A Hybrid Approach to Interpretable Analysis of Research Paper Collections
Jul 2020
We define and find a most specific generalization of a fuzzy set of
topics assigned to leaves of the rooted tree of a taxonomy. This
generalization lifts the set to a “head subject” in the higher ranks
of the taxonomy, that is supposed to “tightly” cover the query set,
possibly bringing in some errors, both “gaps” and “offshoots”. Our
method involves two more automated analysis techniques: a fuzzy
clustering method, FADDIS, involving both additive and spectral
properties, and a purely structural string-to-text relevance measure
based on suffix trees annotated by frequencies. We apply this to
extract research tendencies from two collections of research papers:
(a) about 18000 research papers published in Springer journals
on data science for 20 years, and (b) about 27000 research papers
retrieved from Springer and Elsevier journals in response to data
science related queries. We consider a taxonomy of Data Science
based on the Association for Computing Machinery Classification
of Computing System (ACM-CCS 2012). Our findings allow us to
make some comments on the tendencies of research that cannot be
derived by using more conventional techniques.
Boris Mirkin,
Dmitry Frolov,
Alex Vlasov,
Susana Nascimento, Trevor Fenner