Finding and Parsimoniously Generalizing Fuzzy Clusters using Hierarchical Taxonomies: a Case Study in Data Science
Taxonomies play a fundamental role in structuring concepts in knowledge domains such as Biology, Medical Sciences, Education, and Computer Science.
In this talk an algorithm is presented to lift a fuzzy cluster of topics to higher ranks in a hierarchical taxonomy. The algorithm, PARGen, minimizes a penalty function, balancing the number of introduced ‘head subjects’ and associated errors, ‘gaps’ (false positives) and ‘offshoots’ (false negatives), with proper weights. The result is a parsimonious generalization of the topic cluster in the taxonomy.
The PARGen algorithm is applied to a text collection of 17685 abstracts of research papers published in 17 Springer journals related to Data Science covering a 20 years period (1998-2017). The ground-truth is a hierarchical taxonomy of Data Science (TDS) taken from the 2012 ACM Computing Classification System (ACM-CCS). A discussion will be presented of the methodology to find fuzzy clusters of TDS leaf topics retrieved from text, lift them using PARGen, and find ‘head subjects’ that highlight research tendencies in Data Science.
Susana Nascimento is Assistant Professor at the Department of Computer Science of Faculdade de Ciências e Tecnologia, Universidade NOVA de Lisboa since 2003, and member of NOVA LINCS Knowledge-Base Systems group. She holds a degree in Computer Science Engineering (1989) and a Ph.D. in Computer Science (2002) both from Universidade NOVA de Lisboa, with a thesis on Fuzzy Cluster Analysis. Her main research subject is in the field of Cluster Analysis and its applications. She has been working in oceanographic image analysis on the automatic recognition of mesoscale oceanographic features, like upwelling and eddies, which are important dynamic phenomena in ocean circulation with implications for climate and fisheries. She is currently developing a novel methodology on the use of Taxonomies for Data Analysis in collaboration with researchers from Birkbeck, University of London, and the Higher School of Economics in Moscow. She was PI of two research projects in these fields