Data processing in the PROFIT platform, namely such tasks as data access, search, topic analysis, sentiment analysis, recommendations and preference extraction are based on a module, which extracts knowledge objects (concepts) from the contents of the platform. At the heart of this extracting tool is a thesaurus, which is holding the domain-specific knowledge as a network of related concepts. PROFIT’s knowledge graph describes entities, names, objects, actors and their interrelation and so sets the ground for analyzing and annotating the textual resources and further processing of textual data (sentiment, analysis, topic and trend analysis, recommendations, etc.).
PROFIT’s knowledge graph is built on two well established and open thesauri, already used in the broader finance/economy domain. With the use of STW Economics and the EuroVoc, we ensure, that the exchange of knowledge and data is based on commonly used and public maintained seed thesauri.
Integrating those two thesauri together with knowledge coming from domain experts from the consortium needed additional modeling and adjustments. Remodeling the joint thesaurus in PoolParty allows to resolve naming and hierarchical conflicts and to fix conflicts in ambiguity, with the result, that no hierarchical conflicts were left in the final PROFIT thesaurus.
The subsequent quality assurance was based on 1) semi-automated mechanisms of PoolParty to guarantee the absence of formal issues with the thesaurus. 2) Experts in the field (UoG and DUTH) who have worked with the thesaurus and extended it and 3) the beta-test carried out with 39150 domain-specific articles, where an average of 50 extracted concepts per article shows the relevance/fit of the PROFIT knowledge graph.
Even though the creation of the thesaurus is a life-long process and is never finished, the current state of the thesaurus is satisfactory, and the first experiments of using (parts of) the thesaurus for document annotation shows good results. The PROFIT thesaurus is now publicly available at http://profit.poolparty.biz/profit_thesaurus.html
Characteristics of the PROFIT thesaurus
PROFIT knowledge graph features two concept schemes: EuroVoc and STW. Hence one can still explore the original structures of the thesauri by using only the fused version. The top concepts contain the original 21 categories from EuroVoc and the original classification with seven concepts from STW. All the top concepts emerge from the categorization schemes of the base thesauri. Any concepts of the thesauri may have several top concepts as boarders, i.e. belong to several categories. There are 10837 concepts and 11220 broader/narrower relation pairs, therefore there exist 395 poly-hierarchies.