LOMI: Enrich Linked Open Data (DBpedia) with Microdata

LOMI was created during my studies at the University of Mannheim. Its aim is to use Microdata (In this case only the schema.org vocabulary) collected by the Common Crawl project and prepared by the Web Data Commons project to enrich the information DBpedia delivers.

The program was written in Java and the source code is available on Github. Another project on Github provides the mappings from schema.org vocabulary to the DBpedia ontology.

Final presentation

Components of Lomi

Lomi performs three steps: Deduping, Transformation and Instance Matching. 

Deduping

The input files (Example) taken from the Web Data Commons contain a lot of duplicates. To overcome this, the instances get processed sequentially and by calculating a hash for each instance duplicates will be found and excluded. A sample with size of 150 GB was processed and could be decreased to 12 GB. 92% less!

Transformation

The deduped data needs to get transformed to the DBpedia ontology in order to get aligned and matched. This is done using a manually created mapping (Which is also available on Github).

Instance Matching

This was the most difficult part of the project. New data should only be inserted if an instance already exists in DBpedia. This means, that LOMI needs to find instances belonging together. The idea of the solution was quite simple: Use several hierarchical conditions and use a scoring function. For example, an instance needs to have at least two statements otherwise there is no new information to add (Normally, an instance has at least one statement for the name and one for the type) or the instances must have similar names to a defined extent. The scoring function take into account how similar the names are, if there are properties which equal and so on.

Leave a Comment

comments powered by Disqus