LOMI: Enrich Linked Open Data (DBpedia) with Microdata
Components of Lomi
Lomi performs three steps: Deduping, Transformation and Instance Matching.
The input files (Example) taken from the Web Data Commons contain a lot of duplicates. To overcome this, the instances get processed sequentially and by calculating a hash for each instance duplicates will be found and excluded. A sample with size of 150 GB was processed and could be decreased to 12 GB. 92% less!
The deduped data needs to get transformed to the DBpedia ontology in order to get aligned and matched. This is done using a manually created mapping (Which is also available on Github).
This was the most difficult part of the project. New data should only be inserted if an instance already exists in DBpedia. This means, that LOMI needs to find instances belonging together. The idea of the solution was quite simple: Use several hierarchical conditions and use a scoring function. For example, an instance needs to have at least two statements otherwise there is no new information to add (Normally, an instance has at least one statement for the name and one for the type) or the instances must have similar names to a defined extent. The scoring function take into account how similar the names are, if there are properties which equal and so on.