Ontologies: Perl is a planet in the solar system
Lately, I've been working on Wikipedia, both an unprecedented human adventure (I wouldn't have bet two cents on its survival a few years ago) and a reservoir of fantastic resources for natural language processing. In particular, it is a huge ontology, i.e. a structured knowledge tree, people have dreamed of building for centuries . I alluded to this in my last slide here [fr]: since the Sumerians via Raymond Lulle, Leibnitz and the Encyclopaedists we have been searching — and the semantic Web is the latest invention that aims to organize Everything.
Wikipedia's knowledge tree is navigable online:
Maybe you know the Perl programming language — I'm a big fan, but let's leave that to another post. I used the corresponding page in Wikipedia as a test to determine if I could correctly find its place in the Wikipedian knowledge tree using my little homemade programmes.
Let's follow the category links going up through the tree. The links are at the bottom of the page: the Perl page belongs to all these categories:
Ah... so apparently it's not a tree. Or maybe one of those Indian banyans I frequently refer to, whose branches connect and merge... Anyway, as long as there is no loop (I don't wish to be pedantic, but if there is a Directed Acyclic Graph), it is possible to build an ontology. It's common enough:
But it nevertheless requires some care in building the links, and you quickly get lost.
So let's follow the links on our Perl page. It's an American invention. Ok. Back up. To be brief, here is the path I followed at random among all the possibilities:
- American inventions
- Inventions by country
- Categories by country
- Planets of the Solar System
Don't think that this is an isolated example. It is by far the rule, given the immense complexity of the graph. What a shame... That means that there is a huge amount of work to be done to be able to exploit Wikipedia. At least using automatic means, it is difficult. The whole effort (unprecedented in the history of Humanity, I repeat), should be praised, but to be able to properly exploit the knowledge in it, a little structure will be required...