Jean Véronis

Se connecter à moi sur LinkedIn Me suivre sur Twitter Facebook RSS

jeudi, mai 05, 2005

Web: The future according to Yahoo

Eric Baillargeon tells me that the presentations from the 10th Search Engine Meeting, which took place in Boston on the 11th and 12th of April, are now online.

I’m extremely flattered to see Jan Pedersen, Chief Scientist at Yahoo!, doing me the honour of quoting my work on Google and using some of my figures in his introductory keynote to the conference.

diapo pedersen

It was quite a dig, however, with which to open the conference. For this slide compares hit counts at Yahoo and Google - and doesn’t really favour the latter ...

Jan’s presentation, entitled “Internet Search Engines: Past and Future”, is very interesting – not because it reveals anything that we didn’t know already, of course, but hearing this information confirmed by Yahoo!’s Chief Scientist does add a certain weight. I note how Jan is careful to use the term “claimed” to talk about the index size of the different search engines. In the slide that follows the one above, Jan highlights the disparity between what is claimed and what is observed, emphasising the fact that these figures also include a large number of what he calls “thin docs”: virtually empty pages, simple URLs, etc.

Yahoo’s official view of the future closely resembles Google’s. Many different kinds of search: images, local pages, products, desktop, and so on. Diversifying the offer in this way is certainly necessary to occupy terrain, but I notice that little was said about improving what lies at the heart of the business, namely the quality of the search engine itself. However, that’s what I believe attracts users in the first place and what keeps them coming back: speed, relevance and “freshness” of results.

This silence is even more astonishing when you consider that determined research is well underway at both companies. I personally can see (at least) two areas for improvement that will be crucial to the success of search engines in the years ahead:

1. Sorting results. Currently, this is based too exclusively on notoriety (using algorithms such as PageRank), which produces aberrations of the kind I have repeatedly poked fun at on this blog (see here[fr] and here[fr] for examples) and makes the search engines very vulnerable to abuse by spammers.

2. Presenting results. At the moment, results are returned any old how. But what can be more frustrating than having to plough through page after page of results about people complaining how tedious their day has been when the “boring” you are looking for concerns drilling techniques?

The technologies that help us to greatly improve these two points are built on automatic language processing. For instance, the relevance of results must be established on a search by search basis: the notoriety of a site does not imply that it is relevant for all keywords, and only an in-depth linguistic analysis of the sites can determine which words are relevant for any given site. Improving how results are presented requires the disambiguation (at least on a cursory level) of the words on the pages (tedium or drilling?) and the grouping of results by topic (clustering). Yahoo! is taking its first timid steps in this direction (and other search engines, like Exalead, are doing slightly better (at least for French), but we are still far from seeing any satisfactory results):

boring query on Yahoo

You will notice that the same type of technology is needed to make ads more relevant to requests and results – a relevance that for the moment leaves a lot to be desired, especially in languages other than English. While the average surfer may not be particularly concerned by this, the poor quality of these pairings can lead to a considerable loss of revenue for the search engine.

This research is of key strategic importance, and the official silence on the part of Yahoo! speaks volumes. For Jan Pedersen, who has been recruited by Yahoo! as its Chief Scientist, is a specialist in automatic language processing. As it happens I know his work very well (and he knows mine), since it’s a small world and our area of research is identical. Jan came to prominence in the 90s with several particularly pertinent studies on the part-of-speech tagging of text (how do you know if saw is a noun or a verb?) [1] and word-sense disambiguation (how can we tell if boring refers to ennui or drilling?) as applied to information research [2]. His recent work is also of interest, since it gives an idea of the algorithm used by Yahoo! to rank its results in a similar way to PageRank (which belongs to Google) [3].

This is no coincidence. As I’ve often said here, language is at the core of information; even when we are searching for images it is words we type in to describe them. The search for information will only move forward as language technologies improve. But I understand that competition is stiff, and that it makes sense not to give too much away in front of your competitors!

For further information

[1] Cutting, D., Kupiec, J., Pedersen, J., Sibun, P. (1992). A practical part-of-speech tagger. Proceedings of the third conference on Applied natural language processing (pp. 133-140). Trento, Italy. [pdf]

[2] Schutze, H., Pedersen, J. (1995). Information retrieval based on word senses. Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval (pp. 161--175). Sheffield, United Kingdom. [ps]

[3] Broder, A. Z., Lempel, R., Maghoul, F., Pedersen, J. (2004). Efficient PageRank Approximation via Graph Aggregation. Proceedings of the Thirteenth International World Wide Web Conference (pp. 484-485). New York, U.S.A. [pdf]

1 Commentaires:

Blogger Frondeur a écrit...

Is automatic language processing and the grouping of results by topic really the future of search engines?

If it were so, why didn't Exalead take over the world?

I don't trust the machine to be intelligent, because it never is; I expect it to be thorough and fast, and then I deal with the understanding part myself, thank you very much.

The way to disambiguate words using current search engines is simply to search for other words from the same context: that's what everybody does, and it works pretty well. We do our own disambiguation.

How would a search engine guess the semantic field you're looking for based on just one word, anyway?

Automatic language processing is just one machine making many assumptions and being wrong most of the time.

Whereas what you call ranking by notoriety or popularity should really be called collaborative filtering: ie, the combination of many human judgments, most of them correct, and their intersection infallible...

27 août, 2005 11:21  

Enregistrer un commentaire