There are currently over a billion pages of information on the Internet about every topic imaginable. The question is how can you possibly find what you want? Computer algorithms can be written to search the Internet but most are not practical because they must sacrifice precision for coverage. However, a few engines have found interesting ways of providing high quality information quickly. Page value ranking, topic-specific searches, and Meta search engines are three of the most popular because they work smarter not harder.
While no commercial search engine will make public their algorithm, the basic structure can be inferred by testing the results. The reason for this is because there would be a thousand imitation sites, meaning little or no profit for the developers. The most primitive of searches is the sequential search, which goes through every item in the list one at a time. Yet the sheer size of the web immediately rules out this possibility. While sequential might return the best results, you would most likely never see any results because of the webs inflammatory growth rate. Even the fastest computers would take a long time, and in that time, all kinds of new pages will have been created.
Some of the older spiders like Alta Vista are designed to literally roam randomly through the web using links to other pages. This is accomplished with high-speed servers with 300 connections open at one time. These web spiders are content based which means they actually read and categorize the HTML on every page. One flaw of this is the verbal-disagreement problem where you have a particular word that can describe two different concepts. Type a few words in the query and you will be lucky if you can find anything relates to what you are looking for. The query words can be anywhere in a page and they are likely to be taken out of context.
Content-based searches can also be easily manipulates. Some tactics are very deceptive, for example some automobile web sites have stooped to writing Buy This Car dozens of times in hidden fontsa subliminal version of listing AAAA Autos in the Yellow Pages(1). The truth is that one would never know if a site was doing this unless you looked at the code and most consumers do not look at the code. A less subtle tactic is to pay to get to the top. For example, the engine GoTo accepts payment from those who wish to be at the top of a results list because the sites at the top will get more traffic.
Lawrence Page and Sergey Brin of Google have come up with a different idea for searching called PageRank. They realized that the most popular sites are those linked the most in other pages. Here is the pseudocode algorithm for searching:
1. Parse the query
2. Convert the words into wordIDs.
3. Seek to the start of the doclist in the short barrel for every word.
4. Scan through the doclists until there is a document that matches all the search terms.
5. Compute the rank of that document for the query.
6. If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4.
7. If we are not at the end of any doclist go to step 4.
8. Sort the documents that have matched by rank and return the top k.
Each link to a page is like a vote for that page as well as the pages linked to that page. Thus a hierarchy of pages is created and the search results are much more reliable. Lycos and Excite also use the same system, but Google goes further. It then looked at the position of the words on the page, the size of the fonts, and the likelihood that the words were related to each other (1). Going the extra distance gives Google much better precision.
Google and engines like it can still be manipulated to achieve higher rankings. Anyone who creates a set of pages with links between them can fool the system and add value to their page. So the race continues to find yet another search engine. One promising way to search for something is to use a topic-specific search engine. Among the topic-specific engines are VactionSpot.com, KidsHealth.org, and epicurious.com. These engines give you better results because they are often a front-end to a database of information, they are regularly maintained and updated, and they have a narrow focus and smaller size.
It makes sense that if you do a specific search, then you are less likely to end up with irrelevant information. The good news is that you are getting high quality results in a short period of time. The only problem with topic-specific engines is finding the right one. This is where query routing comes into play. You have two types: manual and automatic. Manual routing means you find the best topic matching your query yourself which can be confusing. Automatic routing is designing an algorithm to do it for you.
One of the newer automatic routers is called Q-Pilot. Q-Pilot uses both offline and online areas for quicker access. When a user enters a query, that query is expanded to create multiple topics that are more specific. These topics are taken from a neighborhood of pages and often represent another search engines topics. Q-Pilotuses the web as its knowledge base and autonomously learns what it does not know (2). This almost sounds like artificial intelligence. Certainly the easiest way to index 100 million pages a day would be to get a computer to do it automatically.
The terms query expansion, clustering, and routing are sure to be seen many times in the near future, as they become necessities for good search engines. They can be found in some Meta search engines such as QueryServer. Query expansion, as mentioned before, is like a thesaurus. It gathers all relevant words in neighborhood pages that might mean the same as those entered in the main query. Then it checks to see how many times those words appear in similar the more times they appear, the more relevant they are. It may be necessary to re-evaluate some words if there is little co-occurrence overall.
Clustering comes right after query expansion and is relatively faster. The engine sorts the results of the primary search engines and groups them. If you then see a more specific topic among those you can go directly to the matches for it. Q-Pilot will give you three different clusters at most to reduce confusion. A pattern seems to be emerging here. New search engines are actually old engines combined with each other and new ideas. So wouldnt the best be one that combines all features? In a word, yes. That is the idea behind a Meta search engine.
QueryServer (http://www.queryserver.com) is an example of one of the very latest Meta search engines. It uses ten primary engines like Yahoo and Google, has customizable matching and clustering, and shows you details of the results like number of matches and response time. An aspect that should not be overlooked concerning Meta search engines is the data model. The data model essentially communicates between the primary and secondary engines, converting the query into the correct format. This is because some use words in search strings while others use Boolean. In order to utilize the features of each engine, the data model should be able to adapt to different engines to achieve good precision.
The Authors explain, Based on such a data model, a meta search engine can achieve several advantages:
1 It will present to users a more sophisticated interface
2 Make the translation more accurate
3 Get more complete and precise results
4 Improve source selection and running priority decisions (3).
Again the idea of optimizing the Internet through intelligent software shows up. It is just a matter of designing a certain algorithm that does not forget what it has learned.
Most people did not foresee the tremendous growth of the Internet in the 1990s. Computer algorithms have gone from small government programs to every personal computer in the world. You start with the most basic problem solving and end up with the most complex of problem solving. That of course is sorting through a database that grows almost exponentially.
Plain and simple, the Internet has a lot of information on it. A crawler works twenty-four hours a day digging through it all. The search engine pulls out the parts people want and hands it to the Meta search engine. The Meta search engine further discriminates until you get exactly what you are looking for. Yet behind all this are machines performing the instructions they have been given an algorithm.