Thursday, May 1, 2008

Semantic Search Technology

Today, whether you are at your PC or wandering the corporate halls with your PDA, searching the Web has become an essential part of doing business. As a result, commercial search engines have become very lucrative and companies such as Google and Yahoo! have become household names.

In ranking Web pages, search engines follow a certain set of rules with the goal to return the most relevant pages at the top of their lists. To do this, they look for the location and frequency of keywords and phrases. Keywords, however, are subject to the two well-known linguistic phenomena that strongly degrade a query's precision and recall: Polysemy (one word might have several meanings) and Synonymy (several words or phrases might designate the same concept).

There are three characteristics required for search engine performance in order to separate useful searches from fruitless ones:

* Maximum relevant information,

* Minimum irrelevant information and,

* Meaningful ranking, with the most relevant results first.

In addition, some search engines use Googel's approach to ranking which assess popularity by the number of links that are pointing to a given site. The heart of Google search software is PageRank, a system that relies on the vast link structure as an indicator of an individual page's value. It interprets a link from page A to page B as a vote, by page A, for page B. Important sites receive a higher PageRank. Votes cast by pages that are themselves ‘important,’ weigh more heavily and help to make other pages ‘important.’

Nevertheless, it is still common for searches to return too many unwanted results and often miss important information. Recently, Google and other innovators have been seeking to implement limited natural language (semantic) search. Semantic search methods could improve traditional results by using, not just words, but concepts and logical relationships. Two approaches to semantics are Semantic Web Documents and Latent Semantic Indexing (LSI).

LSI is an information retrieval method that organizes existing HTML information into a semantic structure that takes advantage of some of the implicit higher-order associations of words with text objects. The resulting structure reflects the major associative patterns in the data. This permits retrieval based on the ‘latent’ semantic content of the existing Web documents, rather than just on keyword matches. LSI offers an application method that can be implemented immediately with existing Web documentation. In a semantic network, the meaning of content is better represented and logical connections are formed.

However, most semantic-network-based search engines suffer performance problems because of the scale of the very large semantic network. In order for the semantic search to be effective in finding responsive results, the network must contain a great deal of relevant information. At the same time, a large network creates difficulties in processing the many possible paths to a relevant solution.

Most of the early efforts on semantic-based search engines were highly dependent on natural language processing techniques to parse and understand the query sentence. One of the first and the most popular of these search engines is Cycorp (http://www.cyc.com). Cyc combines the world’s largest knowledge base with the Web. Cyc (which takes it name from en-cyc-lopedia) is an immense, multi-contextual knowledge based. With Cyc Knowledge Server it is possible for Web sites to add common-sense intelligence and distinguish different meanings of ambiguous concepts.

For more information about technology innovations and Web video see the following references.

REFERENCES:

Alesso, H. P. and Smith, C. F., Connections: Patterns of Discovery John Wiley & Sons, Inc. 2008.

Alesso, H. P. and Smith, C. F., Developing Semantic Web Services A. K. Peters Inc., 2004.

Web Site:
Video Software Lab

No comments: