More Info

 

The word breaker and stemmer we developed can be used by search engines to efficiently find a set of words in Turkish documents.

Turkish Word Breaker

The Word Breaker that we developed conforms to IWordBreaker Interface developed by Microsoft. The Word Breaker is used to break buffers of Unicode characters into words. The Turkish Word Breaker is a pluggable component for the MS Search technology. Given a Unicode text, the word breaker parses the text to find individual words and noun phrases. We optimized this tool for both throughput and minimal use of resources.

Our word breaker can be installed in Microsoft Windows 2000 and Windows XP to be used by the indexing engine.

This tool can be also licensed by search engines to be used in three different situations:

bulletIndex Time: During index time our tool splits all text referenced by the search engine. Since indexing occurs continuously as documents are created or modified, we designed our tools to maximize throughput while using minimum resources.
bulletQuery Time: At query time the text of a query can be also broken into words using our tool.
bulletHit highlighting: Our tool can be also used during hit highlighting (locating hits within the content of a particular document so that the user can easily identify the relevant portions of a document)

Turkish Stemmer

The Stemmer that we developed conforms to IStemmer Interface developed by Microsoft. Our stemmer, given a word as input, generates grammatically similar words that have the same stem or baseform using inflectional generation. Search engines can use our stemmer to generate the inflected forms of a Turkish word, thus they will be able to find more relevant result sets given a single search.

Our stemmer can be installed in Microsoft Windows 2000 and Windows XP to be used by the indexing engine.

This tool can be also licensed by search engines to help in finding all documents that contain words derived or similar to a specific set of query words. For example, without our stemmer searching for "Ahmet resim" (Ahmet picture) in a set of documents will return all document that contain these words in the base forms. However, if a document contains the words "Ahmet'in resimleri" (pictures of Ahmet), the search engine would not be able to locate this document because the query text did not match the form used in the document. A search engine using our tool will intelligently identify this document as well as "Ahmet'in" can be grammatically derived from "Ahmet" and "resimleri" can be derived from "resim."

This tool can be also licensed by search engines to be used in two different situations:

bulletGenerate inflected word forms at query time so that the search engines can identify the documents that contain any of the inflected word form
bulletInflected Hit highlighting: Our tool can be also used during hit highlighting to intelligently highlights inflected word forms as well.

Turkish Noise Words

Certain words are so frequently used that eliminating them increases both the performance of the indexing engine and quality of the search results. The indexing engine should use this list when it invokes a word breaker for Turkish. The indexing engine should remove noise words from query terms and from content that is included in the full-text index.

__________________________


© 1999-2005 hk29 software services
2005-05-09