|
|
|
|
The word breaker and stemmer we developed can be used by search engines to efficiently find a set of words in Turkish documents. Turkish Word BreakerThe Word Breaker that we developed conforms to IWordBreaker Interface developed by Microsoft. The Word Breaker is used to break buffers of Unicode characters into words. The Turkish Word Breaker is a pluggable component for the MS Search technology. Given a Unicode text, the word breaker parses the text to find individual words and noun phrases. We optimized this tool for both throughput and minimal use of resources. Our word breaker can be installed in Microsoft Windows 2000 and Windows XP to be used by the indexing engine. This tool can be also licensed by search engines to be used in three different situations:
Turkish StemmerThe Stemmer that we developed conforms to IStemmer Interface developed by Microsoft. Our stemmer, given a word as input, generates grammatically similar words that have the same stem or baseform using inflectional generation. Search engines can use our stemmer to generate the inflected forms of a Turkish word, thus they will be able to find more relevant result sets given a single search. Our stemmer can be installed in Microsoft Windows 2000 and Windows XP to be used by the indexing engine. This tool can be also licensed by search engines to help in finding all documents that contain words derived or similar to a specific set of query words. For example, without our stemmer searching for "Ahmet resim" (Ahmet picture) in a set of documents will return all document that contain these words in the base forms. However, if a document contains the words "Ahmet'in resimleri" (pictures of Ahmet), the search engine would not be able to locate this document because the query text did not match the form used in the document. A search engine using our tool will intelligently identify this document as well as "Ahmet'in" can be grammatically derived from "Ahmet" and "resimleri" can be derived from "resim." This tool can be also licensed by search engines to be used in two different situations:
Turkish Noise WordsCertain words are so frequently used that eliminating them increases both the performance of the indexing engine and quality of the search results. The indexing engine should use this list when it invokes a word breaker for Turkish. The indexing engine should remove noise words from query terms and from content that is included in the full-text index. |
__________________________
|