tsql - Lucene.Net Underscores causing token split -


i've scripted mssqlserver databases tables,views , stored procedures directory structure indexing lucene.net. of table, view , procedure names contain underscores.

i use standardanalyzer. if query table named tir_invoicebtnwtn01, example, recieve hits tir , invoicebtnwtn01, rather tir_invoicebtnwtn01.

i think issue tokenizer splitting on _ (underscore) since punctuation.

is there (simple) way remove underscores punctuation list or there analyzer should using sql , programming languages?

yes, standardanalyzer splits on underscore. whitespaceanalyzer not. note can use perfieldanalyzerwrapper use different analyzers each field - might want keep of standard analyzer's functionality except table/column name.

whitespaceanalyzer whitespace splitting though. won't lowercase tokens, example. might want make own analyzer combines whitespacetokenizer , lowercasefilter, or lowercasetokenizer.

edit: simple custom analyzer (in c#, can translate java pretty easily):

// chains standard tokenizer, standard filter, , lowercase filter class myanalyzer : analyzer {     public override tokenstream tokenstream(string fieldname, system.io.textreader reader)     {         standardtokenizer basetokenizer = new standardtokenizer(lucene.net.util.version.lucene_29, reader);         standardfilter standardfilter = new standardfilter(basetokenizer);         lowercasefilter lcfilter = new lowercasefilter(standardfilter);         return lcfilter;      } } 

Comments

Popular posts from this blog

asp.net - repeatedly call AddImageUrl(url) to assemble pdf document -

java - Android recognize cell phone with keyboard or not? -

iphone - How would you achieve a LED Scrolling effect? -