tsql - Lucene.Net Underscores causing token split -
i've scripted mssqlserver databases tables,views , stored procedures directory structure indexing lucene.net. of table, view , procedure names contain underscores.
i use standardanalyzer. if query table named tir_invoicebtnwtn01, example, recieve hits tir , invoicebtnwtn01, rather tir_invoicebtnwtn01.
i think issue tokenizer splitting on _ (underscore) since punctuation.
is there (simple) way remove underscores punctuation list or there analyzer should using sql , programming languages?
yes, standardanalyzer splits on underscore. whitespaceanalyzer not. note can use perfieldanalyzerwrapper use different analyzers each field - might want keep of standard analyzer's functionality except table/column name.
whitespaceanalyzer whitespace splitting though. won't lowercase tokens, example. might want make own analyzer combines whitespacetokenizer , lowercasefilter, or lowercasetokenizer.
edit: simple custom analyzer (in c#, can translate java pretty easily):
// chains standard tokenizer, standard filter, , lowercase filter class myanalyzer : analyzer { public override tokenstream tokenstream(string fieldname, system.io.textreader reader) { standardtokenizer basetokenizer = new standardtokenizer(lucene.net.util.version.lucene_29, reader); standardfilter standardfilter = new standardfilter(basetokenizer); lowercasefilter lcfilter = new lowercasefilter(standardfilter); return lcfilter; } }
Comments
Post a Comment