Adding new languages to Sitecore's Solr indexes
Sitecore's Solr indexes come with a rich set of preconfigured languages. This means you normally don't need to bother about Solr when introducing a new language to Sitecore. However, some languages are missing in the managed schema which means you have to take actions in case you want to use those in Sitecore.
Mark Lowe wrote a nice blog post on how you can apply changes to the managed schema in general. Alternatively, there is also the possibility to use config sets for managing the schema files and use those for multiple of your Solr cores.
Missing languages
The following list shows configurations of languages that are missing in the managed schema provided by Sitecore 9. Just put them inside the root
The samples below contain configurations which worked for us so far. That being said, there are probably even better configurations out there. If you have some suggestions for improvement, then please let me know in the comments.
Chinese
<fieldType name="text_zh" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="org.apache.lucene.analysis.cn.smart.HMMChineseTokenizerFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_zh.txt" />
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
<dynamicField name="*_t_zh" type="text_zh" indexed="true" stored="true"/>
For chinese we used the HMMChineseTokenizerFactory. If you apply this configuration to the schema you will probably end up with the following error message when restarting the Solr service:
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load conf for core sitecore_master_index: Can't load schema C:\...\managed-schema: Plugin init failure for [schema.xml] fieldType "text_zh": Plugin init failure for [schema.xml] analyzer/tokenizer: Error loading class 'solr.HMMChineseTokenizerFactory'
In Solr 6+ this tokenizer is not enabled by default. Therefore you need to add the following line to the solrconfig.xml for all of the cores you want to use this tokenizer.
<lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs" regex=".*\.jar" />
Korean
<fieldType name="text_ko" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.CJKWidthFilterFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ko.txt" />
<filter class="solr.CJKBigramFilterFactory" />
</analyzer>
</fieldType>
For Korean we used the CJKBigramFilterFactory. Keep in mind that Lucene 7.4 introduced a specific Korean tokenizer called KoreanTokenizerFactory which has been adapted by Solr 7.5. Looks like we are going to use this in a future version of Sitecore.
Polish
<fieldType name="text_pl" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_pl.txt" />
<filter class="org.apache.lucene.analysis.stempel.StempelPolishStemFilterFactory"/>
</analyzer>
</fieldType>
<dynamicField name="*_t_pl" type="text_pl" indexed="true" stored="true"/>
The StempelPolishStemFilterFactory is also not enabled out-of-the-box in Solr 6+. The trick we used for Chinese also enables this tokenizer.
Misconfigured languages
There are two languages which aren't properly configured in the managed schema of Sitecore 9. In order for these languages to work you need to change the language suffix of the dynamicField element's name attribute.
Czech
old: <dynamicField name="*_t_cz" type="text_cz" indexed="true" stored="true"/>
new: <dynamicField name="*_t_cs" type="text_cz" indexed="true" stored="true"/>
Norwegian
old: <dynamicField name="*_t_no" type="text_no" indexed="true" stored="true"/>
new: <dynamicField name="*_t_nb" type="text_no" indexed="true" stored="true"/>