Interestingly, I and both of the resource aggregators are from the same high school, i.e. Izmir Fen Lisesi
Will maintain my own list upon these. I forked (copy-pasted) and combined their lists below.
- Social Network Analysis and Data Mining (SoNADaM). Although the lab name does not include NLP or text in it, publications listed on their page should justify my decision to have them here.
- Arzucan Ozgur (used to be my TA) (Turkish polarity seed words list will be expanded in two-three months)
- Turkish Morphological Parser and Disambiguator by Haşim Sak.
- Another Turkish Morphological Disambiguator by Deniz Yüret.
- Turkish Natural Language Resources Compilation by Deniz Yüret.
- Turkish Morphological Analyzer released under GPL by Çağrı Çöltekin.
Taner Sezer’s TS Corpus is a 491M token general purpose Turkish corpus. See comments below for details.
Hasim Sak’s page contains some useful Turkish language resources and code in addition to a large web corpus.
Özgür Yılmazel’s Bibliography on Turkish Information Retrieval and Natural Language Processing.
Turkish morphological disambiguator code. Slow but 96% accurate. See Learning morphological disambiguation rules for Turkish for the theory.
Turkish morphology training files. Semi-automatically tagged, has limited accuracy. Two files have the same data except the second file also includes the ambiguous parses (the first parse on each line is correct).
Turkish morphology test files, second one includes ambiguous parses (the first parse on each line is correct). The data is hand tagged, it has good accuracy.
Turkish morphological tagger, includes Oflazer’s finite state machines for Turkish. From Kemal Oflazer. Please use with permission. Requires the publically available Xerox Finite State software.
Turkish morphology rules for PC-Kimmo by Kemal Oflazer. Older implementation. Originally from www.cs.cmu.edu
Original Milliyet corpus, one token per line, 19,627,500 total tokens. Latin-5 encoded, in three 11MB parts. From Kemal Oflazer. Please use with permission.
From Kemal Oflazer. Please use with permission.
Turkish treebank with dependency annotations. Please use with permission.
English-Turkish dictionary (127157 entries, 826K) Originally from www.fen.bilkent.edu.tr/~aykutlu.
(Originally from: www.abgs.gov.tr/ab_dosyalar, Oct 6, 2006)
Emacs extension that automatically adds accents to Turkish words while typing on an English keyboard.
Turkish English parallel text from Kemal Oflazer, Statistical Machine Translation into a Morphologically Complex Language, Invited Paper, In Proceedings of CICLING 2008 – Conference on Intelligent Text Processing and Computational Linguistics, Haifa, Israel, February 2008 (lowercased and converted to utf8). The Turkish part of the dataset is “selectively split”, i.e. some suffixes are separated from their stems, some are not. lm.tr.gz is the Turkish text used to develop the language model.