Monthly Archives: January 2014

Turkish Newspapers’ Circulation vs Twitter Follower Counts

Colored bars indicate the ratio of circulation to Twitter follower counts.

rates

 

red bars = circulation / follower_counts

 

blue bars = follower_counts / circulation

 

 

Initial Observations:

  • Two newspapers lost most customers, Posta & Sozcu are tabloid populist newspapers…
  • Liberal, Islamic, Conservative and Central right newspapers have higher circulation rate than followers on Twitter
  • Newspapers with social liberal, left-wing, socialism, leftism ideologies are much popular on Twitter w.r.t their circulation rates

 

newspaper circulation followers twitter popularity
SÖZCÜ 364,726 15,175 -95.84%
POSTA 415,225 39,308 -90.53%
TARAF 76,665 13,198 -82.78%
YENİ AKİT 64,309 23,434 -63.56%
AKŞAM 105,048 42,153 -59.87%
ZAMAN 1,174,257 540,419 -53.98%
TÜRKİYE 184,053 86,338 -53.09%
BUGÜN 167,644 93,210 -44.40%
VATAN 106,247 66,565 -37.35%
YENİ ŞAFAK 128,092 97,216 -24.10%
SABAH 330,434 260,383 -21.20%
STAR 136,469 110,480 -19.04%
MİLLİ GAZETE 23,953 27,412 14.44%
YURT 51,507 69,361 34.66%
HÜRRİYET 402,770 1,190,000 195.45%
MİLLİYET 166,858 644,844 286.46%
EVRENSEL 6,632 67,035 910.78%
BİRGÜN 10,768 144,735 1244.12%
RADİKAL 22,847 389,816 1606.20%

Circulation for the week of (01.13.2014 – 01.19.2014) is obtained from here.

Detecting Political Leanings & Propagandists on Twitter

Detecting Political tweets based on hashtags: (single iteration proposed by Conover et al. can it be improved by multiple iterations?)

  1. start by labeling one popular/predictive hashtag from each camp.
  2. label new hashtags if they co-occur with already labeled hashtags above a threshold rate (not necessarily to be in the same camp)
  3. manually remove the false positives.

Constructing communication networks:

  1. vertices are tweeters of the political hashtags detected above.
  2. mention edge weights: number of mentions between the two users.
  3. retweet edge weights: number of retweets between the two users.

Clustering communication networks:

  1. starting with the retweet network constructed above, applies Newman’s modularity based clustering algorithm.
  2. cluster by label propagation method (Raghavan,2007): iteratively assign each node the label that is shared by most of its neighbors. (I don’t understand why need this step?)

Mentions form a communication bridge across which information flows between ideologically-opposed users; whereas, people with similar ideologies tend to retweet exclusively each other’s messages, especially propagandists:

  1. First, label one known popular user from each camp.
  2. At each iteration relabel the users by argmax(assoc1,…, assocn) where associ is the ratio of users retweeted of campi or/∪ by campi. Stop after some iterations.
  3. If at least a fraction f of the connections are to users in the same cluster then the user is a hyperadvocate; otherwise, the user is neutral.

Groups & Turkish Media

This study is built upon the data we collected for Turkish Media Clustering. This extension is two folds, we first look at the media preference of five major groups in Turkey; second, we visualize the group descriptiveness of the media. We selected two twitter accounts for each group and got the union of the follower IDs of these two and name them as following:

  • Ak Party: ‘AKKULIS’, ‘AkTanitimMedya’
  • CHP: ‘CHP_online’, ‘herkesicinCHP’
  • MHP: ‘MHP_Bilgi’, ‘Ulku_Ocaklari’
  • BDP: ‘BDPgenelmerkez’, ‘HDP_Kongre’
  • Hizmet (Gulen Movement): ‘FGulencomTR’, ‘Herkul_Nagme’

Resources for Turkish NLP Resources

Interestingly, I and both of the resource aggregators are from the same high school, i.e. Izmir Fen Lisesi :-)

Will maintain my own list upon these. I forked (copy-pasted) and combined their lists below.

My Blog :) :

Amac’s Blog:

Deniz’s Blog:
TS Corpus

Taner Sezer’s TS Corpus is a 491M token general purpose Turkish corpus. See comments below for details.

BounWebCorpus

Hasim Sak’s page contains some useful Turkish language resources and code in addition to a large web corpus.

Bibliography

Özgür Yılmazel’s Bibliography on Turkish Information Retrieval and Natural Language Processing.

tr-disamb.tgz

Turkish morphological disambiguator code. Slow but 96% accurate. See Learning morphological disambiguation rules for Turkish for the theory.

correctparses_03.txt.gztrain.merge.gz

Turkish morphology training files. Semi-automatically tagged, has limited accuracy. Two files have the same data except the second file also includes the ambiguous parses (the first parse on each line is correct).

test.1.2.dis.gztest.merge.gz

Turkish morphology test files, second one includes ambiguous parses (the first parse on each line is correct). The data is hand tagged, it has good accuracy.

tr-tagger.tgz

Turkish morphological tagger, includes Oflazer’s finite state machines for Turkish. From Kemal Oflazer. Please use with permission. Requires the publically available Xerox Finite State software.

turklex.tgzpc_kimmo.tgz

Turkish morphology rules for PC-Kimmo by Kemal Oflazer. Older implementation. Originally from www.cs.cmu.edu

Milliyet1.bz2Milliyet2.bz2Milliyet3.bz2

Original Milliyet corpus, one token per line, 19,627,500 total tokens. Latin-5 encoded, in three 11MB parts. From Kemal Oflazer. Please use with permission.

Turkish wordnet

From Kemal Oflazer. Please use with permission.

METU-Sabanci Turkish Treebank

Turkish treebank with dependency annotations. Please use with permission.

sozluk.txt.gz

English-Turkish dictionary (127157 entries, 826K) Originally from www.fen.bilkent.edu.tr/~aykutlu.

sozluk-boun.txt.gz
Turkish word list (25822 words, 73K) Originally fromwww.cmpe.boun.edu.tr/courses/cmpe230

Avrupa Birliği Temel Terimler Sözlüğü

(Originally from: www.abgs.gov.tr/ab_dosyalar, Oct 6, 2006)

BilisimSozlugu.zip

Bilişim Sözlüğü by Bülent Sankur (Originally from:www.bilisimsozlugu.com, Oct 9, 2006)

turkish.el

Emacs extension that automatically adds accents to Turkish words while typing on an English keyboard.

en-tr.ziplm.tr.gz

Turkish English parallel text from Kemal Oflazer, Statistical Machine Translation into a Morphologically Complex Language, Invited Paper, In Proceedings of CICLING 2008 – Conference on Intelligent Text Processing and Computational Linguistics, Haifa, Israel, February 2008 (lowercased and converted to utf8). The Turkish part of the dataset is “selectively split”, i.e. some suffixes are separated from their stems, some are not. lm.tr.gz is the Turkish text used to develop the language model.