Author Archives: toz

An Attempt to Create Heatmap of Ankara 2014 Elections by Google Fusion Tables

EDIT: This attempt failed because Google could not find the right location of many ballot boxes. This was due to two reasons. One stems from bogus CHP STS dataset such as typos, acronyms, unofficial namings of institutions, wrong/old names, etc.. Second, GFTM (or Google Maps API) could not be able to find some addresses even if they are correct, especially in the cases where an address is not close to the city center (e.g. Ayas, Cubuk, Kalecik, Elmadag, Akyurt, etc.); many of these addresses mapped to a single location somewhere in downtown Ankara.

The unfortunate thing is that GFTM did not report these cases during the import and I could not find a way of detecting, counting and correcting these other than finding spots where many different addresses mapped to (black holes) and manually correcting location info of each ballot box address. You can play with the heatmap available here.

In my previous study, I created a ballot box map of Ankara where the color of a marker denotes the winner at that location. Yet, vote shares or vote differences is not reflected in that map. We cannot differentiate, say, if AKP has %0.5 or %20 more voters than CHP without clicking on a particular marker. My first intention by creating heatmaps of AKP and CHP was to reflect their vote share and vote difference. Unfortunately, I cannot do this with Google Fusion Tables Maps (GFTM hereafter) because it represents the densities but not the values of the points:

Heatmaps display colors on the map to represent the density of points from a table.

GFTM page says. It however allows these points to be weighted. Based on my experiments, weights are normalized (how much I increased their weight, nothing seems changed). The effect of weight is defined as follows:

The optional Weight column adjusts each point’s importance by multiplying its intensity by the specified column value

Since how densities are computed is not revealed, reverse engineering is not applicable (i.e. giving lower weights to points at higher densities) to remove the bias of the clustered boxes. Weights should always be positive, otherwise GFTM places a red square to that location.

GFTM limits the number of points for a heatmap to 1000 because images are rendered on our local browsers (unlike markers). So I had to merge (sum up) the ballot boxes in the same address first, this indeed solved boxes’ superpositioning problem while reducing the granularity. After this process I realized that there are ~1600 unique addresses. Then I removed the towns which are not central (Beypazarı, Çamlıdere, Evren, Güdül, Haymana, Kızılcahamam, Nallıhan, Polatlı and Şereflikoçhisar listed as so here, I indeed added one more victim, Bala) which gave me 1001 locations.

GFTM heatmap have two sliders, one for radius the other is for setting the opacity. GFTM also does not allow this map to be published. In this post I share some captured heatmaps, please let me know if you have any ideas on what/how/where to look at. I hope this post helps anyone :-)

Continue reading

Coloring & Mapping Ballot Boxes in Ankara

[EDIT]: To overcome the overlapping boxes problem discussed below, I merged the boxes in the same building and created another map. Result after merge is available here and here (click on a marker to get details at that location):

 (go to the bottom of the page to see the per ballot box results, i.e. before merge)

I just created a map of Ankara with ballot boxes (sandik) colored by the winner party for that ballot box. Here is what I did:

  1. Dataset comes from CHP (main opposition party) database as YSK (official election institute) has not revealed the vote counts.

     

  2. Go to your Google Drive. Click create. Connect more apps. Choose Fusion Table.
    Create Fusion Table
  3. Filter the previously downloaded xlsx formatted spreadsheet to Ankara ballot boxes only using excel and create a new workbook by copy/pasting to a new spreadsheet and save it as csv.
  4. Combine ALAN, ILCE and IL fields of the table and create a new column called ADRES using MS Excel’s concatenate function as explained here.
  5. Created a column KAZANAN to label the party with the highest vote count: =MATCH(MAX(C2:AB2),C2:AB2,0)
  6. Created another column KRENK for icon/marker type. Here is a list of available markers in Fusion Table Maps. Excel Expression: =IF(B2=1,”small_yellow”,IF(B2=2,”small_red”,IF(B2=3,”small_blue”,”small_green”)))
  7. Upload your csv file to Fusion Tables. When you click create Fusion Table in the above caps, you are forwarded to import page:
    Import spreadsheet Fusion Tables
  8. Google Fusion table automatically geocodes for you if you set the field type for ADRES as location and go to the menu Files>Geocode… NOTE: This process takes very long time. And there supposed to be a daily limit for auto-geocoding for free accounts but it looks like that this somehow was not applied to me.Auto geocode the given column
  9. It sometimes fails to detect write address. You can manually alter the geocode information by switching to table view and clicking edit the row.Edit Row
  10. Edit geocode pops up a map where you can enter an alternative address and hit search button for suggestions or directly enter @lon,lat information into the search box.Edit geocode
  11. To assign different colors/markers to ballot boxes go to Tools > Change map and click Change Feature Styles. Then choose Use icon specified in a column.Marker selection

Here is what to improve:

  1. Ballot boxes in the same address are superimposed.
  2. Some boxes are mislocated by Google Geocoder.

Here is the output:

Ankara ballot boxes colored

 

Zoom out once to get the view of Ankara by ballot box (before merging & includes every town):

What is your favorite citation manager?

Mine is Zotero. And I am using Papership app on my ios devices. Papership integrates very well into Zotero and lets you access, read and annotate your citation and files on Zotero. If you use Papership also on your mac, which has recently become available, annotated files in your library synch are seamlessly updated on all of your devices.

Since I have good reader app on my iPad, which also lets me use my ios device as a regular usb storage, I did not buy the annotation package ($4.99) of Papership app. I also observed a glitch during free trial mode. Moreover, Papership on mac is $9.99 for now.

There are many blogs and websites on the web telling how their life changed after they met Zotero or some other citation manager. That was the case for me as well. I just do not want to repeat here what you can find in other places. Also, there are many comparisons and how-tos on citation managers which are mostly made available by university libraries that might be useful for you as well:

http://guides.lib.washington.edu/content.php?pid=69943&sid=518591

http://en.wikipedia.org/wiki/Comparison_of_reference_management_software

http://www.phdontrack.net/review-and-discover/reference-managers/

http://libguides.mit.edu/references

http://hlwiki.slais.ubc.ca/index.php/Zotero_vs._Mendeley

https://itunes.apple.com/us/app/papership-for-mendeley-zotero/id631980748?mt=8

Complex Adaptive Systems: an Introduction to Computational Models of Social Life

Complex Adaptive Systems

Complex Adaptive Systems by Miller & Page

Miller, J. H., & Page, S. E. (2009). Complex Adaptive Systems: An Introduction to Computational Models of Social Life: An Introduction to Computational Models of Social Life. Princeton University Press.
Here is a review by Paul Ormerod. And here is mine:
The book discusses the advancement of CAS the decade before it was published.
As Paul stated, book has four sections. In this post I will cover the first two and the rest will be the content of another post.
  1. themes in complexity of social worlds
  2. models as maps
  3. general concepts of ABMs
  4. detailed discussion of some actual models

Introduction

Let’s dwell on the title first.
Adaptivity of a social system is the thoughtfulness of the components, aka agents, of the system.
Complexity in its simplest form is the value in the system which does not belong to any of its components but rather emerge through their interaction.
John H. Holland defines complex adaptive systems (CAS from now on) as systems that have a large numbers of components, often called agents, that interact and adapt or learn.
One of the earliest discussions on complexity was the invisible hand of Adam Smith in the Wealth of Nations (1776). Collections of self-interested agents lead to well-formed structures that are no part of any single agent’s intention. How can we understand or prove this invisible hand? While our ability to theorize about social systems has always been vast, the set of tools available for pursuing these theories has often constrained our theoretical dreams either implicitly or explicitly. The tools and ideas emerging from complex systems research complement existing approaches, and they should allow us to build much better theories about the world when they are carefully integrated with existing techniques.

Complexity in Social Worlds

Complexity emerges from highly interdependent components where taking a component off leads to collapse of the system, though it is quite robust to less radical changes in the parts. On the other hand, complicated worlds are reducible, i.e. we can examine its elements separately to gain insight, which is not the case for complex systems. Model #1 Standing Ovation Problem N spectators, each receive a signal of performance quality s(qi)  decides to stand up. A typical mathematical approach:

  1. si(q) = q + εi  if  T1< si(q) then stand up (εi is for heterogeneity)
  2. if α>T2 then everyone stands up.

Approach #2. Indeed all of the economics graduate students modeled the standing ovation without considering attending the theater with acquaintances and used traditional models like mathematics and statistics. However more elaborate models can be constructed with newer computational techniques like agent-based modeling by including more :

  1. including much more heterogeneity as location and friendship involved.
  2. more than two steps required to reach the equilibrium.
  3. attracting more groups? theater design? where to place shills? etc.

Result: for complex systems, canceling differences with averaging leads to serious issues. Two other examples given

  • how genetic variety in bees let them balance the temperature of their hive smoothly (negative feedback loop) [crowded highways]
  • how diversity of thresholds of responsitivity to pheromones affect the defensing pattern of a hive (positive feedback loop) [discounted products]
With discussion of a more complex example model (Tiebout model), authors point new directions by underlining that the difficulty of answering any particular scientific question is often tied to the tools we have at hand.

Modeling

  • Road maps vs all possible details. Snow’s 1855 map of cholera revealed the mode of transmission and the source. Intention of map-makers was just to represent the real world.
  • Homomorphisms. Formal discussion of modeling. Modeling modeling.

On Emergence We may see emergence at many levels. Emergence from a mosaic (tile => image => tile => image …) nucleons, atoms, compounds, amino acids, proteins, organelles, cells, tissues, organs, organisms, societies… Prior ignorance makes a phenomenon mystical, such as planetary motion (prior to Kepler) which turn out to be rather simple, just an ellipse. Similarly, organized complexity can be understood with computational modeling.

Computational Modeling

Theoretical tools:

  • detailed verbal descriptions such as Smith’s (1776) invisible hand
  • mathematical analysis like Arrow’s (1951) possibility theorem
  • thought experiments including Hotelling’s (1929) railroad line
  • mathematical models derived from a set of first principles (currently the predominant tool in economics).
Whether the proposition that countries on a map can always be distinguished through the use of only four colors (the so-called Four- Color Map problem) is proved by the exhaustive enumeration of all possibilities through the use of a computer program (which has been done) or through an elegant (or even non-elegant) axiomatic proof (which has not been done) matters little if all you care about is the basic proposition.  Tool is for simplifying a task. Employ different tools for better theories. For example, a full understanding of supply and demand may require

  • thought experiments using Walrasian auctioneers,
  • axiomatic derivations of optimal bidding behavior,
  • computational models of adaptive agents, and
  • experiments with human subjects.

Computation and Theory 

All new introduced tools attract questions and concerns, so does the computational modeling.

  • Can these tools generate new and useful insights?
  • How robust are they?
  • What biases do they introduce into our theories?
Theory is to make the world understandable by finding the right set of simplifications. Modeling proceeds by deciding what simplifications to impose on the underlying entities and then, based on those abstractions, uncovering their implications. Computation in theory vs Computation as theory
The use of a computer is neither a necessary nor a sufficient condition for us to consider a model as computational. (e.g. Schelling’s (1978) coin based method)

  • abstractions maintain a close association with the real-world agents of interest
  • uncovering the implications of these abstractions requires a sequential set of computations involving these abstractions

Neoclassical economics (an example to computation in theory):

  • individuals optimize their behavior
  • given mathematical constraints, most of the underlying agents in the real system are subsumed into a single object (a representative agent)
  • incorporate driving forces (such as system seeks an equilibrium)
  • Note: computation is used in these type of models for solving numerical methods

Agent-based objects (computation as theory):

  • abstractions are not constrained by the limits of mathematics
  • collection of agent based objects solved by their interactions using computations

Modeling vs Simulation:

  • simple entities and interactions vs complicated
  • implications robust to large class of changes vs less robust
  • surprising results that motivates new predictions vs less surprising
  • easily communicated to others vs may not be that easy

Objections and Responses

  • Q: answers are built in to the model, so cannot learn anything new !
    • all tools build in answers. Clarity is key here. hidden or black-box features are bad
    • a model is bounded by initial framework but it can allow for new theoretical insights
  • Q: computations lack discipline !
    • lack of constraints is indeed a great advantage. Mathematical models become unsolvable when practitioners break away from limited set of assumptions.
    • a discipline similar to the one required for lab-experiments is being formed: Is the experiment elegant? Are there confounds? Can it be easily reproduced? Is it robust to differences in experimental techniques? Do the reported results hold up to additional scrutiny?
    • flexibility. mathematical models solved by a set of solution techniques and verification mechanisms. Given the newness of many computational approaches it will take some time to agreed-upon standards for verification and validation
  • Computational Models Are Only Approximations to Specific Circumstances
    • Giving exact answer might not be that important; relying on approximations may be perfectly acceptable in some cases.
    • Generalizability is tied to the way model created, not the medium. Bad mathematical models may not be extended beyond their initial structure too.
  • Computational Models Are Brittle
    • crashes are not unique to computational models
    • can be prevented by better designs
  • Computational Models Are Hard to Understand
    • due to lack of commonly accepted means for communication. UML, ODD.

Turkish Newspapers’ Circulation vs Twitter Follower Counts

Colored bars indicate the ratio of circulation to Twitter follower counts.

rates

 

red bars = circulation / follower_counts

 

blue bars = follower_counts / circulation

 

 

Initial Observations:

  • Two newspapers lost most customers, Posta & Sozcu are tabloid populist newspapers…
  • Liberal, Islamic, Conservative and Central right newspapers have higher circulation rate than followers on Twitter
  • Newspapers with social liberal, left-wing, socialism, leftism ideologies are much popular on Twitter w.r.t their circulation rates

 

newspaper circulation followers twitter popularity
SÖZCÜ 364,726 15,175 -95.84%
POSTA 415,225 39,308 -90.53%
TARAF 76,665 13,198 -82.78%
YENİ AKİT 64,309 23,434 -63.56%
AKŞAM 105,048 42,153 -59.87%
ZAMAN 1,174,257 540,419 -53.98%
TÜRKİYE 184,053 86,338 -53.09%
BUGÜN 167,644 93,210 -44.40%
VATAN 106,247 66,565 -37.35%
YENİ ŞAFAK 128,092 97,216 -24.10%
SABAH 330,434 260,383 -21.20%
STAR 136,469 110,480 -19.04%
MİLLİ GAZETE 23,953 27,412 14.44%
YURT 51,507 69,361 34.66%
HÜRRİYET 402,770 1,190,000 195.45%
MİLLİYET 166,858 644,844 286.46%
EVRENSEL 6,632 67,035 910.78%
BİRGÜN 10,768 144,735 1244.12%
RADİKAL 22,847 389,816 1606.20%

Circulation for the week of (01.13.2014 – 01.19.2014) is obtained from here.

Detecting Political Leanings & Propagandists on Twitter

Detecting Political tweets based on hashtags: (single iteration proposed by Conover et al. can it be improved by multiple iterations?)

  1. start by labeling one popular/predictive hashtag from each camp.
  2. label new hashtags if they co-occur with already labeled hashtags above a threshold rate (not necessarily to be in the same camp)
  3. manually remove the false positives.

Constructing communication networks:

  1. vertices are tweeters of the political hashtags detected above.
  2. mention edge weights: number of mentions between the two users.
  3. retweet edge weights: number of retweets between the two users.

Clustering communication networks:

  1. starting with the retweet network constructed above, applies Newman’s modularity based clustering algorithm.
  2. cluster by label propagation method (Raghavan,2007): iteratively assign each node the label that is shared by most of its neighbors. (I don’t understand why need this step?)

Mentions form a communication bridge across which information flows between ideologically-opposed users; whereas, people with similar ideologies tend to retweet exclusively each other’s messages, especially propagandists:

  1. First, label one known popular user from each camp.
  2. At each iteration relabel the users by argmax(assoc1,…, assocn) where associ is the ratio of users retweeted of campi or/∪ by campi. Stop after some iterations.
  3. If at least a fraction f of the connections are to users in the same cluster then the user is a hyperadvocate; otherwise, the user is neutral.

Groups & Turkish Media

This study is built upon the data we collected for Turkish Media Clustering. This extension is two folds, we first look at the media preference of five major groups in Turkey; second, we visualize the group descriptiveness of the media. We selected two twitter accounts for each group and got the union of the follower IDs of these two and name them as following:

  • Ak Party: ‘AKKULIS’, ‘AkTanitimMedya’
  • CHP: ‘CHP_online’, ‘herkesicinCHP’
  • MHP: ‘MHP_Bilgi’, ‘Ulku_Ocaklari’
  • BDP: ‘BDPgenelmerkez’, ‘HDP_Kongre’
  • Hizmet (Gulen Movement): ‘FGulencomTR’, ‘Herkul_Nagme’

Resources for Turkish NLP Resources

Interestingly, I and both of the resource aggregators are from the same high school, i.e. Izmir Fen Lisesi :-)

Will maintain my own list upon these. I forked (copy-pasted) and combined their lists below.

My Blog :) :

Amac’s Blog:

Deniz’s Blog:
TS Corpus

Taner Sezer’s TS Corpus is a 491M token general purpose Turkish corpus. See comments below for details.

BounWebCorpus

Hasim Sak’s page contains some useful Turkish language resources and code in addition to a large web corpus.

Bibliography

Özgür Yılmazel’s Bibliography on Turkish Information Retrieval and Natural Language Processing.

tr-disamb.tgz

Turkish morphological disambiguator code. Slow but 96% accurate. See Learning morphological disambiguation rules for Turkish for the theory.

correctparses_03.txt.gztrain.merge.gz

Turkish morphology training files. Semi-automatically tagged, has limited accuracy. Two files have the same data except the second file also includes the ambiguous parses (the first parse on each line is correct).

test.1.2.dis.gztest.merge.gz

Turkish morphology test files, second one includes ambiguous parses (the first parse on each line is correct). The data is hand tagged, it has good accuracy.

tr-tagger.tgz

Turkish morphological tagger, includes Oflazer’s finite state machines for Turkish. From Kemal Oflazer. Please use with permission. Requires the publically available Xerox Finite State software.

turklex.tgzpc_kimmo.tgz

Turkish morphology rules for PC-Kimmo by Kemal Oflazer. Older implementation. Originally from www.cs.cmu.edu

Milliyet1.bz2Milliyet2.bz2Milliyet3.bz2

Original Milliyet corpus, one token per line, 19,627,500 total tokens. Latin-5 encoded, in three 11MB parts. From Kemal Oflazer. Please use with permission.

Turkish wordnet

From Kemal Oflazer. Please use with permission.

METU-Sabanci Turkish Treebank

Turkish treebank with dependency annotations. Please use with permission.

sozluk.txt.gz

English-Turkish dictionary (127157 entries, 826K) Originally from www.fen.bilkent.edu.tr/~aykutlu.

sozluk-boun.txt.gz
Turkish word list (25822 words, 73K) Originally fromwww.cmpe.boun.edu.tr/courses/cmpe230

Avrupa Birliği Temel Terimler Sözlüğü

(Originally from: www.abgs.gov.tr/ab_dosyalar, Oct 6, 2006)

BilisimSozlugu.zip

Bilişim Sözlüğü by Bülent Sankur (Originally from:www.bilisimsozlugu.com, Oct 9, 2006)

turkish.el

Emacs extension that automatically adds accents to Turkish words while typing on an English keyboard.

en-tr.ziplm.tr.gz

Turkish English parallel text from Kemal Oflazer, Statistical Machine Translation into a Morphologically Complex Language, Invited Paper, In Proceedings of CICLING 2008 – Conference on Intelligent Text Processing and Computational Linguistics, Haifa, Israel, February 2008 (lowercased and converted to utf8). The Turkish part of the dataset is “selectively split”, i.e. some suffixes are separated from their stems, some are not. lm.tr.gz is the Turkish text used to develop the language model.

Turkish Media in 2D (MDS & PCA Plots)

Multidimensional scaling (MDS) allows us to visualize (dis)similarities in 2D by trying to preserve the distance between the objects as much as possible. The positioning of the newspapers in the image below is generated using this technique (employed manifold.MDS class in scikit-learn). The colors of the labels are the cluster colors obtained by modularity measure as stated in a previous post.

We also considered reducing the dimension by Principal Component Analysis (PCA). Resulting image is very similar to that of MDS and that is also available here.

This visualization helps us to see beyond clusters. For example (along X-axis) the newspapers to the right are right-wing newspapers while the ones to the left are more leftist. And as we go from center to periphery we see that more popular media are close to the central while the ones to the periphery are more isolated/extreme.

Multidimensional Scaling

Multidimensional Scaling applied to Turkish Media Follower Similarity on Twitter