Tag: #R

With under two weeks to go to the 2015 UK general election, there's no better time to take stock of all the voter intention polls being published by the British media. The data has already been aggregated by UKpollingreport—a site I've used before for analysis of the Scottish independence referendum—and there's not much more to be said then is already widely known: it's likely no single party will win an outright majority and we'll be left with a coalition or minority government.

[full post]


Arnie 2010 (source)

I recently read Arnie's autobiography (great fun) and in it he writes about the various roles he's had, discussing those movies that flopped or were surprise box office successes, but it's hard to build up an overall picture of his career from these fragments. Similarly the raw filmography lists at IMDb and Wikipedia are pretty uninspiring.

That gave me the idea of charting his movie career over time, attempting to show a lot of information at once about how well the film did at box office relative to its budget, and at what points these successes and failures happened over the last few decades. After some python-powered scraping of IMDb data, this is what I came up with:

[full post]


The most popular accounts on twitter have millions of followers, but what are their demographics like? Twitter doesn't collect or release this kind of information, and even things like name and location are only voluntarily added to people's profiles. Unlike Google+ and Facebook, twitter has no real name policy, they don't care what you call yourself, because they can still divine out useful information from your account activity.

For example, you can optionally set your location on your twitter profile. Should you choose not to, twitter can still just geolocate your IP. If you use an anonymiser or VPN, they could use the timing of your account activity to infer a timezone. This could then be refined to a city or town using the topics you tweet about and the locations of friends and services you mention most.

[full post]


"Overrated" and "underrated" are slippery terms to try to quantify. An interesting way of looking at this, I thought, would be to compare the reviews of film critics with those of Joe Public, reasoning that a film which is roundly-lauded by the Hollywood press but proved disappointing for the real audience would be "overrated" and vice versa.

To get some data for this I turned to the most prominent review aggregator: Rotten Tomatoes. All this analysis was done in the R programming language, and full code to reproduce it will be attached at the end.

[full post]


There seems to be a general consensus that author lists in academic articles are growing. Wikipedia says so, and I've also come across a published letter and short Nature article which accept this is the case and discuss ways of mitigating the issue. Recently there was an interesting discussion on academia.stackexchange on the subject but again without much quantification. Luckily given the array of literature database APIs and language bindings available, it should be pretty easy to investigate with some statistical analysis in R.

[full post]


The Guardian newspaper has for a few years been running a data blog and has built up a massive repository of (often) well-curated datasets on a huge number of topics. They even have an indexed list of all data sets they've put together or reused in their articles.

It's a great repository of interesting data for exploratory analysis, and there's a low barrier to entry in terms of getting the data into a useful form. Here's an example using UK election polling data collected over the last thirty years.

[full post]


In the R programming language, the random number generator (RNG) is seeded each session using the current time and process ID. Via the magic of the popular Mersenne Twister PRNG, the values stored in .Random.seed are used sequentially each time "randomness" is invoked in a function. This means, of course, that the same function run in different R sessions can produce varying results, and in the case of modelling a system sensitive to initial conditions the observed differences could be huge.

[full post]