Tag: #rstats

Living in Edinburgh it's been hard to avoid the build-up to Scotland's referendum on independence. On September 18th 2014, less than a month away as I write this, people living in Scotland will go to the polls to answer the question: Should Scotland be an independent country?

[full post]


Apparently an 80s commerical for the helmet manufacturer Bell bore the slogan: "If you've got a $10 head, wear a $10 helmet". Nowadays it's a deeply-ingrained and widely accepted idea among bikers that it's worth spending a lot of money on your headgear. A top-of-the-line Arai can sell for almost four figures, particularly if you want a nice race rep design, but what are you getting for your money and, in particular, is it any safer than a helmet you pickup for a tenth of that price?

[full post]


Arnie 2010 (source)

I recently read Arnie's autobiography (great fun) and in it he writes about the various roles he's had, discussing those movies that flopped or were surprise box office successes, but it's hard to build up an overall picture of his career from these fragments. Similarly the raw filmography lists at IMDb and Wikipedia are pretty uninspiring.

That gave me the idea of charting his movie career over time, attempting to show a lot of information at once about how well the film did at box office relative to its budget, and at what points these successes and failures happened over the last few decades. After some python-powered scraping of IMDb data, this is what I came up with:

[full post]


The most popular accounts on twitter have millions of followers, but what are their demographics like? Twitter doesn't collect or release this kind of information, and even things like name and location are only voluntarily added to people's profiles. Unlike Google+ and Facebook, twitter has no real name policy, they don't care what you call yourself, because they can still divine out useful information from your account activity.

For example, you can optionally set your location on your twitter profile. Should you choose not to, twitter can still just geolocate your IP. If you use an anonymiser or VPN, they could use the timing of your account activity to infer a timezone. This could then be refined to a city or town using the topics you tweet about and the locations of friends and services you mention most.

[full post]


"Overrated" and "underrated" are slippery terms to try to quantify. An interesting way of looking at this, I thought, would be to compare the reviews of film critics with those of Joe Public, reasoning that a film which is roundly-lauded by the Hollywood press but proved disappointing for the real audience would be "overrated" and vice versa.

To get some data for this I turned to the most prominent review aggregator: Rotten Tomatoes. All this analysis was done in the R programming language, and full code to reproduce it will be attached at the end.

[full post]


There seems to be a general consensus that author lists in academic articles are growing. Wikipedia says so, and I've also come across a published letter and short Nature article which accept this is the case and discuss ways of mitigating the issue. Recently there was an interesting discussion on academia.stackexchange on the subject but again without much quantification. Luckily given the array of literature database APIs and language bindings available, it should be pretty easy to investigate with some statistical analysis in R.

[full post]


The Guardian newspaper has for a few years been running a data blog and has built up a massive repository of (often) well-curated datasets on a huge number of topics. They even have an indexed list of all data sets they've put together or reused in their articles.

It's a great repository of interesting data for exploratory analysis, and there's a low barrier to entry in terms of getting the data into a useful form. Here's an example using UK election polling data collected over the last thirty years.

[full post]