07 Jun 2014
I recently read Arnie's autobiography
(great fun) and in it he writes about the various roles he's
had, discussing those movies that flopped or were surprise box office successes,
but it's hard to build up an overall picture of his career from these fragments.
Similarly the raw filmography lists at IMDb
and Wikipedia
are pretty uninspiring.
That gave me the idea of charting his movie career over time, attempting to
show a lot of information at once about how well the film did at box office
relative to its budget, and at what points these successes and failures
happened over the last few decades. After some
python-powered
scraping of IMDb data, this is what I came up with:
[full post]
25 May 2014
The most popular accounts on twitter have millions of followers, but what are their demographics like? Twitter doesn't collect or release this kind of information, and even things like name and location are only voluntarily added to people's profiles. Unlike Google+ and Facebook, twitter has no real name policy, they don't care what you call yourself, because they can still divine out useful information from your account activity.
For example, you can optionally set your location on your twitter profile. Should you choose not to, twitter can still just geolocate your IP. If you use an anonymiser or VPN, they could use the timing of your account activity to infer a timezone. This could then be refined to a city or town using the topics you tweet about and the locations of friends and services you mention most.
[full post]
05 May 2014
"Overrated" and "underrated" are slippery terms to try to quantify. An interesting way of looking at this, I thought, would be to compare the reviews of film critics with those of Joe Public, reasoning that a film which is roundly-lauded by the Hollywood press but proved disappointing for the real audience would be "overrated" and vice versa.
To get some data for this I turned to the most prominent review aggregator: Rotten Tomatoes. All this analysis was done in the R programming language, and full code to reproduce it will be attached at the end.
[full post]
06 Apr 2014
There seems to be a general consensus that author lists in academic articles are growing. Wikipedia says so, and I've also come across a published letter and short Nature article which accept this is the case and discuss ways of mitigating the issue. Recently there was an interesting discussion on academia.stackexchange on the subject but again without much quantification. Luckily given the array of literature database APIs and language bindings available, it should be pretty easy to investigate with some statistical analysis in R.
[full post]
18 Mar 2014
The Guardian newspaper has for a few years been running a data blog and has built up a massive repository of (often) well-curated datasets on a huge number of topics. They even have an indexed list of all data sets they've put together or reused in their articles.
It's a great repository of interesting data for exploratory analysis, and there's a low barrier to entry in terms of getting the data into a useful form. Here's an example using UK election polling data collected over the last thirty years.
[full post]
06 Mar 2014
In the R programming language, the random number generator (RNG) is seeded each session using the current time and process ID. Via the magic of the popular Mersenne Twister PRNG, the values stored in .Random.seed
are used sequentially each time "randomness" is invoked in a function. This means, of course, that the same function run in different R sessions can produce varying results, and in the case of modelling a system sensitive to initial conditions the observed differences could be huge.
[full post]
24 Feb 2014
As a LaTeX fan I'm used to using Beamer for presentations, but the built-in themes are definitely starting to show their age --- and writing a custom .sty
file looks like a nightmare --- so for a while I've been looking at trying out an HTML5 framework.
[full post]