Pop data analysis and R for the web

Ben Moore (@benjaminlmoore)
Edinburgh PsychStats R-users, November 13th 2014

Slides online at:

I'm going to talk about...

A couple of examples of R analyses I've done for fun:

  1. Author inflation

  2. Overrated films

But also:

  • Creating interactive plots from R with rCharts

  • HTML5/CSS3/JS presentations from RMarkdown with Slidify

Example 1: Author inflation

Starting point

  • Interesting question
  • No real answers, speculation
  • Easy to test!!

Getting some data



  # Query: publication date in 2012
  q  = 'publication_date:[2012-01-01T00:00:00Z TO 2012-12-31T23:59:59Z]', 

  # Fields to return: id (doi) and author list
  fl = "id,author", 

  # Filter: only actual articles in journal PLOS ONE
  fq = list("doc_type:full",

  # 500 results (max 1000 per query)
  start=0, limit=500, sleep=6)

Author count distributions

author beanplots

270 authors...

270 authors

Evidence for author inflation


High impact == high inflation ?

corr with IF


  • expand to entire NLM Medline / Pubmed records (>22 mill)

  • Try to get at "good inflation vs. bad inflation"

    • Relative growth of acknowledgements? (PMC)
    • Inflation decrease when "author contributions" brought in?

Example 2: Overrated movies

Starting point

  • Everyone relates to concept of "over/underrated" — but it's inherently subjective

  • Maybe a way to quantify this (with, e.g. films) could be:

  • Critic ratings — subjective ratings

  • Audience ratings — "objective truth" (crowd-sourced, many wrongs principle)

    • (Wrong way round? Up to you... )

  • So given this definition of "overrated":

    Q: What are the most (over|under)rated films?


They have a REST API!


api.key <- "somelongAPIkey"
rt <- getURI(paste0("",
                    "lists/dvds/top_rentals.json?apikey=", api.key, "&limit=50"))

rt <- fromJSON(rt)

title <- rt$movies$title
critics <- rt$movies$ratings$critics_score
audience <- rt$movies$ratings$audience_score

This is easy, why hasn't someone done it before...



Hacky solution

  1. Get largest starting list of films possible (Top rentals: 50)

  2. For each, retrieve "similar films" (max: 5!)

  3. Unique-ify and recurse, growing film list exponentially...

"Walled gardens"



Results v1


Most underrated

Title Critics Audience Difference
1Facing the Giants1386-73
2The Boondock Saints2092-72
3Diary of a Mad Black Woman1687-71
4Grandma's Boy1886-68
=5Step Up1983-64
=5Now and Then1983-64
7The Life of David Gale1982-63
=8Because I Said So566-61
=8Sweet November1677-61
=10Empire Records2484-60
=12A Night at the Roxbury1170-59
=12The Covenant362-59

Most overrated

Title Critics Audience Difference
1Spy Kids934548
23 Backyards763145
3Dinner with Friends884543
=4Stuart Little 2814041
=4Momma's Man915041
=4Cleopatra Jones894841
7About a Boy935439
=8Essential Killing854738
=8The Last Exorcism723438
=11Spy Kids 2: The Island of Lost Dreams743836

Did suprisingly well

author beanplots


R background

  • "Hadley"-verse

  • Robust, powerful libraries with strong theoretical underpinnings:

    • ggplot2 :: Grammar of graphics (Leland Wilkinson)
    • dplyr :: Grammar of data manipulation

  • "Ramnath"-verse

  • Neat hacks that get R talking to various javascript libraries:

    • rCharts :: js plots from lattice-like syntax
    • slidify :: HTML/JS/CSS presentations from RMarkdown

Interactive charts

How we will be doing it:


ggvis (Rstudio)



But currently:



[ dimple.js, highcharts, NVD3, ... ]



Handles data mapping (often JSON) + acts like jQuery for SVGs.

Very powerful but low-level — basic graphs use the same few elements so no need to reinvent wheel for these.

Loads of js plotting libraries

dimple, NVD3, polycharts, highcharts, ...


  • Uniform (lattice-style) plotting interface for each of these (and more!) straight from R

Example: static

# load data
d <- read.csv2("Twitter50.txt", sep="\t")


# plot with ggplot
ggplot(d, aes(x=Citations, y=Followers)) + 
  geom_point() + theme_bw() + 
  coord_trans(x="log10", y="log10") +
  scale_x_log10(limits=c(10, 1e6)) +
  scale_y_log10(limits=c(1e4, 1e7))

# save to file from device
       width=5, height=5)

(Data from @biomickwatson)

science stars

Example: interactive

# load data
d <- read.csv2("Twitter50.txt", sep="\t")


# dplot (dimple.js)
i <- dPlot(Followers ~ Citations, 
           data=d, type="bubble",
      groups="Name", height=480, width=520)
# axis tweaks
i$yAxis(type = "addLogAxis", overrideMin=1e4)
i$xAxis(type = "addLogAxis", overrideMin=10)

# publish as gist


✓ Quick, easy intro to intractive plots for the web

✓ Range of libraries to choose from

✓ Still evolving, new libraries added

✗ Probably will need to refer to js lib docs for customisation

✗ Sooner or later will need to edit the js source



rCharts for presentations: RMarkdown -> HTML5/CSS/js slide deck

Again lots of output frameworks to choose from: reveal.js, io2012, ...

Why use these over PowerPoint / LaTeX Beamer?

  • Reproducible R documents

  • Embed web apps, iframes, SVGs

  • CSS3 transitions and jQuery animations

  • Participants can follow along with just a browser (+ mobiles, tablets)


Syntax ::


## Title (h2)

* Bullet1

  * sub-bullet

![an image](figure/slidify.png)

'r round(rnorm(5), 2)'

Gives ::

Title (h2)

  • Bullet1

    • sub-bullet

an image

-2.44, 1.32, -0.31, -1.78, -0.17


  1. R is a powerful tool to answer everyday questions; chances are the data is out there... Might turn into an interesting blog post, article or paper!

  2. Simple interactive charts are easy to make (see rCharts) and can add value, might be tempted towards D3.js for custom visualisations

  3. Web presentation frameworks are a decent alternative to PowerPoint / Beamer (and easy to write in Markdown, per slidify)

 [email protected]

Thanks for listening

People who've helped me out or I've stolen code from:

@ramnath_vaidya (rCharts, slidify), @hadley_wickham (dplyr, ggplot2, devtools), @kwbroman, @timelyportfolio, StackOverflow, @mbostock (d3.js), @jkiernander (dimple.js)

These slides at; more examples: