R for the web through pop data analysis

blogR

Pop data analysis and R for the web

Ben Moore (@benjaminlmoore)
Edinburgh PsychStats R-users, November 13^th 2014

Slides online at: blm.io/talks/blogr

I'm going to talk about...

A couple of examples of R analyses I've done for fun:

1. Author inflation

2. Overrated films

But also:

Creating interactive plots from R with rCharts
HTML5/CSS3/JS presentations from RMarkdown with Slidify

Example 1: Author inflation

Starting point

Interesting question
No real answers, speculation
Easy to test!!

Getting some data

ropensci

Code

searchplos(
  # Query: publication date in 2012
  q  = 'publication_date:[2012-01-01T00:00:00Z TO 2012-12-31T23:59:59Z]', 

  # Fields to return: id (doi) and author list
  fl = "id,author", 

  # Filter: only actual articles in journal PLOS ONE
  fq = list("doc_type:full",
            "cross_published_journal_key:PLoSONE"), 

  # 500 results (max 1000 per query)
  start=0, limit=500, sleep=6)

Author count distributions

author beanplots

270 authors...

270 authors

Evidence for author inflation

High impact == high inflation ?

corr with IF

TODO

expand to entire NLM Medline / Pubmed records (>22 mill)
Try to get at "good inflation vs. bad inflation"
- Relative growth of acknowledgements? (PMC)
- Inflation decrease when "author contributions" brought in?

Example 2: Overrated movies

Starting point

Everyone relates to concept of "over/underrated" — but it's inherently subjective
Maybe a way to quantify this (with, e.g. films) could be:

Critic ratings — subjective ratings
Audience ratings — "objective truth" (crowd-sourced, many wrongs principle)
- (Wrong way round? Up to you... )
So given this definition of "overrated":
Q: What are the most (over|under)rated films?

They have a REST API!

library("RCurl")
library("jsonlite")

api.key <- "somelongAPIkey"
rt <- getURI(paste0("http://api.rottentomatoes.com/api/public/v1.0/",
                    "lists/dvds/top_rentals.json?apikey=", api.key, "&limit=50"))

rt <- fromJSON(rt)

title <- rt$movies$title
critics <- rt$movies$ratings$critics_score
audience <- rt$movies$ratings$audience_score

This is easy, why hasn't someone done it before...

forum

!!!

Hacky solution

Get largest starting list of films possible (Top rentals: 50)
For each, retrieve "similar films" (max: 5!)
Unique-ify and recurse, growing film list exponentially...

"Walled gardens"

hits

Results v1

results1

Most underrated

	Title	Critics	Audience	Difference
1	Facing the Giants	13	86	-73
2	The Boondock Saints	20	92	-72
3	Diary of a Mad Black Woman	16	87	-71
4	Grandma's Boy	18	86	-68
=5	Step Up	19	83	-64
=5	Now and Then	19	83	-64
7	The Life of David Gale	19	82	-63
=8	Because I Said So	5	66	-61
=8	Sweet November	16	77	-61
=10	Empire Records	24	84	-60
=10	Beaches	29	89	-60
=12	A Night at the Roxbury	11	70	-59
=12	The Covenant	3	62	-59

Most overrated

	Title	Critics	Audience	Difference
1	Spy Kids	93	45	48
2	3 Backyards	76	31	45
3	Dinner with Friends	88	45	43
=4	Stuart Little 2	81	40	41
=4	Momma's Man	91	50	41
=4	Cleopatra Jones	89	48	41
7	About a Boy	93	54	39
=8	Essential Killing	85	47	38
=8	The Last Exorcism	72	34	38
10	Splice	74	37	37
=11	Spy Kids 2: The Island of Lost Dreams	74	38	36
=11	Bruiser	67	31	36
13	EdTV	64	29	35

Did suprisingly well

author beanplots

rCharts

R background

"Hadley"-verse
Robust, powerful libraries with strong theoretical underpinnings:
- ggplot2 :: Grammar of graphics (Leland Wilkinson)
- dplyr :: Grammar of data manipulation

"Ramnath"-verse
Neat hacks that get R talking to various javascript libraries:
- rCharts :: js plots from lattice-like syntax
- slidify :: HTML/JS/CSS presentations from RMarkdown

Interactive charts

How we will be doing it:

R

▼

ggvis (Rstudio)

▼

Vega

▼

D3.js

But currently:

R

▼

rCharts

▼

[ dimple.js, highcharts, NVD3, ... ]

▼

D3.js

D3.js

Handles data mapping (often JSON) + acts like jQuery for SVGs.

Very powerful but low-level — basic graphs use the same few elements so no need to reinvent wheel for these.

Loads of js plotting libraries

dimple, NVD3, polycharts, highcharts, ...

rCharts

Uniform (lattice-style) plotting interface for each of these (and more!) straight from R

Example: static

# load data
d <- read.csv2("Twitter50.txt", sep="\t")

library("ggplot2")

# plot with ggplot
ggplot(d, aes(x=Citations, y=Followers)) + 
  geom_point() + theme_bw() + 
  coord_trans(x="log10", y="log10") +
  scale_x_log10(limits=c(10, 1e6)) +
  scale_y_log10(limits=c(1e4, 1e7))

# save to file from device
ggsave(filename="sciTwitter.svg", 
       width=5, height=5)

(Data from @biomickwatson)

science stars

Example: interactive

# load data
d <- read.csv2("Twitter50.txt", sep="\t")

library("rCharts")

# dplot (dimple.js)
i <- dPlot(Followers ~ Citations, 
           data=d, type="bubble",
      groups="Name", height=480, width=520)
      
# axis tweaks
i$yAxis(type = "addLogAxis", overrideMin=1e4)
i$xAxis(type = "addLogAxis", overrideMin=10)

# publish as gist
i$publish()

rCharts

✓ Quick, easy intro to intractive plots for the web

✓ Range of libraries to choose from

✓ Still evolving, new libraries added

✗ Probably will need to refer to js lib docs for customisation

✗ Sooner or later will need to edit the js source

rCharts

Slidify

rCharts for presentations: RMarkdown -> HTML5/CSS/js slide deck

Again lots of output frameworks to choose from: reveal.js, io2012, ...

Why use these over PowerPoint / LaTeX Beamer?

Reproducible R documents
Embed web apps, iframes, SVGs
CSS3 transitions and jQuery animations
Participants can follow along with just a browser (+ mobiles, tablets)

Slidify

Syntax ::

---

## Title (h2)

* Bullet1

  * sub-bullet

![an image](figure/slidify.png)

'r round(rnorm(5), 2)'

Gives ::

Title (h2)

Bullet1
- sub-bullet

an image

-2.44, 1.32, -0.31, -1.78, -0.17

Summary

R is a powerful tool to answer everyday questions; chances are the data is out there... Might turn into an interesting blog post, article or paper!
Simple interactive charts are easy to make (see rCharts) and can add value, might be tempted towards D3.js for custom visualisations
Web presentation frameworks are a decent alternative to PowerPoint / Beamer (and easy to write in Markdown, per slidify)

blmoore
benjaminlmoore
[email protected]

Thanks for listening

People who've helped me out or I've stolen code from:

@ramnath_vaidya (rCharts, slidify), @hadley_wickham (dplyr, ggplot2, devtools), @kwbroman, @timelyportfolio, StackOverflow, @mbostock (d3.js), @jkiernander (dimple.js)

These slides at blm.io/talks/blogr; more examples: blm.io/blog

I'm going to talk about...

Starting point

Getting some data

Code

Author count distributions

270 authors...

Evidence for author inflation

High impact == high inflation ?

Starting point

!!!

Hacky solution

"Walled gardens"

Results v1

Most underrated

Most overrated

Did suprisingly well

R background

"Hadley"-verse

"Ramnath"-verse

Interactive charts

D3.js

Loads of js plotting libraries

Example: static

Example: interactive

rCharts

Slidify

Slidify

Syntax ::

Gives ::

Title (h2)

Summary