I. Example Analysis: Exploring the Movie Dataset

As always, we start by loading the relevant packages as well as our data:

.libPaths(c("/home/rstudioshared", "/home/rstudioshared/shared_files/packages"))
library(dplyr); library(ggplot2); library(corrplot); library(tidyr)
data(movies)

Making Scatterplots

ggplot(movies, aes(budget, votes))+geom_point()

Finding Correlations

Where values are missing you may need to add use=“complete” as shown below:

cor(movies$votes, movies$rating)

## [1] 0.1037069

cor(movies$budget, movies$rating)

## [1] NA

cor(movies$budget, movies$rating, use="complete")

## [1] -0.01422905

Finding Correlations by Group

Here are correlations between budget and rating grouped by mpaa rating

movies %>% group_by(mpaa) %>% 
  summarize(cor.btwn.budget.rating = cor(budget, rating, use="pairwise"),
            n = length(mpaa))

## Source: local data frame [5 x 3]
## 
##    mpaa cor.btwn.budget.rating     n
## 1                  -0.01477743 53864
## 2 NC-17            -0.82738725    16
## 3    PG             0.05091155   528
## 4 PG-13             0.10474061  1003
## 5     R             0.14159642  3377

We could limit our analysis to movies with more than 100 votes as follows:

movies %>% filter(votes>100) %>% group_by(mpaa) %>% 
  summarize(cor.btwn.budget.rating = cor(budget, rating, use="pairwise"),
            n = length(mpaa))

Making and Using New Variables

We can use the mutate function to make new variables and add them to our data.frame. The cut function is also useful. In the code below, we create four groups of years and calculate the square roots of budgets and votes.

movies <- movies %>% mutate(year.group = cut(year, c(1893,1958,1983,1997,2005)),
                            sqrt.budget = sqrt(budget),
                            sqrt.votes = sqrt(votes)
                            )

How does the correlation between the square root of the budget and the square root of the votes compare to the correlation between budget and votes?

cor(movies$sqrt.budget, movies$sqrt.votes, use="complete")

## [1] 0.6452505

cor(movies$sqrt.budget, movies$sqrt.votes, use="complete")

## [1] 0.6452505

We might also be interested in how the strength of this relationship depends on the year of the movie. We can use our new year groups and find the correlation by group:

movies %>% group_by(year.group) %>% 
  summarize(correlation = cor(sqrt.budget, sqrt.votes, use="pairwise"),
            n = length(mpaa))

## Source: local data frame [5 x 3]
## 
##            year.group correlation     n
## 1 (1.89e+03,1.96e+03]   0.3161794 15082
## 2 (1.96e+03,1.98e+03]   0.3713151 14950
## 3    (1.98e+03,2e+03]   0.5843228 14334
## 4    (2e+03,2.00e+03]   0.7392093 14421
## 5                  NA          NA     1

Drawing Best Fit Lines

Recall that there is a close link between correlations and best-fit/predicion/least squares regression lines.

We can add a best-fit line to a scatterplot by adding +geom_smooth(method=“lm”) to our ggplot code:

ggplot(movies, aes(x=budget, y=votes))+geom_point()+geom_smooth(method="lm")

If you’d like to remove the gray region (showing he uncertainty in the best-fit line, more on this later) you cna do:

ggplot(movies, aes(x=budget, y=votes))+geom_point()+geom_smooth(method="lm", se=FALSE)

Finding and Interpreting Equations for Best Fit Lines

We can get the equation for the best-fit line for predicting the number of votes from the budget as follows:

lm(votes~budget, data=movies)

## 
## Call:
## lm(formula = votes ~ budget, data = movies)
## 
## Coefficients:
## (Intercept)       budget  
##   2.026e+03    2.198e-04

What does this mean? We can plug in possible budgets and predict the number of IMDB votes.

If a movie had a budget of $ 0, our linear prediction is that it will have 2026 votes. If it has a budget of $ 10,000,000 we would expect 2026 + 10000000 * 0.0002198 = 4224 votes.

II. Possible Datasets

Pick one of the following datasets and analyze it – make scatterplots, find correlations, find best fit lines and interpret the lines. If you are interested in the same dataset as a classmate you may share notes but your analysis should be your own. Your analysis should be done in a .Rmd (R markdown) file. The first block of R code should be:

.libPaths(c("/home/rstudioshared", "/home/rstudioshared/shared_files/packages"))
library(dplyr); library(ggplot2); library(corrplot); library(tidyr)

Here are pieces of code that you can use to read in the data you’re interested and a link to find out more about the data.

1. Why do some mammals sleep more than others?

Which mammals have the most REM sleep (dream sleep)? Does it depend on the size of the animal? Does it depend on the size of the animal’s brain?

You can find out more about this data here.

msleep <- read.csv('/home/rstudioshared/shared_files/data/msleep.csv')
View(msleep)

2. Do endorsements win primaries?

A full description of this data can be found here.

endorsements <- read.csv('/home/rstudioshared/shared_files/data/endorsements_june_30.csv')
View(endorsements)

If you would like to recode “won_primary” from a Yes/No answer to 1/0 you can do so as follows:

endorsement <- endorsements %>% mutate(won_primary = 1*(won_primary=="Yes"))

3. Forecasting baseball pitchers

In 1999 Voros McCracken changed the game of baseball by showing that some of what we thought we knew about pitching was wrong. He did so by looking at different elements of pitchers’ performances and seeing how strongly they were correlated from one year to the next. Try to see if you can rediscover what he was the first to find. What are the implications?

This data set has the following pitcher statitics from consecutive years (years Y and, the previous year, Y minus 1):

so_rate : strikeouts per batter faced walk_rate: walks per batter faced hr_rate: home runs per ball in play (in other words, the denominator does not include walks or strikeouts) babip: batting average on balls in play (the denominator is at bats less strikeouts and homeruns)

dips <- read.csv('/home/rstudioshared/shared_files/data/DIPS.csv')
View(dips)

4. Cars

This data is from a Consumer Reports from 1990 and has data on 111 models of cars. A description of the variables can be found here.

cars <- read.csv('/home/rstudioshared/shared_files/data/car90.csv')
View(cars)

5. Employment of Recent College Graduates by Major

This data from the American Community Survey 2010-2012 Public Use Microdata Series. The data on recent graduates is limited to college graduates who are 27 years old or younger. A decription of the data set can be found here.

employment <- read.csv('/home/rstudioshared/shared_files/data/recentgrads.csv')
View(employment)

Here, you may find that it’s more useful to look at numbers as rates rather than counts. You can compute a rate and add it to your data.frame as follows:

employment <- employment %>% mutate(employed.rate = Employed/Total)

6. OECD Country Data

The Organisation for Economic Co‑operation and Development is an organization of 34 countries. It’s mission is to “promote policies that will improve the economic and social well-being of people around the world” and it analyzing the economic and social well-being of member counties. This data is the result of one such analysis.

Use the OECD_key file to find the definition of variables. You can find out more about how each variable is defined on the OECD website. For instance, find out about PISA scores here.

OECD <- read.csv('/home/rstudioshared/shared_files/data/OECD.csv')
View(OECD)
OECD_key <- read.csv('/home/rstudioshared/shared_files/data/OECD_key.csv')
View(OECD_key)

Correlation and Best Fit Line Lab