Possible Datasets

Pick one of the following datasets and analyze it – make scatterplots and data summaries, find correlations, plot best fit lines and interpret the lines. You could also use a decision tree, random forest or multiple regression moodel to predict one of the variables from the multiple other variables. You may also want to use our good friend the bootstrap to estimate the uncertainty in any of the quantities that you fnid (eg. correlations, slopes of best-fit-lines). Please note, that you aren’t expected to do all of this. Simply, find some interesting things to do with your data and clearly explain what you did.

If you are interested in the same dataset as a classmate you may share notes but your analysis should be your own. Your analysis should be done in a .Rmd (R markdown) file. You should complete a report on this data by Thursday, November 29th.

Each of these data sets is on the server and the first block of R code should be.

.libPaths(c("/home/rstudioshared", "/home/rstudioshared/shared_files/packages"))

1. Why do some mammals sleep more than others?

Which mammals have the most REM sleep (dream sleep)? Does it depend on the size of the animal? Does it depend on the size of the animal’s brain?

library(ggplot2)
data(msleep)
View(msleep)
?msleep

2. Forecasting baseball pitchers

In 1999 Voros McCracken changed the game of baseball by showing that some of what we thought we knew about pitching was wrong. He did so by looking at different elements of pitchers’ performances and seeing how strongly they were correlated from one year to the next. Try to see if you can rediscover what he was the first to find. What are the implications?

This data set has the following pitcher statitics from consecutive years (years Y and, the previous year, Y minus 1):

SOrate: strikeouts per batter faced

BBrate: walks per batter faced

HRrate: home runs per ball in play (in other words, the denominator does not include walks or strikeouts)

BABIP: batting average on balls in play (the denominator is at bats less strikeouts and homeruns)

ERA: earned runs per 9 innings

FBV: fastball velocity

Can you come up with the best way to predict ERA from a pitchers stats from the year before?

dips <- read.csv('/home/rstudioshared/shared_files/data/DIPS2.csv')

View(dips)

3. Cars

This data is from a Consumer Reports from 1990 and has data on 111 models of cars.

library(rpart)
data(car90)
View(car90)
?car90

4. OECD Country Data

The Organisation for Economic Co‑operation and Development is an organization of 34 countries. It’s mission is to “promote policies that will improve the economic and social well-being of people around the world” and it analyzing the economic and social well-being of member counties. This data is the result of one such analysis.

Use the OECD_key file to find the definition of variables. You can find out more about how each variable is defined on the OECD website. For instance, find out about PISA scores here.

OECD <- read.csv('/home/rstudioshared/shared_files/data/OECD.csv')
View(OECD)
OECD_key <- read.csv('/home/rstudioshared/shared_files/data/OECD_key.csv')
View(OECD_key)

NEW

I added up-to-date OECD data:

OECD2017 <- read.csv('/home/rstudioshared/shared_files/data/OECD2017.csv')
View(OECD2017)

You may want to simplify the column names:

data2 <- data.frame(country=OECD2017$Country)
data2$expect <- OECD2017$Life.expectancy.in.yrs
data2$no_fac <- OECD2017$Dwellings.without.basic.facilities.as.pct
data2$disp_inc <- OECD2017$Household.net.adjusted.disposable.income.in.usd
data2$emp_rate <- OECD2017$Employment.rate.as.pct
data2$educ <- OECD2017$Educational.attainment.as.pct
data2$air <- OECD2017$Air.pollution.in.ugm3
data2$water <- OECD2017$Water.quality.as.pct
data2$self_health <- OECD2017$Self.reported.health.as.pct
data2$satisf <- OECD2017$Life.satisfaction.as.avg.score
data2$feel_safe <- OECD2017$Feeling.safe.walking.alone.at.night.as.pct
data2$homicide <- OECD2017$Homicide.rate.as.rat
data2$long_hours <- OECD2017$Employees.working.very.long.hours.as.pct
data2$leisure <- OECD2017$Time.devoted.to.leisure.and.personal.care.in.hrs

View(data2)

Regression Lab and Report

Data Science

Possible Datasets

1. Why do some mammals sleep more than others?

2. Forecasting baseball pitchers

3. Cars

4. OECD Country Data