STAT 327 Homework 4

We’ll grade your homework by opening your “hw4.Rmd” file in RStudio (in a directory containing “farm.csv”, “scores.csv”, and “hw4freeCode.R”), clicking “Knit HTML”, reading the HTML output, and reading your “hw4.Rmd” file. You should write R code anywhere you see an empty R code chunk.

Name: Michael Huizenga

Email: mhuizenga@wisc.edu

Part 1: A “jackknife” procedure to find the most outlying point in a linear relationship between two variables

First load the “XML” package to give access to readHTMLTable() and the “RCurl” package for access to getURL().

if (!require("XML")) {
  install.packages("XML") # do this once per lifetime
  stopifnot(require("XML")) # do this once per session
}

## Loading required package: XML

if (!require("RCurl")) {
  install.packages("RCurl") # do this once per lifetime
  stopifnot(require("RCurl")) # do this once per session
}

## Loading required package: RCurl

## Loading required package: bitops

Use R to get the land area (sq. miles) of each state from the second table (labeled “Land area”) in the web page https://simple.wikipedia.org/wiki/List_of_U.S._states_by_area. Hint: you can use tables = readHTMLTable(getURL("https://simple.wikipedia.org/wiki/List_of_U.S._states_by_area")) to read the data. Include code to remove the commas from the numbers.

land_table = readHTMLTable(getURL("https://simple.wikipedia.org/wiki/List_of_U.S._states_by_area"))[2]

land_table <- as.data.frame(land_table) ; colnames(land_table) <- substring(colnames(land_table), 6) ; land_table$km. <- as.numeric(gsub(",", "", land_table$km.))
land_table$sq.miles <- as.numeric(gsub(",", "", land_table$sq.miles))

Use R to get farm areas of states from “farm.csv”. (Note: I got the data in 2013 from the spreadsheet in the row labeled “825 - Farms–Number and Acreage by State [Excel 131k] …” at http://www.census.gov/compendia/statab/cats/agriculture/farms_and_farmland.html. I took the 2010 acreage (1000) column, multiplied by 1000, and divided by 640 (sq. miles per acre). You do not need to use this spreadsheet–just use “farm.csv”.)

farm_areas <- read.csv("farm.csv")

Create a data frame called “area” whose columns are “state”, “farm”, and “land”, which contain state names, farm areas, and land areas, respectively. Hint: the states aren’t in the same order in the two data sets, so getting the “area” data frame right requires a little care.

land_table$state <- land_table$State
land_table$land <- land_table$sq.miles
farm_areas$farm <- farm_areas$sq.miles
land <- subset(land_table, state != "United States", select = c("state", "land"))
farm <- subset(farm_areas, select = c("state", "farm"))
area <- merge(land, farm, "state")

Make a scatterplot of y = farm area vs. x = land area.

plot(area$land, area$farm)

There are two prominent outliers. Use identify() to find their indices.

Unfortunately, identify() doesn’t work on an R graph that we’re viewing through an HTML page. We can use the RStudio menu command “Chunks > Run all” to run all the code in this file in the console, and then click on the graph in RStudio’s “Plots” tab. Once you know the indices, just assign them to variables so you can use them later. Then comment out your call to identify().

land_outlier <- subset(area, land == max(area$land))
farm_outlier <- subset(area, farm == max(area$farm))
# identify not working for some reason...
land_index <- land_outlier$state
farm_index <- farm_outlier$state

The two outliers are Texas, which fits the roughly linear trend of the rest of the data, and Alaska, which does not fit.

Make a linear model of y = farm area vs. x = land area. Make your scatterplot again, and this time add the regression line to it. Then make a linear model of the same data, except with Alaska removed. Add that regression line, colored red, to your scatterplot.

reg_line <- lm(farm ~ land, area)
plot(area$land, area$farm)
abline(reg_line)
area_sub <- subset(area, state != "Alaska")
reg_line_sub <- lm(farm ~ land, area_sub)
abline(reg_line_sub, col = "red")

Notice that, with respect to the original regression line, Texas has the biggest residual (difference in actual and predicted y), because Alaska pulled the line down toward itself. But really Alaska is the outlier! Next we’ll do a “jackknife” procedure to discover computationally that Alaska is the most important outlier.

Make a plot of the residuals for the original model. (Hint: they’re available in the output of lm().)

plot(reg_line$fitted.values, reg_line$residuals)

Notice again that the Texas residual is bigger than the Alaska residual.

Next use a loop to create n=50 models. In step i, make a model of the data with observation i removed. Then predict the value of y[i] from that model, and find the residual (difference) between (the removed) y[i] and the prediction. Save these residuals in a vector r.jack. (A “jackknife” procedure works by removing one observation (or several) from a data set, and then making a prediction from that smaller data set, and repeating this for each observation.)

r.jack <- as.numeric()
for (i in unique(area$state)) {
  temp_sub <- subset(area, state != i)
  removed_index <- subset(area, state == i)
  
  temp_reg <- lm(farm ~ land, temp_sub)
  predicted_val <- predict(temp_reg, newdata = data.frame(land = removed_index$land))
  jack_value <- abs(removed_index$farm - predicted_val)
  r.jack[i] <- jack_value
}

Plot these “jackknife” residuals.

plot(area$land, r.jack)

Notice now that Alaska is clearly the real outlier.

Part 2: ggplot2 graphics

Use ggplot2 to solve make several graphs. First, here’s code to load, or install and load, the package.

if (!require("ggplot2")) {
  install.packages("ggplot2")
  stopifnot(require("ggplot2"))
  }

## Loading required package: ggplot2

Consider the built-in data set warpbreaks. (See ?warpbreaks, http://en.wikipedia.org/wiki/Warp_%28weaving%29, and http://en.wikipedia.org/wiki/Power_loom#Operation.) Make a histogram of the numbers of warp breaks.

warpbreaks <- warpbreaks
ggplot(warpbreaks, aes(x = breaks)) + geom_histogram(stat = "count")

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Make a density plot of the numbers of warp breaks.

ggplot(warpbreaks, aes(x = breaks)) + geom_density()

Make two density plots of warp breaks, using a different color for each wool type, on a single panel. Does the wool type have a strong effect on the number of breaks?

 ggplot(warpbreaks, aes(x = breaks)) + geom_density(aes(colour = wool))

###It seems as though type A wool seems to have less breaks.

Make three density plots of warp breaks, using a different color for each tension level, on a single panel. How does tension seem to affect the number of breaks?

ggplot(warpbreaks, aes(x = breaks)) + geom_density(aes(colour = tension))

###It seems as though higher tension leads to less breaks.

“Old Faithful” is a geyser in Yellowstone National Park that erupts on a remarkably regular schedule (http://en.wikipedia.org/wiki/Old_Faithful). Make a scatterplot of waiting time (\(y\)) vs. most recent eruption time (\(x)\) from the built-in faithful data set. (See ?faithful.) Include a simple linear regression line. What is the most striking feature of this plot?

faithful <- faithful
ggplot(faithful, aes(x = eruptions, y = waiting)) + geom_point() + geom_smooth(method = lm)

###The most striking feature, to me, is that there seem to be no extreme outliers. That is, the length of eruptions and the time between eruptions seems to be very strongly correlated.

Part 3: Web-scraping

rm(list=ls())

First load the “XML” package to give access to readHTMLTable().

if (!require("XML")) {
  install.packages("XML") # do this once per lifetime
  require("XML") # do this once per session
  }

At the bottom of the Internet Movie Database website there’s a link to the Top 250. At the “Top 250” page there’s a list of 250 movies, with a link to each movie. The first movie is The Shawshank Redmption.

With your browser on the “Top 250” page, you can do “right-click > view page source” (in Firefox or Chrome; in Safari, first do “Safari > Preferences > Advanced” and check “Show Develop menu in menu bar”) to see the HTML code that creates the page. (You do not need to learn HTML for this homework.)

Search in the HTML source page for “Shawshank”, and you’ll see that it occurs on line 833. Search for “Godfather”, and you’ll see that it occurs twice, on line 873 for “The Godfather” and on line 913 for “The Godfather: Part II”. For each of these three lines, the preceding line contains a link, relative to the main IMDB URL, to that movie’s page. Use grep() to figure out what small string is common to the 250 lines, like these three, that contain links to the top 250 movies.

Notice that line 833 for “The Shawshank Redemption” includes the text “/title/tt0111161”. Pasting this onto “http://www.imdb.com” gives “http://www.imdb.com/title/tt0111161”, which is a link to the first movie’s page. Adding “/fullcredits” gives “http://www.imdb.com/title/tt0111161/fullcredits”, which is a link to the full cast and crew. Search this “fullcredits” page for “Produced” and you’ll see that “The Shawshank Redemption” was produced by “Liz Glotter”, “David V. Lester”, and “Niki Marvin”.

Write code that does the following:

Use readLines() to read “http://www.imdb.com/chart/top” into a character string vector
- Select the 250 lines containing links to the 250 movies
- From these 250 lines, select the 250 strings like “/title/tt0111161” from which you can form links to the 250 movies
Create an empty list of producers, e.g. “producers = list()”
Read the “fullcredits” page of each movie
- Strip out the title of the movie
- Use readHTMLTable() to read all the tables into a list of dataframes; figure out which dataframe has the producers
- Save the vector of producers in a list, doing something like “producers[[title]] = …”, where “…” is the vector of producers you found
Do “unlist(producers)” to convert your list of title / producer vector pairs into a named vector of producers.
- Use table() to make a table of counts from this vector
- Display the 5 producers who produced the most movies from among these 250

movies <- readLines("http://www.imdb.com/chart/top")
titles <- grep("titleColumn", movies) + 2
links_full <- movies[titles]
links_specific <- substring(links_full, 16, 32)

# create list of producers
web_links <- paste0("http://www.imdb.com", links_specific, "fullcredits")
producers <- list()
producer_names <- list()
#tables <- readHTMLTable(web_links)
tables <- list(length = 250)
for (i in 1:250) {
  tables[[i]] <- readHTMLTable(web_links[i])
}
for (i in 1:250) {
  producers[[i]] <- (tables[[i]][4])
  df <- as.data.frame(producers[[i]])
  producer_names[[i]] <- as.vector(df$NULL.V1)
}
unlisted <- unlist(producer_names)
name_table <- table(unlisted)
name_table <- sort(name_table, decreasing = TRUE)
print(name_table[1:5])

## unlisted
##    Bob Weinstein Harvey Weinstein    John Lasseter      Emma Thomas 
##                9                9                9                8 
##  Stanley Kubrick 
##                7

Part 4: Extra Credit (not required; worth 0, 1, or 2 points)

Collect Year, Director, Rating, Number of Votes and Cast (first billed only)
For each actor, count how many times he or she starred in a Top 250 Movie. Show the 10 actors/actresses that starred in the most movies among the Top 250. Show the 10 actors/actresses that starred in movies among the Top 250 with the highest mean rating.
For each director, count how many times he or she directed a Top 250 Movie. Show the 10 directors that directed the most movies among the Top 250. Show the 10 directors that directed movies among the Top 250 with the highest mean rating.
Show the 10 most frequent Actor-Director collaborations among the Top 250 Movies. What’s the average rating for those collaborations?
Are ratings influenced by year? In what way? Provide a P-value using linear regression. Are the assumptions of linear regression violated? If so, what’s the impact in your P-value estimate?
Do people vote more often for recent movies? Provide a P-value using linear regression. Are the assumptions of linear regression violated? If so, what’s the impact in your P-value estimate?
In light of the previous question, do you think the number of votes influences the rating? Create an analysis of variance table for the ratings, considering year, votes and the interaction of year and votes. Explain what the interaction means.

# ...