DATA 605 Week 11

Introduction
Data Acquisition
Model Building
Model Evaluation
Conclusion

Introduction

As part of the Fundamentals of Computational Mathematics course work at the CUNY School of Professional Studies, we needed to complete a simple linear regression of data of our choice. I have selected to look at the relationship between the total arcade revenue and the number of PhD computer science graduates.

Data Acquisition

The data was scraped from Tyler Vigen’s site. The total revenue from arcades is originally from the U.S. Census Bureau. The computer science PhD’s originally came from the National Science foundation.

library(rvest)
library(stringr)
library(dplyr)
library(tidyr)
library(tibble)

df <- read_html("http://www.tylervigen.com/view_correlation?id=97") %>%
  html_nodes(xpath = "//table[@class='alldata']") %>%
  html_table(fill = TRUE) %>%
  as.data.frame() %>%
  select(-X12) %>% # Drop empty column
  slice(1:3) %>% # Keep the first 3 rows
  mutate(X1 = ifelse(X1 == "", "Year", str_extract(X1, "(.*?)\\s?\\(.*?\\)"))) %>% # Cleanup first column
  mutate(X1 = str_remove(X1, " \\(.*\\)")) %>% # More first column cleanup
  mutate(X1 = str_replace_all(X1, " ", ".")) %>% # Last bit of first column cleanup
  rownames_to_column %>% # Transpose dataframe
  gather(var, value, -rowname) %>% 
  spread(rowname, value) %>%
  select(-var) # Drop unneeded column

# Pull name from first row
names(df) <- df[1,]

df <- df %>%
  slice(2:11) %>% # Drop first row
  mutate(Total.revenue.generated.by.arcades = gsub(",", "", Total.revenue.generated.by.arcades)) %>% # Remove the commas
  mutate(Computer.science.doctorates.awarded = gsub(",", "", Computer.science.doctorates.awarded)) %>% # Remove the commas
  mutate_if(is.character, as.numeric) # Convert to numeric

The end result is the following data

Year	Total Revenue Generated by Arcades	Computer Science Doctorates Awarded
2008	1803	1787
2009	1734	1611
2000	1196	861
2001	1176	830
2002	1269	809
2003	1240	867
2004	1307	948
2005	1435	1129
2006	1601	1453
2007	1654	1656

Model Building

Now that the data is cleaned up we will fit a line to the data

model <- lm(Total.revenue.generated.by.arcades ~ Computer.science.doctorates.awarded, data = df)

Model Evaluation

summary(model)


Call:
lm(formula = Total.revenue.generated.by.arcades ~ Computer.science.doctorates.awarded, 
    data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-63.513 -35.320   6.046  28.183  58.719 

Coefficients:
                                     Estimate Std. Error t value Pr(>|t|)
(Intercept)                         725.80591   46.30027   15.68 2.74e-07
Computer.science.doctorates.awarded   0.59886    0.03701   16.18 2.14e-07
                                       
(Intercept)                         ***
Computer.science.doctorates.awarded ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 43.31 on 8 degrees of freedom
Multiple R-squared:  0.9704,    Adjusted R-squared:  0.9666 
F-statistic: 261.8 on 1 and 8 DF,  p-value: 2.138e-07

Residual Analysis

plot(model)

Conclusion

This model explains almost all of the variablitiy. The R-squared is about 0.97. The residual plots do not raise any concern. The Q-Q is fairly linear. The coefficients are statistically significant. The only problem with this model is that the relationship is a spurious correlations.