DATA 605 Week 11

Introduction

As part of the Fundamentals of Computational Mathematics course work at the CUNY School of Professional Studies, we needed to complete a simple linear regression of data of our choice. I have selected to look at the relationship between the total arcade revenue and the number of PhD computer science graduates.

Data Acquisition

The data was scraped from Tyler Vigen’s site. The total revenue from arcades is originally from the U.S. Census Bureau. The computer science PhD’s originally came from the National Science foundation.

library(rvest)
library(stringr)
library(dplyr)
library(tidyr)
library(tibble)

df <- read_html("http://www.tylervigen.com/view_correlation?id=97") %>%
  html_nodes(xpath = "//table[@class='alldata']") %>%
  html_table(fill = TRUE) %>%
  as.data.frame() %>%
  select(-X12) %>% # Drop empty column
  slice(1:3) %>% # Keep the first 3 rows
  mutate(X1 = ifelse(X1 == "", "Year", str_extract(X1, "(.*?)\\s?\\(.*?\\)"))) %>% # Cleanup first column
  mutate(X1 = str_remove(X1, " \\(.*\\)")) %>% # More first column cleanup
  mutate(X1 = str_replace_all(X1, " ", ".")) %>% # Last bit of first column cleanup
  rownames_to_column %>% # Transpose dataframe
  gather(var, value, -rowname) %>% 
  spread(rowname, value) %>%
  select(-var) # Drop unneeded column

# Pull name from first row
names(df) <- df[1,]

df <- df %>%
  slice(2:11) %>% # Drop first row
  mutate(Total.revenue.generated.by.arcades = gsub(",", "", Total.revenue.generated.by.arcades)) %>% # Remove the commas
  mutate(Computer.science.doctorates.awarded = gsub(",", "", Computer.science.doctorates.awarded)) %>% # Remove the commas
  mutate_if(is.character, as.numeric) # Convert to numeric

The end result is the following data

Year Total Revenue Generated by Arcades Computer Science Doctorates Awarded
2008 1803 1787
2009 1734 1611
2000 1196 861
2001 1176 830
2002 1269 809
2003 1240 867
2004 1307 948
2005 1435 1129
2006 1601 1453
2007 1654 1656

Model Building

Now that the data is cleaned up we will fit a line to the data

model <- lm(Total.revenue.generated.by.arcades ~ Computer.science.doctorates.awarded, data = df)

Model Evaluation

summary(model)

Call:
lm(formula = Total.revenue.generated.by.arcades ~ Computer.science.doctorates.awarded, 
    data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-63.513 -35.320   6.046  28.183  58.719 

Coefficients:
                                     Estimate Std. Error t value Pr(>|t|)
(Intercept)                         725.80591   46.30027   15.68 2.74e-07
Computer.science.doctorates.awarded   0.59886    0.03701   16.18 2.14e-07
                                       
(Intercept)                         ***
Computer.science.doctorates.awarded ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 43.31 on 8 degrees of freedom
Multiple R-squared:  0.9704,    Adjusted R-squared:  0.9666 
F-statistic: 261.8 on 1 and 8 DF,  p-value: 2.138e-07

Residual Analysis

plot(model)

Conclusion

This model explains almost all of the variablitiy. The R-squared is about 0.97. The residual plots do not raise any concern. The Q-Q is fairly linear. The coefficients are statistically significant. The only problem with this model is that the relationship is a spurious correlations.

Mike Silva

10 April 2019