DATA 605 Week 11
Introduction
As part of the Fundamentals of Computational Mathematics course work at the CUNY School of Professional Studies, we needed to complete a simple linear regression of data of our choice. I have selected to look at the relationship between the total arcade revenue and the number of PhD computer science graduates.
Data Acquisition
The data was scraped from Tyler Vigen’s site. The total revenue from arcades is originally from the U.S. Census Bureau. The computer science PhD’s originally came from the National Science foundation.
library(rvest)
library(stringr)
library(dplyr)
library(tidyr)
library(tibble)
df <- read_html("http://www.tylervigen.com/view_correlation?id=97") %>%
html_nodes(xpath = "//table[@class='alldata']") %>%
html_table(fill = TRUE) %>%
as.data.frame() %>%
select(-X12) %>% # Drop empty column
slice(1:3) %>% # Keep the first 3 rows
mutate(X1 = ifelse(X1 == "", "Year", str_extract(X1, "(.*?)\\s?\\(.*?\\)"))) %>% # Cleanup first column
mutate(X1 = str_remove(X1, " \\(.*\\)")) %>% # More first column cleanup
mutate(X1 = str_replace_all(X1, " ", ".")) %>% # Last bit of first column cleanup
rownames_to_column %>% # Transpose dataframe
gather(var, value, -rowname) %>%
spread(rowname, value) %>%
select(-var) # Drop unneeded column
# Pull name from first row
names(df) <- df[1,]
df <- df %>%
slice(2:11) %>% # Drop first row
mutate(Total.revenue.generated.by.arcades = gsub(",", "", Total.revenue.generated.by.arcades)) %>% # Remove the commas
mutate(Computer.science.doctorates.awarded = gsub(",", "", Computer.science.doctorates.awarded)) %>% # Remove the commas
mutate_if(is.character, as.numeric) # Convert to numeric
The end result is the following data
Year | Total Revenue Generated by Arcades | Computer Science Doctorates Awarded |
---|---|---|
2008 | 1803 | 1787 |
2009 | 1734 | 1611 |
2000 | 1196 | 861 |
2001 | 1176 | 830 |
2002 | 1269 | 809 |
2003 | 1240 | 867 |
2004 | 1307 | 948 |
2005 | 1435 | 1129 |
2006 | 1601 | 1453 |
2007 | 1654 | 1656 |
Model Building
Now that the data is cleaned up we will fit a line to the data
model <- lm(Total.revenue.generated.by.arcades ~ Computer.science.doctorates.awarded, data = df)
Model Evaluation
summary(model)
Call:
lm(formula = Total.revenue.generated.by.arcades ~ Computer.science.doctorates.awarded,
data = df)
Residuals:
Min 1Q Median 3Q Max
-63.513 -35.320 6.046 28.183 58.719
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 725.80591 46.30027 15.68 2.74e-07
Computer.science.doctorates.awarded 0.59886 0.03701 16.18 2.14e-07
(Intercept) ***
Computer.science.doctorates.awarded ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 43.31 on 8 degrees of freedom
Multiple R-squared: 0.9704, Adjusted R-squared: 0.9666
F-statistic: 261.8 on 1 and 8 DF, p-value: 2.138e-07
Residual Analysis
plot(model)
Conclusion
This model explains almost all of the variablitiy. The R-squared is about 0.97. The residual plots do not raise any concern. The Q-Q is fairly linear. The coefficients are statistically significant. The only problem with this model is that the relationship is a spurious correlations.