MATH 239: Project Milestone #3: Fitting a Simple Linear Model 100 points
Goal: In this milestone we will start fitting our first simple model. Please report all your findings for this assignment in a R Markdown document.
Identify a numeric response variable in your data set and a numeric explanatory variable. (10 points)
Numeric response variable: number of EV charging stations by state
Numeric explanatory variable: median income by state
# median state income vs num units by state
ggplot(stations_income_df2, aes(med_income18_20, state_num_units.y))+
geom_point()+
ggtitle("Scatterplot of Median Income vs EV Stations by State")+
xlab("Median Income")+
ylab("EV Stations")+
theme_bw()
## Warning: Removed 3 rows containing missing values (geom_point).
## remove num units outlier
no_outlier <- stations_income_df2 %>%
filter(state_num_units.y < 40000)
ggplot(no_outlier, aes(med_income18_20, state_num_units.y))+
geom_point()+
ggtitle("Scatterplot of Median Income vs EV Stations by State, CA removed")+
xlab("Median Income")+
ylab("EV Stations")+
theme_bw()
## Warning: Removed 3 rows containing missing values (geom_point).
## modify dataframe to be same length as m1$residuals due to na values
#full dataset
stations_income_df2 <- stations_income_df2[!is.na(stations_income_df2$med_income18_20), ]
no_outlier <- no_outlier[!is.na(no_outlier$med_income18_20), ]
# full data set
m1 <- lm(state_num_units.y~med_income18_20, data = stations_income_df2)
summary(m1)
##
## Call:
## lm(formula = state_num_units.y ~ med_income18_20, data = stations_income_df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3259 -1907 -1242 0 38843
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.304e+03 5.104e+03 -0.451 0.654
## med_income18_20 7.024e-02 7.366e-02 0.954 0.345
##
## Residual standard error: 5854 on 50 degrees of freedom
## Multiple R-squared: 0.01787, Adjusted R-squared: -0.001778
## F-statistic: 0.9095 on 1 and 50 DF, p-value: 0.3448
#scatterplot with fitted line
ggplot(stations_income_df2, aes(med_income18_20, state_num_units.y))+
geom_point()+
ggtitle("Scatterplot of Median Income vs EV Stations by State")+
xlab("Median Income")+
ylab("EV Stations")+
theme_bw()+
geom_abline(slope=m1$coefficients[2], intercept=m1$coefficients[1],
color="blue", lty=2, lwd=1)
# no outlier
m2 <- lm(state_num_units.y~med_income18_20, data = no_outlier)
summary(m2)
##
## Call:
## lm(formula = state_num_units.y ~ med_income18_20, data = no_outlier)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1768.2 -1229.6 -714.7 921.4 6265.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.746e+02 1.575e+03 0.238 0.813
## med_income18_20 1.985e-02 2.278e-02 0.871 0.388
##
## Residual standard error: 1801 on 49 degrees of freedom
## Multiple R-squared: 0.01526, Adjusted R-squared: -0.004841
## F-statistic: 0.7591 on 1 and 49 DF, p-value: 0.3879
#scatterplot with fitted line, outlier removed
ggplot(no_outlier, aes(med_income18_20, state_num_units.y))+
geom_point()+
ggtitle("Scatterplot of median income by state vs number of EV charging units by state, CA removed")+
xlab("median income by state")+
ylab("number of EV charging units")+
theme_bw()+
geom_abline(slope=m2$coefficients[2], intercept=m2$coefficients[1],
color="blue", lty=2, lwd=1)
From the linear model above:
Null hypothesis: slope = 0
Alternative hypothesis: slope != 0
Full data set: t-distribution with 50 degrees of freedom, the test statistic is 0.954, and p-value is 0.345
Data set with CA removed: t-distribution with 49 degrees of freedom, the test statistic is 0.871, and p-value is 0.388
Full data set: We fail to reject the null hypothesis with a p-value of 0.345 at the alpha = 0.05 significance level. There does not seem to be much evidence that median income of a state tends to be associated with the number of EV charging stations.
anova(m1)
## Analysis of Variance Table
##
## Response: state_num_units.y
## Df Sum Sq Mean Sq F value Pr(>F)
## med_income18_20 1 31163688 31163688 0.9095 0.3448
## Residuals 50 1713228705 34264574
anova(m2)
## Analysis of Variance Table
##
## Response: state_num_units.y
## Df Sum Sq Mean Sq F value Pr(>F)
## med_income18_20 1 2462683 2462683 0.7591 0.3879
## Residuals 49 158965949 3244203
# r^2 = 1 - (SSE/SST)
1 - (1713228705/(31163688+1713228705))
## [1] 0.01786507
With a very small R-squared value of 0.08 for the full data set, it appears that very little of the variation in the number of EV stations can be explained by the median income. It would be difficult to predict the number of EV stations based on the median income of a State.
# residual plot to check model assumptions of center and variance of residuals
#full dataset
m1_df <- augment(m1, stations_income_df2)
ggplot(m1_df, aes(x = .fitted, y = .resid))+
geom_point()+
geom_hline(yintercept=0, lty=2, lwd=1, color="blue")+
ggtitle("Residual Plot")+
theme_bw()
ggplot(m1_df, aes(x = .fitted, y = .resid))+
geom_point()+
geom_hline(yintercept=0, lty=2, lwd=1, color="blue")+
geom_smooth(se=FALSE)+
ggtitle("Residual Plot with locally weighted smoothing line")+
theme_bw()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# CA removed
m2_df <- augment(m2, no_outlier)
ggplot(m2_df, aes(x = .fitted, y = .resid))+
geom_point()+
geom_hline(yintercept=0, lty=2, lwd=1, color="blue")+
ggtitle("Residual Plot with CA removed")+
theme_bw()
ggplot(m2_df, aes(x = .fitted, y = .resid))+
geom_point()+
geom_hline(yintercept=0, lty=2, lwd=1, color="blue")+
geom_smooth(se=FALSE)+
ggtitle("Residual Plot with CA removed and locally weighted smoothing line")+
theme_bw()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# check for normal distribution of errors assumption
qqnorm(m1$residuals)
qqline(m1$residuals)
qqnorm(m2$residuals)
qqline(m2$residuals)
# residuals vs cooks distance
ggplot(m1_df, aes(x = .cooksd, y = .resid, label=State))+
geom_point()+
geom_text(hjust=-1, vjust=1)+
ggtitle("Leverage vs Residual Plot")+
theme_bw()
## Warning: Removed 2 rows containing missing values (geom_text).
ggplot(m2_df, aes(x = .cooksd, y = .resid, label=State))+
geom_point()+
geom_text(hjust=-1, vjust=1)+
ggtitle("Leverage vs Residual Plot with CA removed")+
theme_bw()
## Warning: Removed 2 rows containing missing values (geom_text).
The two numeric variables we looked at are the median income by state and the number of EV charging stations by state. We hypothesized that the median income (explanatory) would be associated with the number of charging stations (response), however the scatter plot shows one apparent outlier, CA with a much higher number of stations, while the rest don’t seem to show any clear relationship.
We performed a hypothesis test for the slope with a t-distribution of 50 degrees of freedom (with all data) or 49 degrees of freedom (with CA removed). The full dataset had a t value of 0.954 and a p value of 0.345, and the outlierless dataset had a t value of 0.871 and a p value of 0.388. For both data sets we fail to reject the null hypothesis with p values of 0.345 (full) and 0.388 (CA removed) at a significance level of alpha = 0.05. There does not seem to be evidence that the median income of a state tends to be associated with the number of EV charging stations it has. R-squared = 0.017, which shows that the proportion of the variation in the number of stations is not really able to be predicted by the state median income.
To test our model assumptions, we created three diagnostic plots: residual plot, qq plot, and leverage vs. residuals plot. In the residual plot of the full data set, the residuals seem to be fairly centered around 0, except for California. With California removed, it becomes clear that there are more points overestimated by the model (points below 0 on the plot) than there are underestimated, so the data is not evenly spread around 0. There is not an obvious fan shape in the data, but there does appear to be a somewhat weak parabolic pattern, showing how there is not really constant spread either. In the qq plot of the full data set, the points appear to follow the qq line fairly well except at the far right hand side of the plot. This would indicate a right skewed distribution. With California removed, the points do not appear to follow the qq line as well, but the tails are bent upwards, indicating a right skewed distribution. In the leverage vs. residual plot of the full data set, California is the only data point with a large enough Cook’s distance (>0.08) to be influential in the model. However, in the data set with California removed, New York, Florida, Massachusetts, and Texas all have considerable leverage.
Our diagnostic plots to test for model assumptions show that a linear model was not a good way to gain insight into these data. These data inherently do not follow the model assumptions for normal distribution and even and constant spread of residuals around zero, so we cannot really use simple linear regression to draw conclusions about the relationship between EV stations and median income. We may be able to transform the data to better fit the model assumptions, but in it’s current state it is not suited for SLR.