A) (10 points)

Identify your response variable, a categorical predictor, and a numeric predictor (that you suspect might be related to your response).

Response variable: number of units by zipcode

Categorical predictor: access code (public or private)

Numeric predictor: Household retirement income, estimated per zip code by the ACS

Describe the units for these variables and for the categorical variable describe the levels.

The response variable is the number of EV units by zip code. We are saying units rather than stations because stations can have more than one unit. Think about this like at a single gas station where there are multiple pumps you can fill up at - each of those pumps would be considered a unit.

The numeric predictor is in dollars and it represents the estimate retirement income for that zip code, estimated based on ACS sample data. We are using this as a proxy for income overall since we have not obtained regular income data at the zip code level, and we recognize the limitations of interpreting these results.

The categorical predictor is access code which has 2 levels, public and private.

B) (10 points)

Fit a simple linear model with a response variable and the numeric predictor that you chose. Does the relationship appear to be significant?

There is a small p-value (<2e-16) which would suggest significance. When graphed normally there does appear to be a weak positive linear association between the variables with some potential outliers where there are higher numbers of stations. We also tried graphing the number of EV units in log10 to see if a more clear linear pattern emerged, but it didn’t look linear in the log graphic either.

m1 <- lm(ZIP.Number.Units ~ household, data = stations_income_df2)
summary(m1)

## 
## Call:
## lm(formula = ZIP.Number.Units ~ household, data = stations_income_df2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -39.72 -11.75  -6.93   0.53 879.70 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.933e+00  5.181e-01   15.31   <2e-16 ***
## household   9.357e-04  4.597e-05   20.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.91 on 11230 degrees of freedom
##   (277 observations deleted due to missingness)
## Multiple R-squared:  0.03558,    Adjusted R-squared:  0.0355 
## F-statistic: 414.3 on 1 and 11230 DF,  p-value: < 2.2e-16

ggplot(stations_income_df2, aes(household, ZIP.Number.Units)) +
  geom_point()+
  geom_smooth(method = "lm", se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 277 rows containing non-finite values (stat_smooth).

## Warning: Removed 277 rows containing missing values (geom_point).

# y axis log
ggplot(stations_income_df2, aes(household, ZIP.Number.Units)) +
  geom_point() + 
  scale_y_log10()+
  geom_smooth(method = "lm", se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 277 rows containing non-finite values (stat_smooth).
## Removed 277 rows containing missing values (geom_point).

C) (5 points)

Now, write the “dummy” variable coding for your categorical variable.

contrasts(stations_income_df2$Access.Code)

##         private
## public        0
## private       1

D) (15 points)

Fit a linear model with a response variable and a categorical explanatory variable. Does it appear that there are differences among the means of levels of the categorical variable? (Hint: Look at the ANOVA F-test).

The p-value of 2.2e-16 from the ANOVA F-test suggests that there is a significant difference between the means of number of units for public vs private access locations. The linear model summary also suggests a significant difference between public vs private with a t-test p-value of <2e-16. The box plots show that although the mean for private is higher, there is a lot of overlap in values between the boxes of the plots. The first plot shows the full range of data, and the second cuts off at 100 EV units to make the boxes more visible to interpret.

m2 <- lm(ZIP.Number.Units~Access.Code, data=stations_income_df2)
summary(m2)

## 
## Call:
## lm(formula = ZIP.Number.Units ~ Access.Code, data = stations_income_df2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24.30 -12.07  -8.07   0.93 885.93 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         14.0665     0.3349   42.00   <2e-16 ***
## Access.Codeprivate  11.2285     0.7413   15.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.05 on 11507 degrees of freedom
## Multiple R-squared:  0.01955,    Adjusted R-squared:  0.01946 
## F-statistic: 229.4 on 1 and 11507 DF,  p-value: < 2.2e-16

# Visualize the model
library(tidyverse)
ggplot(stations_income_df2, aes(y=ZIP.Number.Units, x=Access.Code, fill=Access.Code))+
  geom_boxplot()

# zooming in to see boxplots more clearly
ggplot(stations_income_df2, aes(y=ZIP.Number.Units, x=Access.Code, fill=Access.Code))+
  geom_boxplot()+
  scale_y_continuous(limits = c(0, 100))

## Warning: Removed 234 rows containing non-finite values (stat_boxplot).

# ANOVA output
anova(m2)

## Analysis of Variance Table
## 
## Response: ZIP.Number.Units
##                Df   Sum Sq Mean Sq F value    Pr(>F)    
## Access.Code     1   235715  235715  229.41 < 2.2e-16 ***
## Residuals   11507 11823027    1027                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

E) (15 points)

Now fit a multiple linear model that combines parts (b) and (d), with both the numeric and categorical variables. What are the estimated models for the different levels?

Public: y-hat = 6.680e+00 + 9.014e-04x

Private: y-hat = (6.680e+00 + 9.783e+00) + 9.014e-04x

Zip codes where stations are private have a higher y-intercept in the model, meaning at any household retirement income level on the graph we can generally expect a slightly higher number of stations when they are private. It is also important to note that there are potential outliers - three public access code points on the graph that are higher than all the others.

The scatter plot below shows data with the parallel lines from the model overlaid.

#### Second Model: Parallel lines

m3 <- lm(ZIP.Number.Units~Access.Code+household, data=stations_income_df2)
summary(m3)

## 
## Call:
## lm(formula = ZIP.Number.Units ~ Access.Code + household, data = stations_income_df2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42.29 -11.73  -6.13   0.92 881.96 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        6.601e+00  5.244e-01   12.59   <2e-16 ***
## Access.Codeprivate 9.700e+00  7.477e-01   12.97   <2e-16 ***
## household          8.657e-04  4.595e-05   18.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.68 on 11229 degrees of freedom
##   (277 observations deleted due to missingness)
## Multiple R-squared:  0.04982,    Adjusted R-squared:  0.04966 
## F-statistic: 294.4 on 2 and 11229 DF,  p-value: < 2.2e-16

# VIZ: Shifts in intercept
ggplot(stations_income_df2, aes(x=household, y=ZIP.Number.Units, color=Access.Code))+
  geom_point()+
  geom_abline(intercept = m3$coefficients[1], slope=m3$coefficients[3],
              color="red", lwd=1)+
  geom_abline(intercept = m3$coefficients[1]+m3$coefficients[2], slope=m3$coefficients[3],
             color="blue", lwd=1)

## Warning: Removed 277 rows containing missing values (geom_point).

# ANOVA
anova(m3)

## Analysis of Variance Table
## 
## Response: ZIP.Number.Units
##                Df   Sum Sq Mean Sq F value    Pr(>F)    
## Access.Code     1   234619  234619  233.82 < 2.2e-16 ***
## household       1   356208  356208  355.00 < 2.2e-16 ***
## Residuals   11229 11267365    1003                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F) (15 points)

Finally, fit a multiple linear model that also includes the interaction between the numeric and categorical variables, which allows for different slopes.

What are the estimated models for the different levels?

Public: y-hat = 6.835e+00 + 8.837e-04x

Private: y-hat = (6.835e+00 + 9.783e+00) + (8.837e-04 + 8.573e-05)x

Below is a graphic of the scatter plot with lines overlaid for each level.

#### Third Model: Interaction 

m4 <- lm(ZIP.Number.Units~Access.Code*household, data=stations_income_df2)
summary(m4)

## 
## Call:
## lm(formula = ZIP.Number.Units ~ Access.Code * household, data = stations_income_df2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -44.04 -11.66  -6.21   0.84 882.05 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  6.786e+00  5.634e-01  12.045  < 2e-16 ***
## Access.Codeprivate           8.651e+00  1.387e+00   6.236 4.66e-10 ***
## household                    8.446e-04  5.158e-05  16.374  < 2e-16 ***
## Access.Codeprivate:household 1.018e-04  1.135e-04   0.897     0.37    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.68 on 11228 degrees of freedom
##   (277 observations deleted due to missingness)
## Multiple R-squared:  0.04989,    Adjusted R-squared:  0.04964 
## F-statistic: 196.5 on 3 and 11228 DF,  p-value: < 2.2e-16

# VIZ: Shift intercepts and slopes
ggplot(stations_income_df2, aes(x=household, y=ZIP.Number.Units, color=Access.Code))+
  geom_point()+
  geom_abline(intercept = m4$coefficients[1], slope=m4$coefficients[3],
              color="red", lwd=1)+
  geom_abline(intercept = m4$coefficients[1]+m4$coefficients[2], slope=m4$coefficients[3] + m4$coefficients[4],
             color="blue", lwd=1)

## Warning: Removed 277 rows containing missing values (geom_point).

# ANOVA
anova(m4)

## Analysis of Variance Table
## 
## Response: ZIP.Number.Units
##                          Df   Sum Sq Mean Sq  F value Pr(>F)    
## Access.Code               1   234619  234619 233.8159 <2e-16 ***
## household                 1   356208  356208 354.9890 <2e-16 ***
## Access.Code:household     1      808     808   0.8048 0.3697    
## Residuals             11228 11266557    1003                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

G) (15 points)

Compare the models from parts (B), (D), (E), and (F).

Calculate the MSEs The numbers below are the MSEs for the four models, pulled from the ANOVA tables.

#m1 mse
1050

## [1] 1050

#m2 mse
1061

## [1] 1061

#3 mse
1035

## [1] 1035

#4 mse
1035

## [1] 1035

Discuss model differences

The linear model calculations show that there is a significant association between income and number of stations (part B), and a significant difference in the means between public vs private access and number of stations (D). The parallel lines model also yielded significant results for both income and access code type (E). However, the linear model with interaction between access and income did not improve the model (F). The interaction did not produce a significant p-value in the model, and therefore it does not significantly improve the model to include the interaction.

H) (15 points)

Conclusion:

The last two models (parallel lines and interaction) have basically the same MSE, 1035. However, the interaction is not significant (p-value = 0.45) in the ANOVA for the fourth model (different intercepts and slopes). This suggests that the best model out of the four we fit is model three, the parallel lines model, since this model has the lowest MSE while still keeping only significant variables in the model.

Although the models produce significant p-values, there may be some issues with doubled up data points where the station(s) in a particular zip code were listed as both public and private, creating overlapping data points with the same zip code and household retirement income but different access codes. What we should have done is to re-level the access code variable when we aggregated it to the zip code level. In doing so we would have made three levels as options for each zip code - all private, all public, and mixed. This would remove the overlapping data issue and make our p-values more useful for interpretation.

We came in hypothesizing that as income increased the number of EV units would also increase. Technically our models and tests say there is a significant positive association between these variables. However, the slope of the line is not very steep so there does not seem to be a very dramatic increase in number of units as retirement income increases. Also, the data do not clearly show a strong linear relationship, so running linear models may not be the best way to understand the true relationship between retirement income and EV units by zip code. From our past analysis of these data it seems that perhaps retirement income doesn’t matter as much to having EV unit access as being near EV hubs such as areas in CA with very high numbers of EV units.

Milestone 4

Alena Langford and Kelson Gipp

3/15/2022