# read the CSV with headers
regression1<-read.csv("incidents (5).csv", header = T,sep =",")
head(regression1)
# Conver population to numeric
regression1$population <- as.numeric(gsub(",","",regression1$population))
regression1$population
## [1] 107353 326534 444752 750000 64403 2744878 1600000 2333000 1572816
## [10] 712091 6900000 2700000 4900000 4200000 5200000 7100000
head(regression1)
summary(regression1)
## area zone population incidents
## Length:16 Length:16 Min. : 64403 Min. : 103.0
## Class :character Class :character 1st Qu.: 645256 1st Qu.: 277.8
## Mode :character Mode :character Median :1966500 Median : 654.0
## Mean :2603489 Mean : 695.2
## 3rd Qu.:4375000 3rd Qu.: 853.0
## Max. :7100000 Max. :2072.0
str(regression1)
## 'data.frame': 16 obs. of 4 variables:
## $ area : chr "Boulder" "California-lexington" "Huntsville" "Seattle" ...
## $ zone : chr "west" "east" "east" "west" ...
## $ population: num 107353 326534 444752 750000 64403 ...
## $ incidents : int 605 103 161 1703 1003 527 721 704 105 403 ...
str(regression1$population)
## num [1:16] 107353 326534 444752 750000 64403 ...
regression2<-regression1[,-1]#new data frame with the deletion of column 1
head(regression2)
reg.fit1<-lm(regression1$incidents ~ regression1$population)
summary(reg.fit1)
##
## Call:
## lm(formula = regression1$incidents ~ regression1$population)
##
## Residuals:
## Min 1Q Median 3Q Max
## -684.5 -363.5 -156.2 133.9 1164.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.749e+02 2.018e+02 2.353 0.0337 *
## regression1$population 8.462e-05 5.804e-05 1.458 0.1669
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 534.9 on 14 degrees of freedom
## Multiple R-squared: 0.1318, Adjusted R-squared: 0.0698
## F-statistic: 2.126 on 1 and 14 DF, p-value: 0.1669
Based on the output obtained above, please answer the following question:
Is Population significant at a 5% significance level? What is the adjusted-R squared of the model?
reg.fit2<-lm(incidents ~ zone+population, data = regression1)
summary(reg.fit2)
##
## Call:
## lm(formula = incidents ~ zone + population, data = regression1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -537.21 -273.14 -57.89 188.17 766.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.612e+02 1.675e+02 0.962 0.35363
## zonewest 7.266e+02 1.938e+02 3.749 0.00243 **
## population 6.557e-05 4.206e-05 1.559 0.14300
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 384.8 on 13 degrees of freedom
## Multiple R-squared: 0.5828, Adjusted R-squared: 0.5186
## F-statistic: 9.081 on 2 and 13 DF, p-value: 0.003404
at a 5% significance level zone is statistically significant while population is not statistically significant. The Adjusted R-squared is 0.5186
regression1$zone <- ifelse(regression1$zone == "west", 1, 0)#Please explain the syntax and the output
The above syntax assigned a value of 1 when the zone is west and 0 for other zones.
str(regression1)
## 'data.frame': 16 obs. of 4 variables:
## $ area : chr "Boulder" "California-lexington" "Huntsville" "Seattle" ...
## $ zone : num 1 0 0 1 1 0 1 1 0 0 ...
## $ population: num 107353 326534 444752 750000 64403 ...
## $ incidents : int 605 103 161 1703 1003 527 721 704 105 403 ...
#regression1$zone<-as.integer((regression1$zone),replace=TRUE) was not necessary
interaction<-regression1$zone*regression1$population#Explain the syntax
The above syntax creates an object that capture an interaction between zones and population
reg.fit3<-lm(regression1$incidents~interaction+regression1$population+regression1$zone)
summary(reg.fit3)
##
## Call:
## lm(formula = regression1$incidents ~ interaction + regression1$population +
## regression1$zone)
##
## Residuals:
## Min 1Q Median 3Q Max
## -540.91 -270.93 -59.56 187.99 767.99
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.659e+02 2.313e+02 0.717 0.4869
## interaction 2.974e-06 9.469e-05 0.031 0.9755
## regression1$population 6.352e-05 7.868e-05 0.807 0.4352
## regression1$zone 7.192e+02 3.108e+02 2.314 0.0392 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 400.5 on 12 degrees of freedom
## Multiple R-squared: 0.5829, Adjusted R-squared: 0.4786
## F-statistic: 5.589 on 3 and 12 DF, p-value: 0.01237
Based on the output obtained above, please answer the following question:
Is Population significant at a 5% significance level? Is Zone significant at a 5% significance level? Is the interaction term significant at a 5% significance level? What is the adjusted-R squared of the model?
At 5% significance, only the zone feature is statistically significant. The Adjusted R-squared is 0.4786
Let us now run a model where the only feature is the interaction term.
reg.fit4<-lm(incidents ~ interaction, data = regression1)
summary(reg.fit4)
##
## Call:
## lm(formula = incidents ~ interaction, data = regression1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -650.28 -301.09 -83.71 123.23 1103.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.951e+02 1.320e+02 3.751 0.00215 **
## interaction 1.389e-04 4.737e-05 2.932 0.01093 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 451.9 on 14 degrees of freedom
## Multiple R-squared: 0.3804, Adjusted R-squared: 0.3361
## F-statistic: 8.595 on 1 and 14 DF, p-value: 0.01093
Is the interaction term significant at a 5% significance level? What is the adjusted-R squared of the model?
At 5% significance level, the interaction term is significant, and the Adjusted R-squared is 0.3361
Which of the models run above would you choose to make predictions? Why??
Across all models, the second model where incidents are regressed with zones and population is the best fit as the predictive power is higher with zones moderately significant. For the other models, one has a poor predictive power and the other has a decent predictive power, but no variables were statistically significant.