Handout_11_Regression

# read the CSV with headers
regression1<-read.csv("incidents (5).csv", header = T,sep =",")
head(regression1)

# Conver population to numeric
regression1$population <- as.numeric(gsub(",","",regression1$population))
regression1$population

##  [1]  107353  326534  444752  750000   64403 2744878 1600000 2333000 1572816
## [10]  712091 6900000 2700000 4900000 4200000 5200000 7100000

head(regression1)

summary(regression1)

##      area               zone             population        incidents     
##  Length:16          Length:16          Min.   :  64403   Min.   : 103.0  
##  Class :character   Class :character   1st Qu.: 645256   1st Qu.: 277.8  
##  Mode  :character   Mode  :character   Median :1966500   Median : 654.0  
##                                        Mean   :2603489   Mean   : 695.2  
##                                        3rd Qu.:4375000   3rd Qu.: 853.0  
##                                        Max.   :7100000   Max.   :2072.0

str(regression1)

## 'data.frame':    16 obs. of  4 variables:
##  $ area      : chr  "Boulder" "California-lexington" "Huntsville" "Seattle" ...
##  $ zone      : chr  "west" "east" "east" "west" ...
##  $ population: num  107353 326534 444752 750000 64403 ...
##  $ incidents : int  605 103 161 1703 1003 527 721 704 105 403 ...

str(regression1$population)

##  num [1:16] 107353 326534 444752 750000 64403 ...

regression2<-regression1[,-1]#new data frame with the deletion of column 1

head(regression2)

reg.fit1<-lm(regression1$incidents ~ regression1$population)

summary(reg.fit1)

## 
## Call:
## lm(formula = regression1$incidents ~ regression1$population)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -684.5 -363.5 -156.2  133.9 1164.7 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)  
## (Intercept)            4.749e+02  2.018e+02   2.353   0.0337 *
## regression1$population 8.462e-05  5.804e-05   1.458   0.1669  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 534.9 on 14 degrees of freedom
## Multiple R-squared:  0.1318, Adjusted R-squared:  0.0698 
## F-statistic: 2.126 on 1 and 14 DF,  p-value: 0.1669

Based on the output obtained above, please answer the following question:

Is Population significant at a 5% significance level? What is the adjusted-R squared of the model?

reg.fit2<-lm(incidents ~ zone+population, data = regression1)

summary(reg.fit2)

## 
## Call:
## lm(formula = incidents ~ zone + population, data = regression1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -537.21 -273.14  -57.89  188.17  766.03 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 1.612e+02  1.675e+02   0.962  0.35363   
## zonewest    7.266e+02  1.938e+02   3.749  0.00243 **
## population  6.557e-05  4.206e-05   1.559  0.14300   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 384.8 on 13 degrees of freedom
## Multiple R-squared:  0.5828, Adjusted R-squared:  0.5186 
## F-statistic: 9.081 on 2 and 13 DF,  p-value: 0.003404

Are Population and/or Zone significant at a 5% significance level? What is the adjusted-R squared of the model?

at a 5% significance level zone is statistically significant while population is not statistically significant. The Adjusted R-squared is 0.5186

regression1$zone <- ifelse(regression1$zone == "west", 1, 0)#Please explain the syntax and the output

The above syntax assigned a value of 1 when the zone is west and 0 for other zones.

str(regression1)

## 'data.frame':    16 obs. of  4 variables:
##  $ area      : chr  "Boulder" "California-lexington" "Huntsville" "Seattle" ...
##  $ zone      : num  1 0 0 1 1 0 1 1 0 0 ...
##  $ population: num  107353 326534 444752 750000 64403 ...
##  $ incidents : int  605 103 161 1703 1003 527 721 704 105 403 ...

#regression1$zone<-as.integer((regression1$zone),replace=TRUE) was not necessary

interaction<-regression1$zone*regression1$population#Explain the syntax

The above syntax creates an object that capture an interaction between zones and population

reg.fit3<-lm(regression1$incidents~interaction+regression1$population+regression1$zone)

summary(reg.fit3)

## 
## Call:
## lm(formula = regression1$incidents ~ interaction + regression1$population + 
##     regression1$zone)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -540.91 -270.93  -59.56  187.99  767.99 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)  
## (Intercept)            1.659e+02  2.313e+02   0.717   0.4869  
## interaction            2.974e-06  9.469e-05   0.031   0.9755  
## regression1$population 6.352e-05  7.868e-05   0.807   0.4352  
## regression1$zone       7.192e+02  3.108e+02   2.314   0.0392 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 400.5 on 12 degrees of freedom
## Multiple R-squared:  0.5829, Adjusted R-squared:  0.4786 
## F-statistic: 5.589 on 3 and 12 DF,  p-value: 0.01237

Based on the output obtained above, please answer the following question:

Is Population significant at a 5% significance level? Is Zone significant at a 5% significance level? Is the interaction term significant at a 5% significance level? What is the adjusted-R squared of the model?

At 5% significance, only the zone feature is statistically significant. The Adjusted R-squared is 0.4786

Let us now run a model where the only feature is the interaction term.

reg.fit4<-lm(incidents ~ interaction, data = regression1)
summary(reg.fit4)

## 
## Call:
## lm(formula = incidents ~ interaction, data = regression1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -650.28 -301.09  -83.71  123.23 1103.76 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 4.951e+02  1.320e+02   3.751  0.00215 **
## interaction 1.389e-04  4.737e-05   2.932  0.01093 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 451.9 on 14 degrees of freedom
## Multiple R-squared:  0.3804, Adjusted R-squared:  0.3361 
## F-statistic: 8.595 on 1 and 14 DF,  p-value: 0.01093

Is the interaction term significant at a 5% significance level? What is the adjusted-R squared of the model?

At 5% significance level, the interaction term is significant, and the Adjusted R-squared is 0.3361

Which of the models run above would you choose to make predictions? Why??

Conclusion

Across all models, the second model where incidents are regressed with zones and population is the best fit as the predictive power is higher with zones moderately significant. For the other models, one has a poor predictive power and the other has a decent predictive power, but no variables were statistically significant.