Activity 11: Handout

getwd()

## [1] "/cloud/project"

# make sure the packages for this chapter
# are installed, install if necessary
pkg <- c("ggplot2", "scales", "maptools",
              "sp", "maps", "grid", "car" )
new.pkg <- pkg[!(pkg %in% installed.packages())]
if (length(new.pkg)) {
  install.packages(new.pkg)  
}

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)

## Warning: package 'maptools' is not available for this version of R
## 
## A version of this package for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages

# read the CSV with headers
regression1<-read.csv("incidents.csv", header=T,sep =",")

regression1

summary(regression1)

##      area               zone            population          incidents     
##  Length:16          Length:16          Length:16          Min.   : 103.0  
##  Class :character   Class :character   Class :character   1st Qu.: 277.8  
##  Mode  :character   Mode  :character   Mode  :character   Median : 654.0  
##                                                           Mean   : 695.2  
##                                                           3rd Qu.: 853.0  
##                                                           Max.   :2072.0

str(regression1)

## 'data.frame':    16 obs. of  4 variables:
##  $ area      : chr  "Boulder" "California-lexington" "Huntsville" "Seattle" ...
##  $ zone      : chr  "west" "east" "east" "west" ...
##  $ population: chr  "107,353" "326,534" "444,752" "750,000" ...
##  $ incidents : int  605 103 161 1703 1003 527 721 704 105 403 ...

regression1$population <- as.numeric(gsub(",","",regression1$population))
regression1$population

##  [1]  107353  326534  444752  750000   64403 2744878 1600000 2333000 1572816
## [10]  712091 6900000 2700000 4900000 4200000 5200000 7100000

str(regression1$population)

##  num [1:16] 107353 326534 444752 750000 64403 ...

regression2<-regression1[,-1]#new data frame with the deletion of column 1

head(regression2)

reg.fit1<-lm(regression1$incidents ~ regression1$population)

summary(reg.fit1)

## 
## Call:
## lm(formula = regression1$incidents ~ regression1$population)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -684.5 -363.5 -156.2  133.9 1164.7 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)  
## (Intercept)            4.749e+02  2.018e+02   2.353   0.0337 *
## regression1$population 8.462e-05  5.804e-05   1.458   0.1669  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 534.9 on 14 degrees of freedom
## Multiple R-squared:  0.1318, Adjusted R-squared:  0.0698 
## F-statistic: 2.126 on 1 and 14 DF,  p-value: 0.1669

Based on the output obtained above, please answer the following question:

Is Population significant at a 5% significance level? No, because the p values is 0.1669 and it is > than 0.05 so it is not statistically significant at 5% level

What is the adjusted-R squared of the model? 0.0698

reg.fit2<-lm(incidents ~ zone+population, data = regression1)

summary(reg.fit2)

## 
## Call:
## lm(formula = incidents ~ zone + population, data = regression1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -537.21 -273.14  -57.89  188.17  766.03 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 1.612e+02  1.675e+02   0.962  0.35363   
## zonewest    7.266e+02  1.938e+02   3.749  0.00243 **
## population  6.557e-05  4.206e-05   1.559  0.14300   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 384.8 on 13 degrees of freedom
## Multiple R-squared:  0.5828, Adjusted R-squared:  0.5186 
## F-statistic: 9.081 on 2 and 13 DF,  p-value: 0.003404

Based on the output obtained above, please answer the following question:

Are Population and/or Zone significant at a 5% significance level? Zonewest it is statistically significant because p value is less than alpha 0.00243 < 0.05. Pupulation it is NOT statistically significant because the p value is greater than the alpha 0.143 > 0.05.

What is the adjusted-R squared of the model? 0.5186

regression1$zone <- ifelse(regression1$zone == "west", 1, 0)#Please explain the syntax and the output

regression1

str(regression1)

## 'data.frame':    16 obs. of  4 variables:
##  $ area      : chr  "Boulder" "California-lexington" "Huntsville" "Seattle" ...
##  $ zone      : num  1 0 0 1 1 0 1 1 0 0 ...
##  $ population: num  107353 326534 444752 750000 64403 ...
##  $ incidents : int  605 103 161 1703 1003 527 721 704 105 403 ...

#regression1$zone<-as.integer((regression1$zone),replace=TRUE) was not necessary

interaction<-regression1$zone*regression1$population#Explain the syntax

reg.fit3<-lm(regression1$incidents~interaction+regression1$population+regression1$zone)

summary(reg.fit3)

## 
## Call:
## lm(formula = regression1$incidents ~ interaction + regression1$population + 
##     regression1$zone)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -540.91 -270.93  -59.56  187.99  767.99 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)  
## (Intercept)            1.659e+02  2.313e+02   0.717   0.4869  
## interaction            2.974e-06  9.469e-05   0.031   0.9755  
## regression1$population 6.352e-05  7.868e-05   0.807   0.4352  
## regression1$zone       7.192e+02  3.108e+02   2.314   0.0392 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 400.5 on 12 degrees of freedom
## Multiple R-squared:  0.5829, Adjusted R-squared:  0.4786 
## F-statistic: 5.589 on 3 and 12 DF,  p-value: 0.01237

Based on the output obtained above, please answer the following question:

Is Population significant at a 5% significance level? No it is not. Is Zone significant at a 5% significance level? No it is not Is the interaction term significant at a 5% significance level? Yes the interation it is statistically significant What is the adjusted-R squared of the model? 0.4786

reg.fit4<-lm(regression1$incidents~interaction)

summary(reg.fit4)

## 
## Call:
## lm(formula = regression1$incidents ~ interaction)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -650.28 -301.09  -83.71  123.23 1103.76 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 4.951e+02  1.320e+02   3.751  0.00215 **
## interaction 1.389e-04  4.737e-05   2.932  0.01093 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 451.9 on 14 degrees of freedom
## Multiple R-squared:  0.3804, Adjusted R-squared:  0.3361 
## F-statistic: 8.595 on 1 and 14 DF,  p-value: 0.01093

Let us now run a model where the only feature is the interaction term.

Is the interaction term significant at a 5% significance level? What is the adjusted-R squared of the model? It is NOT statistically significant at 5% significance level because the p value is greater than 0.05. The adjusted R-Squeared = 0.3361

Which of the models run above would you choose to make predictions? Why??

I will chose the one with the higher R-Squared values because I prefer to have independent variables that will directly correlate and affect the model. So the model is the one with interactions, zone, and population.