Let us start by finding our working directory.

getwd()
## [1] "C:/Users/antho/OneDrive/Documents/School/4.DataSecurity&Governance"
# make sure the packages for this chapter
# are installed, install if necessary
#pkg <- c("ggplot2", "scales", "maptools",
#              "sp", "maps", "grid", "car" )
#new.pkg <- pkg[!(pkg %in% installed.packages())]
#if (length(new.pkg)) {
 # install.packages(new.pkg)  
#}

Let us now make sure that the file quiz is uploaded into your new project

final <- read.csv("quiz2.csv", header=T,sep = ",")
#View(final)

After importing the quiz2.csv file from my working directory folder and using the view() function we can see we have 29 records 7 columns on infections, ip address, ufo sighting, population and area

Using the final data frame, let us run a model of infections vs ufo2010(Aliens Visits according to the UFO).

summary(lm(infections ~ ufo2010, data= final))
## 
## Call:
## lm(formula = infections ~ ufo2010, data = final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1187.9  -614.3  -514.2  -200.2  6092.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   630.20     305.67   2.062 0.048982 *  
## ufo2010        29.23       7.49   3.903 0.000572 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1503 on 27 degrees of freedom
## Multiple R-squared:  0.3607, Adjusted R-squared:  0.337 
## F-statistic: 15.23 on 1 and 27 DF,  p-value: 0.0005717

Given the output printed above, is ufo2010 significant at a 5% significance level?1%?

The p-value is significant since it is far below 1% at p-value 0.00057, but sadly the r-squared value is very low at ~34%. So UFO sighings do not predict infections well.

Let us now run a model of infections vs every single quantitative variable that is included in the dataset.

summary(lm(infections ~ pop + income + ipaddr + ufo2010, 
           data=final))
## 
## Call:
## lm(formula = infections ~ pop + income + ipaddr + ufo2010, data = final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1575.4  -735.6  -332.8   178.0  6131.8 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.976e+03  1.696e+03  -1.165    0.255
## pop          1.612e-03  3.110e-03   0.518    0.609
## income       6.079e-02  4.057e-02   1.499    0.147
## ipaddr      -1.585e-03  1.455e-03  -1.089    0.287
## ufo2010      5.438e+01  4.080e+01   1.333    0.195
## 
## Residual standard error: 1496 on 24 degrees of freedom
## Multiple R-squared:  0.4372, Adjusted R-squared:  0.3434 
## F-statistic:  4.66 on 4 and 24 DF,  p-value: 0.006312

Interpret the output. How would you proceed from now on in this handout given the results obtained above.

We can see that the p value for for population and ip address is high at 0.006 which is still significant so they are least probable for predicting a high accuracy linear model. Since population has the highest correlation coeficient it would be in our interest to remove them(“income”, “ipaddr”, and “ufo2010”) from our model in order to see if how population performs as a predictive variable for infections.

#install.packages("carData")

We install the package carData above and load the car package to our notebook below.

library(car) # for the vif() function
## Warning: package 'car' was built under R version 4.2.1
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.2.1

Let us just explore the variance inflation factor(VIF) of the model to see if there is a chance of high correlation between my predictors. I remind you that a strong correlation between two of my predictors will likely end up in heteroskedasticity, and therefore our model would not be accurate.

model <- lm(infections ~ pop + income + ipaddr + ufo2010, 
            data=final)
sqrt(vif(model))
##      pop   income   ipaddr  ufo2010 
## 3.285183 1.395980 5.986325 5.474044

We get the variance inflation factors for our predictors variables.

Let us see if population affects the number of infections in this data set. Write the null and alternative hypothesis you would use to test this relationship.

The null hypothesis is we would see that populations changes, while infections wouldn’t The alternative hypothesis we would see is that populations changes, as well as infections

summary(lm(infections ~ pop, data=final))
## 
## Call:
## lm(formula = infections ~ pop, data = final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1242.9  -635.5  -537.3  -367.5  6085.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.275e+02  3.126e+02   2.007 0.054826 .  
## pop         3.601e-03  9.668e-04   3.725 0.000912 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1528 on 27 degrees of freedom
## Multiple R-squared:  0.3394, Adjusted R-squared:  0.315 
## F-statistic: 13.87 on 1 and 27 DF,  p-value: 0.0009122

Interpret the results obtained above.

We can see from the output above that the p-value is less than 1% at 0.0009 so we have statistical significance but since we have such a low r-squared at 30% population wont explain much in variation of the infections variable even though they have a positive correlation coefficient and significance level under 1 percent.

Now let us define the regression of infections vs pop as pop.lm and predict the number of infections based on the variable population.

pop.lm <- lm(infections ~ pop, data=final)
predict(pop.lm, data.frame(pop=1000000), interval="confidence")
##        fit      lwr      upr
## 1 4228.649 2418.505 6038.792

Interpret the results obtained above.

Above we predict the population that will be infected of 1000000, of which the there is 95% confidence the value will fall between 2419 and 6039. The prediction of our fitted model is 4228.649 as we know from our models r-squared this likely isn’t defining the variance of the true infections.