getwd()
[1] "C:/Users/antho/OneDrive/Documents/School/4.DataSecurity&Governance"
As we Know it is best practice to get the working directory so you know where you are pulling your file and data from. in case you need to change it or move a file to it.
# make sure the packages for this chapter
# are installed, install if necessary
pkg <- c("ggplot2", "scales", "maptools",
"sp", "maps", "grid", "car" )
new.pkg <- pkg[!(pkg %in% installed.packages())]
if (length(new.pkg)) {
install.packages(new.pkg)
}
Here we install or check for the packages, notice the new package variable and how it uses an “if” loop to check for packages before installing them.
# read the CSV with headers
regression1<-read.csv("incidents.csv", header=T,sep =",")
We import the incidents csv usind the read function, notice the header is true(T) so the first row is considered headers.
Notice below I use the head() function to view the fields, their data types, and weither the match.
#View(regression1)
head(regression1)
Notice the population column is character (chr) data type lets change that to an integer (int) with the code below. Notice the that this value we are changing has commas so we have to remove those, lets use the gsub() function for this.
regression1$population <- as.integer(gsub(",","",regression1$population))
regression1$population
[1] 107353 326534 444752 750000 64403 2744878 1600000 2333000 1572816 712091 6900000 2700000 4900000 4200000
[15] 5200000 7100000
Now lets do some more EDA by getting a summary of the fields. The population field having an interger value will now display statistic data from that column instead of generic character information that we’ve changed it from.
summary(regression1)
area zone population incidents
Length:16 Length:16 Min. : 64403 Min. : 103.0
Class :character Class :character 1st Qu.: 645256 1st Qu.: 277.8
Mode :character Mode :character Median :1966500 Median : 654.0
Mean :2603489 Mean : 695.2
3rd Qu.:4375000 3rd Qu.: 853.0
Max. :7100000 Max. :2072.0
Now we will get more data on the dataframe structure by using the str()function on our dataframe
str(regression1)
'data.frame': 16 obs. of 4 variables:
$ area : chr "Boulder" "California-lexington" "Huntsville" "Seattle" ...
$ zone : chr "west" "east" "east" "west" ...
$ population: int 107353 326534 444752 750000 64403 2744878 1600000 2333000 1572816 712091 ...
$ incidents : int 605 103 161 1703 1003 527 721 704 105 403 ...
Since population doesnt have any decimals I believe the interger data type suited it. Lets double check that it is outputing correctley and has all 16 records by calling it specificaly using the structure function.
str(regression1$population)
int [1:16] 107353 326534 444752 750000 64403 2744878 1600000 2333000 1572816 712091 ...
We see it is correct. Let us proceed to create a new data frame that removes the first column. We will name it regression2
regression2<-regression1[,-1]#new data frame with the deletion of column 1
Let us use the head() function to display its fields and see if we got our desired outcome.
head(regression2)
We can see that we were successful in creating the new data frame that excludes the first column (area)
reg.fit1<-lm(regression1$incidents ~ regression1$population)
Now let us create a regression model using the lm() function.Notice we define the fields we want compared(~) we are using the incidents and population fields from our first dataframe.
Let us get a summary of this new regression model we created called “reg.fit1”
summary(reg.fit1)
Call:
lm(formula = regression1$incidents ~ regression1$population)
Residuals:
Min 1Q Median 3Q Max
-684.5 -363.5 -156.2 133.9 1164.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.749e+02 2.018e+02 2.353 0.0337 *
regression1$population 8.462e-05 5.804e-05 1.458 0.1669
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 534.9 on 14 degrees of freedom
Multiple R-squared: 0.1318, Adjusted R-squared: 0.0698
F-statistic: 2.126 on 1 and 14 DF, p-value: 0.1669
We can see that its adjusted r squared is insufficient at 0.698 and its p-value is also insufficient and above the 5% we can accept.
reg.fit2<-lm(incidents ~ zone+population, data = regression1)
After creating a new regression model that compares incidents to zone and population combined, we run a summary to see how it performed.
summary(reg.fit2)
Call:
lm(formula = incidents ~ zone + population, data = regression1)
Residuals:
Min 1Q Median 3Q Max
-537.21 -273.14 -57.89 188.17 766.03
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.612e+02 1.675e+02 0.962 0.35363
zonewest 7.266e+02 1.938e+02 3.749 0.00243 **
population 6.557e-05 4.206e-05 1.559 0.14300
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 384.8 on 13 degrees of freedom
Multiple R-squared: 0.5828, Adjusted R-squared: 0.5186
F-statistic: 9.081 on 2 and 13 DF, p-value: 0.003404
This is an improvement we can see that our r squared is about 50% and our p-value is below the significance level.
regression1$zone <- ifelse(regression1$zone == "west", 1, 0)
Let’s change the zone field from characters to a binary giving a value of 1 to “west” and 0 to “east”
#View(regression1)
head(regression1)
we can see this change was effective and now the zone field s changed to numeric binary data, well suited for a target class.
str(regression1)
'data.frame': 16 obs. of 4 variables:
$ area : chr "Boulder" "California-lexington" "Huntsville" "Seattle" ...
$ zone : num 1 0 0 1 1 0 1 1 0 0 ...
$ population: int 107353 326534 444752 750000 64403 2744878 1600000 2333000 1572816 712091 ...
$ incidents : int 605 103 161 1703 1003 527 721 704 105 403 ...
Interesting that the head output shows zone as a dbl but the summary output shows it as num.
#regression1$zone<-as.integer((regression1$zone),replace=TRUE) was not necessary
This change wasn’t necessary given it is already numeric.
interaction<-regression1$zone*regression1$population
We create a new field that combines the “zone’ and”population” columns into one named “interaction”
Then we use it to fit a new regression model that compares “incidents” to it(“interaction”), “population”, and “zone”)
reg.fit3<-lm(regression1$incidents~interaction+regression1$population+regression1$zone)
Lets see how this new model named reg.fit3 performs
summary(reg.fit3)
Call:
lm(formula = regression1$incidents ~ interaction + regression1$population +
regression1$zone)
Residuals:
Min 1Q Median 3Q Max
-540.91 -270.93 -59.56 187.99 767.99
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.659e+02 2.313e+02 0.717 0.4869
interaction 2.974e-06 9.469e-05 0.031 0.9755
regression1$population 6.352e-05 7.868e-05 0.807 0.4352
regression1$zone 7.192e+02 3.108e+02 2.314 0.0392 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 400.5 on 12 degrees of freedom
Multiple R-squared: 0.5829, Adjusted R-squared: 0.4786
F-statistic: 5.589 on 3 and 12 DF, p-value: 0.01237
We can see it is worse than the previous reg.fit2 but better still than our first model reg.fit1.
reg.fit4<-lm(regression1$incidents~interaction)
Lets create and test one last regression model this time only comparing the “incidents” field to the “interaction” field we created.
summary(reg.fit4)
Call:
lm(formula = regression1$incidents ~ interaction)
Residuals:
Min 1Q Median 3Q Max
-650.28 -301.09 -83.71 123.23 1103.76
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.951e+02 1.320e+02 3.751 0.00215 **
interaction 1.389e-04 4.737e-05 2.932 0.01093 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 451.9 on 14 degrees of freedom
Multiple R-squared: 0.3804, Adjusted R-squared: 0.3361
F-statistic: 8.595 on 1 and 14 DF, p-value: 0.01093
Though this last model was also poor like the previous with an even lower r-squared at 33% and an accetptable p-value.
If I had to use one of these model to predict “incidents” I would use our second one “reg.fit2” since it had the best R-squared at 0.5186 and the best p-value at 0.003404. The p value is low enough and the r squared is higher than any of the other models.