This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter. Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I. When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.
# Where to find datasets?
library(datasets)
seatbelts <- datasets::Seatbelts
#print(class(seatbelts))
df <- as.data.frame(seatbelts)
# df$Date <- rownames(seatbelts)
print(head(df))
# install.packages("zoo")
# install.packages("xts")
library(zoo)
library(xts)
df$Date <- as.Date(time(seatbelts))
df$law <- as.factor(df$law)
sapply(df, class)
DriversKilled drivers front rear kms
"numeric" "numeric" "numeric" "numeric" "numeric"
PetrolPrice VanKilled law Date
"numeric" "numeric" "factor" "Date"
#convert the factor column into a categorical variable
df$law <- as.factor(df$law)
sapply(df, class)
DriversKilled drivers front rear kms
"numeric" "numeric" "numeric" "numeric" "numeric"
PetrolPrice VanKilled law Date
"numeric" "numeric" "factor" "Date"
boxplot(DriversKilled ~ law, data = df)
t.test(DriversKilled ~ law, data = df)
Welch Two Sample t-test
data: DriversKilled by law
t = 5.1253, df = 29.609, p-value = 1.693e-05
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
15.39892 35.81899
sample estimates:
mean in group 0 mean in group 1
125.8698 100.2609
boxplot(VanKilled ~ law, data = df)
t.test(VanKilled ~ law, data = df)
Welch Two Sample t-test
data: VanKilled by law
t = 9.4624, df = 47.965, p-value = 1.508e-12
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
3.474406 5.349366
sample estimates:
mean in group 0 mean in group 1
9.585799 5.173913
## If the hypothesis that drivers killed are fewer when the seatbelt law is in effect, is TRUE,
## then, can we PREDICT based on the measureable number of deaths on a given day, whether the
## Law was in effect or not?
# This is a classification problem
# Y = law (response variable)
# X = [DriversKilled, VanKilled]
# Logistic regression for classification of whether the Law is effect for any given date
model <- glm (formula = law ~ DriversKilled + VanKilled, family = "binomial", data = df)
summary(model)
Call:
glm(formula = law ~ DriversKilled + VanKilled, family = "binomial",
data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.31774 -0.37820 -0.17077 -0.04051 2.94635
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.38029 1.90610 3.872 0.000108 ***
DriversKilled -0.05231 0.01563 -3.347 0.000816 ***
VanKilled -0.52171 0.13099 -3.983 6.81e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 140.740 on 191 degrees of freedom
Residual deviance: 88.991 on 189 degrees of freedom
AIC: 94.991
Number of Fisher Scoring iterations: 7
exp(model$coefficients) #less than 1 (odds) means that odds decreased when drivers killed increased. similarly for vankilled.
(Intercept) DriversKilled VanKilled
1604.0552725 0.9490347 0.5935072
# DriversKilled VanKilled
# 0.9490347 0.5935072
predictionsOfLawInEffect <- predict(model, newdata = df) # LOG ODDS of each row correspondign to when the law is in effect
boxplot(predictionsOfLawInEffect ~ df$law) #Boxplot to find the optimal log-odds cutoff
predictionsOfLawInEffect_CATEGORICAL <- predictionsOfLawInEffect > -2
# -2 is the threshold we found by visually analyzing the graph
table(df$law, predictionsOfLawInEffect_CATEGORICAL) # Confusion matrix: compares the actual v/s predicted state of law = 1 or 0 in a Confusion matrix
predictionsOfLawInEffect_CATEGORICAL
FALSE TRUE
0 133 36
1 4 19
plot(density(exp(predictionsOfLawInEffect))) # plot of the "ODDS" of each row being classified for the law being in effect
# ODDS = p / (1 - p); where ODDS = exp(predictionsOfLawInEffect)
# => (1 - p)*ODDS = p
# => p = ODDS/(1+ODDS)
# Probability of each row corresponding to the law being in effect:
plot(density(exp(predictionsOfLawInEffect)/(1+exp(predictionsOfLawInEffect)))) # plot of the "ODDS" of each row being classified for the law being in effect
summary(exp(predictionsOfLawInEffect)/(1+exp(predictionsOfLawInEffect)))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000343 0.0030743 0.0338102 0.1197917 0.1595388 0.8213848
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 0.0000343 0.0030743 0.0338102 0.1197917 0.1595388 0.8213848
# boxplot to find the optimal probability cutoff:
boxplot(exp(predictionsOfLawInEffect)/(1+exp(predictionsOfLawInEffect)) ~ df$law)
# Linear regression for REGRESSION of the number of people that died ON the X's
# / regressors law-being-in-effect and total number of drivers
# Y = number of people that died i.e. DriversKilled variable
# X = [law, drivers, PetrolPrice] # elim variables that confound the relationship between the predictor and the response of law=1 by logic
model2 <- lm (DriversKilled ~ law , data = df)
summary(model2)
Call:
lm(formula = DriversKilled ~ law, data = df)
Residuals:
Min 1Q Median 3Q Max
-46.870 -17.870 -5.565 14.130 72.130
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 125.870 1.849 68.082 < 2e-16 ***
law1 -25.609 5.342 -4.794 3.29e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 24.03 on 190 degrees of freedom
Multiple R-squared: 0.1079, Adjusted R-squared: 0.1032
F-statistic: 22.98 on 1 and 190 DF, p-value: 3.288e-06
## Export df to a file:
write.csv(df, row.names = F, file = "britishSeatBeltStudy.csv")