Info about R notebook

This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter. Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I. When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

Where to find datasets: the datasets library

# Where to find datasets?
library(datasets)

seatbelts <- datasets::Seatbelts
#print(class(seatbelts))

df <- as.data.frame(seatbelts)
# df$Date <- rownames(seatbelts)
print(head(df))

Load time-series modeling library (XTS and Zoo) and using them to decode the date column in the datasets::Seatbelts dataset

# install.packages("zoo")
# install.packages("xts")
library(zoo)
library(xts)
df$Date <- as.Date(time(seatbelts)) 
df$law <- as.factor(df$law)
sapply(df, class)
DriversKilled       drivers         front          rear           kms 
    "numeric"     "numeric"     "numeric"     "numeric"     "numeric" 
  PetrolPrice     VanKilled           law          Date 
    "numeric"     "numeric"      "factor"        "Date" 
#convert the factor column into a categorical variable

T-test: compares categorical levels of grouping variable based on numerical value of choice (DriversKilled or VanKilled) (exploratory aanlysis of difference between law = 1 and law = 0)

df$law <- as.factor(df$law)
sapply(df, class)
DriversKilled       drivers         front          rear           kms 
    "numeric"     "numeric"     "numeric"     "numeric"     "numeric" 
  PetrolPrice     VanKilled           law          Date 
    "numeric"     "numeric"      "factor"        "Date" 
boxplot(DriversKilled ~ law, data = df)

t.test(DriversKilled ~ law, data = df)

    Welch Two Sample t-test

data:  DriversKilled by law
t = 5.1253, df = 29.609, p-value = 1.693e-05
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 15.39892 35.81899
sample estimates:
mean in group 0 mean in group 1 
       125.8698        100.2609 
boxplot(VanKilled ~ law, data = df)

t.test(VanKilled ~ law, data = df)

    Welch Two Sample t-test

data:  VanKilled by law
t = 9.4624, df = 47.965, p-value = 1.508e-12
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 3.474406 5.349366
sample estimates:
mean in group 0 mean in group 1 
       9.585799        5.173913 
## If the hypothesis that drivers killed are fewer when the seatbelt law is in effect, is TRUE, 
## then, can we PREDICT based on the measureable number of deaths on a given day, whether the 
## Law was in effect or not?  
#  This is a classification problem

#  Y = law (response variable)
#  X = [DriversKilled, VanKilled]



# Logistic regression for classification of whether the Law is effect for any given date 
model <- glm (formula = law ~  DriversKilled + VanKilled, family = "binomial", data = df)
summary(model)

Call:
glm(formula = law ~ DriversKilled + VanKilled, family = "binomial", 
    data = df)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.31774  -0.37820  -0.17077  -0.04051   2.94635  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    7.38029    1.90610   3.872 0.000108 ***
DriversKilled -0.05231    0.01563  -3.347 0.000816 ***
VanKilled     -0.52171    0.13099  -3.983 6.81e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 140.740  on 191  degrees of freedom
Residual deviance:  88.991  on 189  degrees of freedom
AIC: 94.991

Number of Fisher Scoring iterations: 7
exp(model$coefficients) #less than 1 (odds) means that odds decreased when drivers killed increased. similarly for vankilled. 
  (Intercept) DriversKilled     VanKilled 
 1604.0552725     0.9490347     0.5935072 
# DriversKilled     VanKilled 
#     0.9490347     0.5935072

Create predictions of the fit logistic model against our original data - each row being an independent evaluation of the model

predictionsOfLawInEffect <- predict(model, newdata = df)  # LOG ODDS of each row correspondign to when the law is in effect
boxplot(predictionsOfLawInEffect ~ df$law)  #Boxplot to find the optimal log-odds cutoff

predictionsOfLawInEffect_CATEGORICAL <- predictionsOfLawInEffect > -2
# -2 is the threshold we found by visually analyzing the graph
table(df$law, predictionsOfLawInEffect_CATEGORICAL)  # Confusion matrix: compares the actual v/s predicted state of law = 1 or 0 in a Confusion matrix
   predictionsOfLawInEffect_CATEGORICAL
    FALSE TRUE
  0   133   36
  1     4   19

visualize a probability density plot of the predicted odds of the response of law - 1 in the predicted set of evaluated records

odds = p/(1-p)

p / odds/(1+odds)

plot(density(exp(predictionsOfLawInEffect)))  # plot of the "ODDS" of each row being classified for the law being in effect


# ODDS = p / (1 - p); where ODDS = exp(predictionsOfLawInEffect)
# => (1 - p)*ODDS = p
# => p = ODDS/(1+ODDS)

Probability of each row corresponding to the law being in effect:

# Probability of each row  corresponding to the law being in effect: 
plot(density(exp(predictionsOfLawInEffect)/(1+exp(predictionsOfLawInEffect))))  # plot of the "ODDS" of each row being classified for the law being in effect

summary(exp(predictionsOfLawInEffect)/(1+exp(predictionsOfLawInEffect)))
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0000343 0.0030743 0.0338102 0.1197917 0.1595388 0.8213848 
#      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
# 0.0000343 0.0030743 0.0338102 0.1197917 0.1595388 0.8213848

boxplot to find the optimal probability cutoff:

# boxplot to find the optimal probability cutoff:
boxplot(exp(predictionsOfLawInEffect)/(1+exp(predictionsOfLawInEffect)) ~ df$law)

Perform lin reg as an example use-case to redicted drivers that died as a function of the law status being in effect

# Linear regression for REGRESSION of the number of people that died ON the X's 
# / regressors law-being-in-effect and total number of drivers  

# Y = number of people that died i.e.  DriversKilled variable
# X = [law, drivers, PetrolPrice] # elim variables that confound the relationship between the predictor and the response of law=1 by logic

model2 <- lm (DriversKilled ~ law  , data = df)
summary(model2)

Call:
lm(formula = DriversKilled ~ law, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-46.870 -17.870  -5.565  14.130  72.130 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  125.870      1.849  68.082  < 2e-16 ***
law1         -25.609      5.342  -4.794 3.29e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 24.03 on 190 degrees of freedom
Multiple R-squared:  0.1079,    Adjusted R-squared:  0.1032 
F-statistic: 22.98 on 1 and 190 DF,  p-value: 3.288e-06

export df to a file:

## Export df to a file:
write.csv(df, row.names = F, file = "britishSeatBeltStudy.csv")
LS0tCnRpdGxlOiAiQ2xhc3NpZmljYXRpb24gYW5kIFJlZ3Jlc3Npb24gQW5hbHlzaXMgdXNpbmcgQnJpdGlzaCBTZWF0YmVsdCBTdHVkeSBEYXRhIgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCiMjIyMgSW5mbyBhYm91dCBSIG5vdGVib29rClRoaXMgaXMgYW4gW1IgTWFya2Rvd25dKGh0dHA6Ly9ybWFya2Rvd24ucnN0dWRpby5jb20pIE5vdGVib29rLiBXaGVuIHlvdSBleGVjdXRlIGNvZGUgd2l0aGluIHRoZSBub3RlYm9vaywgdGhlIHJlc3VsdHMgYXBwZWFyIGJlbmVhdGggdGhlIGNvZGUuIAoKVHJ5IGV4ZWN1dGluZyB0aGlzIGNodW5rIGJ5IGNsaWNraW5nIHRoZSAqUnVuKiBidXR0b24gd2l0aGluIHRoZSBjaHVuayBvciBieSBwbGFjaW5nIHlvdXIgY3Vyc29yIGluc2lkZSBpdCBhbmQgcHJlc3NpbmcgKkN0cmwrU2hpZnQrRW50ZXIqLiAKQWRkIGEgbmV3IGNodW5rIGJ5IGNsaWNraW5nIHRoZSAqSW5zZXJ0IENodW5rKiBidXR0b24gb24gdGhlIHRvb2xiYXIgb3IgYnkgcHJlc3NpbmcgKkN0cmwrQWx0K0kqLgpXaGVuIHlvdSBzYXZlIHRoZSBub3RlYm9vaywgYW4gSFRNTCBmaWxlIGNvbnRhaW5pbmcgdGhlIGNvZGUgYW5kIG91dHB1dCB3aWxsIGJlIHNhdmVkIGFsb25nc2lkZSBpdCAoY2xpY2sgdGhlICpQcmV2aWV3KiBidXR0b24gb3IgcHJlc3MgKkN0cmwrU2hpZnQrSyogdG8gcHJldmlldyB0aGUgSFRNTCBmaWxlKS4KClRoZSBwcmV2aWV3IHNob3dzIHlvdSBhIHJlbmRlcmVkIEhUTUwgY29weSBvZiB0aGUgY29udGVudHMgb2YgdGhlIGVkaXRvci4gQ29uc2VxdWVudGx5LCB1bmxpa2UgKktuaXQqLCAqUHJldmlldyogZG9lcyBub3QgcnVuIGFueSBSIGNvZGUgY2h1bmtzLiBJbnN0ZWFkLCB0aGUgb3V0cHV0IG9mIHRoZSBjaHVuayB3aGVuIGl0IHdhcyBsYXN0IHJ1biBpbiB0aGUgZWRpdG9yIGlzIGRpc3BsYXllZC4KCiMjIyBXaGVyZSB0byBmaW5kIGRhdGFzZXRzOiB0aGUgZGF0YXNldHMgbGlicmFyeQpgYGB7cn0KIyBXaGVyZSB0byBmaW5kIGRhdGFzZXRzPwpsaWJyYXJ5KGRhdGFzZXRzKQoKc2VhdGJlbHRzIDwtIGRhdGFzZXRzOjpTZWF0YmVsdHMKI3ByaW50KGNsYXNzKHNlYXRiZWx0cykpCgpkZiA8LSBhcy5kYXRhLmZyYW1lKHNlYXRiZWx0cykKIyBkZiREYXRlIDwtIHJvd25hbWVzKHNlYXRiZWx0cykKcHJpbnQoaGVhZChkZikpCmBgYAoKCiMjIyBMb2FkIHRpbWUtc2VyaWVzIG1vZGVsaW5nIGxpYnJhcnkgKFhUUyBhbmQgWm9vKSBhbmQgdXNpbmcgdGhlbSB0byBkZWNvZGUgdGhlIGRhdGUgY29sdW1uIGluIHRoZSAqZGF0YXNldHM6OlNlYXRiZWx0cyogZGF0YXNldApgYGB7cn0KIyBpbnN0YWxsLnBhY2thZ2VzKCJ6b28iKQojIGluc3RhbGwucGFja2FnZXMoInh0cyIpCmxpYnJhcnkoem9vKQpsaWJyYXJ5KHh0cykKZGYkRGF0ZSA8LSBhcy5EYXRlKHRpbWUoc2VhdGJlbHRzKSkgCmRmJGxhdyA8LSBhcy5mYWN0b3IoZGYkbGF3KQpzYXBwbHkoZGYsIGNsYXNzKQojY29udmVydCB0aGUgZmFjdG9yIGNvbHVtbiBpbnRvIGEgY2F0ZWdvcmljYWwgdmFyaWFibGUKCmBgYAojIyMgVC10ZXN0OiBjb21wYXJlcyBjYXRlZ29yaWNhbCBsZXZlbHMgb2YgZ3JvdXBpbmcgdmFyaWFibGUgYmFzZWQgb24gbnVtZXJpY2FsIHZhbHVlIG9mIGNob2ljZSAoRHJpdmVyc0tpbGxlZCBvciBWYW5LaWxsZWQpIChleHBsb3JhdG9yeSBhYW5seXNpcyBvZiBkaWZmZXJlbmNlIGJldHdlZW4gbGF3ID0gMSBhbmQgbGF3ID0gMCkKYGBge3J9CmRmJGxhdyA8LSBhcy5mYWN0b3IoZGYkbGF3KQpzYXBwbHkoZGYsIGNsYXNzKQoKYm94cGxvdChEcml2ZXJzS2lsbGVkIH4gbGF3LCBkYXRhID0gZGYpCnQudGVzdChEcml2ZXJzS2lsbGVkIH4gbGF3LCBkYXRhID0gZGYpCgpib3hwbG90KFZhbktpbGxlZCB+IGxhdywgZGF0YSA9IGRmKQp0LnRlc3QoVmFuS2lsbGVkIH4gbGF3LCBkYXRhID0gZGYpCmBgYAoKYGBge3J9CiMjIElmIHRoZSBoeXBvdGhlc2lzIHRoYXQgZHJpdmVycyBraWxsZWQgYXJlIGZld2VyIHdoZW4gdGhlIHNlYXRiZWx0IGxhdyBpcyBpbiBlZmZlY3QsIGlzIFRSVUUsIAojIyB0aGVuLCBjYW4gd2UgUFJFRElDVCBiYXNlZCBvbiB0aGUgbWVhc3VyZWFibGUgbnVtYmVyIG9mIGRlYXRocyBvbiBhIGdpdmVuIGRheSwgd2hldGhlciB0aGUgCiMjIExhdyB3YXMgaW4gZWZmZWN0IG9yIG5vdD8gIAojICBUaGlzIGlzIGEgY2xhc3NpZmljYXRpb24gcHJvYmxlbQoKIyAgWSA9IGxhdyAocmVzcG9uc2UgdmFyaWFibGUpCiMgIFggPSBbRHJpdmVyc0tpbGxlZCwgVmFuS2lsbGVkXQoKCgojIExvZ2lzdGljIHJlZ3Jlc3Npb24gZm9yIGNsYXNzaWZpY2F0aW9uIG9mIHdoZXRoZXIgdGhlIExhdyBpcyBlZmZlY3QgZm9yIGFueSBnaXZlbiBkYXRlIAptb2RlbCA8LSBnbG0gKGZvcm11bGEgPSBsYXcgfiAgRHJpdmVyc0tpbGxlZCArIFZhbktpbGxlZCwgZmFtaWx5ID0gImJpbm9taWFsIiwgZGF0YSA9IGRmKQpzdW1tYXJ5KG1vZGVsKQoKZXhwKG1vZGVsJGNvZWZmaWNpZW50cykgI2xlc3MgdGhhbiAxIChvZGRzKSBtZWFucyB0aGF0IG9kZHMgZGVjcmVhc2VkIHdoZW4gZHJpdmVycyBraWxsZWQgaW5jcmVhc2VkLiBzaW1pbGFybHkgZm9yIHZhbmtpbGxlZC4gCiMgRHJpdmVyc0tpbGxlZCAgICAgVmFuS2lsbGVkIAojICAgICAwLjk0OTAzNDcgICAgIDAuNTkzNTA3MgpgYGAKCiMjIyBDcmVhdGUgcHJlZGljdGlvbnMgb2YgdGhlIGZpdCBsb2dpc3RpYyBtb2RlbCBhZ2FpbnN0IG91ciBvcmlnaW5hbCBkYXRhIC0gZWFjaCByb3cgYmVpbmcgYW4gaW5kZXBlbmRlbnQgZXZhbHVhdGlvbiBvZiB0aGUgbW9kZWwKYGBge3J9CnByZWRpY3Rpb25zT2ZMYXdJbkVmZmVjdCA8LSBwcmVkaWN0KG1vZGVsLCBuZXdkYXRhID0gZGYpICAjIExPRyBPRERTIG9mIGVhY2ggcm93IGNvcnJlc3BvbmRpZ24gdG8gd2hlbiB0aGUgbGF3IGlzIGluIGVmZmVjdApib3hwbG90KHByZWRpY3Rpb25zT2ZMYXdJbkVmZmVjdCB+IGRmJGxhdykgICNCb3hwbG90IHRvIGZpbmQgdGhlIG9wdGltYWwgbG9nLW9kZHMgY3V0b2ZmCnByZWRpY3Rpb25zT2ZMYXdJbkVmZmVjdF9DQVRFR09SSUNBTCA8LSBwcmVkaWN0aW9uc09mTGF3SW5FZmZlY3QgPiAtMgojIC0yIGlzIHRoZSB0aHJlc2hvbGQgd2UgZm91bmQgYnkgdmlzdWFsbHkgYW5hbHl6aW5nIHRoZSBncmFwaAp0YWJsZShkZiRsYXcsIHByZWRpY3Rpb25zT2ZMYXdJbkVmZmVjdF9DQVRFR09SSUNBTCkgICMgQ29uZnVzaW9uIG1hdHJpeDogY29tcGFyZXMgdGhlIGFjdHVhbCB2L3MgcHJlZGljdGVkIHN0YXRlIG9mIGxhdyA9IDEgb3IgMCBpbiBhIENvbmZ1c2lvbiBtYXRyaXgKYGBgCgojIyMgdmlzdWFsaXplIGEgcHJvYmFiaWxpdHkgZGVuc2l0eSBwbG90IG9mIHRoZSBwcmVkaWN0ZWQgb2RkcyBvZiB0aGUgcmVzcG9uc2Ugb2YgbGF3IC0gMSBpbiB0aGUgcHJlZGljdGVkIHNldCBvZiBldmFsdWF0ZWQgcmVjb3JkcwojIyMjIG9kZHMgPSBwLygxLXApCiMjIyMgcCAvIG9kZHMvKDErb2RkcykKYGBge3J9CnBsb3QoZGVuc2l0eShleHAocHJlZGljdGlvbnNPZkxhd0luRWZmZWN0KSkpICAjIHBsb3Qgb2YgdGhlICJPRERTIiBvZiBlYWNoIHJvdyBiZWluZyBjbGFzc2lmaWVkIGZvciB0aGUgbGF3IGJlaW5nIGluIGVmZmVjdAoKIyBPRERTID0gcCAvICgxIC0gcCk7IHdoZXJlIE9ERFMgPSBleHAocHJlZGljdGlvbnNPZkxhd0luRWZmZWN0KQojID0+ICgxIC0gcCkqT0REUyA9IHAKIyA9PiBwID0gT0REUy8oMStPRERTKQpgYGAKCiMjIyBQcm9iYWJpbGl0eSBvZiBlYWNoIHJvdyAgY29ycmVzcG9uZGluZyB0byB0aGUgbGF3IGJlaW5nIGluIGVmZmVjdDogCmBgYHtyfQojIFByb2JhYmlsaXR5IG9mIGVhY2ggcm93ICBjb3JyZXNwb25kaW5nIHRvIHRoZSBsYXcgYmVpbmcgaW4gZWZmZWN0OiAKcGxvdChkZW5zaXR5KGV4cChwcmVkaWN0aW9uc09mTGF3SW5FZmZlY3QpLygxK2V4cChwcmVkaWN0aW9uc09mTGF3SW5FZmZlY3QpKSkpICAjIHBsb3Qgb2YgdGhlICJPRERTIiBvZiBlYWNoIHJvdyBiZWluZyBjbGFzc2lmaWVkIGZvciB0aGUgbGF3IGJlaW5nIGluIGVmZmVjdApzdW1tYXJ5KGV4cChwcmVkaWN0aW9uc09mTGF3SW5FZmZlY3QpLygxK2V4cChwcmVkaWN0aW9uc09mTGF3SW5FZmZlY3QpKSkKIyAgICAgIE1pbi4gICAxc3QgUXUuICAgIE1lZGlhbiAgICAgIE1lYW4gICAzcmQgUXUuICAgICAgTWF4LiAKIyAwLjAwMDAzNDMgMC4wMDMwNzQzIDAuMDMzODEwMiAwLjExOTc5MTcgMC4xNTk1Mzg4IDAuODIxMzg0OApgYGAKCiMjIyBib3hwbG90IHRvIGZpbmQgdGhlIG9wdGltYWwgcHJvYmFiaWxpdHkgY3V0b2ZmOgpgYGB7cn0KIyBib3hwbG90IHRvIGZpbmQgdGhlIG9wdGltYWwgcHJvYmFiaWxpdHkgY3V0b2ZmOgpib3hwbG90KGV4cChwcmVkaWN0aW9uc09mTGF3SW5FZmZlY3QpLygxK2V4cChwcmVkaWN0aW9uc09mTGF3SW5FZmZlY3QpKSB+IGRmJGxhdykKYGBgCgojIyMgUGVyZm9ybSBsaW4gcmVnIGFzIGFuIGV4YW1wbGUgdXNlLWNhc2UgdG8gcmVkaWN0ZWQgZHJpdmVycyB0aGF0IGRpZWQgYXMgYSBmdW5jdGlvbiBvZiB0aGUgbGF3IHN0YXR1cyBiZWluZyBpbiBlZmZlY3QKYGBge3J9CiMgTGluZWFyIHJlZ3Jlc3Npb24gZm9yIFJFR1JFU1NJT04gb2YgdGhlIG51bWJlciBvZiBwZW9wbGUgdGhhdCBkaWVkIE9OIHRoZSBYJ3MgCiMgLyByZWdyZXNzb3JzIGxhdy1iZWluZy1pbi1lZmZlY3QgYW5kIHRvdGFsIG51bWJlciBvZiBkcml2ZXJzICAKCiMgWSA9IG51bWJlciBvZiBwZW9wbGUgdGhhdCBkaWVkIGkuZS4gIERyaXZlcnNLaWxsZWQgdmFyaWFibGUKIyBYID0gW2xhdywgZHJpdmVycywgUGV0cm9sUHJpY2VdICMgZWxpbSB2YXJpYWJsZXMgdGhhdCBjb25mb3VuZCB0aGUgcmVsYXRpb25zaGlwIGJldHdlZW4gdGhlIHByZWRpY3RvciBhbmQgdGhlIHJlc3BvbnNlIG9mIGxhdz0xIGJ5IGxvZ2ljCgptb2RlbDIgPC0gbG0gKERyaXZlcnNLaWxsZWQgfiBsYXcgICwgZGF0YSA9IGRmKQpzdW1tYXJ5KG1vZGVsMikKYGBgCgojIyMgZXhwb3J0IGRmIHRvIGEgZmlsZToKYGBge3J9CiMjIEV4cG9ydCBkZiB0byBhIGZpbGU6CndyaXRlLmNzdihkZiwgcm93Lm5hbWVzID0gRiwgZmlsZSA9ICJicml0aXNoU2VhdEJlbHRTdHVkeS5jc3YiKQoKYGBgCgo=