The managers of Capital Bikeshare have found that the system works smoothly until more than 500 bikes are rented in any one hour. At that point, it becomes necessary to insert extra bikes into the system and move them across stations to balance loads.

library(knitr)
library(glmnet)
#code pset #1:
library(data.table)
biketab <- fread("bikeshare.csv")
# tell R which are factors
biketab[, c("dteday", "mnth","season","weekday","hr","weathersit") := list(
  factor(dteday), factor(mnth), factor(season), 
  factor(weekday), factor(hr), factor(weathersit))]
####### Q1: outliers and FDR
## the next command calculates total cnt by day, 
# also keeping track of the corresponding yr and mnth id.
daytots <- biketab[, list(total=sum(cnt)), by=c("dteday","yr","mnth")]
row.names(daytots) <- daytots$dteday
# simple regression
daylm <- glm(total ~ yr*mnth, data=daytots)
#### Q2: lasso regression
library(gamlr)
source("naref.R")
mmbike <- sparse.model.matrix(
    cnt ~ . + yr*mnth + hr*notbizday, 
    data=naref(biketab))[,-1]
y <- log(biketab$cnt)
## note, I need lambda.min.ratio=1e-4 because otherwise we don't get a path
## out to complex enough models (i.e. cv err is still decreasing at termination)
fitlin <- cv.gamlr( mmbike, y, lmr=1e-4, verb=F )

library(caret)
library(ROCR)
library(tidyverse)
library(gamlr)
library(knitr)
library(stringr)
library(janitor)

3.1
Define the binary outcome variable overload that is one if cnt > 500, zero otherwise.

#define overload as any day where cnt is greater than 500
biketab$overload = ifelse(biketab$cnt > 500,1,0)

Fit and plot the lasso path for regression of overload onto the same model matrix used in Question 2 (no need for cross validation).

#create sparse matrix for test data set
overload.matrix = sparse.model.matrix(
  overload ~ .+ yr*mnth + hr*notbizday,
  data = naref(biketab))[,-1]
overload.y =as.factor(biketab$overload)
overload.reg = gamlr(overload.matrix,
                     overload.y,
                     family = "binomial")

Plot of Lambda and Coefficients:

plot(overload.reg, xvar = "lambda")

Plot of Lambda and Coefficients using glmnet

overload.reg.glmnet = glmnet(overload.matrix, overload.y, family = "binomial")
plot(overload.reg.glmnet, xvar = "lambda")

3.2
Summarize how hour-of-day effects the probability of an overload during business days. Consider a single hour with a strong effect and compare this to its effect in the regression of Q2.

Extracting the effects of hr on probability of overload

#extract the coefficients generated by regression to indiciate effect sizes
#since lambda argument is not specified, it is using the min lambda
coef.df = coef(overload.reg) %>% as.matrix() %>% as.data.frame()
coef.df$var.names = rownames(coef.df)
colnames(coef.df)[1]= "Beta.Val"
#create data frame with just the hrs variables - excluding not busuziness day hours
hrs.df = filter(coef.df, str_detect(rownames(coef.df), regex("hr[0-9]{1,2}$")))
#arrange hrs.df by absolute valuate
hrs.df = arrange(hrs.df, desc(abs(hrs.df$Beta.Val)))
kable(head(hrs.df, 2), caption = "Odds Ratio of Likelihood of Overload for hrs",
      col.names = c("Beta Value", "Hour"))
Beta Value Hour
0.0876101 hr17
0.0428949 hr18

Extracting the effecits of hr from fitlin regression

#extract the coefficients generated by regression to indiciate effect sizes
#since lambda argument is not specified, it is using the min lambda
coef.df.fitlin = coef(fitlin) %>% as.matrix() %>% as.data.frame()
coef.df.fitlin$var.names = rownames(coef.df.fitlin)
colnames(coef.df.fitlin)[1]= "Beta.Val"
#create data frame with just the hrs variables - excluding not busuziness day hours
hrs.df.fitlin = filter(coef.df.fitlin, str_detect(rownames(coef.df.fitlin), regex("hr[0-9]{1,2}$")))
#arrange hrs.df by absolute valuate
hrs.df.fitlin = arrange(hrs.df.fitlin, desc(abs(hrs.df.fitlin$Beta.Val)))
kable(head(hrs.df.fitlin, 10), caption = "Odds Ratio of Likelihood of Overload for hrs in fitlin",
      col.names = c("Beta Value", "Hour"))
Beta Value Hour
-3.5575318 hr3
-3.4407608 hr4
-3.0278172 hr2
-2.3535056 hr1
-1.8732525 hr5
-1.5466945 hr0
1.0855031 hr8
0.9857572 hr17
0.9350662 hr18
-0.6705973 hr23

We can see that by order both the fitlin and the logistic regrssion of overload that we get different hours having impacts on the outcome. for fitlin we can see that hrs3 and hrs4 have the largest impact on the log(cnt) of bikes rented. It is evident that it is not until hr8 we see that a positive relationship occurrs. The cofficients are essentially telling us that in the very late evening and very early morning, the (log) number of bikes rented goes down.

As for the logistic regression of overload, only hr7 seems to have a positive impact. However, since this is a logistic regression the coefficient represents the odds ratio. Since the coefficeint is less than 1 we can say that there hr7 has approximately is 87% less likely to be overloaded, holding all else equal.


3.3
Suppose that it costs you $200/hr in overtime pay if you have an overload (cnt > 500) with your usual number of staff. Staffing an extra driver to move the bikes costs only $100/hr and means you don’t have to pay any overtime. At what probability for overload > 0 will you want to staff an extra driver?

df = data.frame(Regular.Staff = c(-200,0),
                Extra.Employee = c(-100,-100))
rownames(df) = c("Overload", "Don't Overload")
kable(df, col.names = c("Regular Staff", "Extra Employee"))
Regular Staff Extra Employee
Overload -200 -100
Don’t Overload 0 -100

Using expected probability we can see that our \(p\) threshold is:

\[ 200(1 - p) - 100p > 0\] \[ 200 - 200p - 100p > 0\] \[ 200 - 300p > 0\] \[ 200 > 300p\] \[ \frac{200}{300} > p\] \[ p < .67\]

3.4
Plot and describe the ROC curve for your AICc-optimal regression from 3.1. What is the sensitivity and specificity of your rule from 3.3 if applied with this regression?

source("roc.R")
#using mmbike matrix
test = gamlr(mmbike, overload.y, family = "binomial")
p = predict(test, 
            mmbike, 
            type = "response")
roc(p=p, y = overload.y)

p = predict(overload.reg, 
            overload.matrix, 
            type = "response")
roc(p=p, y = overload.y)

prediction.df = data.frame(prediction = p,
                           reality = overload.y)
prediction.df$threshold.pred = ifelse(prediction.df$seg100 > .67, 1,0)
prediction.df$correct.prediction  = ifelse(prediction.df$threshold.pred == prediction.df$reality, 1, 0)
confusion.matrix = caret::confusionMatrix(prediction.df$reality, prediction.df$threshold.pred)
x = as.table(confusion.matrix, what = "xtabs")
x <- cbind(x, Total = rowSums(x))
x <- rbind(x, Total = colSums(x))
x
          0    1 Total
0     16100    0 16100
1       139 1140  1279
Total 16239 1140 17379



row.names(x) = c("Predicted Not Overloaded", "Predicted Overloaded", "Total")
colnames(x) = c("Predicted Not Overloaded", "Predicted Overloaded", "Total")
x = as.data.frame(x)
kable(x, caption = "Confusion Matrix for Full Data Frame p > .67")

\[Sensativity = \frac{\# \ Correct \ Overloaded}{\# \ Overloaded}\]

\[Specificity = \frac{\# \ Incorrectly \ Not \ Overloaded}{\# \ Not \ Overloaded}\]

specificity = 1140 / 1140
print(paste0("Sensativity: ",specificity *100,"%"))
[1] "Sensativity: 100%"
sensativity = 139/16329
print(paste0("Specificity: ",(1 - sensativity),"%"))
[1] "Specificity: 0.991487537509952%"

From confusion matrix:

confusion.matrix$byClass[1]
Sensitivity 
  0.9914404 
confusion.matrix$byClass[2]
Specificity 
          1 

3.5  

Now, take the test sample and

• fit the regression path excluding this sample (e.g., on mmbike[-test,]).

#create test and training split
trainIndex = createDataPartition(biketab$overload, times = 1, p = .8, list = F)
#create training data frame and training outcome variable
train.df = biketab[trainIndex,]
train.y = train.df$overload
#create test data frame and test outcome
test.df = biketab[-(trainIndex),]
test.y = train.df$overload
train.matrix <- sparse.model.matrix(
  overload ~ . + yr*mnth + hr*notbizday, 
    data=naref(train.df))[,-1]
train.reg = gamlr(train.matrix, train.y, family = "binomial")
plot(train.reg, xvar = "lambda")

• use the AICc-optimal model from this path to predict for the test set.

test.matrix <- sparse.model.matrix(
  overload ~ . + yr*mnth + hr*notbizday, 
    data=naref(test.df))[,-1]

test.y = test.df$overload
pred = predict(train.reg, test.matrix, type = "response")

• plot the ‘out-of-sample’ ROC curve for these predictions.

Compare this curve to your ROC curve from 3.4 and describe what they imply about the quality of AICc selection for this regression.

---
title: 'Machine Learning Problem Set #2'
author: "Tom Curran"
date: "April 24, 2018"
output:
  html_notebook: default
  pdf_document: default
---

The managers of Capital Bikeshare have found that the system works smoothly until more than 500 bikes are rented in any one hour. At that point, it becomes necessary to insert extra bikes into the system and move them across stations to balance loads.


```{r pset_1, message=FALSE, warning=FALSE}
library(knitr)
library(glmnet)
#code pset #1:
library(data.table)
biketab <- fread("bikeshare.csv")
# tell R which are factors
biketab[, c("dteday", "mnth","season","weekday","hr","weathersit") := list(
  factor(dteday), factor(mnth), factor(season), 
  factor(weekday), factor(hr), factor(weathersit))]

####### Q1: outliers and FDR
## the next command calculates total cnt by day, 
# also keeping track of the corresponding yr and mnth id.
daytots <- biketab[, list(total=sum(cnt)), by=c("dteday","yr","mnth")]
row.names(daytots) <- daytots$dteday
# simple regression
daylm <- glm(total ~ yr*mnth, data=daytots)

#### Q2: lasso regression
library(gamlr)
source("naref.R")
mmbike <- sparse.model.matrix(
	cnt ~ . + yr*mnth + hr*notbizday, 
	data=naref(biketab))[,-1]
y <- log(biketab$cnt)
## note, I need lambda.min.ratio=1e-4 because otherwise we don't get a path
## out to complex enough models (i.e. cv err is still decreasing at termination)
fitlin <- cv.gamlr( mmbike, y, lmr=1e-4, verb=F )

```

***

```{r libraries, message=FALSE, warning=FALSE}
library(caret)
library(ROCR)
library(tidyverse)
library(gamlr)
library(knitr)
library(stringr)
library(janitor)
```

**3.1**\
**Define the binary outcome variable overload that is one if cnt > 500, zero otherwise.**
```{r train.test.split, message=FALSE, warning=FALSE}

#define overload as any day where cnt is greater than 500
biketab$overload = ifelse(biketab$cnt > 500,1,0)


```

**Fit and plot the lasso path for regression of overload onto the same model matrix used in Question 2 (no need for cross validation).**
```{r overload.reg, message=FALSE, warning=FALSE}
#create sparse matrix for test data set
overload.matrix = sparse.model.matrix(
  overload ~ .+ yr*mnth + hr*notbizday,
  data = naref(biketab))[,-1]

overload.y =as.factor(biketab$overload)

overload.reg = gamlr(overload.matrix,
                     overload.y,
                     family = "binomial")

```

#### Plot of Lambda and Coefficients:

```{r lambda.plot, message=FALSE, warning=FALSE}
plot(overload.reg, xvar = "lambda")
```

#### Plot of Lambda and Coefficients using glmnet

```{r lambdagraph.glmnet, message=FALSE, warning=FALSE}
overload.reg.glmnet = glmnet(overload.matrix, overload.y, family = "binomial")

plot(overload.reg.glmnet, xvar = "lambda")
```

**3.2**\
**Summarize how hour-of-day effects the probability of an overload during business days. Consider a single hour with a strong effect and compare this to its effect in the regression of Q2.**

Extracting the effects of `hr` on probability of `overload`

```{r}
#extract the coefficients generated by regression to indiciate effect sizes
#since lambda argument is not specified, it is using the min lambda
coef.df = coef(overload.reg) %>% as.matrix() %>% as.data.frame()
coef.df$var.names = rownames(coef.df)
colnames(coef.df)[1]= "Beta.Val"

#create data frame with just the hrs variables - excluding not busuziness day hours
hrs.df = filter(coef.df, str_detect(rownames(coef.df), regex("hr[0-9]{1,2}$")))
#arrange hrs.df by absolute valuate
hrs.df = arrange(hrs.df, desc(abs(hrs.df$Beta.Val)))

kable(head(hrs.df, 2), caption = "Odds Ratio of Likelihood of Overload for hrs",
      col.names = c("Beta Value", "Hour"))

```

Extracting the effecits of `hr` from `fitlin` regression

```{r}
#extract the coefficients generated by regression to indiciate effect sizes
#since lambda argument is not specified, it is using the min lambda
coef.df.fitlin = coef(fitlin) %>% as.matrix() %>% as.data.frame()
coef.df.fitlin$var.names = rownames(coef.df.fitlin)
colnames(coef.df.fitlin)[1]= "Beta.Val"

#create data frame with just the hrs variables - excluding not busuziness day hours
hrs.df.fitlin = filter(coef.df.fitlin, str_detect(rownames(coef.df.fitlin), regex("hr[0-9]{1,2}$")))
#arrange hrs.df by absolute valuate
hrs.df.fitlin = arrange(hrs.df.fitlin, desc(abs(hrs.df.fitlin$Beta.Val)))

kable(head(hrs.df.fitlin, 10), caption = "Odds Ratio of Likelihood of Overload for hrs in fitlin",
      col.names = c("Beta Value", "Hour"))

```


We can see that by order both the fitlin and the logistic regrssion of overload that we get different hours having impacts on the outcome. for `fitlin` we can see that `hrs3` and `hrs4` have the largest impact on the `log(cnt)` of bikes rented. It is evident that it is not until `hr8` we see that a positive relationship occurrs. The cofficients are essentially telling us that in the very late evening and very early morning, the (log) number of bikes rented goes down. 

As for the logistic regression of `overload`, only `hr7` seems to have a positive impact. However, since this is a logistic regression the coefficient represents the odds ratio. Since the coefficeint is less than 1 we can say that there `hr7` has approximately is 87% less likely to be overloaded, holding all else equal. 

***

**3.3**\
**Suppose that it costs you \$200/hr in overtime pay if you have an overload (cnt > 500) with your usual number of staff. Staffing an extra driver to move the bikes costs only \$100/hr and means you don’t have to pay any overtime. At what probability for overload > 0 will you want to staff an extra driver?**

```{r}
df = data.frame(Regular.Staff = c(-200,0),
                Extra.Employee = c(-100,-100))

rownames(df) = c("Overload", "Don't Overload")

kable(df, col.names = c("Regular Staff", "Extra Employee"))
```

Using expected probability we can see that our $p$ threshold is:

$$ 200(1 - p) - 100p > 0$$
$$ 200 - 200p - 100p > 0$$
$$ 200 - 300p > 0$$
$$ 200 > 300p$$
$$ \frac{200}{300} > p$$
$$ p < .67$$

**3.4**\
**Plot and describe the ROC curve for your AICc-optimal regression from 3.1. What is the sensitivity and specificity of your rule from 3.3 if applied with this regression?**

```{r}
source("roc.R")

#using mmbike matrix
test = gamlr(mmbike, overload.y, family = "binomial")

p = predict(test, 
            mmbike, 
            type = "response")

roc(p=p, y = overload.y)



```
```{r}
p = predict(overload.reg, 
            overload.matrix, 
            type = "response")

roc(p=p, y = overload.y)


prediction.df = data.frame(prediction = p,
                           reality = overload.y)
```

```{r}
prediction.df$threshold.pred = ifelse(prediction.df$seg100 > .67, 1,0)

prediction.df$correct.prediction  = ifelse(prediction.df$threshold.pred == prediction.df$reality, 1, 0)

confusion.matrix = caret::confusionMatrix(prediction.df$reality, prediction.df$threshold.pred)

x = as.table(confusion.matrix, what = "xtabs")

x <- cbind(x, Total = rowSums(x))
x <- rbind(x, Total = colSums(x))
x
```

```{r}



row.names(x) = c("Predicted Not Overloaded", "Predicted Overloaded", "Total")
colnames(x) = c("Predicted Not Overloaded", "Predicted Overloaded", "Total")
x = as.data.frame(x)
kable(x, caption = "Confusion Matrix for Full Data Frame p > .67")
```


$$Sensativity = \frac{\# \ Correct \ Overloaded}{\# \ Overloaded}$$

$$Specificity = \frac{\# \ Incorrectly \ Not \ Overloaded}{\# \ Not \ Overloaded}$$


```{r}
specificity = 1140 / 1140
print(paste0("Sensativity: ",specificity *100,"%"))
sensativity = 139/16329
print(paste0("Specificity: ",(1 - sensativity),"%"))
```

From confusion matrix:

```{r}

confusion.matrix$byClass[1]

confusion.matrix$byClass[2]
```

***
**3.5** \ 

**Now, take the test sample and**

• fit the regression path excluding this sample (e.g., on mmbike[-test,]). 

```{r, message=FALSE, warning=FALSE}
#create test and training split
trainIndex = createDataPartition(biketab$overload, times = 1, p = .8, list = F)

#create training data frame and training outcome variable
train.df = biketab[trainIndex,]

train.y = train.df$overload

#create test data frame and test outcome
test.df = biketab[-(trainIndex),]
test.y = train.df$overload

train.matrix <- sparse.model.matrix(
  overload ~ . + yr*mnth + hr*notbizday, 
	data=naref(train.df))[,-1]

train.reg = gamlr(train.matrix, train.y, family = "binomial")
plot(train.reg, xvar = "lambda")
```

• use the AICc-optimal model from this path to predict for the test set.
```{r}
test.matrix <- sparse.model.matrix(
  overload ~ . + yr*mnth + hr*notbizday, 
	data=naref(test.df))[,-1]

test.y = test.df$overload
pred = predict(train.reg, test.matrix, type = "response")


```

• plot the ‘out-of-sample’ ROC curve for these predictions.
```{r}
```

Compare this curve to your ROC curve from 3.4 and describe what they imply about the quality of AICc selection for this regression.