Myra Hallman Mini Project 3

library(dplyr)
library(caret)
library(ggplot2)
library(forecast)
options(scipen=999)
historical.df <- read.csv("historical_data.csv")
historical.df <-na.omit(historical.df)
new.df <- read.csv("new_data.csv")
new.df <-na.omit(new.df)
purchase.df<- read.csv("historical_data.csv")
purchase.df <- na.omit(purchase.df) 

1 Each catalog costs approximately $2 to mail (including printing, postage, and mailing costs). Estimate the gross profit that the firm could expect from the remaining 199,000 names if it selects them randomly from the pool. (note from class:get response rate and expected profit and use excel if you want)

199000 * (14407 / 500 - 2) 
[1] 5335986
  1. Develop a model for classifying a customer as a purchaser or non-purchaser using historical data. 2.a. Pre-process the data to make sure that the binary outcome is recorded as a “factor”.
historical.df$purchase <- factor(
  historical.df$purchase,
  levels = c(1, 0),
  labels = c("purchaser", "non_purchaser")
  )
historical.df$web <- factor(
  historical.df$web,
  levels = c(1, 0),
  labels = c("Yes", "No")
)
historical.df$gender <- factor(
  historical.df$gender,
  levels = c(1, 0),
  labels = c("Male", "Female")
)
historical.df$res <- factor(
  historical.df$res,
  levels = c(1, 0),
  labels = c("Residential", "Commercial")
)
historical.df$us <- factor(
  historical.df$us,
  levels = c(1, 0),
  labels = c("US", "Non.US")
)
new.df$web <- factor(
  new.df$web,
  levels = c(1, 0),
  labels = c("Yes", "No")
)
new.df$gender <- factor(
  new.df$gender,
  levels = c(1, 0),
  labels = c("Male", "Female")
)
new.df$res <- factor(
  new.df$res,
  levels = c(1, 0),
  labels = c("Residential", "Commercial")
)
new.df$us <- factor(
  new.df$us,
  levels = c(1, 0),
  labels = c("US", "Non.US")
)

2.b. Run 2 logistic regression by choosing 2 subsets of the predictors. You can choose any 2 subsets that you want. Make sure that you set up your model to generate predicted probabilities. Use leave one group out cross validation allocating 70% of the data to training.

historical.logit.reg.fit <- train(
  purchase ~ web + gender,
  data = historical.df,
  method = "glm",
  family = "binomial",
  metric = "ROC",
  trControl = trainControl("LGOCV", number=1, p=.7, 
                           classProbs=TRUE, savePredictions=TRUE, 
                           summaryFunction = twoClassSummary))
historical2.logit.reg.fit <- train(
  purchase ~ gender + res,
  data = historical.df,
  method = "glm",
  family = "binomial",
  metric = "ROC",
  trControl = trainControl("LGOCV", number=1, p=.7, 
                           classProbs=TRUE, savePredictions=TRUE, 
                           summaryFunction = twoClassSummary))

2.c Compare the results of the models you estimated in 2.b, and select the better model using area under ROC curve as the decision criteria.

historical.logit.reg.fit$results
  parameter       ROC Sens Spec ROCSD SensSD SpecSD
1      none 0.7474555    0    1    NA     NA     NA
historical2.logit.reg.fit$results
  parameter       ROC Sens Spec ROCSD SensSD SpecSD
1      none 0.5159033    0    1    NA     NA     NA
  1. Develop a model for predicting purchase amount using historical data. 3.a. Create a data frame containing a subset of rows in historical data containing only the cases (rows) of the customers who made a purchase (hint: you can use filter function from dplyr package)
purchase.df <- read.csv("historical_data.csv")
purchase.df <- filter(purchase.df, purchase==1)
head(purchase.df)
  us feq last_update first_update web gender res purchase spending
1  1   5        2081         2438   0      1   0        1     1416
2  1   2         490         2829   0      0   0        1      204
3  1   2        2802         2834   1      0   0        1      130
4  1   5         542         3125   0      0   0        1      578
5  1   1        2158         2158   1      0   0        1       10
6  1   2        4005         4127   0      1   0        1      141
purchase.df$purchase <- factor(
  purchase.df$purchase,
  levels = c(1, 0),
  labels = c("purchaser", "non_purchaser")
  )
purchase.df$web <- factor(
  purchase.df$web,
  levels = c(1, 0),
  labels = c("Yes", "No")
)
purchase.df$gender <- factor(
  purchase.df$gender,
  levels = c(1, 0),
  labels = c("Male", "Female")
)
purchase.df$res <- factor(
  purchase.df$res,
  levels = c(1, 0),
  labels = c("Residential", "Commercial")
)
purchase.df$us <- factor(
  purchase.df$us,
  levels = c(1, 0),
  labels = c("US", "Non.US")
)

3.b. Use the subset of the data created in step 3. Run 2 multiple linear regression by choosing 2 subsets of the predictors. You can choose any 2 subsets that you want. Use leave one group out cross validation allocating 70% of the data to training.

linear.reg.fit <- train(
  spending ~ web + gender,
  data = purchase.df,
  method = "lm",
  trControl = trainControl("LGOCV", number=1, p=.7, 
                          savePredictions=TRUE))
linear2.reg.fit<- train(
  spending ~ gender + res,
  data = purchase.df,
  method = "lm",
  trControl = trainControl("LGOCV", number=1, p=.7, 
                           savePredictions=TRUE))

3.c. Compare the results of the models you estimated in 3.b, and select the better model using RMSE (root mean square error).

linear.reg.fit$results
  intercept     RMSE   Rsquared      MAE RMSESD RsquaredSD MAESD
1      TRUE 349.6288 0.06974091 174.5637     NA         NA    NA
linear2.reg.fit$results
  intercept     RMSE   Rsquared      MAE RMSESD RsquaredSD MAESD
1      TRUE 315.9691 0.01705332 187.4861     NA         NA    NA
  1. For the new data predict whether the customers will make a purchase and if so, predict the purchase amount. 4.a. Add a column to the new data with predicted probabilities of purchase from the selected logistic regression. (hint from class, one will come from log reg, one from lin reg) (p*Spend-2)
pred.probs <- predict(
historical.logit.reg.fit,
newdata = new.df,
type = "prob"
)
new.df$pred.purchase.prob <- pred.probs$purchaser

4.b. Add a column to the new data with predicted spending from the selected multiple linear regression.

new.df$spending <- predict(linear2.reg.fit,
          newdata=new.df)

4.c. Assuming that each catalog costs $2 what is the expected profit if you mail the catalog to everyone in the new dataset?

sum(new.df$expected.value)-(500 * 2)
[1] -1000

4.d Assuming that each catalog costs $2 what is the expected profit if you mail the catalog to everyone whose predicted probability of purchase is greater than 50%? What if you mail the catalog to everyone whose predicted probability of purchase is greater than 70%?

---
title: "Mini Project 3"
output: html_notebook
---
Myra Hallman
Mini Project 3

 

```{r}
library(dplyr)
library(caret)
library(ggplot2)
library(forecast)

options(scipen=999)

historical.df <- read.csv("historical_data.csv")
historical.df <-na.omit(historical.df)

new.df <- read.csv("new_data.csv")
new.df <-na.omit(new.df)

purchase.df<- read.csv("historical_data.csv")

purchase.df <- na.omit(purchase.df) 

```


1 Each catalog costs approximately $2 to mail 
(including printing, postage, and mailing costs). 
Estimate the gross profit that the firm could expect from the 
remaining 199,000 names if it selects them randomly from the pool.
(note from class:get response rate and expected profit and use excel if you want)
```{r}
199000 * (14407 / 500 - 2) 
```

2. Develop a model for classifying a customer as a purchaser or 
non-purchaser using historical data.
2.a. Pre-process the data to make sure that the binary outcome is 
recorded as a “factor”.
```{r}


historical.df$purchase <- factor(
  historical.df$purchase,
  levels = c(1, 0),
  labels = c("purchaser", "non_purchaser")
  )

historical.df$web <- factor(
  historical.df$web,
  levels = c(1, 0),
  labels = c("Yes", "No")
)

historical.df$gender <- factor(
  historical.df$gender,
  levels = c(1, 0),
  labels = c("Male", "Female")
)

historical.df$res <- factor(
  historical.df$res,
  levels = c(1, 0),
  labels = c("Residential", "Commercial")
)

historical.df$us <- factor(
  historical.df$us,
  levels = c(1, 0),
  labels = c("US", "Non.US")
)




new.df$web <- factor(
  new.df$web,
  levels = c(1, 0),
  labels = c("Yes", "No")
)

new.df$gender <- factor(
  new.df$gender,
  levels = c(1, 0),
  labels = c("Male", "Female")
)

new.df$res <- factor(
  new.df$res,
  levels = c(1, 0),
  labels = c("Residential", "Commercial")
)

new.df$us <- factor(
  new.df$us,
  levels = c(1, 0),
  labels = c("US", "Non.US")
)



```

2.b. Run 2 logistic regression by choosing 2 subsets of the predictors. 
You can choose any 2 subsets that you want. Make sure that you set up 
your model to generate predicted probabilities. Use leave one group out 
cross validation allocating 70% of the data to training.

```{r}
historical.logit.reg.fit <- train(
  purchase ~ web + gender,
  data = historical.df,
  method = "glm",
  family = "binomial",
  metric = "ROC",
  trControl = trainControl("LGOCV", number=1, p=.7, 
                           classProbs=TRUE, savePredictions=TRUE, 
                           summaryFunction = twoClassSummary))

historical2.logit.reg.fit <- train(
  purchase ~ gender + res,
  data = historical.df,
  method = "glm",
  family = "binomial",
  metric = "ROC",
  trControl = trainControl("LGOCV", number=1, p=.7, 
                           classProbs=TRUE, savePredictions=TRUE, 
                           summaryFunction = twoClassSummary))

```

2.c Compare the results of the models you estimated in 2.b, and select 
the better model using area under ROC curve as the decision criteria.

```{r}
historical.logit.reg.fit$results
historical2.logit.reg.fit$results
```

3. Develop a model for predicting purchase amount using historical data.
3.a. Create a data frame containing a subset of rows in historical data 
containing only the cases (rows) of the customers who made a purchase 
(hint: you can use filter function from dplyr package)

```{r}
purchase.df <- read.csv("historical_data.csv")

purchase.df <- filter(purchase.df, purchase==1)
head(purchase.df)

purchase.df$purchase <- factor(
  purchase.df$purchase,
  levels = c(1, 0),
  labels = c("purchaser", "non_purchaser")
  )

purchase.df$web <- factor(
  purchase.df$web,
  levels = c(1, 0),
  labels = c("Yes", "No")
)

purchase.df$gender <- factor(
  purchase.df$gender,
  levels = c(1, 0),
  labels = c("Male", "Female")
)

purchase.df$res <- factor(
  purchase.df$res,
  levels = c(1, 0),
  labels = c("Residential", "Commercial")
)

purchase.df$us <- factor(
  purchase.df$us,
  levels = c(1, 0),
  labels = c("US", "Non.US")
)
```

3.b. Use the subset of the data created in step 3. Run 2 multiple linear 
regression by choosing 2 subsets of the predictors. You can choose any 2 
subsets that you want. Use leave one group out cross validation allocating 
70% of the data to training.

```{r}
linear.reg.fit <- train(
  spending ~ web + gender,
  data = purchase.df,
  method = "lm",
  trControl = trainControl("LGOCV", number=1, p=.7, 
                          savePredictions=TRUE))

linear2.reg.fit<- train(
  spending ~ gender + res,
  data = purchase.df,
  method = "lm",
  trControl = trainControl("LGOCV", number=1, p=.7, 
                           savePredictions=TRUE))
```

3.c. Compare the results of the models you estimated in 3.b, and select the 
better model using RMSE (root mean square error).

```{r}
linear.reg.fit$results
linear2.reg.fit$results
```

4. For the new data predict whether the customers will make a purchase and if so,
predict the purchase amount.
4.a. Add a column to the new data with predicted probabilities of purchase from the 
selected logistic regression.
(hint from class, one will come from log reg, one from lin reg) (p*Spend-2)
```{r}
pred.probs <- predict(
historical.logit.reg.fit,
newdata = new.df,
type = "prob"
)
new.df$pred.purchase.prob <- pred.probs$purchaser




```


4.b. Add a column to the new data with predicted spending from the selected multiple 
linear regression.
```{r}
new.df$spending <- predict(linear2.reg.fit,
          newdata=new.df)


```


4.c. Assuming that each catalog costs $2 what is the expected profit if you mail the 
catalog to everyone in the new dataset?


```{r}
sum(new.df$expected.value)-(500 * 2)
```

4.d Assuming that each catalog costs $2 what is the expected profit if you mail the 
catalog to everyone whose predicted probability of purchase is greater than 50%? 
What if you mail the catalog to everyone whose predicted probability of purchase is 
greater than 70%?

```{r}

```









