Part A - Kaggle (20%)
Do at least one submission on Kaggle competition beyond the first sample_submission you already did. This means you should have at least 2 entries shown on the Leaderboard and your RMSE Score must be < 107.28565. If you already did this, then you are done with this part of the homework.
(Eventually you will need to have a total of at least 5 submissions by May 3, Sunday, 11:59 pm.)
Part B - Multiple Regression & Logistic Regression Practice (80%)
Do your work on this R Notebook file showing all the outputs and answers along with the code. (If you have trouble working with an R notebook file, you may want to review this video (the same video I mentioned before).
Enter your answers on the Blackboard quizzes provided. Submitting the R notebook is optional.
1. More Catalog Marketing
This is a practice of using different models for subsets of the data and computing rmse’s.It is related to what we did in the previous homework and in class.
First, run below to read in the 2 data sets into respective data frames as before.
train <- read.csv("CatalogTrain.csv")
test <- read.csv("CatalogTest.csv")
Remember in class, the model we used was:
model <- lm(log(Amount.Spent) ~ Close + Salary + Children + Previous.Customer +
Previous.Spent + Catalogs, data = train)
This gave us RMSE of train data = 606.31 and RMSE of test data = 626.67.
In train, create a new variable logAmt that is equal to the log of the Amount.Spent. So we can work with the new variable logAmt instead of log(Amount.Spent).
Use pairs.panels function from psych package to create a scatterplot matrix of these variables:
Salary, Children, Previous.Spent, Catalogs, logAmt.
Look at the scatterplot between Previous.Spent and logAmt. Notice there are many zero values for Previous.Spent. If Previous.Customer = No, this means the customer did not make any purchases the previous year, so Previous.Spent value must be 0. Verify that the customers for which Previous.Customer = No have all 0 values for Previous.Spent.
For customers who are not previous customers, it makes no sense to use Previous.Spent as a predictor. So we will try building separate regression models for the customers who are previous customers and those who are not. Create 2 data frames from train as follows:
trainYes = subset of train for which Previous.Customer == Yes
trainNo = subset of train for which Previous.Customer == No
Now build separate scatterplot matrices for trainYes and trainNo data frames. For trainNo, Previous.Spent will all be 0, so we don’t want to include that variable.
For trainNo, the variables should be Salary, Children, Catalogs, logAmt.
For trainYes, the variables should be Salary, Children, Previous.Spent, Catalogs, logAmt.
For trainYes, try running a linear regression model with y = logAmt and X = Close, Salary, Children, Previous.Spent and Catalogs. Let’s call this model1. Are all the x variables statistically significant?
Now, test for multicollinearity with vif function from the car package. Which two variable(s) have vif value higher than 5?
Let’s try removing one of these two variables. Run 2 regressions:
model2 with x = Close, Salary, Children, Catalogs and
model3 with x = Close, Children, Previous.Spent and Catalogs.
Compare the RMSE and Adj R-squared of the two regression models. Based on these, which one should you keep?
Now, let’s decide on a regression model for trainNo data frame. These customers were not previous customers, so we will not include the Previous.Spent variable. For trainNo, run a regression model with y = logAmt and x = Close, Salary, Children, Catalogs. Call this model model4.
There will be one variable that is not significant. Remove this variable and run regression again. Call this model5.
Run the predict function with model2 on trainYes and model5 on trainNo. Call the results predTrainYes and predTrainNo, respectively. What are the rmse’s from these (remember this function comes from Metrics package)? Don’t forget - since the predictions are for logAmt, use exp to get the predicted Amount.Spent.
So far, we obtained the RMSE’s separately from running different models on trainYes and trainNo, the subsets of train data. Run the following code to combine them together to find the overall RMSE.
# Add predicted Amount.Spent column to the two data frames.
trainYes1 <- cbind(trainYes, pred.Amount = exp(predTrainYes))
trainNo1 <- cbind(trainNo, pred.Amount = exp(predTrainNo))
# Combine the two data frames to make one data frame then get the rmse from it.
trainCombined <- rbind(trainYes1, trainNo1)
rmse(trainCombined$Amount.Spent, trainCombined$pred.Amount)
Now let’s use model2 and model5 on the appropriate rows of the test data. Build 2 smaller data frames from the test data as you did with train data in (c). They should be named testNo and testYes.
Repeat the previous two parts (j) and (k) on these data frames testYes and testNo to find the 2 separate RMSE’s then the overall RMSE.
Remember the RMSE of the previous model were 606.31 (train) and 626.67 (test). Compare these to the (overall) RMSE now from the train (from (k)) and the test data (from (m)). Are the new RMSE’s better or worse for the train data? What about for the test data?
2. Lending Club
In the lending industry, investors provide loans to borrowers in exchange for the promise of repayment with interest. If the borrower repays the loan, then the lender profits from the interest. However, if the borrower is unable to repay the loan (default), then the lender loses money. Therefore, lenders face the problem of predicting the risk of a borrower being unable to repay a loan.
LendingClub.csv contains data from lendingclub.com, a website that connects borrowers and investors over the Internet. This dataset represents a sample of loans that were funded through the lendingclub.com platform in 2007 - 2015. The definition of the columns are in the file LendingClub_DataDescription.csv.
Here, the dependent variable is loan_status.
Based on this data, we would like to build a model that predicts which borrowers will pay back the full loan, that is loan_status = 1. We use logistic regression that computes for each loan the probability that it will be paid in full.
Read the data into a data frame lending and run str(). How many observations are in this data set?
How many loans were paid in full? What proportion is that?
Create a training data and the test data as follows. Load the package caTools, set the seed to 20. Use sample.split to randomly allocate 80 % of the rows to data frame, train, and the remaining 20% to data frame, test. Recall using sample.split function with lending$loan_status will ensure that the distribution of loan_status is similar between the training set and the test set. (Use the set of commands we went over in class.) What proportion of loans were paid full in train? What about in test?
Let’s work with the train data first. (Build a logistic regression with the following as the predictors: log(annual_inc), delinq_2yrs, dti, int_rate, loan_amnt. Here, log is the natural log of the annual income. We are using the log because annual income is highly right-skewed. [If you are annoyed with all the scientific notations like -1.507e-05 in the output, you can suppress the scientific notation by running this command before asking for the output: options(scipen = 999). Now all the outputs from that point on will not have scientific notation. If you want to undo this later during the session, use command: options(scipen = 0).]
According to the regression coefficients, increasing which variables will increase the probability of the loan being paid in full? Does this make sense?
Write down the expression for logit, z. (This kind format: z = 2.567 + 0.435 x - 1.345 y)
Suppose you have the following data about borrower A. Annual income is $73,500, debt-to-income ratio is 13.0, there was one delinquency in the past two years, the loan amount he wants is $20,000 with the interest rate 8%. What is the probability borrower A will pay back the loan in full? (Compute by plugging in variable values in the regression to get z then plugging this z value into the expression 1/(1+exp(-z)). What are the odds of him paying back in full? Would you lend to this borrower?
Now compute the probability directly with the predict function. You should get the same probability you computed above.
For all the observations in the train data set, use predict function to compute the probabilities of paying back in full. Name the result predTrain. Since you are using the train data that the model came from, you don’t need to specify newdata in the predict function. Verify this is a vector 9,249 probability values.
Now for all the observations in the test data set, use predict function to compute the probabilities of paying back in full. Name the result predTest. Display the first 10 values.
---
title: "Homework 9 due 4/27/20"
output: html_notebook
---

## Part A - Kaggle (20%)

**Do at least one submission on Kaggle competition beyond the first sample_submission you already did.**   This means you should have at least 2 entries shown on the Leaderboard and your RMSE Score must be < 107.28565. If you already did this, then you are done with this part of the homework.  
(Eventually you will need to have a total of at least 5 submissions by May 3, Sunday, 11:59 pm.)

## Part B - Multiple Regression & Logistic Regression Practice (80%)

Do your work on this R Notebook file showing all the outputs and answers along with the code.  (If you have trouble working with an R notebook file, you may want to review this video (the same [video](https://youtu.be/0dD2U7rrGaA) I mentioned before).  
**Enter your answers on the Blackboard quizzes provided.** Submitting the R notebook is optional.

### 1. More Catalog Marketing
This is a practice of using different models for subsets of the data and computing rmse's.It is related to what we did in the previous homework and in class.    
First, run below to read in the 2 data sets into respective data frames as before.

```{r}
train <- read.csv("CatalogTrain.csv")    
test <- read.csv("CatalogTest.csv")
```

Remember in class, the model we used was: 

```{r}
model <- lm(log(Amount.Spent) ~ Close + Salary + Children + Previous.Customer + 
              Previous.Spent + Catalogs, data = train)
```

This gave us RMSE of train data = **606.31** and RMSE of test data = **626.67**. 
              
(a) In **train**, create a new variable **logAmt** that is equal to the log of the Amount.Spent. So we can work with the new variable logAmt instead of log(Amount.Spent).  
Use *pairs.panels* function from psych package to create a scatterplot matrix of these variables:   
Salary, Children, Previous.Spent, Catalogs, logAmt.  

(b) Look at the scatterplot between **Previous.Spent** and **logAmt**. Notice there are many zero values for Previous.Spent.  If Previous.Customer = No, this means the customer did not make any purchases the previous year, so Previous.Spent value must be 0. Verify that the customers for which Previous.Customer = No have all 0 values for Previous.Spent. 

(c) For customers who are not previous customers, it makes no sense to use Previous.Spent as a predictor.  So we will try building separate regression models for the customers who are previous customers and those who are not.  Create 2 data frames from train as follows:   
trainYes = subset of train for which Previous.Customer == Yes  
trainNo = subset of train for which Previous.Customer == No

(d) Now build separate scatterplot matrices for trainYes and trainNo data frames. For trainNo, Previous.Spent will all be 0, so we don't want to include that variable.  
For trainNo, the variables should be Salary, Children, Catalogs, logAmt.  
For trainYes, the variables should be Salary, Children, Previous.Spent, Catalogs, logAmt.

(e) For trainYes, try running a linear regression model with y = logAmt and X = Close, Salary, Children, Previous.Spent and Catalogs. Let's call this model1. Are all the x variables statistically significant?

(f) Now, test for multicollinearity with vif function from the car package.
Which two variable(s) have vif value higher than 5?

(g) Let's try removing one of these two variables. Run 2 regressions:   
**model2** with x = Close, Salary, Children, Catalogs and   
**model3** with x = Close, Children, Previous.Spent and Catalogs.    
Compare the RMSE and Adj R-squared of the two regression models.
Based on these, which one should you keep?

(h) Now, let's decide on a regression model for **trainNo** data frame. These customers were not previous customers, so we will not include the Previous.Spent variable.  For trainNo, run a regression model with y = logAmt and x = Close, Salary, Children, Catalogs. Call this model **model4**.

(i) There will be one variable that is not significant. Remove this variable and run regression again.  Call this **model5**.

(j) Run the predict function with model2 on trainYes and model5 on trainNo. Call the results **predTrainYes** and **predTrainNo**, respectively.  What are the rmse's from these (remember this function comes from Metrics package)?  Don't forget - since the predictions are for logAmt, use exp to get the predicted Amount.Spent. 

(k) So far, we obtained the RMSE's separately from running different models on trainYes and trainNo, the subsets of train data. Run the following code to combine them together to find the overall RMSE. 

```{r}
# Add predicted Amount.Spent column to the two data frames.
trainYes1 <- cbind(trainYes, pred.Amount = exp(predTrainYes))  
trainNo1 <- cbind(trainNo, pred.Amount = exp(predTrainNo))

# Combine the two data frames to make one data frame then get the rmse from it.
trainCombined <- rbind(trainYes1, trainNo1)
rmse(trainCombined$Amount.Spent, trainCombined$pred.Amount)
```

(l) Now let's use model2 and model5 on the appropriate rows of the test data. Build 2 smaller data frames from the test data as you did with train data in (c). They should be named **testNo** and **testYes**.

(m) Repeat the previous two parts (j) and (k) on these data frames testYes and testNo to find the 2 separate RMSE's then the overall RMSE.

(n) Remember the RMSE of the previous model were 606.31 (train) and 626.67 (test). Compare these to the (overall) RMSE now from the train (from (k)) and the test data (from (m)). Are the new RMSE's better or worse for the train data? What about for the test data?


### 2.	Lending Club
In the lending industry, investors provide loans to borrowers in exchange for the promise of repayment with interest. If the borrower repays the loan, then the lender profits from the interest. However, if the borrower is unable to repay the loan (default), then the lender loses money. Therefore, lenders face the problem of predicting the risk of a borrower being unable to repay a loan.  
**LendingClub.csv** contains data from [lendingclub.com](https://www.lendingclub.com/), a website that connects borrowers and investors over the Internet. This dataset represents a sample of loans that were funded through the lendingclub.com platform in 2007 - 2015. The definition of the columns are in the file **LendingClub_DataDescription.csv.**   

Here, the dependent variable is **loan_status**.   
Based on this data, we would like to build a model that predicts which borrowers will pay back the full loan, that is loan_status = 1. We use logistic regression that computes for each loan the probability that it will be paid in full. 

(a)	Read the data into a data frame **lending** and run str().  How many observations are in this data set?

(b) How many loans were paid in full? What proportion is that?

(c)	Create a training data and the test data as follows. Load the package caTools, set the seed to 20.  Use sample.split to randomly allocate 80 % of the rows to data frame, **train**, and the remaining 20% to data frame, **test**.  Recall using sample.split function with lending$loan_status will ensure that the distribution of loan_status is similar between the training set and the test set.  (Use the set of commands we went over in class.)   What proportion of loans were paid full in train? What about in test?

(d)	Let's work with the **train** data first.  (Build a logistic regression with the following as the predictors:
log(annual_inc), delinq_2yrs, dti, int_rate, loan_amnt.  Here, log is the natural log of the annual income.  We are using the log because annual income is highly right-skewed. [If you are annoyed with all the scientific notations like -1.507e-05 in the output, you can suppress the scientific notation by running this command before asking for the output:  options(scipen = 999).  Now all the outputs from that point on will not have scientific notation.  If you want to undo this later during the session, use command:  options(scipen = 0).]

(e)	According to the regression coefficients, increasing which variables will increase the probability of the loan being paid in full?  Does this make sense?

(f)	Write down the expression for logit, z. (This kind format:  z = 2.567 + 0.435 x - 1.345 y) 

(g)	Suppose you have the following data about borrower A. Annual income is $73,500, debt-to-income ratio is 13.0, there was one delinquency in the past two years, the loan amount he wants is $20,000 with the interest rate 8%.  What is the probability borrower A will pay back the loan in full?  (Compute by plugging in variable values in the regression to get z then plugging this z value into the expression 1/(1+exp(-z)).  What are the odds of him paying back in full? Would you lend to this borrower?   

(h)	Now compute the probability directly with the predict function. You should get the same probability you computed above.

(i)	For all the observations in the train data set, use **predict** function to compute the probabilities of paying back in full. Name the result **predTrain**. Since you are using the train data that the model came from, you don't need to specify newdata in the predict function.  Verify this is a vector 9,249 probability values.

(j) Now for all the observations in the **test** data set, use **predict** function to compute the probabilities of paying back in full. Name the result predTest. Display the first 10 values.