Unit 4: Trees

Preliminaries

These are my notes for the lectures of the The_Analytics_Edge_edX_MIT15.0 71x_June2015“ by Professor Dimitris Bertsimas. The goal of these notes is to provide the reproducible R code for all the lectures.
A good list of resources about using R for Classification And Regression Trees(CART) are given below:
Quick-R_ Tree-Based Models
Classification And Regression Trees : A Practical Guide for Describing a Dataset_Leo Pekelis_standford Univ
A Brief Tour of the Trees and Forests _ Statistical Research
Classification and Regression Trees - CMU Statistics
Classification and Regression Trees - Statistics Textbook
Plotting rpart trees with prp()_user manual
Draw nicer Classification and Regression Trees with the rpart.plot package
Comparing a CART model to Random Forest (Part 1)
Introduction to Random forest - Simplified
2 main differences between classification and regression trees
Classification and Regression Tree Analysis_Jake Morgan_BostonUniv.
k-Fold Cross Validation made simple
What is good in a decision tree, a large or a small leaf size?

A quick primer on Classification And Regression Trees (CART)

Classification trees, as the name implies are used to separate the dataset into classes belonging to the response variable. Usually the response variable has two classes: Yes or No (1 or 0). If the target variable has more than 2 categories, then a variant of the algorithm, called C4.5, is used. For binary splits however, the standard CART procedure is used. Thus classification trees are used when the response or target variable is categorical in nature.

Regression trees are needed when the response variable is numeric or continuous. For example, the predicted price of a consumer good. Thus regression trees are applicable for prediction type of problems as opposed to classification.

Keep in mind that in either case, the predictors or independent variables may be categorical or numeric. It is the target variable that determines the type of decision tree needed.

which-type-of-decision-tree-to-use

Judge Jury & classifier_An introduction to Trees_1

From the course slides:

ABOUT THE DATA

Cases from 1994 through 2001.

In this period, the same nine justices presided SCOTUS: + Breyer, Ginsburg, Kennedy, O’Connor, Rehnquist (Chief Justice), Scalia, Souter, Stevens, Thomas + Rare data set - longest period of time with the same set of justices in over 180 years.

We will focus on predicting Justice Stevens’ decisions:

Started out moderate, but became more liberal
Self-proclaimmed conservative

The Variables

In this problem, our dependent variable is whether or not Justice Stevens voted to reverse the lower court decision.
This is a binary variable, Reverse, taking values:

1 if Justice Stevens decided to reverse or overturn the lower court decision.
0 if Justice Stevens voted to affirm or maintain the lower court decision.

Our independent variables are six different properties of the case.

Circuit: Circuit court of origin, 13 courts (1st - 11th , DC, FED).
Issue: Issue area of case, 11 areas (e.g., civil rights, federal taxation).
Petitioner: type of petitioner, 12 categories (e.g., US, an employer).
Respondent: type of respondent 12 categories (same as for Petitioner).
LowerCourt: Ideological direction of lower court decision (this was based on the judgment by the authors of the study), 2 categories:
- conservative
- liberal
Unconst: whether petitioner argued that a law/practice was unconstitutional, binary variable.

VIDEO 1: THE SUPREME COURT

QUICK QUESTION

How much data do you think Andrew Martin should use to build his model?

Ans:Information from all cases with the same set of justices as those he is trying to predict. Data from cases where the justices were different might just add noise to our problem.

EXPLANATION:Andrew Martin should use all data from the cases with the same set of justices. The justices do not change every year, and typically you want to use as much data as you have available.

VIDEO 2: CART

QUICK QUESTION

Suppose that you have the following CART tree:

QQ2_SupremeCourt

How many splits are in this tree?

Ans:3

For which data observations should we predict “Red”, according to this tree? Select all that apply.

Ans: If X is less than 60, and Y is any value. & If X is greater than or equal to 60 and less than 85, and Y is less than 20.

EXPLANATION:This tree has three splits. The first split says to predict “Red” if X is less than 60, regardless of the value of Y. Otherwise, we move to the second split. The second split says to check the value of Y - if it is greater than or equal to 20, predict “Gray”. Otherwise, we move to the third split. This split checks the value of X again. If X is less than 85 (and greater than or equal to 60 by the first split) and Y is less than 20, then we predict “Red”. Otherwise, we predict “Gray”.

VIDEO 3: SPLITTING AND PREDICTIONS

QUICK QUESTION

Suppose you have a subset of 20 observations, where 14 have outcome A and 6 have outcome B. What proportion of observations have outcome A?

14/(14+6)

## [1] 0.7

Ans: 0.7 (The fraction of observations that have outcome A is 14/20 = 0.7.)

QUICK QUESTION

The following questions ask about the subset of 20 observations from the previous question.

If we set the threshold to 0.25 when computing predictions of outcome A, will we predict A or B for these observations?

Ans:A

If we set the threshold to 0.5 when computing predictions of outcome A, will we predict A or B for these observations?

Ans:A

If we set the threshold to 0.75 when computing predictions of outcome A, will we predict A or B for these observations?

Ans:B

EXPLANATION:Since 70% of these observations have outcome A, we will predict A if the threshold is below 0.7, and we will predict B if the threshold is above 0.7.

VIDEO 4: CART IN R (R script reproduced here)

Logistic Regression

We can first try to use logistic regression classification technique.

stevens_full <- read.csv("stevens.csv", stringsAsFactor = TRUE)
str(stevens_full)

## 'data.frame':    566 obs. of  9 variables:
##  $ Docket    : Factor w/ 566 levels "00-1011","00-1045",..: 63 69 70 145 97 181 242 289 334 436 ...
##  $ Term      : int  1994 1994 1994 1994 1995 1995 1996 1997 1997 1999 ...
##  $ Circuit   : Factor w/ 13 levels "10th","11th",..: 4 11 7 3 9 11 13 11 12 2 ...
##  $ Issue     : Factor w/ 11 levels "Attorneys","CivilRights",..: 5 5 5 5 9 5 5 5 5 3 ...
##  $ Petitioner: Factor w/ 12 levels "AMERICAN.INDIAN",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Respondent: Factor w/ 12 levels "AMERICAN.INDIAN",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ LowerCourt: Factor w/ 2 levels "conser","liberal": 2 2 2 1 1 1 1 1 1 1 ...
##  $ Unconst   : int  0 0 0 0 0 1 0 1 0 0 ...
##  $ Reverse   : int  1 1 1 1 1 0 1 1 1 1 ...

#Some of the variables are not interesting for our purpose, namely `Docket` and `Term` and we remove them.
stevens <- stevens_full[, -c(1, 2)]

#logistic model
# model_LogRegr <- glm(Reverse ~ . - Docket - Term, data = stevens, family = binomial)
model_LogRegr <- glm(Reverse ~ ., data = stevens, family = binomial)

summary(model_LogRegr)

## 
## Call:
## glm(formula = Reverse ~ ., family = binomial, data = stevens)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4748  -0.9222   0.4212   0.8805   2.2353  
## 
## Coefficients:
##                                       Estimate Std. Error z value Pr(>|z|)
## (Intercept)                            0.54410    2.08161   0.261 0.793797
## Circuit11th                            0.99854    0.55292   1.806 0.070926
## Circuit1st                             0.76049    0.72138   1.054 0.291783
## Circuit2nd                             1.42816    0.58598   2.437 0.014801
## Circuit3rd                             0.04341    0.61757   0.070 0.943960
## Circuit4th                             1.48778    0.60068   2.477 0.013255
## Circuit5th                             1.68235    0.56190   2.994 0.002753
## Circuit6th                             1.48956    0.59973   2.484 0.013002
## Circuit7th                             0.49603    0.55787   0.889 0.373925
## Circuit8th                             0.26834    0.55420   0.484 0.628253
## Circuit9th                             1.14267    0.47906   2.385 0.017069
## CircuitDC                              0.61482    0.60133   1.022 0.306578
## CircuitFED                             0.25634    0.64661   0.396 0.691786
## IssueCivilRights                      -0.03997    1.40847  -0.028 0.977360
## IssueCriminalProcedure                -0.15254    1.41269  -0.108 0.914012
## IssueDueProcess                        0.30429    1.43514   0.212 0.832085
## IssueEconomicActivity                  0.13116    1.38507   0.095 0.924557
## IssueFederalismAndInterstateRelations  0.13246    1.44584   0.092 0.927005
## IssueFederalTaxation                  -0.86018    1.54330  -0.557 0.577278
## IssueFirstAmendment                   -0.75179    1.42809  -0.526 0.598592
## IssueJudicialPower                    -0.08953    1.38479  -0.065 0.948449
## IssuePrivacy                           1.54454    1.61703   0.955 0.339492
## IssueUnions                           -0.44031    1.48202  -0.297 0.766390
## PetitionerBUSINESS                     1.28748    1.31922   0.976 0.329094
## PetitionerCITY                         0.21754    1.49548   0.145 0.884343
## PetitionerCRIMINAL.DEFENDENT           2.13822    1.33785   1.598 0.109987
## PetitionerEMPLOYEE                     1.78408    1.37351   1.299 0.193973
## PetitionerEMPLOYER                     0.65596    1.45800   0.450 0.652779
## PetitionerGOVERNMENT.OFFICIAL          1.30488    1.35869   0.960 0.336856
## PetitionerINJURED.PERSON               0.59468    1.51320   0.393 0.694323
## PetitionerOTHER                        1.50279    1.29944   1.156 0.247482
## PetitionerPOLITICIAN                   1.02443    1.39867   0.732 0.463904
## PetitionerSTATE                        1.13253    1.36351   0.831 0.406202
## PetitionerUS                           2.03763    1.36048   1.498 0.134206
## RespondentBUSINESS                    -1.71957    0.82967  -2.073 0.038211
## RespondentCITY                        -1.87125    1.11065  -1.685 0.092021
## RespondentCRIMINAL.DEFENDENT          -3.05773    0.86421  -3.538 0.000403
## RespondentEMPLOYEE                    -1.81206    0.91460  -1.981 0.047562
## RespondentEMPLOYER                    -0.90141    1.06608  -0.846 0.397815
## RespondentGOVERNMENT.OFFICIAL         -2.56409    0.97349  -2.634 0.008440
## RespondentINJURED.PERSON              -3.24236    1.03590  -3.130 0.001748
## RespondentOTHER                       -2.05311    0.79489  -2.583 0.009798
## RespondentPOLITICIAN                  -1.58367    0.95899  -1.651 0.098658
## RespondentSTATE                       -1.72107    0.91967  -1.871 0.061290
## RespondentUS                          -2.84583    0.88542  -3.214 0.001308
## LowerCourtliberal                     -1.16242    0.25050  -4.640 3.48e-06
## Unconst                                0.08061    0.27981   0.288 0.773278
##                                          
## (Intercept)                              
## Circuit11th                           .  
## Circuit1st                               
## Circuit2nd                            *  
## Circuit3rd                               
## Circuit4th                            *  
## Circuit5th                            ** 
## Circuit6th                            *  
## Circuit7th                               
## Circuit8th                               
## Circuit9th                            *  
## CircuitDC                                
## CircuitFED                               
## IssueCivilRights                         
## IssueCriminalProcedure                   
## IssueDueProcess                          
## IssueEconomicActivity                    
## IssueFederalismAndInterstateRelations    
## IssueFederalTaxation                     
## IssueFirstAmendment                      
## IssueJudicialPower                       
## IssuePrivacy                             
## IssueUnions                              
## PetitionerBUSINESS                       
## PetitionerCITY                           
## PetitionerCRIMINAL.DEFENDENT             
## PetitionerEMPLOYEE                       
## PetitionerEMPLOYER                       
## PetitionerGOVERNMENT.OFFICIAL            
## PetitionerINJURED.PERSON                 
## PetitionerOTHER                          
## PetitionerPOLITICIAN                     
## PetitionerSTATE                          
## PetitionerUS                             
## RespondentBUSINESS                    *  
## RespondentCITY                        .  
## RespondentCRIMINAL.DEFENDENT          ***
## RespondentEMPLOYEE                    *  
## RespondentEMPLOYER                       
## RespondentGOVERNMENT.OFFICIAL         ** 
## RespondentINJURED.PERSON              ** 
## RespondentOTHER                       ** 
## RespondentPOLITICIAN                  .  
## RespondentSTATE                       .  
## RespondentUS                          ** 
## LowerCourtliberal                     ***
## Unconst                                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 779.86  on 565  degrees of freedom
## Residual deviance: 622.81  on 519  degrees of freedom
## AIC: 716.81
## 
## Number of Fisher Scoring iterations: 4

We get a model where some of the most significant variables are:

whether or not the case is from the 2nd circuit court, with a coefficient of 1.428
whether or not the case is from the 4th circuit court, with a coefficient of 1.488
whether or not the lower court decision was liberal, with a coefficient of -1.162

While this tells us that the case being from the 2nd or 4th circuit courts is predictive of Justice Stevens reversing the case, and the lower court decision being liberal is predictive of Justice Stevens affirming the case, it’s difficult to understand which factors are more important due to things like the scales of the variables, and the possibility of multicollinearity.

It’s also difficult to quickly evaluate what the prediction would be for a new case.

Classification and Regression Trees (CART)

So instead of logistic regression, Martin and his colleagues used a method called classification and regression trees, or CART.

This method builds what is called a tree by splitting on the values of the independent variables.
To predict the outcome for a new observation or case, you can follow the splits in the tree and at the end, you predict the most frequent outcome in the training set that followed the same path.

Some advantages of CART are that:

it does not assume a linear model, like logistic regression or linear regression, and
it is a very interpretable model.

Example of CART splits

This plot shows sample data for two independent variables, x and y, and each data point is colored by the outcome variable, red or gray.
CART tries to split this data into subsets so that each subset is as pure or homogeneous as possible. The first three splits that CART would create are shown here.

example

Then the standard prediction made by a CART model is just a majority vote within each subset.

A CART model is represented by what we call a tree.

example

The tree for the splits we just generated is shown on the right.

The first split tests whether the variable $x < 60$.
- If yes, the model says to predict red, and
- If no, the model moves on to the next split.
The second split checks whether or not $y < 20$.
- If no, the model says to predict gray.
- If yes, the model moves on to the next split.
The third split checks whether or not $x < 85$.
- If yes, then the model says to predict red, and
- if no, the model says to predict gray.

CART and splitting

In the previous example shows a CART tree with three splits, but why not two, or four, or even five?

There are different ways to control how many splits are generated.

One way is by setting a lower bound for the number of data points in each subset.
In R, this is called the minbucket parameter, for the minimum number of observations in each bucket or subset.
The smaller minbucket is, the more splits will be generated. But if it’s too small, overfitting will occur.
This means that CART will fit the training set almost perfectly.
This is bad because then the model will probably not perform well on test set data or new data.
On the other hand, if the minbucket parameter is too large, the model will be too simple and the accuracy will be poor.

Predictions from CART

In each subset of a CART tree, we have a bucket of observations, which may contain both possible outcomes. In the Supreme Court case, we will be classifying observations as either affirm or reverse, again a binary outcome, as in the example shown above.

In the example we classified each subset as either red or gray depending on the majority in that subset.

Instead of just taking the majority outcome to be the prediction, we can compute the percentage of data in a subset of each type of outcome. As an example, if we have a subset with 10 affirms and two reverses, then 83% of the data is affirm.
Then, just like in logistic regression, we can use a threshold value to obtain our prediction.
For this example, we would predict affirm with a threshold of 0.5 since the majority is affirm.

But if we increase that threshold to 0.9, we would predict reverse for this example.

Then by varying the threshold value, we can compute an ROC curve and compute an AUC value to evaluate our model.

# Unit 4 - "Judge, Jury, and Classifier" Lecture


# VIDEO 4

# Read in the data
stevens = read.csv("stevens.csv")
str(stevens)

## 'data.frame':    566 obs. of  9 variables:
##  $ Docket    : Factor w/ 566 levels "00-1011","00-1045",..: 63 69 70 145 97 181 242 289 334 436 ...
##  $ Term      : int  1994 1994 1994 1994 1995 1995 1996 1997 1997 1999 ...
##  $ Circuit   : Factor w/ 13 levels "10th","11th",..: 4 11 7 3 9 11 13 11 12 2 ...
##  $ Issue     : Factor w/ 11 levels "Attorneys","CivilRights",..: 5 5 5 5 9 5 5 5 5 3 ...
##  $ Petitioner: Factor w/ 12 levels "AMERICAN.INDIAN",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Respondent: Factor w/ 12 levels "AMERICAN.INDIAN",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ LowerCourt: Factor w/ 2 levels "conser","liberal": 2 2 2 1 1 1 1 1 1 1 ...
##  $ Unconst   : int  0 0 0 0 0 1 0 1 0 0 ...
##  $ Reverse   : int  1 1 1 1 1 0 1 1 1 1 ...

#Some of the variables are not interesting for our purpose, namely `Docket` and `Term` and we remove them.
stevensless <- stevens[, -c(1, 2)]
library(caTools)

## 
## Attaching package: 'caTools'

## The following objects are masked from 'package:base64enc':
## 
##     base64decode, base64encode

set.seed(3000) # to get the same split everytime
spl1 = sample.split(stevensless$Reverse, SplitRatio = 0.7)
Train1 = subset(stevensless, spl1==TRUE)
Test1 = subset(stevensless, spl1==FALSE)

#Fit _Logistic Regression_ model
#As a reference we also fit a _logistic regression_ model to the _training_ data set:
model_LogRegr <- glm(Reverse ~ ., data = Train1, family = binomial)
summary(model_LogRegr)

## 
## Call:
## glm(formula = Reverse ~ ., family = binomial, data = Train1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3832  -0.9186   0.3458   0.8470   2.2290  
## 
## Coefficients:
##                                       Estimate Std. Error z value Pr(>|z|)
## (Intercept)                            0.45008    2.34027   0.192  0.84749
## Circuit11th                            1.11621    0.65391   1.707  0.08783
## Circuit1st                             0.45867    1.16360   0.394  0.69344
## Circuit2nd                             2.31471    0.72699   3.184  0.00145
## Circuit3rd                             0.46020    0.74762   0.616  0.53819
## Circuit4th                             1.95459    0.72631   2.691  0.00712
## Circuit5th                             2.03668    0.67009   3.039  0.00237
## Circuit6th                             1.48370    0.71501   2.075  0.03798
## Circuit7th                             0.69997    0.68327   1.024  0.30563
## Circuit8th                             0.78445    0.67347   1.165  0.24411
## Circuit9th                             1.34112    0.58353   2.298  0.02154
## CircuitDC                              0.57449    0.72694   0.790  0.42936
## CircuitFED                             0.63139    0.74354   0.849  0.39579
## IssueCivilRights                       0.15302    1.41314   0.108  0.91377
## IssueCriminalProcedure                -0.08072    1.42975  -0.056  0.95498
## IssueDueProcess                        0.48455    1.44254   0.336  0.73694
## IssueEconomicActivity                  0.21192    1.39093   0.152  0.87891
## IssueFederalismAndInterstateRelations  0.28026    1.48114   0.189  0.84992
## IssueFederalTaxation                  -1.76887    1.71259  -1.033  0.30167
## IssueFirstAmendment                   -0.50309    1.45309  -0.346  0.72918
## IssueJudicialPower                    -0.11009    1.37988  -0.080  0.93641
## IssuePrivacy                           2.39866    1.79602   1.336  0.18170
## IssueUnions                           -0.78603    1.52914  -0.514  0.60723
## PetitionerBUSINESS                     1.36136    1.60051   0.851  0.39500
## PetitionerCITY                        -0.19086    1.79841  -0.106  0.91548
## PetitionerCRIMINAL.DEFENDENT           2.01263    1.61931   1.243  0.21391
## PetitionerEMPLOYEE                     2.09290    1.67320   1.251  0.21099
## PetitionerEMPLOYER                     0.68498    1.75856   0.390  0.69690
## PetitionerGOVERNMENT.OFFICIAL          0.35138    1.62596   0.216  0.82890
## PetitionerINJURED.PERSON               1.46768    1.84218   0.797  0.42562
## PetitionerOTHER                        1.40732    1.57153   0.896  0.37051
## PetitionerPOLITICIAN                   0.77215    1.71668   0.450  0.65286
## PetitionerSTATE                        0.82069    1.62757   0.504  0.61409
## PetitionerUS                           1.45574    1.62834   0.894  0.37132
## RespondentBUSINESS                    -1.73754    0.98272  -1.768  0.07705
## RespondentCITY                        -2.63295    1.33944  -1.966  0.04933
## RespondentCRIMINAL.DEFENDENT          -3.00803    1.01539  -2.962  0.00305
## RespondentEMPLOYEE                    -2.32134    1.09025  -2.129  0.03324
## RespondentEMPLOYER                    -0.92216    1.45769  -0.633  0.52699
## RespondentGOVERNMENT.OFFICIAL         -2.38169    1.21421  -1.962  0.04982
## RespondentINJURED.PERSON              -3.46752    1.20843  -2.869  0.00411
## RespondentOTHER                       -2.24797    0.94253  -2.385  0.01708
## RespondentPOLITICIAN                  -1.67445    1.12146  -1.493  0.13541
## RespondentSTATE                       -1.36931    1.09153  -1.254  0.20966
## RespondentUS                          -3.02756    1.04803  -2.889  0.00387
## LowerCourtliberal                     -0.95835    0.30572  -3.135  0.00172
## Unconst                               -0.18029    0.35504  -0.508  0.61159
##                                         
## (Intercept)                             
## Circuit11th                           . 
## Circuit1st                              
## Circuit2nd                            **
## Circuit3rd                              
## Circuit4th                            **
## Circuit5th                            **
## Circuit6th                            * 
## Circuit7th                              
## Circuit8th                              
## Circuit9th                            * 
## CircuitDC                               
## CircuitFED                              
## IssueCivilRights                        
## IssueCriminalProcedure                  
## IssueDueProcess                         
## IssueEconomicActivity                   
## IssueFederalismAndInterstateRelations   
## IssueFederalTaxation                    
## IssueFirstAmendment                     
## IssueJudicialPower                      
## IssuePrivacy                            
## IssueUnions                             
## PetitionerBUSINESS                      
## PetitionerCITY                          
## PetitionerCRIMINAL.DEFENDENT            
## PetitionerEMPLOYEE                      
## PetitionerEMPLOYER                      
## PetitionerGOVERNMENT.OFFICIAL           
## PetitionerINJURED.PERSON                
## PetitionerOTHER                         
## PetitionerPOLITICIAN                    
## PetitionerSTATE                         
## PetitionerUS                            
## RespondentBUSINESS                    . 
## RespondentCITY                        * 
## RespondentCRIMINAL.DEFENDENT          **
## RespondentEMPLOYEE                    * 
## RespondentEMPLOYER                      
## RespondentGOVERNMENT.OFFICIAL         * 
## RespondentINJURED.PERSON              **
## RespondentOTHER                       * 
## RespondentPOLITICIAN                    
## RespondentSTATE                         
## RespondentUS                          **
## LowerCourtliberal                     **
## Unconst                                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 545.70  on 395  degrees of freedom
## Residual deviance: 428.71  on 349  degrees of freedom
## AIC: 522.71
## 
## Number of Fisher Scoring iterations: 5

#Out-of-Sample predictions of the _Logistic Regression_ model
predict_LogRegr_Test<- predict(model_LogRegr,type = "response", newdata = Test1)
cmat_LR <- table(Test1$Reverse, predict_LogRegr_Test > 0.5)
cmat_LR

##    
##     FALSE TRUE
##   0    47   30
##   1    27   66

accu_LR <- (cmat_LR[1,1] + cmat_LR[2,2])/sum(cmat_LR)
accu_LR

## [1] 0.6647059

#Overall Accuracy = 0.6647
#Sensitivity = 66 / 93 = 0.7097 ( = TP rate)
#Specificity = 47 / 77 = 0.6104
#FP rate = 30 / 77 = 0.3896

###################################################

# Read in the data again for CART modeling
stevens = read.csv("stevens.csv")
str(stevens)

## 'data.frame':    566 obs. of  9 variables:
##  $ Docket    : Factor w/ 566 levels "00-1011","00-1045",..: 63 69 70 145 97 181 242 289 334 436 ...
##  $ Term      : int  1994 1994 1994 1994 1995 1995 1996 1997 1997 1999 ...
##  $ Circuit   : Factor w/ 13 levels "10th","11th",..: 4 11 7 3 9 11 13 11 12 2 ...
##  $ Issue     : Factor w/ 11 levels "Attorneys","CivilRights",..: 5 5 5 5 9 5 5 5 5 3 ...
##  $ Petitioner: Factor w/ 12 levels "AMERICAN.INDIAN",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Respondent: Factor w/ 12 levels "AMERICAN.INDIAN",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ LowerCourt: Factor w/ 2 levels "conser","liberal": 2 2 2 1 1 1 1 1 1 1 ...
##  $ Unconst   : int  0 0 0 0 0 1 0 1 0 0 ...
##  $ Reverse   : int  1 1 1 1 1 0 1 1 1 1 ...

# Split the data
library(caTools)
set.seed(3000) # to get the same split everytime
spl = sample.split(stevens$Reverse, SplitRatio = 0.7)
Train = subset(stevens, spl==TRUE)
Test = subset(stevens, spl==FALSE)

# Install rpart library
#install.packages("rpart")
library(rpart)
#install.packages("rpart.plot")
library(rpart.plot)

# CART model
StevensTree = rpart(Reverse ~ Circuit + Issue + Petitioner + Respondent + LowerCourt + Unconst, data = Train, method="class", minbucket=25) 
#method="class" arg tells us to make classification tree and not regression tree.
#minbucket = 25 limits the tree so that it does not overfit to our training set.We selected a value of 25, but we could pick a smaller or larger value.We will see another way to limit the tree later in this lecture.

#The model can be be represented as a decision tree:
prp(StevensTree) #plotting the classification tree

# Make predictions
PredictCART = predict(StevensTree, newdata = Test, type = "class") #We need to give type = "class" if we want the majority class predictions.This is like using a threshold of 0.5.

#Now lets assess the accuracy of the model through confusion matrix
cmat_CART<-table(Test$Reverse, PredictCART)  #first arg is the true outcomes and the second is the predicted outcomes
cmat_CART

##    PredictCART
##      0  1
##   0 41 36
##   1 22 71

#lets now compute the overall accuracy

accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART)
accu_CART  #(41+71)/(41+36+22+71)

## [1] 0.6588235

#Our baseline model accuracy is the most frequent outcome i.e. the reversal  denoted by code 1
(22+71)/(41+36+22+71)

## [1] 0.5470588

#Sensitivity = 71 / 93 = 0.7634 ( = TP rate)
#Specificity = 41 / 77 = 0.5325
#FP rate = 36 / 77 = 0.4675

#If we had used logistic regression model, we would have got overall accuracy of 0.6647.A baseline model that always predicts Reverse (the most common outcome) has an accuracy of 54.7%.Here the CART model significantly beats the baseline model and is competitive with logistic model as well as MORE INTERPRETABLE than the logistic model.A CART tree is a series of decision rules which can easily be explained.

#Lets evaluate our model using the ROC curve
library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

PredictROC = predict(StevensTree, newdata = Test)
head(PredictROC)

##            0         1
## 1  0.3035714 0.6964286
## 3  0.3035714 0.6964286
## 4  0.4000000 0.6000000
## 6  0.4000000 0.6000000
## 8  0.4000000 0.6000000
## 21 0.3035714 0.6964286

#For each observation in the test set, it gives two numbers which can be thought of as the probability of outcome 0 and the probability of outcome 1.
#More concretely, each test set observation is classified into a subset, or bucket, of our CART tree. These numbers give the percentage of training set data in that subset with outcome 0 and the percentage of data in the training set in that subset with outcome 1.
#We will use the second column as our probabilities to generate an ROC curve.

#First we use the prediction() function with first argument the second column of PredictROC, and second argument the true outcome values, Test$Reverse.
#We pass the output of prediction() to performance() to which we give also two arguments for what we want on the X and Y axes of our ROC curve, true positive rate and false positive rate.
pred = prediction(PredictROC[,2], Test$Reverse)
perf = performance(pred, "tpr", "fpr")

#and the plot
plot(perf)

#QUICK QUESTION  

#Compute the AUC of the CART model from the previous video, using the following command in your R console:
as.numeric(performance(pred, "auc")@y.values)

## [1] 0.6927105

#What is the AUC?
#Ans:0.6927105

#Now, recall that in Video 4, our tree had 7 splits. Let's see how this changes if we change the value of minbucket.

#First build a CART model that is similar to the one we built in Video 4, except change the minbucket parameter to 5. Plot the tree.
# CART model
StevensTree = rpart(Reverse ~ Circuit + Issue + Petitioner + Respondent + LowerCourt + Unconst, data = Train, method="class", minbucket=5) #method="class" arg tells us to make classification tree and not regression tree.
prp(StevensTree) #plotting the classification tree

#How many splits does the tree have?
#Ans:16
#EXPLANATION:If you plot the tree with prp(StevensTree), you can see that the tree has 16 splits! This tree is probably overfit to the training data, and is not as interpretable.

#Now build a CART model that is similar to the one we built in Video 4, except change the minbucket parameter to 100. Plot the tree.
StevensTree = rpart(Reverse ~ Circuit + Issue + Petitioner + Respondent + LowerCourt + Unconst, data = Train, method="class", minbucket=100) #method="class" arg tells us to make classification tree and not regression tree.
prp(StevensTree) #plotting the classification tree

#How many splits does the tree have?
#Ans:1
#EXPLANATION:If you plot the tree with prp(StevensTree), you can see that the tree only has one split! This tree is probably not fit well enough to the training data.

VIDEO 5: RANDOM FORESTS (R script reproduced here)

The Random Forests method was designed to improve the prediction accuracy of CART and works by building a large number of CART trees.
Unfortunately, this makes the method less interpretable than CART, so often you need to decide if you value the interpretability or the increase in accuracy more.

To make a prediction for a new observation, each tree in the forest votes on the outcome and we pick the outcome that receives the majority of the votes.

Building a Forest

How does random forests build many CART trees?

We can not just run CART multiple times because it would create the same tree every time. To prevent this, Random Forests …

only allows each tree to split on a random subset of the independent variables, and
each tree is built from what we call a bagged or bootstrapped sample of the data.
This just means that the data used as the training data for each tree is selected randomly with replacement.

Some important parameter values need to be selected in randomForest():

nodesize: the minimum number of observations in a subset, equivalent to the minbucket parameter from CART.
A smaller value of nodesize, which leads to bigger trees, may take longer in R.
Random forests is much more computationally intensive than CART.
ntree: the number of trees to build.
This should not be set too small, but the larger it is the longer it will take.
A couple hundred trees is typically plenty.

A nice thing about random forests is that it is not as sensitive to the parameter values as CART is.

IMPORTANT NOTE on randomness

Keep in mind that Random Forests has a random component.
You may have gotten a different confusion matrix than me because there is a random component to this method.

# VIDEO 5 - Random Forests

# Install randomForest package
#install.packages("randomForest")
library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

# Read in the data again for CART modeling
stevens = read.csv("stevens.csv")
str(stevens)

## 'data.frame':    566 obs. of  9 variables:
##  $ Docket    : Factor w/ 566 levels "00-1011","00-1045",..: 63 69 70 145 97 181 242 289 334 436 ...
##  $ Term      : int  1994 1994 1994 1994 1995 1995 1996 1997 1997 1999 ...
##  $ Circuit   : Factor w/ 13 levels "10th","11th",..: 4 11 7 3 9 11 13 11 12 2 ...
##  $ Issue     : Factor w/ 11 levels "Attorneys","CivilRights",..: 5 5 5 5 9 5 5 5 5 3 ...
##  $ Petitioner: Factor w/ 12 levels "AMERICAN.INDIAN",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Respondent: Factor w/ 12 levels "AMERICAN.INDIAN",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ LowerCourt: Factor w/ 2 levels "conser","liberal": 2 2 2 1 1 1 1 1 1 1 ...
##  $ Unconst   : int  0 0 0 0 0 1 0 1 0 0 ...
##  $ Reverse   : int  1 1 1 1 1 0 1 1 1 1 ...

# Split the data
library(caTools)
set.seed(3000) # to get the same split everytime
spl = sample.split(stevens$Reverse, SplitRatio = 0.7)
Train = subset(stevens, spl==TRUE)
Test = subset(stevens, spl==FALSE)

# Build random forest model
StevensForest = randomForest(Reverse ~ Circuit + Issue + Petitioner + Respondent + LowerCourt + Unconst, data = Train, ntree=200, nodesize=25 ) # this throws a warning which requires that we convert the outcome var as factors

## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?

# Convert outcome to factor
Train$Reverse = as.factor(Train$Reverse)
Test$Reverse = as.factor(Test$Reverse)

# Try again
StevensForest = randomForest(Reverse ~ Circuit + Issue + Petitioner + Respondent + LowerCourt + Unconst, data = Train, ntree=200, nodesize=25 )

# Make predictions
PredictForest = predict(StevensForest, newdata = Test)
table(Test$Reverse, PredictForest)

##    PredictForest
##      0  1
##   0 41 36
##   1 19 74

(41+74)/(41+37+19+74)

## [1] 0.6725146

# Overall accuracy is 67%

#Recall that our:
#logistic regression model had an accuracy of 66.5% and
#our CART model had an accuracy of 65.9%
#So at 67% our random forest model improved our accuracy a little bit over CART.Sometimes you will see a smaller improvement in accuracy and sometimes you'll see that random forests can significantly improve in accuracy over CART.

QUICK QUESTION

IMPORTANT NOTE: When creating random forest models, you might still get different answers from the ones you see here even if you set the random seed. This has to do with different operating systems and the random forest implementation.

Let’s see what happens if we set the seed to two different values and create two different random forest models.

First, set the seed to 100, and the re-build the random forest model, exactly like we did in the previous video (Video 5). Then make predictions on the test set. What is the accuracy of the model on the test set?

# Install randomForest package
#install.packages("randomForest")
library(randomForest)

# Read in the data again for CART modeling
stevens = read.csv("stevens.csv")
str(stevens)

## 'data.frame':    566 obs. of  9 variables:
##  $ Docket    : Factor w/ 566 levels "00-1011","00-1045",..: 63 69 70 145 97 181 242 289 334 436 ...
##  $ Term      : int  1994 1994 1994 1994 1995 1995 1996 1997 1997 1999 ...
##  $ Circuit   : Factor w/ 13 levels "10th","11th",..: 4 11 7 3 9 11 13 11 12 2 ...
##  $ Issue     : Factor w/ 11 levels "Attorneys","CivilRights",..: 5 5 5 5 9 5 5 5 5 3 ...
##  $ Petitioner: Factor w/ 12 levels "AMERICAN.INDIAN",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Respondent: Factor w/ 12 levels "AMERICAN.INDIAN",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ LowerCourt: Factor w/ 2 levels "conser","liberal": 2 2 2 1 1 1 1 1 1 1 ...
##  $ Unconst   : int  0 0 0 0 0 1 0 1 0 0 ...
##  $ Reverse   : int  1 1 1 1 1 0 1 1 1 1 ...

# Split the data
library(caTools)
set.seed(3000) # to get the same split everytime
spl = sample.split(stevens$Reverse, SplitRatio = 0.7)
Train = subset(stevens, spl==TRUE)
Test = subset(stevens, spl==FALSE)

# Convert outcome to factor
Train$Reverse = as.factor(Train$Reverse)
Test$Reverse = as.factor(Test$Reverse)

set.seed(100)
StevensForest = randomForest(Reverse ~ Circuit + Issue + Petitioner + Respondent + LowerCourt + Unconst, data = Train, ntree=200, nodesize=25 )

# Make predictions
PredictForest = predict(StevensForest, newdata = Test)
table(Test$Reverse, PredictForest)

##    PredictForest
##      0  1
##   0 43 34
##   1 19 74

(43+78)/(43+34+15+78)

## [1] 0.7117647

# Overall accuracy is 71%

################################

#Now, set the seed to 200, and then re-build the random forest model, exactly like we did in the previous video (Video 5). Then make predictions on the test set. What is the accuracy of this model on the test set?

set.seed(200)

StevensForest = randomForest(Reverse ~ Circuit + Issue + Petitioner + Respondent + LowerCourt + Unconst, data = Train, ntree=200, nodesize=25 )

# Make predictions
PredictForest = predict(StevensForest, newdata = Test)
table(Test$Reverse, PredictForest)

##    PredictForest
##      0  1
##   0 44 33
##   1 17 76

(39+73)/(39+38+20+73)

## [1] 0.6588235

# Overall accuracy is 65.9%

#As we see here, the random component of the random forest method can change the accuracy. The accuracy for a more stable dataset will not change very much, but a noisy dataset can be significantly affected by the random samples.

VIDEO 6: CROSS-VALIDATION (R script reproduced here)

In CART, the value of minbucket can affect the model’s out-of-sample accuracy.
As we discussed above, if minbucket is too small, over-fitting might occur. On the other hand, if minbucket is too large, the model might be too simple.

So how should we set this parameter value?

We could select the value that gives the best testing set accuracy, but this would not be right. The idea of the testing set is to measure model performance on data the model has never seen before. By picking the value of minbucket to get the best test set performance, the testing set was implicitly used to generate the model.

Instead, we will use a method called K-fold Cross Validation, which is one way to properly select the parameter value.

K-fold Cross Validation

This method works by going through the following steps.

First, we split the training set into $k$ equally sized subsets, or folds. In this example, k equals 5.
Then we select $k - 1$ or four folds to estimate the model, and compute predictions on the remaining one fold, which is often referred to as the validation set.
We build a model and make predictions for each possible parameter value we are considering.
We repeat this for each of the other folds.

Ultimately cross validation builds many models, one for each fold and possible parameter value. Then, for each candidate parameter value, and for each fold, we can compute the accuracy of the model.

This plot shows the possible parameter values on the X-axis, and the accuracy of the model on the Y-axis, with one line for each of the $k$ repeats of the experiment.

example

We then average the accuracy over the $k$ folds to determine the final parameter value that we want to use.

Typically, the behavior looks like the curves shown in the plot:

if the parameter value is too small, then the accuracy is lower, because the model is probably over-fit to the training set.
if the parameter value is too large, then the accuracy is also lower, because the model is too simple.

In this case, we would pick a parameter value around 6, because it leads to the maximum average accuracy over all parameter values.

CV in R

So far, we have used the parameter minbucket to limit our tree in R.
When we use cross validation in R, we will use a parameter called cp instead, the complexity parameter.

It is like Adjusted $R^2$ for linear regression, and AIC for logistic regression, in that it measures the trade-off between model complexity and accuracy on the training set.

A smaller cp value leads to a bigger tree, so a smaller cp value might over-fit the model to the training set. But a cp value that is too large might build a model that is too simple.

# VIDEO 6

# Install cross-validation packages
#install.packages("caret")
library(caret)

## 
## Attaching package: 'caret'

## The following object is masked from 'package:mosaic':
## 
##     dotPlot

#install.packages("e1071")
library(e1071)

# Define cross-validation experiment
numFolds = trainControl( method = "cv", number = 10 )
cpGrid = expand.grid( .cp = seq(0.01,0.5,0.01)) 
#This will define our cp parameters to test as numbers from 0.01 to 0.5, in increments of 0.01.

# Perform the cross validation
save_CV<-train(Reverse ~ Circuit + Issue + Petitioner + Respondent + LowerCourt + Unconst, data = Train, method = "rpart", trControl = numFolds, tuneGrid = cpGrid )
save_CV

## CART 
## 
## 396 samples
##   8 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 357, 356, 357, 356, 357, 356, ... 
## Resampling results across tuning parameters:
## 
##   cp    Accuracy   Kappa      
##   0.01  0.6433974  0.267916905
##   0.02  0.6359615  0.248964339
##   0.03  0.6208974  0.225208239
##   0.04  0.6333974  0.258053232
##   0.05  0.6436538  0.283134471
##   0.06  0.6436538  0.283134471
##   0.07  0.6436538  0.283134471
##   0.08  0.6436538  0.283134471
##   0.09  0.6436538  0.283134471
##   0.10  0.6436538  0.283134471
##   0.11  0.6436538  0.283134471
##   0.12  0.6436538  0.283134471
##   0.13  0.6436538  0.283134471
##   0.14  0.6436538  0.283134471
##   0.15  0.6436538  0.283134471
##   0.16  0.6436538  0.283134471
##   0.17  0.6436538  0.283134471
##   0.18  0.6436538  0.283134471
##   0.19  0.6436538  0.283134471
##   0.20  0.6061538  0.188881247
##   0.21  0.5808333  0.122201772
##   0.22  0.5605769  0.063162246
##   0.23  0.5479487  0.021739130
##   0.24  0.5453846  0.009090909
##   0.25  0.5453846  0.000000000
##   0.26  0.5453846  0.000000000
##   0.27  0.5453846  0.000000000
##   0.28  0.5453846  0.000000000
##   0.29  0.5453846  0.000000000
##   0.30  0.5453846  0.000000000
##   0.31  0.5453846  0.000000000
##   0.32  0.5453846  0.000000000
##   0.33  0.5453846  0.000000000
##   0.34  0.5453846  0.000000000
##   0.35  0.5453846  0.000000000
##   0.36  0.5453846  0.000000000
##   0.37  0.5453846  0.000000000
##   0.38  0.5453846  0.000000000
##   0.39  0.5453846  0.000000000
##   0.40  0.5453846  0.000000000
##   0.41  0.5453846  0.000000000
##   0.42  0.5453846  0.000000000
##   0.43  0.5453846  0.000000000
##   0.44  0.5453846  0.000000000
##   0.45  0.5453846  0.000000000
##   0.46  0.5453846  0.000000000
##   0.47  0.5453846  0.000000000
##   0.48  0.5453846  0.000000000
##   0.49  0.5453846  0.000000000
##   0.50  0.5453846  0.000000000
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.19.

#We get a table describing the cross validation accuracy for different cp parameters.
#The first column gives the cp parameter that was tested
#The second column gives the cross validation accuracy for that cp value.
#The accuracy starts lower, and then increases, and then will start decreasing again, as we saw in the example shown above.

plot(save_CV) #or

plot(save_CV$results$cp, save_CV$results$Accuracy, type="l", xlab="cp", ylab="accuracy")

#the train function tells you to pick the largest value of cp with the lowest error when there are ties

# Create a new CART model
StevensTreeCV = rpart(Reverse ~ Circuit + Issue + Petitioner + Respondent + LowerCourt + Unconst, data = Train, method="class", cp = 0.17) # using the cp value got from Cross validation above

# Make predictions (Out-of-Sample predictions of the Cross Validated CART model)
PredictCV = predict(StevensTreeCV, newdata = Test, type = "class")
cmat_CART_CV<-table(Test$Reverse, PredictCV) #confusion matrix
cmat_CART_CV

##    PredictCV
##      0  1
##   0 59 18
##   1 29 64

accu_CART_CV <- (cmat_CART_CV[1,1] + cmat_CART_CV[2,2])/sum(cmat_CART_CV)
accu_CART_CV  #(59+64)/(59+18+29+64)

## [1] 0.7235294

#Overall Accuracy = 72.4%
#Sensitivity = 64 / 93 = 0.6882 ( = TP rate)
#Specificity = 59 / 77 = 0.7662
#FP rate = 18 / 77 = 0.2338

#Recall that our previous CART model had an accuracy of 65.9%

#What does this decision tree look like?
prp(StevensTreeCV)

#Surprisingly, the best, cross validated, CART model achieves an accuracy of 72.4% with just one split based in the value of the LowerCourt variable.

About Cross Validation

Cross validation helps us make sure we are selecting a good parameter value, and often this will significantly increase the accuracy.If we had already happened to select a good parameter value, then the accuracy might not of increased that much.Nevertheless, by using cross validation, we can be sure that we are selecting a smart parameter value.

Can a CART model actually predict Supreme Court case outcomes better than a group of experts?

Martin and his colleagues used 628 previous Supreme Court cases between 1994 and 2001 to build their model.
They made predictions for the 68 cases that would be decided in October, 2002, before the term started.

The model

Their model had two stages of CART trees.

The first stage involved making predictions using two CART trees.
- One to predict a unanimous liberal decision and
- One to predict a unanimous conservative decision.
- If the trees gave conflicting responses or both predicted no, then they moved on to the next stage.
The second stage consisted of predicting the decision of each individual justice, and then use the majority decision of all nine justices as a final prediction for the case.

It turns out that about 50% of Supreme Court cases result in a unanimous decision, so the first stage alone was a nice first step to detect the easier cases.

This is the decision tree for Justice O’Connor:

example

And this is the decision tree for Justice Souter:

example

This shows an unusual property of the CART trees that Martin and his colleagues developed.
They use predictions for some trees as independent variables for other trees.
In this tree, the first split is whether or not Justice Ginsburg’s predicted decision is liberal. So we have to run Justice Ginsburg’s CART tree first, see what the prediction is, and then use that as input for Justice Souter’s tree.

If we predict that Justice Ginsburg will make a liberal decision, then Justice Souter will probably make a liberal decision too, and viceversa.

The experts

Martin and his colleagues recruited 83 legal experts:

71 academics and 12 attorneys.
38 previously clerked for a Supreme Court justice, 33 were chaired professors and 5 were current or former law school deans
Experts only asked to predict within their area of expertise.
- More than one expert to each case.
Allowed to consider any source of information, but not allowed to communicate with each other regarding predictions.

The results

For the 68 cases in October 2002, the predictions were made, and at the end of the month the results were computed.

For predicting the overall decision that was made by the Supreme Court,

the models had an accuracy of 75%, while
the experts only had an accuracy of 59%.

So the models had a significant edge over the experts in predicting the overall case outcomes.

However, when the predictions were run for individual justices, the model and the experts performed very similarly, with an accuracy of about 68%.
For some justices, the model performed better, and for some justices, the experts performed better.

Keeping an eye on Healthcare costs_The D2Hawkeye Story_2

VIDEO 1: THE STORY OF D2HAWKEYE

QUICK QUESTION

In what ways do you think an analytics approach to predicting healthcare cost will improve upon the previous approach of human judgment? Select all that apply.

Ans:It will allow D2Hawkeye to analyze millions of patients.,It will allow D2Hawkeye to make predictions faster than doctors can. , It will allow D2Hawkeye use all available data (millions of cases) to make decisions.

EXPLANATION:All of the above are correct. There are many advantages to having an analytics approach to predict cost. However, it is important that the models are interpretable, so trees are a great model to use in this situation.

VIDEO 2: CLAIMS DATA

QUICK QUESTION

A common problem in analytics is that you have some data available, but it’s not the ideal dataset. This is the case for this problem, where we only have claims data. Which of the following pieces of information would we ideally like to have in our dataset, but are not included in claims data?

Ans:Blood test results,Physical exam results (weight, height, blood pressure, etc.)

VIDEO 3: THE VARIABLES

QUICK QUESTION

While we don’t have all of the data we would ideally like to have in this problem (like test results), we can define new variables using the data we do have. Which of the following were new variables defined to help predict healthcare cost? Select all that apply.

Ans:Variables to capture chronic conditions, Noncompliance to treatment, Illness severity,Interactions between illnesses

EXPLANATION:All of these variables were defined using the claims data to improve cost predictions. This shows how the intuition of experts can be used to define new variables and improve the model.

VIDEO 4: ERROR MEASURES

QUICK QUESTION

The image below shows the penalty error matrix that we discussed in the previous video.

penalty matrix

We can interpret this matrix as follows. Suppose the actual outcome for an observation is 3, and we predict 2. We find 3 on the top of the matrix, and go down to the second row (since we forecasted 2). The penalty error for this mistake is 2. If for another observation we predict (forecast) 4, but the actual outcome is 1, that is a penalty error of 3.

What is the worst mistake we can make, according to the penalty error matrix?

Ans:We predict 1 (very low cost), but the actual outcome is 5 (very high cost).

EXPLANATION:The highest cost is 8, which occurs when the forecast is 1 (very low cost), but the actual cost is 5 (very high cost). It would be much worse for us to ignore an actual high cost observation than to accidentally predict high cost for someone who turns out to be low cost.

What are the “best” types of mistakes we can make, according to the penalty error matrix?

Ans:Mistakes where we predict one cost bucket HIGHER than the actual outcome.

EXPLANATION:We are happier with mistakes where we predict one cost bucket higher than the actual outcome, since this just means we are being a little overly cautious.

VIDEO 5: CART TO PREDICT COST

QUICK QUESTION

What were the most important factors in the CART trees to predict cost?

Ans:Cost ranges from the previous year

EXPLANATION:The most important variables in a CART tree are at the top of the tree - in this case, they are the cost ranges from the previous year.

Lets reproduce the the slide notes in brief here as captured in videos 1 to 5 before we proceed for the analysis

INTRODUCTION

The specific exercise we are going to see in this lecture is an analytics approach to building models starting with 2.4 million people over a three year span, the period between 2001 and 2003.

We make out-of-sample predictions for the period of 2003 and 2004.

This was in the early years of D2Hawkeye. Out of the 2.4 million people, we included only people with data for at least 10 months in both periods, both in the observation period and the results period. This decreased the data to 400,000 people.

The D2 Hawkeye data

To build an analytics model, let us discuss the variables we used.

13,000 diagnoses. It’s for the codes for diagnosis that claims data utilize.
22,000 different codes for procedures
45,000 codes for prescription drugs.

To work with this massive amount of variables, we aggregated the variables as follows.

Out of the 13,000 diagnoses, we defined 217 diagnosis groups.
Out of the 20,000 procedures, we aggregated the data to develop 213 procedure groups.
Finally, from 45,000 prescription drugs, we developed 189 therapeutic groups.

To illustrate an example of how we infer further information from the data, the graph here shows on the horizontal axis, time, and on the vertical axis, costs in thousands of dollars.

example

Patient one is a patient who, on a monthly basis, has costs on the order of $10,000 to $15,000, a fairly significant cost but fairly constant in time.
Patient two has also an annual cost of a similar size to patient one. But in all but the third month, the costs are almost $0. Whereas in the third month, it cost about $70,000.

In fact, this is additional data we defined indicating whether the patient has a chronic or an acute condition.

In addition to the initial variables we also defined in collaboration with medical doctors, 269 medically-defined rules.

For example,

Interaction between various illnesses (e.g., obesity and depression).
Interaction between diagnosis and age (e.g., more than 65 years old and coronary artery disease).
Noncompliance with treatment (e.g., non-fulfillment of a particular drug order).
Illness severity (e.g., severe depression as opposed to regular depression).

There is another set of variables involving demographic information, such as gender and age.

An important aspect of the variables are the variables related to cost. Rather than using costs directly, we bucketed costs and considered everyone in the group equally. We defined five buckets partitioned in such a way that each accounted for 20% of the total cost.

The partitions were:

from 0 to 3,000,
from 3,000 to 8,000,
from 8,000 to 19,000,
from 19,000 to 55,000, and
above 55,000.

The number of patients that were below 3,000 was 78% of the patients.

Interpretation of the cost buckets

This diagram shows the various levels of risk.

cost buckets

Bucket one consists of patients that have rather low risk.
Bucket two has what is called emerging risk.
In bucket three, moderate level of risk.
Bucket four, high risk.
And bucket five, very high risk.

From a medical perspective,

buckets two and three, the medical and the moderate risk patients, are candidates for wellness programs.
Whereas bucket four, the high risk patients, are candidates for disease management programs.
bucket five, the very high risk patients, are candidates for case management.

Error Measures

Let us introduce the error measures we used in building the analytics models.

Of course $R^2$ was used, but it was accompanied by other measures.

The so-called “penalty error” is motivated by the fact that if you classify a very high-risk patient as a low-risk patient, this is more costly than the reverse, namely classifying a low-risk patient as a very high-risk patient.

Motivated by this, we developed a penalty error with the idea of using asymmetric penalties. The following figure show the penalty error as a matrix between outcome and (model) forecast.

penalty matrix

For example, whenever we classify a low-risk patient as high-risk, we pay a penalty of 2, which is a difference of 3 minus 1, the difference in the error. However, for the reverse case, when you classify a bucket 3 patient as bucket 1 patient, the penalty is twice as much.

The off diagonal penalties are double the corresponding penalties in the lower diagonal.

Baseline reference

To judge the quality of the analytics models we developed, we compare it with a baseline, which is to simply predict that the cost in the next period will be the cost in the current period.

We have observed that as far as identification of buckets is concerned, the accuracy was 75%.
The average penalty error of the baseline was 0.56.

The Method: CART with multi-class classification

In this case, we use multi-class classification because the outcome to predict is the bucket classification, for which there are five classes.

Classification trees have the major advantage of being interpretable by the physicians who observe them and judge them. In other words, people were able to identify these cases as reasonable.
The human intuition agreed with the output of the analytics model.

To give you an example, let us consider patients that have two types of diagnosis: coronary artery disease (CAD) and diabetes.

example_tree

If a patient does not have a coronary artery disease, we would classify the patient as bucket one.
If it has coronary artery disease, then we check whether the person has diabetes or does not have diabetes.
- If it has diabetes, then it’s bucket five, very high risk.
- If it does not have diabetes, but given it has coronary artery disease, it is classified as bucket three.

In the application of Hawkeye, the most important factors in the beginning of the tree were related to cost.
As the tree grows the secondary factor involve various chronic illnesses and some of the medical rules we discussed earlier.

For example, whether or not the patient has asthma and depression or not. If it has asthma and depression, then it’s bucket five.

Examples of bucket 5 patients

Under 35 years old, between $3300 and $3900 in claims, C.A.D., but no office visits in last year
Claims between $3900 and $43000 with at least $8000 paid in last 12 months, $4300 in pharmacy claims, acute cost profile and cancer diagnosis
More than $58000 in claims, at least $55000 paid in last 12 months, and not an acute profile.

VIDEO 6: CLAIMS DATA IN R

LOADING THE DATA

This data comes from the DE-SynPUF dataset, published by the United States Centers for Medicare and Medicaid Services (CMS).
The observations represent a 1% random sample of Medicare beneficiaries, limited to those still alive at the end of 2008.

The Variables

Our independent variables are from 2008, and we will be predicting cost in 2009. They are:

age: patient’s age in years at the end of 2008,

and several binary variables indicating whether (1) or not (0) the patient had diagnosis codes for a particular disease or related disorder in 2008:

alzheimers,
arthritis,
cancer,
copd, chronic obstructive pulmonary disease,
depression,
diabetes,
heart.failure,
ihd, ischemic heart disease,
kidney disease,
osteoporosis,
stroke.

Finally there are non-medical-diagnostic variables:

reimbursement2008: the total amount of Medicare reimbursements for this patient in 2008.
reimbursement2009: the total value of all Medicare reimbursements for the patient in 2009.
bucket2008: the cost bucket the patient fell into in 2008,
bucket2009: the cost bucket the patient fell into in 2009.

# VIDEO 6

# Read in the data
Claims <- read.csv("ClaimsData1.csv")
str(Claims)

## 'data.frame':    458005 obs. of  16 variables:
##  $ age              : int  85 59 67 52 67 68 75 70 67 67 ...
##  $ alzheimers       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ arthritis        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cancer           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ copd             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ depression       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ diabetes         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ heart.failure    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ihd              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ kidney           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ osteoporosis     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ stroke           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ reimbursement2008: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bucket2008       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ reimbursement2009: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bucket2009       : int  1 1 1 1 1 1 1 1 1 1 ...

# Percentage of patients in each cost bucket
#These cost buckets are defined using the thresholds determined by D2Hawkeye.We can verify that the number of patients in each cost bucket has the same structure as what we saw for D2Hawkeye by computing the percentage of patients in each cost bucket.
table(Claims$bucket2009)/nrow(Claims)

## 
##           1           2           3           4           5 
## 0.671267781 0.190170413 0.089466272 0.043324855 0.005770679

round(100.0*table(Claims$bucket2009)/nrow(Claims), 2)

## 
##     1     2     3     4     5 
## 67.13 19.02  8.95  4.33  0.58

#The vast majority of patients in this data set have low cost.

hist(log10(Claims$reimbursement2008), xlab = "log10(2008 reimbursements)", ylab = "Frequency", main = "Distribution of 2008 Reimbursements", col = "orange")

#A MODEL FOR HEALTHCARE COST PREDICTION

#Our goal will be to predict the cost bucket the patient fell into in 2009 using a CART model.

# Split the data
library(caTools)

set.seed(88)

spl = sample.split(Claims$bucket2009, SplitRatio = 0.6) #First we split our entire data set into training and test sets, with 60/40 split:
ClaimsTrain = subset(Claims, spl==TRUE)
ClaimsTest = subset(Claims, spl==FALSE)

summary(ClaimsTrain)

##       age           alzheimers       arthritis          cancer       
##  Min.   : 26.00   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.: 67.00   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median : 73.00   Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   : 72.64   Mean   :0.1928   Mean   :0.1546   Mean   :0.06402  
##  3rd Qu.: 81.00   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :100.00   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##       copd          depression        diabetes      heart.failure   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.1369   Mean   :0.2129   Mean   :0.3809   Mean   :0.2852  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##       ihd             kidney        osteoporosis        stroke       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   :0.4205   Mean   :0.1616   Mean   :0.1738   Mean   :0.04497  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##  reimbursement2008   bucket2008    reimbursement2009   bucket2009   
##  Min.   :     0    Min.   :1.000   Min.   :     0    Min.   :1.000  
##  1st Qu.:     0    1st Qu.:1.000   1st Qu.:   130    1st Qu.:1.000  
##  Median :   960    Median :1.000   Median :  1540    Median :1.000  
##  Mean   :  4016    Mean   :1.438   Mean   :  4280    Mean   :1.522  
##  3rd Qu.:  3110    3rd Qu.:2.000   3rd Qu.:  4220    3rd Qu.:2.000  
##  Max.   :194910    Max.   :5.000   Max.   :158800    Max.   :5.000

#QUICK QUESTION  

#What is the average age of patients in the training set, ClaimsTrain?
mean(ClaimsTrain$age)

## [1] 72.63773

#Ans:72.63773  #or by looking at summary(ClaimsTrain),The mean age should be listed under the age variable 
#What proportion of people in the training set (ClaimsTrain) had at least one diagnosis code for diabetes?
prop.table(table(ClaimsTrain$diabetes)) #or table(ClaimsTrain$diabetes)/nrow(ClaimsTrain)

## 
##         0         1 
## 0.6191017 0.3808983

#Ans:0.3808983 
#since diabetes is a binary variable, the mean value of diabetes gives the proportion of people with at least one diagnosis code for diabetes.

VIDEO 7: BASELINE METHOD AND PENALTY MATRIX

The baseline method would predict that the cost bucket for a patient in 2009 will be the same as it was in 2008.
So let’s create a classification matrix to compute the accuracy for the baseline method on the test set.

The accuracy is the sum of the diagonal, the observations that were classified correctly, divided by the total number of observations in our test set.

# VIDEO 7

# Baseline method
cmat_base <- table(ClaimsTest$bucket2009, ClaimsTest$bucket2008)
cmat_base

##    
##          1      2      3      4      5
##   1 110138   7787   3427   1452    174
##   2  16000  10721   4629   2931    559
##   3   7006   4629   2774   1621    360
##   4   2688   1943   1415   1539    352
##   5    293    191    160    309    104

base_accu <- sum(diag(cmat_base))/nrow(ClaimsTest)
base_accu  #(110138 + 10721 + 2774 + 1539 + 104)/nrow(ClaimsTest)

## [1] 0.6838135

#The accuracy of the baseline method is the 68.38%

#How about the penalty error?
#We need to first create a penalty matrix, where we will put the actual outcomes on the left (rows), and the predicted outcomes on the top (columns) (note that this is the transpose of the matrix shown in the figure above).

# Penalty Matrix
PenaltyMatrix = matrix(c(0,1,2,3,4,2,0,1,2,3,4,2,0,1,2,6,4,2,0,1,8,6,4,2,0), byrow=TRUE, nrow=5)

PenaltyMatrix

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    1    2    3    4
## [2,]    2    0    1    2    3
## [3,]    4    2    0    1    2
## [4,]    6    4    2    0    1
## [5,]    8    6    4    2    0

# Penalty Error of Baseline Method
as.matrix(table(ClaimsTest$bucket2009, ClaimsTest$bucket2008))*PenaltyMatrix

##    
##         1     2     3     4     5
##   1     0  7787  6854  4356   696
##   2 32000     0  4629  5862  1677
##   3 28024  9258     0  1621   720
##   4 16128  7772  2830     0   352
##   5  2344  1146   640   618     0

sum(as.matrix(table(ClaimsTest$bucket2009,ClaimsTest$bucket2008))*PenaltyMatrix)/nrow(ClaimsTest)

## [1] 0.7386055

#The penalty error of the baseline method is 0.739.

#Our goal will be to create a CART model that has an accuracy higher than 68.38% and a penalty error lower than 0.739.

#QUICK QUESTION  

#Suppose that instead of the baseline method discussed in the previous video, we used the baseline method of predicting the most frequent outcome for all observations. This new baseline method would predict cost bucket 1 for everyone.

#The confusion matrix would be the following:
 #or 

cmat_all1 <- matrix(c(rowSums(cmat_base), rep(0, 20)), byrow = FALSE, nrow = 5)
cmat_all1

##        [,1] [,2] [,3] [,4] [,5]
## [1,] 122978    0    0    0    0
## [2,]  34840    0    0    0    0
## [3,]  16390    0    0    0    0
## [4,]   7937    0    0    0    0
## [5,]   1057    0    0    0    0

#What would the accuracy of this baseline method be on the test set?
#Accuracy:
sum(cmat_all1[,1])

## [1] 183202

accu_base_all_1 <- cmat_all1[1,1]/nrow(ClaimsTest)
accu_base_all_1

## [1] 0.67127

#or
prop.table(table(ClaimsTest$bucket2009))[1]

##       1 
## 0.67127

#Ans:0.67127

#What would the penalty error of this baseline method be on the test set?
# as.matrix(cmat_all1*PenaltyMatrix)
# sum(as.matrix(cmat_all1*PenaltyMatrix))
altbase_penalty_error <- sum(as.matrix(cmat_all1*PenaltyMatrix))/nrow(ClaimsTest)
altbase_penalty_error

## [1] 1.044301

#Ans:1.044301
##For the penalty error, since this baseline method predicts 1 for all observations, it would have a penalty error of:(0*122978 + 2*34840 + 4*16390 + 6*7937 + 8*1057)/nrow(ClaimsTest) = 1.044301

VIDEO 8: PREDICTING HEALTHCARE COSTS IN R

# VIDEO 8

# Load necessary libraries
library(rpart)
library(rpart.plot)

# CART model
ClaimsTree = rpart(bucket2009 ~ age + alzheimers + arthritis + cancer + copd + depression + diabetes + heart.failure + ihd + kidney + osteoporosis + stroke + bucket2008 + reimbursement2008, data=ClaimsTrain, method="class", cp=0.00005)

#Note that even though we have a multi-class classification problem here, we build our tree in the same way as a binary classification problem.
#Note on cp: the cp value we are using was selected through cross-validation on the training set. We are not performing the cross-validation here, because it takes a significant amount of time on a data set of this size. Remember that we have almost 275,000 observations in our training set.

prp(ClaimsTree)

## Warning: labs do not fit even at cex 0.15, there may be some overplotting

#We have a huge tree here. This makes sense for a few reasons.
#One is the large number of observations in our training set. Another is that we have a five-class classification problem, so the classification is more complex than a binary classification case, like the one we saw in the previous lecture.
#The trees used by D2Hawkeye were also very large CART trees. While this hurts the interpretability of the model, it is still possible to describe each of the buckets of the tree according to the splits.

# Make predictions (Out-of-Sample predictions of the CART model)
PredictTest = predict(ClaimsTree, newdata = ClaimsTest, type = "class")
#We need to give type = "class" if we want the majority class predictions.This is like using a threshold of 0.5.

cmat_CART <-table(ClaimsTest$bucket2009, PredictTest)
cmat_CART

##    PredictTest
##          1      2      3      4      5
##   1 114141   8610    124    103      0
##   2  18409  16102    187    142      0
##   3   8027   8146    118     99      0
##   4   3099   4584     53    201      0
##   5    351    657      4     45      0

CART_accu <- sum(diag(cmat_CART))/nrow(ClaimsTest)
CART_accu #(114141 + 16102 + 118 + 201 + 0)/nrow(ClaimsTest)  #adding the nos along the diagonal and dividing by total no of obs in the test set

## [1] 0.7126669

#The accuracy of this CART model is the 71.27%.

#For the penalty error, we can use the penalty matrix we prepared previously
as.matrix(table(ClaimsTest$bucket2009, PredictTest))*PenaltyMatrix

##    PredictTest
##         1     2     3     4     5
##   1     0  8610   248   309     0
##   2 36818     0   187   284     0
##   3 32108 16292     0    99     0
##   4 18594 18336   106     0     0
##   5  2808  3942    16    90     0

sum(as.matrix(table(ClaimsTest$bucket2009,PredictTest))*PenaltyMatrix)/nrow(ClaimsTest)

## [1] 0.7578902

#The penalty error of the baseline method is 0.758

#While we increased the accuracy, the penalty error also went up. Why?
#By default, rpart will try to maximize the overall accuracy, and every type of error is seen as having a penalty of one.
#Our CART model predicts 3, 4, and 5 so rarely because there are very few observations in these classes.Therefore we do not really expect this model to do better on the penalty error than the baseline method.

#How can we fix this?

#he rpart function allows us to specify a parameter called loss.
#This is the penalty matrix we want to use when building our model.

#If the rpart function knows that we will be giving a higher penalty to some types of errors over others, it might choose different splits when building the model to minimize the worst types of errors. We will probably get a lower overall accuracy with this new model, but hopefully, the penalty error will be much lower too.

# New CART model with loss matrix
ClaimsTree = rpart(bucket2009 ~ age + alzheimers + arthritis + cancer + copd + depression + diabetes + heart.failure + ihd + kidney + osteoporosis + stroke + bucket2008 + reimbursement2008, data=ClaimsTrain, method="class", cp=0.00005, parms=list(loss=PenaltyMatrix))

# Redo predictions and penalty error (Out-of-Sample predictions of the CART model with loss)
PredictTest = predict(ClaimsTree, newdata = ClaimsTest, type = "class")

CART_loss_cmat<-table(ClaimsTest$bucket2009, PredictTest)
CART_loss_cmat

##    PredictTest
##         1     2     3     4     5
##   1 94310 25295  3087   286     0
##   2  7176 18942  8079   643     0
##   3  3590  7706  4692   401     1
##   4  1304  3193  2803   636     1
##   5   135   356   408   156     2

CART_loss_accu <- sum(diag(CART_loss_cmat))/nrow(ClaimsTest)
CART_loss_accu  #(94310 + 18942 + 4692 + 636 + 2)/nrow(ClaimsTest)

## [1] 0.6472746

#The accuracy of this CART model is the 64.73%.

as.matrix(table(ClaimsTest$bucket2009, PredictTest))*PenaltyMatrix

##    PredictTest
##         1     2     3     4     5
##   1     0 25295  6174   858     0
##   2 14352     0  8079  1286     0
##   3 14360 15412     0   401     2
##   4  7824 12772  5606     0     1
##   5  1080  2136  1632   312     0

sum(as.matrix(table(ClaimsTest$bucket2009,PredictTest))*PenaltyMatrix)/nrow(ClaimsTest)

## [1] 0.6418161

#The penalty error of the CART model with loss is 0.642.

#Our accuracy is now lower than the baseline method, but our penalty error is also much lower.

#QUICK QUESTION  

#In the previous video, we constructed two CART models. The first CART model, without the loss matrix, predicted bucket 1 for 78.6% of the observations in the test set. Did the second CART model, with the loss matrix, predict bucket 1 for more or fewer of the observations, and why?
#Ans:According to the penalty matrix, some of the worst types of errors are to predict bucket 1 when the actual cost bucket is higher. Therefore, the model with the penalty matrix predicted bucket 1 less frequently.
#EXPLANATION:If you look at the classification matrix for the second CART model, we predicted bucket 1 less frequently. This is because, according to the penalty matrix, some of the worst types of errors are to predict bucket 1 when the actual cost bucket is higher.

DISCUSSION

models performance

First, we observe that the overall accuracy of the method regarding the percentage that it accurately predicts is 80%, compared to 75% of the baseline.
However, notice that this is done in an interesting way.

For bucket one patients, the two models are equivalent.
Of course this suggests the idea that healthy people stay healthy, which is the idea of the baseline model.
For buckets two to five, notice that the accuracy increases substantially:
- from 31% to 60%, i.e. it doubles,
- from 21% to 53%, i.e. more than doubles–,
- from 19% to 39%, i.e. doubles.
- for bucket #5 there is an improvement from 23% to 30%, not as big as for the other buckets.

The overall penalty improves, going from from 0.56 to 0.52. The improvement for bucket one, is small, but there are significant improvements as move to higher buckets.

What is the Analytics Edge?

Substantial improvement in D2Hawkeye’s ability to identify patients who need more attention.
Because the model was interpretable, physicians were able to improve the model by identifying new variables and refining existing variables.
Analytics gave D2Hawkeye an edge over competition using “last-century” methods.

Location Location Location_Regression Trees for Housing Data(Recitation)_3 (R script reproduced here)

# Unit 4, Recitation


# VIDEO 2

# Read and examine the data
boston = read.csv("boston.csv")
str(boston)

## 'data.frame':    506 obs. of  16 variables:
##  $ TOWN   : Factor w/ 92 levels "Arlington","Ashland",..: 54 77 77 46 46 46 69 69 69 69 ...
##  $ TRACT  : int  2011 2021 2022 2031 2032 2033 2041 2042 2043 2044 ...
##  $ LON    : num  -71 -71 -70.9 -70.9 -70.9 ...
##  $ LAT    : num  42.3 42.3 42.3 42.3 42.3 ...
##  $ MEDV   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 22.1 16.5 18.9 ...
##  $ CRIM   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ ZN     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ INDUS  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ CHAS   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NOX    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ RM     : num  6.58 6.42 7.18 7 7.15 ...
##  $ AGE    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ DIS    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ RAD    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ TAX    : int  296 242 242 222 222 222 311 311 311 311 ...
##  $ PTRATIO: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...

# Plot observations
## All TRACTs ##
plot(boston$LON, boston$LAT, xlab = "Longtitude", ylab = "Latitude", main = "Boston Housing Density")

# Tracts alongside the Charles River
points(boston$LON[boston$CHAS==1], boston$LAT[boston$CHAS==1], col="blue", pch=19)

# Plot MIT tract
points(boston$LON[boston$TRACT==3531],boston$LAT[boston$TRACT==3531],col="red", pch=20)

# Plot polution # TRACTs where pollution is higher than avergae #
summary(boston$NOX)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3850  0.4490  0.5380  0.5547  0.6240  0.8710

points(boston$LON[boston$NOX>=0.55], boston$LAT[boston$NOX>=0.55], col="green", pch=20)

# Plot prices ## Above average home prices
plot(boston$LON, boston$LAT)
summary(boston$MEDV)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   17.02   21.20   22.53   25.00   50.00

points(boston$LON[boston$MEDV>=21.2], boston$LAT[boston$MEDV>=21.2], col="red", pch=20)

# VIDEO 3

# Trying Linear Regression using LAT and LON
plot(boston$LAT, boston$MEDV)

cor(boston$LAT, boston$MEDV)

## [1] 0.006825792

plot(boston$LON, boston$MEDV)

cor(boston$LON, boston$MEDV)

## [1] -0.3229467

#Apply linear regression 
latlonlm = lm(MEDV ~ LAT + LON, data=boston)
summary(latlonlm)

## 
## Call:
## lm(formula = MEDV ~ LAT + LON, data = boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.460  -5.590  -1.299   3.695  28.129 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3178.472    484.937  -6.554 1.39e-10 ***
## LAT             8.046      6.327   1.272    0.204    
## LON           -40.268      5.184  -7.768 4.50e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.693 on 503 degrees of freedom
## Multiple R-squared:  0.1072, Adjusted R-squared:  0.1036 
## F-statistic: 30.19 on 2 and 503 DF,  p-value: 4.159e-13

# Visualize regression output
plot(boston$LON, boston$LAT, xlab = "Longtitude", ylab = "Latitude", main = "Boston Housing Density")

#Actual above median home prices
points(boston$LON[boston$MEDV>=21.2], boston$LAT[boston$MEDV>=21.2], col="red", pch=20)

#Linear regression predictions on home prices above median
latlonlm$fitted.values

##        1        2        3        4        5        6        7        8 
## 18.75633 18.81648 18.21651 17.97483 17.77344 17.60024 18.32916 18.49416 
##        9       10       11       12       13       14       15       16 
## 18.32904 18.20015 18.44176 18.81222 19.00560 19.43658 19.69836 19.93589 
##       17       18       19       20       21       22       23       24 
## 20.39492 19.92388 20.48766 20.26703 20.08099 20.05277 19.85547 19.67426 
##       25       26       27       28       29       30       31       32 
## 19.54619 19.35208 19.20306 19.16685 19.03801 18.78031 19.43265 19.29173 
##       33       34       35       36       37       38       39       40 
## 19.61388 19.91187 19.79915 20.68910 21.04745 20.98293 21.63527 21.55856 
##       41       42       43       44       45       46       47       48 
## 22.33163 21.37316 21.24036 20.21512 19.75853 19.88344 19.58142 19.25924 
##       49       50       51       52       53       54       55       56 
## 19.25115 19.18508 19.23088 19.74790 19.60931 20.38652 22.21046 20.07213 
##       57       58       59       60       61       62       63       64 
## 18.70708 18.66683 18.82408 18.77184 18.40938 18.33295 18.02687 17.51943 
##       65       66       67       68       69       70       71       72 
## 15.65495 23.10462 24.13141 24.84590 25.60131 25.61752 25.93170 25.34631 
##       73       74       75       76       77       78       79       80 
## 26.20561 25.81913 25.29164 24.85675 24.42188 23.90640 24.53454 24.63514 
##       81       82       83       84       85       86       87       88 
## 23.75323 23.82969 23.85779 23.38269 22.72237 22.94390 22.29957 22.40842 
##       89       90       91       92       93       94       95       96 
## 22.45762 22.14675 21.91325 22.41258 22.92793 23.39494 23.27825 23.80985 
##       97       98       99      100      101      102      103      104 
## 24.17226 24.28506 24.63140 23.89449 23.25432 23.74970 23.81005 23.62081 
##      105      106      107      108      109      110      111      112 
## 23.29062 23.13361 22.97737 22.72687 22.74697 22.97647 22.86770 22.52944 
##      113      114      115      116      117      118      119      120 
## 22.52548 22.31604 22.13247 21.99392 22.06474 21.77238 21.74020 21.37777 
##      121      122      123      124      125      126      127      128 
## 21.49303 21.78858 22.02217 21.96582 21.72822 21.52770 21.78464 22.54575 
##      129      130      131      132      133      134      135      136 
## 22.74307 22.99111 23.22626 23.46786 23.45175 23.66916 23.51215 23.43729 
##      137      138      139      140      141      142      143      144 
## 23.12160 22.92429 22.84380 22.65451 22.46283 22.52568 22.23739 22.40088 
##      145      146      147      148      149      150      151      152 
## 22.34452 22.45326 22.67069 22.53941 22.60221 22.64652 22.81242 22.79553 
##      153      154      155      156      157      158      159      160 
## 22.63851 22.84389 22.92200 22.98483 22.97271 23.13372 23.02100 22.92276 
##      161      162      163      164      165      166      167      168 
## 23.11686 23.31252 23.42771 23.60892 24.03575 23.66526 23.47196 23.89072 
##      169      170      171      172      173      174      175      176 
## 23.37207 23.52911 23.68372 23.62894 23.96395 23.93091 23.86243 24.37376 
##      177      178      179      180      181      182      183      184 
## 25.09862 24.80630 24.38995 24.48101 24.41018 24.24272 24.40379 24.65586 
##      185      186      187      188      189      190      191      192 
## 24.87733 25.13904 24.97145 25.49323 26.23821 26.48860 25.58575 26.72127 
##      193      194      195      196      197      198      199      200 
## 26.15347 27.07142 27.53046 28.15904 29.56023 30.37365 29.78157 30.29335 
##      201      202      203      204      205      206      207      208 
## 30.89753 28.88834 28.88817 28.39678 28.18598 26.42353 26.41397 26.34152 
##      209      210      211      212      213      214      215      216 
## 26.35525 26.05888 26.04683 25.85673 25.70288 25.94843 25.52159 25.36218 
##      217      218      219      220      221      222      223      224 
## 25.00625 24.54316 24.05995 24.54320 24.66565 25.13194 25.26726 25.13756 
##      225      226      227      228      229      230      231      232 
## 24.60370 24.18094 25.03862 24.75439 24.43473 24.76255 25.26424 25.24407 
##      233      234      235      236      237      238      239      240 
## 25.72003 25.55405 25.55558 25.79638 26.19426 26.15406 28.30049 28.21593 
##      241      242      243      244      245      246      247      248 
## 27.75302 28.44966 28.80955 29.24691 29.63356 30.18767 30.42681 29.83324 
##      249      250      251      252      253      254      255      256 
## 30.20688 29.95153 29.63737 29.91678 30.88568 31.20960 31.44166 30.54133 
##      257      258      259      260      261      262      263      264 
## 28.66323 22.91885 23.11536 23.26677 23.37955 23.32319 23.48831 23.12993 
##      265      266      267      268      269      270      271      272 
## 23.00909 22.94468 23.01719 23.48836 23.91107 23.70209 23.15454 23.39217 
##      273      274      275      276      277      278      279      280 
## 23.63371 24.44305 25.11943 25.46811 25.48752 25.92879 25.59288 25.98907 
##      281      282      283      284      285      286      287      288 
## 26.70586 27.45326 27.06033 26.66993 26.65411 27.91862 26.91216 25.10000 
##      289      290      291      292      293      294      295      296 
## 24.60860 25.54682 25.08756 25.00528 23.87704 23.79102 24.24758 24.72601 
##      297      298      299      300      301      302      303      304 
## 24.56256 24.21390 23.93226 23.01827 23.27587 22.80456 21.85417 22.96544 
##      305      306      307      308      309      310      311      312 
## 21.55585 22.03901 21.60809 20.90341 20.32355 20.61345 20.57708 20.25089 
##      313      314      315      316      317      318      319      320 
## 20.46440 20.11653 19.74773 18.96246 19.22830 19.82432 20.12628 20.53701 
##      321      322      323      324      325      326      327      328 
## 19.89691 19.49419 19.21390 18.95456 19.11170 19.37440 19.44197 19.89697 
##      329      330      331      332      333      334      335      336 
## 20.53328 21.11313 20.31192 19.67577 19.07574 18.49177 17.91995 18.23393 
##      337      338      339      340      341      342      343      344 
## 18.58014 17.93585 18.14436 18.36659 18.31661 15.16523 16.38123 17.16657 
##      345      346      347      348      349      350      351      352 
## 16.78823 16.96581 17.17110 16.23690 16.11270 13.72375 12.71715 12.29463 
##      353      354      355      356      357      358      359      360 
## 11.34041 12.06130 13.01563 12.81849 23.60656 24.04794 24.26138 23.92718 
##      361      362      363      364      365      366      367      368 
## 23.79674 23.68957 23.45200 23.72980 22.58058 22.58221 22.34222 22.22785 
##      369      370      371      372      373      374      375      376 
## 22.14326 22.22781 21.94108 21.93382 21.87098 21.65754 21.66964 21.79684 
##      377      378      379      380      381      382      383      384 
## 21.70421 21.86931 21.93617 21.95789 22.21961 22.01023 21.28298 21.20890 
##      385      386      387      388      389      390      391      392 
## 21.30154 21.24358 21.22987 21.19120 21.11629 21.02849 20.80619 20.53234 
##      393      394      395      396      397      398      399      400 
## 21.09620 20.80642 20.92563 21.00858 21.16724 21.15191 21.12448 21.48050 
##      401      402      403      404      405      406      407      408 
## 21.44749 21.34601 21.32429 21.44671 21.51596 21.46034 21.85567 21.91777 
##      409      410      411      412      413      414      415      416 
## 21.92585 22.00638 22.04664 22.12558 22.04264 21.95407 21.80510 21.92029 
##      417      418      419      420      421      422      423      424 
## 22.01853 22.13689 22.23113 22.48482 22.60643 22.75782 22.77960 22.57667 
##      425      426      427      428      429      430      431      432 
## 22.43736 22.42283 22.32218 22.27706 22.17560 22.01454 22.24007 22.06290 
##      433      434      435      436      437      438      439      440 
## 22.10238 21.97033 21.91152 21.87123 21.87201 21.85105 21.86795 21.56755 
##      441      442      443      444      445      446      447      448 
## 21.35816 21.39446 21.40651 21.56114 21.69239 21.74879 21.59015 21.48709 
##      449      450      451      452      453      454      455      456 
## 21.59180 21.72467 21.82537 21.61196 21.32605 21.55157 21.77144 22.00257 
##      457      458      459      460      461      462      463      464 
## 21.99455 21.95834 21.78922 21.60801 21.63859 21.33012 21.12881 21.30601 
##      465      466      467      468      469      470      471      472 
## 21.64832 22.05905 22.08721 22.60660 22.68721 22.68315 22.84829 23.20505 
##      473      474      475      476      477      478      479      480 
## 23.27911 22.94481 22.20387 22.08625 22.30451 22.18770 22.30287 22.77962 
##      481      482      483      484      485      486      487      488 
## 23.97576 23.73415 23.48051 23.73667 22.99336 22.70752 22.64302 22.42955 
##      489      490      491      492      493      494      495      496 
## 21.16374 21.31355 21.40855 21.07754 21.68151 21.00096 21.03154 20.98882 
##      497      498      499      500      501      502      503      504 
## 20.58857 20.31154 20.69332 20.41146 20.10948 19.81316 19.98473 20.12569 
##      505      506 
## 19.81563 19.59015

points(boston$LON[latlonlm$fitted.values >= 21.2], boston$LAT[latlonlm$fitted.values >= 21.2], col="blue", pch="$")

# Video 4

#Regression Trees on longtitude and latitude

# Load CART packages
library(rpart)
library(rpart.plot)

# CART model
latlontree = rpart(MEDV ~ LAT + LON, data=boston)
prp(latlontree)

# Visualize output
plot(boston$LON, boston$LAT, xlab = "Longtitude", ylab = "Latitude", main = "Boston Housing Density")
points(boston$LON[boston$MEDV>=21.2], boston$LAT[boston$MEDV>=21.2], col="red", pch=20)

#Get the predictions of the regression tree according to the model larlontree
fittedvalues = predict(latlontree)
points(boston$LON[fittedvalues>21.2], boston$LAT[fittedvalues>=21.2], col="blue", pch="$")

#Simplify tree by increasing minbucket (#Act against overfitting by increasing minbucket)
latlontree = rpart(MEDV ~ LAT + LON, data=boston, minbucket=50)
plot(latlontree)
text(latlontree)

#prp plotting function of rpart
prp(latlontree)

# Visualize Output
plot(boston$LON, boston$LAT, xlab = "Longtitude", ylab = "Latitude", main = "Boston Housing Density")
abline(v=-71.07)
abline(h=42.21)
abline(h=42.17)
points(boston$LON[boston$MEDV>=21.2], boston$LAT[boston$MEDV>=21.2], col="red", pch=20)

# VIDEO 5

#Linear Regression Vs Regression trees on all independent variables

# Let's use all the variables

# Split the data
library(caTools)
set.seed(123)
split = sample.split(boston$MEDV, SplitRatio = 0.7)
train = subset(boston, split==TRUE)
test = subset(boston, split==FALSE)

names(boston)

##  [1] "TOWN"    "TRACT"   "LON"     "LAT"     "MEDV"    "CRIM"    "ZN"     
##  [8] "INDUS"   "CHAS"    "NOX"     "RM"      "AGE"     "DIS"     "RAD"    
## [15] "TAX"     "PTRATIO"

# Create linear regression
linreg = lm(MEDV ~ LAT + LON + CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + DIS + RAD + TAX + PTRATIO, data=train)
summary(linreg)

## 
## Call:
## lm(formula = MEDV ~ LAT + LON + CRIM + ZN + INDUS + CHAS + NOX + 
##     RM + AGE + DIS + RAD + TAX + PTRATIO, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.511  -2.712  -0.676   1.793  36.883 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.523e+02  4.367e+02  -0.578   0.5638    
## LAT          1.544e+00  5.192e+00   0.297   0.7664    
## LON         -2.987e+00  4.786e+00  -0.624   0.5329    
## CRIM        -1.808e-01  4.390e-02  -4.118 4.77e-05 ***
## ZN           3.250e-02  1.877e-02   1.731   0.0843 .  
## INDUS       -4.297e-02  8.473e-02  -0.507   0.6124    
## CHAS         2.904e+00  1.220e+00   2.380   0.0178 *  
## NOX         -2.161e+01  5.414e+00  -3.992 7.98e-05 ***
## RM           6.284e+00  4.827e-01  13.019  < 2e-16 ***
## AGE         -4.430e-02  1.785e-02  -2.482   0.0135 *  
## DIS         -1.577e+00  2.842e-01  -5.551 5.63e-08 ***
## RAD          2.451e-01  9.728e-02   2.519   0.0122 *  
## TAX         -1.112e-02  5.452e-03  -2.040   0.0421 *  
## PTRATIO     -9.835e-01  1.939e-01  -5.072 6.38e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.595 on 350 degrees of freedom
## Multiple R-squared:  0.665,  Adjusted R-squared:  0.6525 
## F-statistic: 53.43 on 13 and 350 DF,  p-value: < 2.2e-16

# Make predictions
linreg.pred = predict(linreg, newdata=test)
linreg.sse = sum((linreg.pred - test$MEDV)^2)
linreg.sse

## [1] 3037.088

# Create a CART model (#Regression trees)
tree = rpart(MEDV ~ LAT + LON + CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + DIS + RAD + TAX + PTRATIO, data=train)
prp(tree)

# Make predictions
tree.pred = predict(tree, newdata=test)
tree.sse = sum((tree.pred - test$MEDV)^2)
tree.sse

## [1] 4328.988

# Video 7

#Applying cross validation on the training set

# Load libraries for cross-validation
library(caret)
library(e1071)

# Number of folds
tr.control = trainControl(method = "cv", number = 10)

# cp values
cp.grid = expand.grid( .cp = (0:10)*0.001)

# What did we just do?
1*0.001

## [1] 0.001

10*0.001

## [1] 0.01

0:10

##  [1]  0  1  2  3  4  5  6  7  8  9 10

0:10 * 0.001

##  [1] 0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.010

# Cross-validation
tr = train(MEDV ~ LAT + LON + CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + DIS + RAD + TAX + PTRATIO, data = train, method = "rpart", trControl = tr.control, tuneGrid = cp.grid)
tr

## CART 
## 
## 364 samples
##  15 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 326, 327, 328, 328, 328, 328, ... 
## Resampling results across tuning parameters:
## 
##   cp     RMSE      Rsquared 
##   0.000  4.791281  0.7502534
##   0.001  4.772360  0.7526106
##   0.002  4.813061  0.7483929
##   0.003  4.807751  0.7462847
##   0.004  4.817262  0.7428971
##   0.005  4.889312  0.7350155
##   0.006  4.874780  0.7322640
##   0.007  4.864068  0.7335005
##   0.008  4.803353  0.7388394
##   0.009  4.803353  0.7388394
##   0.010  4.803353  0.7388394
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was cp = 0.001.

# Extract tree
best.tree = tr$finalModel
prp(best.tree)

# Make predictions
best.tree.pred = predict(best.tree, newdata=test)
best.tree.sse = sum((best.tree.pred - test$MEDV)^2)
best.tree.sse

## [1] 3675.766

linreg.sse

## [1] 3037.088

Assignment 4

UNDERSTANDING WHY PEOPLE VOTE

In August 2006 three researchers (Alan Gerber and Donald Green of Yale University, and Christopher Larimer of the University of Northern Iowa) carried out a large scale field experiment in Michigan, USA to test the hypothesis that one of the reasons people vote is social, or extrinsic, pressure. To quote the first paragraph of their 2008 research paper:

Among the most striking features of a democratic political system is the participation of millions of voters in elections. Why do large numbers of people vote, despite the fact that … “the casting of a single vote is of no significance where there is a multitude of electors”? One hypothesis is adherence to social norms. Voting is widely regarded as a citizen duty, and citizens worry that others will think less of them if they fail to participate in elections. Voters’ sense of civic duty has long been a leading explanation of vote turnout…

In this homework problem we will use both logistic regression and classification trees to analyze the data they collected.

THE DATA

The researchers grouped about 344,000 voters into different groups randomly - about 191,000 voters were a “control” group, and the rest were categorized into one of four “treatment” groups. These five groups correspond to five binary variables in the dataset.

“Civic Duty” (variable civicduty) group members were sent a letter that simply said “DO YOUR CIVIC DUTY - VOTE!”
“Hawthorne Effect” (variable hawthorne) group members were sent a letter that had the “Civic Duty” message plus the additional message “YOU ARE BEING STUDIED” and they were informed that their voting behavior would be examined by means of public records.
“Self” (variable self) group members received the “Civic Duty” message as well as the recent voting record of everyone in that household and a message stating that another message would be sent after the election with updated records.
“Neighbors” (variable neighbors) group members were given the same message as that for the “Self” group, except the message not only had the household voting records but also that of neighbors - maximizing social pressure.
“Control” (variable control) group members were not sent anything, and represented the typical voting situation.

Additional variables include sex (0 for male, 1 for female), yob (year of birth), and the dependent variable voting (1 if they voted, 0 otherwise).

#lets load the data
gerber<-read.csv("gerber.csv")
str(gerber)

## 'data.frame':    344084 obs. of  8 variables:
##  $ sex      : int  0 1 1 1 0 1 0 0 1 0 ...
##  $ yob      : int  1941 1947 1982 1950 1951 1959 1956 1981 1968 1967 ...
##  $ voting   : int  0 0 1 1 1 1 1 0 0 0 ...
##  $ hawthorne: int  0 0 1 1 1 0 0 0 0 0 ...
##  $ civicduty: int  1 1 0 0 0 0 0 0 0 0 ...
##  $ neighbors: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ self     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ control  : int  0 0 0 0 0 1 1 1 1 1 ...

#PROBLEM 1.1 - EXPLORATION AND LOGISTIC REGRESSION

#We will first get familiar with the data. Load the CSV file gerber.csv into R. What proportion of people in this dataset voted in this election?
prop.table(table(gerber$voting)) #108696/(108696+235388) = 0.316

## 
##         0         1 
## 0.6841004 0.3158996

#Ans:0.3158996 

######################################

#PROBLEM 1.2 - EXPLORATION AND LOGISTIC REGRESSION  

#Which of the four "treatment groups" had the largest percentage of people who actually voted (voting = 1)?
tapply(gerber$voting,gerber$civicduty, mean)

##         0         1 
## 0.3160698 0.3145377

tapply(gerber$voting, gerber$hawthorne, mean)

##         0         1 
## 0.3150909 0.3223746

tapply(gerber$voting, gerber$self, mean)

##         0         1 
## 0.3122446 0.3451515

tapply(gerber$voting, gerber$neighbors, mean)

##         0         1 
## 0.3081505 0.3779482

#Ans:Neighbors
#EXPLANATION:The variable with the largest value in the "1" column has the largest fraction of people voting in their group - this is the Neighbors group.

########################################

#PROBLEM 1.3 - EXPLORATION AND LOGISTIC REGRESSION  

#Build a logistic regression model for voting using the four treatment group variables as the independent variables (civicduty, hawthorne, self, and neighbors). Use all the data to build the model (DO NOT split the data into a training set and testing set). Which of the following coefficients are significant in the logistic regression model? Select all that apply.

LogModel <- glm(voting~ hawthorne + civicduty + neighbors + self  , data = gerber, family = binomial)
summary(LogModel)

## 
## Call:
## glm(formula = voting ~ hawthorne + civicduty + neighbors + self, 
##     family = binomial, data = gerber)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9744  -0.8691  -0.8389   1.4586   1.5590  
## 
## Coefficients:
##              Estimate Std. Error  z value Pr(>|z|)    
## (Intercept) -0.863358   0.005006 -172.459  < 2e-16 ***
## hawthorne    0.120477   0.012037   10.009  < 2e-16 ***
## civicduty    0.084368   0.012100    6.972 3.12e-12 ***
## neighbors    0.365092   0.011679   31.260  < 2e-16 ***
## self         0.222937   0.011867   18.786  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 429238  on 344083  degrees of freedom
## Residual deviance: 428090  on 344079  degrees of freedom
## AIC: 428100
## 
## Number of Fisher Scoring iterations: 4

#Ans:civicduty,hawthorne,self,neighbors
#EXPLANATION:If you look at the output of summary(LogModel), you can see that all of the treatment group variables are significant.

############################################

#PROBLEM 1.4 - EXPLORATION AND LOGISTIC REGRESSION  

#Using a threshold of 0.3, what is the accuracy of the logistic regression model? (When making predictions, you don't need to use the newdata argument since we didn't split our data.)
# Predictions on the test set
predictLog= predict(LogModel, type="response")

# Confusion matrix with threshold of 0.3
table(gerber$voting, predictLog > 0.3)

##    
##      FALSE   TRUE
##   0 134513 100875
##   1  56730  51966

##therefore the Overall Accuracy is
(134513+51966)/(134513+100875+56730+51966)

## [1] 0.5419578

#Ans:0.5419578
#######################################

#PROBLEM 1.5 - EXPLORATION AND LOGISTIC REGRESSION  

#Using a threshold of 0.5, what is the accuracy of the logistic regression model?

# Confusion matrix with threshold of 0.5
table(gerber$voting,predictLog > 0.5)

##    
##      FALSE
##   0 235388
##   1 108696

##therefore the Overall Accuracy is
(235388+0)/(108696+235388)

## [1] 0.6841004

#Ans:0.6841004

####################################

#PROBLEM 1.6 - EXPLORATION AND LOGISTIC REGRESSION  

#Compare your previous two answers to the percentage of people who did not vote (the baseline accuracy) and compute the AUC of the model. What is happening here?

#Baseline accuracy/model
(134513+100875)/(134513+100875+56730+51966) ##From the Confusion matrix:an easier model would be to pick the most frequent outcom

## [1] 0.6841004

#or
table(gerber$voting) # 235388/(235388+108696)

## 
##      0      1 
## 235388 108696

#  AUC 
library(ROCR)
ROCRpred = prediction(predictLog, gerber$voting)
as.numeric(performance(ROCRpred, "auc")@y.values)

## [1] 0.5308461

#Ans: Even though all of the variables are significant, this is a weak predictive model. 
#EXPLANATION:Even though all of our variables are significant, our model does not improve over the baseline model of just predicting that someone will not vote, and the AUC is low. So while the treatment groups do make a difference, this is a weak predictive model.

##############################################

#PROBLEM 2.1 - TREES   

#We will now try out trees. Build a CART tree for voting using all data and the same four treatment variables we used before. Don't set the option method="class" - we are actually going to create a regression tree here. We are interested in building a tree to explore the fraction of people who vote, or the probability of voting. We'd like CART to split our groups if they have different probabilities of voting. If we used method='class', CART would only split if one of the groups had a probability of voting above 50% and the other had a probability of voting less than 50% (since the predicted outcomes would be different). However, with regression trees, CART will split even if both groups have probability less than 50%.

#Leave all the parameters at their default values. You can use the following command in R to build the tree:

CARTmodel = rpart(voting ~ civicduty + hawthorne + self + neighbors, data=gerber)
prp(CARTmodel)

#Plot the tree. What happens, and if relevant, why?
#Ans:No variables are used (the tree is only a root node) - none of the variables make a big enough effect to be split on.
#EXPLANATION:If you plot the tree, with prp(CARTmodel), you should just see one leaf! There are no splits in the tree, because none of the variables make a big enough effect to be split on.

############################################

#PROBLEM 2.2 - TREES  

#Now build the tree using the command:

CARTmodel2 = rpart(voting ~ civicduty + hawthorne + self + neighbors, data=gerber, cp=0.0)
#to force the complete tree to be built. Then plot the tree. What do you observe about the order of the splits?
prp(CARTmodel2)

#Ans: Neighbor is the first split, civic duty is the last.
#EXPLANATION:We saw in Problem 1 that the highest fraction of voters was in the Neighbors group, followed by the Self group, followed by the Hawthorne group, and lastly the Civic Duty group. And we see here that the tree detects this trend.

############################################

#PROBLEM 2.3 - TREES  

#Using only the CART tree plot, determine what fraction (a number between 0 and 1) of "Civic Duty" people voted:
#Ans:0.31
#XPLANATION:You can find this answer by reading the tree - the people in the civic duty group correspond to the bottom right split, which has value 0.31 in the leaf.

##########################################

#PROBLEM 2.4 - TREES  

#Make a new tree that includes the "sex" variable, again with cp = 0.0. Notice that sex appears as a split that is of secondary importance to the treatment group.
#Now build the tree using the command:

CARTmodel3= rpart(voting ~ civicduty + hawthorne + self + neighbors + sex, data=gerber, cp=0.0)
prp(CARTmodel3)

#In the control group, which gender is more likely to vote?
#Ans:Men (0) 

#In the "Civic Duty" group, which gender is more likely to vote?
#Men (0) 

#EXPLANATION:if you plot the tree with prp(CARTmodel3), you can see that there is a split on the "sex" variable after every treatment variable split. For the control group, which corresponds to the bottom left, sex = 0 (male) corresponds to a higher voting percentage.For the civic duty group, which corresponds to the bottom right, sex = 0 (male) corresponds to a higher voting percentage.

############################

#PROBLEM 3.1 - INTERACTION TERMS  

#We know trees can handle "nonlinear" relationships, e.g. "in the 'Civic Duty' group and female", but as we will see in the next few questions, it is possible to do the same for logistic regression. First, let's explore what trees can tell us some more.

#Let's just focus on the "Control" treatment group. Create a regression tree using just the "control" variable, then create another tree with the "control" and "sex" variables, both with cp=0.0.

CARTcontrol= rpart(voting ~ control, data=gerber, cp=0.0)
prp(CARTcontrol,digits = 6)

CARTsex= rpart(voting ~ control+ sex, data=gerber, cp=0.0)
prp(CARTsex,digits = 6)

#In the "control" only tree, what is the absolute value of the difference in the predicted probability of voting between being in the control group versus being in a different group? You can use the absolute value function to get answer, i.e. abs(Control Prediction - Non-Control Prediction). Add the argument "digits = 6" to the prp command to get a more accurate estimate.
abs(0.296638-0.34)

## [1] 0.043362

#Ans:0.043362
#EXPLANATION:The split says that if control = 1, predict 0.296638, and if control = 0, predict 0.34. The absolute difference between these is 0.043362.

###########################################

#PROBLEM 3.2 - INTERACTION TERMS  

#Now, using the second tree (with control and sex), determine who is affected more by NOT being in the control group (being in any of the four treatment groups):

abs(0.302795-0.345818)

## [1] 0.043023

abs(0.290456-0.334176)

## [1] 0.04372

#Ans:They are affected about the same (change in probability within 0.001 of each other)
#EXPLANATION:The first split says that if control = 1, go left. Then, if sex = 1 (female) predict 0.290456, and if sex = 0 (male) predict 0.302795. On the other side of the tree, where control = 0, if sex = 1 (female) predict 0.334176, and if sex = 0 (male) predict 0.345818. So for women, not being in the control group increases the fraction voting by 0.04372. For men, not being in the control group increases the fraction voting by 0.04302. So men and women are affected about the same.

#####################################

#PROBLEM 3.3 - INTERACTION TERMS  

#Going back to logistic regression now, create a model using "sex" and "control". Interpret the coefficient for "sex":
LogModelSex<- glm(voting~ sex + control , data = gerber, family = binomial)
summary(LogModelSex)

## 
## Call:
## glm(formula = voting ~ sex + control, family = binomial, data = gerber)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9220  -0.9012  -0.8290   1.4564   1.5717  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.635538   0.006511 -97.616  < 2e-16 ***
## sex         -0.055791   0.007343  -7.597 3.02e-14 ***
## control     -0.200142   0.007364 -27.179  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 429238  on 344083  degrees of freedom
## Residual deviance: 428443  on 344081  degrees of freedom
## AIC: 428449
## 
## Number of Fisher Scoring iterations: 4

#Ans:Coefficient is negative, reflecting that women are less likely to vote 
#EXPLANATION:If you look at the summary of the model, you can see that the coefficient for the "sex" variable is -0.055791. This means that women are less likely to vote, since women have a larger value in the sex variable, and a negative coefficient means that larger values are predictive of 0.

################################################

#PROBLEM 3.4 - INTERACTION TERMS  

#The regression tree calculated the percentage voting exactly for every one of the four possibilities (Man, Not Control), (Man, Control), (Woman, Not Control), (Woman, Control). Logistic regression has attempted to do the same, although it wasn't able to do as well because it can't consider exactly the joint possibility of being a women and in the control group.

#We can quantify this precisely. Create the following dataframe (this contains all of the possible values of sex and control), and evaluate your logistic regression using the predict function (where "LogModelSex" is the name of your logistic regression model that uses both control and sex):

Possibilities = data.frame(sex=c(0,0,1,1),control=c(0,1,0,1))
predict(LogModelSex, newdata=Possibilities, type="response")

##         1         2         3         4 
## 0.3462559 0.3024455 0.3337375 0.2908065

#The four values in the results correspond to the four possibilities in the order they are stated above ( (Man, Not Control), (Man, Control), (Woman, Not Control), (Woman, Control) ). What is the absolute difference between the tree and the logistic regression for the (Woman, Control) case? Give an answer with five numbers after the decimal point.
abs(0.2908065-0.290456)

## [1] 0.0003505

#Ans:0.0003505
#EXPLANATION:The CART tree predicts 0.290456 for the (Woman, Control) case, and the logistic regression model predicts 0.2908065. So the absolute difference, to five decimal places, is 0.00035.

############################################

#PROBLEM 3.5 - INTERACTION TERM

#So the difference is not too big for this dataset, but it is there. We're going to add a new term to our logistic regression now, that is the combination of the "sex" and "control" variables - so if this new variable is 1, that means the person is a woman AND in the control group. We can do that with the following command:

LogModel2 = glm(voting ~ sex + control + sex:control, data=gerber, family="binomial")
summary(LogModel2)

## 
## Call:
## glm(formula = voting ~ sex + control + sex:control, family = "binomial", 
##     data = gerber)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9213  -0.9019  -0.8284   1.4573   1.5724  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.637471   0.007603 -83.843  < 2e-16 ***
## sex         -0.051888   0.010801  -4.804 1.55e-06 ***
## control     -0.196553   0.010356 -18.980  < 2e-16 ***
## sex:control -0.007259   0.014729  -0.493    0.622    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 429238  on 344083  degrees of freedom
## Residual deviance: 428442  on 344080  degrees of freedom
## AIC: 428450
## 
## Number of Fisher Scoring iterations: 4

#How do you interpret the coefficient for the new variable in isolation? That is, how does it relate to the dependent variable?
#Ans: If a person is a woman and in the control group, the chance that she voted goes down.
#EXPLANATION:This coefficient is negative, so that means that a value of 1 in this variable decreases the chance of voting. This variable will have variable 1 if the person is a woman and in the control group.

##################################

#PROBLEM 3.6 - INTERACTION TERMS  

#Run the same code as before to calculate the average for each group:

predict(LogModel2, newdata=Possibilities, type="response")

##         1         2         3         4 
## 0.3458183 0.3027947 0.3341757 0.2904558

#Now what is the difference between the logistic regression model and the CART model for the (Woman, Control) case? Again, give your answer with five numbers after the decimal point.
abs(0.2904558-0.290456)

## [1] 2e-07

#Ans:2e-07
#EXPLANATION:The logistic regression model now predicts 0.2904558 for the (Woman, Control) case, so there is now a very small difference (practically zero) between CART and logistic regression.

#############################################

#PROBLEM 3.7 - INTERACTION TERMS  

#This example has shown that trees can capture nonlinear relationships that logistic regression can not, but that we can get around this sometimes by using variables that are the combination of two variables. Should we always include all possible interaction terms of the independent variables when building a logistic regression model?
#Ans: No
#EXPLANATION:We should not use all possible interaction terms in a logistic regression model due to overfitting. Even in this simple problem, we have four treatment groups and two values for sex. If we have an interaction term for every treatment variable with sex, we will double the number of variables. In smaller data sets, this could quickly lead to overfitting.

LETTER RECOGNITION

One of the earliest applications of the predictive analytics methods we have studied so far in this class was to automatically recognize letters, which post office machines use to sort mail. In this problem, we will build a model that uses statistics of images of four letters in the Roman alphabet – A, B, P, and R – to predict which letter a particular image corresponds to.

Note that this is a multiclass classification problem. We have mostly focused on binary classification problems (e.g., predicting whether an individual voted or not, whether the Supreme Court will affirm or reverse a case, whether or not a person is at risk for a certain disease, etc.). In this problem, we have more than two classifications that are possible for each observation, like in the D2Hawkeye lecture.

The file letters_ABPR.csv contains 3116 observations, each of which corresponds to a certain image of one of the four letters A, B, P and R. The images came from 20 different fonts, which were then randomly distorted to produce the final images; each such distorted image is represented as a collection of pixels, each of which is “on” or “off”. For each such distorted image, we have available certain statistics of the image in terms of these pixels, as well as which of the four letters the image is. This data comes from the UCI Machine Learning Repository.

This dataset contains the following 17 variables:

letter = the letter that the image corresponds to (A, B, P or R)
xbox = the horizontal position of where the smallest box covering the letter shape begins.
ybox = the vertical position of where the smallest box covering the letter shape begins.
width = the width of this smallest box.
height = the height of this smallest box.
onpix = the total number of “on” pixels in the character image
xbar = the mean horizontal position of all of the “on” pixels
ybar = the mean vertical position of all of the “on” pixels
x2bar = the mean squared horizontal position of all of the “on” pixels in the image
y2bar = the mean squared vertical position of all of the “on” pixels in the image
xybar = the mean of the product of the horizontal and vertical position of all of the “on” pixels in the image
x2ybar = the mean of the product of the squared horizontal position and the vertical position of all of the “on” pixels
xy2bar = the mean of the product of the horizontal position and the squared vertical position of all of the “on” pixels
xedge = the mean number of edges (the number of times an “off” pixel is followed by an “on” pixel, or the image boundary is hit) as the image is scanned from left to right, along the whole vertical length of the image
xedgeycor = the mean of the product of the number of horizontal edges at each vertical position and the vertical position
yedge = the mean number of edges as the images is scanned from top to bottom, along the whole horizontal length of the image
yedgexcor = the mean of the product of the number of vertical edges at each horizontal position and the horizontal position

#Lets load the dataset
letters<-read.csv("letters_ABPR.csv")
str(letters)

## 'data.frame':    3116 obs. of  17 variables:
##  $ letter   : Factor w/ 4 levels "A","B","P","R": 2 1 4 2 3 4 4 1 3 3 ...
##  $ xbox     : int  4 1 5 5 3 8 2 3 8 6 ...
##  $ ybox     : int  2 1 9 9 6 10 6 7 14 10 ...
##  $ width    : int  5 3 5 7 4 8 4 5 7 8 ...
##  $ height   : int  4 2 7 7 4 6 4 5 8 8 ...
##  $ onpix    : int  4 1 6 10 2 6 3 3 4 7 ...
##  $ xbar     : int  8 8 6 9 4 7 6 12 5 8 ...
##  $ ybar     : int  7 2 11 8 14 7 7 2 10 5 ...
##  $ x2bar    : int  6 2 7 4 8 3 5 3 6 7 ...
##  $ y2bar    : int  6 2 3 4 1 5 5 2 3 5 ...
##  $ xybar    : int  7 8 7 6 11 8 6 10 12 7 ...
##  $ x2ybar   : int  6 2 3 8 6 4 5 2 5 6 ...
##  $ xy2bar   : int  6 8 9 6 3 8 7 9 4 6 ...
##  $ xedge    : int  2 1 2 6 0 6 3 2 4 3 ...
##  $ xedgeycor: int  8 6 7 11 10 6 7 6 10 9 ...
##  $ yedge    : int  7 2 5 8 4 7 5 3 4 8 ...
##  $ yedgexcor: int  10 7 11 7 8 7 8 8 8 9 ...

#PROBLEM 1.1 - PREDICTING B OR NOT B  

#Let's warm up by attempting to predict just whether a letter is B or not. To begin, load the file letters_ABPR.csv into R, and call it letters. Then, create a new variable isB in the dataframe, which takes the value "TRUE" if the observation corresponds to the letter B, and "FALSE" if it does not. You can do this by typing the following command into your R console:
letters$isB = as.factor(letters$letter == "B")

#Now split the data set into a training and testing set, putting 50% of the data in the training set. Set the seed to 1000 before making the split. The first argument to sample.split should be the dependent variable "letters$isB". Remember that TRUE values from sample.split should go in the training set.

library(caTools)
set.seed(1000) # to get the same split everytime
spl = sample.split(letters$isB, SplitRatio = 0.5)
train = subset(letters, spl==TRUE)
test= subset(letters, spl==FALSE)

table(train$isB) #The output of table(train$isB) tells us that "not B" is more common

## 
## FALSE  TRUE 
##  1175   383

#Before building models, let's consider a baseline method that always predicts the most frequent outcome, which is "not B". What is the accuracy of this baseline method on the test set?
table(test$isB)

## 
## FALSE  TRUE 
##  1175   383

1175/(1175+383)

## [1] 0.754172

#Ans:0.754172
#EXPLANATION:To compute the accuracy of the baseline method on the test set, we first need to see which outcome value is more frequent in the training set, by using the table function. The output of table(train$isB) tells us that "not B" is more common. So our baseline method is to predict "not B" for everything. How well would this do on the test set? We need to run the table command again, this time on the test set:table(test$isB).There are 1175 observations that are not B, and 383 observations that are B. So the baseline method accuracy on the test set would be 1175/(1175+383) = 0.754172

##########################################################

#PROBLEM 1.2 - PREDICTING B OR NOT B  

#Now build a classification tree to predict whether a letter is a B or not, using the training set to build your model. Remember to remove the variable "letter" out of the model, as this is related to what we are trying to predict! To just remove one variable, you can either write out the other variables, or remember what we did in the Billboards problem in Week 3, and use the following notation:

CARTb = rpart(isB ~ . - letter, data=train, method="class")

#We are just using the default parameters in our CART model, so we don't need to add the minbucket or cp arguments at all. We also added the argument method="class" since this is a classification problem.

#What is the accuracy of the CART model on the test set? (Use type="class" when making predictions on the test set.)

#The model can be be represented as a decision tree:
prp(CARTb) #plotting the classification tree

# Make predictions
predictions = predict(CARTb, newdata = test, type = "class") #We need to give type = "class" if we want the majority class predictions.This is like using a threshold of 0.5.

#Now lets assess the accuracy of the model through confusion matrix
cmat_CART<-table(test$isB, predictions)  #first arg is thr true outcomes and the second is the predicted outcomes
cmat_CART

##        predictions
##         FALSE TRUE
##   FALSE  1118   57
##   TRUE     43  340

#lets now compute the overall accuracy

accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART)
accu_CART  #(1118+340)/(1118+57+43+340)

## [1] 0.9358151

#Ans:0.9358151
#EXPLANATION:To compute the accuracy on the test set, we need to divide the sum of the true positives and true negatives by the total number of observations: (1118+340)/nrow(test) = 0.9358151

########################################

#PROBLEM 1.3 - PREDICTING B OR NOT B   

#Now, build a random forest model to predict whether the letter is a B or not (the isB variable) using the training set. You should use all of the other variables as independent variables, except letter (since it helped us define what we are trying to predict!). Use the default settings for ntree and nodesize (don't include these arguments at all). Right before building the model, set the seed to 1000. (NOTE: You might get a slightly different answer on this problem, even if you set the random seed. This has to do with your operating system and the implementation of the random forest algorithm.)

library(randomForest)

set.seed(1000)
# Build random forest model
RFb = randomForest(isB ~ . - letter, data = train) 
#or RFb = randomForest(isB ~ xbox + ybox + width + height + onpix + xbar + ybar + x2bar + y2bar + xybar + x2ybar + xy2bar + xedge + xedgeycor + yedge + yedgexcor, data=train)

#What is the accuracy of the model on the test set?

predictions = predict(RFb, newdata=test)
#Now lets assess the accuracy of the model through confusion matrix
cmat_forest<-table(test$isB, predictions)  #first arg is thr true outcomes and the second is the predicted outcomes
cmat_forest

##        predictions
##         FALSE TRUE
##   FALSE  1165   10
##   TRUE      9  374

#lets now compute the overall accuracy

accu_forest <- (cmat_forest[1,1] + cmat_forest[2,2])/sum(cmat_forest)
accu_forest  #or (1165+374)/(1165+10+9+374)

## [1] 0.9878049

#Ans: 0.9878049
#EXPLANATION:The accuracy of the model on the test set is the sum of the true positives and true negatives, divided by the total number of observations in the test set:(1165+374)/nrow(test) = 0.9878049

#######################################################

#PROBLEM 2.1 - PREDICTING THE LETTERS A, B, P, R  

#Let us now move on to the problem that we were originally interested in, which is to predict whether or not a letter is one of the four letters A, B, P or R.

#As we saw in the D2Hawkeye lecture, building a multiclass classification CART model in R is no harder than building the models for binary classification problems. Fortunately, building a random forest model is just as easy.

#The variable in our data frame which we will be trying to predict is "letter". Start by converting letter in the original data set (letters) to a factor by running the following command in R:

letters$letter = as.factor( letters$letter )

#Now, generate new training and testing sets of the letters data frame using letters$letter as the first input to the sample.split function. Before splitting, set your seed to 2000. Again put 50% of the data in the training set. (Why do we need to split the data again? Remember that sample.split balances the outcome variable in the training and testing sets. With a new outcome variable, we want to re-generate our split.)

library(caTools)
set.seed(2000) # to get the same split everytime
spl = sample.split(letters$letter, SplitRatio = 0.5)
train2 = subset(letters, spl==TRUE)
test2= subset(letters, spl==FALSE)

#In a multiclass classification problem, a simple baseline model is to predict the most frequent class of all of the options.
table(train2$letter) #tells us that "P" has the most observations

## 
##   A   B   P   R 
## 394 383 402 379

#What is the baseline accuracy on the testing set?

library(randomForest)

# Build random forest model
RFb = randomForest(letter ~ ., data = train2) 
#or RFb = randomForest(isB ~ xbox + ybox + width + height + onpix + xbar + ybar + x2bar + y2bar + xybar + x2ybar + xy2bar + xedge + xedgeycor + yedge + yedgexcor, data=train)

predictions = predict(RFb, newdata=test2)
#Now lets assess the accuracy of the model through confusion matrix
cmat_forest<-table(test2$letter, predictions)  #first arg is thr true outcomes and the second is the predicted outcomes
cmat_forest

##    predictions
##       A   B   P   R
##   A 388   0   5   2
##   B   0 383   0   0
##   P   0   0 398   3
##   R   0   0   1 378

401/nrow(test2)  #or table(test2$letter)

## [1] 0.2573813

#Ans:0.2573813
#EXPLANATION:to compute the accuracy of the baseline method on the test set, we need to first figure out the most common outcome in the training set. The output of table(train2$letter) tells us that "P" has the most observations. So we will predict P for all letters. On the test set, we can run the table command table(test2$letter) to see that it has 401 observations that are actually P. So the test set accuracy of the baseline method is 401/nrow(test) = 0.2573813.

##################################

#PROBLEM 2.2 - PREDICTING THE LETTERS A, B, P, R  

#Now build a classification tree to predict "letter", using the training set to build your model. You should use all of the other variables as independent variables, except "isB", since it is related to what we are trying to predict! Just use the default parameters in your CART model. Add the argument method="class" since this is a classification problem. Even though we have multiple classes here, nothing changes in how we build the model from the binary case.
# Build random forest model
CARTletter= rpart(letter ~ .-isB, data = train2,method="class" ) 
prp(CARTletter)

#What is the test set accuracy of your CART model? Use the argument type="class" when making predictions.
#(HINT: When you are computing the test set accuracy using the confusion matrix, you want to add everything on the main diagonal and divide by the total number of observations in the test set, which can be computed with nrow(test), where test is the name of your test set).

# Make predictions
predictLetter= predict(CARTletter, newdata = test2, type = "class") #We need to give type = "class" if we want the majority class predictions.This is like using a threshold of 0.5.

#Now lets assess the accuracy of the model through confusion matrix
cmat_CART<-table(test2$letter, predictLetter)  #first arg is the true outcomes and the second is the predicted outcomes
cmat_CART

##    predictLetter
##       A   B   P   R
##   A 348   4   0  43
##   B   8 318  12  45
##   P   2  21 363  15
##   R  10  24   5 340

#lets now compute the overall accuracy
accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2] + cmat_CART[3,3] + cmat_CART[4,4])/sum(cmat_CART) #(348+318+363+340)/nrow(test2)
accu_CART

## [1] 0.8786906

#Ans:0.8786906
#Looking at the confusion matrix, table(test2$letter, predictLetter), we want to sum the main diagonal (the correct predictions) and divide by the total number of observations in the test set


##################################

#PROBLEM 2.3 - PREDICTING THE LETTERS A, B, P, R  

#Now build a random forest model on the training data, using the same independent variables as in the previous problem -- again, don't forget to remove the isB variable. Just use the default parameter values for ntree and nodesize (you don't need to include these arguments at all). Set the seed to 1000 right before building your model. (Remember that you might get a slightly different result even if you set the random seed.)

library(randomForest)
set.seed(1000)

# Build random forest model
RFletter= randomForest(letter ~ .-isB, data = train2) 

#What is the test set accuracy of your random forest model?
predictLetter= predict(RFletter, newdata=test2)

#Now lets assess the accuracy of the model through confusion matrix
cmat_forest<-table(test2$letter, predictLetter)  #first arg is thr true outcomes and the second is the predicted outcomes
cmat_forest

##    predictLetter
##       A   B   P   R
##   A 390   0   3   2
##   B   0 380   1   2
##   P   0   5 393   3
##   R   3  12   0 364

#lets now compute the overall accuracy
accu_forest <- (cmat_forest[1,1] + cmat_forest[2,2] + cmat_forest[3,3] + cmat_forest[4,4])/sum(cmat_forest)
accu_forest  #or (390+380 +393+364)/nrow(test2) = 0.9801027

## [1] 0.9801027

#Ans: 0.9801027
#The test set accuracy is the sum of the numbers on the main diagonal, divided by the total number of observations in the test set.

#You should find this value rather striking, for several reasons. The first is that it is significantly higher than the value for CART, highlighting the gain in accuracy that is possible from using random forest models. The second is that while the accuracy of CART decreased significantly as we transitioned from the problem of predicting B/not B (a relatively simple problem) to the problem of predicting the four letters (certainly a harder problem), the accuracy of the random forest model decreased by a tiny amount.

PREDICTING EARNINGS FROM CENSUS DATA

The United States government periodically collects demographic information by conducting a census.

In this problem, we are going to use census information about an individual to predict how much a person earns – in particular, whether the person earns more than $50,000 per year. This data comes from the UCI Machine Learning Repository.

The file census.csv contains 1994 census data for 31,978 individuals in the United States.

The dataset includes the following 13 variables:

age = the age of the individual in years
workclass = the classification of the individual’s working status (does the person work for the federal government, work for the local government, work without pay, and so on)
education = the level of education of the individual (e.g., 5th-6th grade, high school graduate, PhD, so on)
maritalstatus = the marital status of the individual
occupation = the type of work the individual does (e.g., administrative/clerical work, farming/fishing, sales and so on)
relationship = relationship of individual to his/her household
race = the individual’s race
sex = the individual’s sex
capitalgain = the capital gains of the individual in 1994 (from selling an asset such as a stock or bond for more than the original purchase price)
capitalloss = the capital losses of the individual in 1994 (from selling an asset such as a stock or bond for less than the original purchase price)
hoursperweek = the number of hours the individual works per week
nativecountry = the native country of the individual
over50k = whether or not the individual earned more than $50,000 in 1994

#PROBLEM 1.1 - A LOGISTIC REGRESSION MODEL

#Let's begin by building a logistic regression model to predict whether an individual's earnings are above $50,000 (the variable "over50k") using all of the other variables as independent variables. First, read the dataset census.csv into R.

#lets read in the dataset
Census<-read.csv("census.csv")
str(Census)

## 'data.frame':    31978 obs. of  13 variables:
##  $ age          : int  39 50 38 53 28 37 49 52 31 42 ...
##  $ workclass    : Factor w/ 9 levels " ?"," Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
##  $ education    : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 13 7 12 13 10 ...
##  $ maritalstatus: Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
##  $ occupation   : Factor w/ 15 levels " ?"," Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
##  $ relationship : Factor w/ 6 levels " Husband"," Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
##  $ race         : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
##  $ sex          : Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 1 1 2 1 2 ...
##  $ capitalgain  : int  2174 0 0 0 0 0 0 0 14084 5178 ...
##  $ capitalloss  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hoursperweek : int  40 13 40 40 40 40 16 45 50 40 ...
##  $ nativecountry: Factor w/ 41 levels " Cambodia"," Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
##  $ over50k      : Factor w/ 2 levels " <=50K"," >50K": 1 1 1 1 1 1 1 2 2 2 ...

#Then, split the data randomly into a training set and a testing set, setting the seed to 2000 before creating the split. Split the data so that the training set contains 60% of the observations, while the testing set contains 40% of the observations.

library(caTools)

# Randomly split data into training and test sets
set.seed(2000)
spl<-sample.split(Census$over50k, SplitRatio=0.6)#the DV is split into 60% training set and 40% test set
train<-subset(Census,spl== TRUE)
test <-subset(Census, spl== FALSE)

#Next, build a logistic regression model to predict the dependent variable "over50k", using all of the other variables in the dataset as independent variables. Use the training set to build the model.

#Build logistic regression model using all other variables
censusglm<-glm(over50k ~ ., data=train, family=binomial)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(censusglm)

## 
## Call:
## glm(formula = over50k ~ ., family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.1065  -0.5037  -0.1804  -0.0008   3.3383  
## 
## Coefficients: (1 not defined because of singularities)
##                                            Estimate Std. Error z value
## (Intercept)                              -8.658e+00  1.379e+00  -6.279
## age                                       2.548e-02  2.139e-03  11.916
## workclass Federal-gov                     1.105e+00  2.014e-01   5.489
## workclass Local-gov                       3.675e-01  1.821e-01   2.018
## workclass Never-worked                   -1.283e+01  8.453e+02  -0.015
## workclass Private                         6.012e-01  1.626e-01   3.698
## workclass Self-emp-inc                    7.575e-01  1.950e-01   3.884
## workclass Self-emp-not-inc                1.855e-01  1.774e-01   1.046
## workclass State-gov                       4.012e-01  1.961e-01   2.046
## workclass Without-pay                    -1.395e+01  6.597e+02  -0.021
## education 11th                            2.225e-01  2.867e-01   0.776
## education 12th                            6.380e-01  3.597e-01   1.774
## education 1st-4th                        -7.075e-01  7.760e-01  -0.912
## education 5th-6th                        -3.170e-01  4.880e-01  -0.650
## education 7th-8th                        -3.498e-01  3.126e-01  -1.119
## education 9th                            -1.258e-01  3.539e-01  -0.355
## education Assoc-acdm                      1.602e+00  2.427e-01   6.601
## education Assoc-voc                       1.541e+00  2.368e-01   6.506
## education Bachelors                       2.177e+00  2.218e-01   9.817
## education Doctorate                       2.761e+00  2.893e-01   9.544
## education HS-grad                         1.006e+00  2.169e-01   4.638
## education Masters                         2.421e+00  2.353e-01  10.289
## education Preschool                      -2.237e+01  6.864e+02  -0.033
## education Prof-school                     2.938e+00  2.753e-01  10.672
## education Some-college                    1.365e+00  2.195e-01   6.219
## maritalstatus Married-AF-spouse           2.540e+00  7.145e-01   3.555
## maritalstatus Married-civ-spouse          2.458e+00  3.573e-01   6.880
## maritalstatus Married-spouse-absent      -9.486e-02  3.204e-01  -0.296
## maritalstatus Never-married              -4.515e-01  1.139e-01  -3.962
## maritalstatus Separated                   3.609e-02  1.984e-01   0.182
## maritalstatus Widowed                     1.858e-01  1.962e-01   0.947
## occupation Adm-clerical                   9.470e-02  1.288e-01   0.735
## occupation Armed-Forces                  -1.008e+00  1.487e+00  -0.677
## occupation Craft-repair                   2.174e-01  1.109e-01   1.960
## occupation Exec-managerial                9.400e-01  1.138e-01   8.257
## occupation Farming-fishing               -1.068e+00  1.908e-01  -5.599
## occupation Handlers-cleaners             -6.237e-01  1.946e-01  -3.204
## occupation Machine-op-inspct             -1.862e-01  1.376e-01  -1.353
## occupation Other-service                 -8.183e-01  1.641e-01  -4.987
## occupation Priv-house-serv               -1.297e+01  2.267e+02  -0.057
## occupation Prof-specialty                 6.331e-01  1.222e-01   5.180
## occupation Protective-serv                6.267e-01  1.710e-01   3.664
## occupation Sales                          3.276e-01  1.175e-01   2.789
## occupation Tech-support                   6.173e-01  1.533e-01   4.028
## occupation Transport-moving                      NA         NA      NA
## relationship Not-in-family                7.881e-01  3.530e-01   2.233
## relationship Other-relative              -2.194e-01  3.137e-01  -0.699
## relationship Own-child                   -7.489e-01  3.507e-01  -2.136
## relationship Unmarried                    7.041e-01  3.720e-01   1.893
## relationship Wife                         1.324e+00  1.331e-01   9.942
## race Asian-Pac-Islander                   4.830e-01  3.548e-01   1.361
## race Black                                3.644e-01  2.882e-01   1.265
## race Other                                2.204e-01  4.513e-01   0.488
## race White                                4.108e-01  2.737e-01   1.501
## sex Male                                  7.729e-01  1.024e-01   7.545
## capitalgain                               3.280e-04  1.372e-05  23.904
## capitalloss                               6.445e-04  4.854e-05  13.277
## hoursperweek                              2.897e-02  2.101e-03  13.791
## nativecountry Canada                      2.593e-01  1.308e+00   0.198
## nativecountry China                      -9.695e-01  1.327e+00  -0.730
## nativecountry Columbia                   -1.954e+00  1.526e+00  -1.280
## nativecountry Cuba                        5.735e-02  1.323e+00   0.043
## nativecountry Dominican-Republic         -1.435e+01  3.092e+02  -0.046
## nativecountry Ecuador                    -3.550e-02  1.477e+00  -0.024
## nativecountry El-Salvador                -6.095e-01  1.395e+00  -0.437
## nativecountry England                    -6.707e-02  1.327e+00  -0.051
## nativecountry France                      5.301e-01  1.419e+00   0.374
## nativecountry Germany                     5.474e-02  1.306e+00   0.042
## nativecountry Greece                     -2.646e+00  1.714e+00  -1.544
## nativecountry Guatemala                  -1.293e+01  3.345e+02  -0.039
## nativecountry Haiti                      -9.221e-01  1.615e+00  -0.571
## nativecountry Holand-Netherlands         -1.282e+01  2.400e+03  -0.005
## nativecountry Honduras                   -9.584e-01  3.412e+00  -0.281
## nativecountry Hong                       -2.362e-01  1.492e+00  -0.158
## nativecountry Hungary                     1.412e-01  1.555e+00   0.091
## nativecountry India                      -8.218e-01  1.314e+00  -0.625
## nativecountry Iran                       -3.299e-02  1.366e+00  -0.024
## nativecountry Ireland                     1.579e-01  1.473e+00   0.107
## nativecountry Italy                       6.100e-01  1.333e+00   0.458
## nativecountry Jamaica                    -2.279e-01  1.387e+00  -0.164
## nativecountry Japan                       5.072e-01  1.375e+00   0.369
## nativecountry Laos                       -6.831e-01  1.661e+00  -0.411
## nativecountry Mexico                     -9.182e-01  1.303e+00  -0.705
## nativecountry Nicaragua                  -1.987e-01  1.507e+00  -0.132
## nativecountry Outlying-US(Guam-USVI-etc) -1.373e+01  8.502e+02  -0.016
## nativecountry Peru                       -9.660e-01  1.678e+00  -0.576
## nativecountry Philippines                 4.393e-02  1.281e+00   0.034
## nativecountry Poland                      2.410e-01  1.383e+00   0.174
## nativecountry Portugal                    7.276e-01  1.477e+00   0.493
## nativecountry Puerto-Rico                -5.769e-01  1.357e+00  -0.425
## nativecountry Scotland                   -1.188e+00  1.719e+00  -0.691
## nativecountry South                      -8.183e-01  1.341e+00  -0.610
## nativecountry Taiwan                     -2.590e-01  1.350e+00  -0.192
## nativecountry Thailand                   -1.693e+00  1.737e+00  -0.975
## nativecountry Trinadad&Tobago            -1.346e+00  1.721e+00  -0.782
## nativecountry United-States              -8.594e-02  1.269e+00  -0.068
## nativecountry Vietnam                    -1.008e+00  1.523e+00  -0.662
## nativecountry Yugoslavia                  1.402e+00  1.648e+00   0.851
##                                          Pr(>|z|)    
## (Intercept)                              3.41e-10 ***
## age                                       < 2e-16 ***
## workclass Federal-gov                    4.03e-08 ***
## workclass Local-gov                      0.043641 *  
## workclass Never-worked                   0.987885    
## workclass Private                        0.000218 ***
## workclass Self-emp-inc                   0.000103 ***
## workclass Self-emp-not-inc               0.295646    
## workclass State-gov                      0.040728 *  
## workclass Without-pay                    0.983134    
## education 11th                           0.437738    
## education 12th                           0.076064 .  
## education 1st-4th                        0.361897    
## education 5th-6th                        0.516008    
## education 7th-8th                        0.263152    
## education 9th                            0.722228    
## education Assoc-acdm                     4.10e-11 ***
## education Assoc-voc                      7.74e-11 ***
## education Bachelors                       < 2e-16 ***
## education Doctorate                       < 2e-16 ***
## education HS-grad                        3.52e-06 ***
## education Masters                         < 2e-16 ***
## education Preschool                      0.973996    
## education Prof-school                     < 2e-16 ***
## education Some-college                   5.00e-10 ***
## maritalstatus Married-AF-spouse          0.000378 ***
## maritalstatus Married-civ-spouse         6.00e-12 ***
## maritalstatus Married-spouse-absent      0.767155    
## maritalstatus Never-married              7.42e-05 ***
## maritalstatus Separated                  0.855672    
## maritalstatus Widowed                    0.343449    
## occupation Adm-clerical                  0.462064    
## occupation Armed-Forces                  0.498170    
## occupation Craft-repair                  0.049972 *  
## occupation Exec-managerial                < 2e-16 ***
## occupation Farming-fishing               2.15e-08 ***
## occupation Handlers-cleaners             0.001353 ** 
## occupation Machine-op-inspct             0.176061    
## occupation Other-service                 6.14e-07 ***
## occupation Priv-house-serv               0.954385    
## occupation Prof-specialty                2.22e-07 ***
## occupation Protective-serv               0.000248 ***
## occupation Sales                         0.005282 ** 
## occupation Tech-support                  5.63e-05 ***
## occupation Transport-moving                    NA    
## relationship Not-in-family               0.025562 *  
## relationship Other-relative              0.484263    
## relationship Own-child                   0.032716 *  
## relationship Unmarried                   0.058392 .  
## relationship Wife                         < 2e-16 ***
## race Asian-Pac-Islander                  0.173504    
## race Black                               0.206001    
## race Other                               0.625263    
## race White                               0.133356    
## sex Male                                 4.52e-14 ***
## capitalgain                               < 2e-16 ***
## capitalloss                               < 2e-16 ***
## hoursperweek                              < 2e-16 ***
## nativecountry Canada                     0.842879    
## nativecountry China                      0.465157    
## nativecountry Columbia                   0.200470    
## nativecountry Cuba                       0.965432    
## nativecountry Dominican-Republic         0.962972    
## nativecountry Ecuador                    0.980829    
## nativecountry El-Salvador                0.662181    
## nativecountry England                    0.959686    
## nativecountry France                     0.708642    
## nativecountry Germany                    0.966572    
## nativecountry Greece                     0.122527    
## nativecountry Guatemala                  0.969180    
## nativecountry Haiti                      0.568105    
## nativecountry Holand-Netherlands         0.995736    
## nativecountry Honduras                   0.778775    
## nativecountry Hong                       0.874155    
## nativecountry Hungary                    0.927653    
## nativecountry India                      0.531661    
## nativecountry Iran                       0.980736    
## nativecountry Ireland                    0.914628    
## nativecountry Italy                      0.647194    
## nativecountry Jamaica                    0.869467    
## nativecountry Japan                      0.712179    
## nativecountry Laos                       0.680866    
## nativecountry Mexico                     0.481103    
## nativecountry Nicaragua                  0.895132    
## nativecountry Outlying-US(Guam-USVI-etc) 0.987115    
## nativecountry Peru                       0.564797    
## nativecountry Philippines                0.972640    
## nativecountry Poland                     0.861624    
## nativecountry Portugal                   0.622327    
## nativecountry Puerto-Rico                0.670837    
## nativecountry Scotland                   0.489616    
## nativecountry South                      0.541809    
## nativecountry Taiwan                     0.847878    
## nativecountry Thailand                   0.329678    
## nativecountry Trinadad&Tobago            0.434105    
## nativecountry United-States              0.946020    
## nativecountry Vietnam                    0.507799    
## nativecountry Yugoslavia                 0.394874    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21175  on 19186  degrees of freedom
## Residual deviance: 12104  on 19090  degrees of freedom
## AIC: 12298
## 
## Number of Fisher Scoring iterations: 15

#Which variables are significant, or have factors that are significant? (Use 0.1 as your significance threshold, so variables with a period or dot in the stars column should be counted too. You might see a warning message here - you can ignore it and proceed. This message is a warning that we might be overfitting our model to the training set.) Select all that apply.
#Ans:age,workclass,education,maritalstatus,occupation,relationship,sex,capitalgain,capitalloss,hoursperweek 

##########################################

#PROBLEM 1.2 - A LOGISTIC REGRESSION MODEL  

#What is the accuracy of the model on the testing set? Use a threshold of 0.5. (You might see a warning message when you make predictions on the test set - you can safely ignore it.)

#Out-of-Sample predictions of the Logistic Regression model
predictTest<-predict(censusglm, newdata=test, type="response")

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

#Accuracy of model using threshold of 0.5
cmat_LR<- table(test$over50k, predictTest > 0.5)
cmat_LR

##         
##          FALSE TRUE
##    <=50K  9051  662
##    >50K   1190 1888

#Overall accuracy on the testing set
accu_LR <-(cmat_LR[1,1] + cmat_LR[2,2])/sum(cmat_LR)
accu_LR #If we divide the sum of the main diagonal by the sum of all of the entries in the matrix, we obtain the accuracy:(9051+1888)/(9051+662+1190+1888) = 0.8552107

## [1] 0.8552107

#Ans: 0.8552107

#######################################

#PROBLEM 1.3 - A LOGISTIC REGRESSION MODEL  

#What is the baseline accuracy for the testing set?

#We need to first determine the most frequent outcome in the training set. To do that, we table the dependent variable in the training set:
table(train$over50k)

## 
##  <=50K   >50K 
##  14570   4617

#"<=50K" is the more frequent outcome (14570 observations), so this is what the baseline predicts. To generate the accuracy of the baseline on the test set, we can table the dependent variable in the test set:

#Our baseline model accuracy for the test set is the most frequent outcome as determined from above training set
table(test$over50k)

## 
##  <=50K   >50K 
##   9713   3078

9713/nrow(test) #9713/(9713+3078) = 0.7593621

## [1] 0.7593621

#Ans:0.7593621

###########################################

#PROBLEM 1.4 - A LOGISTIC REGRESSION MODEL  

#What is the area-under-the-curve (AUC) for this model on the test set?

#Lets evaluate our model using the ROC curve
library(ROCR)

#Compute the test set predictions
ROCRpred<- prediction(predictTest ,test$over50k)
perf<- performance(ROCRpred, "tpr", "fpr")
plot(perf)

##Compute the AUC 
as.numeric(performance(ROCRpred, "auc")@y.values)

## [1] 0.9061598

#Ans: 0.9061598

#########################################

#PROBLEM 2.1 - A CART MODEL  

#We have just seen how the logistic regression model for this data achieves a high accuracy. Moreover, the significances of the variables give us a way to gauge which variables are relevant for this prediction task. However, it is not immediately clear which variables are more important than the others, especially due to the large number of factor variables in this problem.

#Let us now build a classification tree to predict "over50k". Use the training set to build the model, and all of the other variables as independent variables. Use the default parameters, so don't set a value for minbucket or cp. Remember to specify method="class" as an argument to rpart, since this is a classification problem. After you are done building the model, plot the resulting tree.

#Build CART model using all defaults
library(rpart)
library(rpart.plot)

censustree<-rpart(over50k ~ ., data=train, method="class") #method="class" arg tells us to make classification tree and not regression tree
prp(censustree)

#How many splits does the tree have in total?
#Ans:4

#################################

#PROBLEM 2.2 - A CART MODEL  

#Which variable does the tree split on at the first level (the very first split of the tree)?
#Ans:relationship.

####################################

#PROBLEM 2.3 - A CART MODEL  

#Which variables does the tree split on at the second level (immediately after the first split of the tree)? Select all that apply.
#Ans:education,capitalgain
#The second splits are on capitalgains and education.

######################################

#PROBLEM 2.4 - A CART MODEL  

#What is the accuracy of the model on the testing set? Use a threshold of 0.5. (You can either add the argument type="class", or generate probabilities and use a threshold of 0.5 like in logistic regression.)

#Accuracy using threshold of 0.5 (2 ways to do same thing):

#1st model:Using the CART model:

#Make predictions
PredictCART = predict(censustree, newdata = test, type = "class") #We need to give type = "class" if we want the majority class predictions.This is like using a threshold of 0.5.

#Now lets assess the accuracy of the model through confusion matrix
cmat_CART<-table(test$over50k, PredictCART)  #first arg is the true outcomes and the second is the predicted outcomes
cmat_CART

##         PredictCART
##           <=50K  >50K
##    <=50K   9243   470
##    >50K    1482  1596

#lets now compute the overall accuracy
accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART)
accu_CART  #(9243+1596)/(9243+470+1482+1596) = 0.8473927 #sum the diagonal entries and divide by the sum of all of the terms

## [1] 0.8473927

#or 
sum(diag(cmat_CART)) /nrow(test)

## [1] 0.8473927

#Ans:0.8473927

#2nd method using the probabilities
PredictCART <- predict(censustree, newdata=test)[,2]
cmat_CART<- table(test$over50k, PredictCART > 0.5)
sum(diag(cmat_CART)) /nrow(test)

## [1] 0.8473927

#Ans:0.8473927

##############################################

#PROBLEM 2.5 - A CART MODEL 

#Let us now consider the ROC curve and AUC for the CART model on the test set. You will need to get predicted probabilities for the observations in the test set to build the ROC curve and compute the AUC. Remember that you can do this by removing the type="class" argument when making predictions, and taking the second column of the resulting object.

#Lets evaluate our model using the ROC curve
library(ROCR)

#Generate the predictions for the tree. Note that unlike the previous question, when we call the predict function, we leave out the argument type = "class" from the function call. Without this extra part, we will get the raw probabilities of the dependent variable values for each observation, which we need in order to generate the AUC. We need to take the second column of the output:
PredictROC = predict(censustree, newdata = test)
head(PredictROC)

##        <=50K       >50K
## 2  0.2794982 0.72050176
## 5  0.2794982 0.72050176
## 7  0.9490143 0.05098572
## 8  0.6972807 0.30271934
## 11 0.6972807 0.30271934
## 12 0.2794982 0.72050176

#First we use the prediction() function with first argument the second column of PredictROC, and second argument the true outcome values, Test$Reverse.
#We pass the output of prediction() to performance() to which we give also two arguments for what we want on the X and Y axes of our ROC curve, true positive rate and false positive rate.
pred = prediction(PredictROC[,2], test$over50k)
perf = performance(pred, "tpr", "fpr")

#and the plot
plot(perf)

#Compute the AUC of the CART model
as.numeric(performance(pred, "auc")@y.values)

## [1] 0.8470256

#Plot the ROC curve for the CART model you have estimated. Observe that compared to the logistic regression ROC curve, the CART ROC curve is less smooth than the logistic regression ROC curve. Which of the following explanations for this behavior is most correct? (HINT: Think about what the ROC curve is plotting and what changing the threshold does.)
#Ans: The probabilities from the CART model take only a handful of values (five, one for each end bucket/leaf of the tree); the changes in the ROC curve correspond to setting the threshold to one of those value
#EXPLANATION:Choice 1 is on the right track, but is incorrect, because the number of variables that you use in a model does not determine how the ROC curve looks. In particular, try fitting logistic regression with hourperweek as the only variable; you will see that the ROC curve is very smooth.
#Choice 2 is not correct. The smoothness of the ROC curve will generally depend on the number of data points, but in the case of the particular CART model we have estimated, varying the amount of testing set data will not change the qualitative behavior of the ROC curve.
#Choice 3 is the correct answer. The breakpoints of the curve correspond to the false and true positive rates when the threshold is set to the five possible probability values.
#Choice 4 is also not correct. In logistic regression, the continuity of an independent variable means that you will have a large range of predicted class probabilities in your test set data; this, in turn, means that you will see a large range of true and false positive rates as you change the threshold for generating predictions. In CART, the continuity of the variables does not at all affect the continuity of the predicted class probabilities; for our CART tree, there are only five possible probability values.

################################

#PROBLEM 2.6 - A CART MODEL  

#What is the AUC of the CART model on the test set?

#Compute the AUC of the CART model
as.numeric(performance(pred, "auc")@y.values)

## [1] 0.8470256

#Ans:0.8470256

###############################################

#PROBLEM 3.1 - A RANDOM FOREST MODEL

#Before building a random forest model, we'll down-sample our training set. While some modern personal computers can build a random forest model on the entire training set, others might run out of memory when trying to train the model since random forests is much more computationally intensive than CART or Logistic Regression. For this reason, before continuing we will define a new training set to be used when building our random forest model, that contains 2000 randomly selected obervations from the original training set. Do this by running the following commands in your R console (assuming your training set is called "train"):

set.seed(1)
trainSmall = train[sample(nrow(train), 2000), ]

#Let us now build a random forest model to predict "over50k", using the dataset "trainSmall" as the data used to build the model. Set the seed to 1 again right before building the model, and use all of the other variables in the dataset as independent variables. (If you get an error that random forest "can not handle categorical predictors with more than 32 categories", re-build the model without the nativecountry variable as one of the independent variables.)

#Build random forest model
library(randomForest)

set.seed(1)
CensusForest<-randomForest(over50k ~ ., data=trainSmall)

#Then, make predictions using this model on the entire test set. What is the accuracy of the model on the test set, using a threshold of 0.5? (Remember that you don't need a "type" argument when making predictions with a random forest model if you want to use a threshold of 0.5. Also, note that your accuracy might be different from the one reported here, since random forest models can still differ depending on your operating system, even when the random seed is set. )

#Make predictions
predictTest= predict(CensusForest, newdata = test)

#Now lets assess the accuracy of the model through confusion matrix
cmat_forest<-table(test$over50k, predictTest)  #first arg is thr true outcomes and the second is the predicted outcomes
cmat_forest

##         predictTest
##           <=50K  >50K
##    <=50K   9586   127
##    >50K    1985  1093

#lets now compute the overall accuracy
accu_forest <- (cmat_forest[1,1] + cmat_forest[2,2])/sum(cmat_forest)
accu_forest  #or (9584+1090)/nrow(test) = 0.8348839

## [1] 0.8348839

#or
sum(diag(cmat_forest)) / nrow(test)

## [1] 0.8348839

#Ans:0.8348839

#####################################

#PROBLEM 3.2 - A RANDOM FOREST MODEL  

#As we discussed in lecture, random forest models work by building a large collection of trees. As a result, we lose some of the interpretability that comes with CART in terms of seeing how predictions are made and which variables are important. However, we can still compute metrics that give us insight into which variables are important.

#One metric that we can look at is the number of times, aggregated over all of the trees in the random forest model, that a certain variable is selected for a split. To view this metric, run the following lines of R code (replace "MODEL" with the name of your random forest model):

#Find out the number of times, aggregated over all of the trees in random forest model, that a certain variable is selected for a split:

vu = varUsed(CensusForest, count=TRUE)
vusorted = sort(vu, decreasing = FALSE, index.return = TRUE)
dotchart(vusorted$x, names(CensusForest$forest$xlevels[vusorted$ix]))

#This code produces a chart that for each variable measures the number of times that variable was selected for splitting (the value on the x-axis). Which of the following variables is the most important in terms of the number of splits?
#Ans:age
#EXPLANATION:If you run the three lines of R code given in this problem, you can see that age is used significantly more than the other variables.

###################################

#PROBLEM 3.3 - A RANDOM FOREST MODEL  

#A different metric we can look at is related to "impurity", which measures how homogenous each bucket or leaf of the tree is. In each tree in the forest, whenever we select a variable and perform a split, the impurity is decreased. Therefore, one way to measure the importance of a variable is to average the reduction in impurity, taken over all the times that variable is selected for splitting in all of the trees in the forest. To compute this metric, run the following command in R (replace "MODEL" with the name of your random forest model):

#Impurity which measures how homogenous each bucket or leaf of the tree is.
varImpPlot(CensusForest)

#Which one of the following variables is the most important in terms of mean reduction in impurity?
#Ans:occupation
#EXPLANATION:If you generate the plot with the command varImpPlot(CensusForest), you can see that occupation gives a larger reduction in impurity than the other variables.Notice that the importance as measured by the average reduction in impurity is in general different from the importance as measured by the number of times the variable is selected for splitting. Although age and occupation are important variables in both metrics, the order of the variables is not the same in the two plots.


############################################

#PROBLEM 4.1 - SELECTING CP BY CROSS-VALIDATION  

#We now conclude our study of this data set by looking at how CART behaves with different choices of its parameters.

#Let us select the cp parameter for our CART model using k-fold cross validation, with k = 10 folds. Do this by using the train function. Set the seed beforehand to 2. Test cp values from 0.002 to 0.1 in 0.002 increments, by using the following command:

cartGrid = expand.grid( .cp = seq(0.002,0.1,0.002))

#Selecting cp parameter of our CART model using kfold cross validation, with k=10 folds and cp values from 0.002 to 0.1 in 0.002 increments.

library(caret)
library(e1071)

#Set the seed to 2:
set.seed(2)

#Define cross-validation experiment

#specify that we are going to use k-fold cross validation with 10 folds:
numFolds = trainControl( method = "cv", number = 10 )

#Specify the grid of cp values that we wish to evaluate:
cartGrid = expand.grid( .cp = seq(0.002,0.1,0.002))
#This will define our cp parameters to test as numbers from 0.002 to 0.1, in increments of 0.002.

#Perform the cross validation by running the train function and view the result:
save_CV<-train(over50k~.,data=train,method="rpart",trControl=numFolds,   tuneGrid=cartGrid)
save_CV

## CART 
## 
## 19187 samples
##    12 predictor
##     2 classes: ' <=50K', ' >50K' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 17268, 17269, 17268, 17268, 17269, 17269, ... 
## Resampling results across tuning parameters:
## 
##   cp     Accuracy   Kappa     
##   0.002  0.8515658  0.55758898
##   0.004  0.8476566  0.55353089
##   0.006  0.8444252  0.53388285
##   0.008  0.8441646  0.53439869
##   0.010  0.8443210  0.53574983
##   0.012  0.8443210  0.53574983
##   0.014  0.8443210  0.53574983
##   0.016  0.8434351  0.53116984
##   0.018  0.8405163  0.51397907
##   0.020  0.8395783  0.50605490
##   0.022  0.8391090  0.50179921
##   0.024  0.8391090  0.50179921
##   0.026  0.8391090  0.50179921
##   0.028  0.8391090  0.50179921
##   0.030  0.8391090  0.50179921
##   0.032  0.8391090  0.50179921
##   0.034  0.8370243  0.48876941
##   0.036  0.8321776  0.46362978
##   0.038  0.8265491  0.43857371
##   0.040  0.8246203  0.43016472
##   0.042  0.8246203  0.43016472
##   0.044  0.8246203  0.43016472
##   0.046  0.8220148  0.41296653
##   0.048  0.8220148  0.41296653
##   0.050  0.8220148  0.41296653
##   0.052  0.8153954  0.35631451
##   0.054  0.8128414  0.32562676
##   0.056  0.8119034  0.30789310
##   0.058  0.8119034  0.30789310
##   0.060  0.8119034  0.30789310
##   0.062  0.8119034  0.30789310
##   0.064  0.8095585  0.29541138
##   0.066  0.8095585  0.29541138
##   0.068  0.8009584  0.24623306
##   0.070  0.7958509  0.21466597
##   0.072  0.7958509  0.21466597
##   0.074  0.7958509  0.21466597
##   0.076  0.7725557  0.07856024
##   0.078  0.7593684  0.00000000
##   0.080  0.7593684  0.00000000
##   0.082  0.7593684  0.00000000
##   0.084  0.7593684  0.00000000
##   0.086  0.7593684  0.00000000
##   0.088  0.7593684  0.00000000
##   0.090  0.7593684  0.00000000
##   0.092  0.7593684  0.00000000
##   0.094  0.7593684  0.00000000
##   0.096  0.7593684  0.00000000
##   0.098  0.7593684  0.00000000
##   0.100  0.7593684  0.00000000
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.002.

#Also, remember to use the entire training set "train" when building this model. The train function might take some time to run.

plot(save_CV) #or

plot(save_CV$results$cp, save_CV$results$Accuracy, type="l", xlab="cp", ylab="accuracy")

#Which value of cp does the train function recommend?
save_CV$bestTune

##      cp
## 1 0.002

#Ans:0.002
#EXPLANATION:In other words, the best value was cp = 0.002, corresponding to the lowest cp value. If we look more closely at the accuracy at different cp values, we can see that it seems to be decreasing steadily as the cp value increases. Often, the cp value needs to become quite low before the accuracy begins to deteriorate.

########################################

#PROBLEM 4.2 - SELECTING CP BY CROSS-VALIDATION 

#Fit a CART model to the training data using this value of cp. What is the prediction accuracy on the test set?

#Create a new CART model i.e. fit a CART model to the training data using this value of cp:
CensusCV <- rpart(over50k ~ ., data=train, method="class", cp=save_CV$bestTune) # using the cp value (save_CV$bestTune=0.02) got from Cross validation above

#Make predictions (Out-of-Sample predictions of the Cross Validated CART model)
PredictCV = predict(CensusCV, newdata = test, type = "class")
cmat_CART_CV<-table(test$over50k, PredictCV) #confusion matrix
cmat_CART_CV

##         PredictCV
##           <=50K  >50K
##    <=50K   9178   535
##    >50K    1240  1838

#lets now compute the overall accuracy
accu_CART_CV <- (cmat_CART_CV[1,1] + cmat_CART_CV[2,2])/sum(cmat_CART_CV)
accu_CART_CV  #9178+1838)/(9178+535+1240+1838) = 0.8612306.

## [1] 0.8612306

#or
sum(diag(cmat_CART_CV)) / nrow(test)

## [1] 0.8612306

#Ans:0.8612306

#######################################

#PROBLEM 4.3 - SELECTING CP BY CROSS-VALIDATION  

#Compared to the original accuracy using the default value of cp, this new CART model is an improvement, and so we should clearly favor this new model over the old one -- or should we? Plot the CART tree for this model. How many splits are there?

#What does this decision tree look like?
prp(CensusCV)

#Ans:18

#This highlights one important tradeoff in building predictive models. By tuning cp, we improved our accuracy by over 1%, but our tree became significantly more complicated. In some applications, such an improvement in accuracy would be worth the loss in interpretability. In others, we may prefer a less accurate model that is simpler to understand and describe over a more accurate -- but more complicated -- model.

STATE DATA REVISITED (OPTIONAL)

We will be revisiting the “state” dataset from one of the optional problems in Unit 2. This dataset has, for each of the fifty U.S. states, the population, per capita income, illiteracy rate, murder rate, high school graduation rate, average number of frost days, area, latitude and longitude, division the state belongs to, region the state belongs to, and two-letter abbreviation. This dataset comes from the U.S. Department of Commerce, Bureau of the Census.

Load the dataset into R and convert it to a data frame by running the following two commands in R:

data(state)
statedata = data.frame(state.x77)

If you can’t access the state dataset in R, here is a CSV file with the same data that you can load into R using the read.csv function: statedataSimple.csv. Be sure to call the output of the read.csv function “statedata”.

After you have loaded the data into R, inspect the data set using the command: str(statedata)

This dataset has 50 observations (one for each US state) and the following 8 variables:

Population - the population estimate of the state in 1975
Income - per capita income in 1974
Illiteracy - illiteracy rates in 1970, as a percent of the population
Life.Exp - the life expectancy in years of residents of the state in 1970
Murder - the murder and non-negligent manslaughter rate per 100,000 population in 1976
HS.Grad - percent of high-school graduates in 1970
Frost - the mean number of days with minimum temperature below freezing from + + 1931-1960 in the capital or a large city of the state
Area - the land area (in square miles) of the state

We will try to build a model for life expectancy using regression trees, and employ cross-validation to improve our tree’s performance.

str(statedata)

## 'data.frame':    50 obs. of  8 variables:
##  $ Population: num  3615 365 2212 2110 21198 ...
##  $ Income    : num  3624 6315 4530 3378 5114 ...
##  $ Illiteracy: num  2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
##  $ Life.Exp  : num  69 69.3 70.5 70.7 71.7 ...
##  $ Murder    : num  15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
##  $ HS.Grad   : num  41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
##  $ Frost     : num  20 152 15 65 20 166 139 103 11 60 ...
##  $ Area      : num  50708 566432 113417 51945 156361 ...

#PROBLEM 1.1 - LINEAR REGRESSION MODELS 

#Let's recreate the linear regression models we made in the previous homework question. First, predict Life.Exp using all of the other variables as the independent variables (Population, Income, Illiteracy, Murder, HS.Grad, Frost, Area ). Use the entire dataset to build the model.

RegModel<-lm(Life.Exp~.,data=statedata)
summary(RegModel)

## 
## Call:
## lm(formula = Life.Exp ~ ., data = statedata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.48895 -0.51232 -0.02747  0.57002  1.49447 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.094e+01  1.748e+00  40.586  < 2e-16 ***
## Population   5.180e-05  2.919e-05   1.775   0.0832 .  
## Income      -2.180e-05  2.444e-04  -0.089   0.9293    
## Illiteracy   3.382e-02  3.663e-01   0.092   0.9269    
## Murder      -3.011e-01  4.662e-02  -6.459 8.68e-08 ***
## HS.Grad      4.893e-02  2.332e-02   2.098   0.0420 *  
## Frost       -5.735e-03  3.143e-03  -1.825   0.0752 .  
## Area        -7.383e-08  1.668e-06  -0.044   0.9649    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7448 on 42 degrees of freedom
## Multiple R-squared:  0.7362, Adjusted R-squared:  0.6922 
## F-statistic: 16.74 on 7 and 42 DF,  p-value: 2.534e-10

#What is the adjusted R-squared of the model?
summary(RegModel)$adj.r.squared

## [1] 0.6921823

#Ans:0.6921823

###########################################

#PROBLEM 1.2 - LINEAR REGRESSION MODELS 

#Calculate the sum of squared errors (SSE) between the predicted life expectancies using this model and the actual life expectancies:

#Compute sample SSE

Predictions<-predict(RegModel)
SSE = sum((Predictions- statedata$Life.Exp)^2)
SSE

## [1] 23.29714

#or
sum(RegModel$residuals^2)

## [1] 23.29714

#Ans:23.29714

###############################################

#PROBLEM 1.3 - LINEAR REGRESSION MODELS 

#Build a second linear regression model using just Population, Murder, Frost, and HS.Grad as independent variables (the best 4 variable model from the previous homework). What is the adjusted R-squared for this model?

RegModel1<-lm(Life.Exp~ Population + Murder + Frost + HS.Grad ,data=statedata)
summary(RegModel1)

## 
## Call:
## lm(formula = Life.Exp ~ Population + Murder + Frost + HS.Grad, 
##     data = statedata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.47095 -0.53464 -0.03701  0.57621  1.50683 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.103e+01  9.529e-01  74.542  < 2e-16 ***
## Population   5.014e-05  2.512e-05   1.996  0.05201 .  
## Murder      -3.001e-01  3.661e-02  -8.199 1.77e-10 ***
## Frost       -5.943e-03  2.421e-03  -2.455  0.01802 *  
## HS.Grad      4.658e-02  1.483e-02   3.142  0.00297 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7197 on 45 degrees of freedom
## Multiple R-squared:  0.736,  Adjusted R-squared:  0.7126 
## F-statistic: 31.37 on 4 and 45 DF,  p-value: 1.696e-12

summary(RegModel1)$adj.r.squared

## [1] 0.712569

#Ans:0.712569

#@PROBLEM 1.4 - LINEAR REGRESSION MODELS  

#Calculate the sum of squared errors again, using this reduced model:

#Compute sample SSE
Predictions<-predict(RegModel1)
SSE = sum((Predictions- statedata$Life.Exp)^2)
SSE

## [1] 23.30804

#or
sum(RegModel1$residuals^2)

## [1] 23.30804

#Ans:23.30804

################################################

#PROBLEM 1.5 - LINEAR REGRESSION MODELS 

#Which of the following is correct?
#Ans:Trying different combinations of variables in linear regression is like trying different numbers of splits in a tree - this controls the complexity of the model. 
#EXPLANATION:The correct answer is the first one. Trying different combinations of variables in linear regression controls the complexity of the model. This is similar to trying different numbers of splits in a tree, which is also controlling the complexity of the model.
#The second answer is incorrect because as we see here, a model with fewer variables actually has a higher adjusted R-squared. If your accuracy is just as good, a model with fewer variables is almost always better.
#The third answer is incorrect because the variables we removed have non-zero correlations with the dependent variable Life.Exp. Illiteracy and Area are negatively correlated with Life.Exp, with correlations of -0.59 and -0.11. Income is positively correlated with Life.Exp, with a correlation of 0.34. These correlations can be computed by typing the following into your R console:
cor(statedata$Life.Exp, statedata$Income)

## [1] 0.3402553

cor(statedata$Life.Exp, statedata$Illiteracy)

## [1] -0.5884779

cor(statedata$Life.Exp, statedata$Area)

## [1] -0.1073319

##################################################

#PROBLEM 2.1 - CART MODELS  

#Let's now build a CART model to predict Life.Exp using all of the other variables as independent variables (Population, Income, Illiteracy, Murder, HS.Grad, Frost, Area). We'll use the default minbucket parameter, so don't add the minbucket argument. Remember that in this problem we are not as interested in predicting life expectancies for new observations as we are understanding how they relate to the other variables we have, so we'll use all of the data to build our model. You shouldn't use the method="class" argument since this is a regression tree.

#Leave all the parameters at their default values. You can use the following command in R to build the tree:

# Load CART packages
library(rpart)
library(rpart.plot)

CARTmodel= rpart(Life.Exp~.,data=statedata)
prp(CARTmodel)

#Plot the tree. Which of these variables appear in the tree? Select all that apply.
#Ans:"Murder"
#only variable used in the tree is "Murder".

#######################################

#PROBLEM 2.2 - CART MODELS  

#Use the regression tree you just built to predict life expectancies (using the predict function), and calculate the sum-of-squared-errors (SSE) like you did for linear regression. What is the SSE?

# Make predictions
PredictionsCART= predict(CARTmodel)
tree.sse = sum((PredictionsCART - statedata$Life.Exp)^2)
tree.sse

## [1] 28.99848

#Ans:28.99848

#################################

#PROBLEM 2.3 - CART MODELS  

#The error is higher than for the linear regression models. One reason might be that we haven't made the tree big enough. Set the minbucket parameter to 5, and recreate the tree.

CARTmodel2= rpart(Life.Exp~.,data=statedata,minbucket=5)
prp(CARTmodel2)

#Which variables appear in this new tree? Select all that apply.
#Ans:Murder, HS.Grad, and Area 

##########################################

#PROBLEM 2.4 - CART MODELS  

#Do you think the default minbucket parameter is smaller or larger than 5 based on the tree that was built?
#Ans:Larger
#EXPLANATION:Since the tree now has more splits, it must be true that the default minbucket parameter was limiting the tree from splitting more before. So the default minbucket parameter must be larger than 5.

#############################################

#PROBLEM 2.5 - CART MODELS  

#What is the SSE of this tree?

PredictionsCART2 = predict(CARTmodel2)
tree.sse = sum((PredictionsCART2 - statedata$Life.Exp)^2)
tree.sse

## [1] 23.64283

#Ans:23.64283

########################################

#PROBLEM 2.6 - CART MODELS  

#Can we do even better? Create a tree that predicts Life.Exp using only Area, with the minbucket parameter to 1. What is the SSE of this newest tree?

CARTmodel3 = rpart(Life.Exp ~ Area, data=statedata, minbucket=1)

#Make predictions
PredictionsCART3 = predict(CARTmodel3)

#Compute SSE
tree.sse = sum((statedata$Life.Exp - PredictionsCART3)^2)
tree.sse

## [1] 9.312442

#Ans:9.312442

######################################

#PROBLEM 2.7 - CART MODELS  

#This is the lowest error we have seen so far. What would be the best interpretation of this result?
#Ans:We can build almost perfect models given the right parameters, even if they violate our intuition of what a good model should be.
#EXPLANATION:The correct answer is the second one. By making the minbucket parameter very small, we could build an almost perfect model using just one variable, that is not even our most significant variable. However, if you plot the tree using prp(CARTmodel3), you can see that the tree has 22 splits! This is not a very interpretable model, and will not generalize well.
#The first answer is incorrect because our tree model that was not overfit performed similarly to the linear regression model. Trees only look better than linear regression here because we are overfitting the model to the data.
#The third answer is incorrect because Area is not actually a very meaningful predictor. Without overfitting the tree, our model would not be very accurate only using Area.

#############################################

#PROBLEM 3.1 - CROSS-VALIDATION 

#Adjusting the variables included in a linear regression model is a form of model tuning. In Problem 1 we showed that by removing variables in our linear regression model (tuning the model), we were able to maintain the fit of the model while using a simpler model. A rule of thumb is that simpler models are more interpretable and generalizeable. We will now tune our regression tree to see if we can improve the fit of our tree while keeping it as simple as possible.

#Load the caret library, and set the seed to 111. Set up the controls exactly like we did in the lecture (10-fold cross-validation) with cp varying over the range 0.01 to 0.50 in increments of 0.01. Use the train function to determine the best cp value for a CART model using all of the available independent variables, and the entire dataset statedata. What value of cp does the train function recommend? (Remember that the train function tells you to pick the largest value of cp with the lowest error when there are ties, and explains this at the bottom of the output.)

library(caret)
library(e1071)

set.seed(111)

#set up the cross-validation controls by typing:
fitControl = trainControl(method = "cv", number = 10)
cartGrid = expand.grid(.cp = seq(0.01, 0.5, 0.01) )

#then use the train function to find the best value of cp by typing:
save_CV<-train(Life.Exp ~ ., data=statedata, method="rpart", trControl = fitControl, tuneGrid = cartGrid)

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.

save_CV

## CART 
## 
## 50 samples
##  7 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 45, 45, 46, 45, 45, 43, ... 
## Resampling results across tuning parameters:
## 
##   cp    RMSE       Rsquared 
##   0.01  0.9710867  0.5079610
##   0.02  0.9710867  0.5079610
##   0.03  0.9851044  0.4959796
##   0.04  0.9914151  0.4959796
##   0.05  0.9914151  0.4959796
##   0.06  0.9796163  0.5147632
##   0.07  0.9709231  0.5380553
##   0.08  0.9709231  0.5380553
##   0.09  0.9709231  0.5380553
##   0.10  0.9709231  0.5380553
##   0.11  0.9709231  0.5380553
##   0.12  0.9709231  0.5380553
##   0.13  1.0467593  0.4623293
##   0.14  1.0467593  0.4623293
##   0.15  1.0811534  0.4478183
##   0.16  1.1430940  0.3916691
##   0.17  1.1430940  0.3916691
##   0.18  1.1315421  0.3916691
##   0.19  1.1315421  0.3916691
##   0.20  1.1315421  0.3916691
##   0.21  1.1315421  0.3916691
##   0.22  1.1315421  0.3916691
##   0.23  1.1390467  0.3812782
##   0.24  1.1390467  0.3812782
##   0.25  1.1390467  0.3812782
##   0.26  1.1390467  0.3812782
##   0.27  1.1390467  0.3812782
##   0.28  1.1390467  0.3812782
##   0.29  1.1390467  0.3812782
##   0.30  1.1390467  0.3812782
##   0.31  1.1390467  0.3812782
##   0.32  1.1390467  0.3812782
##   0.33  1.1390467  0.3812782
##   0.34  1.1390467  0.3812782
##   0.35  1.1390467  0.3812782
##   0.36  1.1390467  0.3812782
##   0.37  1.1390467  0.3812782
##   0.38  1.1390467  0.3812782
##   0.39  1.1390467  0.3812782
##   0.40  1.1390467  0.3812782
##   0.41  1.1390467  0.3812782
##   0.42  1.1390467  0.3812782
##   0.43  1.1390467  0.3812782
##   0.44  1.1390467  0.3812782
##   0.45  1.1390467  0.3812782
##   0.46  1.1390467  0.3812782
##   0.47  1.1805471  0.3429189
##   0.48  1.2657748  0.2689183
##   0.49  1.3047621  0.2004959
##   0.50  1.3336338  0.1187613
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was cp = 0.12.

plot(save_CV)

#the train function tells you to pick the largest value of cp with the lowest error when there are ties

#####################################


#PROBLEM 3.2 - CROSS-VALIDATION  

#Create a tree with the value of cp you found in the previous problem, all of the available independent variables, and the entire dataset "statedata" as the training data. Then plot the tree. You'll notice that this is actually quite similar to the first tree we created with the initial model. Interpret the tree: we predict the life expectancy to be 70 if the murder rate is greater than or equal to
CARTmodelCV = rpart(Life.Exp ~ ., data=statedata, cp=0.12)
prp(CARTmodelCV)

#Ans:6.6

#and is less than
#Ans:11
#you can see that the life expectancy is predicted to be 70 if Murder is greater than or equal to 6.6 (the first split) and less than 11 (the second split).

##############################

#PROBLEM 3.3 - CROSS-VALIDATION  

#Calculate the SSE of this tree:

#Before calc SSE,Make predictions
PredictionsCART4 = predict(CARTmodelCV)

#Compute SSE
tree.sse = sum((statedata$Life.Exp - PredictionsCART4)^2)
tree.sse

## [1] 32.86549

#Ans:32.86549

###########################################

#PROBLEM 3.4 - CROSS-VALIDATION  

#Recall the first tree (default parameters), second tree (minbucket = 5), and the third tree (selected with cross validation) we made. Given what you have learned about cross-validation, which of the three models would you expect to be better if we did use it for prediction on a test set? For this question, suppose we had actually set aside a few observations (states) in a test set, and we want to make predictions on those states.
#Ans:he model we just made with the "best" cp
#EXPLANATION:The purpose of cross-validation is to pick the tree that will perform the best on a test set. So we would expect the model we made with the "best" cp to perform best on a test set.

##############################################

#PROBLEM 3.5 - CROSS-VALIDATION  

#At the end of Problem 2 we made a very complex tree using just Area. Use train with the same parameters as before but just using Area as an independent variable to find the best cp value (set the seed to 111 first). Then build a new tree using just Area and this value of cp.

set.seed(111)
save_CV1<-train(Life.Exp ~ Area, data=statedata, method="rpart", trControl = fitControl, tuneGrid = cartGrid )

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.

save_CV1

## CART 
## 
## 50 samples
##  7 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 45, 45, 46, 45, 45, 43, ... 
## Resampling results across tuning parameters:
## 
##   cp    RMSE      Rsquared 
##   0.01  1.205723  0.5389506
##   0.02  1.205723  0.5389506
##   0.03  1.219739  0.5055887
##   0.04  1.219739  0.5055887
##   0.05  1.219739  0.5055887
##   0.06  1.232473  0.5045906
##   0.07  1.281022  0.5412386
##   0.08  1.298106  0.5162642
##   0.09  1.313189  0.5006905
##   0.10  1.313189  0.5006905
##   0.11  1.313189  0.5006905
##   0.12  1.301598  0.5006905
##   0.13  1.381009  0.4212663
##   0.14  1.381009  0.4212663
##   0.15  1.400114  0.4462950
##   0.16  1.440090  0.3878287
##   0.17  1.436418  0.4158177
##   0.18  1.436418  0.4158177
##   0.19  1.436418  0.4158177
##   0.20  1.436418  0.4158177
##   0.21  1.368197  0.4103372
##   0.22  1.368197  0.4103372
##   0.23  1.368197  0.4103372
##   0.24  1.368197  0.4103372
##   0.25  1.368197  0.4103372
##   0.26  1.368197  0.4103372
##   0.27  1.316606        NaN
##   0.28  1.316606        NaN
##   0.29  1.316606        NaN
##   0.30  1.316606        NaN
##   0.31  1.316606        NaN
##   0.32  1.316606        NaN
##   0.33  1.316606        NaN
##   0.34  1.316606        NaN
##   0.35  1.316606        NaN
##   0.36  1.316606        NaN
##   0.37  1.316606        NaN
##   0.38  1.316606        NaN
##   0.39  1.316606        NaN
##   0.40  1.316606        NaN
##   0.41  1.316606        NaN
##   0.42  1.316606        NaN
##   0.43  1.316606        NaN
##   0.44  1.316606        NaN
##   0.45  1.316606        NaN
##   0.46  1.316606        NaN
##   0.47  1.316606        NaN
##   0.48  1.316606        NaN
##   0.49  1.316606        NaN
##   0.50  1.316606        NaN
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was cp = 0.02.

plot(save_CV)

#Then, build a new CART tree by typing:
CARTmodel5 = rpart(Life.Exp ~ Area, data=statedata, cp=0.02)
prp(CARTmodel5)

#How many splits does the tree have?
#Ans:4

###################################

#PROBLEM 3.6 - CROSS-VALIDATION  

#The lower left leaf (or bucket) corresponds to the lowest predicted Life.Exp of 70. Observations in this leaf correspond to states with area greater than or equal to
#Ans:9579

#and area less than
#Ans:51000

#EXPLANATION:To get to this leaf, we go through 3 splits:Area less than 62,000, Area greater than or equal to 9,579 & Area less than 51,000.This means that this leaf is composed of states that have an area greater than 9,579 and less than 51,000.

########################################

#PROBLEM 3.7 - CROSS-VALIDATION  

#We have simplified the previous "Area tree" considerably by using cross-validation. Calculate the SSE of the cross-validated "Area tree", and select all of the following correct statements that apply:

#Before calc SSE,Make predictions
PredictionsCART5 = predict(CARTmodel5)

#Compute SSE
tree.sse = sum((statedata$Life.Exp - PredictionsCART5)^2)
tree.sse

## [1] 44.26817

#Ans: The Area variable is not as predictive as Murder rate
#EXPLANATION:The original Area tree was overfitting the data - it was uninterpretable. Area is not as useful as Murder - if it was, it would have been in the cross-validated tree. Cross-validation is not designed to improve the fit on the training data, but it won't necessarily make it worse either. Cross-validation cannot guarantee improving the SSE on unseen data, although it often helps.

#############################

        #########################################################

The_Analytics_Edge_edX_MIT15.071x_June2015_3

Course Notes by Nishant Upadhyay Unit4:Trees

30 June 2015

Unit 4: Trees

Preliminaries

Judge Jury & classifier_An introduction to Trees_1

VIDEO 1: THE SUPREME COURT

VIDEO 2: CART

VIDEO 3: SPLITTING AND PREDICTIONS

VIDEO 4: CART IN R (R script reproduced here)

VIDEO 5: RANDOM FORESTS (R script reproduced here)

VIDEO 6: CROSS-VALIDATION (R script reproduced here)

Keeping an eye on Healthcare costs_The D2Hawkeye Story_2

VIDEO 1: THE STORY OF D2HAWKEYE

VIDEO 2: CLAIMS DATA

VIDEO 3: THE VARIABLES

VIDEO 4: ERROR MEASURES

VIDEO 5: CART TO PREDICT COST

VIDEO 6: CLAIMS DATA IN R

VIDEO 7: BASELINE METHOD AND PENALTY MATRIX

VIDEO 8: PREDICTING HEALTHCARE COSTS IN R

Location Location Location_Regression Trees for Housing Data(Recitation)_3 (R script reproduced here)

Assignment 4

UNDERSTANDING WHY PEOPLE VOTE

LETTER RECOGNITION

PREDICTING EARNINGS FROM CENSUS DATA

STATE DATA REVISITED (OPTIONAL)