ANLY 530 - Lab 1

1. Tree-based Classification

Question 1: If you see an accuracy of 100%, what does it mean? Does this mean that we design a perfect model?
Answer 1: In this case, false negatives (Type II error) are 14, while false positives (Type I error) are 17. True positives are 8, and true negatives are 61; hence, the accuracy is 69% for the decision tree model on the test set. “Positive” means being creditable / not defaulted / not declined. “Negative” means being not creditable / defaulted / declined.

High accuracy could imply overfitting. If an accuracy of 100% occurs on the test set, it means some predictor / independent / explanatory variables are highly multicollinear to the target / dependent / response variable in the model, e.g., one of the predictors is the exact same as the response. Hence, such a model is “perfect” but meaningless.

It is worth noting that several RPubs from Dhull (2018), Pierre (2018), and et al., made a mistake in the decision tree model on the News Popularity case: the dataset should have 17 variables; however, c5.0 output indicates 18 attributes due to the clumsy code “newsShort$shares = newsShort$popular”, resulting in the 100% accuracy rate.

Step 1: Collecting the data

credit <- read.csv("~/Documents/HU/ANLY 530-53-B/Module 5- Decision tree part II/Lab1/credit.csv")
##MORE DESCRIPTIVE
#options(scipen=100,digits=2);pastecs::stat.desc(credit) %>% kable() %>% kable_styling()
str(credit)

## 'data.frame':    1000 obs. of  21 variables:
##  $ Creditability                    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Account.Balance                  : int  1 1 2 1 1 1 1 1 4 2 ...
##  $ Duration.of.Credit..month.       : int  18 9 12 12 12 10 8 6 18 24 ...
##  $ Payment.Status.of.Previous.Credit: int  4 4 2 4 4 4 4 4 4 2 ...
##  $ Purpose                          : int  2 0 9 0 0 0 0 0 3 3 ...
##  $ Credit.Amount                    : int  1049 2799 841 2122 2171 2241 3398 1361 1098 3758 ...
##  $ Value.Savings.Stocks             : int  1 1 2 1 1 1 1 1 1 3 ...
##  $ Length.of.current.employment     : int  2 3 4 3 3 2 4 2 1 1 ...
##  $ Instalment.per.cent              : int  4 2 2 3 4 1 1 2 4 1 ...
##  $ Sex...Marital.Status             : int  2 3 2 3 3 3 3 3 2 2 ...
##  $ Guarantors                       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Duration.in.Current.address      : int  4 2 4 2 4 3 4 4 4 4 ...
##  $ Most.valuable.available.asset    : int  2 1 1 1 2 1 1 1 3 4 ...
##  $ Age..years.                      : int  21 36 23 39 38 48 39 40 65 23 ...
##  $ Concurrent.Credits               : int  3 3 3 3 1 3 3 3 3 3 ...
##  $ Type.of.apartment                : int  1 1 1 1 2 1 2 2 2 1 ...
##  $ No.of.Credits.at.this.Bank       : int  1 2 1 2 2 2 2 1 2 1 ...
##  $ Occupation                       : int  3 3 2 2 2 2 2 2 1 1 ...
##  $ No.of.dependents                 : int  1 2 1 2 1 2 1 2 1 1 ...
##  $ Telephone                        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Foreign.Worker                   : int  1 1 1 2 2 2 2 2 1 1 ...

credit$Creditability <- as.factor(credit$Creditability)

Step 2: Exploring the data

summary(credit$Credit.Amount) #German bank, monetary unit is Deutsche Mark (DM)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     250    1366    2320    3271    3972   18424

table(credit$Creditability) #creditable: yes(0), no(1); no creditable = default

## 
##   0   1 
## 300 700

set.seed(12345)
creditR <- credit[order(runif(1000)),]
creditTraining <- creditR[1:900,]
creditTest <- creditR[901:1000,]

##ALTERNATIVE
#training <- sample(1000, 900)
#creditTraining <- credit[training,]
#creditTest <- credit[-training,]

prop.table(table(creditTraining$Creditability))

## 
##         0         1 
## 0.3088889 0.6911111

prop.table(table(creditTest$Creditability))

## 
##    0    1 
## 0.22 0.78

Step 3: Training a model on the data

creditModelDT <- C5.0(x=creditTraining[,!(colnames(creditTraining) %in% c("Creditability"))], y=creditTraining[,"Creditability"])
##OR
#creditModel <- C5.0(x=creditTraining[-1], y=creditTraining$Creditability)
summary(creditModelDT)

## 
## Call:
## C5.0.default(x = creditTraining[, !(colnames(creditTraining)
##  %in% c("Creditability"))], y = creditTraining[, "Creditability"])
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Thu Dec 26 00:54:21 2019
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 900 cases (21 attributes) from undefined.data
## 
## Decision tree:
## 
## Account.Balance > 2:
## :...Concurrent.Credits > 2:
## :   :...Age..years. > 33: 1 (179/11)
## :   :   Age..years. <= 33:
## :   :   :...Credit.Amount > 6681:
## :   :       :...Length.of.current.employment <= 2: 0 (4)
## :   :       :   Length.of.current.employment > 2:
## :   :       :   :...Payment.Status.of.Previous.Credit <= 3: 1 (4)
## :   :       :       Payment.Status.of.Previous.Credit > 3: 0 (3/1)
## :   :       Credit.Amount <= 6681:
## :   :       :...Occupation > 2:
## :   :           :...Occupation <= 3: 1 (120/12)
## :   :           :   Occupation > 3:
## :   :           :   :...Duration.of.Credit..month. <= 33: 1 (9)
## :   :           :       Duration.of.Credit..month. > 33: 0 (3)
## :   :           Occupation <= 2:
## :   :           :...No.of.Credits.at.this.Bank > 1: 1 (6)
## :   :               No.of.Credits.at.this.Bank <= 1:
## :   :               :...Most.valuable.available.asset > 1: 0 (3)
## :   :                   Most.valuable.available.asset <= 1:
## :   :                   :...Credit.Amount <= 1987: 1 (8/1)
## :   :                       Credit.Amount > 1987: 0 (2)
## :   Concurrent.Credits <= 2:
## :   :...Guarantors > 1: 1 (4)
## :       Guarantors <= 1:
## :       :...Purpose <= 0:
## :           :...Most.valuable.available.asset <= 2: 0 (5)
## :           :   Most.valuable.available.asset > 2:
## :           :   :...No.of.dependents <= 1: 1 (7/1)
## :           :       No.of.dependents > 1: 0 (2)
## :           Purpose > 0:
## :           :...Purpose <= 4: 1 (35/2)
## :               Purpose > 4:
## :               :...Length.of.current.employment <= 2: 0 (4)
## :                   Length.of.current.employment > 2:
## :                   :...No.of.dependents > 1: 0 (3/1)
## :                       No.of.dependents <= 1:
## :                       :...Length.of.current.employment > 3: 1 (4)
## :                           Length.of.current.employment <= 3:
## :                           :...Instalment.per.cent <= 2: 1 (2)
## :                               Instalment.per.cent > 2: 0 (2)
## Account.Balance <= 2:
## :...Payment.Status.of.Previous.Credit <= 1:
##     :...Value.Savings.Stocks <= 2: 0 (49/10)
##     :   Value.Savings.Stocks > 2:
##     :   :...Credit.Amount <= 2064: 0 (3)
##     :       Credit.Amount > 2064: 1 (9/1)
##     Payment.Status.of.Previous.Credit > 1:
##     :...Credit.Amount > 7980:
##         :...Value.Savings.Stocks > 4:
##         :   :...Payment.Status.of.Previous.Credit <= 2: 0 (4/1)
##         :   :   Payment.Status.of.Previous.Credit > 2: 1 (3)
##         :   Value.Savings.Stocks <= 4:
##         :   :...Account.Balance > 1: 0 (15)
##         :       Account.Balance <= 1:
##         :       :...Concurrent.Credits <= 2: 0 (2)
##         :           Concurrent.Credits > 2:
##         :           :...Credit.Amount <= 10297: 0 (6)
##         :               Credit.Amount > 10297: 1 (3)
##         Credit.Amount <= 7980:
##         :...Duration.of.Credit..month. <= 11:
##             :...Occupation > 3:
##             :   :...Concurrent.Credits <= 2: 1 (3)
##             :   :   Concurrent.Credits > 2:
##             :   :   :...Payment.Status.of.Previous.Credit <= 2: 1 (4/1)
##             :   :       Payment.Status.of.Previous.Credit > 2: 0 (3)
##             :   Occupation <= 3:
##             :   :...Age..years. > 32: 1 (34)
##             :       Age..years. <= 32:
##             :       :...Most.valuable.available.asset <= 1: 1 (13/1)
##             :           Most.valuable.available.asset > 1:
##             :           :...Instalment.per.cent <= 3: 1 (6/1)
##             :               Instalment.per.cent > 3: 0 (6/1)
##             Duration.of.Credit..month. > 11:
##             :...Duration.of.Credit..month. > 36:
##                 :...Length.of.current.employment <= 1: 1 (3)
##                 :   Length.of.current.employment > 1:
##                 :   :...No.of.dependents > 1: 1 (5/1)
##                 :       No.of.dependents <= 1:
##                 :       :...Duration.in.Current.address <= 1: 1 (4/1)
##                 :           Duration.in.Current.address > 1: 0 (23)
##                 Duration.of.Credit..month. <= 36:
##                 :...Guarantors > 2:
##                     :...Foreign.Worker <= 1: 1 (23/1)
##                     :   Foreign.Worker > 1: 0 (2)
##                     Guarantors <= 2:
##                     :...Credit.Amount <= 1381:
##                         :...Telephone > 1:
##                         :   :...Sex...Marital.Status > 3: 0 (2)
##                         :   :   Sex...Marital.Status <= 3:
##                         :   :   :...Duration.of.Credit..month. <= 16: 1 (7)
##                         :   :       Duration.of.Credit..month. > 16: 0 (3/1)
##                         :   Telephone <= 1:
##                         :   :...Concurrent.Credits <= 2: 0 (9)
##                         :       Concurrent.Credits > 2:
##                         :       :...Account.Balance <= 1: 0 (29/6)
##                         :           Account.Balance > 1: [S1]
##                         Credit.Amount > 1381:
##                         :...Guarantors > 1:
##                             :...Foreign.Worker > 1: 1 (2)
##                             :   Foreign.Worker <= 1:
##                             :   :...Instalment.per.cent > 2: 0 (5)
##                             :       Instalment.per.cent <= 2: [S2]
##                             Guarantors <= 1:
##                             :...Payment.Status.of.Previous.Credit > 3:
##                                 :...Age..years. > 33: 1 (22)
##                                 :   Age..years. <= 33:
##                                 :   :...Purpose > 3: 1 (7)
##                                 :       Purpose <= 3: [S3]
##                                 Payment.Status.of.Previous.Credit <= 3:
##                                 :...Instalment.per.cent <= 2:
##                                     :...No.of.dependents > 1:
##                                     :   :...Purpose <= 0: 1 (2)
##                                     :   :   Purpose > 0: 0 (3)
##                                     :   No.of.dependents <= 1: [S4]
##                                     Instalment.per.cent > 2:
##                                     :...Concurrent.Credits <= 1: 1 (8/1)
##                                         Concurrent.Credits > 1:
##                                         :...Sex...Marital.Status <= 1: 0 (6/1)
##                                             Sex...Marital.Status > 1:
##                                             :...Account.Balance > 1: [S5]
##                                                 Account.Balance <= 1: [S6]
## 
## SubTree [S1]
## 
## Duration.in.Current.address > 3: 1 (8/1)
## Duration.in.Current.address <= 3:
## :...Purpose > 2: 0 (5)
##     Purpose <= 2:
##     :...Type.of.apartment <= 1: 0 (2)
##         Type.of.apartment > 1: 1 (5/1)
## 
## SubTree [S2]
## 
## Duration.in.Current.address <= 2: 1 (2)
## Duration.in.Current.address > 2: 0 (4/1)
## 
## SubTree [S3]
## 
## Duration.of.Credit..month. <= 16: 1 (4)
## Duration.of.Credit..month. > 16:
## :...Length.of.current.employment <= 3: 0 (8)
##     Length.of.current.employment > 3: 1 (6/1)
## 
## SubTree [S4]
## 
## Duration.in.Current.address > 1: 1 (41/6)
## Duration.in.Current.address <= 1:
## :...Value.Savings.Stocks > 3: 0 (2)
##     Value.Savings.Stocks <= 3:
##     :...Length.of.current.employment > 2: 1 (4)
##         Length.of.current.employment <= 2:
##         :...Instalment.per.cent <= 1: 0 (3)
##             Instalment.per.cent > 1: 1 (3/1)
## 
## SubTree [S5]
## 
## Sex...Marital.Status > 3: 0 (2)
## Sex...Marital.Status <= 3:
## :...Length.of.current.employment > 3: 1 (10)
##     Length.of.current.employment <= 3:
##     :...Duration.in.Current.address <= 1: 1 (5)
##         Duration.in.Current.address > 1:
##         :...Length.of.current.employment <= 2: 0 (4)
##             Length.of.current.employment > 2:
##             :...Value.Savings.Stocks <= 1: 0 (3)
##                 Value.Savings.Stocks > 1: 1 (5)
## 
## SubTree [S6]
## 
## Payment.Status.of.Previous.Credit > 2: 0 (3)
## Payment.Status.of.Previous.Credit <= 2:
## :...Purpose <= 0: 0 (7/1)
##     Purpose > 0:
##     :...Most.valuable.available.asset <= 1: 0 (5/1)
##         Most.valuable.available.asset > 1:
##         :...Sex...Marital.Status <= 2: 1 (6)
##             Sex...Marital.Status > 2:
##             :...Length.of.current.employment > 4: 0 (5)
##                 Length.of.current.employment <= 4:
##                 :...Telephone > 1: 1 (3)
##                     Telephone <= 1:
##                     :...Length.of.current.employment <= 2: 0 (2)
##                         Length.of.current.employment > 2:
##                         :...Age..years. <= 28: 1 (4)
##                             Age..years. > 28: 0 (2)
## 
## 
## Evaluation on training data (900 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      85   70( 7.8%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     233    45    (a): class 0
##      25   597    (b): class 1
## 
## 
##  Attribute usage:
## 
##  100.00% Account.Balance
##   67.11% Credit.Amount
##   63.11% Concurrent.Credits
##   55.33% Payment.Status.of.Previous.Credit
##   50.33% Age..years.
##   45.44% Duration.of.Credit..month.
##   40.11% Guarantors
##   24.44% Occupation
##   18.33% Instalment.per.cent
##   15.56% Purpose
##   14.22% Length.of.current.employment
##   13.67% Duration.in.Current.address
##   12.67% Value.Savings.Stocks
##   12.22% No.of.dependents
##    9.33% Sex...Marital.Status
##    9.00% Telephone
##    8.78% Most.valuable.available.asset
##    4.22% Foreign.Worker
##    2.11% No.of.Credits.at.this.Bank
##    0.78% Type.of.apartment
## 
## 
## Time: 0.0 secs

Step 4: Evaluating model performance

creditPredDT <- predict(creditModelDT, creditTest)
#table(creditTest$Creditability, creditPredDT)
CrossTable(creditTest$Creditability, creditPredDT, prop.chisq=F, prop.c=F, prop.r=F, dnn=c("Actual Creditability", "Predicted Creditability")) #a confusion matrix of binary classification, "negative" means being (1) not creditable / defaulted / declined

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                      | Predicted Creditability 
## Actual Creditability |         0 |         1 | Row Total | 
## ---------------------|-----------|-----------|-----------|
##                    0 |         8 |        14 |        22 | 
##                      |     0.080 |     0.140 |           | 
## ---------------------|-----------|-----------|-----------|
##                    1 |        17 |        61 |        78 | 
##                      |     0.170 |     0.610 |           | 
## ---------------------|-----------|-----------|-----------|
##         Column Total |        25 |        75 |       100 | 
## ---------------------|-----------|-----------|-----------|
## 
##

2. Random Forest

Question 2: What are the three most important features in this model?
Answer 2: Use the function importance() from the package “randomForest” to see the variable importance measures by mean decrease in accuracy in the random forest model, i.e., importance(creditModelRF) or creditModelRF$importance or caret::varImp(creditModelRF). The model, with the random seed 12345, suggest the three most important features as “Account Balance”, “Duration of Credit”, and “Payment Status of Previous Credit”. The model, with the random seed 23458, suggest the three most important features as “Account Balance”, “Duration of Credit”, and “Credit Amount”. Both random forest cases and, plus, the decision tree model value “Account Balance” the most.

Furthermore, “The scikit-learn Random Forest feature importance and R’s default Random Forest feature importance strategies are biased. To get reliable results in Python, use permutation importance, provided here and in our rfpimp package (via pip). For R, use importance=T in the Random Forest constructor then type=1 in R’s importance() function. In addition, your feature importance measures will only be reliable if your model is trained with suitable hyper-parameters” (Parr, Turgutlu, Csiszar, and Howard, 2018).

creditModelRF <- randomForest(Creditability~., data=creditTraining, importance=T)
summary(creditModelRF)

##                 Length Class  Mode     
## call               4   -none- call     
## type               1   -none- character
## predicted        900   factor numeric  
## err.rate        1500   -none- numeric  
## confusion          6   -none- numeric  
## votes           1800   matrix numeric  
## oob.times        900   -none- numeric  
## classes            2   -none- character
## importance        80   -none- numeric  
## importanceSD      60   -none- numeric  
## localImportance    0   -none- NULL     
## proximity          0   -none- NULL     
## ntree              1   -none- numeric  
## mtry               1   -none- numeric  
## forest            14   -none- list     
## y                900   factor numeric  
## test               0   -none- NULL     
## inbag              0   -none- NULL     
## terms              3   terms  call

creditPredRF <- predict(creditModelRF, creditTest)
#table(creditTest$Creditability, creditPredRF)
CrossTable(creditTest$Creditability, creditPredRF, prop.chisq=F, prop.c=F, prop.r=F, dnn=c("Actual Creditability", "Predicted Creditability")) #a confusion matrix of binary classification

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                      | Predicted Creditability 
## Actual Creditability |         0 |         1 | Row Total | 
## ---------------------|-----------|-----------|-----------|
##                    0 |        11 |        11 |        22 | 
##                      |     0.110 |     0.110 |           | 
## ---------------------|-----------|-----------|-----------|
##                    1 |         9 |        69 |        78 | 
##                      |     0.090 |     0.690 |           | 
## ---------------------|-----------|-----------|-----------|
##         Column Total |        20 |        80 |       100 | 
## ---------------------|-----------|-----------|-----------|
## 
##

importance(creditModelRF, type=1) #type1: MeanDecreaseAccuracy; type2: MeanDecreaseGini

##                                   MeanDecreaseAccuracy
## Account.Balance                             33.0987788
## Duration.of.Credit..month.                  20.7930720
## Payment.Status.of.Previous.Credit           16.6331475
## Purpose                                      5.2239655
## Credit.Amount                               16.2319279
## Value.Savings.Stocks                        10.9540498
## Length.of.current.employment                 7.6996932
## Instalment.per.cent                          7.0556056
## Sex...Marital.Status                        -0.2679574
## Guarantors                                  11.8663372
## Duration.in.Current.address                  3.9844930
## Most.valuable.available.asset                7.4861647
## Age..years.                                  5.7940226
## Concurrent.Credits                           4.8355012
## Type.of.apartment                            4.1302085
## No.of.Credits.at.this.Bank                   2.9144824
## Occupation                                   3.2507050
## No.of.dependents                             1.3437617
## Telephone                                    2.6267656
## Foreign.Worker                               1.9573448

Change the random seed to 23458

set.seed(23458)
creditR2 <- credit[order(runif(1000)),]
creditTraining2 <- creditR2[1:900,]
creditTest2 <- creditR2[901:1000,]

creditModelRF2 <- randomForest(Creditability~., data=creditTraining2, importance=T)
creditPredRF2 <- predict(creditModelRF2, creditTest2)
CrossTable(creditTest2$Creditability, creditPredRF2, prop.chisq=F, prop.c=F, prop.r=F, dnn=c("Actual Creditability", "Predicted Creditability"))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                      | Predicted Creditability 
## Actual Creditability |         0 |         1 | Row Total | 
## ---------------------|-----------|-----------|-----------|
##                    0 |        13 |        18 |        31 | 
##                      |     0.130 |     0.180 |           | 
## ---------------------|-----------|-----------|-----------|
##                    1 |         5 |        64 |        69 | 
##                      |     0.050 |     0.640 |           | 
## ---------------------|-----------|-----------|-----------|
##         Column Total |        18 |        82 |       100 | 
## ---------------------|-----------|-----------|-----------|
## 
##

importance(creditModelRF2, type=1) #type1: MeanDecreaseAccuracy; type2: MeanDecreaseGini

##                                   MeanDecreaseAccuracy
## Account.Balance                              33.925075
## Duration.of.Credit..month.                   21.341865
## Payment.Status.of.Previous.Credit            15.334914
## Purpose                                       6.111192
## Credit.Amount                                15.464864
## Value.Savings.Stocks                          8.748573
## Length.of.current.employment                  9.490680
## Instalment.per.cent                           3.404935
## Sex...Marital.Status                          3.613527
## Guarantors                                   10.154618
## Duration.in.Current.address                   3.482690
## Most.valuable.available.asset                 7.491665
## Age..years.                                   7.539748
## Concurrent.Credits                            6.332402
## Type.of.apartment                             4.586761
## No.of.Credits.at.this.Bank                    3.811475
## Occupation                                    3.928423
## No.of.dependents                              1.676712
## Telephone                                     1.353508
## Foreign.Worker                                2.165521

3. Adding Regression to Trees

Question 3: What is your interpretation about this amount of RMSE?
Answer 3: This question may not be valid because the wine’s quality is a factor variable.

The RMSE stands for “root mean squared error”, also known as “root mean squared deviation” (RMSD), and it is used to evaluate models by summarizing the differences between the actual / observed and predicted values. Plus, the RMSE can be defined as a gradient function in Machine Learning.

The model, assuming it is, has an RMSE of 0.7448, calculated by the function rmse() from the package “Metrics”. Such a value indicates that the model is fair. The white wine quality ranges from 3 to 9, hence it is interpreted that the model may overrate or underrate the quality by one level.

The RMSE is not meaningful for factors! The target / dependent / response variable is categorical type, more specifically, ordinal. ① It is incorrect to compute the Pearson’s correlation between its actual and predicted values. ② “No formal distributional assumptions, random forests are non-parametric and can thus handle skewed and multi-modal data as well as categorical data that are ordinal or non-ordinal” (Richmond, 2016).

Step 1: Collecting the data

wine <- read.csv("~/Documents/HU/ANLY 530-53-B/Module 5- Decision tree part II/Lab1/whitewines.csv")
str(wine)

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  6.7 5.7 5.9 5.3 6.4 7 7.9 6.6 7 6.5 ...
##  $ volatile.acidity    : num  0.62 0.22 0.19 0.47 0.29 0.14 0.12 0.38 0.16 0.37 ...
##  $ citric.acid         : num  0.24 0.2 0.26 0.1 0.21 0.41 0.49 0.28 0.3 0.33 ...
##  $ residual.sugar      : num  1.1 16 7.4 1.3 9.65 0.9 5.2 2.8 2.6 3.9 ...
##  $ chlorides           : num  0.039 0.044 0.034 0.036 0.041 0.037 0.049 0.043 0.043 0.027 ...
##  $ free.sulfur.dioxide : num  6 41 33 11 36 22 33 17 34 40 ...
##  $ total.sulfur.dioxide: num  62 113 123 74 119 95 152 67 90 130 ...
##  $ density             : num  0.993 0.999 0.995 0.991 0.993 ...
##  $ pH                  : num  3.41 3.22 3.49 3.48 2.99 3.25 3.18 3.21 2.88 3.28 ...
##  $ sulphates           : num  0.32 0.46 0.42 0.54 0.34 0.43 0.47 0.47 0.47 0.39 ...
##  $ alcohol             : num  10.4 8.9 10.1 11.2 10.9 ...
##  $ quality             : int  5 6 6 4 6 6 6 6 6 7 ...

table(wine$quality) #ordinal type with 7 levels; cannot check normality!

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Step 2: Exploring and preparing the data

#no randomization in sampling
wineTraining <- wine[1:3750,] #75%
wineTest <- tail(wine,-3750)

Step 3: Training a model on the data

#Recursive Partitioning And Regression Trees
(wineModelR <- rpart(quality~., data=wineTraining)) #method="class" because y is a factor

## n= 3750 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 3750 2945.53200 5.870933  
##    2) alcohol< 10.85 2372 1418.86100 5.604975  
##      4) volatile.acidity>=0.2275 1611  821.30730 5.432030  
##        8) volatile.acidity>=0.3025 688  278.97670 5.255814 *
##        9) volatile.acidity< 0.3025 923  505.04230 5.563380 *
##      5) volatile.acidity< 0.2275 761  447.36400 5.971091 *
##    3) alcohol>=10.85 1378 1070.08200 6.328737  
##      6) free.sulfur.dioxide< 10.5 84   95.55952 5.369048 *
##      7) free.sulfur.dioxide>=10.5 1294  892.13600 6.391036  
##       14) alcohol< 11.76667 629  430.11130 6.173291  
##         28) volatile.acidity>=0.465 11   10.72727 4.545455 *
##         29) volatile.acidity< 0.465 618  389.71680 6.202265 *
##       15) alcohol>=11.76667 665  403.99400 6.596992 *

#rpart.plot(wineModelR, digits=3)
rpart.plot(wineModelR, digits=4, type=3, extra=101)

Step 4: Evaluating model performance

winePredR <- predict(wineModelR, wineTest)
summary(winePredR)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.545   5.563   5.971   5.893   6.202   6.597

summary(wineTest$quality)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.901   6.000   9.000

rmse(wineTest$quality, winePredR) #Root Mean Squared Error

## [1] 0.7448093

4. Evaluating News Popularity

Question 4: Try decision tree and random forest and evaluate the model.
Answer 4: Define the shares more than 1,400 as popular; Select 16 predictor variables with 39,644 observations; Set the random seed to 12345; Divide 75% into a training set and 25% into a test set; Run a decision tree model, a random forest model, and recursive partitioning.

The decision tree model has 59.11% accuracy with the three most important features as “num_hrefs”, “num_imgs”, and “kw_max_max”. The random forest model has 60.25% accuracy with the three most important features as “num_imgs”, “num_hrefs”, and “n_non_stop_words”. The regression trees model has 56.37% accuracy with the three most important features as “num_imgs”, “global_sentiment_polarity”, and “avg_positive_polarity”.

Step 1: Collecting the data

news <- read.csv("~/Documents/HU/ANLY 530-53-B/Module 5- Decision tree part II/Lab1/OnlineNewsPopularity_for_R.csv")
news$popular <- case_when(news$shares>=1400~1, T~0) #popularity: yes(1), no(0)
news$popular <- as.factor(news$popular)

vars <- c("popular", "n_tokens_title", "n_tokens_content", "n_unique_tokens", "n_non_stop_words", "num_hrefs", "num_imgs", "num_videos", "average_token_length", "num_keywords", "kw_max_max", "global_sentiment_polarity", "avg_positive_polarity", "title_subjectivity", "title_sentiment_polarity", "abs_title_subjectivity", "abs_title_sentiment_polarity")
news17 <- news[vars]
str(news17)

## 'data.frame':    39644 obs. of  17 variables:
##  $ popular                     : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 2 1 ...
##  $ n_tokens_title              : num  12 9 9 9 13 10 8 12 11 10 ...
##  $ n_tokens_content            : num  219 255 211 531 1072 ...
##  $ n_unique_tokens             : num  0.664 0.605 0.575 0.504 0.416 ...
##  $ n_non_stop_words            : num  1 1 1 1 1 ...
##  $ num_hrefs                   : num  4 3 3 9 19 2 21 20 2 4 ...
##  $ num_imgs                    : num  1 1 1 1 20 0 20 20 0 1 ...
##  $ num_videos                  : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ average_token_length        : num  4.68 4.91 4.39 4.4 4.68 ...
##  $ num_keywords                : num  5 4 6 7 7 9 10 9 7 5 ...
##  $ kw_max_max                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ global_sentiment_polarity   : num  0.0926 0.1489 0.3233 0.1007 0.281 ...
##  $ avg_positive_polarity       : num  0.379 0.287 0.496 0.386 0.411 ...
##  $ title_subjectivity          : num  0.5 0 0 0 0.455 ...
##  $ title_sentiment_polarity    : num  -0.188 0 0 0 0.136 ...
##  $ abs_title_subjectivity      : num  0 0.5 0.5 0.5 0.0455 ...
##  $ abs_title_sentiment_polarity: num  0.188 0 0 0 0.136 ...

Step 2: Pre-processing

set.seed(12345)
training2 <- sample(nrow(news17), nrow(news17)*0.75) #75%, 29733 of 39644
newsTraining <- news17[training2,]
newsTest <- news17[-training2,]
prop.table(table(newsTraining$popular))

## 
##         0         1 
## 0.4677631 0.5322369

prop.table(table(newsTest$popular))

## 
##         0         1 
## 0.4623146 0.5376854

Step 3: Modeling and evaluation

#Decision Tree model
newsModelDT <- C5.0(newsTraining[-1], newsTraining$popular)
summary(newsModelDT)

## 
## Call:
## C5.0.default(x = newsTraining[-1], y = newsTraining$popular)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Thu Dec 26 00:54:32 2019
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 29733 cases (17 attributes) from undefined.data
## 
## Decision tree:
## 
## num_hrefs > 16:
## :...kw_max_max <= 18200:
## :   :...n_tokens_content <= 1206: 0 (36/8)
## :   :   n_tokens_content > 1206: 1 (9/2)
## :   kw_max_max > 18200:
## :   :...n_tokens_content <= 254: 1 (400/83)
## :       n_tokens_content > 254:
## :       :...n_tokens_content > 3212: 1 (40/2)
## :           n_tokens_content <= 3212:
## :           :...num_keywords <= 5:
## :               :...kw_max_max <= 617900: 1 (74/24)
## :               :   kw_max_max > 617900:
## :               :   :...average_token_length <= 4.940762: 1 (491/226)
## :               :       average_token_length > 4.940762:
## :               :       :...num_hrefs <= 37:
## :               :           :...num_imgs <= 9: 0 (96/20)
## :               :           :   num_imgs > 9: 1 (23/9)
## :               :           num_hrefs > 37:
## :               :           :...n_tokens_title <= 12: 1 (19/3)
## :               :               n_tokens_title > 12: 0 (7/2)
## :               num_keywords > 5:
## :               :...n_tokens_title <= 9:
## :                   :...n_tokens_title <= 7: 1 (453/108)
## :                   :   n_tokens_title > 7:
## :                   :   :...num_keywords > 6: 1 (1085/337)
## :                   :       num_keywords <= 6:
## :                   :       :...avg_positive_polarity <= 0.4356061: 1 (124/38)
## :                   :           avg_positive_polarity > 0.4356061: 0 (51/18)
## :                   n_tokens_title > 9:
## :                   :...num_imgs <= 2: 1 (1020/468)
## :                       num_imgs > 2:
## :                       :...n_tokens_title > 12:
## :                           :...num_keywords <= 8:
## :                           :   :...n_tokens_content <= 388: 1 (34/7)
## :                           :   :   n_tokens_content > 388: 0 (183/83)
## :                           :   num_keywords > 8:
## :                           :   :...num_keywords > 9: 1 (77/21)
## :                           :       num_keywords <= 9:
## :                           :       :...num_hrefs <= 18: 0 (8/1)
## :                           :           num_hrefs > 18: 1 (44/13)
## :                           n_tokens_title <= 12:
## :                           :...average_token_length > 4.690059: 1 (605/176)
## :                               average_token_length <= 4.690059:
## :                               :...num_hrefs <= 31: 1 (454/163)
## :                                   num_hrefs > 31:
## :                                   :...num_keywords <= 7:
## :                                       :...avg_positive_polarity <= 0.2858333: 1 (4)
## :                                       :   avg_positive_polarity > 0.2858333: 0 (36/6)
## :                                       num_keywords > 7:
## :                                       :...num_imgs <= 34: 1 (88/34)
## :                                           num_imgs > 34: 0 (12/2)
## num_hrefs <= 16:
## :...num_videos > 22:
##     :...average_token_length <= 4.595656: 0 (150/41)
##     :   average_token_length > 4.595656: 1 (13/3)
##     num_videos <= 22:
##     :...num_imgs > 3:
##         :...num_keywords <= 6:
##         :   :...num_imgs > 15:
##         :   :   :...n_tokens_content <= 254: 1 (26/5)
##         :   :   :   n_tokens_content > 254:
##         :   :   :   :...abs_title_sentiment_polarity <= 0.6375: 0 (303/109)
##         :   :   :       abs_title_sentiment_polarity > 0.6375: 1 (17/4)
##         :   :   num_imgs <= 15:
##         :   :   :...global_sentiment_polarity > 0.07433511:
##         :   :       :...num_videos <= 2: 1 (638/235)
##         :   :       :   num_videos > 2:
##         :   :       :   :...title_sentiment_polarity <= -0.07777778: 1 (5)
##         :   :       :       title_sentiment_polarity > -0.07777778:
##         :   :       :       :...avg_positive_polarity <= 0.3379489: 1 (4)
##         :   :       :           avg_positive_polarity > 0.3379489: 0 (21/5)
##         :   :       global_sentiment_polarity <= 0.07433511:
##         :   :       :...title_subjectivity > 0.8333333: 1 (34/7)
##         :   :           title_subjectivity <= 0.8333333:
##         :   :           :...n_tokens_content <= 302:
##         :   :               :...num_imgs > 11: 0 (27/10)
##         :   :               :   num_imgs <= 11:
##         :   :               :   :...n_tokens_title <= 13: 1 (71/21)
##         :   :               :       n_tokens_title > 13: 0 (3)
##         :   :               n_tokens_content > 302:
##         :   :               :...num_imgs <= 12: 0 (168/48)
##         :   :                   num_imgs > 12:
##         :   :                   :...num_imgs <= 14: 1 (19/6)
##         :   :                       num_imgs > 14: 0 (14/4)
##         :   num_keywords > 6:
##         :   :...title_sentiment_polarity > 0.4: 1 (449/130)
##         :       title_sentiment_polarity <= 0.4:
##         :       :...avg_positive_polarity > 0.4718768:
##         :           :...abs_title_subjectivity > 0.07272727: 1 (253/64)
##         :           :   abs_title_subjectivity <= 0.07272727:
##         :           :   :...num_videos <= 1: 0 (23/9)
##         :           :       num_videos > 1: 1 (3)
##         :           avg_positive_polarity <= 0.4718768:
##         :           :...n_tokens_title <= 10: 1 (1317/508)
##         :               n_tokens_title > 10:
##         :               :...num_hrefs > 6: 1 (650/268)
##         :                   num_hrefs <= 6:
##         :                   :...num_videos <= 0: 0 (331/150)
##         :                       num_videos > 0:
##         :                       :...avg_positive_polarity <= 0.1920238: 1 (12)
##         :                           avg_positive_polarity > 0.1920238:
##         :                           :...num_videos > 2: 0 (14/4)
##         :                               num_videos <= 2:
##         :                               :...num_hrefs > 3: 1 (119/37)
##         :                                   num_hrefs <= 3:
##         :                                   :...n_tokens_title <= 13: 0 (59/23)
##         :                                       n_tokens_title > 13: 1 (14/3)
##         num_imgs <= 3:
##         :...kw_max_max <= 617900:
##             :...kw_max_max > 118700: 1 (1250/442)
##             :   kw_max_max <= 118700:
##             :   :...n_tokens_content > 886: 1 (98/26)
##             :       n_tokens_content <= 886:
##             :       :...num_hrefs <= 11: 1 (1428/664)
##             :           num_hrefs > 11: 0 (122/47)
##             kw_max_max > 617900:
##             :...n_unique_tokens <= 0.4065041:
##                 :...num_imgs > 0:
##                 :   :...average_token_length <= 4.911008: 1 (505/171)
##                 :   :   average_token_length > 4.911008:
##                 :   :   :...n_unique_tokens <= 0.3680441: 1 (11/3)
##                 :   :       n_unique_tokens > 0.3680441: 0 (29/5)
##                 :   num_imgs <= 0:
##                 :   :...num_videos > 1: 0 (33/14)
##                 :       num_videos <= 1:
##                 :       :...num_videos > 0: 1 (521/226)
##                 :           num_videos <= 0:
##                 :           :...n_tokens_title > 13: 1 (5)
##                 :               n_tokens_title <= 13:
##                 :               :...num_keywords <= 7: 0 (26/6)
##                 :                   num_keywords > 7: 1 (24/8)
##                 n_unique_tokens > 0.4065041:
##                 :...global_sentiment_polarity <= 0.07066981:
##                     :...num_hrefs <= 1: 0 (214/47)
##                     :   num_hrefs > 1:
##                     :   :...average_token_length > 4.729642:
##                     :       :...n_tokens_content <= 313:
##                     :       :   :...num_imgs <= 1:
##                     :       :   :   :...kw_max_max > 690400: 0 (534/209)
##                     :       :   :   :   kw_max_max <= 690400:
##                     :       :   :   :   :...n_tokens_title <= 11: 0 (97/41)
##                     :       :   :   :       n_tokens_title > 11: 1 (18/4)
##                     :       :   :   num_imgs > 1:
##                     :       :   :   :...num_keywords > 8: 1 (17/1)
##                     :       :   :       num_keywords <= 8:
##                     :       :   :       :...num_videos <= 0: 0 (55/20)
##                     :       :   :           num_videos > 0: 1 (9/3)
##                     :       :   n_tokens_content > 313:
##                     :       :   :...num_keywords <= 7: 0 (890/243)
##                     :       :       num_keywords > 7:
##                     :       :       :...num_imgs <= 2: 0 (512/171)
##                     :       :           num_imgs > 2:
##                     :       :           :...n_tokens_title > 13: 1 (4)
##                     :       :               n_tokens_title <= 13:
##                     :       :               :...num_hrefs > 7: 0 (30/8)
##                     :       :                   num_hrefs <= 7:
##                     :       :                   :...num_videos > 0: 1 (5)
##                     :       :                       num_videos <= 0: [S1]
##                     :       average_token_length <= 4.729642:
##                     :       :...num_keywords <= 5: 0 (607/234)
##                     :           num_keywords > 5:
##                     :           :...num_imgs <= 0:
##                     :               :...n_tokens_title <= 9: 0 (110/50)
##                     :               :   n_tokens_title > 9: 1 (337/137)
##                     :               num_imgs > 0:
##                     :               :...num_imgs > 2:
##                     :                   :...n_tokens_content <= 538: 1 (31/5)
##                     :                   :   n_tokens_content > 538:
##                     :                   :   :...num_keywords <= 7: 0 (20/4)
##                     :                   :       num_keywords > 7: [S2]
##                     :                   num_imgs <= 2:
##                     :                   :...num_videos > 2: 0 (40/14)
##                     :                       num_videos <= 2:
##                     :                       :...num_videos <= 0: 0 (740/327)
##                     :                           num_videos > 0:
##                     :                           :...num_imgs <= 1:
##                     :                               :...num_videos > 1: [S3]
##                     :                               :   num_videos <= 1: [S4]
##                     :                               num_imgs > 1: [S5]
##                     global_sentiment_polarity > 0.07066981:
##                     :...title_sentiment_polarity <= -0.006397306:
##                         :...n_tokens_content > 119:
##                         :   :...n_tokens_title <= 7: 1 (52/14)
##                         :   :   n_tokens_title > 7: 0 (1023/434)
##                         :   n_tokens_content <= 119:
##                         :   :...n_tokens_title > 12: 0 (53/1)
##                         :       n_tokens_title <= 12:
##                         :       :...num_imgs > 0: 1 (6)
##                         :           num_imgs <= 0:
##                         :           :...title_sentiment_polarity <= -0.625: 1 (5)
##                         :               title_sentiment_polarity > -0.625:
##                         :               :...n_unique_tokens <= 0.7198582: 1 (3)
##                         :                   n_unique_tokens > 0.7198582: [S6]
##                         title_sentiment_polarity > -0.006397306:
##                         :...num_imgs > 1:
##                             :...n_tokens_content > 242: 1 (1355/615)
##                             :   n_tokens_content <= 242:
##                             :   :...global_sentiment_polarity <= 0.3298789: 1 (199/47)
##                             :       global_sentiment_polarity > 0.3298789: 0 (14/3)
##                             num_imgs <= 1:
##                             :...num_videos > 1:
##                                 :...num_hrefs > 11: 1 (167/54)
##                                 :   num_hrefs <= 11:
##                                 :   :...num_keywords > 8:
##                                 :       :...num_imgs <= 0: [S7]
##                                 :       :   num_imgs > 0:
##                                 :       :   :...num_keywords <= 9: 0 (41/19)
##                                 :       :       num_keywords > 9:
##                                 :       :       :...num_hrefs <= 6: 0 (23/10)
##                                 :       :           num_hrefs > 6: 1 (31/8)
##                                 :       num_keywords <= 8:
##                                 :       :...n_tokens_title <= 7: 0 (51/19)
##                                 :           n_tokens_title > 7:
##                                 :           :...num_imgs <= 0: 1 (320/108)
##                                 :               num_imgs > 0:
##                                 :               :...num_videos > 13: 1 (19/4)
##                                 :                   num_videos <= 13:
##                                 :                   :...kw_max_max <= 690400: [S8]
##                                 :                       kw_max_max > 690400: [S9]
##                                 num_videos <= 1:
##                                 :...kw_max_max <= 690400:
##                                     :...num_imgs <= 0:
##                                     :   :...n_tokens_title <= 11: 1 (267/106)
##                                     :   :   n_tokens_title > 11:
##                                     :   :   :...num_keywords <= 8: 1 (67/14)
##                                     :   :       num_keywords > 8: 0 (13/4)
##                                     :   num_imgs > 0:
##                                     :   :...n_tokens_content <= 650: [S10]
##                                     :       n_tokens_content > 650: [S11]
##                                     kw_max_max > 690400:
##                                     :...num_keywords <= 6: 0 (2253/1013)
##                                         num_keywords > 6:
##                                         :...n_tokens_content > 993: 1 (105/34)
##                                             n_tokens_content <= 993:
##                                             :...num_videos > 0: 1 (949/462)
##                                                 num_videos <= 0:
##                                                 :...num_imgs <= 0: [S12]
##                                                     num_imgs > 0:
##                                                     :...num_keywords > 9: [S13]
##                                                         num_keywords <= 9: [S14]
## 
## SubTree [S1]
## 
## global_sentiment_polarity <= 0.01406926: 1 (6)
## global_sentiment_polarity > 0.01406926: 0 (13/4)
## 
## SubTree [S2]
## 
## global_sentiment_polarity <= 0.03557535: 0 (7)
## global_sentiment_polarity > 0.03557535: 1 (18/3)
## 
## SubTree [S3]
## 
## avg_positive_polarity <= 0.3235457: 1 (20/5)
## avg_positive_polarity > 0.3235457: 0 (13/2)
## 
## SubTree [S4]
## 
## kw_max_max > 690400: 1 (189/87)
## kw_max_max <= 690400:
## :...global_sentiment_polarity <= 0.06004399: 0 (7/1)
##     global_sentiment_polarity > 0.06004399: 1 (3)
## 
## SubTree [S5]
## 
## n_unique_tokens <= 0.523077: 0 (29/4)
## n_unique_tokens > 0.523077:
## :...n_tokens_title > 11: 1 (9)
##     n_tokens_title <= 11:
##     :...n_tokens_content <= 231: 0 (6)
##         n_tokens_content > 231: 1 (14/4)
## 
## SubTree [S6]
## 
## average_token_length <= 4.908397: 0 (48/6)
## average_token_length > 4.908397: 1 (7/2)
## 
## SubTree [S7]
## 
## avg_positive_polarity <= 0.3969345: 0 (77/23)
## avg_positive_polarity > 0.3969345:
## :...title_subjectivity <= 0.7348485: 1 (70/25)
##     title_subjectivity > 0.7348485: 0 (16/4)
## 
## SubTree [S8]
## 
## n_tokens_title <= 9: 1 (9/1)
## n_tokens_title > 9: 0 (11/3)
## 
## SubTree [S9]
## 
## num_videos <= 2: 1 (128/55)
## num_videos > 2: 0 (120/52)
## 
## SubTree [S10]
## 
## title_subjectivity > 0.975: 1 (26/6)
## title_subjectivity <= 0.975:
## :...average_token_length <= 4.32405: 1 (57/15)
##     average_token_length > 4.32405: 0 (745/330)
## 
## SubTree [S11]
## 
## title_sentiment_polarity <= 0.1305556: 1 (119/31)
## title_sentiment_polarity > 0.1305556:
## :...avg_positive_polarity <= 0.3142575: 0 (9)
##     avg_positive_polarity > 0.3142575:
##     :...num_hrefs <= 7: 1 (9)
##         num_hrefs > 7: 0 (29/13)
## 
## SubTree [S12]
## 
## n_tokens_content <= 141: 0 (86/25)
## n_tokens_content > 141: 1 (238/94)
## 
## SubTree [S13]
## 
## n_tokens_title <= 8: 1 (97/39)
## n_tokens_title > 8: 0 (403/192)
## 
## SubTree [S14]
## 
## abs_title_subjectivity <= 0.475: 0 (638/312)
## abs_title_subjectivity > 0.475:
## :...num_hrefs <= 3: 0 (121/37)
##     num_hrefs > 3:
##     :...title_subjectivity > 0.5: 1 (48/22)
##         title_subjectivity <= 0.5:
##         :...n_tokens_title > 8: 0 (547/255)
##             n_tokens_title <= 8:
##             :...num_hrefs > 7:
##                 :...global_sentiment_polarity <= 0.1513961: 0 (68/23)
##                 :   global_sentiment_polarity > 0.1513961: 1 (34/12)
##                 num_hrefs <= 7:
##                 :...n_tokens_title > 7: 0 (29/4)
##                     n_tokens_title <= 7:
##                     :...average_token_length > 4.681704: 0 (23/3)
##                         average_token_length <= 4.681704:
##                         :...global_sentiment_polarity > 0.2002036: 0 (4)
##                             global_sentiment_polarity <= 0.2002036:
##                             :...global_sentiment_polarity <= 0.1087384: 0 (3)
##                                 global_sentiment_polarity > 0.1087384: 1 (12/1)
## 
## 
## Evaluation on training data (29733 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##     151 11342(38.1%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##    7345  6563    (a): class 0
##    4779 11046    (b): class 1
## 
## 
##  Attribute usage:
## 
##  100.00% num_hrefs
##   90.07% num_imgs
##   84.00% kw_max_max
##   81.59% num_videos
##   67.36% num_keywords
##   57.91% n_tokens_content
##   55.85% n_unique_tokens
##   55.34% global_sentiment_polarity
##   47.49% title_sentiment_polarity
##   37.06% n_tokens_title
##   26.34% average_token_length
##   11.02% avg_positive_polarity
##    6.79% title_subjectivity
##    6.07% abs_title_subjectivity
##    1.08% abs_title_sentiment_polarity
## 
## 
## Time: 0.6 secs

newsPredDT <- predict(newsModelDT, newsTest)
CrossTable(newsTest$popular, newsPredDT, prop.chisq=F, prop.c=F, prop.r=F, dnn=c("Actual", "Predicted (DT)"))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  9911 
## 
##  
##              | Predicted (DT) 
##       Actual |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |      2270 |      2312 |      4582 | 
##              |     0.229 |     0.233 |           | 
## -------------|-----------|-----------|-----------|
##            1 |      1741 |      3588 |      5329 | 
##              |     0.176 |     0.362 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      4011 |      5900 |      9911 | 
## -------------|-----------|-----------|-----------|
## 
##

#Random Forest model
newsModelRF <- randomForest(popular~., data=newsTraining, importance=T) #about 2min runtime
newsPredRF <- predict(newsModelRF, newsTest)
CrossTable(newsTest$popular, newsPredRF, prop.chisq=F, prop.c=F, prop.r=F, dnn=c("Actual", "Predicted (RF)"))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  9911 
## 
##  
##              | Predicted (RF) 
##       Actual |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |      2257 |      2325 |      4582 | 
##              |     0.228 |     0.235 |           | 
## -------------|-----------|-----------|-----------|
##            1 |      1616 |      3713 |      5329 | 
##              |     0.163 |     0.375 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      3873 |      6038 |      9911 | 
## -------------|-----------|-----------|-----------|
## 
##

importance(newsModelRF, type=1)

##                              MeanDecreaseAccuracy
## n_tokens_title                           11.58966
## n_tokens_content                         37.20085
## n_unique_tokens                          37.27078
## n_non_stop_words                         37.61929
## num_hrefs                                38.25399
## num_imgs                                 39.04636
## num_videos                               14.99645
## average_token_length                     20.51853
## num_keywords                             18.32682
## kw_max_max                               21.97751
## global_sentiment_polarity                34.83415
## avg_positive_polarity                    13.99606
## title_subjectivity                       14.00154
## title_sentiment_polarity                 20.51038
## abs_title_subjectivity                   12.42208
## abs_title_sentiment_polarity             16.81981

#Regression Trees model
newsModelR <- rpart(popular~., data=newsTraining)
summary(newsModelR)

## Call:
## rpart(formula = popular ~ ., data = newsTraining)
##   n= 29733 
## 
##           CP nsplit rel error    xerror        xstd
## 1 0.03311044      0 1.0000000 1.0000000 0.006186146
## 2 0.01000000      2 0.9337791 0.9390279 0.006153106
## 
## Variable importance
##                  num_imgs global_sentiment_polarity     avg_positive_polarity 
##                        49                        25                         6 
##                 num_hrefs          n_non_stop_words          n_tokens_content 
##                         6                         5                         5 
##      average_token_length           n_unique_tokens 
##                         2                         2 
## 
## Node number 1: 29733 observations,    complexity param=0.03311044
##   predicted class=1  expected loss=0.4677631  P(node) =1
##     class counts: 13908 15825
##    probabilities: 0.468 0.532 
##   left son=2 (21955 obs) right son=3 (7778 obs)
##   Primary splits:
##       num_imgs                  < 3.5        to the left,  improve=165.43800, (0 missing)
##       num_hrefs                 < 16.5       to the left,  improve=138.98500, (0 missing)
##       global_sentiment_polarity < 0.09387385 to the left,  improve=106.75980, (0 missing)
##       num_keywords              < 6.5        to the left,  improve= 69.87192, (0 missing)
##       n_unique_tokens           < 0.4366337  to the right, improve= 67.85730, (0 missing)
##   Surrogate splits:
##       num_hrefs                 < 17.5       to the left,  agree=0.769, adj=0.118, (0 split)
##       n_non_stop_words          < 1          to the left,  agree=0.756, adj=0.066, (0 split)
##       n_tokens_content          < 1316.5     to the left,  agree=0.752, adj=0.051, (0 split)
##       global_sentiment_polarity < -0.1855226 to the right, agree=0.740, adj=0.004, (0 split)
##       average_token_length      < 5.654628   to the left,  agree=0.739, adj=0.004, (0 split)
## 
## Node number 2: 21955 observations,    complexity param=0.03311044
##   predicted class=1  expected loss=0.4991574  P(node) =0.7384051
##     class counts: 10959 10996
##    probabilities: 0.499 0.501 
##   left son=4 (8173 obs) right son=5 (13782 obs)
##   Primary splits:
##       global_sentiment_polarity < 0.08566651 to the left,  improve=85.15745, (0 missing)
##       n_unique_tokens           < 0.4305251  to the right, improve=57.18810, (0 missing)
##       kw_max_max                < 677000     to the right, improve=56.78869, (0 missing)
##       average_token_length      < 4.782888   to the right, improve=49.01715, (0 missing)
##       num_hrefs                 < 21.5       to the left,  improve=48.00899, (0 missing)
##   Surrogate splits:
##       avg_positive_polarity < 0.2844315  to the left,  agree=0.714, adj=0.231, (0 split)
##       n_unique_tokens       < 0.1845133  to the left,  agree=0.659, adj=0.084, (0 split)
##       n_tokens_content      < 9          to the left,  agree=0.659, adj=0.084, (0 split)
##       n_non_stop_words      < 0.5        to the left,  agree=0.659, adj=0.084, (0 split)
##       average_token_length  < 1.8        to the left,  agree=0.659, adj=0.084, (0 split)
## 
## Node number 3: 7778 observations
##   predicted class=1  expected loss=0.3791463  P(node) =0.2615949
##     class counts:  2949  4829
##    probabilities: 0.379 0.621 
## 
## Node number 4: 8173 observations
##   predicted class=0  expected loss=0.4436559  P(node) =0.2748798
##     class counts:  4547  3626
##    probabilities: 0.556 0.444 
## 
## Node number 5: 13782 observations
##   predicted class=1  expected loss=0.4652445  P(node) =0.4635254
##     class counts:  6412  7370
##    probabilities: 0.465 0.535

rpart.plot(newsModelR, digits=3, type=1)

newsPredR <- predict(newsModelR, newsTest, type="class") #don't forget type="class", otherwise it will return as probabilities (type="prob") by default
summary(newsPredR)

##    0    1 
## 2666 7245

summary(newsTest$popular)

##    0    1 
## 4582 5329

#rmse(newsTest$popular, newsPredR)
## Warning message:
## In Ops.factor(actual, predicted) : ‘-’ not meaningful for factors
CrossTable(newsTest$popular, newsPredR, prop.chisq=F, prop.c=F, prop.r=F, dnn=c("Actual", "Predicted (RT)"))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  9911 
## 
##  
##              | Predicted (RT) 
##       Actual |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |      1462 |      3120 |      4582 | 
##              |     0.148 |     0.315 |           | 
## -------------|-----------|-----------|-----------|
##            1 |      1204 |      4125 |      5329 | 
##              |     0.121 |     0.416 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      2666 |      7245 |      9911 | 
## -------------|-----------|-----------|-----------|
## 
##

newsModelR$variable.importance

##                  num_imgs global_sentiment_polarity     avg_positive_polarity 
##                165.437961                 85.901899                 19.703014 
##                 num_hrefs          n_non_stop_words          n_tokens_content 
##                 19.589658                 18.102157                 15.634839 
##      average_token_length           n_unique_tokens 
##                  7.764944                  7.137263

ANLY 530 - Lab 1

Classification Models

Zhengxiao Wei

2019-12-07

1. Tree-based Classification

2. Random Forest

3. Adding Regression to Trees

4. Evaluating News Popularity