============================================================================================================
About: This document is also available at http://rpubs.com/sherloconan/557644
Data source: https://www.kaggle.com/sherloconan/anly-53053b
Question 1: If you see an accuracy of 100%, what does it mean? Does this mean that we design a perfect model?
Answer 1: In this case, false negatives (Type II error) are 14, while false positives (Type I error) are 17. True positives are 8, and true negatives are 61; hence, the accuracy is 69% for the decision tree model on the test set. “Positive” means being creditable / not defaulted / not declined. “Negative” means being not creditable / defaulted / declined.
High accuracy could imply overfitting. If an accuracy of 100% occurs on the test set, it means some predictor / independent / explanatory variables are highly multicollinear to the target / dependent / response variable in the model, e.g., one of the predictors is the exact same as the response. Hence, such a model is “perfect” but meaningless.
It is worth noting that several RPubs from Dhull (2018), Pierre (2018), and et al., made a mistake in the decision tree model on the News Popularity case: the dataset should have 17 variables; however, c5.0 output indicates 18 attributes due to the clumsy code “newsShort$shares = newsShort$popular”, resulting in the 100% accuracy rate.
Step 1: Collecting the data
credit <- read.csv("~/Documents/HU/ANLY 530-53-B/Module 5- Decision tree part II/Lab1/credit.csv")
##MORE DESCRIPTIVE
#options(scipen=100,digits=2);pastecs::stat.desc(credit) %>% kable() %>% kable_styling()
str(credit)
## 'data.frame': 1000 obs. of 21 variables:
## $ Creditability : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Account.Balance : int 1 1 2 1 1 1 1 1 4 2 ...
## $ Duration.of.Credit..month. : int 18 9 12 12 12 10 8 6 18 24 ...
## $ Payment.Status.of.Previous.Credit: int 4 4 2 4 4 4 4 4 4 2 ...
## $ Purpose : int 2 0 9 0 0 0 0 0 3 3 ...
## $ Credit.Amount : int 1049 2799 841 2122 2171 2241 3398 1361 1098 3758 ...
## $ Value.Savings.Stocks : int 1 1 2 1 1 1 1 1 1 3 ...
## $ Length.of.current.employment : int 2 3 4 3 3 2 4 2 1 1 ...
## $ Instalment.per.cent : int 4 2 2 3 4 1 1 2 4 1 ...
## $ Sex...Marital.Status : int 2 3 2 3 3 3 3 3 2 2 ...
## $ Guarantors : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Duration.in.Current.address : int 4 2 4 2 4 3 4 4 4 4 ...
## $ Most.valuable.available.asset : int 2 1 1 1 2 1 1 1 3 4 ...
## $ Age..years. : int 21 36 23 39 38 48 39 40 65 23 ...
## $ Concurrent.Credits : int 3 3 3 3 1 3 3 3 3 3 ...
## $ Type.of.apartment : int 1 1 1 1 2 1 2 2 2 1 ...
## $ No.of.Credits.at.this.Bank : int 1 2 1 2 2 2 2 1 2 1 ...
## $ Occupation : int 3 3 2 2 2 2 2 2 1 1 ...
## $ No.of.dependents : int 1 2 1 2 1 2 1 2 1 1 ...
## $ Telephone : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Foreign.Worker : int 1 1 1 2 2 2 2 2 1 1 ...
credit$Creditability <- as.factor(credit$Creditability)
Step 2: Exploring the data
summary(credit$Credit.Amount) #German bank, monetary unit is Deutsche Mark (DM)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 250 1366 2320 3271 3972 18424
table(credit$Creditability) #creditable: yes(0), no(1); no creditable = default
##
## 0 1
## 300 700
set.seed(12345)
creditR <- credit[order(runif(1000)),]
creditTraining <- creditR[1:900,]
creditTest <- creditR[901:1000,]
##ALTERNATIVE
#training <- sample(1000, 900)
#creditTraining <- credit[training,]
#creditTest <- credit[-training,]
prop.table(table(creditTraining$Creditability))
##
## 0 1
## 0.3088889 0.6911111
prop.table(table(creditTest$Creditability))
##
## 0 1
## 0.22 0.78
Step 3: Training a model on the data
creditModelDT <- C5.0(x=creditTraining[,!(colnames(creditTraining) %in% c("Creditability"))], y=creditTraining[,"Creditability"])
##OR
#creditModel <- C5.0(x=creditTraining[-1], y=creditTraining$Creditability)
summary(creditModelDT)
##
## Call:
## C5.0.default(x = creditTraining[, !(colnames(creditTraining)
## %in% c("Creditability"))], y = creditTraining[, "Creditability"])
##
##
## C5.0 [Release 2.07 GPL Edition] Thu Dec 26 00:54:21 2019
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 900 cases (21 attributes) from undefined.data
##
## Decision tree:
##
## Account.Balance > 2:
## :...Concurrent.Credits > 2:
## : :...Age..years. > 33: 1 (179/11)
## : : Age..years. <= 33:
## : : :...Credit.Amount > 6681:
## : : :...Length.of.current.employment <= 2: 0 (4)
## : : : Length.of.current.employment > 2:
## : : : :...Payment.Status.of.Previous.Credit <= 3: 1 (4)
## : : : Payment.Status.of.Previous.Credit > 3: 0 (3/1)
## : : Credit.Amount <= 6681:
## : : :...Occupation > 2:
## : : :...Occupation <= 3: 1 (120/12)
## : : : Occupation > 3:
## : : : :...Duration.of.Credit..month. <= 33: 1 (9)
## : : : Duration.of.Credit..month. > 33: 0 (3)
## : : Occupation <= 2:
## : : :...No.of.Credits.at.this.Bank > 1: 1 (6)
## : : No.of.Credits.at.this.Bank <= 1:
## : : :...Most.valuable.available.asset > 1: 0 (3)
## : : Most.valuable.available.asset <= 1:
## : : :...Credit.Amount <= 1987: 1 (8/1)
## : : Credit.Amount > 1987: 0 (2)
## : Concurrent.Credits <= 2:
## : :...Guarantors > 1: 1 (4)
## : Guarantors <= 1:
## : :...Purpose <= 0:
## : :...Most.valuable.available.asset <= 2: 0 (5)
## : : Most.valuable.available.asset > 2:
## : : :...No.of.dependents <= 1: 1 (7/1)
## : : No.of.dependents > 1: 0 (2)
## : Purpose > 0:
## : :...Purpose <= 4: 1 (35/2)
## : Purpose > 4:
## : :...Length.of.current.employment <= 2: 0 (4)
## : Length.of.current.employment > 2:
## : :...No.of.dependents > 1: 0 (3/1)
## : No.of.dependents <= 1:
## : :...Length.of.current.employment > 3: 1 (4)
## : Length.of.current.employment <= 3:
## : :...Instalment.per.cent <= 2: 1 (2)
## : Instalment.per.cent > 2: 0 (2)
## Account.Balance <= 2:
## :...Payment.Status.of.Previous.Credit <= 1:
## :...Value.Savings.Stocks <= 2: 0 (49/10)
## : Value.Savings.Stocks > 2:
## : :...Credit.Amount <= 2064: 0 (3)
## : Credit.Amount > 2064: 1 (9/1)
## Payment.Status.of.Previous.Credit > 1:
## :...Credit.Amount > 7980:
## :...Value.Savings.Stocks > 4:
## : :...Payment.Status.of.Previous.Credit <= 2: 0 (4/1)
## : : Payment.Status.of.Previous.Credit > 2: 1 (3)
## : Value.Savings.Stocks <= 4:
## : :...Account.Balance > 1: 0 (15)
## : Account.Balance <= 1:
## : :...Concurrent.Credits <= 2: 0 (2)
## : Concurrent.Credits > 2:
## : :...Credit.Amount <= 10297: 0 (6)
## : Credit.Amount > 10297: 1 (3)
## Credit.Amount <= 7980:
## :...Duration.of.Credit..month. <= 11:
## :...Occupation > 3:
## : :...Concurrent.Credits <= 2: 1 (3)
## : : Concurrent.Credits > 2:
## : : :...Payment.Status.of.Previous.Credit <= 2: 1 (4/1)
## : : Payment.Status.of.Previous.Credit > 2: 0 (3)
## : Occupation <= 3:
## : :...Age..years. > 32: 1 (34)
## : Age..years. <= 32:
## : :...Most.valuable.available.asset <= 1: 1 (13/1)
## : Most.valuable.available.asset > 1:
## : :...Instalment.per.cent <= 3: 1 (6/1)
## : Instalment.per.cent > 3: 0 (6/1)
## Duration.of.Credit..month. > 11:
## :...Duration.of.Credit..month. > 36:
## :...Length.of.current.employment <= 1: 1 (3)
## : Length.of.current.employment > 1:
## : :...No.of.dependents > 1: 1 (5/1)
## : No.of.dependents <= 1:
## : :...Duration.in.Current.address <= 1: 1 (4/1)
## : Duration.in.Current.address > 1: 0 (23)
## Duration.of.Credit..month. <= 36:
## :...Guarantors > 2:
## :...Foreign.Worker <= 1: 1 (23/1)
## : Foreign.Worker > 1: 0 (2)
## Guarantors <= 2:
## :...Credit.Amount <= 1381:
## :...Telephone > 1:
## : :...Sex...Marital.Status > 3: 0 (2)
## : : Sex...Marital.Status <= 3:
## : : :...Duration.of.Credit..month. <= 16: 1 (7)
## : : Duration.of.Credit..month. > 16: 0 (3/1)
## : Telephone <= 1:
## : :...Concurrent.Credits <= 2: 0 (9)
## : Concurrent.Credits > 2:
## : :...Account.Balance <= 1: 0 (29/6)
## : Account.Balance > 1: [S1]
## Credit.Amount > 1381:
## :...Guarantors > 1:
## :...Foreign.Worker > 1: 1 (2)
## : Foreign.Worker <= 1:
## : :...Instalment.per.cent > 2: 0 (5)
## : Instalment.per.cent <= 2: [S2]
## Guarantors <= 1:
## :...Payment.Status.of.Previous.Credit > 3:
## :...Age..years. > 33: 1 (22)
## : Age..years. <= 33:
## : :...Purpose > 3: 1 (7)
## : Purpose <= 3: [S3]
## Payment.Status.of.Previous.Credit <= 3:
## :...Instalment.per.cent <= 2:
## :...No.of.dependents > 1:
## : :...Purpose <= 0: 1 (2)
## : : Purpose > 0: 0 (3)
## : No.of.dependents <= 1: [S4]
## Instalment.per.cent > 2:
## :...Concurrent.Credits <= 1: 1 (8/1)
## Concurrent.Credits > 1:
## :...Sex...Marital.Status <= 1: 0 (6/1)
## Sex...Marital.Status > 1:
## :...Account.Balance > 1: [S5]
## Account.Balance <= 1: [S6]
##
## SubTree [S1]
##
## Duration.in.Current.address > 3: 1 (8/1)
## Duration.in.Current.address <= 3:
## :...Purpose > 2: 0 (5)
## Purpose <= 2:
## :...Type.of.apartment <= 1: 0 (2)
## Type.of.apartment > 1: 1 (5/1)
##
## SubTree [S2]
##
## Duration.in.Current.address <= 2: 1 (2)
## Duration.in.Current.address > 2: 0 (4/1)
##
## SubTree [S3]
##
## Duration.of.Credit..month. <= 16: 1 (4)
## Duration.of.Credit..month. > 16:
## :...Length.of.current.employment <= 3: 0 (8)
## Length.of.current.employment > 3: 1 (6/1)
##
## SubTree [S4]
##
## Duration.in.Current.address > 1: 1 (41/6)
## Duration.in.Current.address <= 1:
## :...Value.Savings.Stocks > 3: 0 (2)
## Value.Savings.Stocks <= 3:
## :...Length.of.current.employment > 2: 1 (4)
## Length.of.current.employment <= 2:
## :...Instalment.per.cent <= 1: 0 (3)
## Instalment.per.cent > 1: 1 (3/1)
##
## SubTree [S5]
##
## Sex...Marital.Status > 3: 0 (2)
## Sex...Marital.Status <= 3:
## :...Length.of.current.employment > 3: 1 (10)
## Length.of.current.employment <= 3:
## :...Duration.in.Current.address <= 1: 1 (5)
## Duration.in.Current.address > 1:
## :...Length.of.current.employment <= 2: 0 (4)
## Length.of.current.employment > 2:
## :...Value.Savings.Stocks <= 1: 0 (3)
## Value.Savings.Stocks > 1: 1 (5)
##
## SubTree [S6]
##
## Payment.Status.of.Previous.Credit > 2: 0 (3)
## Payment.Status.of.Previous.Credit <= 2:
## :...Purpose <= 0: 0 (7/1)
## Purpose > 0:
## :...Most.valuable.available.asset <= 1: 0 (5/1)
## Most.valuable.available.asset > 1:
## :...Sex...Marital.Status <= 2: 1 (6)
## Sex...Marital.Status > 2:
## :...Length.of.current.employment > 4: 0 (5)
## Length.of.current.employment <= 4:
## :...Telephone > 1: 1 (3)
## Telephone <= 1:
## :...Length.of.current.employment <= 2: 0 (2)
## Length.of.current.employment > 2:
## :...Age..years. <= 28: 1 (4)
## Age..years. > 28: 0 (2)
##
##
## Evaluation on training data (900 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 85 70( 7.8%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 233 45 (a): class 0
## 25 597 (b): class 1
##
##
## Attribute usage:
##
## 100.00% Account.Balance
## 67.11% Credit.Amount
## 63.11% Concurrent.Credits
## 55.33% Payment.Status.of.Previous.Credit
## 50.33% Age..years.
## 45.44% Duration.of.Credit..month.
## 40.11% Guarantors
## 24.44% Occupation
## 18.33% Instalment.per.cent
## 15.56% Purpose
## 14.22% Length.of.current.employment
## 13.67% Duration.in.Current.address
## 12.67% Value.Savings.Stocks
## 12.22% No.of.dependents
## 9.33% Sex...Marital.Status
## 9.00% Telephone
## 8.78% Most.valuable.available.asset
## 4.22% Foreign.Worker
## 2.11% No.of.Credits.at.this.Bank
## 0.78% Type.of.apartment
##
##
## Time: 0.0 secs
Step 4: Evaluating model performance
creditPredDT <- predict(creditModelDT, creditTest)
#table(creditTest$Creditability, creditPredDT)
CrossTable(creditTest$Creditability, creditPredDT, prop.chisq=F, prop.c=F, prop.r=F, dnn=c("Actual Creditability", "Predicted Creditability")) #a confusion matrix of binary classification, "negative" means being (1) not creditable / defaulted / declined
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | Predicted Creditability
## Actual Creditability | 0 | 1 | Row Total |
## ---------------------|-----------|-----------|-----------|
## 0 | 8 | 14 | 22 |
## | 0.080 | 0.140 | |
## ---------------------|-----------|-----------|-----------|
## 1 | 17 | 61 | 78 |
## | 0.170 | 0.610 | |
## ---------------------|-----------|-----------|-----------|
## Column Total | 25 | 75 | 100 |
## ---------------------|-----------|-----------|-----------|
##
##
Question 2: What are the three most important features in this model?
Answer 2: Use the function importance() from the package “randomForest” to see the variable importance measures by mean decrease in accuracy in the random forest model, i.e., importance(creditModelRF) or creditModelRF$importance or caret::varImp(creditModelRF). The model, with the random seed 12345, suggest the three most important features as “Account Balance”, “Duration of Credit”, and “Payment Status of Previous Credit”. The model, with the random seed 23458, suggest the three most important features as “Account Balance”, “Duration of Credit”, and “Credit Amount”. Both random forest cases and, plus, the decision tree model value “Account Balance” the most.
Furthermore, “The scikit-learn Random Forest feature importance and R’s default Random Forest feature importance strategies are biased. To get reliable results in Python, use permutation importance, provided here and in our rfpimp package (via pip). For R, use importance=T in the Random Forest constructor then type=1 in R’s importance() function. In addition, your feature importance measures will only be reliable if your model is trained with suitable hyper-parameters” (Parr, Turgutlu, Csiszar, and Howard, 2018).
creditModelRF <- randomForest(Creditability~., data=creditTraining, importance=T)
summary(creditModelRF)
## Length Class Mode
## call 4 -none- call
## type 1 -none- character
## predicted 900 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 1800 matrix numeric
## oob.times 900 -none- numeric
## classes 2 -none- character
## importance 80 -none- numeric
## importanceSD 60 -none- numeric
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 900 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
creditPredRF <- predict(creditModelRF, creditTest)
#table(creditTest$Creditability, creditPredRF)
CrossTable(creditTest$Creditability, creditPredRF, prop.chisq=F, prop.c=F, prop.r=F, dnn=c("Actual Creditability", "Predicted Creditability")) #a confusion matrix of binary classification
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | Predicted Creditability
## Actual Creditability | 0 | 1 | Row Total |
## ---------------------|-----------|-----------|-----------|
## 0 | 11 | 11 | 22 |
## | 0.110 | 0.110 | |
## ---------------------|-----------|-----------|-----------|
## 1 | 9 | 69 | 78 |
## | 0.090 | 0.690 | |
## ---------------------|-----------|-----------|-----------|
## Column Total | 20 | 80 | 100 |
## ---------------------|-----------|-----------|-----------|
##
##
importance(creditModelRF, type=1) #type1: MeanDecreaseAccuracy; type2: MeanDecreaseGini
## MeanDecreaseAccuracy
## Account.Balance 33.0987788
## Duration.of.Credit..month. 20.7930720
## Payment.Status.of.Previous.Credit 16.6331475
## Purpose 5.2239655
## Credit.Amount 16.2319279
## Value.Savings.Stocks 10.9540498
## Length.of.current.employment 7.6996932
## Instalment.per.cent 7.0556056
## Sex...Marital.Status -0.2679574
## Guarantors 11.8663372
## Duration.in.Current.address 3.9844930
## Most.valuable.available.asset 7.4861647
## Age..years. 5.7940226
## Concurrent.Credits 4.8355012
## Type.of.apartment 4.1302085
## No.of.Credits.at.this.Bank 2.9144824
## Occupation 3.2507050
## No.of.dependents 1.3437617
## Telephone 2.6267656
## Foreign.Worker 1.9573448
Change the random seed to 23458
set.seed(23458)
creditR2 <- credit[order(runif(1000)),]
creditTraining2 <- creditR2[1:900,]
creditTest2 <- creditR2[901:1000,]
creditModelRF2 <- randomForest(Creditability~., data=creditTraining2, importance=T)
creditPredRF2 <- predict(creditModelRF2, creditTest2)
CrossTable(creditTest2$Creditability, creditPredRF2, prop.chisq=F, prop.c=F, prop.r=F, dnn=c("Actual Creditability", "Predicted Creditability"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | Predicted Creditability
## Actual Creditability | 0 | 1 | Row Total |
## ---------------------|-----------|-----------|-----------|
## 0 | 13 | 18 | 31 |
## | 0.130 | 0.180 | |
## ---------------------|-----------|-----------|-----------|
## 1 | 5 | 64 | 69 |
## | 0.050 | 0.640 | |
## ---------------------|-----------|-----------|-----------|
## Column Total | 18 | 82 | 100 |
## ---------------------|-----------|-----------|-----------|
##
##
importance(creditModelRF2, type=1) #type1: MeanDecreaseAccuracy; type2: MeanDecreaseGini
## MeanDecreaseAccuracy
## Account.Balance 33.925075
## Duration.of.Credit..month. 21.341865
## Payment.Status.of.Previous.Credit 15.334914
## Purpose 6.111192
## Credit.Amount 15.464864
## Value.Savings.Stocks 8.748573
## Length.of.current.employment 9.490680
## Instalment.per.cent 3.404935
## Sex...Marital.Status 3.613527
## Guarantors 10.154618
## Duration.in.Current.address 3.482690
## Most.valuable.available.asset 7.491665
## Age..years. 7.539748
## Concurrent.Credits 6.332402
## Type.of.apartment 4.586761
## No.of.Credits.at.this.Bank 3.811475
## Occupation 3.928423
## No.of.dependents 1.676712
## Telephone 1.353508
## Foreign.Worker 2.165521
Question 3: What is your interpretation about this amount of RMSE?
Answer 3: This question may not be valid because the wine’s quality is a factor variable.
The RMSE stands for “root mean squared error”, also known as “root mean squared deviation” (RMSD), and it is used to evaluate models by summarizing the differences between the actual / observed and predicted values. Plus, the RMSE can be defined as a gradient function in Machine Learning.
The model, assuming it is, has an RMSE of 0.7448, calculated by the function rmse() from the package “Metrics”. Such a value indicates that the model is fair. The white wine quality ranges from 3 to 9, hence it is interpreted that the model may overrate or underrate the quality by one level.
The RMSE is not meaningful for factors! The target / dependent / response variable is categorical type, more specifically, ordinal. ① It is incorrect to compute the Pearson’s correlation between its actual and predicted values. ② “No formal distributional assumptions, random forests are non-parametric and can thus handle skewed and multi-modal data as well as categorical data that are ordinal or non-ordinal” (Richmond, 2016).
Step 1: Collecting the data
wine <- read.csv("~/Documents/HU/ANLY 530-53-B/Module 5- Decision tree part II/Lab1/whitewines.csv")
str(wine)
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 6.7 5.7 5.9 5.3 6.4 7 7.9 6.6 7 6.5 ...
## $ volatile.acidity : num 0.62 0.22 0.19 0.47 0.29 0.14 0.12 0.38 0.16 0.37 ...
## $ citric.acid : num 0.24 0.2 0.26 0.1 0.21 0.41 0.49 0.28 0.3 0.33 ...
## $ residual.sugar : num 1.1 16 7.4 1.3 9.65 0.9 5.2 2.8 2.6 3.9 ...
## $ chlorides : num 0.039 0.044 0.034 0.036 0.041 0.037 0.049 0.043 0.043 0.027 ...
## $ free.sulfur.dioxide : num 6 41 33 11 36 22 33 17 34 40 ...
## $ total.sulfur.dioxide: num 62 113 123 74 119 95 152 67 90 130 ...
## $ density : num 0.993 0.999 0.995 0.991 0.993 ...
## $ pH : num 3.41 3.22 3.49 3.48 2.99 3.25 3.18 3.21 2.88 3.28 ...
## $ sulphates : num 0.32 0.46 0.42 0.54 0.34 0.43 0.47 0.47 0.47 0.39 ...
## $ alcohol : num 10.4 8.9 10.1 11.2 10.9 ...
## $ quality : int 5 6 6 4 6 6 6 6 6 7 ...
table(wine$quality) #ordinal type with 7 levels; cannot check normality!
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
Step 2: Exploring and preparing the data
#no randomization in sampling
wineTraining <- wine[1:3750,] #75%
wineTest <- tail(wine,-3750)
Step 3: Training a model on the data
#Recursive Partitioning And Regression Trees
(wineModelR <- rpart(quality~., data=wineTraining)) #method="class" because y is a factor
## n= 3750
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 3750 2945.53200 5.870933
## 2) alcohol< 10.85 2372 1418.86100 5.604975
## 4) volatile.acidity>=0.2275 1611 821.30730 5.432030
## 8) volatile.acidity>=0.3025 688 278.97670 5.255814 *
## 9) volatile.acidity< 0.3025 923 505.04230 5.563380 *
## 5) volatile.acidity< 0.2275 761 447.36400 5.971091 *
## 3) alcohol>=10.85 1378 1070.08200 6.328737
## 6) free.sulfur.dioxide< 10.5 84 95.55952 5.369048 *
## 7) free.sulfur.dioxide>=10.5 1294 892.13600 6.391036
## 14) alcohol< 11.76667 629 430.11130 6.173291
## 28) volatile.acidity>=0.465 11 10.72727 4.545455 *
## 29) volatile.acidity< 0.465 618 389.71680 6.202265 *
## 15) alcohol>=11.76667 665 403.99400 6.596992 *
#rpart.plot(wineModelR, digits=3)
rpart.plot(wineModelR, digits=4, type=3, extra=101)
Step 4: Evaluating model performance
winePredR <- predict(wineModelR, wineTest)
summary(winePredR)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.545 5.563 5.971 5.893 6.202 6.597
summary(wineTest$quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.901 6.000 9.000
rmse(wineTest$quality, winePredR) #Root Mean Squared Error
## [1] 0.7448093
Question 4: Try decision tree and random forest and evaluate the model.
Answer 4: Define the shares more than 1,400 as popular; Select 16 predictor variables with 39,644 observations; Set the random seed to 12345; Divide 75% into a training set and 25% into a test set; Run a decision tree model, a random forest model, and recursive partitioning.
The decision tree model has 59.11% accuracy with the three most important features as “num_hrefs”, “num_imgs”, and “kw_max_max”. The random forest model has 60.25% accuracy with the three most important features as “num_imgs”, “num_hrefs”, and “n_non_stop_words”. The regression trees model has 56.37% accuracy with the three most important features as “num_imgs”, “global_sentiment_polarity”, and “avg_positive_polarity”.
Step 1: Collecting the data
news <- read.csv("~/Documents/HU/ANLY 530-53-B/Module 5- Decision tree part II/Lab1/OnlineNewsPopularity_for_R.csv")
news$popular <- case_when(news$shares>=1400~1, T~0) #popularity: yes(1), no(0)
news$popular <- as.factor(news$popular)
vars <- c("popular", "n_tokens_title", "n_tokens_content", "n_unique_tokens", "n_non_stop_words", "num_hrefs", "num_imgs", "num_videos", "average_token_length", "num_keywords", "kw_max_max", "global_sentiment_polarity", "avg_positive_polarity", "title_subjectivity", "title_sentiment_polarity", "abs_title_subjectivity", "abs_title_sentiment_polarity")
news17 <- news[vars]
str(news17)
## 'data.frame': 39644 obs. of 17 variables:
## $ popular : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 2 1 ...
## $ n_tokens_title : num 12 9 9 9 13 10 8 12 11 10 ...
## $ n_tokens_content : num 219 255 211 531 1072 ...
## $ n_unique_tokens : num 0.664 0.605 0.575 0.504 0.416 ...
## $ n_non_stop_words : num 1 1 1 1 1 ...
## $ num_hrefs : num 4 3 3 9 19 2 21 20 2 4 ...
## $ num_imgs : num 1 1 1 1 20 0 20 20 0 1 ...
## $ num_videos : num 0 0 0 0 0 0 0 0 0 1 ...
## $ average_token_length : num 4.68 4.91 4.39 4.4 4.68 ...
## $ num_keywords : num 5 4 6 7 7 9 10 9 7 5 ...
## $ kw_max_max : num 0 0 0 0 0 0 0 0 0 0 ...
## $ global_sentiment_polarity : num 0.0926 0.1489 0.3233 0.1007 0.281 ...
## $ avg_positive_polarity : num 0.379 0.287 0.496 0.386 0.411 ...
## $ title_subjectivity : num 0.5 0 0 0 0.455 ...
## $ title_sentiment_polarity : num -0.188 0 0 0 0.136 ...
## $ abs_title_subjectivity : num 0 0.5 0.5 0.5 0.0455 ...
## $ abs_title_sentiment_polarity: num 0.188 0 0 0 0.136 ...
Step 2: Pre-processing
set.seed(12345)
training2 <- sample(nrow(news17), nrow(news17)*0.75) #75%, 29733 of 39644
newsTraining <- news17[training2,]
newsTest <- news17[-training2,]
prop.table(table(newsTraining$popular))
##
## 0 1
## 0.4677631 0.5322369
prop.table(table(newsTest$popular))
##
## 0 1
## 0.4623146 0.5376854
Step 3: Modeling and evaluation
#Decision Tree model
newsModelDT <- C5.0(newsTraining[-1], newsTraining$popular)
summary(newsModelDT)
##
## Call:
## C5.0.default(x = newsTraining[-1], y = newsTraining$popular)
##
##
## C5.0 [Release 2.07 GPL Edition] Thu Dec 26 00:54:32 2019
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 29733 cases (17 attributes) from undefined.data
##
## Decision tree:
##
## num_hrefs > 16:
## :...kw_max_max <= 18200:
## : :...n_tokens_content <= 1206: 0 (36/8)
## : : n_tokens_content > 1206: 1 (9/2)
## : kw_max_max > 18200:
## : :...n_tokens_content <= 254: 1 (400/83)
## : n_tokens_content > 254:
## : :...n_tokens_content > 3212: 1 (40/2)
## : n_tokens_content <= 3212:
## : :...num_keywords <= 5:
## : :...kw_max_max <= 617900: 1 (74/24)
## : : kw_max_max > 617900:
## : : :...average_token_length <= 4.940762: 1 (491/226)
## : : average_token_length > 4.940762:
## : : :...num_hrefs <= 37:
## : : :...num_imgs <= 9: 0 (96/20)
## : : : num_imgs > 9: 1 (23/9)
## : : num_hrefs > 37:
## : : :...n_tokens_title <= 12: 1 (19/3)
## : : n_tokens_title > 12: 0 (7/2)
## : num_keywords > 5:
## : :...n_tokens_title <= 9:
## : :...n_tokens_title <= 7: 1 (453/108)
## : : n_tokens_title > 7:
## : : :...num_keywords > 6: 1 (1085/337)
## : : num_keywords <= 6:
## : : :...avg_positive_polarity <= 0.4356061: 1 (124/38)
## : : avg_positive_polarity > 0.4356061: 0 (51/18)
## : n_tokens_title > 9:
## : :...num_imgs <= 2: 1 (1020/468)
## : num_imgs > 2:
## : :...n_tokens_title > 12:
## : :...num_keywords <= 8:
## : : :...n_tokens_content <= 388: 1 (34/7)
## : : : n_tokens_content > 388: 0 (183/83)
## : : num_keywords > 8:
## : : :...num_keywords > 9: 1 (77/21)
## : : num_keywords <= 9:
## : : :...num_hrefs <= 18: 0 (8/1)
## : : num_hrefs > 18: 1 (44/13)
## : n_tokens_title <= 12:
## : :...average_token_length > 4.690059: 1 (605/176)
## : average_token_length <= 4.690059:
## : :...num_hrefs <= 31: 1 (454/163)
## : num_hrefs > 31:
## : :...num_keywords <= 7:
## : :...avg_positive_polarity <= 0.2858333: 1 (4)
## : : avg_positive_polarity > 0.2858333: 0 (36/6)
## : num_keywords > 7:
## : :...num_imgs <= 34: 1 (88/34)
## : num_imgs > 34: 0 (12/2)
## num_hrefs <= 16:
## :...num_videos > 22:
## :...average_token_length <= 4.595656: 0 (150/41)
## : average_token_length > 4.595656: 1 (13/3)
## num_videos <= 22:
## :...num_imgs > 3:
## :...num_keywords <= 6:
## : :...num_imgs > 15:
## : : :...n_tokens_content <= 254: 1 (26/5)
## : : : n_tokens_content > 254:
## : : : :...abs_title_sentiment_polarity <= 0.6375: 0 (303/109)
## : : : abs_title_sentiment_polarity > 0.6375: 1 (17/4)
## : : num_imgs <= 15:
## : : :...global_sentiment_polarity > 0.07433511:
## : : :...num_videos <= 2: 1 (638/235)
## : : : num_videos > 2:
## : : : :...title_sentiment_polarity <= -0.07777778: 1 (5)
## : : : title_sentiment_polarity > -0.07777778:
## : : : :...avg_positive_polarity <= 0.3379489: 1 (4)
## : : : avg_positive_polarity > 0.3379489: 0 (21/5)
## : : global_sentiment_polarity <= 0.07433511:
## : : :...title_subjectivity > 0.8333333: 1 (34/7)
## : : title_subjectivity <= 0.8333333:
## : : :...n_tokens_content <= 302:
## : : :...num_imgs > 11: 0 (27/10)
## : : : num_imgs <= 11:
## : : : :...n_tokens_title <= 13: 1 (71/21)
## : : : n_tokens_title > 13: 0 (3)
## : : n_tokens_content > 302:
## : : :...num_imgs <= 12: 0 (168/48)
## : : num_imgs > 12:
## : : :...num_imgs <= 14: 1 (19/6)
## : : num_imgs > 14: 0 (14/4)
## : num_keywords > 6:
## : :...title_sentiment_polarity > 0.4: 1 (449/130)
## : title_sentiment_polarity <= 0.4:
## : :...avg_positive_polarity > 0.4718768:
## : :...abs_title_subjectivity > 0.07272727: 1 (253/64)
## : : abs_title_subjectivity <= 0.07272727:
## : : :...num_videos <= 1: 0 (23/9)
## : : num_videos > 1: 1 (3)
## : avg_positive_polarity <= 0.4718768:
## : :...n_tokens_title <= 10: 1 (1317/508)
## : n_tokens_title > 10:
## : :...num_hrefs > 6: 1 (650/268)
## : num_hrefs <= 6:
## : :...num_videos <= 0: 0 (331/150)
## : num_videos > 0:
## : :...avg_positive_polarity <= 0.1920238: 1 (12)
## : avg_positive_polarity > 0.1920238:
## : :...num_videos > 2: 0 (14/4)
## : num_videos <= 2:
## : :...num_hrefs > 3: 1 (119/37)
## : num_hrefs <= 3:
## : :...n_tokens_title <= 13: 0 (59/23)
## : n_tokens_title > 13: 1 (14/3)
## num_imgs <= 3:
## :...kw_max_max <= 617900:
## :...kw_max_max > 118700: 1 (1250/442)
## : kw_max_max <= 118700:
## : :...n_tokens_content > 886: 1 (98/26)
## : n_tokens_content <= 886:
## : :...num_hrefs <= 11: 1 (1428/664)
## : num_hrefs > 11: 0 (122/47)
## kw_max_max > 617900:
## :...n_unique_tokens <= 0.4065041:
## :...num_imgs > 0:
## : :...average_token_length <= 4.911008: 1 (505/171)
## : : average_token_length > 4.911008:
## : : :...n_unique_tokens <= 0.3680441: 1 (11/3)
## : : n_unique_tokens > 0.3680441: 0 (29/5)
## : num_imgs <= 0:
## : :...num_videos > 1: 0 (33/14)
## : num_videos <= 1:
## : :...num_videos > 0: 1 (521/226)
## : num_videos <= 0:
## : :...n_tokens_title > 13: 1 (5)
## : n_tokens_title <= 13:
## : :...num_keywords <= 7: 0 (26/6)
## : num_keywords > 7: 1 (24/8)
## n_unique_tokens > 0.4065041:
## :...global_sentiment_polarity <= 0.07066981:
## :...num_hrefs <= 1: 0 (214/47)
## : num_hrefs > 1:
## : :...average_token_length > 4.729642:
## : :...n_tokens_content <= 313:
## : : :...num_imgs <= 1:
## : : : :...kw_max_max > 690400: 0 (534/209)
## : : : : kw_max_max <= 690400:
## : : : : :...n_tokens_title <= 11: 0 (97/41)
## : : : : n_tokens_title > 11: 1 (18/4)
## : : : num_imgs > 1:
## : : : :...num_keywords > 8: 1 (17/1)
## : : : num_keywords <= 8:
## : : : :...num_videos <= 0: 0 (55/20)
## : : : num_videos > 0: 1 (9/3)
## : : n_tokens_content > 313:
## : : :...num_keywords <= 7: 0 (890/243)
## : : num_keywords > 7:
## : : :...num_imgs <= 2: 0 (512/171)
## : : num_imgs > 2:
## : : :...n_tokens_title > 13: 1 (4)
## : : n_tokens_title <= 13:
## : : :...num_hrefs > 7: 0 (30/8)
## : : num_hrefs <= 7:
## : : :...num_videos > 0: 1 (5)
## : : num_videos <= 0: [S1]
## : average_token_length <= 4.729642:
## : :...num_keywords <= 5: 0 (607/234)
## : num_keywords > 5:
## : :...num_imgs <= 0:
## : :...n_tokens_title <= 9: 0 (110/50)
## : : n_tokens_title > 9: 1 (337/137)
## : num_imgs > 0:
## : :...num_imgs > 2:
## : :...n_tokens_content <= 538: 1 (31/5)
## : : n_tokens_content > 538:
## : : :...num_keywords <= 7: 0 (20/4)
## : : num_keywords > 7: [S2]
## : num_imgs <= 2:
## : :...num_videos > 2: 0 (40/14)
## : num_videos <= 2:
## : :...num_videos <= 0: 0 (740/327)
## : num_videos > 0:
## : :...num_imgs <= 1:
## : :...num_videos > 1: [S3]
## : : num_videos <= 1: [S4]
## : num_imgs > 1: [S5]
## global_sentiment_polarity > 0.07066981:
## :...title_sentiment_polarity <= -0.006397306:
## :...n_tokens_content > 119:
## : :...n_tokens_title <= 7: 1 (52/14)
## : : n_tokens_title > 7: 0 (1023/434)
## : n_tokens_content <= 119:
## : :...n_tokens_title > 12: 0 (53/1)
## : n_tokens_title <= 12:
## : :...num_imgs > 0: 1 (6)
## : num_imgs <= 0:
## : :...title_sentiment_polarity <= -0.625: 1 (5)
## : title_sentiment_polarity > -0.625:
## : :...n_unique_tokens <= 0.7198582: 1 (3)
## : n_unique_tokens > 0.7198582: [S6]
## title_sentiment_polarity > -0.006397306:
## :...num_imgs > 1:
## :...n_tokens_content > 242: 1 (1355/615)
## : n_tokens_content <= 242:
## : :...global_sentiment_polarity <= 0.3298789: 1 (199/47)
## : global_sentiment_polarity > 0.3298789: 0 (14/3)
## num_imgs <= 1:
## :...num_videos > 1:
## :...num_hrefs > 11: 1 (167/54)
## : num_hrefs <= 11:
## : :...num_keywords > 8:
## : :...num_imgs <= 0: [S7]
## : : num_imgs > 0:
## : : :...num_keywords <= 9: 0 (41/19)
## : : num_keywords > 9:
## : : :...num_hrefs <= 6: 0 (23/10)
## : : num_hrefs > 6: 1 (31/8)
## : num_keywords <= 8:
## : :...n_tokens_title <= 7: 0 (51/19)
## : n_tokens_title > 7:
## : :...num_imgs <= 0: 1 (320/108)
## : num_imgs > 0:
## : :...num_videos > 13: 1 (19/4)
## : num_videos <= 13:
## : :...kw_max_max <= 690400: [S8]
## : kw_max_max > 690400: [S9]
## num_videos <= 1:
## :...kw_max_max <= 690400:
## :...num_imgs <= 0:
## : :...n_tokens_title <= 11: 1 (267/106)
## : : n_tokens_title > 11:
## : : :...num_keywords <= 8: 1 (67/14)
## : : num_keywords > 8: 0 (13/4)
## : num_imgs > 0:
## : :...n_tokens_content <= 650: [S10]
## : n_tokens_content > 650: [S11]
## kw_max_max > 690400:
## :...num_keywords <= 6: 0 (2253/1013)
## num_keywords > 6:
## :...n_tokens_content > 993: 1 (105/34)
## n_tokens_content <= 993:
## :...num_videos > 0: 1 (949/462)
## num_videos <= 0:
## :...num_imgs <= 0: [S12]
## num_imgs > 0:
## :...num_keywords > 9: [S13]
## num_keywords <= 9: [S14]
##
## SubTree [S1]
##
## global_sentiment_polarity <= 0.01406926: 1 (6)
## global_sentiment_polarity > 0.01406926: 0 (13/4)
##
## SubTree [S2]
##
## global_sentiment_polarity <= 0.03557535: 0 (7)
## global_sentiment_polarity > 0.03557535: 1 (18/3)
##
## SubTree [S3]
##
## avg_positive_polarity <= 0.3235457: 1 (20/5)
## avg_positive_polarity > 0.3235457: 0 (13/2)
##
## SubTree [S4]
##
## kw_max_max > 690400: 1 (189/87)
## kw_max_max <= 690400:
## :...global_sentiment_polarity <= 0.06004399: 0 (7/1)
## global_sentiment_polarity > 0.06004399: 1 (3)
##
## SubTree [S5]
##
## n_unique_tokens <= 0.523077: 0 (29/4)
## n_unique_tokens > 0.523077:
## :...n_tokens_title > 11: 1 (9)
## n_tokens_title <= 11:
## :...n_tokens_content <= 231: 0 (6)
## n_tokens_content > 231: 1 (14/4)
##
## SubTree [S6]
##
## average_token_length <= 4.908397: 0 (48/6)
## average_token_length > 4.908397: 1 (7/2)
##
## SubTree [S7]
##
## avg_positive_polarity <= 0.3969345: 0 (77/23)
## avg_positive_polarity > 0.3969345:
## :...title_subjectivity <= 0.7348485: 1 (70/25)
## title_subjectivity > 0.7348485: 0 (16/4)
##
## SubTree [S8]
##
## n_tokens_title <= 9: 1 (9/1)
## n_tokens_title > 9: 0 (11/3)
##
## SubTree [S9]
##
## num_videos <= 2: 1 (128/55)
## num_videos > 2: 0 (120/52)
##
## SubTree [S10]
##
## title_subjectivity > 0.975: 1 (26/6)
## title_subjectivity <= 0.975:
## :...average_token_length <= 4.32405: 1 (57/15)
## average_token_length > 4.32405: 0 (745/330)
##
## SubTree [S11]
##
## title_sentiment_polarity <= 0.1305556: 1 (119/31)
## title_sentiment_polarity > 0.1305556:
## :...avg_positive_polarity <= 0.3142575: 0 (9)
## avg_positive_polarity > 0.3142575:
## :...num_hrefs <= 7: 1 (9)
## num_hrefs > 7: 0 (29/13)
##
## SubTree [S12]
##
## n_tokens_content <= 141: 0 (86/25)
## n_tokens_content > 141: 1 (238/94)
##
## SubTree [S13]
##
## n_tokens_title <= 8: 1 (97/39)
## n_tokens_title > 8: 0 (403/192)
##
## SubTree [S14]
##
## abs_title_subjectivity <= 0.475: 0 (638/312)
## abs_title_subjectivity > 0.475:
## :...num_hrefs <= 3: 0 (121/37)
## num_hrefs > 3:
## :...title_subjectivity > 0.5: 1 (48/22)
## title_subjectivity <= 0.5:
## :...n_tokens_title > 8: 0 (547/255)
## n_tokens_title <= 8:
## :...num_hrefs > 7:
## :...global_sentiment_polarity <= 0.1513961: 0 (68/23)
## : global_sentiment_polarity > 0.1513961: 1 (34/12)
## num_hrefs <= 7:
## :...n_tokens_title > 7: 0 (29/4)
## n_tokens_title <= 7:
## :...average_token_length > 4.681704: 0 (23/3)
## average_token_length <= 4.681704:
## :...global_sentiment_polarity > 0.2002036: 0 (4)
## global_sentiment_polarity <= 0.2002036:
## :...global_sentiment_polarity <= 0.1087384: 0 (3)
## global_sentiment_polarity > 0.1087384: 1 (12/1)
##
##
## Evaluation on training data (29733 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 151 11342(38.1%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 7345 6563 (a): class 0
## 4779 11046 (b): class 1
##
##
## Attribute usage:
##
## 100.00% num_hrefs
## 90.07% num_imgs
## 84.00% kw_max_max
## 81.59% num_videos
## 67.36% num_keywords
## 57.91% n_tokens_content
## 55.85% n_unique_tokens
## 55.34% global_sentiment_polarity
## 47.49% title_sentiment_polarity
## 37.06% n_tokens_title
## 26.34% average_token_length
## 11.02% avg_positive_polarity
## 6.79% title_subjectivity
## 6.07% abs_title_subjectivity
## 1.08% abs_title_sentiment_polarity
##
##
## Time: 0.6 secs
newsPredDT <- predict(newsModelDT, newsTest)
CrossTable(newsTest$popular, newsPredDT, prop.chisq=F, prop.c=F, prop.r=F, dnn=c("Actual", "Predicted (DT)"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 9911
##
##
## | Predicted (DT)
## Actual | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 2270 | 2312 | 4582 |
## | 0.229 | 0.233 | |
## -------------|-----------|-----------|-----------|
## 1 | 1741 | 3588 | 5329 |
## | 0.176 | 0.362 | |
## -------------|-----------|-----------|-----------|
## Column Total | 4011 | 5900 | 9911 |
## -------------|-----------|-----------|-----------|
##
##
#Random Forest model
newsModelRF <- randomForest(popular~., data=newsTraining, importance=T) #about 2min runtime
newsPredRF <- predict(newsModelRF, newsTest)
CrossTable(newsTest$popular, newsPredRF, prop.chisq=F, prop.c=F, prop.r=F, dnn=c("Actual", "Predicted (RF)"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 9911
##
##
## | Predicted (RF)
## Actual | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 2257 | 2325 | 4582 |
## | 0.228 | 0.235 | |
## -------------|-----------|-----------|-----------|
## 1 | 1616 | 3713 | 5329 |
## | 0.163 | 0.375 | |
## -------------|-----------|-----------|-----------|
## Column Total | 3873 | 6038 | 9911 |
## -------------|-----------|-----------|-----------|
##
##
importance(newsModelRF, type=1)
## MeanDecreaseAccuracy
## n_tokens_title 11.58966
## n_tokens_content 37.20085
## n_unique_tokens 37.27078
## n_non_stop_words 37.61929
## num_hrefs 38.25399
## num_imgs 39.04636
## num_videos 14.99645
## average_token_length 20.51853
## num_keywords 18.32682
## kw_max_max 21.97751
## global_sentiment_polarity 34.83415
## avg_positive_polarity 13.99606
## title_subjectivity 14.00154
## title_sentiment_polarity 20.51038
## abs_title_subjectivity 12.42208
## abs_title_sentiment_polarity 16.81981
#Regression Trees model
newsModelR <- rpart(popular~., data=newsTraining)
summary(newsModelR)
## Call:
## rpart(formula = popular ~ ., data = newsTraining)
## n= 29733
##
## CP nsplit rel error xerror xstd
## 1 0.03311044 0 1.0000000 1.0000000 0.006186146
## 2 0.01000000 2 0.9337791 0.9390279 0.006153106
##
## Variable importance
## num_imgs global_sentiment_polarity avg_positive_polarity
## 49 25 6
## num_hrefs n_non_stop_words n_tokens_content
## 6 5 5
## average_token_length n_unique_tokens
## 2 2
##
## Node number 1: 29733 observations, complexity param=0.03311044
## predicted class=1 expected loss=0.4677631 P(node) =1
## class counts: 13908 15825
## probabilities: 0.468 0.532
## left son=2 (21955 obs) right son=3 (7778 obs)
## Primary splits:
## num_imgs < 3.5 to the left, improve=165.43800, (0 missing)
## num_hrefs < 16.5 to the left, improve=138.98500, (0 missing)
## global_sentiment_polarity < 0.09387385 to the left, improve=106.75980, (0 missing)
## num_keywords < 6.5 to the left, improve= 69.87192, (0 missing)
## n_unique_tokens < 0.4366337 to the right, improve= 67.85730, (0 missing)
## Surrogate splits:
## num_hrefs < 17.5 to the left, agree=0.769, adj=0.118, (0 split)
## n_non_stop_words < 1 to the left, agree=0.756, adj=0.066, (0 split)
## n_tokens_content < 1316.5 to the left, agree=0.752, adj=0.051, (0 split)
## global_sentiment_polarity < -0.1855226 to the right, agree=0.740, adj=0.004, (0 split)
## average_token_length < 5.654628 to the left, agree=0.739, adj=0.004, (0 split)
##
## Node number 2: 21955 observations, complexity param=0.03311044
## predicted class=1 expected loss=0.4991574 P(node) =0.7384051
## class counts: 10959 10996
## probabilities: 0.499 0.501
## left son=4 (8173 obs) right son=5 (13782 obs)
## Primary splits:
## global_sentiment_polarity < 0.08566651 to the left, improve=85.15745, (0 missing)
## n_unique_tokens < 0.4305251 to the right, improve=57.18810, (0 missing)
## kw_max_max < 677000 to the right, improve=56.78869, (0 missing)
## average_token_length < 4.782888 to the right, improve=49.01715, (0 missing)
## num_hrefs < 21.5 to the left, improve=48.00899, (0 missing)
## Surrogate splits:
## avg_positive_polarity < 0.2844315 to the left, agree=0.714, adj=0.231, (0 split)
## n_unique_tokens < 0.1845133 to the left, agree=0.659, adj=0.084, (0 split)
## n_tokens_content < 9 to the left, agree=0.659, adj=0.084, (0 split)
## n_non_stop_words < 0.5 to the left, agree=0.659, adj=0.084, (0 split)
## average_token_length < 1.8 to the left, agree=0.659, adj=0.084, (0 split)
##
## Node number 3: 7778 observations
## predicted class=1 expected loss=0.3791463 P(node) =0.2615949
## class counts: 2949 4829
## probabilities: 0.379 0.621
##
## Node number 4: 8173 observations
## predicted class=0 expected loss=0.4436559 P(node) =0.2748798
## class counts: 4547 3626
## probabilities: 0.556 0.444
##
## Node number 5: 13782 observations
## predicted class=1 expected loss=0.4652445 P(node) =0.4635254
## class counts: 6412 7370
## probabilities: 0.465 0.535
rpart.plot(newsModelR, digits=3, type=1)
newsPredR <- predict(newsModelR, newsTest, type="class") #don't forget type="class", otherwise it will return as probabilities (type="prob") by default
summary(newsPredR)
## 0 1
## 2666 7245
summary(newsTest$popular)
## 0 1
## 4582 5329
#rmse(newsTest$popular, newsPredR)
## Warning message:
## In Ops.factor(actual, predicted) : ‘-’ not meaningful for factors
CrossTable(newsTest$popular, newsPredR, prop.chisq=F, prop.c=F, prop.r=F, dnn=c("Actual", "Predicted (RT)"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 9911
##
##
## | Predicted (RT)
## Actual | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 1462 | 3120 | 4582 |
## | 0.148 | 0.315 | |
## -------------|-----------|-----------|-----------|
## 1 | 1204 | 4125 | 5329 |
## | 0.121 | 0.416 | |
## -------------|-----------|-----------|-----------|
## Column Total | 2666 | 7245 | 9911 |
## -------------|-----------|-----------|-----------|
##
##
newsModelR$variable.importance
## num_imgs global_sentiment_polarity avg_positive_polarity
## 165.437961 85.901899 19.703014
## num_hrefs n_non_stop_words n_tokens_content
## 19.589658 18.102157 15.634839
## average_token_length n_unique_tokens
## 7.764944 7.137263