Assignment 13 Jared Lamothe

# Load dataset
data <- read.csv("prepared_weather_forecast_data.csv")

str(data)

## 'data.frame':    2500 obs. of  6 variables:
##  $ Temperature: num  23.7 27.9 25.1 23.6 20.6 ...
##  $ Humidity   : num  89.6 46.5 83.1 74.4 96.9 ...
##  $ Wind_Speed : num  7.34 5.95 1.37 7.05 4.64 ...
##  $ Cloud_Cover: num  50.5 4.99 14.86 67.26 47.68 ...
##  $ Pressure   : num  1032 993 1007 983 981 ...
##  $ Rain       : int  1 0 0 1 0 0 0 0 0 1 ...

library(rpart)

#QUESTION 1

#Load dataset
data <- read.csv("prepared_weather_forecast_data.csv")

#build decision tree
tree_model <- rpart(Rain ~ ., data = data, method = "class")


plot(tree_model, uniform = TRUE, main = "Decision Tree")
text(tree_model, use.n = TRUE, all = TRUE, cex = 0.8)

printcp(tree_model)

## 
## Classification tree:
## rpart(formula = Rain ~ ., data = data, method = "class")
## 
## Variables actually used in tree construction:
## [1] Cloud_Cover Humidity    Temperature
## 
## Root node error: 314/2500 = 0.1256
## 
## n= 2500 
## 
##        CP nsplit rel error    xerror      xstd
## 1 0.33333      0         1 1.0000000 0.0527704
## 2 0.01000      3         0 0.0063694 0.0045021

summary(tree_model)

## Call:
## rpart(formula = Rain ~ ., data = data, method = "class")
##   n= 2500 
## 
##          CP nsplit rel error      xerror        xstd
## 1 0.3333333      0         1 1.000000000 0.052770384
## 2 0.0100000      3         0 0.006369427 0.004502063
## 
## Variable importance
## Temperature Cloud_Cover    Humidity    Pressure  Wind_Speed 
##          41          33          21           3           1 
## 
## Node number 1: 2500 observations,    complexity param=0.3333333
##   predicted class=0  expected loss=0.1256  P(node) =1
##     class counts:  2186   314
##    probabilities: 0.874 0.126 
##   left son=2 (1476 obs) right son=3 (1024 obs)
##   Primary splits:
##       Humidity    < 70.08948  to the left,  improve=113.6935000, (0 missing)
##       Cloud_Cover < 50.04157  to the left,  improve= 80.6636500, (0 missing)
##       Temperature < 24.85582  to the right, improve= 53.7338300, (0 missing)
##       Wind_Speed  < 19.52472  to the left,  improve=  1.1530790, (0 missing)
##       Pressure    < 1046.992  to the left,  improve=  0.6879844, (0 missing)
##   Surrogate splits:
##       Temperature < 10.3176   to the right, agree=0.593, adj=0.007, (0 split)
##       Wind_Speed  < 0.14542   to the right, agree=0.592, adj=0.004, (0 split)
##       Cloud_Cover < 99.06668  to the left,  agree=0.591, adj=0.002, (0 split)
##       Pressure    < 1049.888  to the left,  agree=0.591, adj=0.002, (0 split)
## 
## Node number 2: 1476 observations
##   predicted class=0  expected loss=0  P(node) =0.5904
##     class counts:  1476     0
##    probabilities: 1.000 0.000 
## 
## Node number 3: 1024 observations,    complexity param=0.3333333
##   predicted class=0  expected loss=0.3066406  P(node) =0.4096
##     class counts:   710   314
##    probabilities: 0.693 0.307 
##   left son=6 (515 obs) right son=7 (509 obs)
##   Primary splits:
##       Cloud_Cover < 50.003    to the left,  improve=194.840300, (0 missing)
##       Temperature < 24.91331  to the right, improve=121.930500, (0 missing)
##       Pressure    < 1046.538  to the left,  improve=  2.643016, (0 missing)
##       Humidity    < 78.00198  to the right, improve=  2.342653, (0 missing)
##       Wind_Speed  < 3.212949  to the right, improve=  1.105887, (0 missing)
##   Surrogate splits:
##       Humidity    < 91.88824  to the left,  agree=0.531, adj=0.057, (0 split)
##       Pressure    < 1010.525  to the left,  agree=0.530, adj=0.055, (0 split)
##       Temperature < 32.85183  to the left,  agree=0.521, adj=0.035, (0 split)
##       Wind_Speed  < 2.857783  to the right, agree=0.514, adj=0.022, (0 split)
## 
## Node number 6: 515 observations
##   predicted class=0  expected loss=0  P(node) =0.206
##     class counts:   515     0
##    probabilities: 1.000 0.000 
## 
## Node number 7: 509 observations,    complexity param=0.3333333
##   predicted class=1  expected loss=0.3831041  P(node) =0.2036
##     class counts:   195   314
##    probabilities: 0.383 0.617 
##   left son=14 (195 obs) right son=15 (314 obs)
##   Primary splits:
##       Temperature < 24.91366  to the right, improve=240.589400, (0 missing)
##       Humidity    < 79.09418  to the right, improve=  4.427431, (0 missing)
##       Pressure    < 982.5404  to the left,  improve=  1.958716, (0 missing)
##       Wind_Speed  < 19.65787  to the left,  improve=  1.919804, (0 missing)
##       Cloud_Cover < 53.5763   to the left,  improve=  1.621691, (0 missing)
##   Surrogate splits:
##       Pressure    < 982.5404  to the left,  agree=0.625, adj=0.021, (0 split)
##       Wind_Speed  < 0.1599858 to the left,  agree=0.623, adj=0.015, (0 split)
##       Cloud_Cover < 99.44046  to the right, agree=0.623, adj=0.015, (0 split)
##       Humidity    < 98.87372  to the right, agree=0.621, adj=0.010, (0 split)
## 
## Node number 14: 195 observations
##   predicted class=0  expected loss=0  P(node) =0.078
##     class counts:   195     0
##    probabilities: 1.000 0.000 
## 
## Node number 15: 314 observations
##   predicted class=1  expected loss=0  P(node) =0.1256
##     class counts:     0   314
##    probabilities: 0.000 1.000

Questions: The depth of the tree is 3. The variable that was used on the first part was Humidity. Path 1 condition was Humidity < 70.09. The model predicted no rain with 100% certainty leading all observations to be in the no rain class. Path 2 If the humidity is 70.089 or higher but the cloud cover is less than 50.003, the model will still predict no rain with 100% certainty.

#QUESTION 2

#libraries
library(tree)

#Load dataset
data <- read.csv("prepared_weather_forecast_data.csv")

#Convert the target variable
data$Rain <- as.factor(data$Rain)

#split data
set.seed(123) 
train_indices <- sample(1:nrow(data), size = 0.7 * nrow(data))
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]

#Train a classification tree using the tree package
tree_model <- tree(Rain ~ ., data = train_data)

#Visualize the decision tree
plot(tree_model)
text(tree_model, pretty = 0)

#Predict outcomes for the test set
predictions <- predict(tree_model, test_data, type = "class")

#Evaluate model performance
#confusion matrix
conf_matrix <- table(predictions, test_data$Rain)
print(conf_matrix)

##            
## predictions   0   1
##           0 648   1
##           1   0 101

#Accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
cat("Accuracy:", accuracy, "\n")

## Accuracy: 0.9986667

#Variable importance
cat("Summary of the tree:\n")

## Summary of the tree:

summary(tree_model)

## 
## Classification tree:
## tree(formula = Rain ~ ., data = train_data)
## Variables actually used in tree construction:
## [1] "Humidity"    "Cloud_Cover" "Temperature"
## Number of terminal nodes:  4 
## Residual mean deviance:  0 = 0 / 1746 
## Misclassification error rate: 0 = 0 / 1750

#compare
train_predictions <- predict(tree_model, train_data, type = "class")
train_conf_matrix <- table(train_predictions, train_data$Rain)
train_accuracy <- sum(diag(train_conf_matrix)) / sum(train_conf_matrix)
cat("Training Accuracy:", train_accuracy, "\n")

## Training Accuracy: 1

cat("Testing Accuracy:", accuracy, "\n")

## Testing Accuracy: 0.9986667

The model achieved an accuracy of 99.87%, which is extremely high. This means it almost perfectly classified almost all the observations. The three variables used were Humidity, Cloud_cover, and temperature and they were significant in that order. because the training accuracy was 100% and the residual mean deivance is 0, is might suggest that the model is too tightly fit to the training data. However, the test accuracy being high helps to indicate that overfitting might not be that bad of an issue.

#QUESTION 3

#libraries
library(rpart)

#Load dataset
data <- read.csv("prepared_weather_forecast_data.csv")


data$Rain <- as.factor(data$Rain)

set.seed(123)
train_indices <- sample(1:nrow(data), size = 0.7 * nrow(data))
train_data <- data[train_indices, ]
validation_data <- data[-train_indices, ]

rpart_model <- rpart(Rain ~ ., data = train_data, method = "class", control = rpart.control(cp = 0.01))

plot(rpart_model, uniform = TRUE, main = "Unpruned Tree")
text(rpart_model, use.n = TRUE, all = TRUE, cex = 0.8)

unpruned_predictions <- predict(rpart_model, validation_data, type = "class")
unpruned_conf_matrix <- table(unpruned_predictions, validation_data$Rain)
unpruned_accuracy <- sum(diag(unpruned_conf_matrix)) / sum(unpruned_conf_matrix)
cat("Unpruned Tree Accuracy:", unpruned_accuracy, "\n")

## Unpruned Tree Accuracy: 0.9986667

printcp(rpart_model)  # Displays the cross-validation results

## 
## Classification tree:
## rpart(formula = Rain ~ ., data = train_data, method = "class", 
##     control = rpart.control(cp = 0.01))
## 
## Variables actually used in tree construction:
## [1] Cloud_Cover Humidity    Temperature
## 
## Root node error: 212/1750 = 0.12114
## 
## n= 1750 
## 
##        CP nsplit rel error   xerror     xstd
## 1 0.33333      0         1 1.000000 0.064386
## 2 0.01000      3         0 0.014151 0.008163

optimal_cp <- rpart_model$cptable[which.min(rpart_model$cptable[, "xerror"]), "CP"]
cat("Optimal CP:", optimal_cp, "\n")

## Optimal CP: 0.01

pruned_tree <- prune(rpart_model, cp = optimal_cp)

plot(pruned_tree, uniform = TRUE, main = "Pruned Tree")
text(pruned_tree, use.n = TRUE, all = TRUE, cex = 0.8)

pruned_predictions <- predict(pruned_tree, validation_data, type = "class")
pruned_conf_matrix <- table(pruned_predictions, validation_data$Rain)
pruned_accuracy <- sum(diag(pruned_conf_matrix)) / sum(pruned_conf_matrix)
cat("Pruned Tree Accuracy:", pruned_accuracy, "\n")

## Pruned Tree Accuracy: 0.9986667

The unpruned tree has 3 splits and it also has a relative error of 0 on the training data. The pruned tree basically an exact match of the unpruned tree. I believe it is because there were already so few splits to begin with, it was difficult to find anything else to cut down.The best complexity parameter was .001 because it made the tree as simple as it could while maintaining a high accuracy. There isnt any major difference between the pruned and unpruend trees since the original tree was already simple and working well.

#QUESTION 4

#Load librarys
library(party)

## Loading required package: grid

## Loading required package: mvtnorm

## Loading required package: modeltools

## Loading required package: stats4

## Loading required package: strucchange

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

data <- read.csv("WineQT.csv")

#Remove unnecessary columns 
data$Id <- NULL

#Convert 'quality' to a facto
data$quality <- as.factor(data$quality)

#split data
set.seed(123)
train_indices <- sample(1:nrow(data), 0.8 * nrow(data))
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]

#Train a Conditional Inference Tree
model <- ctree(quality ~ ., data = train_data)

#Visualize
plot(model)

#Print the tree structure
print(model)

## 
##   Conditional inference tree with 14 terminal nodes
## 
## Response:  quality 
## Inputs:  fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol 
## Number of observations:  914 
## 
## 1) alcohol <= 10.3; criterion = 1, statistic = 227.187
##   2) volatile.acidity <= 0.315; criterion = 1, statistic = 68.591
##     3)*  weights = 34 
##   2) volatile.acidity > 0.315
##     4) volatile.acidity <= 0.33; criterion = 1, statistic = 37.929
##       5)*  weights = 9 
##     4) volatile.acidity > 0.33
##       6) volatile.acidity <= 0.87; criterion = 1, statistic = 35.887
##         7) volatile.acidity <= 0.55; criterion = 0.999, statistic = 23.544
##           8) sulphates <= 0.62; criterion = 0.998, statistic = 19.733
##             9)*  weights = 116 
##           8) sulphates > 0.62
##             10) sulphates <= 0.83; criterion = 1, statistic = 28.829
##               11)*  weights = 68 
##             10) sulphates > 0.83
##               12)*  weights = 12 
##         7) volatile.acidity > 0.55
##           13)*  weights = 243 
##       6) volatile.acidity > 0.87
##         14)*  weights = 25 
## 1) alcohol > 10.3
##   15) volatile.acidity <= 0.84; criterion = 1, statistic = 102.235
##     16) volatile.acidity <= 0.31; criterion = 1, statistic = 40.846
##       17)*  weights = 67 
##     16) volatile.acidity > 0.31
##       18) alcohol <= 12.9; criterion = 1, statistic = 35.085
##         19) sulphates <= 0.83; criterion = 1, statistic = 26.34
##           20) alcohol <= 11.2; criterion = 0.998, statistic = 21.734
##             21)*  weights = 149 
##           20) alcohol > 11.2
##             22) sulphates <= 0.58; criterion = 0.981, statistic = 17.211
##               23)*  weights = 44 
##             22) sulphates > 0.58
##               24)*  weights = 84 
##         19) sulphates > 0.83
##           25)*  weights = 34 
##       18) alcohol > 12.9
##         26)*  weights = 12 
##   15) volatile.acidity > 0.84
##     27)*  weights = 17

#Evaluate model performance on test data
predicted <- predict(model, test_data)
confusion_matrix <- table(Predicted = predicted, Actual = test_data$quality)
print(confusion_matrix)

##          Actual
## Predicted  3  4  5  6  7  8
##         3  0  0  0  0  0  0
##         4  0  0  3  4  0  1
##         5  2  2 59 32  1  0
##         6  0  2 26 46 22  2
##         7  0  0  4 10 12  1
##         8  0  0  0  0  0  0

Alcohol has an important role at the start of the tree. Volatile acidity divides the data into subgroups. Wines with low alcohol content, volatile acidity interacts a lot to provide differences in the quality. Wines that contain low alcohol but medium levels of volatile acidity seem to be around very specific qualitys. a lot or a little acidity tends to lower quality Sulphates are extremely important for wines with lots of alcohol. Higher levels of sulphates and mid to high alcohol content correlate well with higher win quality.

This data can be used to examine the preferences of wine consumers. Consumers who like to have lower alcohol may also want mid volatile acidity.Also, sulphate, which is used to preserve the wine, should be used to carefully. This is because it may extended the taste and length of a bottle, but it destroy the sense of a natural product if used excessively.

#QUESTION 5

#Load libraries
library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(rpart)

#Load the dataset
data <- read.csv("WineQT.csv")

# Remove unnecessary columns
data$Id <- NULL
data$quality <- as.factor(data$quality)

#Split the data
set.seed(123)
train_indices <- createDataPartition(data$quality, p = 0.8, list = FALSE)
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]

train_control <- trainControl(method = "cv", number = 5) 

grid <- expand.grid(cp = seq(0.01, 0.1, by = 0.01)) 

# Train the model
caret_model <- train(
  quality ~ ., 
  data = train_data, 
  method = "rpart", 
  trControl = train_control, 
  tuneGrid = grid, 
  preProcess = c("center", "scale")
)

#print model
print(caret_model)

## CART 
## 
## 917 samples
##  11 predictor
##   6 classes: '3', '4', '5', '6', '7', '8' 
## 
## Pre-processing: centered (11), scaled (11) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 733, 733, 733, 734, 735 
## Resampling results across tuning parameters:
## 
##   cp    Accuracy   Kappa    
##   0.01  0.5844388  0.3254379
##   0.02  0.5702369  0.2860819
##   0.03  0.5506238  0.2396987
##   0.04  0.5506238  0.2341489
##   0.05  0.5506238  0.2341489
##   0.06  0.5506238  0.2341489
##   0.07  0.5506238  0.2341489
##   0.08  0.5506238  0.2341489
##   0.09  0.5506238  0.2341489
##   0.10  0.5506238  0.2341489
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01.

#predict
caret_predictions <- predict(caret_model, test_data)



caret_conf_matrix <- confusionMatrix(caret_predictions, test_data$quality)
print(caret_conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  3  4  5  6  7  8
##          3  0  0  0  0  0  0
##          4  0  0  0  0  0  0
##          5  1  4 74 38  2  0
##          6  0  2 21 44 15  2
##          7  0  0  1 10 11  1
##          8  0  0  0  0  0  0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5708          
##                  95% CI : (0.5035, 0.6362)
##     No Information Rate : 0.4248          
##     P-Value [Acc > NIR] : 7.058e-06       
##                                           
##                   Kappa : 0.2992          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity          0.000000  0.00000   0.7708   0.4783  0.39286  0.00000
## Specificity          1.000000  1.00000   0.6538   0.7015  0.93939  1.00000
## Pos Pred Value            NaN      NaN   0.6218   0.5238  0.47826      NaN
## Neg Pred Value       0.995575  0.97345   0.7944   0.6620  0.91626  0.98673
## Prevalence           0.004425  0.02655   0.4248   0.4071  0.12389  0.01327
## Detection Rate       0.000000  0.00000   0.3274   0.1947  0.04867  0.00000
## Detection Prevalence 0.000000  0.00000   0.5265   0.3717  0.10177  0.00000
## Balanced Accuracy    0.500000  0.50000   0.7123   0.5899  0.66613  0.50000

#train decision tree
rpart_model <- rpart(quality ~ ., data = train_data)

#Predict on the test set
rpart_predictions <- predict(rpart_model, test_data, type = "class")
rpart_conf_matrix <- confusionMatrix(rpart_predictions, test_data$quality)
print(rpart_conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  3  4  5  6  7  8
##          3  0  0  0  0  0  0
##          4  0  0  0  0  0  0
##          5  1  4 74 38  2  0
##          6  0  2 21 44 15  2
##          7  0  0  1 10 11  1
##          8  0  0  0  0  0  0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5708          
##                  95% CI : (0.5035, 0.6362)
##     No Information Rate : 0.4248          
##     P-Value [Acc > NIR] : 7.058e-06       
##                                           
##                   Kappa : 0.2992          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity          0.000000  0.00000   0.7708   0.4783  0.39286  0.00000
## Specificity          1.000000  1.00000   0.6538   0.7015  0.93939  1.00000
## Pos Pred Value            NaN      NaN   0.6218   0.5238  0.47826      NaN
## Neg Pred Value       0.995575  0.97345   0.7944   0.6620  0.91626  0.98673
## Prevalence           0.004425  0.02655   0.4248   0.4071  0.12389  0.01327
## Detection Rate       0.000000  0.00000   0.3274   0.1947  0.04867  0.00000
## Detection Prevalence 0.000000  0.00000   0.5265   0.3717  0.10177  0.00000
## Balanced Accuracy    0.500000  0.50000   0.7123   0.5899  0.66613  0.50000

# Compare the models
cat("Caret Model Accuracy:", caret_conf_matrix$overall["Accuracy"], "\n")

## Caret Model Accuracy: 0.5707965

cat("Rpart Model Accuracy:", rpart_conf_matrix$overall["Accuracy"], "\n")

## Rpart Model Accuracy: 0.5707965

Caret automatically scaled and split my data, as well as handled missing data.The optimal value of the complexity parameter was 0.01. This was determined through cross validation accuracy.Both caret and rpart models achieved the same testing accuracy of 57.08%. Since both models got the same accuracy, this probably shows that the preprocessing and the tuning did not change or improve the model.