Homework 2

I will use this dataset found on Kaggle that contains information about wine contents and quality. The dependent variable or the response variable is the wine quality and the other are the predictors. This dataset contains 12 columns and 1599 observations.

0.1 Importing The Data

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

dataset <- read_csv("C:\\Users\\Al Haque\\OneDrive\\Desktop\\Data 622\\Data622Hw2\\winequality-red.csv")

## Rows: 1599 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (12): fixed acidity, volatile acidity, citric acid, residual sugar, chlo...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(dataset)

## # A tibble: 6 × 12
##   `fixed acidity` `volatile acidity` `citric acid` `residual sugar` chlorides
##             <dbl>              <dbl>         <dbl>            <dbl>     <dbl>
## 1             7.4               0.7           0                 1.9     0.076
## 2             7.8               0.88          0                 2.6     0.098
## 3             7.8               0.76          0.04              2.3     0.092
## 4            11.2               0.28          0.56              1.9     0.075
## 5             7.4               0.7           0                 1.9     0.076
## 6             7.4               0.66          0                 1.8     0.075
## # ℹ 7 more variables: `free sulfur dioxide` <dbl>,
## #   `total sulfur dioxide` <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## #   alcohol <dbl>, quality <dbl>

According to Kaggle, the classes are ordered and there are more normal wines than excellent or poor ones, meaning we will have to stratify on the wine quality..

0.2 Data Exploration

## Let's check the unique wine quality., so there are scores of either 3,4,5,6,7,8 

unique(dataset$quality)

## [1] 5 6 7 4 8 3

ggplot(dataset,aes(x = quality)) +
  geom_bar() +
  ggtitle("Wine Quality Proportion") +
  xlab("Wine Quality") +
  ylab("Number of Wines") + theme_bw()

0.2.1 Histogram of The Data.

I have made a histogram of all the columns in the dataset, we melted the data into a long-format and facet-wrapped it by name, and we can see all the distribution.

data_long <- dataset %>%
  pivot_longer(colnames(dataset)) %>%
  as.data.frame()

## We can see various distribution of the dataset now.. 
ggp1 <- ggplot(data_long, aes(x = value)) +    # Draw each column as histogram
  geom_histogram() + 
  facet_wrap(~ name, scales = "free")
ggp1

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

0.3 Data-Cleaning

Let me try to clean some of the columns within the dataset..

## This library shows whether there are empty values, since all the viz is red it means that there are no missing values within  the dataset.. 

library(visdat)

vis_dat(dataset)

Now I will make a categorical column of the wine quality a score of 7 or higher is considered Good else it is bad quality.

Dataset <- dataset %>%
  mutate(Quality = ifelse(quality >= 7, "Good","Bad"),
         Quality = factor(Quality))

head(Dataset)

## # A tibble: 6 × 13
##   `fixed acidity` `volatile acidity` `citric acid` `residual sugar` chlorides
##             <dbl>              <dbl>         <dbl>            <dbl>     <dbl>
## 1             7.4               0.7           0                 1.9     0.076
## 2             7.8               0.88          0                 2.6     0.098
## 3             7.8               0.76          0.04              2.3     0.092
## 4            11.2               0.28          0.56              1.9     0.075
## 5             7.4               0.7           0                 1.9     0.076
## 6             7.4               0.66          0                 1.8     0.075
## # ℹ 8 more variables: `free sulfur dioxide` <dbl>,
## #   `total sulfur dioxide` <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## #   alcohol <dbl>, quality <dbl>, Quality <fct>

## There are more bad quality wines than there are of good quality wine.. 
Dataset %>%
  tabyl(Quality)

##  Quality    n   percent
##      Bad 1382 0.8642902
##     Good  217 0.1357098

0.4 Correlation plot

Since we have all numeric variables let’s take a look at a correlation plot..

library(corrplot)

## corrplot 0.92 loaded

corr_matrix <- Dataset %>% select_if(is.numeric) %>% 
  cor(method="pearson", use="pairwise.complete.obs")

corrplot(corr_matrix,type="upper",method = "number")

We have some correlation between some of the predictors, fixed acidity has a positive correlation between citric acid and density while it has a negative correlation with pH. While free sulfur dioxide has a postivie correlation with total sulfur dioxide.

0.5 Decision Tree

We will make a decision-tree but before let’s split the data into training and testing set..

Dataset1 <- Dataset %>%
  select(-quality)

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

set.seed(1234)
# create a list of 75% of the rows in the original dataset we can use for training
validation_index <- createDataPartition(Dataset1$Quality, p=0.75, list=FALSE)
# select 20% of the data for validation
validation <- Dataset1[-validation_index,]
# use the remaining 80% of data to training and testing the models
Training <- Dataset1[validation_index,]

Training %>%
  tabyl(Quality)

##  Quality    n   percent
##      Bad 1037 0.8641667
##     Good  163 0.1358333

validation %>%
  tabyl(Quality)

##  Quality   n   percent
##      Bad 345 0.8646617
##     Good  54 0.1353383

Each of the sets are equally proportioned by good and bad quality..

library(rpart)
library(rpart.plot)

fit <- rpart(Quality~.,data = Training,method = 'class')
rpart.plot(fit,extra = 106)

0.5.1 Make predictions.

predict_unseen <- predict(fit,validation,type = 'class')

0.5.2 Display Accuracy

Here we display the confusion matrix results we get an accuracy of 87% for the testing set which is pretty good.

confusionMatrix(predict_unseen,validation$Quality)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Bad Good
##       Bad  325   32
##       Good  20   22
##                                           
##                Accuracy : 0.8697          
##                  95% CI : (0.8326, 0.9011)
##     No Information Rate : 0.8647          
##     P-Value [Acc > NIR] : 0.4198          
##                                           
##                   Kappa : 0.3856          
##                                           
##  Mcnemar's Test P-Value : 0.1272          
##                                           
##             Sensitivity : 0.9420          
##             Specificity : 0.4074          
##          Pos Pred Value : 0.9104          
##          Neg Pred Value : 0.5238          
##              Prevalence : 0.8647          
##          Detection Rate : 0.8145          
##    Detection Prevalence : 0.8947          
##       Balanced Accuracy : 0.6747          
##                                           
##        'Positive' Class : Bad             
##

0.6 Regression Tree

We are going to predict the alcohol column since that was the predictor used at the first split I am going to use the normal Dataset

## We are going to use the data but swap it to pH I should use the alcohol since it is the first split of the decision tree 
fit2 <- rpart(alcohol~.,data = Training,method = 'anova')

rpart.plot(fit2)

0.6.1 Predict on testing data

predict_unseen1 <- predict(fit2,validation,class = 'anova')
head(predict_unseen1)

##         1         2         3         4         5         6 
##  9.667434  9.838307 10.149296 11.005682  9.667434  9.838307

0.6.2 Regression Metrics

I got a function from this website: https://www.pluralsight.com/guides/non-linear-regression-trees-with-r

eval_results <- function(true, predicted, df) {
  SSE <- sum((predicted - true)^2)
  SST <- sum((true - mean(true))^2)
  R_square <- 1 - SSE / SST
  RMSE = sqrt(SSE/nrow(df))
  
# Model performance metrics
  data.frame(
    RMSE = RMSE,
    Rsquare = R_square
  )
  
}

  
eval_results(validation$alcohol,predict_unseen1,validation)

##        RMSE   Rsquare
## 1 0.7428541 0.4644592

0.7 Fit a Random Forest..

Next we will create a random forest for classification we will try to predict on our original dependent variable (wine quality..)

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

## We have to clean the column names before we make the random forest... 
library(janitor)

Training1 <- Training %>%
  clean_names()

validation1 <- validation %>%
  clean_names()

## we will set a seed and make it reproducible.. 
## we will create a random forest model using the training set and then apply it on the testing set

set.seed(1)

randomf <- randomForest(quality~.,data = Training1)

randomf

## 
## Call:
##  randomForest(formula = quality ~ ., data = Training1) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 9.67%
## Confusion matrix:
##       Bad Good class.error
## Bad  1010   27  0.02603664
## Good   89   74  0.54601227

0.7.1 Variable Importance Plot

The plot shows that alcohol and sulphates, are the most important predictors when making predictions for good or bad classification for wine.

varImpPlot(randomf)

Seems alcohol and sulphates played a bigger role in classification for the wine quality.

0.8 Predict Values(Random Forest) using new test data

Using the testing data, we get a 91% accuracy on our testing data.

randmpred <- predict(randomf,newdata = validation1)

confusionMatrix(randmpred,validation1$quality)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Bad Good
##       Bad  340   27
##       Good   5   27
##                                           
##                Accuracy : 0.9198          
##                  95% CI : (0.8887, 0.9445)
##     No Information Rate : 0.8647          
##     P-Value [Acc > NIR] : 0.0004143       
##                                           
##                   Kappa : 0.5862          
##                                           
##  Mcnemar's Test P-Value : 0.0002054       
##                                           
##             Sensitivity : 0.9855          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.9264          
##          Neg Pred Value : 0.8438          
##              Prevalence : 0.8647          
##          Detection Rate : 0.8521          
##    Detection Prevalence : 0.9198          
##       Balanced Accuracy : 0.7428          
##                                           
##        'Positive' Class : Bad             
##

Wow a 91% accuracy on the random forest model