Data Analytics Application - Build a regression model using Logistic, Support vector Machine, Multinomial, Ordinal Logistic Regression

Access wine quality data set at https://archive.ics.uci.edu/ml/datasets/Wine+Quality

Summary of Answer

I modelled properties of red and white variants of Portuguese “Vinho Verde” wineto predict the quality of each wine as given in data set. It is given that the classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones.

First, I did exploratory data analysis and split data was in the ratio of 50/50 to create training and test datasets,respectively.

For each question one regression model with all variable was created. Then Each model was tuned to improve performance.The accuracy,R-square for linear model and misclassification rate was then determined for each model. The multinominal logistic regression model predicts train data with 55% accuracy, SVM logistic regression model predicts test data with 60% accuracy,whereas ordinal regression model without scaling predicts hold-out data with 56% accuracy. Liner regression gives an R-square around 28, which means only 28% of the variance in DV is explained by IV.

Conclusion: I expected Ordinal regression to provide the most accurate prediction since the dependent variable (quality) is ordered in the original dataset but it is giving accurcy of only 40% However,The model created using Support Vector Machines most accurately predicted the test data.

Installing required packages and sourcing libraries requiredfor this HW

Importing files- Two csv files

# Structure and Summary for red wine
redwine<- read.csv("C:/Users/Amit/Dropbox/Dataset/winequality-red.csv")
redwine <- read.csv("C:/users/Amit/Dropbox/Dataset/winequality-red.csv", sep= ";")
View(redwine)
#str(redwine)
#summary(redwine)

# Structure and Summary for white wine
whitewine<- read.csv("C:/Users/Amit/Dropbox/Dataset/winequality-white.csv")
whitewine <- read.csv("C:/Users/Amit/Dropbox/Dataset/winequality-white.csv", sep=";")
View(whitewine)
#str(whitewine)
#summary(whitewine)

Merge of redwine and whitewine data

wine_redwhite <- rbind(whitewine, redwine)
class(wine_redwhite)

## [1] "data.frame"

View(wine_redwhite)
head(wine_redwhite)

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.0             0.27        0.36           20.7     0.045
## 2           6.3             0.30        0.34            1.6     0.049
## 3           8.1             0.28        0.40            6.9     0.050
## 4           7.2             0.23        0.32            8.5     0.058
## 5           7.2             0.23        0.32            8.5     0.058
## 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 6       6

#str(wine_redwhite)
#summary(wine_redwhite)

Red wine data has 1,599 observation and quality variable has levels from 3 to 8. White wine data has 4898 observation and quality variable has levels from 3 to 9. Combined red and wine data has 6,497 observation and has levels from 3 to 9.

Basic exploratory data analysis to get a sense of data

#Check for missing values
#colSums(is.na(wine_redwhite))

#install.packages("corrplot")
library(corrplot)

## Warning: package 'corrplot' was built under R version 3.3.2

wine_corr <- cor(wine_redwhite)
summary(wine_corr)

##  fixed.acidity      volatile.acidity    citric.acid      
##  Min.   :-0.32905   Min.   :-0.41448   Min.   :-0.37798  
##  1st Qu.:-0.14716   1st Qu.:-0.28741   1st Qu.: 0.02663  
##  Median : 0.07113   Median : 0.09068   Median : 0.09084  
##  Mean   : 0.12095   Mean   : 0.05921   Mean   : 0.11282  
##  3rd Qu.: 0.30578   3rd Qu.: 0.26391   3rd Qu.: 0.15565  
##  Max.   : 1.00000   Max.   : 1.00000   Max.   : 1.00000  
##  residual.sugar       chlorides        free.sulfur.dioxide
##  Min.   :-0.35941   Min.   :-0.27963   Min.   :-0.35256   
##  1st Qu.:-0.18845   1st Qu.:-0.19645   1st Qu.:-0.19010   
##  Median :-0.07448   Median : 0.04185   Median :-0.06007   
##  Mean   : 0.10890   Mean   : 0.12134   Mean   : 0.08280   
##  3rd Qu.: 0.42602   3rd Qu.: 0.36624   3rd Qu.: 0.20056   
##  Max.   : 1.00000   Max.   : 1.00000   Max.   : 1.00000   
##  total.sulfur.dioxide    density               pH          
##  Min.   :-0.41448     Min.   :-0.68675   Min.   :-0.32981  
##  1st Qu.:-0.27670     1st Qu.: 0.02221   1st Qu.:-0.24198  
##  Median :-0.13990     Median : 0.17782   Median : 0.01560  
##  Mean   : 0.04997     Mean   : 0.17318   Mean   : 0.03472  
##  3rd Qu.: 0.27030     3rd Qu.: 0.38669   3rd Qu.: 0.13897  
##  Max.   : 1.00000     Max.   : 1.00000   Max.   : 1.00000  
##    sulphates           alcohol            quality         
##  Min.   :-0.27573   Min.   :-0.68675   Min.   :-0.305858  
##  1st Qu.:-0.04875   1st Qu.:-0.25912   1st Qu.:-0.107724  
##  Median : 0.12416   Median :-0.06655   Median :-0.008737  
##  Mean   : 0.15119   Mean   :-0.02748   Mean   : 0.059664  
##  3rd Qu.: 0.26950   3rd Qu.: 0.02804   3rd Qu.: 0.062980  
##  Max.   : 1.00000   Max.   : 1.00000   Max.   : 1.000000

corrplot(wine_corr,method="number")

# Skewness and kurtosis
#install.packages("psych")
#library("psych")
#describe(wine_redwhite)
# Histogram and table to get frquency of different qualities
#hist_1 <- hist(wine_redwhite$quality)
#hist_1
#print(hist_1$breaks)
# Change number of breaks and add labels
hist_2 <- hist(wine_redwhite$quality, breaks = 5, col ="lightblue", xlab = "Quality ", main= " Histogram of Quality rating")

table(wine_redwhite$quality)

## 
##    3    4    5    6    7    8    9 
##   30  216 2138 2836 1079  193    5

From describe function output, we observe that skewness coefficient for quality is 0.19 and kurtosis is 0.23. Also, from the scatterplot matrix, we observe that there is not a linear relationship between the DV (Quality) and the IV(Predictors).

Next, I split the data into test and train data

set.seed(1235)
wine_redwhite <- wine_redwhite[sample(nrow(wine_redwhite)),]
split <- floor(nrow(wine_redwhite)/2)
wine_train <- wine_redwhite[0:split,]
wine_test <- wine_redwhite[(split+1):(nrow(wine_redwhite)-1),]
View(wine_train)
View(wine_test)
summary(wine_train)

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 4.200   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500   1st Qu.: 1.800  
##  Median : 7.000   Median :0.2900   Median :0.3100   Median : 3.200  
##  Mean   : 7.215   Mean   :0.3425   Mean   :0.3183   Mean   : 5.546  
##  3rd Qu.: 7.700   3rd Qu.:0.4100   3rd Qu.:0.3900   3rd Qu.: 8.200  
##  Max.   :15.600   Max.   :1.3300   Max.   :1.0000   Max.   :31.600  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   :  1.0       Min.   :  6.0       
##  1st Qu.:0.03800   1st Qu.: 17.0       1st Qu.: 77.0       
##  Median :0.04700   Median : 29.0       Median :118.0       
##  Mean   :0.05608   Mean   : 30.4       Mean   :115.3       
##  3rd Qu.:0.06400   3rd Qu.: 41.0       3rd Qu.:155.0       
##  Max.   :0.61000   Max.   :131.0       Max.   :366.5       
##     density             pH          sulphates         alcohol    
##  Min.   :0.9871   Min.   :2.740   Min.   :0.2200   Min.   : 8.0  
##  1st Qu.:0.9924   1st Qu.:3.110   1st Qu.:0.4300   1st Qu.: 9.5  
##  Median :0.9949   Median :3.210   Median :0.5000   Median :10.3  
##  Mean   :0.9947   Mean   :3.218   Mean   :0.5296   Mean   :10.5  
##  3rd Qu.:0.9970   3rd Qu.:3.320   3rd Qu.:0.6000   3rd Qu.:11.3  
##  Max.   :1.0103   Max.   :4.010   Max.   :2.0000   Max.   :14.2  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.821  
##  3rd Qu.:6.000  
##  Max.   :9.000

str(wine_train)

## 'data.frame':    3248 obs. of  12 variables:
##  $ fixed.acidity       : num  6.2 6.6 6.3 9 12.4 7.3 7.2 6.5 7.2 6.9 ...
##  $ volatile.acidity    : num  0.27 0.16 0.26 0.36 0.35 0.26 0.31 0.28 0.27 0.28 ...
##  $ citric.acid         : num  0.49 0.21 0.25 0.52 0.49 0.33 0.35 0.38 0.31 0.41 ...
##  $ residual.sugar      : num  1.4 6.7 7.8 2.1 2.6 11.8 7.2 7.8 1.2 1.7 ...
##  $ chlorides           : num  0.05 0.055 0.058 0.111 0.079 0.057 0.046 0.031 0.031 0.05 ...
##  $ free.sulfur.dioxide : num  20 43 44 5 27 48 45 54 27 10 ...
##  $ total.sulfur.dioxide: num  74 157 166 10 69 127 178 216 80 136 ...
##  $ density             : num  0.993 0.994 0.996 0.996 0.999 ...
##  $ pH                  : num  3.32 3.15 3.24 3.31 3.12 3.1 3.14 3.03 3.03 3.16 ...
##  $ sulphates           : num  0.44 0.52 0.41 0.62 0.75 0.55 0.53 0.42 0.33 0.71 ...
##  $ alcohol             : num  9.8 10.8 9 11.3 10.4 10 9.7 13.1 12.7 11.4 ...
##  $ quality             : int  6 6 5 6 6 6 5 6 6 6 ...

str(wine_test)

## 'data.frame':    3248 obs. of  12 variables:
##  $ fixed.acidity       : num  5.9 9.1 8.3 5.8 6.9 7.4 8.5 6.8 5 7.3 ...
##  $ volatile.acidity    : num  0.21 0.27 0.28 0.29 0.27 0.55 0.17 0.41 0.27 0.65 ...
##  $ citric.acid         : num  0.31 0.32 0.48 0.38 0.25 0.19 0.74 0.31 0.32 0 ...
##  $ residual.sugar      : num  1.8 1.1 2.1 10.7 7.5 1.8 3.6 8.8 4.5 1.2 ...
##  $ chlorides           : num  0.033 0.031 0.093 0.038 0.03 0.082 0.05 0.084 0.032 0.065 ...
##  $ free.sulfur.dioxide : num  45 15 6 49 18 15 29 26 58 15 ...
##  $ total.sulfur.dioxide: num  142 151 12 136 117 34 128 45 178 21 ...
##  $ density             : num  0.99 0.994 0.994 0.994 0.991 ...
##  $ pH                  : num  3.35 3.03 3.26 3.11 3.09 3.49 3.28 3.38 3.45 3.39 ...
##  $ sulphates           : num  0.5 0.41 0.62 0.59 0.38 0.68 0.4 0.64 0.31 0.47 ...
##  $ alcohol             : num  12.7 10.6 12.4 11.2 13 10.5 12.4 10.1 12.6 10 ...
##  $ quality             : int  6 5 7 6 6 5 6 6 7 7 ...

summary(wine_test)

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.700  
##  1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500   1st Qu.: 1.800  
##  Median : 7.000   Median :0.2900   Median :0.3100   Median : 2.900  
##  Mean   : 7.215   Mean   :0.3369   Mean   :0.3189   Mean   : 5.341  
##  3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900   3rd Qu.: 7.900  
##  Max.   :15.900   Max.   :1.5800   Max.   :1.6600   Max.   :65.800  
##    chlorides     free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.009   Min.   :  1.00      Min.   :  6.0        Min.   :0.9871  
##  1st Qu.:0.038   1st Qu.: 17.00      1st Qu.: 79.0        1st Qu.:0.9923  
##  Median :0.047   Median : 29.00      Median :118.0        Median :0.9948  
##  Mean   :0.056   Mean   : 30.66      Mean   :116.2        Mean   :0.9947  
##  3rd Qu.:0.066   3rd Qu.: 41.00      3rd Qu.:156.0        3rd Qu.:0.9969  
##  Max.   :0.611   Max.   :289.00      Max.   :440.0        Max.   :1.0390  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.720   Min.   :0.2300   Min.   : 8.00   Min.   :3.000  
##  1st Qu.:3.110   1st Qu.:0.4300   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.210   Median :0.5100   Median :10.30   Median :6.000  
##  Mean   :3.219   Mean   :0.5329   Mean   :10.48   Mean   :5.815  
##  3rd Qu.:3.320   3rd Qu.:0.6000   3rd Qu.:11.30   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :1.9800   Max.   :14.90   Max.   :9.000

plot(wine_test)

Question 1 -Build a logistic regression model using “quality” as the target variable. Note that you don’t have binary classification task any more. For this you will have to use multinomial logistic regression. However, you can still interpret the model output (i.e., statistical significance of the coefficients, etc. exactly the same way as binary logistic regression). (40 points)

multinomial logistic regression (question 1)-In this case we can use Multinomial logistic regression as response variable (quality) is a nominal categorical variable with more than 2 levels.

library(nnet)
wine_train$quality <- as.factor(wine_train$quality)
mlogit_model <- multinom(quality ~. ,data =wine_train, maxit = 1000)

## # weights:  91 (72 variable)
## initial  value 6320.316164 
## iter  10 value 4320.497806
## iter  20 value 4060.606046
## iter  30 value 3854.530346
## iter  40 value 3506.654411
## iter  50 value 3457.637350
## iter  60 value 3450.835013
## iter  70 value 3447.147730
## iter  80 value 3445.537837
## iter  90 value 3444.981576
## iter 100 value 3442.950730
## iter 110 value 3441.461199
## iter 120 value 3440.940957
## iter 130 value 3440.306791
## iter 140 value 3439.559266
## iter 150 value 3439.034524
## iter 160 value 3439.019201
## iter 170 value 3438.970730
## iter 180 value 3438.428701
## iter 190 value 3438.316035
## iter 200 value 3438.089704
## iter 210 value 3437.677310
## iter 220 value 3437.491963
## iter 230 value 3437.346085
## iter 240 value 3437.311541
## iter 250 value 3437.258707
## iter 260 value 3437.160001
## iter 270 value 3437.054373
## iter 280 value 3436.643428
## iter 290 value 3436.594604
## iter 300 value 3436.494691
## iter 310 value 3436.487621
## final  value 3436.487177 
## converged

mlogit_model

## Call:
## multinom(formula = quality ~ ., data = wine_train, maxit = 1000)
## 
## Coefficients:
##   (Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar
## 4    80.34737   -0.40121812       -0.3009899 -0.18609933    -0.13430930
## 5  -143.77234   -0.45157483       -2.4570309 -0.52078002    -0.23245524
## 6  -120.11940   -0.44163988       -6.3501832 -0.55937887    -0.16496438
## 7   111.65170   -0.02859189       -8.4393169 -0.90090096    -0.04285993
## 8    71.90978   -0.05405534       -7.2393255 -0.07268703    -0.02706201
## 9   -23.26643    1.05731368       -9.9188910  4.69045014    -0.54367948
##    chlorides free.sulfur.dioxide total.sulfur.dioxide    density        pH
## 4  -15.62348          0.01680235          -0.01254569  -54.00569 -4.071159
## 5  -15.24590          0.07569049          -0.01640530  174.65852 -3.608298
## 6  -13.70611          0.08352995          -0.02206566  142.16967 -3.275409
## 7  -19.58021          0.09035251          -0.02426119 -105.41733 -1.556323
## 8  -18.60198          0.10593109          -0.02046752  -70.67082 -1.755639
## 9 -248.45446          0.08576505          -0.01621272  -42.75708 13.405050
##   sulphates     alcohol
## 4  3.077385 -0.62686578
## 5  3.302585 -0.82060754
## 6  5.059485 -0.01451927
## 7  6.643809  0.38419658
## 8  6.035392  0.64336353
## 9  3.378276  1.92082084
## 
## Residual Deviance: 6872.974 
## AIC: 7016.974

mlogit_output <- summary(mlogit_model)
mlogit_output

## Call:
## multinom(formula = quality ~ ., data = wine_train, maxit = 1000)
## 
## Coefficients:
##   (Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar
## 4    80.34737   -0.40121812       -0.3009899 -0.18609933    -0.13430930
## 5  -143.77234   -0.45157483       -2.4570309 -0.52078002    -0.23245524
## 6  -120.11940   -0.44163988       -6.3501832 -0.55937887    -0.16496438
## 7   111.65170   -0.02859189       -8.4393169 -0.90090096    -0.04285993
## 8    71.90978   -0.05405534       -7.2393255 -0.07268703    -0.02706201
## 9   -23.26643    1.05731368       -9.9188910  4.69045014    -0.54367948
##    chlorides free.sulfur.dioxide total.sulfur.dioxide    density        pH
## 4  -15.62348          0.01680235          -0.01254569  -54.00569 -4.071159
## 5  -15.24590          0.07569049          -0.01640530  174.65852 -3.608298
## 6  -13.70611          0.08352995          -0.02206566  142.16967 -3.275409
## 7  -19.58021          0.09035251          -0.02426119 -105.41733 -1.556323
## 8  -18.60198          0.10593109          -0.02046752  -70.67082 -1.755639
## 9 -248.45446          0.08576505          -0.01621272  -42.75708 13.405050
##   sulphates     alcohol
## 4  3.077385 -0.62686578
## 5  3.302585 -0.82060754
## 6  5.059485 -0.01451927
## 7  6.643809  0.38419658
## 8  6.035392  0.64336353
## 9  3.378276  1.92082084
## 
## Std. Errors:
##   (Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar
## 4   0.1454493     0.2165992        0.5532004  0.69219171     0.07410513
## 5   0.5178327     0.2052018        0.3784100  0.37689593     0.06965434
## 6   0.4284067     0.2045802        0.3693196  0.34811001     0.06951777
## 7   0.5579515     0.2071980        0.4956936  0.47244397     0.07045497
## 8   0.1321986     0.2193902        0.8597776  0.77473267     0.07408530
## 9   0.1320847     0.3346028        0.1170187  0.08956497     0.49921984
##    chlorides free.sulfur.dioxide total.sulfur.dioxide   density        pH
## 4 0.07400592          0.03622649          0.006873728 0.1450188 0.6111689
## 5 0.72950948          0.03483952          0.006457798 0.5073267 0.5901773
## 6 0.75412783          0.03485743          0.006480820 0.4195697 0.5697341
## 7 0.04534702          0.03504646          0.006626980 0.5469947 0.5958200
## 8 0.03572340          0.03571581          0.007253145 0.1314910 0.6354287
## 9 0.01741697          0.07224975          0.025817800 0.1340691 0.9254421
##   sulphates   alcohol
## 4 0.6817020 0.2258354
## 5 0.3586576 0.2135175
## 6 0.3066205 0.2105279
## 7 0.3651224 0.2133400
## 8 0.6329260 0.2247793
## 9 0.1478071 0.3393065
## 
## Residual Deviance: 6872.974 
## AIC: 7016.974

Interpretation of Train Model data. After we run the model, nnet reports the summary iterations and declaration about model convergence. In this case, the model converged quite after 310 iterations. 2. Model execution output shows some iteration history and includes the final negative log-likelihood 3436.5. This value is multiplied by two as shown in the model summary as the Residual Deviance

The Akaike Information Criterion (AIC) is 7016.9 and it provides a method for assessing the quality of your model through comparison of related models. If we have more than one similar candidate models (where all of the variables of the simpler model occur in the more complex models), then we should select the model that has the smallest AIC. It’s useful for comparing models. However, unlike adjusted R-squared, the number itself is not meaningful.

calculate coefficient and standard error

mlogit_output$coefficients

##   (Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar
## 4    80.34737   -0.40121812       -0.3009899 -0.18609933    -0.13430930
## 5  -143.77234   -0.45157483       -2.4570309 -0.52078002    -0.23245524
## 6  -120.11940   -0.44163988       -6.3501832 -0.55937887    -0.16496438
## 7   111.65170   -0.02859189       -8.4393169 -0.90090096    -0.04285993
## 8    71.90978   -0.05405534       -7.2393255 -0.07268703    -0.02706201
## 9   -23.26643    1.05731368       -9.9188910  4.69045014    -0.54367948
##    chlorides free.sulfur.dioxide total.sulfur.dioxide    density        pH
## 4  -15.62348          0.01680235          -0.01254569  -54.00569 -4.071159
## 5  -15.24590          0.07569049          -0.01640530  174.65852 -3.608298
## 6  -13.70611          0.08352995          -0.02206566  142.16967 -3.275409
## 7  -19.58021          0.09035251          -0.02426119 -105.41733 -1.556323
## 8  -18.60198          0.10593109          -0.02046752  -70.67082 -1.755639
## 9 -248.45446          0.08576505          -0.01621272  -42.75708 13.405050
##   sulphates     alcohol
## 4  3.077385 -0.62686578
## 5  3.302585 -0.82060754
## 6  5.059485 -0.01451927
## 7  6.643809  0.38419658
## 8  6.035392  0.64336353
## 9  3.378276  1.92082084

mlogit_output$standard.errors

##   (Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar
## 4   0.1454493     0.2165992        0.5532004  0.69219171     0.07410513
## 5   0.5178327     0.2052018        0.3784100  0.37689593     0.06965434
## 6   0.4284067     0.2045802        0.3693196  0.34811001     0.06951777
## 7   0.5579515     0.2071980        0.4956936  0.47244397     0.07045497
## 8   0.1321986     0.2193902        0.8597776  0.77473267     0.07408530
## 9   0.1320847     0.3346028        0.1170187  0.08956497     0.49921984
##    chlorides free.sulfur.dioxide total.sulfur.dioxide   density        pH
## 4 0.07400592          0.03622649          0.006873728 0.1450188 0.6111689
## 5 0.72950948          0.03483952          0.006457798 0.5073267 0.5901773
## 6 0.75412783          0.03485743          0.006480820 0.4195697 0.5697341
## 7 0.04534702          0.03504646          0.006626980 0.5469947 0.5958200
## 8 0.03572340          0.03571581          0.007253145 0.1314910 0.6354287
## 9 0.01741697          0.07224975          0.025817800 0.1340691 0.9254421
##   sulphates   alcohol
## 4 0.6817020 0.2258354
## 5 0.3586576 0.2135175
## 6 0.3066205 0.2105279
## 7 0.3651224 0.2133400
## 8 0.6329260 0.2247793
## 9 0.1478071 0.3393065

calculate Z score and p-Value for the variables in the model

z <- mlogit_output$coefficients/mlogit_output$standard.errors
p <- (1-pnorm(abs(z),0,1))*2 # I am using two-tailed z test
print(z, digits =2)

##   (Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar
## 4         552         -1.85            -0.54      -0.269          -1.81
## 5        -278         -2.20            -6.49      -1.382          -3.34
## 6        -280         -2.16           -17.19      -1.607          -2.37
## 7         200         -0.14           -17.03      -1.907          -0.61
## 8         544         -0.25            -8.42      -0.094          -0.37
## 9        -176          3.16           -84.76      52.369          -1.09
##   chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 4      -211                0.46                -1.83    -372 -6.7
## 5       -21                2.17                -2.54     344 -6.1
## 6       -18                2.40                -3.40     339 -5.7
## 7      -432                2.58                -3.66    -193 -2.6
## 8      -521                2.97                -2.82    -537 -2.8
## 9    -14265                1.19                -0.63    -319 14.5
##   sulphates alcohol
## 4       4.5  -2.776
## 5       9.2  -3.843
## 6      16.5  -0.069
## 7      18.2   1.801
## 8       9.5   2.862
## 9      22.9   5.661

print(p, digits =2)

##   (Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar
## 4           0        0.0640          5.9e-01       0.788        0.06992
## 5           0        0.0278          8.4e-11       0.167        0.00085
## 6           0        0.0309          0.0e+00       0.108        0.01765
## 7           0        0.8902          0.0e+00       0.057        0.54297
## 8           0        0.8054          0.0e+00       0.925        0.71490
## 9           0        0.0016          0.0e+00       0.000        0.27613
##   chlorides free.sulfur.dioxide total.sulfur.dioxide density      pH
## 4         0              0.6428              0.06798       0 2.7e-11
## 5         0              0.0298              0.01107       0 9.7e-10
## 6         0              0.0166              0.00066       0 9.0e-09
## 7         0              0.0099              0.00025       0 9.0e-03
## 8         0              0.0030              0.00477       0 5.7e-03
## 9         0              0.2352              0.53003       0 0.0e+00
##   sulphates alcohol
## 4   6.4e-06 5.5e-03
## 5   0.0e+00 1.2e-04
## 6   0.0e+00 9.5e-01
## 7   0.0e+00 7.2e-02
## 8   0.0e+00 4.2e-03
## 9   0.0e+00 1.5e-08

The p-Value for quality tells us that citric acid and residual sugar (for four levels of quality) are not significant. Now I’ll explore the entire data set, and analyze if we can remove any variables which do not add to model performance.

Pquality5 <- rbind(mlogit_output$coefficients[2, ],mlogit_output$standard.errors[2, ],z[2, ],p[2, ])
Pquality5

##       (Intercept) fixed.acidity volatile.acidity citric.acid
## [1,] -143.7723368   -0.45157483    -2.457031e+00  -0.5207800
## [2,]    0.5178327    0.20520178     3.784100e-01   0.3768959
## [3,] -277.6424511   -2.20063797    -6.493038e+00  -1.3817608
## [4,]    0.0000000    0.02776166     8.412226e-11   0.1670452
##      residual.sugar   chlorides free.sulfur.dioxide total.sulfur.dioxide
## [1,]  -0.2324552432 -15.2459000          0.07569049         -0.016405299
## [2,]   0.0696543433   0.7295095          0.03483952          0.006457798
## [3,]  -3.3372684614 -20.8988375          2.17254706         -2.540385993
## [4,]   0.0008460618   0.0000000          0.02981442          0.011073019
##          density            pH sulphates       alcohol
## [1,] 174.6585171 -3.608298e+00 3.3025854 -0.8206075363
## [2,]   0.5073267  5.901773e-01 0.3586576  0.2135175060
## [3,] 344.2722436 -6.113923e+00 9.2081850 -3.8432798873
## [4,]   0.0000000  9.721148e-10 0.0000000  0.0001214009

rownames(Pquality5) <- c("Coefficient","Std. Errors","z stat","p value")
knitr::kable(Pquality5)

	(Intercept)	fixed.acidity	volatile.acidity	citric.acid	residual.sugar	chlorides	free.sulfur.dioxide	total.sulfur.dioxide	density	pH	sulphates	alcohol
Coefficient	-143.7723368	-0.4515748	-2.457031	-0.5207800	-0.2324552	-15.2459000	0.0756905	-0.0164053	174.6585171	-3.6082985	3.3025854	-0.8206075
Std. Errors	0.5178327	0.2052018	0.378410	0.3768959	0.0696543	0.7295095	0.0348395	0.0064578	0.5073267	0.5901773	0.3586576	0.2135175
z stat	-277.6424511	-2.2006380	-6.493038	-1.3817608	-3.3372685	-20.8988375	2.1725471	-2.5403860	344.2722436	-6.1139226	9.2081850	-3.8432799
p value	0.0000000	0.0277617	0.000000	0.1670452	0.0008461	0.0000000	0.0298144	0.0110730	0.0000000	0.0000000	0.0000000	0.0001214

Interpretation of Coefficients

Probabilities Equation Ln(p(Quality=4)/p(Quality=3) = 80 - 0.40fixed.acidity -0.30volatile.acidity - 0.18citric.acid - 0.13residual.sugar -15.62chlorides + 0.016free.sulfur.dioxide -0.012total.sulfur.dioxide - 54.0density - 4.07ph + 3.07sulphates - 0.62*alcohol

Ln(p(Quality=5)/p(Quality=3) = 143 - 0.45fixed.acidity -2.45volatile.acidity - 0.52citric.acid - 0.23residual.sugar -15.24chlorides + 0.07free.sulfur.dioxide -0.016total.sulfur.dioxide - 174.6density - 3.60ph + 3.30sulphates - 0.82*alcohol

Caclucation of odds ratio- extract the coefficients from the model and exponentiate

oddsML <- exp(coef(mlogit_output))
print(oddsML, digits =2)

##   (Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar
## 4     7.8e+34          0.67          7.4e-01        0.83           0.87
## 5     3.6e-63          0.64          8.6e-02        0.59           0.79
## 6     6.8e-53          0.64          1.7e-03        0.57           0.85
## 7     3.1e+48          0.97          2.2e-04        0.41           0.96
## 8     1.7e+31          0.95          7.2e-04        0.93           0.97
## 9     7.9e-11          2.88          4.9e-05      108.90           0.58
##   chlorides free.sulfur.dioxide total.sulfur.dioxide density      pH
## 4   1.6e-07                 1.0                 0.99 3.5e-24 1.7e-02
## 5   2.4e-07                 1.1                 0.98 7.1e+75 2.7e-02
## 6   1.1e-06                 1.1                 0.98 5.5e+61 3.8e-02
## 7   3.1e-09                 1.1                 0.98 1.7e-46 2.1e-01
## 8   8.3e-09                 1.1                 0.98 2.0e-31 1.7e-01
## 9  1.3e-108                 1.1                 0.98 2.7e-19 6.6e+05
##   sulphates alcohol
## 4        22    0.53
## 5        27    0.44
## 6       158    0.99
## 7       768    1.47
## 8       418    1.90
## 9        29    6.83

The relative risk/odds ratio for a one-unit increase in the variable alcohol is .534 for being in quality 4 vs. quality 3. The relative risk/odds ratio for a one-unit increase in the variable alcohol is .440 for being in quality 5 vs. quality 3.

predictedML <- predict(mlogit_model,wine_test,na.action =na.pass, type="probs")
predicted_classML <- predict(mlogit_model,wine_test)
#wine_train$quality
# confusion matrix
caret::confusionMatrix(as.factor(predicted_classML),as.factor(wine_test$quality))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    3    4    5    6    7    8    9
##          3    0    0    0    1    0    0    0
##          4    1    2    0    0    0    0    0
##          5   11   71  654  326   37    5    0
##          6    6   35  403 1004  394   55    1
##          7    0    0    6   83  112   34    1
##          8    2    0    0    1    0    0    0
##          9    0    0    0    2    0    1    0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5456          
##                  95% CI : (0.5283, 0.5628)
##     No Information Rate : 0.4363          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2689          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                       Class: 3  Class: 4 Class: 5 Class: 6 Class: 7
## Sensitivity          0.0000000 0.0185185   0.6152   0.7085  0.20626
## Specificity          0.9996902 0.9996815   0.7941   0.5117  0.95416
## Pos Pred Value       0.0000000 0.6666667   0.5924   0.5290  0.47458
## Neg Pred Value       0.9938405 0.9673344   0.8092   0.6941  0.85691
## Prevalence           0.0061576 0.0332512   0.3273   0.4363  0.16718
## Detection Rate       0.0000000 0.0006158   0.2014   0.3091  0.03448
## Detection Prevalence 0.0003079 0.0009236   0.3399   0.5844  0.07266
## Balanced Accuracy    0.4998451 0.5091000   0.7046   0.6101  0.58021
##                       Class: 8  Class: 9
## Sensitivity          0.0000000 0.0000000
## Specificity          0.9990485 0.9990758
## Pos Pred Value       0.0000000 0.0000000
## Neg Pred Value       0.9707242 0.9993837
## Prevalence           0.0292488 0.0006158
## Detection Rate       0.0000000 0.0000000
## Detection Prevalence 0.0009236 0.0009236
## Balanced Accuracy    0.4995243 0.4995379

mean(as.character(predicted_classML) != as.character(wine_test$quality))

## [1] 0.4544335

# cm <- table(predicted_class, wine_train$quality)
# print(cm)

Confusion Matrix gives that Multinominal Logist model with all variables to have 55% accuracy and has around 45% accuracy rate

Calculation of important variables using Caret function -just to exlplore more

library(caret)

## Warning: package 'caret' was built under R version 3.3.2

## 
## Attaching package: 'caret'

## The following object is masked from 'package:survival':
## 
##     cluster

mostImportantVariables <- varImp(mlogit_model)
mostImportantVariables$Variables <- row.names(mostImportantVariables)
mostImportantVariables <- mostImportantVariables[order(-mostImportantVariables$Overall),]
print(head(mostImportantVariables))

##                     Overall        Variables
## density          589.679102          density
## chlorides        331.212145        chlorides
## volatile.acidity  34.705737 volatile.acidity
## pH                27.671878               pH
## sulphates         27.496932        sulphates
## citric.acid        6.930296      citric.acid

Lean Multinomial Model

I used coefficients magnitude, p-value and accuracy to select key variables for lean model

mlogit_model1 <- multinom(quality ~ density + chlorides + volatile.acidity + alcohol,data =wine_train, maxit = 1000)

## # weights:  42 (30 variable)
## initial  value 6320.316164 
## iter  10 value 3916.389255
## iter  20 value 3559.370598
## iter  30 value 3550.506628
## iter  40 value 3543.044583
## iter  50 value 3539.028675
## iter  60 value 3535.377250
## iter  70 value 3534.781094
## iter  80 value 3533.483356
## iter  90 value 3532.290133
## iter 100 value 3531.652642
## iter 110 value 3529.226400
## iter 120 value 3528.372508
## iter 130 value 3527.551327
## iter 140 value 3526.868656
## iter 150 value 3526.476516
## iter 160 value 3525.900301
## iter 170 value 3525.717948
## iter 180 value 3525.456945
## iter 190 value 3524.778470
## iter 200 value 3524.729025
## iter 210 value 3524.662916
## iter 220 value 3524.474316
## iter 230 value 3524.414746
## iter 240 value 3524.385488
## final  value 3524.352714 
## converged

mlogit_model1

## Call:
## multinom(formula = quality ~ density + chlorides + volatile.acidity + 
##     alcohol, data = wine_train, maxit = 1000)
## 
## Coefficients:
##   (Intercept)    density   chlorides volatile.acidity    alcohol
## 4  197.446607 -189.78397   -8.725659        0.3750108 -0.5519724
## 5   93.071512  -78.53379   -7.709595       -1.6203411 -0.8769918
## 6  -18.363948   24.68042   -4.433698       -5.5818409  0.1228285
## 7 -103.692907  102.52426  -11.575178       -8.2251245  0.8614796
## 8 -134.141176  130.69747  -20.456602       -7.6528607  0.9480950
## 9   -1.390011  -11.06833 -112.577157      -10.9758124  1.6324814
## 
## Residual Deviance: 7048.705 
## AIC: 7108.705

mlogit_output1 <- summary(mlogit_model1)
mlogit_output1

## Call:
## multinom(formula = quality ~ density + chlorides + volatile.acidity + 
##     alcohol, data = wine_train, maxit = 1000)
## 
## Coefficients:
##   (Intercept)    density   chlorides volatile.acidity    alcohol
## 4  197.446607 -189.78397   -8.725659        0.3750108 -0.5519724
## 5   93.071512  -78.53379   -7.709595       -1.6203411 -0.8769918
## 6  -18.363948   24.68042   -4.433698       -5.5818409  0.1228285
## 7 -103.692907  102.52426  -11.575178       -8.2251245  0.8614796
## 8 -134.141176  130.69747  -20.456602       -7.6528607  0.9480950
## 9   -1.390011  -11.06833 -112.577157      -10.9758124  1.6324814
## 
## Std. Errors:
##   (Intercept)   density chlorides volatile.acidity   alcohol
## 4    1.729836  1.728648 5.1367043         1.439356 0.3151521
## 5    9.446285  9.337721 4.1003520         1.384156 0.3035482
## 6   12.509405 12.377231 4.0660773         1.395951 0.3025122
## 7    3.215013  3.203924 4.9611188         1.464048 0.3044135
## 8    1.853929  1.726059 8.0632289         1.714886 0.3167942
## 9    4.752426  4.792886 0.3245852         6.887989 0.8084408
## 
## Residual Deviance: 7048.705 
## AIC: 7108.705

mlogit_output1$coefficients

##   (Intercept)    density   chlorides volatile.acidity    alcohol
## 4  197.446607 -189.78397   -8.725659        0.3750108 -0.5519724
## 5   93.071512  -78.53379   -7.709595       -1.6203411 -0.8769918
## 6  -18.363948   24.68042   -4.433698       -5.5818409  0.1228285
## 7 -103.692907  102.52426  -11.575178       -8.2251245  0.8614796
## 8 -134.141176  130.69747  -20.456602       -7.6528607  0.9480950
## 9   -1.390011  -11.06833 -112.577157      -10.9758124  1.6324814

mlogit_output1$standard.errors

##   (Intercept)   density chlorides volatile.acidity   alcohol
## 4    1.729836  1.728648 5.1367043         1.439356 0.3151521
## 5    9.446285  9.337721 4.1003520         1.384156 0.3035482
## 6   12.509405 12.377231 4.0660773         1.395951 0.3025122
## 7    3.215013  3.203924 4.9611188         1.464048 0.3044135
## 8    1.853929  1.726059 8.0632289         1.714886 0.3167942
## 9    4.752426  4.792886 0.3245852         6.887989 0.8084408

z <- mlogit_output1$coefficients/mlogit_output1$standard.errors
p <- (1-pnorm(abs(z),0,1))*2

Table of Coefficient,standard error, z statand, p value for Probability of Quality 5 ( Quality 5 has the highest probability))

Pquality5 <- rbind(mlogit_output1$coefficients[2, ],mlogit_output1$standard.errors[2, ],z[2, ],p[2, ])
Pquality5

##      (Intercept)    density   chlorides volatile.acidity     alcohol
## [1,]   93.071512 -78.533789 -7.70959532       -1.6203411 -0.87699180
## [2,]    9.446285   9.337721  4.10035199        1.3841564  0.30354821
## [3,]    9.852711  -8.410381 -1.88022768       -1.1706344 -2.88913517
## [4,]    0.000000   0.000000  0.06007705        0.2417458  0.00386303

rownames(Pquality5) <- c("Coefficient","Std. Errors","z stat","p value")
knitr::kable(Pquality5)

	(Intercept)	density	chlorides	volatile.acidity	alcohol
Coefficient	93.071512	-78.533789	-7.7095953	-1.6203411	-0.8769918
Std. Errors	9.446285	9.337721	4.1003520	1.3841564	0.3035482
z stat	9.852711	-8.410381	-1.8802277	-1.1706344	-2.8891352
p value	0.000000	0.000000	0.0600771	0.2417458	0.0038630
## extract the	coefficients	from the mode	l and exponen	tiate

exp(coef(mlogit_output1))

##    (Intercept)      density    chlorides volatile.acidity   alcohol
## 4 5.623049e+85 3.783286e-83 1.623657e-04     1.4550071084 0.5758130
## 5 2.632960e+40 7.820037e-35 4.485029e-04     0.1978312038 0.4160325
## 6 1.058372e-08 5.230815e+10 1.187051e-02     0.0037656270 1.1306905
## 7 9.262812e-46 3.355214e+44 9.396453e-06     0.0002678390 2.3666598
## 8 5.536398e-59 5.770217e+56 1.305602e-09     0.0004746842 2.5807887
## 9 2.490726e-01 1.559865e-05 1.283400e-49     0.0000171106 5.1165550

mlogit_output1$coefficients

##   (Intercept)    density   chlorides volatile.acidity    alcohol
## 4  197.446607 -189.78397   -8.725659        0.3750108 -0.5519724
## 5   93.071512  -78.53379   -7.709595       -1.6203411 -0.8769918
## 6  -18.363948   24.68042   -4.433698       -5.5818409  0.1228285
## 7 -103.692907  102.52426  -11.575178       -8.2251245  0.8614796
## 8 -134.141176  130.69747  -20.456602       -7.6528607  0.9480950
## 9   -1.390011  -11.06833 -112.577157      -10.9758124  1.6324814

Evaluation of Multinomial Logistics Models

There is no R-square for a logit model but we may use a McFadden psuedo R2 using pscl package to get this and other fit statistics

library(pscl)

## Classes and Methods for R developed in the

## Political Science Computational Laboratory

## Department of Political Science

## Stanford University

## Simon Jackman

## hurdle and zeroinfl functions by Achim Zeileis

mlogit_ModelFit<- rbind(pscl::pR2(mlogit_model)[1:6],pscl::pR2(mlogit_model1)[1:6])

## fitting null model for pseudo-r2
## # weights:  14 (6 variable)
## initial  value 6320.316164 
## iter  10 value 4123.979306
## iter  20 value 4118.078251
## final  value 4118.060977 
## converged
## fitting null model for pseudo-r2
## # weights:  14 (6 variable)
## initial  value 6320.316164 
## iter  10 value 4123.979306
## iter  20 value 4118.078251
## final  value 4118.060977 
## converged

rownames(mlogit_ModelFit) <- c("Model-0", "Model-1")
print(mlogit_ModelFit, digits = 2)

##           llh llhNull   G2 McFadden r2ML r2CU
## Model-0 -3436   -4118 1363     0.17 0.34 0.37
## Model-1 -3524   -4118 1187     0.14 0.31 0.33

Calculation of Multinomial Model error

mlogit_ModelError <- as.data.frame(rbind(cbind(mlogit_output$deviance,mlogit_output$AIC)),cbind(mlogit_output1$deviance,mlogit_output1$AIC))
                                   
names(mlogit_ModelError) <- c("Deviance", "AIC")
print(mlogit_ModelError, digits = 3)

##   Deviance  AIC
## 1     6873 7017

We observe that Model 0 with all variables has highest McFadden (0.17) and r2 scores and lowest deviance and AIC

Accuracy of Train data

predictedML1 <- predict(mlogit_model1,wine_train,na.action =na.pass, type="probs")
predicted_classML1 <- predict(mlogit_model1,wine_train)

caret::confusionMatrix(as.factor(predicted_classML1),as.factor(wine_train$quality))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   0   0   0   0   0
##          4   0   0   0   0   0   0   0
##          5   4  57 657 345  37  12   0
##          6   6  49 414 986 397  67   1
##          7   0   2   4  88 101  19   2
##          8   0   0   0   0   0   0   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5369          
##                  95% CI : (0.5196, 0.5542)
##     No Information Rate : 0.4369          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2501          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity          0.000000  0.00000   0.6112   0.6949   0.1888  0.00000
## Specificity          1.000000  1.00000   0.7906   0.4893   0.9576  1.00000
## Pos Pred Value            NaN      NaN   0.5908   0.5135   0.4676      NaN
## Neg Pred Value       0.996921  0.96675   0.8043   0.6739   0.8569  0.96983
## Prevalence           0.003079  0.03325   0.3310   0.4369   0.1647  0.03017
## Detection Rate       0.000000  0.00000   0.2023   0.3036   0.0311  0.00000
## Detection Prevalence 0.000000  0.00000   0.3424   0.5911   0.0665  0.00000
## Balanced Accuracy    0.500000  0.50000   0.7009   0.5921   0.5732  0.50000
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          1.0000000
## Pos Pred Value             NaN
## Neg Pred Value       0.9990764
## Prevalence           0.0009236
## Detection Rate       0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy    0.5000000

mean(as.character(predicted_classML1) != as.character(wine_train$quality))

## [1] 0.4630542

Accuracy on test data

predictedML1 <- predict(mlogit_model1,wine_test,na.action =na.pass, type="probs")
predicted_classML1 <- predict(mlogit_model1,wine_test)
caret::confusionMatrix(as.factor(predicted_classML1),as.factor(wine_test$quality))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   0   0   0   0   0
##          4   1   1   0   0   0   0   0
##          5  10  66 647 363  32   5   0
##          6   9  41 411 974 421  59   1
##          7   0   0   5  80  90  31   1
##          8   0   0   0   0   0   0   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5271          
##                  95% CI : (0.5098, 0.5444)
##     No Information Rate : 0.4363          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2358          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3  Class: 4 Class: 5 Class: 6 Class: 7
## Sensitivity          0.000000 0.0092593   0.6087   0.6874  0.16575
## Specificity          1.000000 0.9996815   0.7822   0.4855  0.95675
## Pos Pred Value            NaN 0.5000000   0.5761   0.5084  0.43478
## Neg Pred Value       0.993842 0.9670364   0.8042   0.6674  0.85104
## Prevalence           0.006158 0.0332512   0.3273   0.4363  0.16718
## Detection Rate       0.000000 0.0003079   0.1992   0.2999  0.02771
## Detection Prevalence 0.000000 0.0006158   0.3458   0.5899  0.06373
## Balanced Accuracy    0.500000 0.5044704   0.6954   0.5864  0.56125
##                      Class: 8  Class: 9
## Sensitivity           0.00000 0.0000000
## Specificity           1.00000 1.0000000
## Pos Pred Value            NaN       NaN
## Neg Pred Value        0.97075 0.9993842
## Prevalence            0.02925 0.0006158
## Detection Rate        0.00000 0.0000000
## Detection Prevalence  0.00000 0.0000000
## Balanced Accuracy     0.50000 0.5000000

mean(as.character(predicted_classML1) != as.character(wine_test$quality))

## [1] 0.4729064

We observe that Test sample accuracy is around 53% (in both train and test data) with just four predictors as compared to an accuracy of 54% with all 11 variables. However, A misclassification error of 47.2% seems high. It seems that given variables are not as good in explaining the quality. In this case of multinomial regression, The category to which an outcome quality belongs to, does not assume any order in it. In reality,the 7 categories of quality have a natural order. So, Ordinal regression might improve the result

Question 2-Use support vector machine (SVM) to estimate the model. Treat quality as a multiple categorical variable. (40 points)

In order to create a SVR model with R we will need the package e1071.

# set.seed(200)
#install.packages("e1071")
library(e1071)

## 
## Attaching package: 'e1071'

## The following objects are masked from 'package:moments':
## 
##     kurtosis, moment, skewness

## The following object is masked from 'package:Hmisc':
## 
##     impute

#as.factor(quality)

Model with all variables-This SVM logistic regression model uses quality for catagorical nominal dependent variable.

modelsvm <- svm(quality ~ alcohol + volatile.acidity + free.sulfur.dioxide + sulphates + total.sulfur.dioxide + density + residual.sugar+ fixed.acidity + citric.acid + chlorides + pH, data=wine_train)
summary(modelsvm)

## 
## Call:
## svm(formula = quality ~ alcohol + volatile.acidity + free.sulfur.dioxide + 
##     sulphates + total.sulfur.dioxide + density + residual.sugar + 
##     fixed.acidity + citric.acid + chlorides + pH, data = wine_train)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.09090909 
## 
## Number of Support Vectors:  2913
## 
##  ( 1274 894 526 98 108 3 10 )
## 
## 
## Number of Classes:  7 
## 
## Levels: 
##  3 4 5 6 7 8 9

predictedsvm <- predict(modelsvm,wine_train)
#predictedsvm

In the code above, I performed an epsilon-regression, and did not set any value for gama, but it took a default value of 0.1. There is also a cost parameter which we can change to avoid overfitting. A common way to measure error in machine learning is to use the Root Mean Squared Error (RMSE) so I will use it to compute the RMSE I took the square root and we get the RMSE. RMSE= sqrt(MSE)

rmse <- function(error)
{
  sqrt(mean(error^2))
}
 
#error <- model$residuals
error <- as.numeric(wine_train$quality) - as.numeric(predictedsvm)
svrPredictionRMSE <- rmse(error)
svrPredictionRMSE

## [1] 0.7321353

RMSE Error is around 73%

Model Tuning-The code is not run in R markdown as it takes lot of time- Result are used in the table below #tune$best.parameters # gamma cost #100 1 5

#obj <- tune.svm(modelsvm, data=wine_train, gamma= seq(.1,1,by=.1), cost= seq(1,10, by =1))
#obj$best.parameters
#plot(obj)

Final Parameters for SVM Model Gamma = 1 and Cost 5

modelsvm_tuned <- svm(quality ~ alcohol + volatile.acidity + free.sulfur.dioxide + sulphates + total.sulfur.dioxide + density + residual.sugar+ fixed.acidity + citric.acid+ chlorides + pH, data = wine_train, gamma = 1, cost = 5)

wine_test_svmpredict <- predict(modelsvm_tuned,wine_test,na.action = na.pass, type="probs")

Confusion matrix and misclassification error

# caret::confusionMatrix(as.factor(wine_train_svmpredict),as.factor(wine_train$quality))
# mean(as.character(wine_train_svmpredict)!= as.character(wine_train$quality))

The model on train data test gives an accuracy of 55% and test data at 60%

Lean Model with density + chlorides + volatile.acidity + alcohol

modelsvm_tuned1 <- svm(quality ~ density + chlorides + volatile.acidity + alcohol, data = wine_train, gamma = 1, cost = 5)
wine_test_svmpredict1 <- predict(modelsvm_tuned1,wine_test,na.action = na.pass, type="class")

# Cross tabs or confusion matrix
caret::confusionMatrix(as.factor(wine_test_svmpredict1),as.factor(wine_test$quality))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    3    4    5    6    7    8    9
##          3    0    0    1    2    0    0    0
##          4    1    5    1    2    0    0    0
##          5   10   62  637  318   26    0    0
##          6    8   39  410 1020  412   72    2
##          7    1    2   14   74  103   21    0
##          8    0    0    0    1    2    2    0
##          9    0    0    0    0    0    0    0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.544           
##                  95% CI : (0.5267, 0.5613)
##     No Information Rate : 0.4363          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2634          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                       Class: 3 Class: 4 Class: 5 Class: 6 Class: 7
## Sensitivity          0.0000000 0.046296   0.5992   0.7198  0.18969
## Specificity          0.9990706 0.998726   0.8096   0.4850  0.95860
## Pos Pred Value       0.0000000 0.555556   0.6049   0.5196  0.47907
## Neg Pred Value       0.9938367 0.968200   0.8059   0.6911  0.85493
## Prevalence           0.0061576 0.033251   0.3273   0.4363  0.16718
## Detection Rate       0.0000000 0.001539   0.1961   0.3140  0.03171
## Detection Prevalence 0.0009236 0.002771   0.3242   0.6044  0.06619
## Balanced Accuracy    0.4995353 0.522511   0.7044   0.6024  0.57414
##                       Class: 8  Class: 9
## Sensitivity          0.0210526 0.0000000
## Specificity          0.9990485 1.0000000
## Pos Pred Value       0.4000000       NaN
## Neg Pred Value       0.9713228 0.9993842
## Prevalence           0.0292488 0.0006158
## Detection Rate       0.0006158 0.0000000
## Detection Prevalence 0.0015394 0.0000000
## Balanced Accuracy    0.5100506 0.5000000

mean(as.character(wine_test_svmpredict1) != as.character(wine_test$quality))

## [1] 0.4559729

Accuracy with lean model is around 55% # Question-3 -Treat quality as a continuous variable. Estimate a linear regression model and compare the output with multinomial regression and SVM. # Invoking required R packages

library(regclass)

## Loading required package: bestglm

## Loading required package: leaps

## Loading required package: VGAM

## Loading required package: stats4

## Loading required package: splines

## 
## Attaching package: 'VGAM'

## The following object is masked from 'package:caret':
## 
##     predictors

## The following objects are masked from 'package:psych':
## 
##     fisherz, logistic, logit

## Loading required package: rpart

## Warning: package 'rpart' was built under R version 3.3.2

## Important regclass change from 1.3 to 1.4:
## All functions that had a . in the name now have an _
## all.correlations -> all_correlations, cor.demo -> cor_demo, etc.

## 
## Attaching package: 'regclass'

## The following object is masked from 'package:lattice':
## 
##     qq

library(leaps)
library(bestglm)

Create the initial multinominal Linear Model with quality as a continuous dependent variable and Model with all other variables as predictors

linear1 <- as.numeric(quality) ~ alcohol + volatile.acidity + free.sulfur.dioxide + sulphates + total.sulfur.dioxide + density + residual.sugar+ fixed.acidity + citric.acid+ chlorides + pH

lm1 <- lm(formula =linear1 ,data=wine_train)

summary(lm1)

## 
## Call:
## lm(formula = linear1, data = wine_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2807 -0.4751 -0.0391  0.4639  2.6654 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           5.115e+01  1.755e+01   2.914 0.003596 ** 
## alcohol               2.666e-01  2.467e-02  10.809  < 2e-16 ***
## volatile.acidity     -1.292e+00  1.070e-01 -12.073  < 2e-16 ***
## free.sulfur.dioxide   6.837e-03  1.105e-03   6.189 6.81e-10 ***
## sulphates             6.632e-01  1.083e-01   6.124 1.02e-09 ***
## total.sulfur.dioxide -1.960e-03  3.993e-04  -4.910 9.58e-07 ***
## density              -5.245e+01  1.790e+01  -2.929 0.003420 ** 
## residual.sugar        3.597e-02  7.420e-03   4.847 1.31e-06 ***
## fixed.acidity         8.273e-02  2.258e-02   3.663 0.000253 ***
## citric.acid          -3.029e-02  1.132e-01  -0.268 0.789088    
## chlorides            -1.584e-01  4.724e-01  -0.335 0.737361    
## pH                    4.267e-01  1.294e-01   3.297 0.000987 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7364 on 3236 degrees of freedom
## Multiple R-squared:  0.283,  Adjusted R-squared:  0.2805 
## F-statistic: 116.1 on 11 and 3236 DF,  p-value: < 2.2e-16

confint(lm1, level=0.95)

##                              2.5 %        97.5 %
## (Intercept)           16.730035583  85.568370295
## alcohol                0.218273317   0.315011040
## volatile.acidity      -1.501791650  -1.082139157
## free.sulfur.dioxide    0.004671041   0.009002888
## sulphates              0.450886635   0.875599455
## total.sulfur.dioxide  -0.002743486  -0.001177562
## density              -87.550186497 -17.343374106
## residual.sugar         0.021416824   0.050513314
## fixed.acidity          0.038444681   0.127008502
## citric.acid           -0.252309480   0.191725317
## chlorides             -1.084743426   0.767854626
## pH                     0.172946115   0.680371329

anova(lm1)

## Analysis of Variance Table
## 
## Response: as.numeric(quality)
##                        Df  Sum Sq Mean Sq  F value    Pr(>F)    
## alcohol                 1  463.93  463.93 855.6230 < 2.2e-16 ***
## volatile.acidity        1  163.64  163.64 301.7982 < 2.2e-16 ***
## free.sulfur.dioxide     1   10.28   10.28  18.9631 1.374e-05 ***
## sulphates               1   24.41   24.41  45.0181 2.295e-11 ***
## total.sulfur.dioxide    1    8.64    8.64  15.9383 6.689e-05 ***
## density                 1    7.40    7.40  13.6534 0.0002235 ***
## residual.sugar          1    5.54    5.54  10.2148 0.0014066 ** 
## fixed.acidity           1    1.92    1.92   3.5405 0.0599771 .  
## citric.acid             1    0.22    0.22   0.4100 0.5220350    
## chlorides               1    0.58    0.58   1.0667 0.3017793    
## pH                      1    5.89    5.89  10.8717 0.0009870 ***
## Residuals            3236 1754.61    0.54                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Evaluation calculates Residual standard error = 0.74, Multi R2 = 0.28, and Adj R2 = 0.28. Anova evaluation finds Citric Acid, Chlorides, and Fixed acidity are not significant in this model. We can take log of variable which have high skewness to make it normal. I tried taking log of chloride and other variable but there is not much change in R-square even after making them normal. This may be due to Central limit theorem for large sample.

Graphical Analysis of Liner Regression assumption

Assumption 1- residuals plot can be used to assess the assumption that the variables have a linear relationship

unstandardizedPredicted <- predict(lm1)
unstandardizedResiduals <- resid(lm1)
 #get standardized values
standardizedPredicted <- (unstandardizedPredicted - mean(unstandardizedPredicted)) / sd(unstandardizedPredicted)
standardizedResiduals <- (unstandardizedResiduals - mean(unstandardizedResiduals)) / sd(unstandardizedResiduals)
#create standardized residuals plot
plot(standardizedPredicted, standardizedResiduals, main = "Standardized Residuals Plot", xlab = "Standardized Predicted Values", ylab = "Standardized Residuals")
#add horizontal line
abline(0,0)

Explanation of Residual plot- values that are close to the horizontal line are predicted well. The points above the line are underpredicted and the ones below the line are overpredicted. The linearity assumption is supported to the extent that the amount of points scattered above and below the line is equal.

Residual histogram- using a histogram we may assess the assumption that the residuals are normally distributed

#create residuals histogram
hist(standardizedResiduals, freq = FALSE)
#add normal curve
curve(dnorm, add = TRUE)

# Use of PP-plot to access the assumption and residuals are normally distributed

#get probability distribution for residuals
probDist <- pnorm(standardizedResiduals)
#create PP plot
plot(ppoints(length(standardizedResiduals)), sort(probDist), main = "PP Plot", xlab = "Observed Probability", ylab = "Expected Probability")
#add diagonal line
abline(0,1)

Here, the distribution is considered to be normal to the extent that the plotted points match the diagonal line.

Points in the residual plor are dispersed randomly

Stepwise variable selection

####  Create Linear Model for Variable Selection
lmdVarSelect <- lm(formula = linear1, data = wine_train)

###   Perform Step-Wise Variable Selection
step <- stepAIC(lmdVarSelect, direction="both")

## Start:  AIC=-1976.1
## as.numeric(quality) ~ alcohol + volatile.acidity + free.sulfur.dioxide + 
##     sulphates + total.sulfur.dioxide + density + residual.sugar + 
##     fixed.acidity + citric.acid + chlorides + pH
## 
##                        Df Sum of Sq    RSS     AIC
## - citric.acid           1     0.039 1754.7 -1978.0
## - chlorides             1     0.061 1754.7 -1978.0
## <none>                              1754.6 -1976.1
## - density               1     4.653 1759.3 -1969.5
## - pH                    1     5.895 1760.5 -1967.2
## - fixed.acidity         1     7.275 1761.9 -1964.7
## - residual.sugar        1    12.739 1767.3 -1954.6
## - total.sulfur.dioxide  1    13.069 1767.7 -1954.0
## - sulphates             1    20.333 1774.9 -1940.7
## - free.sulfur.dioxide   1    20.770 1775.4 -1939.9
## - alcohol               1    63.346 1818.0 -1862.9
## - volatile.acidity      1    79.027 1833.6 -1835.0
## 
## Step:  AIC=-1978.02
## as.numeric(quality) ~ alcohol + volatile.acidity + free.sulfur.dioxide + 
##     sulphates + total.sulfur.dioxide + density + residual.sugar + 
##     fixed.acidity + chlorides + pH
## 
##                        Df Sum of Sq    RSS     AIC
## - chlorides             1     0.079 1754.7 -1979.9
## <none>                              1754.7 -1978.0
## + citric.acid           1     0.039 1754.6 -1976.1
## - density               1     4.644 1759.3 -1971.4
## - pH                    1     5.968 1760.6 -1969.0
## - fixed.acidity         1     7.471 1762.1 -1966.2
## - residual.sugar        1    12.703 1767.3 -1956.6
## - total.sulfur.dioxide  1    13.743 1768.4 -1954.7
## - sulphates             1    20.306 1775.0 -1942.7
## - free.sulfur.dioxide   1    20.805 1775.5 -1941.7
## - alcohol               1    63.424 1818.1 -1864.7
## - volatile.acidity      1    89.821 1844.5 -1817.9
## 
## Step:  AIC=-1979.88
## as.numeric(quality) ~ alcohol + volatile.acidity + free.sulfur.dioxide + 
##     sulphates + total.sulfur.dioxide + density + residual.sugar + 
##     fixed.acidity + pH
## 
##                        Df Sum of Sq    RSS     AIC
## <none>                              1754.7 -1979.9
## + chlorides             1     0.079 1754.7 -1978.0
## + citric.acid           1     0.057 1754.7 -1978.0
## - density               1     5.144 1759.9 -1972.4
## - pH                    1     6.578 1761.3 -1969.7
## - fixed.acidity         1     7.908 1762.6 -1967.3
## - total.sulfur.dioxide  1    13.665 1768.4 -1956.7
## - residual.sugar        1    13.986 1768.7 -1956.1
## - sulphates             1    20.676 1775.4 -1943.8
## - free.sulfur.dioxide   1    20.731 1775.5 -1943.7
## - alcohol               1    63.378 1818.1 -1866.6
## - volatile.acidity      1    92.316 1847.0 -1815.3

step$anova

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## as.numeric(quality) ~ alcohol + volatile.acidity + free.sulfur.dioxide + 
##     sulphates + total.sulfur.dioxide + density + residual.sugar + 
##     fixed.acidity + citric.acid + chlorides + pH
## 
## Final Model:
## as.numeric(quality) ~ alcohol + volatile.acidity + free.sulfur.dioxide + 
##     sulphates + total.sulfur.dioxide + density + residual.sugar + 
##     fixed.acidity + pH
## 
## 
##            Step Df   Deviance Resid. Df Resid. Dev       AIC
## 1                                  3236   1754.609 -1976.097
## 2 - citric.acid  1 0.03880406      3237   1754.648 -1978.025
## 3   - chlorides  1 0.07898424      3238   1754.727 -1979.879

Model with key predictors

I createad a Linear Model 2 using only Volatile Acidity, Alcohol, free suphur dioxide and Sulphates as independent variables. From the above table, we see that these four variables contribute for most of the variation in the data

lm2 <-lm(as.numeric(quality)~ alcohol + free.sulfur.dioxide + sulphates + volatile.acidity, data=wine_train)
summary(lm2)

## 
## Call:
## lm(formula = as.numeric(quality) ~ alcohol + free.sulfur.dioxide + 
##     sulphates + volatile.acidity, data = wine_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2730 -0.4754 -0.0287  0.4570  2.5672 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.4272306  0.1434983   2.977  0.00293 ** 
## alcohol              0.3232049  0.0112053  28.844  < 2e-16 ***
## free.sulfur.dioxide  0.0042448  0.0008299   5.115 3.33e-07 ***
## sulphates            0.5978010  0.0897634   6.660 3.21e-11 ***
## volatile.acidity    -1.3019892  0.0848539 -15.344  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7419 on 3243 degrees of freedom
## Multiple R-squared:  0.2706, Adjusted R-squared:  0.2697 
## F-statistic: 300.8 on 4 and 3243 DF,  p-value: < 2.2e-16

confint(lm2, level=0.95)

##                            2.5 %      97.5 %
## (Intercept)          0.145874042  0.70858725
## alcohol              0.301234711  0.34517513
## free.sulfur.dioxide  0.002617525  0.00587201
## sulphates            0.421802302  0.77379964
## volatile.acidity    -1.468361820 -1.13561661

anova(lm2)

## Analysis of Variance Table
## 
## Response: as.numeric(quality)
##                       Df  Sum Sq Mean Sq F value    Pr(>F)    
## alcohol                1  463.93  463.93  842.97 < 2.2e-16 ***
## free.sulfur.dioxide    1   59.25   59.25  107.65 < 2.2e-16 ***
## sulphates              1    9.51    9.51   17.28 3.309e-05 ***
## volatile.acidity       1  129.57  129.57  235.44 < 2.2e-16 ***
## Residuals           3243 1784.81    0.55                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error = 0.74, Multi R2 = 0.28, and Adj R2 = 0.27. Anova indicates th all variables are significant in this model

Residual plot analysis

plot(residuals(lm2))

Points in the residual plor are randomly dispersed indicating that linear regression model may be appropriate for the data

An evaluation of both Linear Model 1 and Linear Model 2 concludes that neither model is well suited for predicting this data

Root mean square Error of linear model

rmse <- function(error)
{
  sqrt(mean(error^2))
}
 
error <- lm2$residuals  # same as data$Y - predictedY
predictionRMSE <- rmse(error) 
predictionRMSE

## [1] 0.7412893

Error is 74.5%- It seems that Liner regression is underestimating the true nature of the relationship and interesting things are being overlooked.

An evaluation of both Linear Model 1 and Linear Model 2 concludes that neither model is well suited for predicting this data. The linear model assumes that the probability p is a linear function of the regressors, while the logistic model assumes that the natural log of the odds p/(1-p) is a linear function of the regressors. When the true probabilities are extreme, the linear model can also yield predicted probabilities that are greater than 1 or less than 0. Those out-of-bounds predicted probabilities are the Achilles heel of the linear model. The linear probability model is fast by comparison because it can be estimated noniteratively using ordinary least squares (OLS).

Question 4-Note that so far we treated quality as either categorical or continuous. Is there any other way you can model it? If the answer is yes, which is that way?

Yes, we can model it using Ordinal logistic regression since quality variable is ordered. Ordinal logistic regression is used to predict an ordinal dependent variable given one or more independent variables. This will be enable us to determine which of our independent variables (if any) have a statistically significant effect on our dependent variable

invoking required packages (c(“MASS”, “ORDINAL”, “ERER”))

library(MASS)
library(ordinal)

## 
## Attaching package: 'ordinal'

## The following objects are masked from 'package:VGAM':
## 
##     dgumbel, dlgamma, pgumbel, plgamma, qgumbel, rgumbel, wine

## The following object is masked from 'package:psych':
## 
##     income

library(erer)

## Loading required package: lmtest

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## Attaching package: 'lmtest'

## The following object is masked from 'package:VGAM':
## 
##     lrtest

Covnert quality variable from factor to ordered varible

wine_train$newquality <- ordered(wine_train$quality, levels = c(3,4,5,6,7,8,9))
wine_test$newquality <- ordered(wine_test$quality, levels = c(3,4,5,6,7,8,9))
str(wine_train)

## 'data.frame':    3248 obs. of  13 variables:
##  $ fixed.acidity       : num  6.2 6.6 6.3 9 12.4 7.3 7.2 6.5 7.2 6.9 ...
##  $ volatile.acidity    : num  0.27 0.16 0.26 0.36 0.35 0.26 0.31 0.28 0.27 0.28 ...
##  $ citric.acid         : num  0.49 0.21 0.25 0.52 0.49 0.33 0.35 0.38 0.31 0.41 ...
##  $ residual.sugar      : num  1.4 6.7 7.8 2.1 2.6 11.8 7.2 7.8 1.2 1.7 ...
##  $ chlorides           : num  0.05 0.055 0.058 0.111 0.079 0.057 0.046 0.031 0.031 0.05 ...
##  $ free.sulfur.dioxide : num  20 43 44 5 27 48 45 54 27 10 ...
##  $ total.sulfur.dioxide: num  74 157 166 10 69 127 178 216 80 136 ...
##  $ density             : num  0.993 0.994 0.996 0.996 0.999 ...
##  $ pH                  : num  3.32 3.15 3.24 3.31 3.12 3.1 3.14 3.03 3.03 3.16 ...
##  $ sulphates           : num  0.44 0.52 0.41 0.62 0.75 0.55 0.53 0.42 0.33 0.71 ...
##  $ alcohol             : num  9.8 10.8 9 11.3 10.4 10 9.7 13.1 12.7 11.4 ...
##  $ quality             : Factor w/ 7 levels "3","4","5","6",..: 4 4 3 4 4 4 3 4 4 4 ...
##  $ newquality          : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 3 4 4 4 3 4 4 4 ...

ord<-clm(formula= (newquality~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol), data= wine_train, maxit = 500)

## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables? 
## In addition: Absolute and relative convergence criteria were met

summary(ord)

## formula: 
## newquality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol
## data:    wine_train
## 
##  link  threshold nobs logLik   AIC     niter max.grad cond.H 
##  logit flexible  3248 -3543.66 7121.33 8(0)  1.11e-10 2.4e+11
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## fixed.acidity         2.025e-01  5.946e-02   3.405  0.00066 ***
## volatile.acidity     -3.581e+00  2.958e-01 -12.106  < 2e-16 ***
## citric.acid          -2.591e-02  2.957e-01  -0.088  0.93018    
## residual.sugar        9.470e-02  1.961e-02   4.829 1.37e-06 ***
## chlorides            -3.176e-02  1.257e+00  -0.025  0.97984    
## free.sulfur.dioxide   1.695e-02  2.931e-03   5.784 7.28e-09 ***
## total.sulfur.dioxide -5.497e-03  1.059e-03  -5.191 2.09e-07 ***
## density              -1.364e+02  4.748e+01  -2.873  0.00407 ** 
## pH                    1.163e+00  3.412e-01   3.410  0.00065 ***
## sulphates             1.794e+00  2.819e-01   6.365 1.95e-10 ***
## alcohol               7.268e-01  6.631e-02  10.960  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##     Estimate Std. Error z value
## 3|4  -129.23      46.55  -2.776
## 4|5  -126.69      46.55  -2.722
## 5|6  -123.48      46.55  -2.653
## 6|7  -120.89      46.55  -2.597
## 7|8  -118.60      46.54  -2.548
## 8|9  -115.01      46.55  -2.471

Lean Model

Chloride and citric acid are not significant and hence removed,

ord1 <- clm(formula=(newquality~ density + volatile.acidity + alcohol + fixed.acidity + residual.sugar + free.sulfur.dioxide + total.sulfur.dioxide + pH + sulphates), data= wine_train, maxit = 500)

## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables? 
## In addition: Absolute and relative convergence criteria were met

summary(ord1)

## formula: 
## newquality ~ density + volatile.acidity + alcohol + fixed.acidity + residual.sugar + free.sulfur.dioxide + total.sulfur.dioxide + pH + sulphates
## data:    wine_train
## 
##  link  threshold nobs logLik   AIC     niter max.grad cond.H 
##  logit flexible  3248 -3543.67 7117.34 8(0)  9.23e-10 2.3e+11
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## density              -1.368e+02  4.636e+01  -2.951 0.003166 ** 
## volatile.acidity     -3.573e+00  2.739e-01 -13.048  < 2e-16 ***
## alcohol               7.264e-01  6.616e-02  10.979  < 2e-16 ***
## fixed.acidity         2.016e-01  5.717e-02   3.526 0.000421 ***
## residual.sugar        9.482e-02  1.901e-02   4.988 6.10e-07 ***
## free.sulfur.dioxide   1.695e-02  2.927e-03   5.791 6.99e-09 ***
## total.sulfur.dioxide -5.509e-03  1.040e-03  -5.299 1.16e-07 ***
## pH                    1.168e+00  3.334e-01   3.503 0.000460 ***
## sulphates             1.791e+00  2.773e-01   6.458 1.06e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##     Estimate Std. Error z value
## 3|4  -129.61      45.49  -2.849
## 4|5  -127.06      45.49  -2.793
## 5|6  -123.85      45.49  -2.723
## 6|7  -121.27      45.49  -2.666
## 7|8  -118.98      45.48  -2.616
## 8|9  -115.39      45.49  -2.537

essential metrics such as p-Value, CI, Odds ratio

ctable <- coef(summary(ord1))
ctable

##                           Estimate   Std. Error    z value     Pr(>|z|)
## 3|4                  -1.296080e+02 45.487867052  -2.849286 4.381742e-03
## 4|5                  -1.270638e+02 45.488143945  -2.793339 5.216691e-03
## 5|6                  -1.238518e+02 45.489904850  -2.722622 6.476609e-03
## 6|7                  -1.212670e+02 45.485700767  -2.666047 7.674892e-03
## 7|8                  -1.189792e+02 45.484010194  -2.615846 8.900671e-03
## 8|9                  -1.153877e+02 45.487234354  -2.536704 1.119015e-02
## density              -1.368017e+02 46.355170800  -2.951164 3.165784e-03
## volatile.acidity     -3.573303e+00  0.273860886 -13.047878 6.534194e-39
## alcohol               7.263572e-01  0.066157698  10.979179 4.812789e-28
## fixed.acidity         2.016080e-01  0.057173110   3.526274 4.214512e-04
## residual.sugar        9.481623e-02  0.019009257   4.987898 6.103988e-07
## free.sulfur.dioxide   1.695236e-02  0.002927325   5.791077 6.993664e-09
## total.sulfur.dioxide -5.509088e-03  0.001039653  -5.298967 1.164595e-07
## pH                    1.167840e+00  0.333399901   3.502822 4.603572e-04
## sulphates             1.790710e+00  0.277277652   6.458183 1.059679e-10

confidence intervals

ci <- confint(ord1)
ci

##                              2.5 %        97.5 %
## density              -2.278499e+02 -46.084314695
## volatile.acidity     -4.112110e+00  -3.038426422
## alcohol               5.968855e-01   0.856280014
## fixed.acidity         8.969627e-02   0.313909873
## residual.sugar        5.761315e-02   0.132142810
## free.sulfur.dioxide   1.122166e-02   0.022698435
## total.sulfur.dioxide -7.550016e-03  -0.003474023
## pH                    5.150373e-01   1.822209401
## sulphates             1.248044e+00   2.335222477

exp(coef(ord1))

##                  3|4                  4|5                  5|6 
##         5.152025e-57         6.559643e-56         1.628685e-54 
##                  6|7                  7|8                  8|9 
##         2.159734e-53         2.128168e-52         7.722836e-51 
##              density     volatile.acidity              alcohol 
##         3.870480e-60         2.806299e-02         2.067535e+00 
##        fixed.acidity       residual.sugar  free.sulfur.dioxide 
##         1.223368e+00         1.099457e+00         1.017097e+00 
## total.sulfur.dioxide                   pH            sulphates 
##         9.945061e-01         3.215042e+00         5.993705e+00

# Odds ratio  and  Confidence interval
# exp(cbind(OR = coef(ord1),ci))
str(wine_train)

## 'data.frame':    3248 obs. of  13 variables:
##  $ fixed.acidity       : num  6.2 6.6 6.3 9 12.4 7.3 7.2 6.5 7.2 6.9 ...
##  $ volatile.acidity    : num  0.27 0.16 0.26 0.36 0.35 0.26 0.31 0.28 0.27 0.28 ...
##  $ citric.acid         : num  0.49 0.21 0.25 0.52 0.49 0.33 0.35 0.38 0.31 0.41 ...
##  $ residual.sugar      : num  1.4 6.7 7.8 2.1 2.6 11.8 7.2 7.8 1.2 1.7 ...
##  $ chlorides           : num  0.05 0.055 0.058 0.111 0.079 0.057 0.046 0.031 0.031 0.05 ...
##  $ free.sulfur.dioxide : num  20 43 44 5 27 48 45 54 27 10 ...
##  $ total.sulfur.dioxide: num  74 157 166 10 69 127 178 216 80 136 ...
##  $ density             : num  0.993 0.994 0.996 0.996 0.999 ...
##  $ pH                  : num  3.32 3.15 3.24 3.31 3.12 3.1 3.14 3.03 3.03 3.16 ...
##  $ sulphates           : num  0.44 0.52 0.41 0.62 0.75 0.55 0.53 0.42 0.33 0.71 ...
##  $ alcohol             : num  9.8 10.8 9 11.3 10.4 10 9.7 13.1 12.7 11.4 ...
##  $ quality             : Factor w/ 7 levels "3","4","5","6",..: 4 4 3 4 4 4 3 4 4 4 ...
##  $ newquality          : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 3 4 4 4 3 4 4 4 ...

str(wine_test)

## 'data.frame':    3248 obs. of  13 variables:
##  $ fixed.acidity       : num  5.9 9.1 8.3 5.8 6.9 7.4 8.5 6.8 5 7.3 ...
##  $ volatile.acidity    : num  0.21 0.27 0.28 0.29 0.27 0.55 0.17 0.41 0.27 0.65 ...
##  $ citric.acid         : num  0.31 0.32 0.48 0.38 0.25 0.19 0.74 0.31 0.32 0 ...
##  $ residual.sugar      : num  1.8 1.1 2.1 10.7 7.5 1.8 3.6 8.8 4.5 1.2 ...
##  $ chlorides           : num  0.033 0.031 0.093 0.038 0.03 0.082 0.05 0.084 0.032 0.065 ...
##  $ free.sulfur.dioxide : num  45 15 6 49 18 15 29 26 58 15 ...
##  $ total.sulfur.dioxide: num  142 151 12 136 117 34 128 45 178 21 ...
##  $ density             : num  0.99 0.994 0.994 0.994 0.991 ...
##  $ pH                  : num  3.35 3.03 3.26 3.11 3.09 3.49 3.28 3.38 3.45 3.39 ...
##  $ sulphates           : num  0.5 0.41 0.62 0.59 0.38 0.68 0.4 0.64 0.31 0.47 ...
##  $ alcohol             : num  12.7 10.6 12.4 11.2 13 10.5 12.4 10.1 12.6 10 ...
##  $ quality             : int  6 5 7 6 6 5 6 6 7 7 ...
##  $ newquality          : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 3 5 4 4 3 4 4 5 5 ...

Accuracy on Train data

predicted_class<- predict (ord, wine_train, type = "class")
predicted_class<- ordered(predicted_class$fit, levels = c(3, 4, 5, 6, 7, 8, 9)) ## Create as Ordered Factor).
str(predicted_class)

##  Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 3 4 4 4 4 5 4 4 ...

str(wine_train$newquality)

##  Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 3 4 4 4 3 4 4 4 ...

cm_ols <- table(predicted_class, wine_train$newquality)
print(cm_ols)

##                
## predicted_class    3    4    5    6    7    8    9
##               3    0    0    0    0    0    0    0
##               4    0    0    2    0    0    0    0
##               5    5   61  610  311   43   10    0
##               6    5   46  460 1032  377   68    1
##               7    0    1    3   76  115   20    2
##               8    0    0    0    0    0    0    0
##               9    0    0    0    0    0    0    0

Accuracy <- sum(diag(cm_ols))/sum(cm_ols)
Accuracy

## [1] 0.5409483

The OLS model gives and accuracy of 56%