Summary of Answer
I modelled properties of red and white variants of Portuguese “Vinho Verde” wineto predict the quality of each wine as given in data set. It is given that the classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones.
First, I did exploratory data analysis and split data was in the ratio of 50/50 to create training and test datasets,respectively.
For each question one regression model with all variable was created. Then Each model was tuned to improve performance.The accuracy,R-square for linear model and misclassification rate was then determined for each model. The multinominal logistic regression model predicts train data with 55% accuracy, SVM logistic regression model predicts test data with 60% accuracy,whereas ordinal regression model without scaling predicts hold-out data with 56% accuracy. Liner regression gives an R-square around 28, which means only 28% of the variance in DV is explained by IV.
Conclusion: I expected Ordinal regression to provide the most accurate prediction since the dependent variable (quality) is ordered in the original dataset but it is giving accurcy of only 40% However,The model created using Support Vector Machines most accurately predicted the test data.
Installing required packages and sourcing libraries requiredfor this HW
# Structure and Summary for red wine
redwine<- read.csv("C:/Users/Amit/Dropbox/Dataset/winequality-red.csv")
redwine <- read.csv("C:/users/Amit/Dropbox/Dataset/winequality-red.csv", sep= ";")
View(redwine)
#str(redwine)
#summary(redwine)
# Structure and Summary for white wine
whitewine<- read.csv("C:/Users/Amit/Dropbox/Dataset/winequality-white.csv")
whitewine <- read.csv("C:/Users/Amit/Dropbox/Dataset/winequality-white.csv", sep=";")
View(whitewine)
#str(whitewine)
#summary(whitewine)
wine_redwhite <- rbind(whitewine, redwine)
class(wine_redwhite)
## [1] "data.frame"
View(wine_redwhite)
head(wine_redwhite)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.0 0.27 0.36 20.7 0.045
## 2 6.3 0.30 0.34 1.6 0.049
## 3 8.1 0.28 0.40 6.9 0.050
## 4 7.2 0.23 0.32 8.5 0.058
## 5 7.2 0.23 0.32 8.5 0.058
## 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
#str(wine_redwhite)
#summary(wine_redwhite)
Red wine data has 1,599 observation and quality variable has levels from 3 to 8. White wine data has 4898 observation and quality variable has levels from 3 to 9. Combined red and wine data has 6,497 observation and has levels from 3 to 9.
#Check for missing values
#colSums(is.na(wine_redwhite))
#install.packages("corrplot")
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.3.2
wine_corr <- cor(wine_redwhite)
summary(wine_corr)
## fixed.acidity volatile.acidity citric.acid
## Min. :-0.32905 Min. :-0.41448 Min. :-0.37798
## 1st Qu.:-0.14716 1st Qu.:-0.28741 1st Qu.: 0.02663
## Median : 0.07113 Median : 0.09068 Median : 0.09084
## Mean : 0.12095 Mean : 0.05921 Mean : 0.11282
## 3rd Qu.: 0.30578 3rd Qu.: 0.26391 3rd Qu.: 0.15565
## Max. : 1.00000 Max. : 1.00000 Max. : 1.00000
## residual.sugar chlorides free.sulfur.dioxide
## Min. :-0.35941 Min. :-0.27963 Min. :-0.35256
## 1st Qu.:-0.18845 1st Qu.:-0.19645 1st Qu.:-0.19010
## Median :-0.07448 Median : 0.04185 Median :-0.06007
## Mean : 0.10890 Mean : 0.12134 Mean : 0.08280
## 3rd Qu.: 0.42602 3rd Qu.: 0.36624 3rd Qu.: 0.20056
## Max. : 1.00000 Max. : 1.00000 Max. : 1.00000
## total.sulfur.dioxide density pH
## Min. :-0.41448 Min. :-0.68675 Min. :-0.32981
## 1st Qu.:-0.27670 1st Qu.: 0.02221 1st Qu.:-0.24198
## Median :-0.13990 Median : 0.17782 Median : 0.01560
## Mean : 0.04997 Mean : 0.17318 Mean : 0.03472
## 3rd Qu.: 0.27030 3rd Qu.: 0.38669 3rd Qu.: 0.13897
## Max. : 1.00000 Max. : 1.00000 Max. : 1.00000
## sulphates alcohol quality
## Min. :-0.27573 Min. :-0.68675 Min. :-0.305858
## 1st Qu.:-0.04875 1st Qu.:-0.25912 1st Qu.:-0.107724
## Median : 0.12416 Median :-0.06655 Median :-0.008737
## Mean : 0.15119 Mean :-0.02748 Mean : 0.059664
## 3rd Qu.: 0.26950 3rd Qu.: 0.02804 3rd Qu.: 0.062980
## Max. : 1.00000 Max. : 1.00000 Max. : 1.000000
corrplot(wine_corr,method="number")
# Skewness and kurtosis
#install.packages("psych")
#library("psych")
#describe(wine_redwhite)
# Histogram and table to get frquency of different qualities
#hist_1 <- hist(wine_redwhite$quality)
#hist_1
#print(hist_1$breaks)
# Change number of breaks and add labels
hist_2 <- hist(wine_redwhite$quality, breaks = 5, col ="lightblue", xlab = "Quality ", main= " Histogram of Quality rating")
table(wine_redwhite$quality)
##
## 3 4 5 6 7 8 9
## 30 216 2138 2836 1079 193 5
From describe function output, we observe that skewness coefficient for quality is 0.19 and kurtosis is 0.23. Also, from the scatterplot matrix, we observe that there is not a linear relationship between the DV (Quality) and the IV(Predictors).
Next, I split the data into test and train data
set.seed(1235)
wine_redwhite <- wine_redwhite[sample(nrow(wine_redwhite)),]
split <- floor(nrow(wine_redwhite)/2)
wine_train <- wine_redwhite[0:split,]
wine_test <- wine_redwhite[(split+1):(nrow(wine_redwhite)-1),]
View(wine_train)
View(wine_test)
summary(wine_train)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.200 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500 1st Qu.: 1.800
## Median : 7.000 Median :0.2900 Median :0.3100 Median : 3.200
## Mean : 7.215 Mean :0.3425 Mean :0.3183 Mean : 5.546
## 3rd Qu.: 7.700 3rd Qu.:0.4100 3rd Qu.:0.3900 3rd Qu.: 8.200
## Max. :15.600 Max. :1.3300 Max. :1.0000 Max. :31.600
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.0 Min. : 6.0
## 1st Qu.:0.03800 1st Qu.: 17.0 1st Qu.: 77.0
## Median :0.04700 Median : 29.0 Median :118.0
## Mean :0.05608 Mean : 30.4 Mean :115.3
## 3rd Qu.:0.06400 3rd Qu.: 41.0 3rd Qu.:155.0
## Max. :0.61000 Max. :131.0 Max. :366.5
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.740 Min. :0.2200 Min. : 8.0
## 1st Qu.:0.9924 1st Qu.:3.110 1st Qu.:0.4300 1st Qu.: 9.5
## Median :0.9949 Median :3.210 Median :0.5000 Median :10.3
## Mean :0.9947 Mean :3.218 Mean :0.5296 Mean :10.5
## 3rd Qu.:0.9970 3rd Qu.:3.320 3rd Qu.:0.6000 3rd Qu.:11.3
## Max. :1.0103 Max. :4.010 Max. :2.0000 Max. :14.2
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.821
## 3rd Qu.:6.000
## Max. :9.000
str(wine_train)
## 'data.frame': 3248 obs. of 12 variables:
## $ fixed.acidity : num 6.2 6.6 6.3 9 12.4 7.3 7.2 6.5 7.2 6.9 ...
## $ volatile.acidity : num 0.27 0.16 0.26 0.36 0.35 0.26 0.31 0.28 0.27 0.28 ...
## $ citric.acid : num 0.49 0.21 0.25 0.52 0.49 0.33 0.35 0.38 0.31 0.41 ...
## $ residual.sugar : num 1.4 6.7 7.8 2.1 2.6 11.8 7.2 7.8 1.2 1.7 ...
## $ chlorides : num 0.05 0.055 0.058 0.111 0.079 0.057 0.046 0.031 0.031 0.05 ...
## $ free.sulfur.dioxide : num 20 43 44 5 27 48 45 54 27 10 ...
## $ total.sulfur.dioxide: num 74 157 166 10 69 127 178 216 80 136 ...
## $ density : num 0.993 0.994 0.996 0.996 0.999 ...
## $ pH : num 3.32 3.15 3.24 3.31 3.12 3.1 3.14 3.03 3.03 3.16 ...
## $ sulphates : num 0.44 0.52 0.41 0.62 0.75 0.55 0.53 0.42 0.33 0.71 ...
## $ alcohol : num 9.8 10.8 9 11.3 10.4 10 9.7 13.1 12.7 11.4 ...
## $ quality : int 6 6 5 6 6 6 5 6 6 6 ...
str(wine_test)
## 'data.frame': 3248 obs. of 12 variables:
## $ fixed.acidity : num 5.9 9.1 8.3 5.8 6.9 7.4 8.5 6.8 5 7.3 ...
## $ volatile.acidity : num 0.21 0.27 0.28 0.29 0.27 0.55 0.17 0.41 0.27 0.65 ...
## $ citric.acid : num 0.31 0.32 0.48 0.38 0.25 0.19 0.74 0.31 0.32 0 ...
## $ residual.sugar : num 1.8 1.1 2.1 10.7 7.5 1.8 3.6 8.8 4.5 1.2 ...
## $ chlorides : num 0.033 0.031 0.093 0.038 0.03 0.082 0.05 0.084 0.032 0.065 ...
## $ free.sulfur.dioxide : num 45 15 6 49 18 15 29 26 58 15 ...
## $ total.sulfur.dioxide: num 142 151 12 136 117 34 128 45 178 21 ...
## $ density : num 0.99 0.994 0.994 0.994 0.991 ...
## $ pH : num 3.35 3.03 3.26 3.11 3.09 3.49 3.28 3.38 3.45 3.39 ...
## $ sulphates : num 0.5 0.41 0.62 0.59 0.38 0.68 0.4 0.64 0.31 0.47 ...
## $ alcohol : num 12.7 10.6 12.4 11.2 13 10.5 12.4 10.1 12.6 10 ...
## $ quality : int 6 5 7 6 6 5 6 6 7 7 ...
summary(wine_test)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.700
## 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500 1st Qu.: 1.800
## Median : 7.000 Median :0.2900 Median :0.3100 Median : 2.900
## Mean : 7.215 Mean :0.3369 Mean :0.3189 Mean : 5.341
## 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900 3rd Qu.: 7.900
## Max. :15.900 Max. :1.5800 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :0.009 Min. : 1.00 Min. : 6.0 Min. :0.9871
## 1st Qu.:0.038 1st Qu.: 17.00 1st Qu.: 79.0 1st Qu.:0.9923
## Median :0.047 Median : 29.00 Median :118.0 Median :0.9948
## Mean :0.056 Mean : 30.66 Mean :116.2 Mean :0.9947
## 3rd Qu.:0.066 3rd Qu.: 41.00 3rd Qu.:156.0 3rd Qu.:0.9969
## Max. :0.611 Max. :289.00 Max. :440.0 Max. :1.0390
## pH sulphates alcohol quality
## Min. :2.720 Min. :0.2300 Min. : 8.00 Min. :3.000
## 1st Qu.:3.110 1st Qu.:0.4300 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.210 Median :0.5100 Median :10.30 Median :6.000
## Mean :3.219 Mean :0.5329 Mean :10.48 Mean :5.815
## 3rd Qu.:3.320 3rd Qu.:0.6000 3rd Qu.:11.30 3rd Qu.:6.000
## Max. :4.010 Max. :1.9800 Max. :14.90 Max. :9.000
plot(wine_test)
Question 1 -Build a logistic regression model using “quality” as the target variable. Note that you don’t have binary classification task any more. For this you will have to use multinomial logistic regression. However, you can still interpret the model output (i.e., statistical significance of the coefficients, etc. exactly the same way as binary logistic regression). (40 points)
library(nnet)
wine_train$quality <- as.factor(wine_train$quality)
mlogit_model <- multinom(quality ~. ,data =wine_train, maxit = 1000)
## # weights: 91 (72 variable)
## initial value 6320.316164
## iter 10 value 4320.497806
## iter 20 value 4060.606046
## iter 30 value 3854.530346
## iter 40 value 3506.654411
## iter 50 value 3457.637350
## iter 60 value 3450.835013
## iter 70 value 3447.147730
## iter 80 value 3445.537837
## iter 90 value 3444.981576
## iter 100 value 3442.950730
## iter 110 value 3441.461199
## iter 120 value 3440.940957
## iter 130 value 3440.306791
## iter 140 value 3439.559266
## iter 150 value 3439.034524
## iter 160 value 3439.019201
## iter 170 value 3438.970730
## iter 180 value 3438.428701
## iter 190 value 3438.316035
## iter 200 value 3438.089704
## iter 210 value 3437.677310
## iter 220 value 3437.491963
## iter 230 value 3437.346085
## iter 240 value 3437.311541
## iter 250 value 3437.258707
## iter 260 value 3437.160001
## iter 270 value 3437.054373
## iter 280 value 3436.643428
## iter 290 value 3436.594604
## iter 300 value 3436.494691
## iter 310 value 3436.487621
## final value 3436.487177
## converged
mlogit_model
## Call:
## multinom(formula = quality ~ ., data = wine_train, maxit = 1000)
##
## Coefficients:
## (Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar
## 4 80.34737 -0.40121812 -0.3009899 -0.18609933 -0.13430930
## 5 -143.77234 -0.45157483 -2.4570309 -0.52078002 -0.23245524
## 6 -120.11940 -0.44163988 -6.3501832 -0.55937887 -0.16496438
## 7 111.65170 -0.02859189 -8.4393169 -0.90090096 -0.04285993
## 8 71.90978 -0.05405534 -7.2393255 -0.07268703 -0.02706201
## 9 -23.26643 1.05731368 -9.9188910 4.69045014 -0.54367948
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 4 -15.62348 0.01680235 -0.01254569 -54.00569 -4.071159
## 5 -15.24590 0.07569049 -0.01640530 174.65852 -3.608298
## 6 -13.70611 0.08352995 -0.02206566 142.16967 -3.275409
## 7 -19.58021 0.09035251 -0.02426119 -105.41733 -1.556323
## 8 -18.60198 0.10593109 -0.02046752 -70.67082 -1.755639
## 9 -248.45446 0.08576505 -0.01621272 -42.75708 13.405050
## sulphates alcohol
## 4 3.077385 -0.62686578
## 5 3.302585 -0.82060754
## 6 5.059485 -0.01451927
## 7 6.643809 0.38419658
## 8 6.035392 0.64336353
## 9 3.378276 1.92082084
##
## Residual Deviance: 6872.974
## AIC: 7016.974
mlogit_output <- summary(mlogit_model)
mlogit_output
## Call:
## multinom(formula = quality ~ ., data = wine_train, maxit = 1000)
##
## Coefficients:
## (Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar
## 4 80.34737 -0.40121812 -0.3009899 -0.18609933 -0.13430930
## 5 -143.77234 -0.45157483 -2.4570309 -0.52078002 -0.23245524
## 6 -120.11940 -0.44163988 -6.3501832 -0.55937887 -0.16496438
## 7 111.65170 -0.02859189 -8.4393169 -0.90090096 -0.04285993
## 8 71.90978 -0.05405534 -7.2393255 -0.07268703 -0.02706201
## 9 -23.26643 1.05731368 -9.9188910 4.69045014 -0.54367948
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 4 -15.62348 0.01680235 -0.01254569 -54.00569 -4.071159
## 5 -15.24590 0.07569049 -0.01640530 174.65852 -3.608298
## 6 -13.70611 0.08352995 -0.02206566 142.16967 -3.275409
## 7 -19.58021 0.09035251 -0.02426119 -105.41733 -1.556323
## 8 -18.60198 0.10593109 -0.02046752 -70.67082 -1.755639
## 9 -248.45446 0.08576505 -0.01621272 -42.75708 13.405050
## sulphates alcohol
## 4 3.077385 -0.62686578
## 5 3.302585 -0.82060754
## 6 5.059485 -0.01451927
## 7 6.643809 0.38419658
## 8 6.035392 0.64336353
## 9 3.378276 1.92082084
##
## Std. Errors:
## (Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar
## 4 0.1454493 0.2165992 0.5532004 0.69219171 0.07410513
## 5 0.5178327 0.2052018 0.3784100 0.37689593 0.06965434
## 6 0.4284067 0.2045802 0.3693196 0.34811001 0.06951777
## 7 0.5579515 0.2071980 0.4956936 0.47244397 0.07045497
## 8 0.1321986 0.2193902 0.8597776 0.77473267 0.07408530
## 9 0.1320847 0.3346028 0.1170187 0.08956497 0.49921984
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 4 0.07400592 0.03622649 0.006873728 0.1450188 0.6111689
## 5 0.72950948 0.03483952 0.006457798 0.5073267 0.5901773
## 6 0.75412783 0.03485743 0.006480820 0.4195697 0.5697341
## 7 0.04534702 0.03504646 0.006626980 0.5469947 0.5958200
## 8 0.03572340 0.03571581 0.007253145 0.1314910 0.6354287
## 9 0.01741697 0.07224975 0.025817800 0.1340691 0.9254421
## sulphates alcohol
## 4 0.6817020 0.2258354
## 5 0.3586576 0.2135175
## 6 0.3066205 0.2105279
## 7 0.3651224 0.2133400
## 8 0.6329260 0.2247793
## 9 0.1478071 0.3393065
##
## Residual Deviance: 6872.974
## AIC: 7016.974
Interpretation of Train Model data. After we run the model, nnet reports the summary iterations and declaration about model convergence. In this case, the model converged quite after 310 iterations. 2. Model execution output shows some iteration history and includes the final negative log-likelihood 3436.5. This value is multiplied by two as shown in the model summary as the Residual Deviance
The Akaike Information Criterion (AIC) is 7016.9 and it provides a method for assessing the quality of your model through comparison of related models. If we have more than one similar candidate models (where all of the variables of the simpler model occur in the more complex models), then we should select the model that has the smallest AIC. It’s useful for comparing models. However, unlike adjusted R-squared, the number itself is not meaningful.
mlogit_output$coefficients
## (Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar
## 4 80.34737 -0.40121812 -0.3009899 -0.18609933 -0.13430930
## 5 -143.77234 -0.45157483 -2.4570309 -0.52078002 -0.23245524
## 6 -120.11940 -0.44163988 -6.3501832 -0.55937887 -0.16496438
## 7 111.65170 -0.02859189 -8.4393169 -0.90090096 -0.04285993
## 8 71.90978 -0.05405534 -7.2393255 -0.07268703 -0.02706201
## 9 -23.26643 1.05731368 -9.9188910 4.69045014 -0.54367948
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 4 -15.62348 0.01680235 -0.01254569 -54.00569 -4.071159
## 5 -15.24590 0.07569049 -0.01640530 174.65852 -3.608298
## 6 -13.70611 0.08352995 -0.02206566 142.16967 -3.275409
## 7 -19.58021 0.09035251 -0.02426119 -105.41733 -1.556323
## 8 -18.60198 0.10593109 -0.02046752 -70.67082 -1.755639
## 9 -248.45446 0.08576505 -0.01621272 -42.75708 13.405050
## sulphates alcohol
## 4 3.077385 -0.62686578
## 5 3.302585 -0.82060754
## 6 5.059485 -0.01451927
## 7 6.643809 0.38419658
## 8 6.035392 0.64336353
## 9 3.378276 1.92082084
mlogit_output$standard.errors
## (Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar
## 4 0.1454493 0.2165992 0.5532004 0.69219171 0.07410513
## 5 0.5178327 0.2052018 0.3784100 0.37689593 0.06965434
## 6 0.4284067 0.2045802 0.3693196 0.34811001 0.06951777
## 7 0.5579515 0.2071980 0.4956936 0.47244397 0.07045497
## 8 0.1321986 0.2193902 0.8597776 0.77473267 0.07408530
## 9 0.1320847 0.3346028 0.1170187 0.08956497 0.49921984
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 4 0.07400592 0.03622649 0.006873728 0.1450188 0.6111689
## 5 0.72950948 0.03483952 0.006457798 0.5073267 0.5901773
## 6 0.75412783 0.03485743 0.006480820 0.4195697 0.5697341
## 7 0.04534702 0.03504646 0.006626980 0.5469947 0.5958200
## 8 0.03572340 0.03571581 0.007253145 0.1314910 0.6354287
## 9 0.01741697 0.07224975 0.025817800 0.1340691 0.9254421
## sulphates alcohol
## 4 0.6817020 0.2258354
## 5 0.3586576 0.2135175
## 6 0.3066205 0.2105279
## 7 0.3651224 0.2133400
## 8 0.6329260 0.2247793
## 9 0.1478071 0.3393065
z <- mlogit_output$coefficients/mlogit_output$standard.errors
p <- (1-pnorm(abs(z),0,1))*2 # I am using two-tailed z test
print(z, digits =2)
## (Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar
## 4 552 -1.85 -0.54 -0.269 -1.81
## 5 -278 -2.20 -6.49 -1.382 -3.34
## 6 -280 -2.16 -17.19 -1.607 -2.37
## 7 200 -0.14 -17.03 -1.907 -0.61
## 8 544 -0.25 -8.42 -0.094 -0.37
## 9 -176 3.16 -84.76 52.369 -1.09
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 4 -211 0.46 -1.83 -372 -6.7
## 5 -21 2.17 -2.54 344 -6.1
## 6 -18 2.40 -3.40 339 -5.7
## 7 -432 2.58 -3.66 -193 -2.6
## 8 -521 2.97 -2.82 -537 -2.8
## 9 -14265 1.19 -0.63 -319 14.5
## sulphates alcohol
## 4 4.5 -2.776
## 5 9.2 -3.843
## 6 16.5 -0.069
## 7 18.2 1.801
## 8 9.5 2.862
## 9 22.9 5.661
print(p, digits =2)
## (Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar
## 4 0 0.0640 5.9e-01 0.788 0.06992
## 5 0 0.0278 8.4e-11 0.167 0.00085
## 6 0 0.0309 0.0e+00 0.108 0.01765
## 7 0 0.8902 0.0e+00 0.057 0.54297
## 8 0 0.8054 0.0e+00 0.925 0.71490
## 9 0 0.0016 0.0e+00 0.000 0.27613
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 4 0 0.6428 0.06798 0 2.7e-11
## 5 0 0.0298 0.01107 0 9.7e-10
## 6 0 0.0166 0.00066 0 9.0e-09
## 7 0 0.0099 0.00025 0 9.0e-03
## 8 0 0.0030 0.00477 0 5.7e-03
## 9 0 0.2352 0.53003 0 0.0e+00
## sulphates alcohol
## 4 6.4e-06 5.5e-03
## 5 0.0e+00 1.2e-04
## 6 0.0e+00 9.5e-01
## 7 0.0e+00 7.2e-02
## 8 0.0e+00 4.2e-03
## 9 0.0e+00 1.5e-08
The p-Value for quality tells us that citric acid and residual sugar (for four levels of quality) are not significant. Now I’ll explore the entire data set, and analyze if we can remove any variables which do not add to model performance.
Pquality5 <- rbind(mlogit_output$coefficients[2, ],mlogit_output$standard.errors[2, ],z[2, ],p[2, ])
Pquality5
## (Intercept) fixed.acidity volatile.acidity citric.acid
## [1,] -143.7723368 -0.45157483 -2.457031e+00 -0.5207800
## [2,] 0.5178327 0.20520178 3.784100e-01 0.3768959
## [3,] -277.6424511 -2.20063797 -6.493038e+00 -1.3817608
## [4,] 0.0000000 0.02776166 8.412226e-11 0.1670452
## residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide
## [1,] -0.2324552432 -15.2459000 0.07569049 -0.016405299
## [2,] 0.0696543433 0.7295095 0.03483952 0.006457798
## [3,] -3.3372684614 -20.8988375 2.17254706 -2.540385993
## [4,] 0.0008460618 0.0000000 0.02981442 0.011073019
## density pH sulphates alcohol
## [1,] 174.6585171 -3.608298e+00 3.3025854 -0.8206075363
## [2,] 0.5073267 5.901773e-01 0.3586576 0.2135175060
## [3,] 344.2722436 -6.113923e+00 9.2081850 -3.8432798873
## [4,] 0.0000000 9.721148e-10 0.0000000 0.0001214009
rownames(Pquality5) <- c("Coefficient","Std. Errors","z stat","p value")
knitr::kable(Pquality5)
| (Intercept) | fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Coefficient | -143.7723368 | -0.4515748 | -2.457031 | -0.5207800 | -0.2324552 | -15.2459000 | 0.0756905 | -0.0164053 | 174.6585171 | -3.6082985 | 3.3025854 | -0.8206075 |
| Std. Errors | 0.5178327 | 0.2052018 | 0.378410 | 0.3768959 | 0.0696543 | 0.7295095 | 0.0348395 | 0.0064578 | 0.5073267 | 0.5901773 | 0.3586576 | 0.2135175 |
| z stat | -277.6424511 | -2.2006380 | -6.493038 | -1.3817608 | -3.3372685 | -20.8988375 | 2.1725471 | -2.5403860 | 344.2722436 | -6.1139226 | 9.2081850 | -3.8432799 |
| p value | 0.0000000 | 0.0277617 | 0.000000 | 0.1670452 | 0.0008461 | 0.0000000 | 0.0298144 | 0.0110730 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0001214 |
Probabilities Equation Ln(p(Quality=4)/p(Quality=3) = 80 - 0.40fixed.acidity -0.30volatile.acidity - 0.18citric.acid - 0.13residual.sugar -15.62chlorides + 0.016free.sulfur.dioxide -0.012total.sulfur.dioxide - 54.0density - 4.07ph + 3.07sulphates - 0.62*alcohol
Ln(p(Quality=5)/p(Quality=3) = 143 - 0.45fixed.acidity -2.45volatile.acidity - 0.52citric.acid - 0.23residual.sugar -15.24chlorides + 0.07free.sulfur.dioxide -0.016total.sulfur.dioxide - 174.6density - 3.60ph + 3.30sulphates - 0.82*alcohol
oddsML <- exp(coef(mlogit_output))
print(oddsML, digits =2)
## (Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar
## 4 7.8e+34 0.67 7.4e-01 0.83 0.87
## 5 3.6e-63 0.64 8.6e-02 0.59 0.79
## 6 6.8e-53 0.64 1.7e-03 0.57 0.85
## 7 3.1e+48 0.97 2.2e-04 0.41 0.96
## 8 1.7e+31 0.95 7.2e-04 0.93 0.97
## 9 7.9e-11 2.88 4.9e-05 108.90 0.58
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 4 1.6e-07 1.0 0.99 3.5e-24 1.7e-02
## 5 2.4e-07 1.1 0.98 7.1e+75 2.7e-02
## 6 1.1e-06 1.1 0.98 5.5e+61 3.8e-02
## 7 3.1e-09 1.1 0.98 1.7e-46 2.1e-01
## 8 8.3e-09 1.1 0.98 2.0e-31 1.7e-01
## 9 1.3e-108 1.1 0.98 2.7e-19 6.6e+05
## sulphates alcohol
## 4 22 0.53
## 5 27 0.44
## 6 158 0.99
## 7 768 1.47
## 8 418 1.90
## 9 29 6.83
The relative risk/odds ratio for a one-unit increase in the variable alcohol is .534 for being in quality 4 vs. quality 3. The relative risk/odds ratio for a one-unit increase in the variable alcohol is .440 for being in quality 5 vs. quality 3.
predictedML <- predict(mlogit_model,wine_test,na.action =na.pass, type="probs")
predicted_classML <- predict(mlogit_model,wine_test)
#wine_train$quality
# confusion matrix
caret::confusionMatrix(as.factor(predicted_classML),as.factor(wine_test$quality))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8 9
## 3 0 0 0 1 0 0 0
## 4 1 2 0 0 0 0 0
## 5 11 71 654 326 37 5 0
## 6 6 35 403 1004 394 55 1
## 7 0 0 6 83 112 34 1
## 8 2 0 0 1 0 0 0
## 9 0 0 0 2 0 1 0
##
## Overall Statistics
##
## Accuracy : 0.5456
## 95% CI : (0.5283, 0.5628)
## No Information Rate : 0.4363
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2689
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7
## Sensitivity 0.0000000 0.0185185 0.6152 0.7085 0.20626
## Specificity 0.9996902 0.9996815 0.7941 0.5117 0.95416
## Pos Pred Value 0.0000000 0.6666667 0.5924 0.5290 0.47458
## Neg Pred Value 0.9938405 0.9673344 0.8092 0.6941 0.85691
## Prevalence 0.0061576 0.0332512 0.3273 0.4363 0.16718
## Detection Rate 0.0000000 0.0006158 0.2014 0.3091 0.03448
## Detection Prevalence 0.0003079 0.0009236 0.3399 0.5844 0.07266
## Balanced Accuracy 0.4998451 0.5091000 0.7046 0.6101 0.58021
## Class: 8 Class: 9
## Sensitivity 0.0000000 0.0000000
## Specificity 0.9990485 0.9990758
## Pos Pred Value 0.0000000 0.0000000
## Neg Pred Value 0.9707242 0.9993837
## Prevalence 0.0292488 0.0006158
## Detection Rate 0.0000000 0.0000000
## Detection Prevalence 0.0009236 0.0009236
## Balanced Accuracy 0.4995243 0.4995379
mean(as.character(predicted_classML) != as.character(wine_test$quality))
## [1] 0.4544335
# cm <- table(predicted_class, wine_train$quality)
# print(cm)
Confusion Matrix gives that Multinominal Logist model with all variables to have 55% accuracy and has around 45% accuracy rate
library(caret)
## Warning: package 'caret' was built under R version 3.3.2
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
mostImportantVariables <- varImp(mlogit_model)
mostImportantVariables$Variables <- row.names(mostImportantVariables)
mostImportantVariables <- mostImportantVariables[order(-mostImportantVariables$Overall),]
print(head(mostImportantVariables))
## Overall Variables
## density 589.679102 density
## chlorides 331.212145 chlorides
## volatile.acidity 34.705737 volatile.acidity
## pH 27.671878 pH
## sulphates 27.496932 sulphates
## citric.acid 6.930296 citric.acid
I used coefficients magnitude, p-value and accuracy to select key variables for lean model
mlogit_model1 <- multinom(quality ~ density + chlorides + volatile.acidity + alcohol,data =wine_train, maxit = 1000)
## # weights: 42 (30 variable)
## initial value 6320.316164
## iter 10 value 3916.389255
## iter 20 value 3559.370598
## iter 30 value 3550.506628
## iter 40 value 3543.044583
## iter 50 value 3539.028675
## iter 60 value 3535.377250
## iter 70 value 3534.781094
## iter 80 value 3533.483356
## iter 90 value 3532.290133
## iter 100 value 3531.652642
## iter 110 value 3529.226400
## iter 120 value 3528.372508
## iter 130 value 3527.551327
## iter 140 value 3526.868656
## iter 150 value 3526.476516
## iter 160 value 3525.900301
## iter 170 value 3525.717948
## iter 180 value 3525.456945
## iter 190 value 3524.778470
## iter 200 value 3524.729025
## iter 210 value 3524.662916
## iter 220 value 3524.474316
## iter 230 value 3524.414746
## iter 240 value 3524.385488
## final value 3524.352714
## converged
mlogit_model1
## Call:
## multinom(formula = quality ~ density + chlorides + volatile.acidity +
## alcohol, data = wine_train, maxit = 1000)
##
## Coefficients:
## (Intercept) density chlorides volatile.acidity alcohol
## 4 197.446607 -189.78397 -8.725659 0.3750108 -0.5519724
## 5 93.071512 -78.53379 -7.709595 -1.6203411 -0.8769918
## 6 -18.363948 24.68042 -4.433698 -5.5818409 0.1228285
## 7 -103.692907 102.52426 -11.575178 -8.2251245 0.8614796
## 8 -134.141176 130.69747 -20.456602 -7.6528607 0.9480950
## 9 -1.390011 -11.06833 -112.577157 -10.9758124 1.6324814
##
## Residual Deviance: 7048.705
## AIC: 7108.705
mlogit_output1 <- summary(mlogit_model1)
mlogit_output1
## Call:
## multinom(formula = quality ~ density + chlorides + volatile.acidity +
## alcohol, data = wine_train, maxit = 1000)
##
## Coefficients:
## (Intercept) density chlorides volatile.acidity alcohol
## 4 197.446607 -189.78397 -8.725659 0.3750108 -0.5519724
## 5 93.071512 -78.53379 -7.709595 -1.6203411 -0.8769918
## 6 -18.363948 24.68042 -4.433698 -5.5818409 0.1228285
## 7 -103.692907 102.52426 -11.575178 -8.2251245 0.8614796
## 8 -134.141176 130.69747 -20.456602 -7.6528607 0.9480950
## 9 -1.390011 -11.06833 -112.577157 -10.9758124 1.6324814
##
## Std. Errors:
## (Intercept) density chlorides volatile.acidity alcohol
## 4 1.729836 1.728648 5.1367043 1.439356 0.3151521
## 5 9.446285 9.337721 4.1003520 1.384156 0.3035482
## 6 12.509405 12.377231 4.0660773 1.395951 0.3025122
## 7 3.215013 3.203924 4.9611188 1.464048 0.3044135
## 8 1.853929 1.726059 8.0632289 1.714886 0.3167942
## 9 4.752426 4.792886 0.3245852 6.887989 0.8084408
##
## Residual Deviance: 7048.705
## AIC: 7108.705
mlogit_output1$coefficients
## (Intercept) density chlorides volatile.acidity alcohol
## 4 197.446607 -189.78397 -8.725659 0.3750108 -0.5519724
## 5 93.071512 -78.53379 -7.709595 -1.6203411 -0.8769918
## 6 -18.363948 24.68042 -4.433698 -5.5818409 0.1228285
## 7 -103.692907 102.52426 -11.575178 -8.2251245 0.8614796
## 8 -134.141176 130.69747 -20.456602 -7.6528607 0.9480950
## 9 -1.390011 -11.06833 -112.577157 -10.9758124 1.6324814
mlogit_output1$standard.errors
## (Intercept) density chlorides volatile.acidity alcohol
## 4 1.729836 1.728648 5.1367043 1.439356 0.3151521
## 5 9.446285 9.337721 4.1003520 1.384156 0.3035482
## 6 12.509405 12.377231 4.0660773 1.395951 0.3025122
## 7 3.215013 3.203924 4.9611188 1.464048 0.3044135
## 8 1.853929 1.726059 8.0632289 1.714886 0.3167942
## 9 4.752426 4.792886 0.3245852 6.887989 0.8084408
z <- mlogit_output1$coefficients/mlogit_output1$standard.errors
p <- (1-pnorm(abs(z),0,1))*2
Pquality5 <- rbind(mlogit_output1$coefficients[2, ],mlogit_output1$standard.errors[2, ],z[2, ],p[2, ])
Pquality5
## (Intercept) density chlorides volatile.acidity alcohol
## [1,] 93.071512 -78.533789 -7.70959532 -1.6203411 -0.87699180
## [2,] 9.446285 9.337721 4.10035199 1.3841564 0.30354821
## [3,] 9.852711 -8.410381 -1.88022768 -1.1706344 -2.88913517
## [4,] 0.000000 0.000000 0.06007705 0.2417458 0.00386303
rownames(Pquality5) <- c("Coefficient","Std. Errors","z stat","p value")
knitr::kable(Pquality5)
| (Intercept) | density | chlorides | volatile.acidity | alcohol | |
|---|---|---|---|---|---|
| Coefficient | 93.071512 | -78.533789 | -7.7095953 | -1.6203411 | -0.8769918 |
| Std. Errors | 9.446285 | 9.337721 | 4.1003520 | 1.3841564 | 0.3035482 |
| z stat | 9.852711 | -8.410381 | -1.8802277 | -1.1706344 | -2.8891352 |
| p value | 0.000000 | 0.000000 | 0.0600771 | 0.2417458 | 0.0038630 |
| ## extract the | coefficients | from the mode | l and exponen | tiate |
exp(coef(mlogit_output1))
## (Intercept) density chlorides volatile.acidity alcohol
## 4 5.623049e+85 3.783286e-83 1.623657e-04 1.4550071084 0.5758130
## 5 2.632960e+40 7.820037e-35 4.485029e-04 0.1978312038 0.4160325
## 6 1.058372e-08 5.230815e+10 1.187051e-02 0.0037656270 1.1306905
## 7 9.262812e-46 3.355214e+44 9.396453e-06 0.0002678390 2.3666598
## 8 5.536398e-59 5.770217e+56 1.305602e-09 0.0004746842 2.5807887
## 9 2.490726e-01 1.559865e-05 1.283400e-49 0.0000171106 5.1165550
mlogit_output1$coefficients
## (Intercept) density chlorides volatile.acidity alcohol
## 4 197.446607 -189.78397 -8.725659 0.3750108 -0.5519724
## 5 93.071512 -78.53379 -7.709595 -1.6203411 -0.8769918
## 6 -18.363948 24.68042 -4.433698 -5.5818409 0.1228285
## 7 -103.692907 102.52426 -11.575178 -8.2251245 0.8614796
## 8 -134.141176 130.69747 -20.456602 -7.6528607 0.9480950
## 9 -1.390011 -11.06833 -112.577157 -10.9758124 1.6324814
There is no R-square for a logit model but we may use a McFadden psuedo R2 using pscl package to get this and other fit statistics
library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
mlogit_ModelFit<- rbind(pscl::pR2(mlogit_model)[1:6],pscl::pR2(mlogit_model1)[1:6])
## fitting null model for pseudo-r2
## # weights: 14 (6 variable)
## initial value 6320.316164
## iter 10 value 4123.979306
## iter 20 value 4118.078251
## final value 4118.060977
## converged
## fitting null model for pseudo-r2
## # weights: 14 (6 variable)
## initial value 6320.316164
## iter 10 value 4123.979306
## iter 20 value 4118.078251
## final value 4118.060977
## converged
rownames(mlogit_ModelFit) <- c("Model-0", "Model-1")
print(mlogit_ModelFit, digits = 2)
## llh llhNull G2 McFadden r2ML r2CU
## Model-0 -3436 -4118 1363 0.17 0.34 0.37
## Model-1 -3524 -4118 1187 0.14 0.31 0.33
mlogit_ModelError <- as.data.frame(rbind(cbind(mlogit_output$deviance,mlogit_output$AIC)),cbind(mlogit_output1$deviance,mlogit_output1$AIC))
names(mlogit_ModelError) <- c("Deviance", "AIC")
print(mlogit_ModelError, digits = 3)
## Deviance AIC
## 1 6873 7017
We observe that Model 0 with all variables has highest McFadden (0.17) and r2 scores and lowest deviance and AIC
predictedML1 <- predict(mlogit_model1,wine_train,na.action =na.pass, type="probs")
predicted_classML1 <- predict(mlogit_model1,wine_train)
caret::confusionMatrix(as.factor(predicted_classML1),as.factor(wine_train$quality))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8 9
## 3 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0
## 5 4 57 657 345 37 12 0
## 6 6 49 414 986 397 67 1
## 7 0 2 4 88 101 19 2
## 8 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.5369
## 95% CI : (0.5196, 0.5542)
## No Information Rate : 0.4369
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2501
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity 0.000000 0.00000 0.6112 0.6949 0.1888 0.00000
## Specificity 1.000000 1.00000 0.7906 0.4893 0.9576 1.00000
## Pos Pred Value NaN NaN 0.5908 0.5135 0.4676 NaN
## Neg Pred Value 0.996921 0.96675 0.8043 0.6739 0.8569 0.96983
## Prevalence 0.003079 0.03325 0.3310 0.4369 0.1647 0.03017
## Detection Rate 0.000000 0.00000 0.2023 0.3036 0.0311 0.00000
## Detection Prevalence 0.000000 0.00000 0.3424 0.5911 0.0665 0.00000
## Balanced Accuracy 0.500000 0.50000 0.7009 0.5921 0.5732 0.50000
## Class: 9
## Sensitivity 0.0000000
## Specificity 1.0000000
## Pos Pred Value NaN
## Neg Pred Value 0.9990764
## Prevalence 0.0009236
## Detection Rate 0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy 0.5000000
mean(as.character(predicted_classML1) != as.character(wine_train$quality))
## [1] 0.4630542
predictedML1 <- predict(mlogit_model1,wine_test,na.action =na.pass, type="probs")
predicted_classML1 <- predict(mlogit_model1,wine_test)
caret::confusionMatrix(as.factor(predicted_classML1),as.factor(wine_test$quality))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8 9
## 3 0 0 0 0 0 0 0
## 4 1 1 0 0 0 0 0
## 5 10 66 647 363 32 5 0
## 6 9 41 411 974 421 59 1
## 7 0 0 5 80 90 31 1
## 8 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.5271
## 95% CI : (0.5098, 0.5444)
## No Information Rate : 0.4363
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2358
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7
## Sensitivity 0.000000 0.0092593 0.6087 0.6874 0.16575
## Specificity 1.000000 0.9996815 0.7822 0.4855 0.95675
## Pos Pred Value NaN 0.5000000 0.5761 0.5084 0.43478
## Neg Pred Value 0.993842 0.9670364 0.8042 0.6674 0.85104
## Prevalence 0.006158 0.0332512 0.3273 0.4363 0.16718
## Detection Rate 0.000000 0.0003079 0.1992 0.2999 0.02771
## Detection Prevalence 0.000000 0.0006158 0.3458 0.5899 0.06373
## Balanced Accuracy 0.500000 0.5044704 0.6954 0.5864 0.56125
## Class: 8 Class: 9
## Sensitivity 0.00000 0.0000000
## Specificity 1.00000 1.0000000
## Pos Pred Value NaN NaN
## Neg Pred Value 0.97075 0.9993842
## Prevalence 0.02925 0.0006158
## Detection Rate 0.00000 0.0000000
## Detection Prevalence 0.00000 0.0000000
## Balanced Accuracy 0.50000 0.5000000
mean(as.character(predicted_classML1) != as.character(wine_test$quality))
## [1] 0.4729064
We observe that Test sample accuracy is around 53% (in both train and test data) with just four predictors as compared to an accuracy of 54% with all 11 variables. However, A misclassification error of 47.2% seems high. It seems that given variables are not as good in explaining the quality. In this case of multinomial regression, The category to which an outcome quality belongs to, does not assume any order in it. In reality,the 7 categories of quality have a natural order. So, Ordinal regression might improve the result
In order to create a SVR model with R we will need the package e1071.
# set.seed(200)
#install.packages("e1071")
library(e1071)
##
## Attaching package: 'e1071'
## The following objects are masked from 'package:moments':
##
## kurtosis, moment, skewness
## The following object is masked from 'package:Hmisc':
##
## impute
#as.factor(quality)
modelsvm <- svm(quality ~ alcohol + volatile.acidity + free.sulfur.dioxide + sulphates + total.sulfur.dioxide + density + residual.sugar+ fixed.acidity + citric.acid + chlorides + pH, data=wine_train)
summary(modelsvm)
##
## Call:
## svm(formula = quality ~ alcohol + volatile.acidity + free.sulfur.dioxide +
## sulphates + total.sulfur.dioxide + density + residual.sugar +
## fixed.acidity + citric.acid + chlorides + pH, data = wine_train)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.09090909
##
## Number of Support Vectors: 2913
##
## ( 1274 894 526 98 108 3 10 )
##
##
## Number of Classes: 7
##
## Levels:
## 3 4 5 6 7 8 9
predictedsvm <- predict(modelsvm,wine_train)
#predictedsvm
In the code above, I performed an epsilon-regression, and did not set any value for gama, but it took a default value of 0.1. There is also a cost parameter which we can change to avoid overfitting. A common way to measure error in machine learning is to use the Root Mean Squared Error (RMSE) so I will use it to compute the RMSE I took the square root and we get the RMSE. RMSE= sqrt(MSE)
rmse <- function(error)
{
sqrt(mean(error^2))
}
#error <- model$residuals
error <- as.numeric(wine_train$quality) - as.numeric(predictedsvm)
svrPredictionRMSE <- rmse(error)
svrPredictionRMSE
## [1] 0.7321353
Model Tuning-The code is not run in R markdown as it takes lot of time- Result are used in the table below #tune$best.parameters # gamma cost #100 1 5
#obj <- tune.svm(modelsvm, data=wine_train, gamma= seq(.1,1,by=.1), cost= seq(1,10, by =1))
#obj$best.parameters
#plot(obj)
Final Parameters for SVM Model Gamma = 1 and Cost 5
modelsvm_tuned <- svm(quality ~ alcohol + volatile.acidity + free.sulfur.dioxide + sulphates + total.sulfur.dioxide + density + residual.sugar+ fixed.acidity + citric.acid+ chlorides + pH, data = wine_train, gamma = 1, cost = 5)
wine_test_svmpredict <- predict(modelsvm_tuned,wine_test,na.action = na.pass, type="probs")
# caret::confusionMatrix(as.factor(wine_train_svmpredict),as.factor(wine_train$quality))
# mean(as.character(wine_train_svmpredict)!= as.character(wine_train$quality))
The model on train data test gives an accuracy of 55% and test data at 60%
modelsvm_tuned1 <- svm(quality ~ density + chlorides + volatile.acidity + alcohol, data = wine_train, gamma = 1, cost = 5)
wine_test_svmpredict1 <- predict(modelsvm_tuned1,wine_test,na.action = na.pass, type="class")
# Cross tabs or confusion matrix
caret::confusionMatrix(as.factor(wine_test_svmpredict1),as.factor(wine_test$quality))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8 9
## 3 0 0 1 2 0 0 0
## 4 1 5 1 2 0 0 0
## 5 10 62 637 318 26 0 0
## 6 8 39 410 1020 412 72 2
## 7 1 2 14 74 103 21 0
## 8 0 0 0 1 2 2 0
## 9 0 0 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.544
## 95% CI : (0.5267, 0.5613)
## No Information Rate : 0.4363
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2634
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7
## Sensitivity 0.0000000 0.046296 0.5992 0.7198 0.18969
## Specificity 0.9990706 0.998726 0.8096 0.4850 0.95860
## Pos Pred Value 0.0000000 0.555556 0.6049 0.5196 0.47907
## Neg Pred Value 0.9938367 0.968200 0.8059 0.6911 0.85493
## Prevalence 0.0061576 0.033251 0.3273 0.4363 0.16718
## Detection Rate 0.0000000 0.001539 0.1961 0.3140 0.03171
## Detection Prevalence 0.0009236 0.002771 0.3242 0.6044 0.06619
## Balanced Accuracy 0.4995353 0.522511 0.7044 0.6024 0.57414
## Class: 8 Class: 9
## Sensitivity 0.0210526 0.0000000
## Specificity 0.9990485 1.0000000
## Pos Pred Value 0.4000000 NaN
## Neg Pred Value 0.9713228 0.9993842
## Prevalence 0.0292488 0.0006158
## Detection Rate 0.0006158 0.0000000
## Detection Prevalence 0.0015394 0.0000000
## Balanced Accuracy 0.5100506 0.5000000
mean(as.character(wine_test_svmpredict1) != as.character(wine_test$quality))
## [1] 0.4559729
Accuracy with lean model is around 55% # Question-3 -Treat quality as a continuous variable. Estimate a linear regression model and compare the output with multinomial regression and SVM. # Invoking required R packages
library(regclass)
## Loading required package: bestglm
## Loading required package: leaps
## Loading required package: VGAM
## Loading required package: stats4
## Loading required package: splines
##
## Attaching package: 'VGAM'
## The following object is masked from 'package:caret':
##
## predictors
## The following objects are masked from 'package:psych':
##
## fisherz, logistic, logit
## Loading required package: rpart
## Warning: package 'rpart' was built under R version 3.3.2
## Important regclass change from 1.3 to 1.4:
## All functions that had a . in the name now have an _
## all.correlations -> all_correlations, cor.demo -> cor_demo, etc.
##
## Attaching package: 'regclass'
## The following object is masked from 'package:lattice':
##
## qq
library(leaps)
library(bestglm)
linear1 <- as.numeric(quality) ~ alcohol + volatile.acidity + free.sulfur.dioxide + sulphates + total.sulfur.dioxide + density + residual.sugar+ fixed.acidity + citric.acid+ chlorides + pH
lm1 <- lm(formula =linear1 ,data=wine_train)
summary(lm1)
##
## Call:
## lm(formula = linear1, data = wine_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2807 -0.4751 -0.0391 0.4639 2.6654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.115e+01 1.755e+01 2.914 0.003596 **
## alcohol 2.666e-01 2.467e-02 10.809 < 2e-16 ***
## volatile.acidity -1.292e+00 1.070e-01 -12.073 < 2e-16 ***
## free.sulfur.dioxide 6.837e-03 1.105e-03 6.189 6.81e-10 ***
## sulphates 6.632e-01 1.083e-01 6.124 1.02e-09 ***
## total.sulfur.dioxide -1.960e-03 3.993e-04 -4.910 9.58e-07 ***
## density -5.245e+01 1.790e+01 -2.929 0.003420 **
## residual.sugar 3.597e-02 7.420e-03 4.847 1.31e-06 ***
## fixed.acidity 8.273e-02 2.258e-02 3.663 0.000253 ***
## citric.acid -3.029e-02 1.132e-01 -0.268 0.789088
## chlorides -1.584e-01 4.724e-01 -0.335 0.737361
## pH 4.267e-01 1.294e-01 3.297 0.000987 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7364 on 3236 degrees of freedom
## Multiple R-squared: 0.283, Adjusted R-squared: 0.2805
## F-statistic: 116.1 on 11 and 3236 DF, p-value: < 2.2e-16
confint(lm1, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 16.730035583 85.568370295
## alcohol 0.218273317 0.315011040
## volatile.acidity -1.501791650 -1.082139157
## free.sulfur.dioxide 0.004671041 0.009002888
## sulphates 0.450886635 0.875599455
## total.sulfur.dioxide -0.002743486 -0.001177562
## density -87.550186497 -17.343374106
## residual.sugar 0.021416824 0.050513314
## fixed.acidity 0.038444681 0.127008502
## citric.acid -0.252309480 0.191725317
## chlorides -1.084743426 0.767854626
## pH 0.172946115 0.680371329
anova(lm1)
## Analysis of Variance Table
##
## Response: as.numeric(quality)
## Df Sum Sq Mean Sq F value Pr(>F)
## alcohol 1 463.93 463.93 855.6230 < 2.2e-16 ***
## volatile.acidity 1 163.64 163.64 301.7982 < 2.2e-16 ***
## free.sulfur.dioxide 1 10.28 10.28 18.9631 1.374e-05 ***
## sulphates 1 24.41 24.41 45.0181 2.295e-11 ***
## total.sulfur.dioxide 1 8.64 8.64 15.9383 6.689e-05 ***
## density 1 7.40 7.40 13.6534 0.0002235 ***
## residual.sugar 1 5.54 5.54 10.2148 0.0014066 **
## fixed.acidity 1 1.92 1.92 3.5405 0.0599771 .
## citric.acid 1 0.22 0.22 0.4100 0.5220350
## chlorides 1 0.58 0.58 1.0667 0.3017793
## pH 1 5.89 5.89 10.8717 0.0009870 ***
## Residuals 3236 1754.61 0.54
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Evaluation calculates Residual standard error = 0.74, Multi R2 = 0.28, and Adj R2 = 0.28. Anova evaluation finds Citric Acid, Chlorides, and Fixed acidity are not significant in this model. We can take log of variable which have high skewness to make it normal. I tried taking log of chloride and other variable but there is not much change in R-square even after making them normal. This may be due to Central limit theorem for large sample.
Assumption 1- residuals plot can be used to assess the assumption that the variables have a linear relationship
unstandardizedPredicted <- predict(lm1)
unstandardizedResiduals <- resid(lm1)
#get standardized values
standardizedPredicted <- (unstandardizedPredicted - mean(unstandardizedPredicted)) / sd(unstandardizedPredicted)
standardizedResiduals <- (unstandardizedResiduals - mean(unstandardizedResiduals)) / sd(unstandardizedResiduals)
#create standardized residuals plot
plot(standardizedPredicted, standardizedResiduals, main = "Standardized Residuals Plot", xlab = "Standardized Predicted Values", ylab = "Standardized Residuals")
#add horizontal line
abline(0,0)
Explanation of Residual plot- values that are close to the horizontal line are predicted well. The points above the line are underpredicted and the ones below the line are overpredicted. The linearity assumption is supported to the extent that the amount of points scattered above and below the line is equal.
#create residuals histogram
hist(standardizedResiduals, freq = FALSE)
#add normal curve
curve(dnorm, add = TRUE)
# Use of PP-plot to access the assumption and residuals are normally distributed
#get probability distribution for residuals
probDist <- pnorm(standardizedResiduals)
#create PP plot
plot(ppoints(length(standardizedResiduals)), sort(probDist), main = "PP Plot", xlab = "Observed Probability", ylab = "Expected Probability")
#add diagonal line
abline(0,1)
Here, the distribution is considered to be normal to the extent that the plotted points match the diagonal line.
#### Create Linear Model for Variable Selection
lmdVarSelect <- lm(formula = linear1, data = wine_train)
### Perform Step-Wise Variable Selection
step <- stepAIC(lmdVarSelect, direction="both")
## Start: AIC=-1976.1
## as.numeric(quality) ~ alcohol + volatile.acidity + free.sulfur.dioxide +
## sulphates + total.sulfur.dioxide + density + residual.sugar +
## fixed.acidity + citric.acid + chlorides + pH
##
## Df Sum of Sq RSS AIC
## - citric.acid 1 0.039 1754.7 -1978.0
## - chlorides 1 0.061 1754.7 -1978.0
## <none> 1754.6 -1976.1
## - density 1 4.653 1759.3 -1969.5
## - pH 1 5.895 1760.5 -1967.2
## - fixed.acidity 1 7.275 1761.9 -1964.7
## - residual.sugar 1 12.739 1767.3 -1954.6
## - total.sulfur.dioxide 1 13.069 1767.7 -1954.0
## - sulphates 1 20.333 1774.9 -1940.7
## - free.sulfur.dioxide 1 20.770 1775.4 -1939.9
## - alcohol 1 63.346 1818.0 -1862.9
## - volatile.acidity 1 79.027 1833.6 -1835.0
##
## Step: AIC=-1978.02
## as.numeric(quality) ~ alcohol + volatile.acidity + free.sulfur.dioxide +
## sulphates + total.sulfur.dioxide + density + residual.sugar +
## fixed.acidity + chlorides + pH
##
## Df Sum of Sq RSS AIC
## - chlorides 1 0.079 1754.7 -1979.9
## <none> 1754.7 -1978.0
## + citric.acid 1 0.039 1754.6 -1976.1
## - density 1 4.644 1759.3 -1971.4
## - pH 1 5.968 1760.6 -1969.0
## - fixed.acidity 1 7.471 1762.1 -1966.2
## - residual.sugar 1 12.703 1767.3 -1956.6
## - total.sulfur.dioxide 1 13.743 1768.4 -1954.7
## - sulphates 1 20.306 1775.0 -1942.7
## - free.sulfur.dioxide 1 20.805 1775.5 -1941.7
## - alcohol 1 63.424 1818.1 -1864.7
## - volatile.acidity 1 89.821 1844.5 -1817.9
##
## Step: AIC=-1979.88
## as.numeric(quality) ~ alcohol + volatile.acidity + free.sulfur.dioxide +
## sulphates + total.sulfur.dioxide + density + residual.sugar +
## fixed.acidity + pH
##
## Df Sum of Sq RSS AIC
## <none> 1754.7 -1979.9
## + chlorides 1 0.079 1754.7 -1978.0
## + citric.acid 1 0.057 1754.7 -1978.0
## - density 1 5.144 1759.9 -1972.4
## - pH 1 6.578 1761.3 -1969.7
## - fixed.acidity 1 7.908 1762.6 -1967.3
## - total.sulfur.dioxide 1 13.665 1768.4 -1956.7
## - residual.sugar 1 13.986 1768.7 -1956.1
## - sulphates 1 20.676 1775.4 -1943.8
## - free.sulfur.dioxide 1 20.731 1775.5 -1943.7
## - alcohol 1 63.378 1818.1 -1866.6
## - volatile.acidity 1 92.316 1847.0 -1815.3
step$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## as.numeric(quality) ~ alcohol + volatile.acidity + free.sulfur.dioxide +
## sulphates + total.sulfur.dioxide + density + residual.sugar +
## fixed.acidity + citric.acid + chlorides + pH
##
## Final Model:
## as.numeric(quality) ~ alcohol + volatile.acidity + free.sulfur.dioxide +
## sulphates + total.sulfur.dioxide + density + residual.sugar +
## fixed.acidity + pH
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 3236 1754.609 -1976.097
## 2 - citric.acid 1 0.03880406 3237 1754.648 -1978.025
## 3 - chlorides 1 0.07898424 3238 1754.727 -1979.879
I createad a Linear Model 2 using only Volatile Acidity, Alcohol, free suphur dioxide and Sulphates as independent variables. From the above table, we see that these four variables contribute for most of the variation in the data
lm2 <-lm(as.numeric(quality)~ alcohol + free.sulfur.dioxide + sulphates + volatile.acidity, data=wine_train)
summary(lm2)
##
## Call:
## lm(formula = as.numeric(quality) ~ alcohol + free.sulfur.dioxide +
## sulphates + volatile.acidity, data = wine_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2730 -0.4754 -0.0287 0.4570 2.5672
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4272306 0.1434983 2.977 0.00293 **
## alcohol 0.3232049 0.0112053 28.844 < 2e-16 ***
## free.sulfur.dioxide 0.0042448 0.0008299 5.115 3.33e-07 ***
## sulphates 0.5978010 0.0897634 6.660 3.21e-11 ***
## volatile.acidity -1.3019892 0.0848539 -15.344 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7419 on 3243 degrees of freedom
## Multiple R-squared: 0.2706, Adjusted R-squared: 0.2697
## F-statistic: 300.8 on 4 and 3243 DF, p-value: < 2.2e-16
confint(lm2, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 0.145874042 0.70858725
## alcohol 0.301234711 0.34517513
## free.sulfur.dioxide 0.002617525 0.00587201
## sulphates 0.421802302 0.77379964
## volatile.acidity -1.468361820 -1.13561661
anova(lm2)
## Analysis of Variance Table
##
## Response: as.numeric(quality)
## Df Sum Sq Mean Sq F value Pr(>F)
## alcohol 1 463.93 463.93 842.97 < 2.2e-16 ***
## free.sulfur.dioxide 1 59.25 59.25 107.65 < 2.2e-16 ***
## sulphates 1 9.51 9.51 17.28 3.309e-05 ***
## volatile.acidity 1 129.57 129.57 235.44 < 2.2e-16 ***
## Residuals 3243 1784.81 0.55
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error = 0.74, Multi R2 = 0.28, and Adj R2 = 0.27. Anova indicates th all variables are significant in this model
plot(residuals(lm2))
Points in the residual plor are randomly dispersed indicating that linear regression model may be appropriate for the data
rmse <- function(error)
{
sqrt(mean(error^2))
}
error <- lm2$residuals # same as data$Y - predictedY
predictionRMSE <- rmse(error)
predictionRMSE
## [1] 0.7412893
Error is 74.5%- It seems that Liner regression is underestimating the true nature of the relationship and interesting things are being overlooked.
An evaluation of both Linear Model 1 and Linear Model 2 concludes that neither model is well suited for predicting this data. The linear model assumes that the probability p is a linear function of the regressors, while the logistic model assumes that the natural log of the odds p/(1-p) is a linear function of the regressors. When the true probabilities are extreme, the linear model can also yield predicted probabilities that are greater than 1 or less than 0. Those out-of-bounds predicted probabilities are the Achilles heel of the linear model. The linear probability model is fast by comparison because it can be estimated noniteratively using ordinary least squares (OLS).
Yes, we can model it using Ordinal logistic regression since quality variable is ordered. Ordinal logistic regression is used to predict an ordinal dependent variable given one or more independent variables. This will be enable us to determine which of our independent variables (if any) have a statistically significant effect on our dependent variable
invoking required packages (c(“MASS”, “ORDINAL”, “ERER”))
library(MASS)
library(ordinal)
##
## Attaching package: 'ordinal'
## The following objects are masked from 'package:VGAM':
##
## dgumbel, dlgamma, pgumbel, plgamma, qgumbel, rgumbel, wine
## The following object is masked from 'package:psych':
##
## income
library(erer)
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'lmtest'
## The following object is masked from 'package:VGAM':
##
## lrtest
wine_train$newquality <- ordered(wine_train$quality, levels = c(3,4,5,6,7,8,9))
wine_test$newquality <- ordered(wine_test$quality, levels = c(3,4,5,6,7,8,9))
str(wine_train)
## 'data.frame': 3248 obs. of 13 variables:
## $ fixed.acidity : num 6.2 6.6 6.3 9 12.4 7.3 7.2 6.5 7.2 6.9 ...
## $ volatile.acidity : num 0.27 0.16 0.26 0.36 0.35 0.26 0.31 0.28 0.27 0.28 ...
## $ citric.acid : num 0.49 0.21 0.25 0.52 0.49 0.33 0.35 0.38 0.31 0.41 ...
## $ residual.sugar : num 1.4 6.7 7.8 2.1 2.6 11.8 7.2 7.8 1.2 1.7 ...
## $ chlorides : num 0.05 0.055 0.058 0.111 0.079 0.057 0.046 0.031 0.031 0.05 ...
## $ free.sulfur.dioxide : num 20 43 44 5 27 48 45 54 27 10 ...
## $ total.sulfur.dioxide: num 74 157 166 10 69 127 178 216 80 136 ...
## $ density : num 0.993 0.994 0.996 0.996 0.999 ...
## $ pH : num 3.32 3.15 3.24 3.31 3.12 3.1 3.14 3.03 3.03 3.16 ...
## $ sulphates : num 0.44 0.52 0.41 0.62 0.75 0.55 0.53 0.42 0.33 0.71 ...
## $ alcohol : num 9.8 10.8 9 11.3 10.4 10 9.7 13.1 12.7 11.4 ...
## $ quality : Factor w/ 7 levels "3","4","5","6",..: 4 4 3 4 4 4 3 4 4 4 ...
## $ newquality : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 3 4 4 4 3 4 4 4 ...
ord<-clm(formula= (newquality~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol), data= wine_train, maxit = 500)
## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio
## - Rescale variables?
## In addition: Absolute and relative convergence criteria were met
summary(ord)
## formula:
## newquality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol
## data: wine_train
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 3248 -3543.66 7121.33 8(0) 1.11e-10 2.4e+11
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## fixed.acidity 2.025e-01 5.946e-02 3.405 0.00066 ***
## volatile.acidity -3.581e+00 2.958e-01 -12.106 < 2e-16 ***
## citric.acid -2.591e-02 2.957e-01 -0.088 0.93018
## residual.sugar 9.470e-02 1.961e-02 4.829 1.37e-06 ***
## chlorides -3.176e-02 1.257e+00 -0.025 0.97984
## free.sulfur.dioxide 1.695e-02 2.931e-03 5.784 7.28e-09 ***
## total.sulfur.dioxide -5.497e-03 1.059e-03 -5.191 2.09e-07 ***
## density -1.364e+02 4.748e+01 -2.873 0.00407 **
## pH 1.163e+00 3.412e-01 3.410 0.00065 ***
## sulphates 1.794e+00 2.819e-01 6.365 1.95e-10 ***
## alcohol 7.268e-01 6.631e-02 10.960 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## 3|4 -129.23 46.55 -2.776
## 4|5 -126.69 46.55 -2.722
## 5|6 -123.48 46.55 -2.653
## 6|7 -120.89 46.55 -2.597
## 7|8 -118.60 46.54 -2.548
## 8|9 -115.01 46.55 -2.471
Chloride and citric acid are not significant and hence removed,
ord1 <- clm(formula=(newquality~ density + volatile.acidity + alcohol + fixed.acidity + residual.sugar + free.sulfur.dioxide + total.sulfur.dioxide + pH + sulphates), data= wine_train, maxit = 500)
## Warning: (3) Model is nearly unidentifiable: large eigenvalue ratio
## - Rescale variables?
## In addition: Absolute and relative convergence criteria were met
summary(ord1)
## formula:
## newquality ~ density + volatile.acidity + alcohol + fixed.acidity + residual.sugar + free.sulfur.dioxide + total.sulfur.dioxide + pH + sulphates
## data: wine_train
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 3248 -3543.67 7117.34 8(0) 9.23e-10 2.3e+11
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## density -1.368e+02 4.636e+01 -2.951 0.003166 **
## volatile.acidity -3.573e+00 2.739e-01 -13.048 < 2e-16 ***
## alcohol 7.264e-01 6.616e-02 10.979 < 2e-16 ***
## fixed.acidity 2.016e-01 5.717e-02 3.526 0.000421 ***
## residual.sugar 9.482e-02 1.901e-02 4.988 6.10e-07 ***
## free.sulfur.dioxide 1.695e-02 2.927e-03 5.791 6.99e-09 ***
## total.sulfur.dioxide -5.509e-03 1.040e-03 -5.299 1.16e-07 ***
## pH 1.168e+00 3.334e-01 3.503 0.000460 ***
## sulphates 1.791e+00 2.773e-01 6.458 1.06e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## 3|4 -129.61 45.49 -2.849
## 4|5 -127.06 45.49 -2.793
## 5|6 -123.85 45.49 -2.723
## 6|7 -121.27 45.49 -2.666
## 7|8 -118.98 45.48 -2.616
## 8|9 -115.39 45.49 -2.537
ctable <- coef(summary(ord1))
ctable
## Estimate Std. Error z value Pr(>|z|)
## 3|4 -1.296080e+02 45.487867052 -2.849286 4.381742e-03
## 4|5 -1.270638e+02 45.488143945 -2.793339 5.216691e-03
## 5|6 -1.238518e+02 45.489904850 -2.722622 6.476609e-03
## 6|7 -1.212670e+02 45.485700767 -2.666047 7.674892e-03
## 7|8 -1.189792e+02 45.484010194 -2.615846 8.900671e-03
## 8|9 -1.153877e+02 45.487234354 -2.536704 1.119015e-02
## density -1.368017e+02 46.355170800 -2.951164 3.165784e-03
## volatile.acidity -3.573303e+00 0.273860886 -13.047878 6.534194e-39
## alcohol 7.263572e-01 0.066157698 10.979179 4.812789e-28
## fixed.acidity 2.016080e-01 0.057173110 3.526274 4.214512e-04
## residual.sugar 9.481623e-02 0.019009257 4.987898 6.103988e-07
## free.sulfur.dioxide 1.695236e-02 0.002927325 5.791077 6.993664e-09
## total.sulfur.dioxide -5.509088e-03 0.001039653 -5.298967 1.164595e-07
## pH 1.167840e+00 0.333399901 3.502822 4.603572e-04
## sulphates 1.790710e+00 0.277277652 6.458183 1.059679e-10
ci <- confint(ord1)
ci
## 2.5 % 97.5 %
## density -2.278499e+02 -46.084314695
## volatile.acidity -4.112110e+00 -3.038426422
## alcohol 5.968855e-01 0.856280014
## fixed.acidity 8.969627e-02 0.313909873
## residual.sugar 5.761315e-02 0.132142810
## free.sulfur.dioxide 1.122166e-02 0.022698435
## total.sulfur.dioxide -7.550016e-03 -0.003474023
## pH 5.150373e-01 1.822209401
## sulphates 1.248044e+00 2.335222477
exp(coef(ord1))
## 3|4 4|5 5|6
## 5.152025e-57 6.559643e-56 1.628685e-54
## 6|7 7|8 8|9
## 2.159734e-53 2.128168e-52 7.722836e-51
## density volatile.acidity alcohol
## 3.870480e-60 2.806299e-02 2.067535e+00
## fixed.acidity residual.sugar free.sulfur.dioxide
## 1.223368e+00 1.099457e+00 1.017097e+00
## total.sulfur.dioxide pH sulphates
## 9.945061e-01 3.215042e+00 5.993705e+00
# Odds ratio and Confidence interval
# exp(cbind(OR = coef(ord1),ci))
str(wine_train)
## 'data.frame': 3248 obs. of 13 variables:
## $ fixed.acidity : num 6.2 6.6 6.3 9 12.4 7.3 7.2 6.5 7.2 6.9 ...
## $ volatile.acidity : num 0.27 0.16 0.26 0.36 0.35 0.26 0.31 0.28 0.27 0.28 ...
## $ citric.acid : num 0.49 0.21 0.25 0.52 0.49 0.33 0.35 0.38 0.31 0.41 ...
## $ residual.sugar : num 1.4 6.7 7.8 2.1 2.6 11.8 7.2 7.8 1.2 1.7 ...
## $ chlorides : num 0.05 0.055 0.058 0.111 0.079 0.057 0.046 0.031 0.031 0.05 ...
## $ free.sulfur.dioxide : num 20 43 44 5 27 48 45 54 27 10 ...
## $ total.sulfur.dioxide: num 74 157 166 10 69 127 178 216 80 136 ...
## $ density : num 0.993 0.994 0.996 0.996 0.999 ...
## $ pH : num 3.32 3.15 3.24 3.31 3.12 3.1 3.14 3.03 3.03 3.16 ...
## $ sulphates : num 0.44 0.52 0.41 0.62 0.75 0.55 0.53 0.42 0.33 0.71 ...
## $ alcohol : num 9.8 10.8 9 11.3 10.4 10 9.7 13.1 12.7 11.4 ...
## $ quality : Factor w/ 7 levels "3","4","5","6",..: 4 4 3 4 4 4 3 4 4 4 ...
## $ newquality : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 3 4 4 4 3 4 4 4 ...
str(wine_test)
## 'data.frame': 3248 obs. of 13 variables:
## $ fixed.acidity : num 5.9 9.1 8.3 5.8 6.9 7.4 8.5 6.8 5 7.3 ...
## $ volatile.acidity : num 0.21 0.27 0.28 0.29 0.27 0.55 0.17 0.41 0.27 0.65 ...
## $ citric.acid : num 0.31 0.32 0.48 0.38 0.25 0.19 0.74 0.31 0.32 0 ...
## $ residual.sugar : num 1.8 1.1 2.1 10.7 7.5 1.8 3.6 8.8 4.5 1.2 ...
## $ chlorides : num 0.033 0.031 0.093 0.038 0.03 0.082 0.05 0.084 0.032 0.065 ...
## $ free.sulfur.dioxide : num 45 15 6 49 18 15 29 26 58 15 ...
## $ total.sulfur.dioxide: num 142 151 12 136 117 34 128 45 178 21 ...
## $ density : num 0.99 0.994 0.994 0.994 0.991 ...
## $ pH : num 3.35 3.03 3.26 3.11 3.09 3.49 3.28 3.38 3.45 3.39 ...
## $ sulphates : num 0.5 0.41 0.62 0.59 0.38 0.68 0.4 0.64 0.31 0.47 ...
## $ alcohol : num 12.7 10.6 12.4 11.2 13 10.5 12.4 10.1 12.6 10 ...
## $ quality : int 6 5 7 6 6 5 6 6 7 7 ...
## $ newquality : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 3 5 4 4 3 4 4 5 5 ...
predicted_class<- predict (ord, wine_train, type = "class")
predicted_class<- ordered(predicted_class$fit, levels = c(3, 4, 5, 6, 7, 8, 9)) ## Create as Ordered Factor).
str(predicted_class)
## Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 3 4 4 4 4 5 4 4 ...
str(wine_train$newquality)
## Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 3 4 4 4 3 4 4 4 ...
cm_ols <- table(predicted_class, wine_train$newquality)
print(cm_ols)
##
## predicted_class 3 4 5 6 7 8 9
## 3 0 0 0 0 0 0 0
## 4 0 0 2 0 0 0 0
## 5 5 61 610 311 43 10 0
## 6 5 46 460 1032 377 68 1
## 7 0 1 3 76 115 20 2
## 8 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0
Accuracy <- sum(diag(cm_ols))/sum(cm_ols)
Accuracy
## [1] 0.5409483
The OLS model gives and accuracy of 56%