pacman::p_load(MASS,tidyverse, skimr,DataExplorer,ggcorrplot,fpp3,tsibble,xts,lares,caret,missForest,doParallel,car )

1 Introduction

For this final project, I was tasked with finding a new data set and testing a variety of machine learning methods on it. The data can be found here https://www.kaggle.com/datasets/fedesoriano/spanish-wine-quality-dataset

As explained on Kaggle, this dataset is related to red variants of Spanish wines. The dataset describes several popularity and description metrics their effect on it’s quality. The datasets can be used for classification or regression tasks. The classes are ordered and not balanced (i.e. the quality goes from almost 5 to 4 points). The task is to predict either the quality of wine or the prices using the given data.

I chose to to focus on predicting price for my project. Due to the web scrapped nature of the data set, there are many irregularities and missing values.

wine <- read_csv("archive/wines_SPA.csv")

2 Data Exploration

wine

skim(wine)

Data summary
Name	wine
Number of rows	7500
Number of columns	11
_______________________
Column type frequency:
character	6
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
winery	0	1.00	3	48	480
wine	0	1.00	2	70	847
year	2	1.00	4	4	71
country	0	1.00	6	6	1
region	0	1.00	4	31	76
type	545	0.93	3	20	21

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
rating	0	1.00	4.25	0.12	4.20	4.2	4.20	4.20	4.90	▇▁▁▁▁
num_reviews	0	1.00	451.11	723.00	25.00	389.0	404.00	415.00	32624.00	▇▁▁▁▁
price	0	1.00	60.10	150.36	4.99	18.9	28.53	51.35	3119.08	▇▁▁▁▁
body	1169	0.84	4.16	0.58	2.00	4.0	4.00	5.00	5.00	▁▁▁▇▃
acidity	1169	0.84	2.95	0.25	1.00	3.0	3.00	3.00	3.00	▁▁▁▁▇

plot_missing(wine)

wine %>% filter(str_detect(year,"[:alpha:]"))

wine <- wine %>% mutate(year = ifelse(str_detect(year,"[:alpha:]" ),NA,year))

plot_missing(wine)

wine %>% filter(if_any(everything(),~is.na(.)))

wine <- type.convert(wine, as.is = FALSE ) %>% select(!country)

na.omit(wine)     %>% plot_correlation(cor_args = list("use" = "everything"))

corr_var(wine,price)

corr_cross(wine)

corr_cross(wine,type=2)

Severe alias issue, many variables directly correspond with others. Can see some of that below.

aliases <- alias(lm(price~.,data=wine,na.action = na.exclude))

aliases$Complete[1:10,1:10]

                                         (Intercept) wineryAbadal
wine3.9                                  0           1           
wine5 Partides Gratallops Vi de La Vila  0           0           
wine8000                                 0           0           
wineA Torna dos Pasas Escolma Ribeiro    0           0           
wineAcediano                             0           0           
wineAlbarino (O Rosal)                   0           0           
wineAlbarino de Fefinanes III Ano        0           0           
wineAlbarino Pedralonga                  0           0           
wineAlbarino Sobre Lias                  0           0           
wineAlenza Ribera del Duero Gran Reserva 0           0           
                                         wineryAbadia Retuerta
wine3.9                                  0                    
wine5 Partides Gratallops Vi de La Vila  0                    
wine8000                                 0                    
wineA Torna dos Pasas Escolma Ribeiro    0                    
wineAcediano                             0                    
wineAlbarino (O Rosal)                   0                    
wineAlbarino de Fefinanes III Ano        0                    
wineAlbarino Pedralonga                  0                    
wineAlbarino Sobre Lias                  0                    
wineAlenza Ribera del Duero Gran Reserva 0                    
                                         wineryAbel Mendoza Monge
wine3.9                                  0                       
wine5 Partides Gratallops Vi de La Vila  0                       
wine8000                                 0                       
wineA Torna dos Pasas Escolma Ribeiro    0                       
wineAcediano                             0                       
wineAlbarino (O Rosal)                   0                       
wineAlbarino de Fefinanes III Ano        0                       
wineAlbarino Pedralonga                  0                       
wineAlbarino Sobre Lias                  0                       
wineAlenza Ribera del Duero Gran Reserva 0                       
                                         wineryAcustic Celler wineryAdama Wines
wine3.9                                  0                    0                
wine5 Partides Gratallops Vi de La Vila  0                    0                
wine8000                                 0                    0                
wineA Torna dos Pasas Escolma Ribeiro    0                    0                
wineAcediano                             0                    0                
wineAlbarino (O Rosal)                   0                    0                
wineAlbarino de Fefinanes III Ano        0                    0                
wineAlbarino Pedralonga                  0                    0                
wineAlbarino Sobre Lias                  0                    0                
wineAlenza Ribera del Duero Gran Reserva 0                    0                
                                         wineryAdega Familiar Eladio Pineiro
wine3.9                                  0                                  
wine5 Partides Gratallops Vi de La Vila  0                                  
wine8000                                 0                                  
wineA Torna dos Pasas Escolma Ribeiro    0                                  
wineAcediano                             0                                  
wineAlbarino (O Rosal)                   0                                  
wineAlbarino de Fefinanes III Ano        0                                  
wineAlbarino Pedralonga                  0                                  
wineAlbarino Sobre Lias                  0                                  
wineAlenza Ribera del Duero Gran Reserva 0                                  
                                         wineryAGE wineryAgusti Torello Mata
wine3.9                                  0         0                        
wine5 Partides Gratallops Vi de La Vila  0         0                        
wine8000                                 0         0                        
wineA Torna dos Pasas Escolma Ribeiro    0         0                        
wineAcediano                             0         0                        
wineAlbarino (O Rosal)                   0         0                        
wineAlbarino de Fefinanes III Ano        0         0                        
wineAlbarino Pedralonga                  0         0                        
wineAlbarino Sobre Lias                  0         0                        
wineAlenza Ribera del Duero Gran Reserva 0         0                        
                                         wineryAlbamar
wine3.9                                  0            
wine5 Partides Gratallops Vi de La Vila  0            
wine8000                                 0            
wineA Torna dos Pasas Escolma Ribeiro    0            
wineAcediano                             0            
wineAlbarino (O Rosal)                   0            
wineAlbarino de Fefinanes III Ano        0            
wineAlbarino Pedralonga                  0            
wineAlbarino Sobre Lias                  0            
wineAlenza Ribera del Duero Gran Reserva 0

Had to remove these to fix the issue.

alias(lm(price~.-winery-body-wine -acidity- region,data=wine,na.action = na.exclude))

Model :
price ~ (winery + wine + year + rating + num_reviews + region + 
    type + body + acidity) - winery - body - wine - acidity - 
    region

No longer any visible issues.

vif(lm(price~.-winery-body-wine -acidity- region,data=wine,na.action = na.exclude))

                GVIF Df GVIF^(1/(2*Df))
year        1.464693  1        1.210245
rating      1.218969  1        1.104069
num_reviews 1.019916  1        1.009909
type        1.501323 20        1.010210

corr_var(select(wine,!c(winery,body,wine,acidity,region)),price)

3 Modeling

3.1 Creating Train and Test Data

set.seed(123)

wine_no <- select(wine,!c(winery,wine,region)) %>% na.omit()

train_index <- createDataPartition(wine_no$price , p=.8, list=F)

train_no <-  wine_no[ train_index,] 
       
  
test_no <- wine_no[-train_index,]

set.seed(123)
wine_a <- na.omit(wine)
 #select(wine,!c(wine,region,winery)) %>



train_index <- createDataPartition(wine_a$price , p=.8, list=F)

train <-  wine_a[ train_index,] 
       
  
test <- wine_a[-train_index,]

set.seed(123)

wine_alt <- select(wine,!c(winery))
train_index_a <- createDataPartition(wine_alt$price , p=.8, list=F)

train_a <-  wine_alt[ train_index_a,] 
       
  
test_a <- wine_alt[-train_index_a,]

set.seed(123)
wine_a <-  select(wine,!c(winery)) %>% na.omit(wine)




train_index <- createDataPartition(wine_a$price , p=.8, list=F)

train_alt <-  wine_a[ train_index,] 
       
  
test_alt <- wine_a[-train_index,]

corr_var(train_a,price)

set.seed(123)
doParallel::registerDoParallel(cores = 4) # set based on number of CPU cores

add_lateTr <- train_a %>% select(region,wine)
predictors <- select(train_a,!c(price,region,wine)) %>% as.data.frame()

test_pred <-   select(test_a,!c(price,region,wine))  %>%  as.data.frame()
add_lateTe <- test_a %>% select(region,wine)



train_tree <-  missForest( predictors,parallelize = 'forests',ntree = 500,maxiter=4,
                             )$ximp

train_test_X <- rbind(test_pred, train_tree)
tes_treet <-  missForest(as.data.frame(train_test_X), parallelize = 'forests', ntree = 500,maxiter=4)$ximp[1:nrow(test_a), ]

Did not end up finding the alt version useful, but left it in to show my process.

train_tree <-  train_tree %>% mutate(across( c(year),~round(.) %>% as.integer()),
                                     price = train_a$price )

tes_treet <-  tes_treet %>%mutate(across( c(year),~round(.) %>% as.integer()),
                                  
                                  price = test_a$price
                                  
                                  ) 
train_tree_alt <- train_tree %>% bind_cols(add_lateTr)
test_tree_alt <- tes_treet %>% bind_cols(add_lateTe)

corr_var(train_tree,price)

corr_var(train,price)

unregister <- function() {
  env <- foreach:::.foreachGlobals
  rm(list=ls(name=env), pos=env)
}
unregister()

3.2 Linear Modeling With Step

set.seed(128)
lm_model <- train(price~.,
                  data = train_tree,
                  trace= F,
                  method='lmStepAIC',metric='Rsquared', trControl = ctrl)

set.seed(128)
lm_model2 <- train(price~.,
                  data = train_no, trace= F,
                  method='lmStepAIC',metric='Rsquared', trControl = ctrl)

  postResample(predict(lm_model,  test_no), test_no$price)

      RMSE   Rsquared        MAE 
86.0177031  0.4597681 38.7092733

 postResample(predict(lm_model2,  tes_treet), tes_treet$price)

       RMSE    Rsquared         MAE 
130.2807266   0.3692052  48.3646526

3.3 KNN Models

set.seed(2334)
knnModel <- train( price~., 
                    data = train_no,
                    method = "knn",
                           preProc = c("center", "scale"),
            trControl = ctrl,
              tuneLength = 12)

 postResample(predict(knnModel,  test_no), test_no$price)

    RMSE Rsquared      MAE 
83.14928  0.55793 21.76782

cl <- makePSOCKcluster(4)
registerDoParallel(cl)


set.seed(2323)
knnModel2 <- train( price~., 
                    data = train_tree,
                    method = "knn",
                    preProc = c("center", "scale"),
            trControl =  trainControl(method = "cv", number = 10))


stopCluster(cl)

 postResample(predict(knnModel2,  tes_treet), tes_treet$price)

       RMSE    Rsquared         MAE 
119.0166008   0.4729685  21.3613044

cl <- makePSOCKcluster(5)
registerDoParallel(cl)


set.seed(124)
knnModel3 <- train( price~., 
                    data = train,
                    method = "knn",
                      preProc = c("center", "scale"),
            trControl = ctrl,
            tuneLength = 10)
stopCluster(cl)

saveRDS(knnModel3,"Models/knnModel3.rds")

knnModel3 <- readRDS("Models/knnModel3.rds")

 postResample(predict(knnModel3,  test), test$price)

      RMSE   Rsquared        MAE 
79.8675442  0.6583582 17.2396813

3.4 Support Vector Machine

Using all the categorical variables worked horribly, and took forever to load even the postResample function, so I chose to remove it.

cl <- makePSOCKcluster(5)
registerDoParallel(cl)
svmRTuned <- train(price~. ,
                  data = train_tree,
                   method = "svmRadial",
                    preProc = c("center", "scale"),
                  trControl = ctrl)
stopCluster(cl)

 postResample(predict(svmRTuned,  tes_treet), tes_treet$price)

       RMSE    Rsquared         MAE 
125.3729061   0.4477469  26.9191512

cl <- makePSOCKcluster(5)
registerDoParallel(cl)
svmRTuned1 <- train(price~. ,
                  data = train_no,
                   method = "svmRadial",
                    preProc = c("center", "scale"),
                  trControl = ctrl)


stopCluster(cl)

 postResample(predict(svmRTuned1,  test_no), test_no$price)

       RMSE    Rsquared         MAE 
110.9879491   0.1839489  26.3349835

3.5 Neural Network

nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
 .size = c(1:15),
 ## The next option is to use bagging (see the
 ## next chapter) instead of different random
 ## seeds.
 .bag = F)

cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(100)
nnetTune <- train(price~.,
                  data = train_tree,
                  method = "avNNet",
                  tuneGrid = nnetGrid,
                  trControl =ctrl,
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 15 * (ncol(train_tree) + 1) + 15 + 1,
                    preProc = c("center", "scale"),
                  maxit = 1000,
                  allowParallel =T

                  
                                    )


stopCluster(cl)

saveRDS(nnetTune,'Models/nnetTune.rds')

nnetTune <- readRDS('Models/nnetTune.rds')

 postResample(predict(nnetTune,  tes_treet), tes_treet$price)

       RMSE    Rsquared         MAE 
114.1134621   0.5167132  26.9752785

Unable to use all variables, not enough variance in most of the other predictors.

cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(100)
nnetTune2 <- train(price~.,
                  data = train_no,
                  method = "avNNet",
                  tuneGrid = nnetGrid,
                  trControl =ctrl,
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 15 * (ncol(train_no
                                       ) + 1) + 15 + 1,
                  preProc = c("center", "scale"),
                  maxit = 600,
                  allowParallel =T

                  
                                    )


stopCluster(cl)

saveRDS(nnetTune2,'Models/nnetTune2.rds')

nnetTune2 <- readRDS('Models/nnetTune2.rds')

 postResample(predict(nnetTune2,  test_no), test_no$price)

      RMSE   Rsquared        MAE 
75.8879166  0.5939673 28.2663009

3.6 Cubist Models

cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(1289)
cubistMo <- train(price ~ ., data = train_tree,
                     method = 'cubist',
                   trControl = trainControl(method = "cv", number = 30))

stopCluster(cl)

saveRDS(cubistMo,'Models/cubistMo.rds')

cubistMo <- readRDS('Models/cubistMo.rds')

 postResample(predict(cubistMo,  tes_treet), tes_treet$price)

      RMSE   Rsquared        MAE 
118.210526   0.480081  20.687160

cl <- makePSOCKcluster(5)
registerDoParallel(cl)

set.seed(1289)
cubistMo2 <- train(price ~ ., data = train,
                     method = 'cubist',
                    preProc = c("center", "scale"),
                   trControl = trainControl(method = "cv", number = 10))
stopCluster(cl)

saveRDS(cubistMo2,'Models/cubistMo2.rds')

cubistMo2 <- readRDS('Models/cubistMo2.rds')

 postResample(predict(cubistMo2,  test), test$price)

     RMSE  Rsquared       MAE 
61.676003  0.759151 14.083656

cl <- makePSOCKcluster(5)
registerDoParallel(cl)

set.seed(1289)
cubistMo3 <- train(price ~ ., data = train_alt,
                     method = 'cubist',
                    preProc = c("center", "scale"),
                   trControl = trainControl(method = "cv", number = 10))
stopCluster(cl)

saveRDS(cubistMo3,'Models/cubistMo3.rds')

cubistMo3 <- readRDS('Models/cubistMo3.rds')

 postResample(predict(cubistMo3,  test_alt), test_alt$price)

      RMSE   Rsquared        MAE 
53.1418677  0.8071393 12.6795783

4 Discussion/Conclusion

This data set had many issues that had to be dealt with. There were year columns with a weird character symbol indicating the values weren’t there. A whole load of missing values with no clear reason for why they were missing. Many of the categorical variables were highly correlated with one another. Fixing all these problems took some careful time and consideration. I used some very useful and powerful functions from the lares package to find cross correlation and which predictors were most correlated with the variable I was attempting to predict, the price of a given wine.

I then created multiple sets of training and test data. My though process was that I was not certain why the original values were missing. As the data was collect through web scraping, there was a possibility that the extra rows with missing values were added by mistake and had no real value to being used. In order to test whether this was true, I attempted to create both an imputed training set and one where I simply removed all the NA values. Ultimately, I found that the training set without imputation performed much better for my best model, but it always prudent to make sure though testing both ways before assuming one method is better than another. However, for some reason both the linear model and support vector machine actually did better with the imputation set.

Throughout this project, on both the modeling and imputations, I also tested a very useful function within R that is only found by digging deeper. By running my imputations and models in parallel, the process of running the code becomes much faster. It an excellent way to speed up the process of getting work done in R. I also made sure to save any model that took exceptionally long to load, making it an absolute breeze to knit my file later, and save my results for reproducibility.

I used a variety of machine learning algorithms that were taught during this class. I began with a simple linear regression model, using a step function to find the best set of variables possible. Due to the high correlation of many of the categorical predictors, the linear models could only handle a smaller portion of the data set. As a result, it did not perform very well.

Next I attempted to use K nearest neighbors. With how many levels were in some of the top categorical variables, I figured it would work very well for this data. In fact, it was able to overcome the various issues of this data set and gave a much better performance value than the linear model did.

Then, I tested a support vector machine approach. It also was able to do much better than the linear regression model but wasn’t able to do nearly as well as the KNN approach. It could not handle using most of the predictors, doing so made it perform horribly. I had to limit the number of variables used on it for that reason and it likely hurt the performance.

Up next, I used a Neural Network, one of the later models discussed in this class. It did not like this data set. There were too many classes in the categorical data that did not have many different values. I had to reduce the predictor values to the model, otherwise some of the layers would converge on missing data, spewing back an error. It also did about as well as the support vector machine model in the end.

Finally, I used an alternate Tree model, one that I researched and found to be very good at handling data with lots of dimensions such as this data. That model would be Cubist. With very little tuning required, I was able to get it running on most of the predictors. It performed remarkably well, and gave the highest overall score. I chose to use it over random forest as Cubist seemed to work much better. I am so glad to have discovered and used it.

For this project, I dealt with a complicated and messy data set. I had to fix many issues with it, and overcome many challenges in making everything work the way I desired. Through the work I did here, I was able to discover many new methods to deal with messy data. I am very happy with the results I had, but there is more work on cleaning and modeling this data for sure. There are likely more self evident problems with this data to be discovered, better wayys of handling the missing data and so on. For now though, I am happy with my current results and will take my leave here. Thank you for taking the time to read my report.

Data 622 Final Project

Zachary Safir

2022-05-10