pacman::p_load(MASS,tidyverse, skimr,DataExplorer,ggcorrplot,fpp3,tsibble,xts,lares,caret,missForest,doParallel,car )
For this final project, I was tasked with finding a new data set and testing a variety of machine learning methods on it. The data can be found here https://www.kaggle.com/datasets/fedesoriano/spanish-wine-quality-dataset
As explained on Kaggle, this dataset is related to red variants of Spanish wines. The dataset describes several popularity and description metrics their effect on it’s quality. The datasets can be used for classification or regression tasks. The classes are ordered and not balanced (i.e. the quality goes from almost 5 to 4 points). The task is to predict either the quality of wine or the prices using the given data.
I chose to to focus on predicting price for my project. Due to the web scrapped nature of the data set, there are many irregularities and missing values.
wine <- read_csv("archive/wines_SPA.csv")
wine
skim(wine)
| Name | wine |
| Number of rows | 7500 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| winery | 0 | 1.00 | 3 | 48 | 0 | 480 | 0 |
| wine | 0 | 1.00 | 2 | 70 | 0 | 847 | 0 |
| year | 2 | 1.00 | 4 | 4 | 0 | 71 | 0 |
| country | 0 | 1.00 | 6 | 6 | 0 | 1 | 0 |
| region | 0 | 1.00 | 4 | 31 | 0 | 76 | 0 |
| type | 545 | 0.93 | 3 | 20 | 0 | 21 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| rating | 0 | 1.00 | 4.25 | 0.12 | 4.20 | 4.2 | 4.20 | 4.20 | 4.90 | ▇▁▁▁▁ |
| num_reviews | 0 | 1.00 | 451.11 | 723.00 | 25.00 | 389.0 | 404.00 | 415.00 | 32624.00 | ▇▁▁▁▁ |
| price | 0 | 1.00 | 60.10 | 150.36 | 4.99 | 18.9 | 28.53 | 51.35 | 3119.08 | ▇▁▁▁▁ |
| body | 1169 | 0.84 | 4.16 | 0.58 | 2.00 | 4.0 | 4.00 | 5.00 | 5.00 | ▁▁▁▇▃ |
| acidity | 1169 | 0.84 | 2.95 | 0.25 | 1.00 | 3.0 | 3.00 | 3.00 | 3.00 | ▁▁▁▁▇ |
plot_missing(wine)
wine %>% filter(str_detect(year,"[:alpha:]"))
wine <- wine %>% mutate(year = ifelse(str_detect(year,"[:alpha:]" ),NA,year))
plot_missing(wine)
wine %>% filter(if_any(everything(),~is.na(.)))
wine <- type.convert(wine, as.is = FALSE ) %>% select(!country)
na.omit(wine) %>% plot_correlation(cor_args = list("use" = "everything"))
corr_var(wine,price)
corr_cross(wine)
corr_cross(wine,type=2)
Severe alias issue, many variables directly correspond with others. Can see some of that below.
aliases <- alias(lm(price~.,data=wine,na.action = na.exclude))
aliases$Complete[1:10,1:10]
(Intercept) wineryAbadal
wine3.9 0 1
wine5 Partides Gratallops Vi de La Vila 0 0
wine8000 0 0
wineA Torna dos Pasas Escolma Ribeiro 0 0
wineAcediano 0 0
wineAlbarino (O Rosal) 0 0
wineAlbarino de Fefinanes III Ano 0 0
wineAlbarino Pedralonga 0 0
wineAlbarino Sobre Lias 0 0
wineAlenza Ribera del Duero Gran Reserva 0 0
wineryAbadia Retuerta
wine3.9 0
wine5 Partides Gratallops Vi de La Vila 0
wine8000 0
wineA Torna dos Pasas Escolma Ribeiro 0
wineAcediano 0
wineAlbarino (O Rosal) 0
wineAlbarino de Fefinanes III Ano 0
wineAlbarino Pedralonga 0
wineAlbarino Sobre Lias 0
wineAlenza Ribera del Duero Gran Reserva 0
wineryAbel Mendoza Monge
wine3.9 0
wine5 Partides Gratallops Vi de La Vila 0
wine8000 0
wineA Torna dos Pasas Escolma Ribeiro 0
wineAcediano 0
wineAlbarino (O Rosal) 0
wineAlbarino de Fefinanes III Ano 0
wineAlbarino Pedralonga 0
wineAlbarino Sobre Lias 0
wineAlenza Ribera del Duero Gran Reserva 0
wineryAcustic Celler wineryAdama Wines
wine3.9 0 0
wine5 Partides Gratallops Vi de La Vila 0 0
wine8000 0 0
wineA Torna dos Pasas Escolma Ribeiro 0 0
wineAcediano 0 0
wineAlbarino (O Rosal) 0 0
wineAlbarino de Fefinanes III Ano 0 0
wineAlbarino Pedralonga 0 0
wineAlbarino Sobre Lias 0 0
wineAlenza Ribera del Duero Gran Reserva 0 0
wineryAdega Familiar Eladio Pineiro
wine3.9 0
wine5 Partides Gratallops Vi de La Vila 0
wine8000 0
wineA Torna dos Pasas Escolma Ribeiro 0
wineAcediano 0
wineAlbarino (O Rosal) 0
wineAlbarino de Fefinanes III Ano 0
wineAlbarino Pedralonga 0
wineAlbarino Sobre Lias 0
wineAlenza Ribera del Duero Gran Reserva 0
wineryAGE wineryAgusti Torello Mata
wine3.9 0 0
wine5 Partides Gratallops Vi de La Vila 0 0
wine8000 0 0
wineA Torna dos Pasas Escolma Ribeiro 0 0
wineAcediano 0 0
wineAlbarino (O Rosal) 0 0
wineAlbarino de Fefinanes III Ano 0 0
wineAlbarino Pedralonga 0 0
wineAlbarino Sobre Lias 0 0
wineAlenza Ribera del Duero Gran Reserva 0 0
wineryAlbamar
wine3.9 0
wine5 Partides Gratallops Vi de La Vila 0
wine8000 0
wineA Torna dos Pasas Escolma Ribeiro 0
wineAcediano 0
wineAlbarino (O Rosal) 0
wineAlbarino de Fefinanes III Ano 0
wineAlbarino Pedralonga 0
wineAlbarino Sobre Lias 0
wineAlenza Ribera del Duero Gran Reserva 0
Had to remove these to fix the issue.
alias(lm(price~.-winery-body-wine -acidity- region,data=wine,na.action = na.exclude))
Model :
price ~ (winery + wine + year + rating + num_reviews + region +
type + body + acidity) - winery - body - wine - acidity -
region
No longer any visible issues.
vif(lm(price~.-winery-body-wine -acidity- region,data=wine,na.action = na.exclude))
GVIF Df GVIF^(1/(2*Df))
year 1.464693 1 1.210245
rating 1.218969 1 1.104069
num_reviews 1.019916 1 1.009909
type 1.501323 20 1.010210
corr_var(select(wine,!c(winery,body,wine,acidity,region)),price)
set.seed(123)
wine_no <- select(wine,!c(winery,wine,region)) %>% na.omit()
train_index <- createDataPartition(wine_no$price , p=.8, list=F)
train_no <- wine_no[ train_index,]
test_no <- wine_no[-train_index,]
set.seed(123)
wine_a <- na.omit(wine)
#select(wine,!c(wine,region,winery)) %>
train_index <- createDataPartition(wine_a$price , p=.8, list=F)
train <- wine_a[ train_index,]
test <- wine_a[-train_index,]
set.seed(123)
wine_alt <- select(wine,!c(winery))
train_index_a <- createDataPartition(wine_alt$price , p=.8, list=F)
train_a <- wine_alt[ train_index_a,]
test_a <- wine_alt[-train_index_a,]
set.seed(123)
wine_a <- select(wine,!c(winery)) %>% na.omit(wine)
train_index <- createDataPartition(wine_a$price , p=.8, list=F)
train_alt <- wine_a[ train_index,]
test_alt <- wine_a[-train_index,]
corr_var(train_a,price)
set.seed(123)
doParallel::registerDoParallel(cores = 4) # set based on number of CPU cores
add_lateTr <- train_a %>% select(region,wine)
predictors <- select(train_a,!c(price,region,wine)) %>% as.data.frame()
test_pred <- select(test_a,!c(price,region,wine)) %>% as.data.frame()
add_lateTe <- test_a %>% select(region,wine)
train_tree <- missForest( predictors,parallelize = 'forests',ntree = 500,maxiter=4,
)$ximp
train_test_X <- rbind(test_pred, train_tree)
tes_treet <- missForest(as.data.frame(train_test_X), parallelize = 'forests', ntree = 500,maxiter=4)$ximp[1:nrow(test_a), ]
Did not end up finding the alt version useful, but left it in to show my process.
train_tree <- train_tree %>% mutate(across( c(year),~round(.) %>% as.integer()),
price = train_a$price )
tes_treet <- tes_treet %>%mutate(across( c(year),~round(.) %>% as.integer()),
price = test_a$price
)
train_tree_alt <- train_tree %>% bind_cols(add_lateTr)
test_tree_alt <- tes_treet %>% bind_cols(add_lateTe)
corr_var(train_tree,price)
corr_var(train,price)
unregister <- function() {
env <- foreach:::.foreachGlobals
rm(list=ls(name=env), pos=env)
}
unregister()
set.seed(128)
lm_model <- train(price~.,
data = train_tree,
trace= F,
method='lmStepAIC',metric='Rsquared', trControl = ctrl)
set.seed(128)
lm_model2 <- train(price~.,
data = train_no, trace= F,
method='lmStepAIC',metric='Rsquared', trControl = ctrl)
postResample(predict(lm_model, test_no), test_no$price)
RMSE Rsquared MAE
86.0177031 0.4597681 38.7092733
postResample(predict(lm_model2, tes_treet), tes_treet$price)
RMSE Rsquared MAE
130.2807266 0.3692052 48.3646526
set.seed(2334)
knnModel <- train( price~.,
data = train_no,
method = "knn",
preProc = c("center", "scale"),
trControl = ctrl,
tuneLength = 12)
postResample(predict(knnModel, test_no), test_no$price)
RMSE Rsquared MAE
83.14928 0.55793 21.76782
cl <- makePSOCKcluster(4)
registerDoParallel(cl)
set.seed(2323)
knnModel2 <- train( price~.,
data = train_tree,
method = "knn",
preProc = c("center", "scale"),
trControl = trainControl(method = "cv", number = 10))
stopCluster(cl)
postResample(predict(knnModel2, tes_treet), tes_treet$price)
RMSE Rsquared MAE
119.0166008 0.4729685 21.3613044
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(124)
knnModel3 <- train( price~.,
data = train,
method = "knn",
preProc = c("center", "scale"),
trControl = ctrl,
tuneLength = 10)
stopCluster(cl)
saveRDS(knnModel3,"Models/knnModel3.rds")
knnModel3 <- readRDS("Models/knnModel3.rds")
postResample(predict(knnModel3, test), test$price)
RMSE Rsquared MAE
79.8675442 0.6583582 17.2396813
Using all the categorical variables worked horribly, and took forever to load even the postResample function, so I chose to remove it.
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
svmRTuned <- train(price~. ,
data = train_tree,
method = "svmRadial",
preProc = c("center", "scale"),
trControl = ctrl)
stopCluster(cl)
postResample(predict(svmRTuned, tes_treet), tes_treet$price)
RMSE Rsquared MAE
125.3729061 0.4477469 26.9191512
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
svmRTuned1 <- train(price~. ,
data = train_no,
method = "svmRadial",
preProc = c("center", "scale"),
trControl = ctrl)
stopCluster(cl)
postResample(predict(svmRTuned1, test_no), test_no$price)
RMSE Rsquared MAE
110.9879491 0.1839489 26.3349835
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
.size = c(1:15),
## The next option is to use bagging (see the
## next chapter) instead of different random
## seeds.
.bag = F)
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(100)
nnetTune <- train(price~.,
data = train_tree,
method = "avNNet",
tuneGrid = nnetGrid,
trControl =ctrl,
linout = TRUE,
trace = FALSE,
MaxNWts = 15 * (ncol(train_tree) + 1) + 15 + 1,
preProc = c("center", "scale"),
maxit = 1000,
allowParallel =T
)
stopCluster(cl)
saveRDS(nnetTune,'Models/nnetTune.rds')
nnetTune <- readRDS('Models/nnetTune.rds')
postResample(predict(nnetTune, tes_treet), tes_treet$price)
RMSE Rsquared MAE
114.1134621 0.5167132 26.9752785
Unable to use all variables, not enough variance in most of the other predictors.
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(100)
nnetTune2 <- train(price~.,
data = train_no,
method = "avNNet",
tuneGrid = nnetGrid,
trControl =ctrl,
linout = TRUE,
trace = FALSE,
MaxNWts = 15 * (ncol(train_no
) + 1) + 15 + 1,
preProc = c("center", "scale"),
maxit = 600,
allowParallel =T
)
stopCluster(cl)
saveRDS(nnetTune2,'Models/nnetTune2.rds')
nnetTune2 <- readRDS('Models/nnetTune2.rds')
postResample(predict(nnetTune2, test_no), test_no$price)
RMSE Rsquared MAE
75.8879166 0.5939673 28.2663009
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(1289)
cubistMo <- train(price ~ ., data = train_tree,
method = 'cubist',
trControl = trainControl(method = "cv", number = 30))
stopCluster(cl)
saveRDS(cubistMo,'Models/cubistMo.rds')
cubistMo <- readRDS('Models/cubistMo.rds')
postResample(predict(cubistMo, tes_treet), tes_treet$price)
RMSE Rsquared MAE
118.210526 0.480081 20.687160
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(1289)
cubistMo2 <- train(price ~ ., data = train,
method = 'cubist',
preProc = c("center", "scale"),
trControl = trainControl(method = "cv", number = 10))
stopCluster(cl)
saveRDS(cubistMo2,'Models/cubistMo2.rds')
cubistMo2 <- readRDS('Models/cubistMo2.rds')
postResample(predict(cubistMo2, test), test$price)
RMSE Rsquared MAE
61.676003 0.759151 14.083656
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
set.seed(1289)
cubistMo3 <- train(price ~ ., data = train_alt,
method = 'cubist',
preProc = c("center", "scale"),
trControl = trainControl(method = "cv", number = 10))
stopCluster(cl)
saveRDS(cubistMo3,'Models/cubistMo3.rds')
cubistMo3 <- readRDS('Models/cubistMo3.rds')
postResample(predict(cubistMo3, test_alt), test_alt$price)
RMSE Rsquared MAE
53.1418677 0.8071393 12.6795783
This data set had many issues that had to be dealt with. There were year columns with a weird character symbol indicating the values weren’t there. A whole load of missing values with no clear reason for why they were missing. Many of the categorical variables were highly correlated with one another. Fixing all these problems took some careful time and consideration. I used some very useful and powerful functions from the lares package to find cross correlation and which predictors were most correlated with the variable I was attempting to predict, the price of a given wine.
I then created multiple sets of training and test data. My though process was that I was not certain why the original values were missing. As the data was collect through web scraping, there was a possibility that the extra rows with missing values were added by mistake and had no real value to being used. In order to test whether this was true, I attempted to create both an imputed training set and one where I simply removed all the NA values. Ultimately, I found that the training set without imputation performed much better for my best model, but it always prudent to make sure though testing both ways before assuming one method is better than another. However, for some reason both the linear model and support vector machine actually did better with the imputation set.
Throughout this project, on both the modeling and imputations, I also tested a very useful function within R that is only found by digging deeper. By running my imputations and models in parallel, the process of running the code becomes much faster. It an excellent way to speed up the process of getting work done in R. I also made sure to save any model that took exceptionally long to load, making it an absolute breeze to knit my file later, and save my results for reproducibility.
I used a variety of machine learning algorithms that were taught during this class. I began with a simple linear regression model, using a step function to find the best set of variables possible. Due to the high correlation of many of the categorical predictors, the linear models could only handle a smaller portion of the data set. As a result, it did not perform very well.
Next I attempted to use K nearest neighbors. With how many levels were in some of the top categorical variables, I figured it would work very well for this data. In fact, it was able to overcome the various issues of this data set and gave a much better performance value than the linear model did.
Then, I tested a support vector machine approach. It also was able to do much better than the linear regression model but wasn’t able to do nearly as well as the KNN approach. It could not handle using most of the predictors, doing so made it perform horribly. I had to limit the number of variables used on it for that reason and it likely hurt the performance.
Up next, I used a Neural Network, one of the later models discussed in this class. It did not like this data set. There were too many classes in the categorical data that did not have many different values. I had to reduce the predictor values to the model, otherwise some of the layers would converge on missing data, spewing back an error. It also did about as well as the support vector machine model in the end.
Finally, I used an alternate Tree model, one that I researched and found to be very good at handling data with lots of dimensions such as this data. That model would be Cubist. With very little tuning required, I was able to get it running on most of the predictors. It performed remarkably well, and gave the highest overall score. I chose to use it over random forest as Cubist seemed to work much better. I am so glad to have discovered and used it.
For this project, I dealt with a complicated and messy data set. I had to fix many issues with it, and overcome many challenges in making everything work the way I desired. Through the work I did here, I was able to discover many new methods to deal with messy data. I am very happy with the results I had, but there is more work on cleaning and modeling this data for sure. There are likely more self evident problems with this data to be discovered, better wayys of handling the missing data and so on. For now though, I am happy with my current results and will take my leave here. Thank you for taking the time to read my report.