Group Members
We have been given a dataset from a beverage manufacturing company that consists of 2,571 rows of data and 33 columns. The dataset contains information on different beverages and their chemical composition. The goal of this analysis is to use the 32 predictive features to predict the Potential for hydrogen (pH), which is a measure of the acidity/alkalinity of the beverage. pH is the key KPI in this analysis.
We’ll begin by reading in the dataset and looking at each column’s data type:
data_raw <- readr::read_csv('https://raw.githubusercontent.com/christianthieme/Predictive-Analytics/main/Project2/data/StudentData%20-%20TO%20MODEL.csv')
data_raw$obs_type <- "train"
eval_raw <- readr::read_csv('https://raw.githubusercontent.com/christianthieme/Predictive-Analytics/main/Project2/data/StudentEvaluation-%20TO%20PREDICT.csv')
eval_raw$obs_type <- "test"
combined <- data_raw %>%
rbind(eval_raw)
# convert column names to all lowercase
names(combined) <- lapply(names(combined), tolower)
# convert column name spaces to underscore
names(combined) <- str_replace_all(names(combined), ' ', '_')
df <- combined %>%
filter(obs_type == 'train') %>%
select(-obs_type)
eval <- combined %>%
filter(obs_type == 'test') %>%
select(-obs_type)
glimpse(combined)## Rows: 2,838
## Columns: 34
## $ brand_code <chr> "B", "A", "B", "A", "A", "A", "A", "B", "B", "B", "B…
## $ carb_volume <dbl> 5.340000, 5.426667, 5.286667, 5.440000, 5.486667, 5.…
## $ fill_ounces <dbl> 23.96667, 24.00667, 24.06000, 24.00667, 24.31333, 23…
## $ pc_volume <dbl> 0.2633333, 0.2386667, 0.2633333, 0.2933333, 0.111333…
## $ carb_pressure <dbl> 68.2, 68.4, 70.8, 63.0, 67.2, 66.6, 64.2, 67.6, 64.2…
## $ carb_temp <dbl> 141.2, 139.6, 144.8, 132.6, 136.8, 138.4, 136.8, 141…
## $ psc <dbl> 0.104, 0.124, 0.090, NA, 0.026, 0.090, 0.128, 0.154,…
## $ psc_fill <dbl> 0.26, 0.22, 0.34, 0.42, 0.16, 0.24, 0.40, 0.34, 0.12…
## $ psc_co2 <dbl> 0.04, 0.04, 0.16, 0.04, 0.12, 0.04, 0.04, 0.04, 0.14…
## $ mnf_flow <dbl> -100, -100, -100, -100, -100, -100, -100, -100, -100…
## $ carb_pressure1 <dbl> 118.8, 121.6, 120.2, 115.2, 118.4, 119.6, 122.2, 124…
## $ fill_pressure <dbl> 46.0, 46.0, 46.0, 46.4, 45.8, 45.6, 51.8, 46.8, 46.0…
## $ hyd_pressure1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hyd_pressure2 <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hyd_pressure3 <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ hyd_pressure4 <dbl> 118, 106, 82, 92, 92, 116, 124, 132, 90, 108, 94, 86…
## $ filler_level <dbl> 121.2, 118.6, 120.0, 117.8, 118.6, 120.2, 123.4, 118…
## $ filler_speed <dbl> 4002, 3986, 4020, 4012, 4010, 4014, NA, 1004, 4014, …
## $ temperature <dbl> 66.0, 67.6, 67.0, 65.6, 65.6, 66.2, 65.8, 65.2, 65.4…
## $ usage_cont <dbl> 16.18, 19.90, 17.76, 17.42, 17.68, 23.82, 20.74, 18.…
## $ carb_flow <dbl> 2932, 3144, 2914, 3062, 3054, 2948, 30, 684, 2902, 3…
## $ density <dbl> 0.88, 0.92, 1.58, 1.54, 1.54, 1.52, 0.84, 0.84, 0.90…
## $ mfr <dbl> 725.0, 726.8, 735.0, 730.6, 722.8, 738.8, NA, NA, 74…
## $ balling <dbl> 1.398, 1.498, 3.142, 3.042, 3.042, 2.992, 1.298, 1.2…
## $ pressure_vacuum <dbl> -4.0, -4.0, -3.8, -4.4, -4.4, -4.4, -4.4, -4.4, -4.4…
## $ ph <dbl> 8.36, 8.26, 8.94, 8.24, 8.26, 8.32, 8.40, 8.38, 8.38…
## $ oxygen_filler <dbl> 0.022, 0.026, 0.024, 0.030, 0.030, 0.024, 0.066, 0.0…
## $ bowl_setpoint <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12…
## $ pressure_setpoint <dbl> 46.4, 46.8, 46.6, 46.0, 46.0, 46.0, 46.0, 46.0, 46.0…
## $ air_pressurer <dbl> 142.6, 143.0, 142.0, 146.2, 146.2, 146.6, 146.2, 146…
## $ alch_rel <dbl> 6.58, 6.56, 7.66, 7.14, 7.14, 7.16, 6.54, 6.52, 6.52…
## $ carb_rel <dbl> 5.32, 5.30, 5.84, 5.42, 5.44, 5.44, 5.38, 5.34, 5.34…
## $ balling_lvl <dbl> 1.48, 1.56, 3.28, 3.04, 3.04, 3.02, 1.44, 1.44, 1.44…
## $ obs_type <chr> "train", "train", "train", "train", "train", "train"…
We see that all columns, with the exception of brand, are doubles and continuous. Excluding the response variable, this means that we have 1 categorical variable and 31 continuous variables to work with.
In the output above, we can see that there are missing values (NAs). Let’s see how pervasive this issue is within our dataset:
In total, only about 1% of our data is missing. We can see that most of the columns are only missing a negligible amount of data. mfr and brand code have the largest amount of missing values and are missing 8.25% and 4.67% of their data, respectively. Additionally, there does not appear to be a pattern in which values are missing. Now that we understand that our missing values are not a pervasive issue, we’ll continue with our analysis.
Let’s get an understanding of the distribution of our response variable:
df %>%
select(ph) %>%
ggplot() +
aes(x = ph) +
geom_histogram(fill = "lightsalmon2", color = "black") +
labs(title = "Histogram of pH", y = "Count") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
# plot.margin = ggplot2::margin(10, 20, 10, 10),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank()
)The distribution of pH is left-skewed and multi-modal. Generally speaking, when we see a multi-modal distribution, often times that is an indication that there are sub-populations within the distribution. We know from looking at our dataset that there is a brand code with values A, B, C, and D. Let’s break up the above distribution into 4 distributions based on these values:
df %>%
ggplot() +
aes(x = ph) +
geom_histogram(fill = "lightsalmon2", color = "black") +
labs(title = "Distribution of pH by Brand") +
facet_wrap(~brand_code, scales = 'free_y') +
theme(
plot.title = element_text(hjust = 0.45),
# plot.margin = ggplot2::margin(10, 20, 10, 10),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank()
)Breaking down to this further grain does not seem to be much more helpful in eliminating some of the bi/multi-modal nature of these distributions. There may be even more granular sub-populations within this data that we are not seeing.
Now we’ll turn our attention to the numeric features within our dataset:
df %>%
select(-ph) %>%
inspectdf::inspect_num() %>%
show_plot() +
labs(title = 'Distribution of Numeric Columns in Training Data')We note the following about these distributions:
air_pressurer - there appears to be either two distributions here, or a single distribution and a pocket of outliersballing, balling_lvl, density,fill_ _pressure, hyd_pressure1, hyd_pressure2, hyd_pressure3, hyd_pressure4, mnf_flow, pressure_setpoint- there appears to be two distributions here. This could potentially be connected to the type of brand_code or something else not as easily distinguishable.bowl_setpoint - half of all the values are around 120carb_flow - most values fall between 3,000 and 4,000 with a large pocket of values at 1,000 as wellfiller_speed, mfr, oxygen_filler - either appears to have two distributions or a few significant outliersNow that we’ve looked at our response variable, let’s look at our explanatory variables. We’ll begin first by looking at brand code, which is our only categorical variable:
df %>%
dplyr::select(brand_code, ph) %>%
ggplot() +
aes(x = brand_code, y = ph) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick', outlier.alpha = 0.35)+
labs(title = 'Brand Code Box Plots',
x = element_blank(),
y = 'pH') +
# facet_grid(~brand_code) +
# theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
# panel.background = element_blank(),
axis.ticks.x = element_line(color = "grey")
)We can see from the above boxplots that brand_code does have a meaningful relationship with pH. We can also see some significant outliers in C and possibly D that will need to be evaluated further. We’ll now turn our attention to the numeric features in our dataset.
Numeric Features
df %>%
select(-brand_code) %>%
gather(variable, value, -ph) %>%
ggplot(aes(x = value, y = ph)) +
geom_point(color = 'steelblue', alpha = 0.35) +
facet_wrap(~variable, scales = "free_x") +
labs(title = "pH Relationship with Independent Variables", x = element_blank())+
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey", linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
) We note the following about the relationship between pH and these variables:
air_pressurer - it appears that there are two sub-groups here. In looking to see if this was due to brand_code we found that these sub-groups exist even at individual group levelsalch_rel - most points appear to be gathered in 3 distinct areas, however there do appear to be 7 outliersballing, balling_lvl - it appears that there are two sub-groups here. These sub-groups do potentially look to be associated with brand_codebowl_setpoint - appears to potentially be a categorical variable as the values don’t appear to be continuous. Also appear to potentially be some outlierscarb_flow - appear to be two or three groups of data points. Also note the presence of outliersdensity - appear to be two to three groups of points. We note the presence of an outlierfiller_speed - appear to mostly fall within the low or high range. Values in the middle are less frequent.pressure_setpoint - appear to be discrete values with the exception of 4 outlierspressure_vacuum - appear to be discrete values with a potentially positive linear relationship. We note the presence of an outlierpsc_co2 - appear to be discrete values. We note the presence of an outlierpsc_fill - there appear to be 5 bands that values can fall into with certain areas that do not have values. We may consider adding a categorical variable to capture thiscarb_pressure, carb_pressure1, carb_rel, carb_volume, fill_ounces, fill_pressure, filler_level, hyd_pressure1-4, mfr, oxygen_filler, pc_volume, psc, temperature, usage_cont- no visible relationship. We do note the presence of outliersCorrelated Features
For many models, correlation between features can be an issue. Let’s see what the correlation between our variables looks like:
numeric_values <- df %>%
dplyr::select_if(is.numeric)
numeric_values <- numeric_values[complete.cases(numeric_values),] %>%
data.frame()
train_cor <- cor(numeric_values)
corrplot::corrplot.mixed(train_cor, tl.col = 'black', tl.pos = 'lt')We see many of our features are highly correlated. There are several methods we could use to solve this, however, because we have many features, it may make sense to use principal component analysis, which will allow us to reduce the number of columns in our model and hopefully produce a simpler model. Additionally, many tree-based models are not as sensitive to collinearity like linear models, therefore, we’ll focus our efforts on non-linear models in the modeling phase.
Summary EDA Notes
brand_codes. It may be helpful to do some feature engineering to tease this information from the dataIn our analysis of the data, we noted several outliers. Most of those outliers come from one value with a pH of 9.36. We removed this outlier, but saw a marked reduction in accuracy and so made the decision to keep the value in the training set.
Many attempts were made at feature engineering, including combining features, rounding features, using other features as a percentage of other features, etc., however, any additional features seemed to decrease the performance of our model. It may be that each feature has fairly strong signal on its own, and our engineering may have been confounding that signal. As such, we have not included any additional features in our dataset.
We’ll need to impute missing data for both numeric and categorical variables. We’ll use the knnImpute method for the numeric variables and a multinomial logistic regression model to impute the brand_code variable.
We do not use the combined dataset to train our model or impute values under the assumption that this evaluation data wouldn’t be available to us when training our models. In practice, when we’re faced with a new dataset, it would be impossible to use that data in the process of training our models.
# remove rows without a response variable
df3 <- df %>%
filter(!is.na(ph)) %>%
mutate(brand_code = factor(brand_code))
df3$brand_code <- case_when(
df3$brand_code == "A" ~ 1,
df3$brand_code == "B" ~ 2,
df3$brand_code == "C" ~ 3,
df3$brand_code == "D" ~ 4,
)
# remove near zero variance variables
isNZV <- nearZeroVar(df3)
df3 <- df3[, -isNZV] # hyd_pressure1
# imputing numeric variables
mode <- function(df, ...) {
tb <- table(df)
which(tb == max(tb)) %>% sort %>% `[`(., 1) %>% names
}
impute <- function(df, fun) {
fval <- fun(df, na.rm = TRUE)
y <- ifelse(is.na(df), fval, df)
return(y)
}
df3 <- df3 %>% mutate(
brand_code = impute(brand_code, mode),
carb_volume = impute(carb_volume, mean),
fill_ounces = impute(fill_ounces, mean),
pc_volume = impute(pc_volume, mean),
carb_pressure = impute(carb_pressure, mean),
carb_temp = impute(carb_temp, mean),
psc = impute(psc, mean),
psc_fill = impute(psc_fill, mean),
psc_co2 = impute(psc_co2, mean),
carb_pressure1 = impute(carb_pressure1, mean),
fill_pressure = impute(fill_pressure, mean),
#hyd_pressure1 = impute(hyd_pressure1, mean),
hyd_pressure2 = impute(hyd_pressure2, mean),
hyd_pressure3 = impute(hyd_pressure3, mean),
hyd_pressure4 = impute(hyd_pressure4, mean),
filler_level = impute(filler_level, mean),
filler_speed = impute(filler_speed, mean),
temperature = impute(temperature, mean),
usage_cont = impute(usage_cont, mean),
carb_flow = impute(carb_flow, mean),
mfr = impute(mfr, mean),
oxygen_filler = impute(oxygen_filler, mean),
bowl_setpoint = impute(bowl_setpoint, mean),
pressure_setpoint = impute(pressure_setpoint, mean),
alch_rel = impute(alch_rel, mean),
carb_rel = impute(carb_rel, mean),
balling_lvl = impute(balling_lvl, mean)
) %>%
mutate_if(is.character, as.numeric)
visdat::vis_miss(df3)Processing Note
As part of our processing, we attempted many different transformations and adjustments, but saw no improvement in our RMSE or MAE, in fact, many of the changes negatively affected our accuracy, as such, we chose a simpler processing approach. We experimented with YeoJohnson transformations, centering, scaling, excluding highly correlated features, and principal component analysis. Ultimately, we determined that only using KNN to impute missing values in our numerical columns and a multinomial model to impute missing values in our categorical column produced the most accurate results.
After imputing, we’ll use caret’s createDataPartition to determine a stratified 80/20 split. We will use this test dataset to evaluate all our models, however, if two or more models have close error and variance measures, we can continue with cross validation on additional splits. During our EDA we noted that brand_code was a significant feature within our dataset, as such we chose to use a stratified sampling approach in an effort to make our training and testing sets more uniform.
Having split our data into a training and testing set, we’re now ready to fit and train our models. Due to the non-linear relationships of our features to the response variable as well the outliers we observed and other issues noted during our exploratory data analysis, we have determined that non-linear models would most likely perform best on this dataset when creating predictions. With this in mind, we’ll fit and train several different non-linear models in an effort to to see which one predicts with the lowest error and least variance on the test dataset. Once we determine which model is the most accurate, we’ll move forward with creating predictions on the provided scoring set. We have chosen the following models to train:
As part of our training, we are using cross fold validation in an effort to select the optimal parameters for each model.
# MARS
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
set.seed(101)
mars <- train(ph ~ .,
data = train,
method = 'earth',
tuneGrid = marsGrid,
trControl = trainControl(method = 'cv')
)
# SVM
svm <- train(ph ~ .,
data = train,
method = 'svmRadial',
preProc = c('center', 'scale'),
tuneLength = 14,
trControl = trainControl(method = 'cv')
)
# KNN
set.seed(101)
knn <- train(ph ~ .,
data = train,
method = 'knn',
preProc = c('center', 'scale'),
tuneGrid = data.frame(.k = 1:20),
trControl = trainControl(method = 'cv')
)
# random forest
rf <- randomForest(ph ~ .,
data = train,
ntrees = 1000)
# cubist
cubist <- train(ph ~ .,
data = train,
method = 'cubist')# Avg Neural Nets
set.seed(100)
trainingData <- select(train, -ph)
trainingData <- data.frame(trainingData)
testData <- select(test, -ph)
testData <- data.frame(testData)
neural_model <- expand.grid(.decay = c(0, 0.01, .1),
.size = c(1:10),
.bag = FALSE)
# get the maximum number of hidden units
maxSize <- max(neural_model$.size)
# there are M(p+1)+M+1 parameters in total
numWts <- 1*(maxSize * (length(trainingData) + 1) + maxSize + 1)
ctrl <- trainControl(method = "cv")
avNNet_model <- train(trainingData, train$ph,
method = "avNNet",
tuneGrid = neural_model,
trControl = ctrl,
preProcess = c("center", "scale"),
linout = TRUE,
trace = FALSE,
maxit = 500,
MaxNWts = numWts
)avNNet_predict <- predict(avNNet_model, newdata = testData)
avNNet_results <- postResample(pred = avNNet_predict, obs = test$ph)
avNNet_results$MAPE <- MAPE(test$ph, avNNet_predict)Having trained each of our models, we’ll now turn our attention to evaluating the predictive power of each of the models. We’ll use postResample to measure the error between each of the model’s predictions and the holdout test data and display the accuracy metrics in the table below:
# predictions
test$mars <- predict(mars, test)
test$svm <- predict(svm, test)
test$knn <- predict(knn, test)
test$rf <- predict(rf, test)
test$cubist <- predict(cubist, test)
# measure of error and variance
metrics <- data.frame(rbind(
postResample(pred = test$mars, obs = test$ph),
postResample(pred = test$svm, obs = test$ph),
postResample(pred = test$knn, obs = test$ph),
postResample(pred = test$rf, obs = test$ph),
postResample(pred = test$cubist, obs = test$ph)
),
row.names = c('MARS', 'SVM', 'KNN', 'Random Forest', 'Cubist')
)
metrics$MAPE <- c(MAPE(test$ph, test$mars), MAPE(test$ph, test$svm),
MAPE(test$ph, test$knn), MAPE(test$ph, test$rf), MAPE(test$ph, test$cubist))| Model Type | RMSE | \(R^2\) | MAE | MAPE |
|---|---|---|---|---|
| MARS | 0.1176799 | 0.4952452 | 0.0887886 | 99.9757625 |
| SVM | 0.1119914 | 0.5462167 | 0.0810064 | 99.8763735 |
| KNN | 0.1162803 | 0.5153739 | 0.0882178 | 99.8192806 |
| RF | 0.0946641 | 0.6905932 | 0.0694998 | 99.9888221 |
| Cubist | 0.0917986 | 0.692324 | 0.0670975 | 99.9619656 |
| Neural Nets | 0.1134418 | 0.5324197 | 0.08416577 | 99.98299 |
The measure of error for the random forest and the cubist models were close, so we’ll do one pass of cross validation with a new train/test split in order to have additional confidence in the chosen model.
# create new train/test split with different seed and 75/25 split
set.seed(2)
trainIndex <- createDataPartition(df3$ph,
p = 0.75,
list = F)
train2 <- df3[trainIndex, ]
test2 <- df3[-trainIndex, ]
# random forest
rf2 <- randomForest(ph ~ .,
data = train2,
ntrees = 1000)
# cubist
cubist2 <- train(ph ~ .,
data = train2,
method = 'cubist')
# predictions
test2$rf <- predict(rf2, test2)
test2$cubist <- predict(cubist2, test2)
# measure of error and variance
metrics2 <- data.frame(rbind(
postResample(pred = test$cubist, obs = test$ph),
postResample(pred = test2$rf, obs = test2$ph),
postResample(pred = test2$cubist, obs = test2$ph)
),
row.names = c('Cubist v1', 'Random Forest v2', 'Cubist v2')
)
metrics2$MAPE <- c(MAPE(test$ph, test$cubist), MAPE(test2$ph, test2$rf),
MAPE(test2$ph, test2$cubist))| Model Type | RMSE | \(R^2\) | MAE | MAPE |
|---|---|---|---|---|
| Cubist v1 | 0.0917986 | 0.692324 | 0.0670975 | 99.9619656 |
| RF v2 | 0.1021197 | 0.6520538 | 0.0733234 | 99.9617631 |
| Cubist v2 | 0.0981522 | 0.662651 | 0.0699042 | 99.9608776 |
Comparing the 2nd fit of random forest vs cubist, the 2nd random forest fit shows a relatively greater error when predicting on the new test set, so the cubist is the better model again. When comparing the first and second cubist models, the first cubist outperforms the second.
Based on the model results above, we have chosen the Cubist regression model as our final model. The cubist model has the lowest RMSE and MAE.The \(R^2\) values should only be used to compare models of the same type.
Having selected our best preforming model, we can now use caret’s varImp method to look at what features within our dataset are most important when fitting the model.
It appears that
mnf_flow, brand code, balling, balling_lvl and density are key to our model’s performance. We saw in our correlation plot that mnf_flow had the strongest relationship with pH. Additionally, we saw in our box plots that brand code had a significant relationship with pH as well.
balling, balling_lvl, and density are all within the top 7 most important variables in this model, with significant correlations. We attempted to model after excluding the collinear variables, but that resulted in a less accurate model. It’s possible that the variable importance of these three variables might be misrepresented due to correlation.
We’ll use the Cubist regression model to predict on our test set. Before making predictions, we’ll need to perform the same data processing steps as we have done with our training dataset.
We’ll need to impute missing data on the final evaluation data. To do this, we’ll train another multinomial logistic regression model for brand code, excluding pH. Finally, we’ll use knnImpute again for the numeric variables. We’ll be using the entire combined dataset to impute these values under the assumption that we have trained our models on existing data and can leverage that to help us get a more accurate insight into the true population distributions.
# get all data excluding ph
eval2 <- combined %>%
select(-ph)
# convert brand code to numeric
eval2$brand_code <- case_when(
eval2$brand_code == "A" ~ 1,
eval2$brand_code == "B" ~ 2,
eval2$brand_code == "C" ~ 3,
eval2$brand_code == "D" ~ 4,
)
# isolate evaluation data only
eval2 <- eval2 %>%
filter(obs_type == 'test') %>%
select(-obs_type)
# # remove near zero variance variables
# isNZV_2 <- nearZeroVar(eval2)
# eval2 <- eval2[, -isNZV_2]
eval2 <- eval2 %>% mutate(
brand_code = impute(brand_code, mode),
carb_volume = impute(carb_volume, mean),
fill_ounces = impute(fill_ounces, mean),
pc_volume = impute(pc_volume, mean),
#carb_pressure = impute(carb_pressure, mean),
carb_temp = impute(carb_temp, mean),
psc = impute(psc, mean),
psc_fill = impute(psc_fill, mean),
psc_co2 = impute(psc_co2, mean),
carb_pressure1 = impute(carb_pressure1, mean),
fill_pressure = impute(fill_pressure, mean),
#hyd_pressure1 = impute(hyd_pressure1, mean),
hyd_pressure2 = impute(hyd_pressure2, mean),
hyd_pressure3 = impute(hyd_pressure3, mean),
hyd_pressure4 = impute(hyd_pressure4, mean),
filler_level = impute(filler_level, mean),
filler_speed = impute(filler_speed, mean),
temperature = impute(temperature, mean),
usage_cont = impute(usage_cont, mean),
density = impute(density, mean),
#carb_flow = impute(carb_flow, mean),
mfr = impute(mfr, mean),
balling = impute(balling, mean),
pressure_vacuum = impute(pressure_vacuum, mean),
oxygen_filler = impute(oxygen_filler, mean),
bowl_setpoint = impute(bowl_setpoint, mean),
pressure_setpoint = impute(pressure_setpoint, mean),
air_pressurer = impute(air_pressurer, mean),
alch_rel = impute(alch_rel, mean),
carb_rel = impute(carb_rel, mean)
) %>%
mutate_if(is.character, as.numeric)
#sum(is.na(eval2))Having imputed any missing values in our test set, we are now ready to make our final predictions.
We make our final predictions by calling the predict method and passing in the eval2 dataframe which is our test set with all the necessary imputations for missing values. Once the predictions have been made, we’ll push them to a CSV file where they can be evaluated.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.148 8.465 8.529 8.553 8.667 8.841
## [1] 8.489836 8.407301 8.495484 8.667449 8.500791 8.581749 8.457839 8.552064
## [9] 8.524187 8.697361
Throughout this project, we imported the data, performed extensive exploratory data analysis, processed our data, and finally trained a model and made predictions. Throughout this process, we determined that the Cubist Regression model was the most accurate model, and used it to generate our final predictions. We did an extensive amount of experimenting with tranformations, imputation methods, and feature engineering, but at the end of the day, the simplist dataset seemed to perform the best.
With the unique nature of this dataset, we believe strong domain knowledge would be necessary to take this model to the next level. Often with these detailed datasets with many columns, it is hard to create new meaningful features without strong domain knowledge in the area. We believe feature engineering would be the place to focus and that additional modeling gains could be made by creating additional features from this rich dataset.