Project2:
This is role playing. ABC Beverage leadership has given a task to understand manufacturing process, the predictive factors and be able to report to them our predictive model of PH. I need to create both a technical and a non-technical reports to present the result. This R markdown is the technical part. In the end of this report, there will be an excel created with prediction of the pH.
Data Exploration:
Read and overveiew of the given training data. It has 33 columns - variables and 2571 rows - observations which included some missing values. As for data type, only “Brande Code” is categorical while other 32 columns were numirical. the focus of the project PH value has a mean of 8.54 and 4 missing values.
df_StudentData <- read_excel("C:/Users/xmei/Desktop/train.xlsx")
df_EvalData <- read_excel("C:/Users/xmei/Desktop/test.xlsx")
summary(df_StudentData)
## Brand Code Carb Volume Fill Ounces PC Volume
## Length:2571 Min. :5.040 Min. :23.63 Min. :0.07933
## Class :character 1st Qu.:5.293 1st Qu.:23.92 1st Qu.:0.23917
## Mode :character Median :5.347 Median :23.97 Median :0.27133
## Mean :5.370 Mean :23.97 Mean :0.27712
## 3rd Qu.:5.453 3rd Qu.:24.03 3rd Qu.:0.31200
## Max. :5.700 Max. :24.32 Max. :0.47800
## NA's :10 NA's :38 NA's :39
## Carb Pressure Carb Temp PSC PSC Fill
## Min. :57.00 Min. :128.6 Min. :0.00200 Min. :0.0000
## 1st Qu.:65.60 1st Qu.:138.4 1st Qu.:0.04800 1st Qu.:0.1000
## Median :68.20 Median :140.8 Median :0.07600 Median :0.1800
## Mean :68.19 Mean :141.1 Mean :0.08457 Mean :0.1954
## 3rd Qu.:70.60 3rd Qu.:143.8 3rd Qu.:0.11200 3rd Qu.:0.2600
## Max. :79.40 Max. :154.0 Max. :0.27000 Max. :0.6200
## NA's :27 NA's :26 NA's :33 NA's :23
## PSC CO2 Mnf Flow Carb Pressure1 Fill Pressure
## Min. :0.00000 Min. :-100.20 Min. :105.6 Min. :34.60
## 1st Qu.:0.02000 1st Qu.:-100.00 1st Qu.:119.0 1st Qu.:46.00
## Median :0.04000 Median : 65.20 Median :123.2 Median :46.40
## Mean :0.05641 Mean : 24.57 Mean :122.6 Mean :47.92
## 3rd Qu.:0.08000 3rd Qu.: 140.80 3rd Qu.:125.4 3rd Qu.:50.00
## Max. :0.24000 Max. : 229.40 Max. :140.2 Max. :60.40
## NA's :39 NA's :2 NA's :32 NA's :22
## Hyd Pressure1 Hyd Pressure2 Hyd Pressure3 Hyd Pressure4
## Min. :-0.80 Min. : 0.00 Min. :-1.20 Min. : 52.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 86.00
## Median :11.40 Median :28.60 Median :27.60 Median : 96.00
## Mean :12.44 Mean :20.96 Mean :20.46 Mean : 96.29
## 3rd Qu.:20.20 3rd Qu.:34.60 3rd Qu.:33.40 3rd Qu.:102.00
## Max. :58.00 Max. :59.40 Max. :50.00 Max. :142.00
## NA's :11 NA's :15 NA's :15 NA's :30
## Filler Level Filler Speed Temperature Usage cont Carb Flow
## Min. : 55.8 Min. : 998 Min. :63.60 Min. :12.08 Min. : 26
## 1st Qu.: 98.3 1st Qu.:3888 1st Qu.:65.20 1st Qu.:18.36 1st Qu.:1144
## Median :118.4 Median :3982 Median :65.60 Median :21.79 Median :3028
## Mean :109.3 Mean :3687 Mean :65.97 Mean :20.99 Mean :2468
## 3rd Qu.:120.0 3rd Qu.:3998 3rd Qu.:66.40 3rd Qu.:23.75 3rd Qu.:3186
## Max. :161.2 Max. :4030 Max. :76.20 Max. :25.90 Max. :5104
## NA's :20 NA's :57 NA's :14 NA's :5 NA's :2
## Density MFR Balling Pressure Vacuum
## Min. :0.240 Min. : 31.4 Min. :-0.170 Min. :-6.600
## 1st Qu.:0.900 1st Qu.:706.3 1st Qu.: 1.496 1st Qu.:-5.600
## Median :0.980 Median :724.0 Median : 1.648 Median :-5.400
## Mean :1.174 Mean :704.0 Mean : 2.198 Mean :-5.216
## 3rd Qu.:1.620 3rd Qu.:731.0 3rd Qu.: 3.292 3rd Qu.:-5.000
## Max. :1.920 Max. :868.6 Max. : 4.012 Max. :-3.600
## NA's :1 NA's :212 NA's :1
## PH Oxygen Filler Bowl Setpoint Pressure Setpoint
## Min. :7.880 Min. :0.00240 Min. : 70.0 Min. :44.00
## 1st Qu.:8.440 1st Qu.:0.02200 1st Qu.:100.0 1st Qu.:46.00
## Median :8.540 Median :0.03340 Median :120.0 Median :46.00
## Mean :8.546 Mean :0.04684 Mean :109.3 Mean :47.62
## 3rd Qu.:8.680 3rd Qu.:0.06000 3rd Qu.:120.0 3rd Qu.:50.00
## Max. :9.360 Max. :0.40000 Max. :140.0 Max. :52.00
## NA's :4 NA's :12 NA's :2 NA's :12
## Air Pressurer Alch Rel Carb Rel Balling Lvl
## Min. :140.8 Min. :5.280 Min. :4.960 Min. :0.00
## 1st Qu.:142.2 1st Qu.:6.540 1st Qu.:5.340 1st Qu.:1.38
## Median :142.6 Median :6.560 Median :5.400 Median :1.48
## Mean :142.8 Mean :6.897 Mean :5.437 Mean :2.05
## 3rd Qu.:143.0 3rd Qu.:7.240 3rd Qu.:5.540 3rd Qu.:3.14
## Max. :148.2 Max. :8.620 Max. :6.060 Max. :3.66
## NA's :9 NA's :10 NA's :1
skim(df_StudentData)
| Name | df_StudentData |
| Number of rows | 2571 |
| Number of columns | 33 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 32 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Brand Code | 120 | 0.95 | 1 | 1 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Carb Volume | 10 | 1.00 | 5.37 | 0.11 | 5.04 | 5.29 | 5.35 | 5.45 | 5.70 | ▁▆▇▅▁ |
| Fill Ounces | 38 | 0.99 | 23.97 | 0.09 | 23.63 | 23.92 | 23.97 | 24.03 | 24.32 | ▁▂▇▂▁ |
| PC Volume | 39 | 0.98 | 0.28 | 0.06 | 0.08 | 0.24 | 0.27 | 0.31 | 0.48 | ▁▃▇▂▁ |
| Carb Pressure | 27 | 0.99 | 68.19 | 3.54 | 57.00 | 65.60 | 68.20 | 70.60 | 79.40 | ▁▅▇▃▁ |
| Carb Temp | 26 | 0.99 | 141.09 | 4.04 | 128.60 | 138.40 | 140.80 | 143.80 | 154.00 | ▁▅▇▃▁ |
| PSC | 33 | 0.99 | 0.08 | 0.05 | 0.00 | 0.05 | 0.08 | 0.11 | 0.27 | ▆▇▃▁▁ |
| PSC Fill | 23 | 0.99 | 0.20 | 0.12 | 0.00 | 0.10 | 0.18 | 0.26 | 0.62 | ▆▇▃▁▁ |
| PSC CO2 | 39 | 0.98 | 0.06 | 0.04 | 0.00 | 0.02 | 0.04 | 0.08 | 0.24 | ▇▅▂▁▁ |
| Mnf Flow | 2 | 1.00 | 24.57 | 119.48 | -100.20 | -100.00 | 65.20 | 140.80 | 229.40 | ▇▁▁▇▂ |
| Carb Pressure1 | 32 | 0.99 | 122.59 | 4.74 | 105.60 | 119.00 | 123.20 | 125.40 | 140.20 | ▁▃▇▂▁ |
| Fill Pressure | 22 | 0.99 | 47.92 | 3.18 | 34.60 | 46.00 | 46.40 | 50.00 | 60.40 | ▁▁▇▂▁ |
| Hyd Pressure1 | 11 | 1.00 | 12.44 | 12.43 | -0.80 | 0.00 | 11.40 | 20.20 | 58.00 | ▇▅▂▁▁ |
| Hyd Pressure2 | 15 | 0.99 | 20.96 | 16.39 | 0.00 | 0.00 | 28.60 | 34.60 | 59.40 | ▇▂▇▅▁ |
| Hyd Pressure3 | 15 | 0.99 | 20.46 | 15.98 | -1.20 | 0.00 | 27.60 | 33.40 | 50.00 | ▇▁▃▇▁ |
| Hyd Pressure4 | 30 | 0.99 | 96.29 | 13.12 | 52.00 | 86.00 | 96.00 | 102.00 | 142.00 | ▁▃▇▂▁ |
| Filler Level | 20 | 0.99 | 109.25 | 15.70 | 55.80 | 98.30 | 118.40 | 120.00 | 161.20 | ▁▃▅▇▁ |
| Filler Speed | 57 | 0.98 | 3687.20 | 770.82 | 998.00 | 3888.00 | 3982.00 | 3998.00 | 4030.00 | ▁▁▁▁▇ |
| Temperature | 14 | 0.99 | 65.97 | 1.38 | 63.60 | 65.20 | 65.60 | 66.40 | 76.20 | ▇▃▁▁▁ |
| Usage cont | 5 | 1.00 | 20.99 | 2.98 | 12.08 | 18.36 | 21.79 | 23.75 | 25.90 | ▁▃▅▃▇ |
| Carb Flow | 2 | 1.00 | 2468.35 | 1073.70 | 26.00 | 1144.00 | 3028.00 | 3186.00 | 5104.00 | ▂▅▆▇▁ |
| Density | 1 | 1.00 | 1.17 | 0.38 | 0.24 | 0.90 | 0.98 | 1.62 | 1.92 | ▁▅▇▂▆ |
| MFR | 212 | 0.92 | 704.05 | 73.90 | 31.40 | 706.30 | 724.00 | 731.00 | 868.60 | ▁▁▁▂▇ |
| Balling | 1 | 1.00 | 2.20 | 0.93 | -0.17 | 1.50 | 1.65 | 3.29 | 4.01 | ▁▇▇▁▇ |
| Pressure Vacuum | 0 | 1.00 | -5.22 | 0.57 | -6.60 | -5.60 | -5.40 | -5.00 | -3.60 | ▂▇▆▂▁ |
| PH | 4 | 1.00 | 8.55 | 0.17 | 7.88 | 8.44 | 8.54 | 8.68 | 9.36 | ▁▅▇▂▁ |
| Oxygen Filler | 12 | 1.00 | 0.05 | 0.05 | 0.00 | 0.02 | 0.03 | 0.06 | 0.40 | ▇▁▁▁▁ |
| Bowl Setpoint | 2 | 1.00 | 109.33 | 15.30 | 70.00 | 100.00 | 120.00 | 120.00 | 140.00 | ▁▂▃▇▁ |
| Pressure Setpoint | 12 | 1.00 | 47.62 | 2.04 | 44.00 | 46.00 | 46.00 | 50.00 | 52.00 | ▁▇▁▆▁ |
| Air Pressurer | 0 | 1.00 | 142.83 | 1.21 | 140.80 | 142.20 | 142.60 | 143.00 | 148.20 | ▅▇▁▁▁ |
| Alch Rel | 9 | 1.00 | 6.90 | 0.51 | 5.28 | 6.54 | 6.56 | 7.24 | 8.62 | ▁▇▂▃▁ |
| Carb Rel | 10 | 1.00 | 5.44 | 0.13 | 4.96 | 5.34 | 5.40 | 5.54 | 6.06 | ▁▇▇▂▁ |
| Balling Lvl | 1 | 1.00 | 2.05 | 0.87 | 0.00 | 1.38 | 1.48 | 3.14 | 3.66 | ▁▇▂▁▆ |
As part of data exploration, I used faceted grid of histograms to visualize the distribution of all numeric variables in the dataset.
df_StudentData %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_histogram(bins = 15) +
facet_wrap(~key, scales = "free") +
ggtitle("Histograms of Numerical Predictors")
## Warning: Removed 724 rows containing non-finite outside the scale range
## (`stat_bin()`).
To check if there is any outliers, skewness of the vairable, I first convert wide to long format and use boxplot to show numrical predictors.
df_StudentData %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_boxplot() +
facet_wrap(~key, scales = 'free') +
ggtitle("Boxplots of Numerical Predictors")
## Warning: Removed 724 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
As for “Brand Code” column to categorical factor in both training and evaluation datasets, and use bar chart to show the frequency distribution of each brand code, looks like brand B has highest frequency, 1239, 48.2%.
df_StudentData %>%
mutate(`Brand Code` = factor(`Brand Code`)) %>%
ggplot(aes(x = fct_infreq(`Brand Code`))) +
geom_bar(aes(fill = after_stat(count)), width = 0.8) +
geom_text(
stat = "count",
aes(label = after_stat(
paste0(format(count, big.mark = ","), "\n",
"(", round(count/sum(count)*100, 1), "%)")
)),
vjust = -0.5,
size = 3.2,
lineheight = 0.8
) +
scale_y_continuous(
labels = scales::comma_format(),
expand = expansion(mult = c(0, 0.12))
) +
scale_fill_gradient(low = "#E3F2FD", high = "#1565C0", guide = "none") +
labs(
title = "Brand Code Distribution",
x = "Brand Code",
y = "Count"
) +
theme_classic() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
panel.grid.major.y = element_line(color = "gray90")
)
Missing Data
30 of the variables have missing values. MFR seems to be missing for roughly 8.25% of the data. Brand_Code has 120 missing values. Since Brande code is categorical, it needs to be handled different when dealing with n/a values.
df_StudentData %>%
summarise_all(list(~ sum(is.na(.)))) %>%
gather(variable, value) %>%
filter(value != 0) %>%
arrange(-value) %>%
kable() %>%
kable_styling() %>%
scroll_box(width = "100%", height = "200px")
| variable | value |
|---|---|
| MFR | 212 |
| Brand Code | 120 |
| Filler Speed | 57 |
| PC Volume | 39 |
| PSC CO2 | 39 |
| Fill Ounces | 38 |
| PSC | 33 |
| Carb Pressure1 | 32 |
| Hyd Pressure4 | 30 |
| Carb Pressure | 27 |
| Carb Temp | 26 |
| PSC Fill | 23 |
| Fill Pressure | 22 |
| Filler Level | 20 |
| Hyd Pressure2 | 15 |
| Hyd Pressure3 | 15 |
| Temperature | 14 |
| Oxygen Filler | 12 |
| Pressure Setpoint | 12 |
| Hyd Pressure1 | 11 |
| Carb Volume | 10 |
| Carb Rel | 10 |
| Alch Rel | 9 |
| Usage cont | 5 |
| PH | 4 |
| Mnf Flow | 2 |
| Carb Flow | 2 |
| Bowl Setpoint | 2 |
| Density | 1 |
| Balling | 1 |
| Balling Lvl | 1 |
Transforming the Data
we will transform the data by imputing missing values using the MICE method with predictive mean matching (PMM). Low-variance predictor variables that have near-zero variance are likely not useful for modeling. Hyd Pressure1 was removed.
set.seed(100)
# Convert Brand Code to factor
df_StudentData$`Brand Code` <- as.factor(df_StudentData$`Brand Code`)
init <- mice(df_StudentData, maxit=0, print=FALSE)
meth <- init$method
predM <- init$predictorMatrix
# Change the method for Brand Code
n_levels <- length(levels(df_StudentData$`Brand Code`))
if(n_levels > 2){
meth["Brand Code"] <- "polyreg"
} else {
meth["Brand Code"] <- "logreg"
}
imputed <- mice(df_StudentData, method=meth, predictorMatrix=predM, m=1, print=FALSE)
df_StudentData <- complete(imputed)
# Remove low variance columns from the numeric columns
numeric_col_names <- names(df_StudentData)[sapply(df_StudentData, is.numeric)]
low_var_cols <- nearZeroVar(df_StudentData[, numeric_col_names, drop=FALSE], names=TRUE)
if(length(low_var_cols) > 0){
df_StudentData <- df_StudentData[, !colnames(df_StudentData) %in% low_var_cols]
}
There are no more missing values and the transformations are complete for now.
df_StudentData %>%
summarise_all(list(~is.na(.))) %>%
pivot_longer(everything(), names_to = "variables", values_to="missing") %>%
count(variables, missing) %>%
ggplot(aes(y = variables, x=n, fill = missing))+
geom_col(position = "fill") +
labs(title = "Proportion of Missing Values AFTER Imputation",
x = "Proportion") +
scale_fill_manual(values=c("grey","blue"))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## ℹ The deprecated feature was likely used in the dplyr package.
## Please report the issue at <https://github.com/tidyverse/dplyr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Model Building
To building model, the data will have an 80-20 split. Different type of regression models that include linear regression, non-linear regression, and regression trees will be applied and compare to choose a most fit one.
library(caret)
library(dplyr)
set.seed(100)
# index for training
index <- createDataPartition(df_StudentData$PH, p = .8, list = FALSE)
train_x <- df_StudentData[index, ] %>% dplyr::select(-PH)
train_y <- df_StudentData[index, 'PH']
# test
test_x <- df_StudentData[-index, ] %>% dplyr::select(-PH)
test_y <- df_StudentData[-index, 'PH']
LM has RMSE 0.1412367 and R2 0.3808389.
set.seed(100)
# 10-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 10)
lmModel <- train(train_x, train_y, method = "lm", trControl = ctrl)
lmPred <- predict(lmModel, test_x)
postResample(lmPred, test_y)
## RMSE Rsquared MAE
## 0.1412367 0.3808389 0.1074174
PLS has RMSE 0.1414071 and R2 0.3793440.
set.seed(100)
plsTune <- train(train_x, train_y,
method = "pls",
tuneLength = 20, trControl = ctrl,
preProc = c("center", "scale"))
plsPred <- predict(plsTune, test_x)
postResample(plsPred, test_y)
## RMSE Rsquared MAE
## 0.1414071 0.3793440 0.1076608
MARS has RMSE 0.12713823 and R2 0.49903611
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)
set.seed(100)
# tune
marsTune <- train(train_x, train_y,
method = "earth",
tuneGrid = marsGrid,
trControl = trainControl(method = "cv"))
marsPred <- predict(marsTune, test_x)
postResample(marsPred, test_y)
## RMSE Rsquared MAE
## 0.12713823 0.49903611 0.09585218
Boosted Trees has RMSE 0.10801194 and R2 0.63814896
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
n.trees = seq(100, 1000, by = 50),
shrinkage = c(0.01, 0.1),
n.minobsinnode = 10)
set.seed(100)
gbmTune <- train(train_x, train_y,
method = "gbm",
tuneGrid = gbmGrid,
verbose = FALSE)
gbmPred <- predict(gbmTune, test_x)
postResample(gbmPred, test_y)
## RMSE Rsquared MAE
## 0.10801194 0.63814896 0.07980758
Random Forest has RMSE 0.10632992 and R2 0.66740004.
set.seed(100)
rfModel <- randomForest(train_x, train_y,
importance = TRUE,
ntree = 1000)
rfPred <- predict(rfModel, test_x)
postResample(rfPred, test_y)
## RMSE Rsquared MAE
## 0.10632992 0.66740004 0.07530304
Cubist has RMSE 0.10574881 and R2 0.65511959
set.seed(100)
cubistTuned <- train(train_x, train_y,
method = "cubist")
cubistPred <- predict(cubistTuned, test_x)
postResample(cubistPred, test_y)
## RMSE Rsquared MAE
## 0.10574881 0.65511959 0.07396093
Based on the results, the cubist has the lowest RMSE very similar to randomForest however the latter has highest R2. It would be interesting decision on which model to use, I think both are very comparable. I’ll go with randomForest this time.
rbind(lm = postResample(lmPred, test_y),
pls = postResample(plsPred, test_y),
mars = postResample(marsPred, test_y),
randomForest = postResample(rfPred, test_y),
boosted = postResample(gbmPred, test_y),
cubist = postResample(cubistPred, test_y))
## RMSE Rsquared MAE
## lm 0.1412367 0.3808389 0.10741741
## pls 0.1414071 0.3793440 0.10766083
## mars 0.1271382 0.4990361 0.09585218
## randomForest 0.1063299 0.6674000 0.07530304
## boosted 0.1080119 0.6381490 0.07980758
## cubist 0.1057488 0.6551196 0.07396093
Model Evaluation
Random Forest was chosen as the best model. Let’s find out which features most strongly predict the target variable. Here I ranks the predictors by scaled importance scores with 100 being the most important and present lower numbers means less importance. Seems Mnf Fllow has the highest importance score, closely followed by Brande code.
rfImp <- varImp(rfModel, scale = TRUE) %>%
as.data.frame()
rfImp %>%
arrange(-Overall) %>%
kable() %>%
kable_styling() %>%
scroll_box()
| Overall | |
|---|---|
| Mnf Flow | 58.2376108 |
| Brand Code | 58.0510412 |
| Oxygen Filler | 50.6316460 |
| Pressure Vacuum | 49.6745261 |
| Usage cont | 41.9325680 |
| Temperature | 41.1480679 |
| Air Pressurer | 39.3431540 |
| Carb Rel | 35.7792701 |
| Balling Lvl | 34.4357966 |
| Balling | 33.0238037 |
| Filler Speed | 31.9305502 |
| Alch Rel | 31.3900365 |
| Carb Flow | 31.3729521 |
| Carb Pressure1 | 28.5357607 |
| Bowl Setpoint | 26.9613169 |
| Density | 26.3642820 |
| Filler Level | 25.8557828 |
| Hyd Pressure3 | 23.4760243 |
| Carb Volume | 22.4025854 |
| MFR | 21.3257259 |
| Hyd Pressure2 | 19.0875515 |
| PC Volume | 18.9377110 |
| Hyd Pressure4 | 18.6489186 |
| Fill Pressure | 18.5035162 |
| Pressure Setpoint | 16.8148359 |
| Fill Ounces | 6.0230537 |
| Carb Pressure | 5.4110057 |
| PSC Fill | 0.7716583 |
| PSC CO2 | 0.5487036 |
| Carb Temp | -0.2380696 |
| PSC | -0.7612532 |
Another visual to show features most strongly predict the target variable in the forest tree model.
imp_data <- importance(rfModel) %>%
as.data.frame() %>%
mutate(variable = rownames(.)) %>%
top_n(10, IncNodePurity)
# Convert to long format
imp_long <- imp_data %>%
pivot_longer(cols = c(`%IncMSE`, "IncNodePurity"),
names_to = "importance_type",
values_to = "value") %>%
mutate(
importance_type = factor(importance_type,
levels = c("%IncMSE", "IncNodePurity"),
labels = c("Increase in MSE %", "Node Purity"))
)
# Create grouped importance plot
ggplot(imp_long, aes(x = value, y = reorder(variable, value),
fill = importance_type)) +
geom_col(position = "dodge", width = 0.7, alpha = 0.9) +
geom_text(aes(label = round(value, 1)),
position = position_dodge(width = 0.7),
hjust = -0.1, size = 3.2) +
scale_fill_manual(
values = c("Increase in MSE %" = "#3498db", "Node Purity" = "#e74c3c"),
name = "Importance Metric"
) +
scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
labs(
title = "Feature Importance",
x = "Importance Score",
y = "Feature"
) +
theme_minimal() +
theme(
legend.position = "top",
legend.title = element_text(face = "bold"),
axis.text.y = element_text(face = "bold")
) +
facet_wrap(~importance_type, scales = "free_x")
Forecasting: apply to the evaluation data and get the prediction, but first has to transformed the data by imputing missing values and removing Hyd Pressure1 to match the process used in the training data, ensuring feature consistency.
set.seed(100)
#ensure Brand Code is factor (same as training data)
df_EvalData$`Brand Code` <- as.factor(df_EvalData$`Brand Code`)
if(!is.null(df_StudentData$`Brand Code`)) {
df_EvalData$`Brand Code` <- factor(
df_EvalData$`Brand Code`,
levels = levels(df_StudentData$`Brand Code`)
)
}
# Identify columns that exist in both datasets (excluding PH from training)
training_cols <- names(train_x)
eval_cols <- names(df_EvalData)
# Ensure df_EvalData has all columns needed for prediction
missing_cols <- setdiff(training_cols, eval_cols)
if(length(missing_cols) > 0) {
for(col in missing_cols) {
df_EvalData[[col]] <- NA
}
}
# Reorder columns to match training data
df_EvalData <- df_EvalData[, training_cols]
init_eval <- mice(df_EvalData, maxit = 0, print = FALSE)
meth_eval <- init_eval$method
predM_eval <- init_eval$predictorMatrix
# Set method for Brand Code (same as training)
if("Brand Code" %in% names(df_EvalData)) {
n_levels_eval <- length(unique(na.omit(df_EvalData$`Brand Code`)))
if(n_levels_eval > 2) {
meth_eval["Brand Code"] <- "polyreg"
} else {
meth_eval["Brand Code"] <- "logreg"
}
}
# Impute evaluation data
imputed_eval <- mice(df_EvalData,
method = meth_eval,
predictorMatrix = predM_eval,
m = 1,
print = FALSE)
df_EvalData_imputed <- complete(imputed_eval)
# Now predict PH
prediction <- predict(rfModel, df_EvalData_imputed)
head(prediction)
## 1 2 3 4 5 6
## 8.539210 8.468090 8.532999 8.599674 8.502400 8.558538
Average PH for the testing data.
df_EvalData$PH <- prediction
#average ph
df_EvalData %>%
group_by('Brand Code') %>%
summarise(`Average PH` = mean(PH))
## # A tibble: 1 × 2
## `"Brand Code"` `Average PH`
## <chr> <dbl>
## 1 Brand Code 8.55
Save the result in excel form ready to present to the leadership.
library(openxlsx)
df_EvalData$PH <- prediction
write.xlsx(list('PH' = prediction, 'EvalData_complete' = df_EvalData),
file = "C:/Users/xmei/Desktop/predictions_DJO.xlsx")
Conclusion: the biggest lesson I learned through this project was the handling of missing values. Since brand code being catagorical was different with other predictors and need to handle seperately. Mars model kept giving me issue until I realized this issue. Very interesting project to work on.