Project2:

This is role playing. ABC Beverage leadership has given a task to understand manufacturing process, the predictive factors and be able to report to them our predictive model of PH. I need to create both a technical and a non-technical reports to present the result. This R markdown is the technical part. In the end of this report, there will be an excel created with prediction of the pH.

Data Exploration:

Read and overveiew of the given training data. It has 33 columns - variables and 2571 rows - observations which included some missing values. As for data type, only “Brande Code” is categorical while other 32 columns were numirical. the focus of the project PH value has a mean of 8.54 and 4 missing values.

df_StudentData  <- read_excel("C:/Users/xmei/Desktop/train.xlsx")
df_EvalData <- read_excel("C:/Users/xmei/Desktop/test.xlsx")

summary(df_StudentData)
##   Brand Code         Carb Volume     Fill Ounces      PC Volume      
##  Length:2571        Min.   :5.040   Min.   :23.63   Min.   :0.07933  
##  Class :character   1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23917  
##  Mode  :character   Median :5.347   Median :23.97   Median :0.27133  
##                     Mean   :5.370   Mean   :23.97   Mean   :0.27712  
##                     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200  
##                     Max.   :5.700   Max.   :24.32   Max.   :0.47800  
##                     NA's   :10      NA's   :38      NA's   :39       
##  Carb Pressure     Carb Temp          PSC             PSC Fill     
##  Min.   :57.00   Min.   :128.6   Min.   :0.00200   Min.   :0.0000  
##  1st Qu.:65.60   1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000  
##  Median :68.20   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.19   Mean   :141.1   Mean   :0.08457   Mean   :0.1954  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :79.40   Max.   :154.0   Max.   :0.27000   Max.   :0.6200  
##  NA's   :27      NA's   :26      NA's   :33        NA's   :23      
##     PSC CO2           Mnf Flow       Carb Pressure1  Fill Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :105.6   Min.   :34.60  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00  
##  Median :0.04000   Median :  65.20   Median :123.2   Median :46.40  
##  Mean   :0.05641   Mean   :  24.57   Mean   :122.6   Mean   :47.92  
##  3rd Qu.:0.08000   3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00  
##  Max.   :0.24000   Max.   : 229.40   Max.   :140.2   Max.   :60.40  
##  NA's   :39        NA's   :2         NA's   :32      NA's   :22     
##  Hyd Pressure1   Hyd Pressure2   Hyd Pressure3   Hyd Pressure4   
##  Min.   :-0.80   Min.   : 0.00   Min.   :-1.20   Min.   : 52.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00  
##  Median :11.40   Median :28.60   Median :27.60   Median : 96.00  
##  Mean   :12.44   Mean   :20.96   Mean   :20.46   Mean   : 96.29  
##  3rd Qu.:20.20   3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00  
##  Max.   :58.00   Max.   :59.40   Max.   :50.00   Max.   :142.00  
##  NA's   :11      NA's   :15      NA's   :15      NA's   :30      
##   Filler Level    Filler Speed   Temperature      Usage cont      Carb Flow   
##  Min.   : 55.8   Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26  
##  1st Qu.: 98.3   1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1144  
##  Median :118.4   Median :3982   Median :65.60   Median :21.79   Median :3028  
##  Mean   :109.3   Mean   :3687   Mean   :65.97   Mean   :20.99   Mean   :2468  
##  3rd Qu.:120.0   3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75   3rd Qu.:3186  
##  Max.   :161.2   Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104  
##  NA's   :20      NA's   :57     NA's   :14      NA's   :5       NA's   :2     
##     Density           MFR           Balling       Pressure Vacuum 
##  Min.   :0.240   Min.   : 31.4   Min.   :-0.170   Min.   :-6.600  
##  1st Qu.:0.900   1st Qu.:706.3   1st Qu.: 1.496   1st Qu.:-5.600  
##  Median :0.980   Median :724.0   Median : 1.648   Median :-5.400  
##  Mean   :1.174   Mean   :704.0   Mean   : 2.198   Mean   :-5.216  
##  3rd Qu.:1.620   3rd Qu.:731.0   3rd Qu.: 3.292   3rd Qu.:-5.000  
##  Max.   :1.920   Max.   :868.6   Max.   : 4.012   Max.   :-3.600  
##  NA's   :1       NA's   :212     NA's   :1                        
##        PH        Oxygen Filler     Bowl Setpoint   Pressure Setpoint
##  Min.   :7.880   Min.   :0.00240   Min.   : 70.0   Min.   :44.00    
##  1st Qu.:8.440   1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00    
##  Median :8.540   Median :0.03340   Median :120.0   Median :46.00    
##  Mean   :8.546   Mean   :0.04684   Mean   :109.3   Mean   :47.62    
##  3rd Qu.:8.680   3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00    
##  Max.   :9.360   Max.   :0.40000   Max.   :140.0   Max.   :52.00    
##  NA's   :4       NA's   :12        NA's   :2       NA's   :12       
##  Air Pressurer      Alch Rel        Carb Rel      Balling Lvl  
##  Min.   :140.8   Min.   :5.280   Min.   :4.960   Min.   :0.00  
##  1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.38  
##  Median :142.6   Median :6.560   Median :5.400   Median :1.48  
##  Mean   :142.8   Mean   :6.897   Mean   :5.437   Mean   :2.05  
##  3rd Qu.:143.0   3rd Qu.:7.240   3rd Qu.:5.540   3rd Qu.:3.14  
##  Max.   :148.2   Max.   :8.620   Max.   :6.060   Max.   :3.66  
##                  NA's   :9       NA's   :10      NA's   :1
skim(df_StudentData)
Data summary
Name df_StudentData
Number of rows 2571
Number of columns 33
_______________________
Column type frequency:
character 1
numeric 32
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Brand Code 120 0.95 1 1 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Carb Volume 10 1.00 5.37 0.11 5.04 5.29 5.35 5.45 5.70 ▁▆▇▅▁
Fill Ounces 38 0.99 23.97 0.09 23.63 23.92 23.97 24.03 24.32 ▁▂▇▂▁
PC Volume 39 0.98 0.28 0.06 0.08 0.24 0.27 0.31 0.48 ▁▃▇▂▁
Carb Pressure 27 0.99 68.19 3.54 57.00 65.60 68.20 70.60 79.40 ▁▅▇▃▁
Carb Temp 26 0.99 141.09 4.04 128.60 138.40 140.80 143.80 154.00 ▁▅▇▃▁
PSC 33 0.99 0.08 0.05 0.00 0.05 0.08 0.11 0.27 ▆▇▃▁▁
PSC Fill 23 0.99 0.20 0.12 0.00 0.10 0.18 0.26 0.62 ▆▇▃▁▁
PSC CO2 39 0.98 0.06 0.04 0.00 0.02 0.04 0.08 0.24 ▇▅▂▁▁
Mnf Flow 2 1.00 24.57 119.48 -100.20 -100.00 65.20 140.80 229.40 ▇▁▁▇▂
Carb Pressure1 32 0.99 122.59 4.74 105.60 119.00 123.20 125.40 140.20 ▁▃▇▂▁
Fill Pressure 22 0.99 47.92 3.18 34.60 46.00 46.40 50.00 60.40 ▁▁▇▂▁
Hyd Pressure1 11 1.00 12.44 12.43 -0.80 0.00 11.40 20.20 58.00 ▇▅▂▁▁
Hyd Pressure2 15 0.99 20.96 16.39 0.00 0.00 28.60 34.60 59.40 ▇▂▇▅▁
Hyd Pressure3 15 0.99 20.46 15.98 -1.20 0.00 27.60 33.40 50.00 ▇▁▃▇▁
Hyd Pressure4 30 0.99 96.29 13.12 52.00 86.00 96.00 102.00 142.00 ▁▃▇▂▁
Filler Level 20 0.99 109.25 15.70 55.80 98.30 118.40 120.00 161.20 ▁▃▅▇▁
Filler Speed 57 0.98 3687.20 770.82 998.00 3888.00 3982.00 3998.00 4030.00 ▁▁▁▁▇
Temperature 14 0.99 65.97 1.38 63.60 65.20 65.60 66.40 76.20 ▇▃▁▁▁
Usage cont 5 1.00 20.99 2.98 12.08 18.36 21.79 23.75 25.90 ▁▃▅▃▇
Carb Flow 2 1.00 2468.35 1073.70 26.00 1144.00 3028.00 3186.00 5104.00 ▂▅▆▇▁
Density 1 1.00 1.17 0.38 0.24 0.90 0.98 1.62 1.92 ▁▅▇▂▆
MFR 212 0.92 704.05 73.90 31.40 706.30 724.00 731.00 868.60 ▁▁▁▂▇
Balling 1 1.00 2.20 0.93 -0.17 1.50 1.65 3.29 4.01 ▁▇▇▁▇
Pressure Vacuum 0 1.00 -5.22 0.57 -6.60 -5.60 -5.40 -5.00 -3.60 ▂▇▆▂▁
PH 4 1.00 8.55 0.17 7.88 8.44 8.54 8.68 9.36 ▁▅▇▂▁
Oxygen Filler 12 1.00 0.05 0.05 0.00 0.02 0.03 0.06 0.40 ▇▁▁▁▁
Bowl Setpoint 2 1.00 109.33 15.30 70.00 100.00 120.00 120.00 140.00 ▁▂▃▇▁
Pressure Setpoint 12 1.00 47.62 2.04 44.00 46.00 46.00 50.00 52.00 ▁▇▁▆▁
Air Pressurer 0 1.00 142.83 1.21 140.80 142.20 142.60 143.00 148.20 ▅▇▁▁▁
Alch Rel 9 1.00 6.90 0.51 5.28 6.54 6.56 7.24 8.62 ▁▇▂▃▁
Carb Rel 10 1.00 5.44 0.13 4.96 5.34 5.40 5.54 6.06 ▁▇▇▂▁
Balling Lvl 1 1.00 2.05 0.87 0.00 1.38 1.48 3.14 3.66 ▁▇▂▁▆

As part of data exploration, I used faceted grid of histograms to visualize the distribution of all numeric variables in the dataset.

df_StudentData %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_histogram(bins = 15) + 
  facet_wrap(~key, scales = "free") +
  ggtitle("Histograms of Numerical Predictors")
## Warning: Removed 724 rows containing non-finite outside the scale range
## (`stat_bin()`).

To check if there is any outliers, skewness of the vairable, I first convert wide to long format and use boxplot to show numrical predictors.

df_StudentData %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Boxplots of Numerical Predictors")
## Warning: Removed 724 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

As for “Brand Code” column to categorical factor in both training and evaluation datasets, and use bar chart to show the frequency distribution of each brand code, looks like brand B has highest frequency, 1239, 48.2%.

df_StudentData %>%
  mutate(`Brand Code` = factor(`Brand Code`)) %>%
  ggplot(aes(x = fct_infreq(`Brand Code`))) +
  geom_bar(aes(fill = after_stat(count)), width = 0.8) +
  geom_text(
    stat = "count",
    aes(label = after_stat(
      paste0(format(count, big.mark = ","), "\n",
             "(", round(count/sum(count)*100, 1), "%)")
    )),
    vjust = -0.5,
    size = 3.2,
    lineheight = 0.8
  ) +
  scale_y_continuous(
    labels = scales::comma_format(),
    expand = expansion(mult = c(0, 0.12))
  ) +
  scale_fill_gradient(low = "#E3F2FD", high = "#1565C0", guide = "none") +
  labs(
    title = "Brand Code Distribution",
    x = "Brand Code",
    y = "Count"
  ) +
  theme_classic() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
    panel.grid.major.y = element_line(color = "gray90")
  )

Missing Data

30 of the variables have missing values. MFR seems to be missing for roughly 8.25% of the data. Brand_Code has 120 missing values. Since Brande code is categorical, it needs to be handled different when dealing with n/a values.

df_StudentData %>%
  summarise_all(list(~ sum(is.na(.)))) %>%
  gather(variable, value) %>%
  filter(value != 0) %>%
  arrange(-value) %>%
  kable() %>% 
  kable_styling() %>%
  scroll_box(width = "100%", height = "200px")
variable value
MFR 212
Brand Code 120
Filler Speed 57
PC Volume 39
PSC CO2 39
Fill Ounces 38
PSC 33
Carb Pressure1 32
Hyd Pressure4 30
Carb Pressure 27
Carb Temp 26
PSC Fill 23
Fill Pressure 22
Filler Level 20
Hyd Pressure2 15
Hyd Pressure3 15
Temperature 14
Oxygen Filler 12
Pressure Setpoint 12
Hyd Pressure1 11
Carb Volume 10
Carb Rel 10
Alch Rel 9
Usage cont 5
PH 4
Mnf Flow 2
Carb Flow 2
Bowl Setpoint 2
Density 1
Balling 1
Balling Lvl 1

Transforming the Data

we will transform the data by imputing missing values using the MICE method with predictive mean matching (PMM). Low-variance predictor variables that have near-zero variance are likely not useful for modeling. Hyd Pressure1 was removed.

set.seed(100)

# Convert Brand Code to factor
df_StudentData$`Brand Code` <- as.factor(df_StudentData$`Brand Code`)


init <- mice(df_StudentData, maxit=0, print=FALSE)
meth <- init$method
predM <- init$predictorMatrix

# Change the method for Brand Code
n_levels <- length(levels(df_StudentData$`Brand Code`))
if(n_levels > 2){
  meth["Brand Code"] <- "polyreg"
} else {
  meth["Brand Code"] <- "logreg"
}


imputed <- mice(df_StudentData, method=meth, predictorMatrix=predM, m=1, print=FALSE)
df_StudentData <- complete(imputed)

# Remove low variance columns from the numeric columns
numeric_col_names <- names(df_StudentData)[sapply(df_StudentData, is.numeric)]
low_var_cols <- nearZeroVar(df_StudentData[, numeric_col_names, drop=FALSE], names=TRUE)
if(length(low_var_cols) > 0){
  df_StudentData <- df_StudentData[, !colnames(df_StudentData) %in% low_var_cols]
}

There are no more missing values and the transformations are complete for now.

df_StudentData %>%
  summarise_all(list(~is.na(.))) %>%
  pivot_longer(everything(), names_to = "variables", values_to="missing") %>%
  count(variables, missing) %>%
  ggplot(aes(y = variables, x=n, fill = missing))+
  geom_col(position = "fill") +
  labs(title = "Proportion of Missing Values AFTER Imputation",
       x = "Proportion") +
  scale_fill_manual(values=c("grey","blue"))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## ℹ The deprecated feature was likely used in the dplyr package.
##   Please report the issue at <https://github.com/tidyverse/dplyr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Model Building

To building model, the data will have an 80-20 split. Different type of regression models that include linear regression, non-linear regression, and regression trees will be applied and compare to choose a most fit one.

library(caret)
library(dplyr)

set.seed(100)

# index for training
index <- createDataPartition(df_StudentData$PH, p = .8, list = FALSE)

train_x <- df_StudentData[index, ] %>% dplyr::select(-PH)
train_y <- df_StudentData[index, 'PH']

# test
test_x <- df_StudentData[-index, ] %>% dplyr::select(-PH)
test_y <- df_StudentData[-index, 'PH']

LM has RMSE 0.1412367 and R2 0.3808389.

set.seed(100)

# 10-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 10)

lmModel <- train(train_x, train_y, method = "lm", trControl = ctrl) 

lmPred <- predict(lmModel, test_x)

postResample(lmPred, test_y)
##      RMSE  Rsquared       MAE 
## 0.1412367 0.3808389 0.1074174

PLS has RMSE 0.1414071 and R2 0.3793440.

set.seed(100)

plsTune <- train(train_x, train_y, 
                 method = "pls", 
                 tuneLength = 20, trControl = ctrl,
                 preProc = c("center", "scale"))

plsPred <- predict(plsTune, test_x)

postResample(plsPred, test_y)
##      RMSE  Rsquared       MAE 
## 0.1414071 0.3793440 0.1076608

MARS has RMSE 0.12713823 and R2 0.49903611

marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

set.seed(100)

# tune
marsTune <- train(train_x, train_y,
                  method = "earth",
                  tuneGrid = marsGrid,
                  trControl = trainControl(method = "cv"))

marsPred <- predict(marsTune, test_x)

postResample(marsPred, test_y)
##       RMSE   Rsquared        MAE 
## 0.12713823 0.49903611 0.09585218

Boosted Trees has RMSE 0.10801194 and R2 0.63814896

gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)
set.seed(100)

gbmTune <- train(train_x, train_y,
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 verbose = FALSE)

gbmPred <- predict(gbmTune, test_x)

postResample(gbmPred, test_y)
##       RMSE   Rsquared        MAE 
## 0.10801194 0.63814896 0.07980758

Random Forest has RMSE 0.10632992 and R2 0.66740004.

set.seed(100)

rfModel <- randomForest(train_x, train_y, 
                        importance = TRUE,
                        ntree = 1000)


rfPred <- predict(rfModel, test_x)

postResample(rfPred, test_y) 
##       RMSE   Rsquared        MAE 
## 0.10632992 0.66740004 0.07530304

Cubist has RMSE 0.10574881 and R2 0.65511959

set.seed(100)
cubistTuned <- train(train_x, train_y, 
                     method = "cubist")

cubistPred <- predict(cubistTuned, test_x)

postResample(cubistPred, test_y)
##       RMSE   Rsquared        MAE 
## 0.10574881 0.65511959 0.07396093

Based on the results, the cubist has the lowest RMSE very similar to randomForest however the latter has highest R2. It would be interesting decision on which model to use, I think both are very comparable. I’ll go with randomForest this time.

rbind(lm = postResample(lmPred, test_y),
      pls = postResample(plsPred, test_y),
      mars = postResample(marsPred, test_y),
      randomForest = postResample(rfPred, test_y),
      boosted = postResample(gbmPred, test_y),
      cubist = postResample(cubistPred, test_y))
##                   RMSE  Rsquared        MAE
## lm           0.1412367 0.3808389 0.10741741
## pls          0.1414071 0.3793440 0.10766083
## mars         0.1271382 0.4990361 0.09585218
## randomForest 0.1063299 0.6674000 0.07530304
## boosted      0.1080119 0.6381490 0.07980758
## cubist       0.1057488 0.6551196 0.07396093

Model Evaluation

Random Forest was chosen as the best model. Let’s find out which features most strongly predict the target variable. Here I ranks the predictors by scaled importance scores with 100 being the most important and present lower numbers means less importance. Seems Mnf Fllow has the highest importance score, closely followed by Brande code.

rfImp <- varImp(rfModel, scale = TRUE) %>%
  as.data.frame()

rfImp %>%
  arrange(-Overall) %>%
  kable() %>% 
  kable_styling() %>%
  scroll_box()
Overall
Mnf Flow 58.2376108
Brand Code 58.0510412
Oxygen Filler 50.6316460
Pressure Vacuum 49.6745261
Usage cont 41.9325680
Temperature 41.1480679
Air Pressurer 39.3431540
Carb Rel 35.7792701
Balling Lvl 34.4357966
Balling 33.0238037
Filler Speed 31.9305502
Alch Rel 31.3900365
Carb Flow 31.3729521
Carb Pressure1 28.5357607
Bowl Setpoint 26.9613169
Density 26.3642820
Filler Level 25.8557828
Hyd Pressure3 23.4760243
Carb Volume 22.4025854
MFR 21.3257259
Hyd Pressure2 19.0875515
PC Volume 18.9377110
Hyd Pressure4 18.6489186
Fill Pressure 18.5035162
Pressure Setpoint 16.8148359
Fill Ounces 6.0230537
Carb Pressure 5.4110057
PSC Fill 0.7716583
PSC CO2 0.5487036
Carb Temp -0.2380696
PSC -0.7612532

Another visual to show features most strongly predict the target variable in the forest tree model.

imp_data <- importance(rfModel) %>%
  as.data.frame() %>%
  mutate(variable = rownames(.)) %>%
  top_n(10, IncNodePurity) 

# Convert to long format
imp_long <- imp_data %>%
  pivot_longer(cols = c(`%IncMSE`, "IncNodePurity"),
               names_to = "importance_type",
               values_to = "value") %>%
  mutate(
    importance_type = factor(importance_type,
                             levels = c("%IncMSE", "IncNodePurity"),
                             labels = c("Increase in MSE %", "Node Purity"))
  )

# Create grouped importance plot
ggplot(imp_long, aes(x = value, y = reorder(variable, value), 
                     fill = importance_type)) +
  geom_col(position = "dodge", width = 0.7, alpha = 0.9) +
  geom_text(aes(label = round(value, 1)),
            position = position_dodge(width = 0.7),
            hjust = -0.1, size = 3.2) +
  scale_fill_manual(
    values = c("Increase in MSE %" = "#3498db", "Node Purity" = "#e74c3c"),
    name = "Importance Metric"
  ) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Feature Importance",
    x = "Importance Score",
    y = "Feature"
  ) +
  theme_minimal() +
  theme(
    legend.position = "top",
    legend.title = element_text(face = "bold"),
    axis.text.y = element_text(face = "bold")
  ) +
  facet_wrap(~importance_type, scales = "free_x")

Forecasting: apply to the evaluation data and get the prediction, but first has to transformed the data by imputing missing values and removing Hyd Pressure1 to match the process used in the training data, ensuring feature consistency.

set.seed(100)

#ensure Brand Code is factor (same as training data)
df_EvalData$`Brand Code` <- as.factor(df_EvalData$`Brand Code`)

if(!is.null(df_StudentData$`Brand Code`)) {

  df_EvalData$`Brand Code` <- factor(
    df_EvalData$`Brand Code`,
    levels = levels(df_StudentData$`Brand Code`)
  )
}

# Identify columns that exist in both datasets (excluding PH from training)
training_cols <- names(train_x)
eval_cols <- names(df_EvalData)

# Ensure df_EvalData has all columns needed for prediction
missing_cols <- setdiff(training_cols, eval_cols)
if(length(missing_cols) > 0) {
  for(col in missing_cols) {
    df_EvalData[[col]] <- NA
  }
}

# Reorder columns to match training data
df_EvalData <- df_EvalData[, training_cols]

init_eval <- mice(df_EvalData, maxit = 0, print = FALSE)
meth_eval <- init_eval$method
predM_eval <- init_eval$predictorMatrix

# Set method for Brand Code (same as training)
if("Brand Code" %in% names(df_EvalData)) {
  n_levels_eval <- length(unique(na.omit(df_EvalData$`Brand Code`)))
  if(n_levels_eval > 2) {
    meth_eval["Brand Code"] <- "polyreg"
  } else {
    meth_eval["Brand Code"] <- "logreg"
  }
}

# Impute evaluation data
imputed_eval <- mice(df_EvalData, 
                     method = meth_eval, 
                     predictorMatrix = predM_eval, 
                     m = 1, 
                     print = FALSE)
df_EvalData_imputed <- complete(imputed_eval)

# Now predict PH
prediction <- predict(rfModel, df_EvalData_imputed)

head(prediction)
##        1        2        3        4        5        6 
## 8.539210 8.468090 8.532999 8.599674 8.502400 8.558538

Average PH for the testing data.

df_EvalData$PH <- prediction

#average ph
df_EvalData %>%
  group_by('Brand Code') %>%
  summarise(`Average PH` = mean(PH))
## # A tibble: 1 × 2
##   `"Brand Code"` `Average PH`
##   <chr>                 <dbl>
## 1 Brand Code             8.55

Save the result in excel form ready to present to the leadership.

library(openxlsx)
df_EvalData$PH <- prediction
write.xlsx(list('PH' = prediction, 'EvalData_complete' = df_EvalData), 
           file = "C:/Users/xmei/Desktop/predictions_DJO.xlsx")

Conclusion: the biggest lesson I learned through this project was the handling of missing values. Since brand code being catagorical was different with other predictors and need to handle seperately. Mars model kept giving me issue until I realized this issue. Very interesting project to work on.