Project2

Project2:

This is role playing. ABC Beverage leadership has given a task to understand manufacturing process, the predictive factors and be able to report to them our predictive model of PH. I need to create both a technical and a non-technical reports to present the result. This R markdown is the technical part. In the end of this report, there will be an excel created with prediction of the pH.

Data Exploration:

Read and overveiew of the given training data. It has 33 columns - variables and 2571 rows - observations which included some missing values. As for data type, only “Brande Code” is categorical while other 32 columns were numirical. the focus of the project PH value has a mean of 8.54 and 4 missing values.

df_StudentData  <- read_excel("C:/Users/xmei/Desktop/train.xlsx")
df_EvalData <- read_excel("C:/Users/xmei/Desktop/test.xlsx")

summary(df_StudentData)

##   Brand Code         Carb Volume     Fill Ounces      PC Volume      
##  Length:2571        Min.   :5.040   Min.   :23.63   Min.   :0.07933  
##  Class :character   1st Qu.:5.293   1st Qu.:23.92   1st Qu.:0.23917  
##  Mode  :character   Median :5.347   Median :23.97   Median :0.27133  
##                     Mean   :5.370   Mean   :23.97   Mean   :0.27712  
##                     3rd Qu.:5.453   3rd Qu.:24.03   3rd Qu.:0.31200  
##                     Max.   :5.700   Max.   :24.32   Max.   :0.47800  
##                     NA's   :10      NA's   :38      NA's   :39       
##  Carb Pressure     Carb Temp          PSC             PSC Fill     
##  Min.   :57.00   Min.   :128.6   Min.   :0.00200   Min.   :0.0000  
##  1st Qu.:65.60   1st Qu.:138.4   1st Qu.:0.04800   1st Qu.:0.1000  
##  Median :68.20   Median :140.8   Median :0.07600   Median :0.1800  
##  Mean   :68.19   Mean   :141.1   Mean   :0.08457   Mean   :0.1954  
##  3rd Qu.:70.60   3rd Qu.:143.8   3rd Qu.:0.11200   3rd Qu.:0.2600  
##  Max.   :79.40   Max.   :154.0   Max.   :0.27000   Max.   :0.6200  
##  NA's   :27      NA's   :26      NA's   :33        NA's   :23      
##     PSC CO2           Mnf Flow       Carb Pressure1  Fill Pressure  
##  Min.   :0.00000   Min.   :-100.20   Min.   :105.6   Min.   :34.60  
##  1st Qu.:0.02000   1st Qu.:-100.00   1st Qu.:119.0   1st Qu.:46.00  
##  Median :0.04000   Median :  65.20   Median :123.2   Median :46.40  
##  Mean   :0.05641   Mean   :  24.57   Mean   :122.6   Mean   :47.92  
##  3rd Qu.:0.08000   3rd Qu.: 140.80   3rd Qu.:125.4   3rd Qu.:50.00  
##  Max.   :0.24000   Max.   : 229.40   Max.   :140.2   Max.   :60.40  
##  NA's   :39        NA's   :2         NA's   :32      NA's   :22     
##  Hyd Pressure1   Hyd Pressure2   Hyd Pressure3   Hyd Pressure4   
##  Min.   :-0.80   Min.   : 0.00   Min.   :-1.20   Min.   : 52.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 86.00  
##  Median :11.40   Median :28.60   Median :27.60   Median : 96.00  
##  Mean   :12.44   Mean   :20.96   Mean   :20.46   Mean   : 96.29  
##  3rd Qu.:20.20   3rd Qu.:34.60   3rd Qu.:33.40   3rd Qu.:102.00  
##  Max.   :58.00   Max.   :59.40   Max.   :50.00   Max.   :142.00  
##  NA's   :11      NA's   :15      NA's   :15      NA's   :30      
##   Filler Level    Filler Speed   Temperature      Usage cont      Carb Flow   
##  Min.   : 55.8   Min.   : 998   Min.   :63.60   Min.   :12.08   Min.   :  26  
##  1st Qu.: 98.3   1st Qu.:3888   1st Qu.:65.20   1st Qu.:18.36   1st Qu.:1144  
##  Median :118.4   Median :3982   Median :65.60   Median :21.79   Median :3028  
##  Mean   :109.3   Mean   :3687   Mean   :65.97   Mean   :20.99   Mean   :2468  
##  3rd Qu.:120.0   3rd Qu.:3998   3rd Qu.:66.40   3rd Qu.:23.75   3rd Qu.:3186  
##  Max.   :161.2   Max.   :4030   Max.   :76.20   Max.   :25.90   Max.   :5104  
##  NA's   :20      NA's   :57     NA's   :14      NA's   :5       NA's   :2     
##     Density           MFR           Balling       Pressure Vacuum 
##  Min.   :0.240   Min.   : 31.4   Min.   :-0.170   Min.   :-6.600  
##  1st Qu.:0.900   1st Qu.:706.3   1st Qu.: 1.496   1st Qu.:-5.600  
##  Median :0.980   Median :724.0   Median : 1.648   Median :-5.400  
##  Mean   :1.174   Mean   :704.0   Mean   : 2.198   Mean   :-5.216  
##  3rd Qu.:1.620   3rd Qu.:731.0   3rd Qu.: 3.292   3rd Qu.:-5.000  
##  Max.   :1.920   Max.   :868.6   Max.   : 4.012   Max.   :-3.600  
##  NA's   :1       NA's   :212     NA's   :1                        
##        PH        Oxygen Filler     Bowl Setpoint   Pressure Setpoint
##  Min.   :7.880   Min.   :0.00240   Min.   : 70.0   Min.   :44.00    
##  1st Qu.:8.440   1st Qu.:0.02200   1st Qu.:100.0   1st Qu.:46.00    
##  Median :8.540   Median :0.03340   Median :120.0   Median :46.00    
##  Mean   :8.546   Mean   :0.04684   Mean   :109.3   Mean   :47.62    
##  3rd Qu.:8.680   3rd Qu.:0.06000   3rd Qu.:120.0   3rd Qu.:50.00    
##  Max.   :9.360   Max.   :0.40000   Max.   :140.0   Max.   :52.00    
##  NA's   :4       NA's   :12        NA's   :2       NA's   :12       
##  Air Pressurer      Alch Rel        Carb Rel      Balling Lvl  
##  Min.   :140.8   Min.   :5.280   Min.   :4.960   Min.   :0.00  
##  1st Qu.:142.2   1st Qu.:6.540   1st Qu.:5.340   1st Qu.:1.38  
##  Median :142.6   Median :6.560   Median :5.400   Median :1.48  
##  Mean   :142.8   Mean   :6.897   Mean   :5.437   Mean   :2.05  
##  3rd Qu.:143.0   3rd Qu.:7.240   3rd Qu.:5.540   3rd Qu.:3.14  
##  Max.   :148.2   Max.   :8.620   Max.   :6.060   Max.   :3.66  
##                  NA's   :9       NA's   :10      NA's   :1

skim(df_StudentData)

Data summary
Name	df_StudentData
Number of rows	2571
Number of columns	33
_______________________
Column type frequency:
character	1
numeric	32
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Brand Code	120	0.95	1	1	0	4	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Carb Volume	10	1.00	5.37	0.11	5.04	5.29	5.35	5.45	5.70	▁▆▇▅▁
Fill Ounces	38	0.99	23.97	0.09	23.63	23.92	23.97	24.03	24.32	▁▂▇▂▁
PC Volume	39	0.98	0.28	0.06	0.08	0.24	0.27	0.31	0.48	▁▃▇▂▁
Carb Pressure	27	0.99	68.19	3.54	57.00	65.60	68.20	70.60	79.40	▁▅▇▃▁
Carb Temp	26	0.99	141.09	4.04	128.60	138.40	140.80	143.80	154.00	▁▅▇▃▁
PSC	33	0.99	0.08	0.05	0.00	0.05	0.08	0.11	0.27	▆▇▃▁▁
PSC Fill	23	0.99	0.20	0.12	0.00	0.10	0.18	0.26	0.62	▆▇▃▁▁
PSC CO2	39	0.98	0.06	0.04	0.00	0.02	0.04	0.08	0.24	▇▅▂▁▁
Mnf Flow	2	1.00	24.57	119.48	-100.20	-100.00	65.20	140.80	229.40	▇▁▁▇▂
Carb Pressure1	32	0.99	122.59	4.74	105.60	119.00	123.20	125.40	140.20	▁▃▇▂▁
Fill Pressure	22	0.99	47.92	3.18	34.60	46.00	46.40	50.00	60.40	▁▁▇▂▁
Hyd Pressure1	11	1.00	12.44	12.43	-0.80	0.00	11.40	20.20	58.00	▇▅▂▁▁
Hyd Pressure2	15	0.99	20.96	16.39	0.00	0.00	28.60	34.60	59.40	▇▂▇▅▁
Hyd Pressure3	15	0.99	20.46	15.98	-1.20	0.00	27.60	33.40	50.00	▇▁▃▇▁
Hyd Pressure4	30	0.99	96.29	13.12	52.00	86.00	96.00	102.00	142.00	▁▃▇▂▁
Filler Level	20	0.99	109.25	15.70	55.80	98.30	118.40	120.00	161.20	▁▃▅▇▁
Filler Speed	57	0.98	3687.20	770.82	998.00	3888.00	3982.00	3998.00	4030.00	▁▁▁▁▇
Temperature	14	0.99	65.97	1.38	63.60	65.20	65.60	66.40	76.20	▇▃▁▁▁
Usage cont	5	1.00	20.99	2.98	12.08	18.36	21.79	23.75	25.90	▁▃▅▃▇
Carb Flow	2	1.00	2468.35	1073.70	26.00	1144.00	3028.00	3186.00	5104.00	▂▅▆▇▁
Density	1	1.00	1.17	0.38	0.24	0.90	0.98	1.62	1.92	▁▅▇▂▆
MFR	212	0.92	704.05	73.90	31.40	706.30	724.00	731.00	868.60	▁▁▁▂▇
Balling	1	1.00	2.20	0.93	-0.17	1.50	1.65	3.29	4.01	▁▇▇▁▇
Pressure Vacuum	0	1.00	-5.22	0.57	-6.60	-5.60	-5.40	-5.00	-3.60	▂▇▆▂▁
PH	4	1.00	8.55	0.17	7.88	8.44	8.54	8.68	9.36	▁▅▇▂▁
Oxygen Filler	12	1.00	0.05	0.05	0.00	0.02	0.03	0.06	0.40	▇▁▁▁▁
Bowl Setpoint	2	1.00	109.33	15.30	70.00	100.00	120.00	120.00	140.00	▁▂▃▇▁
Pressure Setpoint	12	1.00	47.62	2.04	44.00	46.00	46.00	50.00	52.00	▁▇▁▆▁
Air Pressurer	0	1.00	142.83	1.21	140.80	142.20	142.60	143.00	148.20	▅▇▁▁▁
Alch Rel	9	1.00	6.90	0.51	5.28	6.54	6.56	7.24	8.62	▁▇▂▃▁
Carb Rel	10	1.00	5.44	0.13	4.96	5.34	5.40	5.54	6.06	▁▇▇▂▁
Balling Lvl	1	1.00	2.05	0.87	0.00	1.38	1.48	3.14	3.66	▁▇▂▁▆

As part of data exploration, I used faceted grid of histograms to visualize the distribution of all numeric variables in the dataset.

df_StudentData %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_histogram(bins = 15) + 
  facet_wrap(~key, scales = "free") +
  ggtitle("Histograms of Numerical Predictors")

## Warning: Removed 724 rows containing non-finite outside the scale range
## (`stat_bin()`).

To check if there is any outliers, skewness of the vairable, I first convert wide to long format and use boxplot to show numrical predictors.

df_StudentData %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scales = 'free') +
  ggtitle("Boxplots of Numerical Predictors")

## Warning: Removed 724 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

As for “Brand Code” column to categorical factor in both training and evaluation datasets, and use bar chart to show the frequency distribution of each brand code, looks like brand B has highest frequency, 1239, 48.2%.

df_StudentData %>%
  mutate(`Brand Code` = factor(`Brand Code`)) %>%
  ggplot(aes(x = fct_infreq(`Brand Code`))) +
  geom_bar(aes(fill = after_stat(count)), width = 0.8) +
  geom_text(
    stat = "count",
    aes(label = after_stat(
      paste0(format(count, big.mark = ","), "\n",
             "(", round(count/sum(count)*100, 1), "%)")
    )),
    vjust = -0.5,
    size = 3.2,
    lineheight = 0.8
  ) +
  scale_y_continuous(
    labels = scales::comma_format(),
    expand = expansion(mult = c(0, 0.12))
  ) +
  scale_fill_gradient(low = "#E3F2FD", high = "#1565C0", guide = "none") +
  labs(
    title = "Brand Code Distribution",
    x = "Brand Code",
    y = "Count"
  ) +
  theme_classic() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
    panel.grid.major.y = element_line(color = "gray90")
  )

Missing Data

30 of the variables have missing values. MFR seems to be missing for roughly 8.25% of the data. Brand_Code has 120 missing values. Since Brande code is categorical, it needs to be handled different when dealing with n/a values.

df_StudentData %>%
  summarise_all(list(~ sum(is.na(.)))) %>%
  gather(variable, value) %>%
  filter(value != 0) %>%
  arrange(-value) %>%
  kable() %>% 
  kable_styling() %>%
  scroll_box(width = "100%", height = "200px")

variable	value
MFR	212
Brand Code	120
Filler Speed	57
PC Volume	39
PSC CO2	39
Fill Ounces	38
PSC	33
Carb Pressure1	32
Hyd Pressure4	30
Carb Pressure	27
Carb Temp	26
PSC Fill	23
Fill Pressure	22
Filler Level	20
Hyd Pressure2	15
Hyd Pressure3	15
Temperature	14
Oxygen Filler	12
Pressure Setpoint	12
Hyd Pressure1	11
Carb Volume	10
Carb Rel	10
Alch Rel	9
Usage cont	5
PH	4
Mnf Flow	2
Carb Flow	2
Bowl Setpoint	2
Density	1
Balling	1
Balling Lvl	1

Transforming the Data

we will transform the data by imputing missing values using the MICE method with predictive mean matching (PMM). Low-variance predictor variables that have near-zero variance are likely not useful for modeling. Hyd Pressure1 was removed.

set.seed(100)

# Convert Brand Code to factor
df_StudentData$`Brand Code` <- as.factor(df_StudentData$`Brand Code`)


init <- mice(df_StudentData, maxit=0, print=FALSE)
meth <- init$method
predM <- init$predictorMatrix

# Change the method for Brand Code
n_levels <- length(levels(df_StudentData$`Brand Code`))
if(n_levels > 2){
  meth["Brand Code"] <- "polyreg"
} else {
  meth["Brand Code"] <- "logreg"
}


imputed <- mice(df_StudentData, method=meth, predictorMatrix=predM, m=1, print=FALSE)
df_StudentData <- complete(imputed)

# Remove low variance columns from the numeric columns
numeric_col_names <- names(df_StudentData)[sapply(df_StudentData, is.numeric)]
low_var_cols <- nearZeroVar(df_StudentData[, numeric_col_names, drop=FALSE], names=TRUE)
if(length(low_var_cols) > 0){
  df_StudentData <- df_StudentData[, !colnames(df_StudentData) %in% low_var_cols]
}

There are no more missing values and the transformations are complete for now.

df_StudentData %>%
  summarise_all(list(~is.na(.))) %>%
  pivot_longer(everything(), names_to = "variables", values_to="missing") %>%
  count(variables, missing) %>%
  ggplot(aes(y = variables, x=n, fill = missing))+
  geom_col(position = "fill") +
  labs(title = "Proportion of Missing Values AFTER Imputation",
       x = "Proportion") +
  scale_fill_manual(values=c("grey","blue"))

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## ℹ The deprecated feature was likely used in the dplyr package.
##   Please report the issue at <https://github.com/tidyverse/dplyr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Model Building

To building model, the data will have an 80-20 split. Different type of regression models that include linear regression, non-linear regression, and regression trees will be applied and compare to choose a most fit one.

library(caret)
library(dplyr)

set.seed(100)

# index for training
index <- createDataPartition(df_StudentData$PH, p = .8, list = FALSE)

train_x <- df_StudentData[index, ] %>% dplyr::select(-PH)
train_y <- df_StudentData[index, 'PH']

# test
test_x <- df_StudentData[-index, ] %>% dplyr::select(-PH)
test_y <- df_StudentData[-index, 'PH']

LM has RMSE 0.1412367 and R2 0.3808389.

set.seed(100)

# 10-fold cross-validation to make reasonable estimates
ctrl <- trainControl(method = "cv", number = 10)

lmModel <- train(train_x, train_y, method = "lm", trControl = ctrl) 

lmPred <- predict(lmModel, test_x)

postResample(lmPred, test_y)

##      RMSE  Rsquared       MAE 
## 0.1412367 0.3808389 0.1074174

PLS has RMSE 0.1414071 and R2 0.3793440.

set.seed(100)

plsTune <- train(train_x, train_y, 
                 method = "pls", 
                 tuneLength = 20, trControl = ctrl,
                 preProc = c("center", "scale"))

plsPred <- predict(plsTune, test_x)

postResample(plsPred, test_y)

##      RMSE  Rsquared       MAE 
## 0.1414071 0.3793440 0.1076608

MARS has RMSE 0.12713823 and R2 0.49903611

marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:38)

set.seed(100)

# tune
marsTune <- train(train_x, train_y,
                  method = "earth",
                  tuneGrid = marsGrid,
                  trControl = trainControl(method = "cv"))

marsPred <- predict(marsTune, test_x)

postResample(marsPred, test_y)

##       RMSE   Rsquared        MAE 
## 0.12713823 0.49903611 0.09585218

Boosted Trees has RMSE 0.10801194 and R2 0.63814896

gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1),
                       n.minobsinnode = 10)
set.seed(100)

gbmTune <- train(train_x, train_y,
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 verbose = FALSE)

gbmPred <- predict(gbmTune, test_x)

postResample(gbmPred, test_y)

##       RMSE   Rsquared        MAE 
## 0.10801194 0.63814896 0.07980758

Random Forest has RMSE 0.10632992 and R2 0.66740004.

set.seed(100)

rfModel <- randomForest(train_x, train_y, 
                        importance = TRUE,
                        ntree = 1000)


rfPred <- predict(rfModel, test_x)

postResample(rfPred, test_y)

##       RMSE   Rsquared        MAE 
## 0.10632992 0.66740004 0.07530304

Cubist has RMSE 0.10574881 and R2 0.65511959

set.seed(100)
cubistTuned <- train(train_x, train_y, 
                     method = "cubist")

cubistPred <- predict(cubistTuned, test_x)

postResample(cubistPred, test_y)

##       RMSE   Rsquared        MAE 
## 0.10574881 0.65511959 0.07396093

Based on the results, the cubist has the lowest RMSE very similar to randomForest however the latter has highest R2. It would be interesting decision on which model to use, I think both are very comparable. I’ll go with randomForest this time.

rbind(lm = postResample(lmPred, test_y),
      pls = postResample(plsPred, test_y),
      mars = postResample(marsPred, test_y),
      randomForest = postResample(rfPred, test_y),
      boosted = postResample(gbmPred, test_y),
      cubist = postResample(cubistPred, test_y))

##                   RMSE  Rsquared        MAE
## lm           0.1412367 0.3808389 0.10741741
## pls          0.1414071 0.3793440 0.10766083
## mars         0.1271382 0.4990361 0.09585218
## randomForest 0.1063299 0.6674000 0.07530304
## boosted      0.1080119 0.6381490 0.07980758
## cubist       0.1057488 0.6551196 0.07396093

Model Evaluation

Random Forest was chosen as the best model. Let’s find out which features most strongly predict the target variable. Here I ranks the predictors by scaled importance scores with 100 being the most important and present lower numbers means less importance. Seems Mnf Fllow has the highest importance score, closely followed by Brande code.

rfImp <- varImp(rfModel, scale = TRUE) %>%
  as.data.frame()

rfImp %>%
  arrange(-Overall) %>%
  kable() %>% 
  kable_styling() %>%
  scroll_box()

	Overall
Mnf Flow	58.2376108
Brand Code	58.0510412
Oxygen Filler	50.6316460
Pressure Vacuum	49.6745261
Usage cont	41.9325680
Temperature	41.1480679
Air Pressurer	39.3431540
Carb Rel	35.7792701
Balling Lvl	34.4357966
Balling	33.0238037
Filler Speed	31.9305502
Alch Rel	31.3900365
Carb Flow	31.3729521
Carb Pressure1	28.5357607
Bowl Setpoint	26.9613169
Density	26.3642820
Filler Level	25.8557828
Hyd Pressure3	23.4760243
Carb Volume	22.4025854
MFR	21.3257259
Hyd Pressure2	19.0875515
PC Volume	18.9377110
Hyd Pressure4	18.6489186
Fill Pressure	18.5035162
Pressure Setpoint	16.8148359
Fill Ounces	6.0230537
Carb Pressure	5.4110057
PSC Fill	0.7716583
PSC CO2	0.5487036
Carb Temp	-0.2380696
PSC	-0.7612532

Another visual to show features most strongly predict the target variable in the forest tree model.

imp_data <- importance(rfModel) %>%
  as.data.frame() %>%
  mutate(variable = rownames(.)) %>%
  top_n(10, IncNodePurity) 

# Convert to long format
imp_long <- imp_data %>%
  pivot_longer(cols = c(`%IncMSE`, "IncNodePurity"),
               names_to = "importance_type",
               values_to = "value") %>%
  mutate(
    importance_type = factor(importance_type,
                             levels = c("%IncMSE", "IncNodePurity"),
                             labels = c("Increase in MSE %", "Node Purity"))
  )

# Create grouped importance plot
ggplot(imp_long, aes(x = value, y = reorder(variable, value), 
                     fill = importance_type)) +
  geom_col(position = "dodge", width = 0.7, alpha = 0.9) +
  geom_text(aes(label = round(value, 1)),
            position = position_dodge(width = 0.7),
            hjust = -0.1, size = 3.2) +
  scale_fill_manual(
    values = c("Increase in MSE %" = "#3498db", "Node Purity" = "#e74c3c"),
    name = "Importance Metric"
  ) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Feature Importance",
    x = "Importance Score",
    y = "Feature"
  ) +
  theme_minimal() +
  theme(
    legend.position = "top",
    legend.title = element_text(face = "bold"),
    axis.text.y = element_text(face = "bold")
  ) +
  facet_wrap(~importance_type, scales = "free_x")

Forecasting: apply to the evaluation data and get the prediction, but first has to transformed the data by imputing missing values and removing Hyd Pressure1 to match the process used in the training data, ensuring feature consistency.

set.seed(100)

#ensure Brand Code is factor (same as training data)
df_EvalData$`Brand Code` <- as.factor(df_EvalData$`Brand Code`)

if(!is.null(df_StudentData$`Brand Code`)) {

  df_EvalData$`Brand Code` <- factor(
    df_EvalData$`Brand Code`,
    levels = levels(df_StudentData$`Brand Code`)
  )
}

# Identify columns that exist in both datasets (excluding PH from training)
training_cols <- names(train_x)
eval_cols <- names(df_EvalData)

# Ensure df_EvalData has all columns needed for prediction
missing_cols <- setdiff(training_cols, eval_cols)
if(length(missing_cols) > 0) {
  for(col in missing_cols) {
    df_EvalData[[col]] <- NA
  }
}

# Reorder columns to match training data
df_EvalData <- df_EvalData[, training_cols]

init_eval <- mice(df_EvalData, maxit = 0, print = FALSE)
meth_eval <- init_eval$method
predM_eval <- init_eval$predictorMatrix

# Set method for Brand Code (same as training)
if("Brand Code" %in% names(df_EvalData)) {
  n_levels_eval <- length(unique(na.omit(df_EvalData$`Brand Code`)))
  if(n_levels_eval > 2) {
    meth_eval["Brand Code"] <- "polyreg"
  } else {
    meth_eval["Brand Code"] <- "logreg"
  }
}

# Impute evaluation data
imputed_eval <- mice(df_EvalData, 
                     method = meth_eval, 
                     predictorMatrix = predM_eval, 
                     m = 1, 
                     print = FALSE)
df_EvalData_imputed <- complete(imputed_eval)

# Now predict PH
prediction <- predict(rfModel, df_EvalData_imputed)

head(prediction)

##        1        2        3        4        5        6 
## 8.539210 8.468090 8.532999 8.599674 8.502400 8.558538

Average PH for the testing data.

df_EvalData$PH <- prediction

#average ph
df_EvalData %>%
  group_by('Brand Code') %>%
  summarise(`Average PH` = mean(PH))

## # A tibble: 1 × 2
##   `"Brand Code"` `Average PH`
##   <chr>                 <dbl>
## 1 Brand Code             8.55

Save the result in excel form ready to present to the leadership.

library(openxlsx)
df_EvalData$PH <- prediction
write.xlsx(list('PH' = prediction, 'EvalData_complete' = df_EvalData), 
           file = "C:/Users/xmei/Desktop/predictions_DJO.xlsx")

Conclusion: the biggest lesson I learned through this project was the handling of missing values. Since brand code being catagorical was different with other predictors and need to handle seperately. Mars model kept giving me issue until I realized this issue. Very interesting project to work on.

Project2

XiaoFei Mei

2025-12-14