Data 624 Project 2

1. Problem Statement

This is role playing. I am your new boss. I am in charge of production at ABC Beverage and you are a team of data scientists reporting to me. My leadership has told me that new regulations are requiring us to understand our manufacturing process, the predictive factors and be able to report to them our predictive model of PH. Please use the historical data set I am providing. Build and report the factors in BOTH a technical and non-technical report. I like to use Word and Excel. Please provide your non-technical report in a business friendly readable document and your predictions in an Excel readable format. The technical report should show clearly the models you tested and how you selected your final approach. Please submit both RPubs links and .rmd files or other readable formats for technical and non-technical reports. Also submit the excel file showing the prediction of your models for pH.

2. Data Exploration

Load Data

Two files are provided for this project. The file StudentData.xlsx contains the data used to build the predictive models for PH. The file StudentEvaluation.xlsx contains the data used to evaluate the selected model. The PH column in this file has no data. These files were converted to .csv and uploaded to the GitHub for easy loadability.

student_data<-read.csv("https://raw.githubusercontent.com/Luz917/data624project2/master/StudentData.csv")
student_eval<-read.csv("https://raw.githubusercontent.com/Luz917/data624project2/master/StudentEvaluation.csv")

Summary of Student Data

The Student Data data set has 2,571 observations and 33 variables.

This data set contains categorical, continuous, and discrete variables. The data set has some missing values for most of the variables. The response variable PH has four missing values. All the predictor variables have missing values except forPressure.Vacuum and Air.Pressure. The categorical variable Brand.Code has blank values.

Data summary
Name	student_data
Number of rows	2571
Number of columns	33
_______________________
Column type frequency:
character	1
numeric	32
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
ï..Brand.Code	0	1	0	1	120	5	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Carb.Volume	10	1.00	5.37	0.11	5.04	5.29	5.35	5.45	5.70	▁▆▇▅▁
Fill.Ounces	38	0.99	23.97	0.09	23.63	23.92	23.97	24.03	24.32	▁▂▇▂▁
PC.Volume	39	0.98	0.28	0.06	0.08	0.24	0.27	0.31	0.48	▁▃▇▂▁
Carb.Pressure	27	0.99	68.19	3.54	57.00	65.60	68.20	70.60	79.40	▁▅▇▃▁
Carb.Temp	26	0.99	141.09	4.04	128.60	138.40	140.80	143.80	154.00	▁▅▇▃▁
PSC	33	0.99	0.08	0.05	0.00	0.05	0.08	0.11	0.27	▆▇▃▁▁
PSC.Fill	23	0.99	0.20	0.12	0.00	0.10	0.18	0.26	0.62	▆▇▃▁▁
PSC.CO2	39	0.98	0.06	0.04	0.00	0.02	0.04	0.08	0.24	▇▅▂▁▁
Mnf.Flow	2	1.00	24.57	119.48	-100.20	-100.00	65.20	140.80	229.40	▇▁▁▇▂
Carb.Pressure1	32	0.99	122.59	4.74	105.60	119.00	123.20	125.40	140.20	▁▃▇▂▁
Fill.Pressure	22	0.99	47.92	3.18	34.60	46.00	46.40	50.00	60.40	▁▁▇▂▁
Hyd.Pressure1	11	1.00	12.44	12.43	-0.80	0.00	11.40	20.20	58.00	▇▅▂▁▁
Hyd.Pressure2	15	0.99	20.96	16.39	0.00	0.00	28.60	34.60	59.40	▇▂▇▅▁
Hyd.Pressure3	15	0.99	20.46	15.98	-1.20	0.00	27.60	33.40	50.00	▇▁▃▇▁
Hyd.Pressure4	30	0.99	96.29	13.12	52.00	86.00	96.00	102.00	142.00	▁▃▇▂▁
Filler.Level	20	0.99	109.25	15.70	55.80	98.30	118.40	120.00	161.20	▁▃▅▇▁
Filler.Speed	57	0.98	3687.20	770.82	998.00	3888.00	3982.00	3998.00	4030.00	▁▁▁▁▇
Temperature	14	0.99	65.97	1.38	63.60	65.20	65.60	66.40	76.20	▇▃▁▁▁
Usage.cont	5	1.00	20.99	2.98	12.08	18.36	21.79	23.75	25.90	▁▃▅▃▇
Carb.Flow	2	1.00	2468.35	1073.70	26.00	1144.00	3028.00	3186.00	5104.00	▂▅▆▇▁
Density	1	1.00	1.17	0.38	0.24	0.90	0.98	1.62	1.92	▁▅▇▂▆
MFR	212	0.92	704.05	73.90	31.40	706.30	724.00	731.00	868.60	▁▁▁▂▇
Balling	1	1.00	2.20	0.93	-0.17	1.50	1.65	3.29	4.01	▁▇▇▁▇
Pressure.Vacuum	0	1.00	-5.22	0.57	-6.60	-5.60	-5.40	-5.00	-3.60	▂▇▆▂▁
PH	4	1.00	8.55	0.17	7.88	8.44	8.54	8.68	9.36	▁▅▇▂▁
Oxygen.Filler	12	1.00	0.05	0.05	0.00	0.02	0.03	0.06	0.40	▇▁▁▁▁
Bowl.Setpoint	2	1.00	109.33	15.30	70.00	100.00	120.00	120.00	140.00	▁▂▃▇▁
Pressure.Setpoint	12	1.00	47.62	2.04	44.00	46.00	46.00	50.00	52.00	▁▇▁▆▁
Air.Pressurer	0	1.00	142.83	1.21	140.80	142.20	142.60	143.00	148.20	▅▇▁▁▁
Alch.Rel	9	1.00	6.90	0.51	5.28	6.54	6.56	7.24	8.62	▁▇▂▃▁
Carb.Rel	10	1.00	5.44	0.13	4.96	5.34	5.40	5.54	6.06	▁▇▇▂▁
Balling.Lvl	1	1.00	2.05	0.87	0.00	1.38	1.48	3.14	3.66	▁▇▂▁▆

Predictor Brand.Code has four different categories, and there are 120 observations with categories that are not known.

table(student_data$ï..Brand.Code)

## 
##         A    B    C    D 
##  120  293 1239  304  615

Doing a little cleanup of the variable name i..Brand.Code and simply renaming this to Brand.Code.

names(student_data)[names(student_data) == "ï..Brand.Code"] <- "Brand.Code"
names(student_eval)[names(student_eval) == "ï..Brand.Code"] <- "Brand.Code"

Disribution of Response Variable `PH`

The distribution of the response variable PH is approximately normal with some degree of skewness to the left. The mean PH value is 8.5. Because we have fewer observations towards the end of the distribution, this could mean that the models are not able to predict PH values that fall towards the tails of the distribution.

hist(student_data$PH)

Distribution of Predictors

Below shows the distribution of the numeric predictors. We can see that some variables have bimodal features (e.g., Air Pressure, Hyd.Pressure2, Hyd.Pressure3, Balling). Multimodal distributions could suggests different groups or different regions with another type of distribution shape. Some distributions appear to be more discrete with much more limited set of distinct values such as Pressure Setpoint, PSC.CO2, and Bowl.Setpoint.

plot_histogram((student_data[-c(26)]))

Scatter Plots of Predictors with Response Variable

Below is a pairwise plot of the predictor variables versus the response variable PH. This should give us an idea of the predictors’ relationship to the response variable.

Correlation

Below is heat map of correlations between variables. As you can see, some variables are more strongly correlated than others. Response variable PH is more strongly correlated to Mnf.Flow. Hyd.Pressure3 has a somewhat moderate negative correlation with Pressure.Vaccuum. There are some variables with significant positive correlations. Balling is strongly correlated with Balling.Lv1 (0.98), among others. Concerns about multicollinearity should be considered when selecting features for our models. We should avoid including pairs that are strongly correlated with each other.

Identify Multicollinearity Problem

The vifcor function of the usdm package calculates variance inflation factor (VIF). This function excludes highly correlated variables from the set through stepwise procedure. This is function is used to deal with multicollinearity problem.

The function identified 6 variables from the 31 predictors that have collinearity problem. These are Balling, Bowl.Setpoint, Balling.Lvl,MFR, Hyd.Pressure3, and Alch.Rel.

(vif_result <- vifcor(na.omit(student_data[c(2:25,27:33)])))

## 6 variables from the 31 input variables have collinearity problem: 
##  
## Balling Bowl.Setpoint Balling.Lvl MFR Hyd.Pressure3 Alch.Rel 
## 
## After excluding the collinear variables, the linear correlation coefficients ranges between: 
## min correlation ( Pressure.Setpoint ~ PC.Volume ):  0.0001991472 
## max correlation ( Carb.Rel ~ Density ):  0.852689 
## 
## ---------- VIFs of the remained variables -------- 
##            Variables       VIF
## 1        Carb.Volume 17.159340
## 2        Fill.Ounces  1.153764
## 3          PC.Volume  1.685901
## 4      Carb.Pressure 43.298925
## 5          Carb.Temp 35.460787
## 6                PSC  1.155980
## 7           PSC.Fill  1.109449
## 8            PSC.CO2  1.064636
## 9           Mnf.Flow  4.262754
## 10    Carb.Pressure1  1.434406
## 11     Fill.Pressure  3.490621
## 12     Hyd.Pressure1  2.935470
## 13     Hyd.Pressure2  4.900597
## 14     Hyd.Pressure4  1.752686
## 15      Filler.Level  2.618103
## 16      Filler.Speed  1.273500
## 17       Temperature  1.151185
## 18        Usage.cont  1.718776
## 19         Carb.Flow  1.987496
## 20           Density  4.499376
## 21   Pressure.Vacuum  2.054866
## 22     Oxygen.Filler  1.561878
## 23 Pressure.Setpoint  3.300894
## 24     Air.Pressurer  1.167671
## 25          Carb.Rel  6.339500

Below, we remove the 6 variables with collinearity problem and run the vifcor function again after removing these 6 problematic variables. The output confirms that there are no more variables with collinearity problem. We will be dropping these 6 predictors from the data frame.

temp <- student_data[c(2:25,27:33)]
temp$Balling <- NULL
temp$Bowl.Setpoint <- NULL
temp$Balling.Lvl <- NULL
temp$MFR <- NULL
temp$Hyd.Pressure3 <- NULL
temp$Alch.Rel <- NULL
vifcor(temp)

## No variable from the 25 input variables has collinearity problem. 
## 
## The linear correlation coefficients ranges between: 
## min correlation ( Density ~ Usage.cont ):  -1.283718e-05 
## max correlation ( Carb.Rel ~ Density ):  0.8475287 
## 
## ---------- VIFs of the remained variables -------- 
##            Variables       VIF
## 1        Carb.Volume 16.772613
## 2        Fill.Ounces  1.145540
## 3          PC.Volume  1.644366
## 4      Carb.Pressure 43.057978
## 5          Carb.Temp 35.275573
## 6                PSC  1.154486
## 7           PSC.Fill  1.108167
## 8            PSC.CO2  1.059999
## 9           Mnf.Flow  4.164543
## 10    Carb.Pressure1  1.499823
## 11     Fill.Pressure  2.534785
## 12     Hyd.Pressure1  2.934876
## 13     Hyd.Pressure2  4.965940
## 14     Hyd.Pressure4  1.952196
## 15      Filler.Level  2.448071
## 16      Filler.Speed  2.027988
## 17       Temperature  1.208222
## 18        Usage.cont  1.671926
## 19         Carb.Flow  2.132049
## 20           Density  4.355882
## 21   Pressure.Vacuum  1.997420
## 22     Oxygen.Filler  1.512114
## 23 Pressure.Setpoint  2.650392
## 24     Air.Pressurer  1.160505
## 25          Carb.Rel  6.017929

Box Plots

Below is a scaled box plots of the variables. As you can see, there are some outliers in the data set. Usually, there are three general reasons for outliers. Reasons can include data entry or measurement error, sampling problems or unusual conditions, or simply a natural variation. As we do not have expert domain knowledge about the process that generated the data, we are unsure as to the nature of the reasons behind these outliers. As a result of this, it is hard do a reasonable assessment on whether to drop the outliers or not. If these outliers are a result of natural variation in the process, they capture valuable information that should probably be represented in the model.

For example, predictors MFR, Filler.Speed, and Oxygen.Filler appear to have some significant outliers.

The histogram chart below shows the distribution of predictors that appear to have some outliers. Carb.Flow and Air.Pressure appear to have multimodal distributions.

Missing Data

As you can see, approximately 8.25 percent of the rows are missing values for MFR with a count of 212 missing values. Bias is likely in analyses when missingness is more than 10%. So, based on this rule of thumb, it’s probably safe to impute missing values for MFR. Earlier data exploration revealed that the categorical variable Brand.Code has missing values. However, the plot_missing function did not reflect this, which originally showed a 0% value for missingness as the missing data are not represented by the value NA. To reflect the accurate missing ratio of Brand.Code, the value NA was assigned. Approximately, 4.67 percent of rows are missing for Brand.Code. The response variable PH missing ration is 0.16 percent with a count of 4. The rest of the other variables have low missing ratios, which suggests that imputation can be applied safely.

Near-Zero Variance

Below checks for predictors that show near zero-variance. These are predictors that do not vary much across observations and do not add much predictive information.

freqRatio is the ratio of frequencies for the most common value over the second most common value. percentUnique is the percentage of unique data points out of the total number of data points. zeroVar shows true if the predictor only has one distinct value; otherwise, false. nzv shows true if the predictor is a near zero variance predictor. The table below is sorted by nzv, and as you can see, predictor Hyd.Pressure1 has been determined to be a near zero-variance predictor. The rest of the predictors are not near zero-variance.

result <- nearZeroVar(student_data, saveMetrics= TRUE)
datatable(result[order(-result$nzv),])

3. Data Preparation

Based on the data exploration findings, we learned that there are missing values in the data set, predictors that are strongly correlated with each other, predictors that show near zero-variance, and predictors with outliers.

The student_data data frame is copied before we start modifying the data frame by dropping variables and imputing missing values. The student_eval data is also processed in parallel with student_data.

#copy of original student_data
student_data2 <- student_data

#copy of original student_eval
student_eval2 <- student_eval

Near Zero-Variance

During data exploration, it was determined that predictor Hyd.Pressure1 is a near zero-variance variable. This variable is dropped.

#drop for student_data
student_data$Hyd.Pressure1 <- NULL

#drop for student_eval
student_eval$Hyd.Pressure1 <- NULL

Multicollinearity

During data exploration, the vifcor function identified 6 predictors that have collinearity problems. These variables are dropped.

Drop variables Balling, Bowl.Setpoint, Balling.Lvl,MFR, Hyd.Pressure3, Alch.Rel.

#drop for student_data
student_data$Balling <- NULL
student_data$Bowl.Setpoint <- NULL
student_data$Balling.Lvl <- NULL
student_data$MFR <- NULL
student_data$Hyd.Pressure3 <- NULL
student_data$Alch.Rel <- NULL

#drop for student_eval
student_eval$Balling <- NULL
student_eval$Bowl.Setpoint <- NULL
student_eval$Balling.Lvl <- NULL
student_eval$MFR <- NULL
student_eval$Hyd.Pressure3 <- NULL
student_eval$Alch.Rel <- NULL

Missing Values

Drop observations with missing PH values
Assign Brand.Code with blank values to U.
Impute missing values.

The code below drops observations with missing PH values.

#drop for student_data
student_data <- student_data[!is.na(student_data$PH), ]

#note: student_eval all observations do not have PH values

The blank Brand.Code values were explicitly assigned the value NA previously. The code below assigns the category U to blank values of Brand.Code.

#student_data
student_data[is.na(student_data$Brand.Code),]$Brand.Code <- 'U'

#student_eval
student_eval[is.na(student_eval$Brand.Code),]$Brand.Code <- 'U'

Below is box plot of response variable PH by Brand.Code grouping.

boxplot(formula=PH~`Brand.Code`, data=student_data)

Impute missing values through mice package. Below is an excerpt from the mice package documentation.

The mice package implements a method to deal with missing data. The package creates multiple imputations (replacement values) for multivariate missing data. The method is based on Fully Conditional Specification, where each incomplete variable is imputed by a separate model. The MICE algorithm can impute mixes of continuous, binary, unordered categorical and ordered categorical data. In addition, MICE can impute continuous two-level data, and maintain consistency between imputations by means of passive imputation. Many diagnostic plots are implemented to inspect the quality of the imputations. Apply the imputation function. The returns an S3 object of class mids (multiply imputed data set).

#student_data
s3obj_mice = mice(student_data, print = FALSE, seed = 360)

#student_eval
s3obj_mice_eval = mice(student_eval, print = FALSE, seed = 360)

Below is density plot of the variables with imputed data.

densityplot(s3obj_mice)

The complete function of the mice package exports the imputed data. Update student_data with the imputed results.

#student_data
student_data = complete(s3obj_mice)

#student_eval
student_eval = complete(s3obj_mice_eval)

After the the data preparation steps applied above, the data frame is left with 2,567 observations and 26 variables. As you can see, we no longer have any missing values.

skim(student_data)

Data summary
Name	student_data
Number of rows	2567
Number of columns	26
_______________________
Column type frequency:
character	1
numeric	25
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Brand.Code	0	1	1	1	0	5	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Carb.Volume	1	5.37	0.11	5.04	5.29	5.35	5.46	5.70	▁▆▇▅▁
Fill.Ounces	1	23.97	0.09	23.63	23.92	23.97	24.03	24.32	▁▂▇▂▁
PC.Volume	1	0.28	0.06	0.08	0.24	0.27	0.31	0.48	▁▃▇▂▁
Carb.Pressure	1	68.22	3.58	57.00	65.60	68.20	70.60	79.40	▁▅▇▃▁
Carb.Temp	1	141.13	4.14	128.60	138.40	140.80	143.80	154.00	▁▅▇▃▁
PSC	1	0.08	0.05	0.00	0.05	0.08	0.11	0.27	▆▇▃▁▁
PSC.Fill	1	0.20	0.12	0.00	0.10	0.18	0.26	0.62	▆▇▃▁▁
PSC.CO2	1	0.06	0.04	0.00	0.02	0.04	0.08	0.24	▇▅▂▁▁
Mnf.Flow	1	24.63	119.50	-100.20	-100.00	70.20	140.80	229.40	▇▁▁▇▂
Carb.Pressure1	1	122.57	4.73	105.60	119.00	123.20	125.40	140.20	▁▃▇▂▁
Fill.Pressure	1	47.91	3.19	34.60	46.00	46.40	50.00	60.40	▁▁▇▂▁
Hyd.Pressure2	1	20.99	16.37	0.00	0.00	28.60	34.60	59.40	▇▂▇▃▁
Hyd.Pressure4	1	96.50	13.36	62.00	86.00	96.00	102.00	142.00	▂▆▇▂▁
Filler.Level	1	109.26	15.70	55.80	98.40	118.40	120.00	161.20	▁▃▅▇▁
Filler.Speed	1	3640.93	837.01	998.00	3857.00	3980.00	3997.00	4030.00	▁▁▁▁▇
Temperature	1	65.97	1.39	63.60	65.20	65.60	66.40	76.20	▇▃▁▁▁
Usage.cont	1	20.99	2.98	12.08	18.37	21.78	23.75	25.90	▁▃▅▃▇
Carb.Flow	1	2472.70	1070.27	26.00	1169.00	3030.00	3188.00	5104.00	▂▅▆▇▁
Density	1	1.17	0.38	0.24	0.90	0.98	1.62	1.92	▁▅▇▂▆
Pressure.Vacuum	1	-5.22	0.57	-6.60	-5.60	-5.40	-5.00	-3.60	▂▇▆▂▁
PH	1	8.55	0.17	7.88	8.44	8.54	8.68	9.36	▁▅▇▂▁
Oxygen.Filler	1	0.05	0.05	0.00	0.02	0.03	0.06	0.40	▇▁▁▁▁
Pressure.Setpoint	1	47.61	2.04	44.00	46.00	46.00	50.00	52.00	▁▇▁▆▁
Air.Pressurer	1	142.83	1.21	140.80	142.20	142.60	143.00	148.20	▅▇▁▁▁
Carb.Rel	1	5.44	0.13	4.96	5.34	5.40	5.54	6.06	▁▇▇▂▁

Outliers

We’re not dropping any observations with outliers.

Create Dummy Variables for Categorical Variable `Brand.Code`

The code below creates dummy variables that encode categorical variables into 1 or 0. The dummy variables are added to the data frame, and the categorical variable is then removed from the data frame.

# student_data
Brand.A <- ifelse(student_data$Brand.Code  == 'A', 1, 0)
Brand.B <- ifelse(student_data$Brand.Code  == 'B', 1, 0)
Brand.C <- ifelse(student_data$Brand.Code  == 'C', 1, 0)
Brand.D <- ifelse(student_data$Brand.Code  == 'D', 1, 0)
Brand.U <- ifelse(student_data$Brand.Code  == 'U', 1, 0)
# add dummy columns to data frame
student_data$Brand.A <- Brand.A
student_data$Brand.B <- Brand.B
student_data$Brand.C <- Brand.C
student_data$Brand.D <- Brand.D
student_data$Brand.U <- Brand.U
# remove categorical variable
student_data <- subset(student_data, select = -c(Brand.Code))


#student_eval
Brand.A <- ifelse(student_eval$Brand.Code  == 'A', 1, 0)
Brand.B <- ifelse(student_eval$Brand.Code  == 'B', 1, 0)
Brand.C <- ifelse(student_eval$Brand.Code  == 'C', 1, 0)
Brand.D <- ifelse(student_eval$Brand.Code  == 'D', 1, 0)
Brand.U <- ifelse(student_eval$Brand.Code  == 'U', 1, 0)
# add dummy columns to data frame
student_eval$Brand.A <- Brand.A
student_eval$Brand.B <- Brand.B
student_eval$Brand.C <- Brand.C
student_eval$Brand.D <- Brand.D
student_eval$Brand.U <- Brand.U
# remove categorical variable
student_eval <- subset(student_eval, select = -c(Brand.Code))

After the steps above were applied, the student_data data frame has 30 variables.

skim(student_data)

Data summary
Name	student_data
Number of rows	2567
Number of columns	30
_______________________
Column type frequency:
numeric	30
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Carb.Volume	1	5.37	0.11	5.04	5.29	5.35	5.46	5.70	▁▆▇▅▁
Fill.Ounces	1	23.97	0.09	23.63	23.92	23.97	24.03	24.32	▁▂▇▂▁
PC.Volume	1	0.28	0.06	0.08	0.24	0.27	0.31	0.48	▁▃▇▂▁
Carb.Pressure	1	68.22	3.58	57.00	65.60	68.20	70.60	79.40	▁▅▇▃▁
Carb.Temp	1	141.13	4.14	128.60	138.40	140.80	143.80	154.00	▁▅▇▃▁
PSC	1	0.08	0.05	0.00	0.05	0.08	0.11	0.27	▆▇▃▁▁
PSC.Fill	1	0.20	0.12	0.00	0.10	0.18	0.26	0.62	▆▇▃▁▁
PSC.CO2	1	0.06	0.04	0.00	0.02	0.04	0.08	0.24	▇▅▂▁▁
Mnf.Flow	1	24.63	119.50	-100.20	-100.00	70.20	140.80	229.40	▇▁▁▇▂
Carb.Pressure1	1	122.57	4.73	105.60	119.00	123.20	125.40	140.20	▁▃▇▂▁
Fill.Pressure	1	47.91	3.19	34.60	46.00	46.40	50.00	60.40	▁▁▇▂▁
Hyd.Pressure2	1	20.99	16.37	0.00	0.00	28.60	34.60	59.40	▇▂▇▃▁
Hyd.Pressure4	1	96.50	13.36	62.00	86.00	96.00	102.00	142.00	▂▆▇▂▁
Filler.Level	1	109.26	15.70	55.80	98.40	118.40	120.00	161.20	▁▃▅▇▁
Filler.Speed	1	3640.93	837.01	998.00	3857.00	3980.00	3997.00	4030.00	▁▁▁▁▇
Temperature	1	65.97	1.39	63.60	65.20	65.60	66.40	76.20	▇▃▁▁▁
Usage.cont	1	20.99	2.98	12.08	18.37	21.78	23.75	25.90	▁▃▅▃▇
Carb.Flow	1	2472.70	1070.27	26.00	1169.00	3030.00	3188.00	5104.00	▂▅▆▇▁
Density	1	1.17	0.38	0.24	0.90	0.98	1.62	1.92	▁▅▇▂▆
Pressure.Vacuum	1	-5.22	0.57	-6.60	-5.60	-5.40	-5.00	-3.60	▂▇▆▂▁
PH	1	8.55	0.17	7.88	8.44	8.54	8.68	9.36	▁▅▇▂▁
Oxygen.Filler	1	0.05	0.05	0.00	0.02	0.03	0.06	0.40	▇▁▁▁▁
Pressure.Setpoint	1	47.61	2.04	44.00	46.00	46.00	50.00	52.00	▁▇▁▆▁
Air.Pressurer	1	142.83	1.21	140.80	142.20	142.60	143.00	148.20	▅▇▁▁▁
Carb.Rel	1	5.44	0.13	4.96	5.34	5.40	5.54	6.06	▁▇▇▂▁
Brand.A	1	0.11	0.32	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
Brand.B	1	0.48	0.50	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▇
Brand.C	1	0.12	0.32	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
Brand.D	1	0.24	0.43	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
Brand.U	1	0.05	0.21	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁

4. Model Building

In this section, three categories of models were considered: (1) linear, (2) non-linear, and (3) tree-based. In total, nine models were considered. Linear models considered include PLS, Ridge, ENet, and Lasso. Non-linear models considered include KNN, MARS, and SVM. Tree-based models considered include random forest and GBM.

Split Train and Test Data

The code below splits the data frame student_datainto training and test sets with 80 percent of the observations going to the training set and 20 percent going to the test set. The training set is used to tune the models.

# Create training and testing split from training data
set.seed(525)
trainrow = createDataPartition(student_data$PH, p = 0.80, list = FALSE)
student_data_train <- student_data[trainrow, ]
student_data_test <- student_data[-trainrow, ]
colPH <- which(colnames(student_data) == "PH")
train_X <- student_data_train[, -colPH]
train_Y <- student_data_train$PH
test_X <- student_data_test[, -colPH]
test_Y <- student_data_test$PH

Build data frames to save training and test performance.

#collect results of training and test results on models
perf_train <- data.frame(Model=character(), RMSE=double(), RSquared=double(), MAE=double())
perf_test <- data.frame(Model=character(), RMSE=double(), RSquared=double(), MAE=double())

Linear Models

First set of models that will be done are the liner models. Linear models include Partial Least Square which is part of the pls library, There are three models that are from the elasticnect library and those models are Ridge, Elastic Net, and Lasso.

a. Partial Least Square (PLS)

Partial Least Squares is a type of technique that minimizes the predictors into smaller sets of components that are uncorrelated conducts partial least regression on them, and it does not do it on the original data. In PLS predictors can be highly correlated. Predictors are measured with error as it does not assume that the data is fixed. PLS uses Rsquared to select the optimal model. Here we build the PLS model.

set.seed(1)
PLS_model <- train(x=train_X,
                y=train_Y, 
                method='pls',
                metric='Rsquared',
                tuneLength=20,
                trControl=trainControl(method='cv'),
                preProcess=c('center', 'scale')
                )
PLS_model

## Partial Least Squares 
## 
## 2055 samples
##   29 predictor
## 
## Pre-processing: centered (29), scaled (29) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1852, 1848, 1850, 1849, 1849, 1850, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE       Rsquared   MAE      
##    1     0.1475444  0.2697501  0.1163209
##    2     0.1403038  0.3375482  0.1096748
##    3     0.1390244  0.3495563  0.1093433
##    4     0.1374810  0.3637951  0.1078142
##    5     0.1368743  0.3694355  0.1067738
##    6     0.1364368  0.3735789  0.1063230
##    7     0.1362655  0.3752907  0.1064270
##    8     0.1361407  0.3763309  0.1062417
##    9     0.1359896  0.3777980  0.1060292
##   10     0.1360214  0.3774610  0.1059708
##   11     0.1359612  0.3778966  0.1059615
##   12     0.1359398  0.3780788  0.1059709
##   13     0.1359229  0.3782254  0.1059311
##   14     0.1359397  0.3780868  0.1059473
##   15     0.1359419  0.3780920  0.1059436
##   16     0.1359748  0.3778372  0.1059720
##   17     0.1359439  0.3781005  0.1059759
##   18     0.1359370  0.3781625  0.1059648
##   19     0.1359326  0.3781984  0.1059591
##   20     0.1359354  0.3781738  0.1059602
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 13.

Below is the plot of the PLS model of the RSquared data.

plot(PLS_model)

Below is the performance of PLS model on the test set, which is similar to the training set.

PLS_model_pred <- predict(PLS_model, newdata=test_X)
(PLS_test <- postResample(pred=PLS_model_pred, obs=test_Y))

##      RMSE  Rsquared       MAE 
## 0.1379208 0.3750179 0.1071041

Save training and test performance.

#PLS
#save training and test perf
r <- PLS_model$results
s <- r[which(r$ncomp == PLS_model$bestTune$ncomp),]
perf_train[nrow(perf_train) + 1,] = c("PLS", s$RMSE, s$Rsquared, s$MAE)
perf_test[nrow(perf_test) + 1,] = c("PLS", PLS_test[1], PLS_test[2], PLS_test[3])

b. Ridge

Ridge regression is a technique that shrinks the regression coefficients, that makes variables, have an outcome with minor contributions, and makes the coefficients close to zero. The ridge model determines is RMSE by selecting the optimal model. Here we buld the ridge model.

## Define the candidate set of values
ridgeGrid <- data.frame(.lambda = seq(0, 1, by=0.1))
set.seed(1)
ridge_model <- train(x=train_X,
                y=train_Y,
               method = "ridge",
               tuneGrid = ridgeGrid,
               trControl = trainControl(method='cv') ,
               preProc = c("center", "scale")
              )
ridge_model

## Ridge Regression 
## 
## 2055 samples
##   29 predictor
## 
## Pre-processing: centered (29), scaled (29) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1852, 1848, 1850, 1849, 1849, 1850, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE       Rsquared   MAE      
##   0.0     0.1359338  0.3781874  0.1059620
##   0.1     0.1364204  0.3740103  0.1062830
##   0.2     0.1371790  0.3685452  0.1068324
##   0.3     0.1380115  0.3636493  0.1074350
##   0.4     0.1389151  0.3593017  0.1080551
##   0.5     0.1398857  0.3554164  0.1087703
##   0.6     0.1409165  0.3519149  0.1095544
##   0.7     0.1419997  0.3487341  0.1103864
##   0.8     0.1431270  0.3458245  0.1112790
##   0.9     0.1442907  0.3431470  0.1121981
##   1.0     0.1454833  0.3406707  0.1131479
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 0.

Below is the plot of the Ridge model of the RSquared data.

plot(ridge_model)

Below is the performance of ridge model on the test set, which is similar to the training set.

ridge_model_pred <- predict(ridge_model, newdata=test_X)
(ridge_test <- postResample(pred=ridge_model_pred, obs=test_Y))

##      RMSE  Rsquared       MAE 
## 0.1379925 0.3744831 0.1070752

Save training and test performance.

#RIDGE
#save training and test perf
r <- ridge_model$results
s <- r[which(r$lambda == ridge_model$bestTune$lambda),]
perf_train[nrow(perf_train) + 1,] = c("Ridge", s$RMSE, s$Rsquared, s$MAE)
perf_test[nrow(perf_test) + 1,] = c("Ridge", ridge_test[1], ridge_test[2], ridge_test[3])

c. Elastic Net (ENet)

Elastic Net creates a regression model that is then penalized with the L1-norm and L2-norm. Enet effectively shrink coefficients and it also sets some coefficients to zero. Elastic Net Model uses RMSE uses to select the optimal model. Here, we build the Elastic Net model.

set.seed(1)
enet_model <- train(x=train_X,
                y= train_Y,
               method = "enet",
                tuneGrid=expand.grid(.fraction = seq(0, 1, by=0.1), 
                                      .lambda = seq(0, 1, by=0.1)),
               trControl = trainControl(method='cv') ,
               preProc = c("center", "scale")
              )
enet_model

## Elasticnet 
## 
## 2055 samples
##   29 predictor
## 
## Pre-processing: centered (29), scaled (29) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1852, 1848, 1850, 1849, 1849, 1850, ... 
## Resampling results across tuning parameters:
## 
##   lambda  fraction  RMSE       Rsquared   MAE      
##   0.0     0.0       0.1720592        NaN  0.1384415
##   0.0     0.1       0.1553818  0.2602260  0.1237906
##   0.0     0.2       0.1466019  0.3063326  0.1160760
##   0.0     0.3       0.1418796  0.3347373  0.1119024
##   0.0     0.4       0.1394502  0.3503005  0.1096758
##   0.0     0.5       0.1380169  0.3609219  0.1083283
##   0.0     0.6       0.1370389  0.3687844  0.1072676
##   0.0     0.7       0.1364295  0.3736785  0.1065075
##   0.0     0.8       0.1360504  0.3767977  0.1061019
##   0.0     0.9       0.1359254  0.3779960  0.1059529
##   0.0     1.0       0.1359338  0.3781874  0.1059620
##   0.1     0.0       0.1720592        NaN  0.1384415
##   0.1     0.1       0.1583835  0.2419125  0.1263453
##   0.1     0.2       0.1501410  0.2889712  0.1192044
##   0.1     0.3       0.1447534  0.3165224  0.1144000
##   0.1     0.4       0.1416784  0.3338812  0.1116758
##   0.1     0.5       0.1397321  0.3469424  0.1099362
##   0.1     0.6       0.1385626  0.3547983  0.1087232
##   0.1     0.7       0.1377892  0.3611276  0.1079526
##   0.1     0.8       0.1371588  0.3667610  0.1071974
##   0.1     0.9       0.1367304  0.3707961  0.1066688
##   0.1     1.0       0.1364204  0.3740103  0.1062830
##   0.2     0.0       0.1720592        NaN  0.1384415
##   0.2     0.1       0.1591009  0.2353980  0.1269506
##   0.2     0.2       0.1510970  0.2845927  0.1200476
##   0.2     0.3       0.1456469  0.3111884  0.1152072
##   0.2     0.4       0.1423120  0.3279708  0.1122730
##   0.2     0.5       0.1404181  0.3405321  0.1105412
##   0.2     0.6       0.1390500  0.3501421  0.1091622
##   0.2     0.7       0.1382867  0.3562557  0.1082711
##   0.2     0.8       0.1378836  0.3605573  0.1077646
##   0.2     0.9       0.1374722  0.3649844  0.1072444
##   0.2     1.0       0.1371790  0.3685452  0.1068324
##   0.3     0.0       0.1720592        NaN  0.1384415
##   0.3     0.1       0.1594690  0.2315372  0.1272617
##   0.3     0.2       0.1515975  0.2823682  0.1204822
##   0.3     0.3       0.1461786  0.3073859  0.1156826
##   0.3     0.4       0.1427266  0.3243036  0.1126391
##   0.3     0.5       0.1408424  0.3356561  0.1109037
##   0.3     0.6       0.1394810  0.3459020  0.1095137
##   0.3     0.7       0.1386381  0.3530667  0.1085536
##   0.3     0.8       0.1383166  0.3575235  0.1079813
##   0.3     0.9       0.1381937  0.3603796  0.1077162
##   0.3     1.0       0.1380115  0.3636493  0.1074350
##   0.4     0.0       0.1720592        NaN  0.1384415
##   0.4     0.1       0.1596282  0.2296124  0.1273963
##   0.4     0.2       0.1518339  0.2813473  0.1206844
##   0.4     0.3       0.1464405  0.3049629  0.1159124
##   0.4     0.4       0.1429889  0.3213293  0.1128712
##   0.4     0.5       0.1410509  0.3328259  0.1110791
##   0.4     0.6       0.1398562  0.3421265  0.1097934
##   0.4     0.7       0.1389961  0.3500782  0.1087926
##   0.4     0.8       0.1387387  0.3547894  0.1082554
##   0.4     0.9       0.1388278  0.3572329  0.1081374
##   0.4     1.0       0.1389151  0.3593017  0.1080551
##   0.5     0.0       0.1720592        NaN  0.1384415
##   0.5     0.1       0.1596396  0.2291676  0.1274081
##   0.5     0.2       0.1518751  0.2808993  0.1207097
##   0.5     0.3       0.1464945  0.3035506  0.1159641
##   0.5     0.4       0.1431023  0.3192552  0.1129726
##   0.5     0.5       0.1411614  0.3311951  0.1111688
##   0.5     0.6       0.1401334  0.3393502  0.1099751
##   0.5     0.7       0.1393698  0.3473499  0.1090133
##   0.5     0.8       0.1392186  0.3520623  0.1085895
##   0.5     0.9       0.1395225  0.3541140  0.1086610
##   0.5     1.0       0.1398857  0.3554164  0.1087703
##   0.6     0.0       0.1720592        NaN  0.1384415
##   0.6     0.1       0.1595818  0.2294755  0.1273680
##   0.6     0.2       0.1517956  0.2807151  0.1206217
##   0.6     0.3       0.1464390  0.3024651  0.1159194
##   0.6     0.4       0.1431309  0.3177463  0.1129841
##   0.6     0.5       0.1412323  0.3299454  0.1112058
##   0.6     0.6       0.1403891  0.3370847  0.1101157
##   0.6     0.7       0.1397688  0.3448843  0.1092316
##   0.6     0.8       0.1398222  0.3489679  0.1090139
##   0.6     0.9       0.1402964  0.3510050  0.1092505
##   0.6     1.0       0.1409165  0.3519149  0.1095544
##   0.7     0.0       0.1720592        NaN  0.1384415
##   0.7     0.1       0.1595070  0.2299944  0.1273148
##   0.7     0.2       0.1516582  0.2805581  0.1204783
##   0.7     0.3       0.1463377  0.3015587  0.1158287
##   0.7     0.4       0.1431205  0.3166373  0.1129535
##   0.7     0.5       0.1413141  0.3287408  0.1112351
##   0.7     0.6       0.1406710  0.3350319  0.1102612
##   0.7     0.7       0.1402578  0.3422018  0.1095345
##   0.7     0.8       0.1405061  0.3458989  0.1095020
##   0.7     0.9       0.1411291  0.3479915  0.1098824
##   0.7     1.0       0.1419997  0.3487341  0.1103864
##   0.8     0.0       0.1720592        NaN  0.1384415
##   0.8     0.1       0.1594168  0.2307456  0.1272502
##   0.8     0.2       0.1514951  0.2803787  0.1203175
##   0.8     0.3       0.1462148  0.3007756  0.1157205
##   0.8     0.4       0.1430926  0.3157634  0.1129163
##   0.8     0.5       0.1414196  0.3274490  0.1112628
##   0.8     0.6       0.1409653  0.3331832  0.1104267
##   0.8     0.7       0.1408215  0.3393204  0.1099235
##   0.8     0.8       0.1412555  0.3428038  0.1100576
##   0.8     0.9       0.1420265  0.3450480  0.1105468
##   0.8     1.0       0.1431270  0.3458245  0.1112790
##   0.9     0.0       0.1720592        NaN  0.1384415
##   0.9     0.1       0.1593101  0.2315884  0.1271679
##   0.9     0.2       0.1513188  0.2801547  0.1201471
##   0.9     0.3       0.1460837  0.3001093  0.1156089
##   0.9     0.4       0.1430672  0.3149814  0.1128717
##   0.9     0.5       0.1415529  0.3261136  0.1113019
##   0.9     0.6       0.1412980  0.3312954  0.1106343
##   0.9     0.7       0.1414542  0.3363190  0.1103801
##   0.9     0.8       0.1420483  0.3398880  0.1106476
##   0.9     0.9       0.1429789  0.3422458  0.1112692
##   0.9     1.0       0.1442907  0.3431470  0.1121981
##   1.0     0.0       0.1720592        NaN  0.1384415
##   1.0     0.1       0.1592008  0.2323180  0.1270817
##   1.0     0.2       0.1511442  0.2798793  0.1199803
##   1.0     0.3       0.1459585  0.2994972  0.1154996
##   1.0     0.4       0.1430609  0.3141531  0.1128287
##   1.0     0.5       0.1417152  0.3247313  0.1113631
##   1.0     0.6       0.1416641  0.3293841  0.1108682
##   1.0     0.7       0.1420963  0.3335644  0.1108523
##   1.0     0.8       0.1428632  0.3372221  0.1112651
##   1.0     0.9       0.1439644  0.3396531  0.1120260
##   1.0     1.0       0.1454833  0.3406707  0.1131479
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were fraction = 0.9 and lambda = 0.

Below is the plot of the ENet model of the RSquared data.

plot(enet_model)

Below is performance of ENet model on the test set, which is similar to training set.

enet_model_pred <- predict(enet_model, newdata=test_X)
(enet_test <- postResample(pred=enet_model_pred, obs=test_Y))

##      RMSE  Rsquared       MAE 
## 0.1378915 0.3748911 0.1070772

Save training and test performance.

d. Lasso

Lasso stands for Least Absolute Shrinkage and Selection Operator. As its name suggest Lasso shrinks the coeeficients of the regression toward zero which then penalizes the regression model to L1-norm, the sum of the absolute coefficients. Lasso uses the Rsquared in order to select the optimal model. Here, we build the lasso model.

set.seed(1)
lasso_model <- train(x=train_X,
                  y=train_Y,
                  method='lasso',
                  metric='Rsquared',
                  tuneGrid=data.frame(.fraction = seq(0, 0.5, by=0.05)),
                  trControl=trainControl(method='cv'),
                  preProcess=c('center','scale')
                  )
lasso_model

## The lasso 
## 
## 2055 samples
##   29 predictor
## 
## Pre-processing: centered (29), scaled (29) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1852, 1848, 1850, 1849, 1849, 1850, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE       Rsquared   MAE      
##   0.00      0.1720592        NaN  0.1384415
##   0.05      0.1623547  0.1988617  0.1297894
##   0.10      0.1553818  0.2602260  0.1237906
##   0.15      0.1503884  0.2851764  0.1194232
##   0.20      0.1466019  0.3063326  0.1160760
##   0.25      0.1437345  0.3223985  0.1135287
##   0.30      0.1418796  0.3347373  0.1119024
##   0.35      0.1404829  0.3437027  0.1106357
##   0.40      0.1394502  0.3503005  0.1096758
##   0.45      0.1386665  0.3560308  0.1089289
##   0.50      0.1380169  0.3609219  0.1083283
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was fraction = 0.5.

Below is the plot of the Lasso model of the RSquared data.

plot(lasso_model)

Below is performance of lasso model on test set, which is similar to training set.

lasso_model_pred <- predict(lasso_model, newdata=test_X)
(lasso_test <- postResample(pred=lasso_model_pred, obs=test_Y))

##      RMSE  Rsquared       MAE 
## 0.1390828 0.3634917 0.1093774

Save training and test performance.

Non-Linear Models

Next set of models that will be done are the non-linear models. The non_linear models consist of K-Nearest Neighbors (KNN), Support Vector Machines (SVM),and Multivariate Adaptive Regression Splines (MARS). SVM model and KNN model both use the caret library, while MARS uses the earth library.

a. K-nearest Neighbors (KNN)

Here, we build the K-nearest Neighbor model. RMSE metric was used to select the optimal model.

set.seed(1)
knn_model <- train(x=train_X, 
                   y=train_Y, 
                  method="knn", 
                  tuneLength=20, 
                  trainControl=trainControl(method = "repeatedcv", repeats = 5),
                  preProc = c("center", "scale"))
knn_model

## k-Nearest Neighbors 
## 
## 2055 samples
##   29 predictor
## 
## Pre-processing: centered (29), scaled (29) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2055, 2055, 2055, 2055, 2055, 2055, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE       
##    5  0.1362882  0.4049885  0.10062242
##    7  0.1337103  0.4149618  0.09969496
##    9  0.1326581  0.4182120  0.09957466
##   11  0.1320620  0.4198883  0.09943211
##   13  0.1318634  0.4195381  0.09961616
##   15  0.1320655  0.4168911  0.09997505
##   17  0.1320747  0.4157675  0.10024156
##   19  0.1323075  0.4133827  0.10051829
##   21  0.1322902  0.4130841  0.10065080
##   23  0.1323428  0.4121573  0.10085459
##   25  0.1324471  0.4108564  0.10100372
##   27  0.1328053  0.4074699  0.10136790
##   29  0.1331310  0.4044819  0.10170746
##   31  0.1334328  0.4016739  0.10201870
##   33  0.1336948  0.3991875  0.10226727
##   35  0.1339202  0.3969727  0.10242136
##   37  0.1341456  0.3948643  0.10254753
##   39  0.1343751  0.3926863  0.10269663
##   41  0.1345295  0.3912840  0.10281018
##   43  0.1348283  0.3887038  0.10305314
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 13.

plot(knn_model)

Below is the performance of the KNN model on the test set, which is similar to the training set.

knn_pred <- predict(knn_model, newdata=test_X)
(knn_test <- postResample(pred=knn_pred,test_Y))

##       RMSE   Rsquared        MAE 
## 0.12439734 0.49837261 0.09557478

Save training and test performance.

b. Multivariate Adaptive Regression Splines (MARS)

Here, we build the MARS model. RMSE metric was used to select the optimal model.

set.seed(1)
mars_model <- train(x=train_X, y=train_Y, 
                      method="earth",
                      tuneLength=20,
                      preProc = c("center", "scale"))
mars_model

## Multivariate Adaptive Regression Spline 
## 
## 2055 samples
##   29 predictor
## 
## Pre-processing: centered (29), scaled (29) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2055, 2055, 2055, 2055, 2055, 2055, ... 
## Resampling results across tuning parameters:
## 
##   nprune  RMSE       Rsquared   MAE      
##    2      0.1525180  0.2094742  0.1197446
##    3      0.1444375  0.2902816  0.1130738
##    5      0.1420249  0.3146692  0.1104466
##    7      0.1400350  0.3341658  0.1083860
##    9      0.1380296  0.3530826  0.1063602
##   10      0.1372706  0.3603058  0.1057218
##   12      0.1352924  0.3785790  0.1038026
##   14      0.1349637  0.3823557  0.1031480
##   16      0.1347622  0.3847275  0.1029210
##   18      0.1346738  0.3863708  0.1027334
##   19      0.1346900  0.3865554  0.1027726
##   21      0.1346389  0.3878520  0.1027844
##   23      0.1348787  0.3865977  0.1028585
##   25      0.1349155  0.3868247  0.1029269
##   27      0.1348331  0.3879429  0.1027531
##   28      0.1349035  0.3875977  0.1027481
##   30      0.1349271  0.3878905  0.1026746
##   32      0.1348401  0.3890486  0.1025479
##   34      0.1350289  0.3876740  0.1026852
##   36      0.1350152  0.3880050  0.1026450
## 
## Tuning parameter 'degree' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 21 and degree = 1.

plot(mars_model)

Below is the performance of the MARS model on the test set, which is similar to the training set.

mars_pred <- predict(mars_model, newdata=test_X)
(mars_test <- postResample(pred=mars_pred, test_Y))

##      RMSE  Rsquared       MAE 
## 0.1340052 0.4103099 0.1027201

Save training and test performance.

c. Support Vector Machines (SVM)

Here, we build the SVM model. RMSE was used to select the optimal model.

set.seed(1)
svm_model <- train(train_X, 
                  train_Y, 
                  method="svmRadial",
                  tuneLength=10, 
                  trainControl=trainControl(method = "repeatedcv", repeats = 5),
                  preProc = c("center", "scale"))
svm_model

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 2055 samples
##   29 predictor
## 
## Pre-processing: centered (29), scaled (29) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2055, 2055, 2055, 2055, 2055, 2055, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE       Rsquared   MAE       
##     0.25  0.1279841  0.4484411  0.09512726
##     0.50  0.1258799  0.4649780  0.09305339
##     1.00  0.1245581  0.4760033  0.09180649
##     2.00  0.1241910  0.4810630  0.09162848
##     4.00  0.1248802  0.4802107  0.09250449
##     8.00  0.1266428  0.4735991  0.09417736
##    16.00  0.1298438  0.4592186  0.09681733
##    32.00  0.1353447  0.4332973  0.10101334
##    64.00  0.1416916  0.4051859  0.10565725
##   128.00  0.1477701  0.3804602  0.11026098
## 
## Tuning parameter 'sigma' was held constant at a value of 0.02369026
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.02369026 and C = 2.

plot(svm_model)

Below is the performance of the SVM model on the test set, which is similar to the training set.

svm_pred <- predict(svm_model, newdata=test_X)
(svm_test <- postResample(pred=svm_pred,test_Y))

##      RMSE  Rsquared       MAE 
## 0.1172226 0.5556294 0.0859758

Save training and test performance.

Tree-Based Models

Next set of models is part are the tree based models which consist of Random Forest(RF), and the Gradient Boosting Machine. RF Model uses the randomforest library while GMB Model uses the caret library.

a. Random Forest(RF)

Here, we build the random forest model with ntree set to 100.

set.seed(1)
rf_model <- randomForest(x = train_X, y = train_Y, ntree = 100)
rf_model

## 
## Call:
##  randomForest(x = train_X, y = train_Y, ntree = 100) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 9
## 
##           Mean of squared residuals: 0.01118717
##                     % Var explained: 62.25

Below is the performance of the random forest model on the test set, which is similar to the training set.

rf_model_pred <- predict(rf_model, test_X)
(rf_test <- postResample(pred = rf_model_pred, obs = test_Y))

##       RMSE   Rsquared        MAE 
## 0.09796695 0.70761740 0.07363158

Save training and test performance.

b. Gradient Boosting Machine (GBM)

Here, we build the gradient boosting machine model. RMSE was used to select the optimal model.

set.seed(1)
gbmGrid = expand.grid(interaction.depth = seq(1,5, by=2), n.trees = seq(100, 1000, by = 100), shrinkage = 0.1, n.minobsinnode = 5)
gbm_model <- train(train_X, train_Y, tuneGrid = gbmGrid, verbose = FALSE, method = 'gbm' )
gbm_model

## Stochastic Gradient Boosting 
## 
## 2055 samples
##   29 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2055, 2055, 2055, 2055, 2055, 2055, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE       Rsquared   MAE       
##   1                   100     0.1339239  0.3950323  0.10477222
##   1                   200     0.1322563  0.4050135  0.10263092
##   1                   300     0.1318939  0.4076331  0.10174900
##   1                   400     0.1320518  0.4071191  0.10147871
##   1                   500     0.1321515  0.4071142  0.10125885
##   1                   600     0.1323119  0.4068150  0.10111531
##   1                   700     0.1323415  0.4073659  0.10097784
##   1                   800     0.1326016  0.4062089  0.10108686
##   1                   900     0.1327688  0.4056856  0.10109589
##   1                  1000     0.1328946  0.4055744  0.10108683
##   3                   100     0.1250318  0.4692545  0.09592086
##   3                   200     0.1222785  0.4910517  0.09249375
##   3                   300     0.1212470  0.5000584  0.09118180
##   3                   400     0.1204244  0.5072077  0.09034816
##   3                   500     0.1199141  0.5117855  0.08975774
##   3                   600     0.1195945  0.5149241  0.08944118
##   3                   700     0.1192920  0.5177363  0.08916290
##   3                   800     0.1191023  0.5199261  0.08898788
##   3                   900     0.1191735  0.5199473  0.08894566
##   3                  1000     0.1190374  0.5212941  0.08883902
##   5                   100     0.1206083  0.5056220  0.09133563
##   5                   200     0.1180559  0.5254546  0.08857157
##   5                   300     0.1170877  0.5335114  0.08756047
##   5                   400     0.1166022  0.5377664  0.08695262
##   5                   500     0.1162364  0.5410361  0.08651740
##   5                   600     0.1161295  0.5422862  0.08635330
##   5                   700     0.1161070  0.5427562  0.08626253
##   5                   800     0.1159675  0.5441361  0.08611505
##   5                   900     0.1158866  0.5450034  0.08606533
##   5                  1000     0.1158797  0.5452960  0.08602816
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 1000, interaction.depth =
##  5, shrinkage = 0.1 and n.minobsinnode = 5.

Below is the performance of the GMB model on the test set, which is similar to the training set.

gbm_model_pred <- predict(gbm_model, test_X)
(gbm_test <- postResample(pred = gbm_model_pred, obs = test_Y))

##      RMSE  Rsquared       MAE 
## 0.1041354 0.6412081 0.0802140

5. Variable of Importance

Predictor Mnf.flow appears to be the most important predictor across the nine different models considered.

a. Linear Models

The chart below shows the levels of importance for each predictor. For all linear models considered, Mnf.flow shows as the most important predictor.

b. Non-Linear Models

The chart below shows the levels of importance for each predictor. For all the non-linear models, Mnf.Flow shows as the most important predictor.

c. Tree-Based Models

The chart below shows the levels of importance for each predictor. For all tree-based models considered, Mnf.flow is the most important predictor.

gridExtra::grid.arrange(p1_trees,p2_trees, nrow=1)

6. Model Selection

Nine different models were considered from three different classes of models, which includes linear, non-linear, and tree-based models. For model selection, we’ve decided to compare the performance of each model on the test set using RMSE. Below is a table that shows how well each model did based on RMSE, RSquared, and MAE.

As you can see, Random Forest model is the best performing model based on lowest RMSE. It also happens to be the model with the highest r-squared as well.

Performance of Models on Test Set

datatable(perf_test[order(perf_test$RMSE),])

Tuning Performance

Below is a table that shows the performance of each model on the training set. The random forest metrics are below the table (as we could not figure out how to add it in the table dynamically).

datatable(perf_train[order(perf_train$RMSE),])

Random Forest is not in the table above. Below is this model’s performance on the training set.

## 
## Call:
##  randomForest(x = train_X, y = train_Y, ntree = 100) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 9
## 
##           Mean of squared residuals: 0.01118717
##                     % Var explained: 62.25

7. Model Evaluation

The student_eval data set has been processed in parallel with student_data so that any variables dropped or any imputation done on student_data is also performed on student_eval.

At this stage, we should expect for student_eval to have no missing values except for PH. The evaluation data set has 267 observations. As expected, there are 29 predictors.

skim(student_eval)

Data summary
Name	student_eval
Number of rows	267
Number of columns	30
_______________________
Column type frequency:
logical	1
numeric	29
________________________
Group variables	None

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
PH	267	0	NaN	:

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Carb.Volume	1	5.37	0.11	5.15	5.28	5.34	5.46	5.67	▂▇▃▅▁
Fill.Ounces	1	23.97	0.08	23.75	23.92	23.97	24.02	24.20	▁▅▇▃▁
PC.Volume	1	0.28	0.06	0.10	0.23	0.28	0.32	0.46	▁▆▇▅▁
Carb.Pressure	1	68.25	3.86	60.20	65.30	68.00	70.60	77.60	▃▆▇▃▂
Carb.Temp	1	141.27	4.34	130.00	138.40	140.80	143.90	154.00	▁▆▇▃▁
PSC	1	0.09	0.05	0.00	0.05	0.08	0.11	0.25	▆▇▃▂▁
PSC.Fill	1	0.19	0.11	0.02	0.10	0.18	0.26	0.62	▆▇▃▁▁
PSC.CO2	1	0.05	0.04	0.00	0.02	0.04	0.06	0.24	▇▃▂▁▁
Mnf.Flow	1	21.03	117.76	-100.20	-100.00	0.20	141.30	220.40	▇▁▁▆▂
Carb.Pressure1	1	122.99	4.45	113.00	120.10	123.40	125.50	136.00	▃▃▇▂▁
Fill.Pressure	1	48.13	3.47	37.80	46.00	47.80	50.20	60.20	▁▇▇▂▁
Hyd.Pressure2	1	20.04	17.23	-50.00	0.00	26.80	34.80	61.40	▁▁▆▇▁
Hyd.Pressure4	1	98.18	14.13	68.00	90.00	98.00	104.00	140.00	▅▆▇▂▁
Filler.Level	1	110.37	15.47	69.20	100.60	118.60	120.20	153.20	▂▃▇▇▁
Filler.Speed	1	3501.01	987.44	1006.00	3795.00	3918.00	3996.00	4020.00	▂▁▁▁▇
Temperature	1	66.22	1.69	63.80	65.40	65.80	66.60	75.40	▇▃▁▁▁
Usage.cont	1	20.88	3.00	12.90	18.10	21.40	23.74	24.60	▁▃▃▃▇
Carb.Flow	1	2408.64	1161.36	0.00	1083.00	3038.00	3215.00	3858.00	▂▃▁▆▇
Density	1	1.17	0.38	0.06	0.91	0.98	1.60	1.84	▁▁▇▁▅
Pressure.Vacuum	1	-5.17	0.58	-6.40	-5.60	-5.20	-4.80	-3.60	▁▇▆▃▁
Oxygen.Filler	1	0.05	0.05	0.00	0.02	0.03	0.06	0.40	▇▁▁▁▁
Pressure.Setpoint	1	47.73	2.06	44.00	46.00	46.00	50.00	52.00	▁▇▁▆▁
Air.Pressurer	1	142.83	1.23	141.20	142.20	142.60	142.80	147.20	▅▇▁▁▁
Carb.Rel	1	5.44	0.13	5.18	5.34	5.40	5.56	5.74	▂▇▂▃▂
Brand.A	1	0.13	0.34	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
Brand.B	1	0.48	0.50	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▇
Brand.C	1	0.12	0.32	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
Brand.D	1	0.24	0.43	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
Brand.U	1	0.03	0.17	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁

Use Selected Model to Predict `PH` on `student_eval`

The selected model is the random forest model. This model is used to predict response variable PH on the student_data.

student_eval_predictors <- subset(student_eval, select = -PH)

#Prediction is run on random forest model 
PH_hat <- predict(rf_model, student_eval_predictors)
student_eval_results <- cbind(PH_hat, student_eval_predictors)

#save copy to csv
write.csv(student_eval_results, "Student_Eval_Predictions.csv")

Predicted `PH`

Below is a preview of the predicted PH.

Overview of Predicted `PH`

The predicted PH values do not show any values that fall outside the expected PH range of values.

Data summary
Name	student_eval_results$PH_h…
Number of rows	267
Number of columns	1
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
data	0	1	8.55	0.11	8.17	8.47	8.53	8.63	8.79	▁▂▇▆▃

The general shape of the distribution of the predicted PH is approximately normal. There are only 267 rows.

8. Conclusion

Nine models were considered to build a predictive model for company ABC. The purpose of this model is to predict PH levels of beverages that company ABC manufactures. The selected model is a tree-based model, specifically a random forest model. This model has the lowest RMSE and also happens to have the highest r-squared as well. We were given a student evaluation file. The selected model was used to predict the PH levels for this evaluation data set. The results of the prediction show predicted values that are within the expected range. The general shahpe of the distribution of the predicted values is approximately normal as expected.

Data 624 Project 2

Maryluz Cruz, Bill Stepniak, Sherranette Tinapunan

5/11/2021

1. Problem Statement

2. Data Exploration

Load Data

Summary of Student Data

Disribution of Response Variable PH

Distribution of Predictors

Scatter Plots of Predictors with Response Variable

Correlation

Identify Multicollinearity Problem

Box Plots

Missing Data

Near-Zero Variance

3. Data Preparation

Near Zero-Variance

Multicollinearity

Missing Values

Outliers

Create Dummy Variables for Categorical Variable Brand.Code

4. Model Building

Split Train and Test Data

Linear Models

a. Partial Least Square (PLS)

b. Ridge

c. Elastic Net (ENet)

d. Lasso

Non-Linear Models

a. K-nearest Neighbors (KNN)

b. Multivariate Adaptive Regression Splines (MARS)

c. Support Vector Machines (SVM)

Tree-Based Models

a. Random Forest(RF)

b. Gradient Boosting Machine (GBM)

5. Variable of Importance

a. Linear Models

b. Non-Linear Models

c. Tree-Based Models

6. Model Selection

Performance of Models on Test Set

Tuning Performance

7. Model Evaluation

Use Selected Model to Predict PH on student_eval

Predicted PH

Overview of Predicted PH

8. Conclusion

Disribution of Response Variable `PH`

Create Dummy Variables for Categorical Variable `Brand.Code`

Use Selected Model to Predict `PH` on `student_eval`

Predicted `PH`

Overview of Predicted `PH`