Background: My client is a global power technology leader, is a corporation of complementary business units that design, engineer, manufacture, distribute and service engines and related technologies, including fuel systems, controls, air handling, filtration, emission solutions, and electrical power generation systems.
The diesel engine business unit supply units for both on highway like semi-trucks and off highway applications like mining, construction, farming and marine. In 2020, the engine segment represents about 32% of the total generated revenue.
The headquarters of my client is located in United States and employees over 58000 people worldwide and serve customers in approximately 190 countries.
Existing Analytics: Working with their customers during the development and field testing their products, low-cost data loggers (LCDs) are installed on the engine or power generators to log performance data. The logged data is then transmitted back to my client server via cell signals or downloaded USB.
The data collected is analyzed and used to support problems solving to improve the quality of their products and used to plan for the monthly service time.
Business Problem Framing: My client have installed Spin-on filter into their field test engines through which engine oil flowing to be cleaned of impurities before getting sent through the rest if engine. These filters are designed to last at least 500hrs before getting replaced.
The oil pressure before and after the filter is monitored to determine oil differential pressure (ODP). This is done by taking a difference between post-filter oil pressure and pre-filter oil pressure.
With time, the ODP value increase more impurities from the oil are retained. The filter is expected to last about 500hrs before reaching 125kPa and at that point, the filter, the filtered must be changed
However, these filters are getting plugged so fast that most of the last about 200hrs which leads to frequent down time for replacement. My client’s customers are complaining that these irregular down times are costing them a lot of money and can be asking for as feasible solution to reduce these down times
I met with my client and there is some additional information I need to help me with working on the solution.
Stakeholder Analysis: Following are the stakeholders for this project.
My client: Is the main stakeholder in this project as it responsible for manufacturing and planning maintenance schedule for their field engines. Irregular servicing cost my client both time and money since they have to mobilize service engineers and field technicians to have filters changed.
My client’s end customers: These are the engine users who run the engines in the field. They are interested in having a reliable running engine with the minimal down time as much as possible. When the oil filters get plugged, there fault codes are triggered which leads to engine derate hence reducing the amount of power needed to perform the required job.
Data Sources: My client provided performance data from an engine that was running in Q4 2020 with a spin on filters and with the highest count of oil filter plug in issues in that particular location. The data has parameters/variables/attributes assumed to impact oil consumption.
These attributes include oil temperature, oil pressure, engine speed, engine torque, fuel rate, oil differential pressure, crankcase pressure (CCP), oil property parameters (i.e. oil viscosity, oil density, and oil dielectric) and the Oil differential pressure fault code (FC_1362). The data provided has a sample rate of 5 second interval.
This data will be prepared and transformed to handle missing data.Data scaling that might include standardization and or normalization using R packages may be necessary depending on the type of the model expected.
Expected Benefit of Analysis: With this analysis will be with predicting the ODP. With this information, my client can schedule the maintenance that aligns with the customers planned down time. Hence avoiding unplanned customer down time. This in turn will save my client a lot of money
Data mining goal from business goal:The business goal of my client is to be able to schedule maintenance intervals that aligns with spin on filter performance. This will save my client time and money spent to mobilize service and field engineers to perform filter changes at irregular intervals
Following packages are used in this entire project during the analysis.Each step has been explained and the codes have been given where necessary.
#install the packages if necessary
if(!require(tidyverse)) {install.packages("tidyverse"); require(tidyverse)}
if(!require(DataExplorer)) {install.packages("DataExplorer"); require(DataExplorer)} #For data exploration
if(!require(GGally)) {install.packages("GGally"); require(GGally)} #visual summary
if(!require(forcats)) {install.packages("forcats"); require(forcats)} #visual summary
if(!require(Hmisc)) {install.packages("Hmisc"); require(Hmisc)} #correlation table
if(!require(corrplot)) {install.packages("corrplot"); require(corrplot)} #correlation visualization
if(!require(equatiomatic)) {install.packages("equatiomatic"); require(equatiomatic)} # extracting linear equation
if(!require(pander)) {install.packages("pander"); require(pander)} # to create linear model summary
library(reshape2)
library(knitr)
library(plotly)
library(DT)
library(lubridate)
library(data.table)
lib1.name1 <- c("reshape2",
"knitr",
"plotly",
"DT",
"tidyverse",
"Data Explorer",
"lubridate",
"GGally",
"Hmisc",
"corrplot",
"equatiomatic",
"pander")
lib1.desc <- c("Used to reshape variables into groups.",
"For knitting document and include_graphics, kable functions.",
"Used to plot interactive charts.",
"Used to display the data on the screen in a scrollable format.",
"Used for data manipulation and visualization,includes following packages: readr, ggplot2, tibble, tidyr, dplyr, broom, modelr",
"Used for data exploration process.",
"Used for date manipulation.",
"Used for visual summary",
"Used to create correlation table",
"Used to create correlation table",
"Used to extract linear equation from linear model",
"Used to create a linear model summary"
)
lib1.desc <-as_tibble(cbind(lib1.name1,lib1.desc))
colnames(lib1.desc) <- c("Library", "Purpose")
kable(lib1.desc)
| Library | Purpose |
|---|---|
| reshape2 | Used to reshape variables into groups. |
| knitr | For knitting document and include_graphics, kable functions. |
| plotly | Used to plot interactive charts. |
| DT | Used to display the data on the screen in a scrollable format. |
| tidyverse | Used for data manipulation and visualization,includes following packages: readr, ggplot2, tibble, tidyr, dplyr, broom, modelr |
| Data Explorer | Used for data exploration process. |
| lubridate | Used for date manipulation. |
| GGally | Used for visual summary |
| Hmisc | Used to create correlation table |
| corrplot | Used to create correlation table |
| equatiomatic | Used to extract linear equation from linear model |
| pander | Used to create a linear model summary |
My client provided data in the excel format (Performance_33218293.csv). This data covers the entire Q4 2020 for an engine with the highest count of ODP fault counts. I uploaded the data to my GitHub engine repository (https://github.com/lawbuk/engine) where it is downloaded for the analysis. Here is a sample of our data set.
# Read in original data sets from Github
Performance <- as_tibble(read.csv("https://raw.githubusercontent.com/lawbuk/engine/main/Performance_33218293.csv", sep = ",", na.strings = "NA",strip.white = TRUE, stringsAsFactors = FALSE))
# View data with DT package
datatable(head(Performance, 10))
Purpose: This data set contains performance variables that are related to oil differential pressure. The data collected at 5 second interval and it is to be used to understand the historical performance that led to the activation of fault codes. The the data will be used to build the predictive model that will be used to plan for service time.
Period: This data was collected for date range Oct 1st, 2020 through December 31st, 2020.
Variables: this data set has 15 variables.
Peculiarities: The oil_Diff_Press variable has observations capped at 125kPa. This is because the sensor that is reading the Oil_Diff_Press can read up to 125kPa.
Below are the variables and their description in this data set
Variable.type <-lapply(Performance,class)
Variable.desc <- c("Date when data was collected",
"Hour of the day when data was collected (24hr)",
"Minute of the day when data was collected (min)",
"Cumulative Engine hour at the time of data collection (Hrs.)",
"Engine crankcase pressure (CCP) (kPa)",
"Engine oil density",
"Engine dielectric",
"Engine fuel rate (gal/l)",
"Engine speed (rpm)",
"Engine net torque (Nm)",
"Engine oil viscosity",
"Engine oil temperature (Deg C)",
"Engine oil pressure (kPa)",
"Engine Oil Differential Pressure (ODP) (kPa)",
"Engine Oil Differential Pressure Fault Code 1321")
Variable.name1 <- colnames(Performance)
parameter.desc <-as_tibble(cbind(Variable.name1,Variable.type,Variable.desc))
colnames(parameter.desc) <- c("Variable", "Type", "Description")
kable(parameter.desc)
| Variable | Type | Description |
|---|---|---|
| Date | character | Date when data was collected |
| Hour_of_Day | integer | Hour of the day when data was collected (24hr) |
| Minute | integer | Minute of the day when data was collected (min) |
| Run_Hours | numeric | Cumulative Engine hour at the time of data collection (Hrs.) |
| CCP | numeric | Engine crankcase pressure (CCP) (kPa) |
| Density | numeric | Engine oil density |
| Dielectric | numeric | Engine dielectric |
| Fuel_Rate | numeric | Engine fuel rate (gal/l) |
| Speed | numeric | Engine speed (rpm) |
| Torque | numeric | Engine net torque (Nm) |
| Viscosity | numeric | Engine oil viscosity |
| Oil_Temp | numeric | Engine oil temperature (Deg C) |
| Oil_Pressure | numeric | Engine oil pressure (kPa) |
| Oil_DP | numeric | Engine Oil Differential Pressure (ODP) (kPa) |
| FC_1362 | integer | Engine Oil Differential Pressure Fault Code 1321 |
All data cleaning steps were performed in RStudio in a R script. The first three columns are removed after getting combined to form a Datetime column, then the data type for each column is set.
Performance<-Performance%>%
mutate(time =str_c(Hour_of_Day,Minute,sep = ":"),
time =str_c(Date,time,sep = " "),
Date = ymd_hm(time)
)%>%
select(c(-time,-Hour_of_Day,-Minute))%>%
rename(Datetime=Date)
# View data with DT package
datatable(head(Performance, n=10))
Missing data: I explored the data for the missing data and as seen the Viscosity, Dielectric and Density columns each has about 0.28% of missing data. Since there are enough observations, these missing points will be used to dropped from the data set.
#Check for missing data
plot_missing(Performance,
missing_only = FALSE,
geom_label_args = list(),
title = "Parameters with missing Data",
ggtheme = theme_gray(),
theme_config = list(legend.position = c("none")))
Distributions: Next I explored the numeric data columns for distribution using histograms. Oil_DP is the target variable; however, its values are capped at 125 because of the sensor limit. In order to be able to predict the correct value, the data to be used for predictive model will be filter for values of Oil_DP less than 125kPa.
# Create Histogram for distribution
Performance %>% select(c(-Run_Hours))%>%
plot_histogram(
binary_as_factor = TRUE,
geom_histogram_args = list(bins = 30L),
scale_x = "continuous",
title = "Distribution plots",
ggtheme = theme_gray(),
theme_config = list(),
nrow = 2L,
ncol = 5L,
parallel = FALSE
)
Outliers: Then I explored data for outliers using Boxplots. These outliers will be taken out of the data as needed.
# boxplot for distribution
setDT(Performance)%>%
select(c(-Run_Hours,-FC_1362))%>%
na.omit() %>%
mutate(month = format(Datetime, "%W"),month = factor(month))%>%
melt(id.vars = c("Datetime","month"), variable.name = "param")%>%
ggplot(aes(month,value, colour = param))+ geom_boxplot()+
facet_wrap(~param, nrow = 2,scales = "free") +
ggtitle("Weekly Boxplots to show the outliers")+
xlab("Week")+
theme(axis.text.x=element_text(angle=90),legend.position="none")
Redundancy Fuel_Rate is a function of Speed and Torque, therefore it does not provide any more useful information and this will be removed from the data set
#Correlation
M <- round(rcorr(as.matrix(Performance%>%select(-c(Datetime,Run_Hours,FC_1362))))$r,4)
corrplot(M, method="number")
Data clean up will involve the following steps
clean_data<-Performance%>%
select(-c(Fuel_Rate,FC_1362))%>%
filter(Oil_DP<125,
Viscosity<40,
Density<1,
Oil_Pressure<600,
Speed>1750,
Dielectric >1.8,
Oil_Temp >80,
Torque > 10000,
!is.na(Viscosity),
!is.na(Dielectric),
!is.na(Density))
par(mar = c(4, 4, .1, .1))
#Check for missing data
plot_missing(clean_data,
missing_only = FALSE,
geom_label_args = list(),
title = "Parameters with missing Data",
ggtheme = theme_gray(),
theme_config = list(legend.position = c("none")))
# Create Histogram for distribution
clean_data %>% select(c(-Run_Hours))%>%
plot_histogram(
binary_as_factor = TRUE,
geom_histogram_args = list(bins = 30L),
scale_x = "continuous",
title = "Distribution plots",
ggtheme = theme_gray(),
theme_config = list(),
nrow = 3L,
ncol = 3L,
parallel = FALSE
)
# boxplot for distribution
setDT(clean_data)%>%
select(-Run_Hours)%>%
na.omit() %>%
mutate(month = format(Datetime, "%W"),month = factor(month))%>%
melt(id.vars = c("Datetime","month"), variable.name = "param")%>%
ggplot(aes(month,value, colour = param))+ geom_boxplot()+
facet_wrap(~param, nrow = 3,scales = "free") +
ggtitle("Weekly Boxplots to show the outliers")+
xlab("Week")+
theme(axis.text.x=element_text(angle=90),legend.position="none")
Explore the data: Let us look at the summary statistics of our cleaned data
## Datetime Run_Hours CCP Density
## Min. :2020-10-03 22:04:00 Min. :3951 Min. :0.86 Min. :0.760
## 1st Qu.:2020-10-09 17:23:15 1st Qu.:4044 1st Qu.:1.71 1st Qu.:0.790
## Median :2020-11-09 23:42:30 Median :4604 Median :1.83 Median :0.790
## Mean :2020-11-09 20:36:44 Mean :4569 Mean :1.86 Mean :0.792
## 3rd Qu.:2020-12-09 11:36:45 3rd Qu.:5046 3rd Qu.:1.99 3rd Qu.:0.800
## Max. :2020-12-15 04:16:00 Max. :5156 Max. :2.86 Max. :0.830
## Dielectric Speed Torque Viscosity Oil_Temp
## Min. :1.840 Min. :1751 Min. :10002 Min. :12.02 Min. :80.03
## 1st Qu.:2.120 1st Qu.:1812 1st Qu.:10731 1st Qu.:14.64 1st Qu.:88.13
## Median :2.170 Median :1819 Median :10746 Median :15.67 Median :91.56
## Mean :2.149 Mean :1821 Mean :10725 Mean :15.84 Mean :91.07
## 3rd Qu.:2.200 3rd Qu.:1826 3rd Qu.:10761 3rd Qu.:16.88 3rd Qu.:94.19
## Max. :2.250 Max. :1932 Max. :11313 Max. :26.20 Max. :99.06
## Oil_Pressure Oil_DP
## Min. :412.0 Min. : 27.51
## 1st Qu.:448.0 1st Qu.: 71.50
## Median :452.0 Median : 81.98
## Mean :454.1 Mean : 83.34
## 3rd Qu.:460.0 3rd Qu.: 94.53
## Max. :488.0 Max. :124.52
Correlation: Next we will look at the correlation statistics. As seen most predictor variables have have a normal distribution. Oil property variables (Dielectric, viscosity) are skewed.
clean_data%>%
select(-c(Datetime,Run_Hours, Density))%>%
as.data.frame()%>%
GGally::ggpairs()
Analytics Type In the previous section, the data is cleaned and the descriptive analysis was performed to explore the distribution using histogram and box plots. Also, weekly box plots are created to show the weekly trends in our data set.
the e performance data is cleaned and explored. Descriptive analysis and predictive analysis will be performed on the performance data.
With descriptive analysis, we will look at the distribution and trends in the data provided. Will also use scatter plots on various variables to explore any relationship between different factors. With predictive analysis, I will attempt to use correlation and regression analysis to predict the possible variables that are associated with the ODP.
Descriptive Analytics: This analysis will be applied to the cleaned data to create the visualization for the purpose understanding the historical performance of our client’s engine. In this analysis, I will use bar plots, histograms, scatter plots and box plots with the use of common R packages like ggplot2 among others.
Predictive Analytics: To perform this type of analytics, the performance data will be divided into train and test data sets. Train data set will be used to build an ODP predictive models. Whereas the test data will be used to valid the model. In addition, the fault code data will be used to test the model performance.
There might be data scaling done on some input independent variables that have high scale and then will run a model that will rank the independent variable correlation with dependent variable. Narrows down the number of independent variables to be used in the final predictive model
Model Selection The following possible models will will be considered during this analysis.
Multiple linear Regression model: The data provided has both qualitative (numerical) and qualitative (categorical) variables. The response variable (oil differential pressure) is the response valuable; thus, this will be applying a Least squares multiple linear regression to predict the value of Oil_DP and, With the use of mean squared error (MSE) measure, we can evaluate the performance of our model
Linear Stepwise Regression Model: Since we have more than one predictor variable, I will tune the model above using a linear stepwise regression model. This will give use the best simple model with only significant predictors.
Polynomial Regression Model: Depending on the accuracy of the model above, I might a polynomial regression model to improve on the accuracy.
For a successful regression analysis, I will follow the following steps:
To build a linear regression modal, I will base on these assumptions;
In case those conditions are not met, data transformation will be necessary.
Background For regression, population parameters; \(\hat\beta_0\) for slope and, \(\hat\beta_1\) for y-intercept are estimated. This is done by using a random sample df data set to solve for the slope and y intercept for each dependent parameter. The challenge here is to determine the best sample size because, slope and y-intercept are random variables. i.e their estimate values change from sample to sample.
Therefore since slope and y-intercepts are random variables then, they have a normal underline sampling distributions. Luckily enough;
\(\hat\beta_1\) is unbiased. That is, the mean of the y-intercept is equal to the population mean (\(\mu\hat\beta_1\ = \mu\beta_1\)). Thus, \(E(\hat\beta_1)\) = \(\beta_1\). This means on average, mean estimate is going to approach the true population parameter
The \(\beta_1\) standard deviation is estimated by \(S\hat\beta_1 = \frac{S}{\sqrt\sum(\hat x - x_i)^2}\). The standard deviation of \(\hat\beta_1\) is small, when the denominator \(\sqrt\sum(\hat x - x_i)^2\) is large. This gives us a tighter fit of the line to our data.
Next, the estimates are tested for significance and its done by fitting our data with a line that has a non-zero slope. \(\beta_0 ;\beta_1 = 0\) if intercepting at origin, \(\beta_0 = y, \beta_1 = 0\).
To learn about the population parameters and test for their significance, confidence intervals and hypothesis tests for \(\beta_1\) are used.
Hypothesis tests: Null hypothesis \(H_0 :\beta_1 = 0\), this means that there is no linear relationship between \(x\) and \(y\). i.e just ignore \(x\). So we want our data to give us enough evidence to reject that hypothesis in favor of my alternatives; \(H_a :\beta_1 > 0\) or \(H_a :\beta_1 < 0\) or \(H_0 :\beta_1 \ne 0\). If any of the alternative is true, then \(x\) and \(y\) have a linear relationship and the estimated slope is going to be \(\hat\beta_1\)
T-statistics: The T-distributions helps in estimating more things in our statistics, i.e drawing more information out of the sample to come up with test statistics. In this case estimating \(\hat\beta_1\) and \(S\hat\beta_1\).
So, the test statistics follows a t distribution \(T = \frac{\hat\beta_1- 0}{S\hat\beta_1} = \frac{\hat\beta_1}{S\hat\beta_1}\) where 0 is the mean assumed under the null. This has a t distribution with degrees of freedom \(df =n-2\) if null hypothesis,\(H_0\) is true.
Confidence intervals: The confidence interval is calculated using a formula; \(\hat\beta_1\pm t_{\alpha/2,n-2}*S\beta_1\) where \(t_{\alpha/2}\) is t critical values. Note, the confidence interval of \(\hat\beta_1\) gives a range of values where the true unknown parameter is likely to reside. It will also help in answering the question, is \(y\) related to \(x\) ?.
If the confidence interval of \(\beta_1\) contains 0, then there is no linear relationship between \(y\) and \(x\). Otherwise, there is a linear relationship
We will begin our simple linear model building process by using the Speed as the independent variable. Here is the relationship between Oil_DP and Speed.
data_final <- clean_data%>%
select(-c(Datetime))
plot(Oil_DP ~ Speed, data = data_final)
abline(coefficients(lm(Oil_DP ~ Speed, data = data_final)), lwd=2,lty = 2,col="red")
title("Oil Differential Pressure vs Speed")
Next, lets fit a simple linear model using Speed to estimate Oil_DP.
##
## Call:
## lm(formula = Oil_DP ~ Speed, data = data_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.310 -12.106 -1.507 11.101 41.865
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.88640 22.74477 -0.786 0.432
## Speed 0.05560 0.01249 4.451 8.72e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.32 on 5752 degrees of freedom
## Multiple R-squared: 0.003432, Adjusted R-squared: 0.003259
## F-statistic: 19.81 on 1 and 5752 DF, p-value: 8.72e-06
From the results, our linear model is represented by the following equation \[ \begin{aligned} \operatorname{Oil\_DP} &= \beta_{0} + \beta_{1}(\operatorname{Speed}) + \epsilon \end{aligned} \]
\[ \begin{aligned} \operatorname{\widehat{Oil\_DP}} &= -17.89 + 0.06(\operatorname{Speed}) \end{aligned} \]
Confidence Interval: We can calculate a 95% confidence interval for our true population slope using the quantile of t distribution. \(CI_{95} =qt(1-\frac{\alpha}{2},df)\)
#calculating a 95% confidence of interval (CI) of true population
CI_95 <- qt(0.975,df=nrow(data_final) - 2) # df is the degree of freedom
#(predictor slope + t*predictor Std.Error)
c <- coef(summary(mod_lm))[,"Estimate"]["Speed"]
e <- coef(summary(mod_lm))[,"Std. Error"]["Speed"]
p_max <- c + CI_95 *e
p_min <- c - CI_95 *e
#(predictor slope / Std.Error)
T <- c/e # this is similar to t value from the model
The results shows that we are 95% confident that the true slope, regressing independent variables on dependent variable is between 0.031111 and 0.0800917 Oil_DP kPa
Hypothesis test: Next, let us do the hypothesis test for the slope \((\beta_1)\) to see if the slope is significantly different from 0 and, compare that to the t value. \(T= \frac{\hat\beta_1}{S\hat\beta_1} =\) 0.0556013/0.0124927 = 4.4507
Then we compare the absolute values of that test t statistics to that quantile of the the t statistics we calculated. You notice that \(|T| > t_{(\alpha/2,n-2)} =\) 4.450713 > 1.9604.
Therefore, we reject the null hypothesis that the slope = 0
Hypothesis test on p values by calculating the percentile of t distribution where the p value is evaluated at the critical value calculated from the data
pt<-pt(T, df=nrow(data_final) - 2) # where T value calculated above and df is the degree of freedom
With both \(\alpha= 0.01\) or \(\alpha = 0.05\), it leads us to reject the null hypothesis that \(\beta_1 = 0\)
Confirm residuals normality: We plot a Histogram and Q-Q plot of residual from our model fit. From the QQ-Plot, we she that the model residual does not satisfy normality and linearity
plot(mod_lm$fitted.values, mod_lm$residuals,pch=20)
title("Residual distribution")
par(mfrow =c(1,2))
# par(cex.main =0.8)
hist(mod_lm$residuals)
#Normal Q-Q Plot of residuals
qqnorm(mod_lm$residuals)
qqline(mod_lm$residuals,col=2)
Collinearity: This occurs when two or more predictor variables are closely related to one another or are highly correlated. As we add more predictors to our model, it becomes hard to tell which of your predictors is contributing most on you model
This may lead to a miss classification error hence leading to over fitting a model. This happens because R might confused to which of those variables is going to be used as a predictor.
Next, lets fit a multiple regression model with all predictors.
mod_mlr <- lm(Oil_DP ~ ., data = data_final)
pander(mod_mlr)
| Â | Estimate | Std. Error | t value | Pr(>|t|) |
|---|---|---|---|---|
| (Intercept) | -246.8 | 40.83 | -6.045 | 1.59e-09 |
| Run_Hours | 0.00262 | 0.0005274 | 4.967 | 7.001e-07 |
| CCP | 9.002 | 1.203 | 7.481 | 8.488e-14 |
| Density | 289.1 | 27 | 10.71 | 1.664e-26 |
| Dielectric | 26.03 | 3.414 | 7.625 | 2.834e-14 |
| Speed | 0.1231 | 0.01206 | 10.2 | 3.1e-24 |
| Torque | 0.008719 | 0.001876 | 4.648 | 3.426e-06 |
| Viscosity | 1.875 | 0.1778 | 10.55 | 9.021e-26 |
| Oil_Temp | 0.2705 | 0.0841 | 3.216 | 0.001307 |
| Oil_Pressure | -0.7825 | 0.03317 | -23.59 | 1.506e-117 |
Residuals:
pander(summary(mod_mlr$residuals))
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| -56.62 | -10.28 | -1.417 | -8.756e-15 | 9.418 | 62.31 |
# display the actual coefficients
eq2 <- extract_eq(mod_mlr,wrap = TRUE, use_coefs = TRUE)
plot(mod_mlr$fitted.values, mod_mlr$residuals,pch=20)
title("Residual distribution")
par(mfrow =c(1,2))
# par(cex.main =0.8)
hist(mod_mlr$residuals)
#Normal Q-Q Plot of residuals
qqnorm(mod_mlr$residuals)
qqline(mod_mlr$residuals,col=2)
This model is slightly better than the first model. It is represented by the following equation \[ \begin{aligned} \operatorname{Oil\_DP} &= \beta_{0} + \beta_{1}(\operatorname{Run\_Hours}) + \beta_{2}(\operatorname{CCP}) + \beta_{3}(\operatorname{Density})\ + \\ &\quad \beta_{4}(\operatorname{Dielectric}) + \beta_{5}(\operatorname{Speed}) + \beta_{6}(\operatorname{Torque}) + \beta_{7}(\operatorname{Viscosity})\ + \\ &\quad \beta_{8}(\operatorname{Oil\_Temp}) + \beta_{9}(\operatorname{Oil\_Pressure}) + \epsilon \end{aligned} \]. This is simplified to \[ \begin{aligned} \operatorname{\widehat{Oil\_DP}} &= -246.82 + 0(\operatorname{Run\_Hours}) + 9(\operatorname{CCP}) + 289.12(\operatorname{Density})\ + \\ &\quad 26.03(\operatorname{Dielectric}) + 0.12(\operatorname{Speed}) + 0.01(\operatorname{Torque}) + 1.88(\operatorname{Viscosity})\ + \\ &\quad 0.27(\operatorname{Oil\_Temp}) - 0.78(\operatorname{Oil\_Pressure}) \end{aligned} \]
mod_mlr2 <- lm(Oil_DP ~., data = data_final)
drop1(mod_mlr2,test="F")
## Single term deletions
##
## Model:
## Oil_DP ~ Run_Hours + CCP + Density + Dielectric + Speed + Torque +
## Viscosity + Oil_Temp + Oil_Pressure
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 1289494 31161
## Run_Hours 1 5538 1295032 31184 24.670 7.001e-07 ***
## CCP 1 12564 1302057 31215 55.965 8.488e-14 ***
## Density 1 25739 1315233 31273 114.654 < 2.2e-16 ***
## Dielectric 1 13053 1302546 31217 58.144 2.834e-14 ***
## Speed 1 23369 1312862 31263 104.095 < 2.2e-16 ***
## Torque 1 4850 1294344 31181 21.604 3.426e-06 ***
## Viscosity 1 24972 1314466 31270 111.237 < 2.2e-16 ***
## Oil_Temp 1 2322 1291815 31170 10.342 0.001307 **
## Oil_Pressure 1 124964 1414458 31692 556.649 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod_mlr2)
##
## Call:
## lm(formula = Oil_DP ~ ., data = data_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.623 -10.276 -1.417 9.418 62.306
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.468e+02 4.083e+01 -6.045 1.59e-09 ***
## Run_Hours 2.620e-03 5.274e-04 4.967 7.00e-07 ***
## CCP 9.002e+00 1.203e+00 7.481 8.49e-14 ***
## Density 2.891e+02 2.700e+01 10.708 < 2e-16 ***
## Dielectric 2.603e+01 3.414e+00 7.625 2.83e-14 ***
## Speed 1.231e-01 1.206e-02 10.203 < 2e-16 ***
## Torque 8.719e-03 1.876e-03 4.648 3.43e-06 ***
## Viscosity 1.875e+00 1.778e-01 10.547 < 2e-16 ***
## Oil_Temp 2.705e-01 8.410e-02 3.216 0.00131 **
## Oil_Pressure -7.825e-01 3.317e-02 -23.593 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.98 on 5744 degrees of freedom
## Multiple R-squared: 0.1615, Adjusted R-squared: 0.1602
## F-statistic: 122.9 on 9 and 5744 DF, p-value: < 2.2e-16
# display the actual coefficients
eq3<-extract_eq(mod_mlr2,use_coefs = TRUE)
plot(mod_mlr2$fitted.values, mod_mlr2$residuals,pch=20)
title("Residual distribution")
par(mfrow =c(1,2))
# par(cex.main =0.8)
hist(mod_mlr2$residuals)
#Normal Q-Q Plot of residuals
qqnorm(mod_mlr2$residuals)
qqline(mod_mlr2$residuals,col=2)
\[ \operatorname{\widehat{Oil\_DP}} = -246.82 + 0(\operatorname{Run\_Hours}) + 9(\operatorname{CCP}) + 289.12(\operatorname{Density}) + 26.03(\operatorname{Dielectric}) + 0.12(\operatorname{Speed}) + 0.01(\operatorname{Torque}) + 1.88(\operatorname{Viscosity}) + 0.27(\operatorname{Oil\_Temp}) - 0.78(\operatorname{Oil\_Pressure}) \]
Checking model condition
We check for linearity using residual plot. We need to see if the residuals are well centered around 0 (on y-axis) and spread evenly evenly above and below 0.
We can also check for normality by looking at histogram and qqplots above. As you can see the residue is centered and spreads spread evenly.
Dropping most insignificant Variable can be done by dropping the variable with the lowest value of sum of square (sum of Sq), residual sum of square (RSS) or Akaike information criterion (AIC).
After the least significant variable is dropped, re-run the model with the remaining variables, and repeat the process until all the variables are significant.
Linear Stepwise Regression Model: As seen earlier that there is collinearity in our data, I have to use a principle of parsimony . The simplest, most efficient model, the best. With application of stepwise regression, this will drop insignificant explanatory variables one by one.
# This provides the best model
mod_smlr <- step(lm(Oil_DP ~ ., data = data_final))
## Start: AIC=31161.28
## Oil_DP ~ Run_Hours + CCP + Density + Dielectric + Speed + Torque +
## Viscosity + Oil_Temp + Oil_Pressure
##
## Df Sum of Sq RSS AIC
## <none> 1289494 31161
## - Oil_Temp 1 2322 1291815 31170
## - Torque 1 4850 1294344 31181
## - Run_Hours 1 5538 1295032 31184
## - CCP 1 12564 1302057 31215
## - Dielectric 1 13053 1302546 31217
## - Speed 1 23369 1312862 31263
## - Viscosity 1 24972 1314466 31270
## - Density 1 25739 1315233 31273
## - Oil_Pressure 1 124964 1414458 31692
drop1(mod_smlr,test="F")
## Single term deletions
##
## Model:
## Oil_DP ~ Run_Hours + CCP + Density + Dielectric + Speed + Torque +
## Viscosity + Oil_Temp + Oil_Pressure
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 1289494 31161
## Run_Hours 1 5538 1295032 31184 24.670 7.001e-07 ***
## CCP 1 12564 1302057 31215 55.965 8.488e-14 ***
## Density 1 25739 1315233 31273 114.654 < 2.2e-16 ***
## Dielectric 1 13053 1302546 31217 58.144 2.834e-14 ***
## Speed 1 23369 1312862 31263 104.095 < 2.2e-16 ***
## Torque 1 4850 1294344 31181 21.604 3.426e-06 ***
## Viscosity 1 24972 1314466 31270 111.237 < 2.2e-16 ***
## Oil_Temp 1 2322 1291815 31170 10.342 0.001307 **
## Oil_Pressure 1 124964 1414458 31692 556.649 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod_smlr)
##
## Call:
## lm(formula = Oil_DP ~ Run_Hours + CCP + Density + Dielectric +
## Speed + Torque + Viscosity + Oil_Temp + Oil_Pressure, data = data_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.623 -10.276 -1.417 9.418 62.306
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.468e+02 4.083e+01 -6.045 1.59e-09 ***
## Run_Hours 2.620e-03 5.274e-04 4.967 7.00e-07 ***
## CCP 9.002e+00 1.203e+00 7.481 8.49e-14 ***
## Density 2.891e+02 2.700e+01 10.708 < 2e-16 ***
## Dielectric 2.603e+01 3.414e+00 7.625 2.83e-14 ***
## Speed 1.231e-01 1.206e-02 10.203 < 2e-16 ***
## Torque 8.719e-03 1.876e-03 4.648 3.43e-06 ***
## Viscosity 1.875e+00 1.778e-01 10.547 < 2e-16 ***
## Oil_Temp 2.705e-01 8.410e-02 3.216 0.00131 **
## Oil_Pressure -7.825e-01 3.317e-02 -23.593 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.98 on 5744 degrees of freedom
## Multiple R-squared: 0.1615, Adjusted R-squared: 0.1602
## F-statistic: 122.9 on 9 and 5744 DF, p-value: < 2.2e-16
eq5<-extract_eq(mod_smlr,use_coefs = TRUE)
plot(mod_smlr$fitted.values, mod_smlr$residuals,pch=20)
title("Residual distribution")
par(mfrow =c(1,2))
# par(cex.main =0.8)
hist(mod_smlr$residuals)
#Normal Q-Q Plot of residuals
qqnorm(mod_smlr$residuals)
qqline(mod_smlr$residuals,col=2)
Our final model is represented by the equation below:
\[ \operatorname{\widehat{Oil\_DP}} = -246.82 + 0(\operatorname{Run\_Hours}) + 9(\operatorname{CCP}) + 289.12(\operatorname{Density}) + 26.03(\operatorname{Dielectric}) + 0.12(\operatorname{Speed}) + 0.01(\operatorname{Torque}) + 1.88(\operatorname{Viscosity}) + 0.27(\operatorname{Oil\_Temp}) - 0.78(\operatorname{Oil\_Pressure}) \]
A polynomial regression: extends linear model by powers of predictors our model. As seen from the multiple linear modeling, our residuals did not meet all our required assumption. This led me to design a polynomial regression model. For this project since we are able to fit our data onto the multiple linear regression, doing polynomial regression model dos not bring significant improvement to the model.
Just to show how this can be achieved
mod_pr <- lm(Oil_DP ~ Run_Hours + Density + Torque +
Oil_Pressure +(Density^2) +I(Torque^2) + I(Viscosity^2) + I(Density^3) + I(Viscosity^3) + I(Oil_Temp^3), data = data_final)
drop1(mod_pr,test="F")
## Single term deletions
##
## Model:
## Oil_DP ~ Run_Hours + Density + Torque + Oil_Pressure + (Density^2) +
## I(Torque^2) + I(Viscosity^2) + I(Density^3) + I(Viscosity^3) +
## I(Oil_Temp^3)
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 1336870 31369
## Run_Hours 1 23276 1360145 31466 100.0066 < 2.2e-16 ***
## Density 1 337 1337207 31368 1.4499 0.228599
## Torque 1 245 1337114 31368 1.0516 0.305189
## Oil_Pressure 1 83305 1420174 31715 357.9278 < 2.2e-16 ***
## I(Torque^2) 1 291 1337161 31368 1.2500 0.263594
## I(Viscosity^2) 1 3894 1340764 31384 16.7324 4.363e-05 ***
## I(Density^3) 1 244 1337114 31368 1.0498 0.305593
## I(Viscosity^3) 1 2288 1339158 31377 9.8306 0.001725 **
## I(Oil_Temp^3) 1 5563 1342433 31391 23.9041 1.040e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod_pr)
##
## Call:
## lm(formula = Oil_DP ~ Run_Hours + Density + Torque + Oil_Pressure +
## (Density^2) + I(Torque^2) + I(Viscosity^2) + I(Density^3) +
## I(Viscosity^3) + I(Oil_Temp^3), data = data_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.137 -10.731 -1.511 9.848 52.353
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.572e+02 1.124e+03 -0.229 0.81906
## Run_Hours 4.845e-03 4.844e-04 10.000 < 2e-16 ***
## Density 2.052e+03 1.704e+03 1.204 0.22860
## Torque -1.336e-01 1.303e-01 -1.025 0.30519
## Oil_Pressure -5.487e-01 2.900e-02 -18.919 < 2e-16 ***
## I(Torque^2) 6.924e-06 6.193e-06 1.118 0.26359
## I(Viscosity^2) 2.003e-01 4.896e-02 4.091 4.36e-05 ***
## I(Density^3) -9.253e+02 9.031e+02 -1.025 0.30559
## I(Viscosity^3) -5.980e-03 1.907e-03 -3.135 0.00172 **
## I(Oil_Temp^3) 1.632e-05 3.338e-06 4.889 1.04e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.26 on 5744 degrees of freedom
## Multiple R-squared: 0.1307, Adjusted R-squared: 0.1293
## F-statistic: 95.92 on 9 and 5744 DF, p-value: < 2.2e-16
eq6 <- extract_eq(mod_pr,use_coefs = TRUE)
plot(mod_pr$fitted.values, mod_pr$residuals,pch=20)
title("Residual distribution")
par(mfrow =c(1,2))
# par(cex.main =0.8)
hist(mod_pr$residuals)
#Normal Q-Q Plot of residuals
qqnorm(mod_pr$residuals)
qqline(mod_pr$residuals,col=2)
\[
\operatorname{\widehat{Oil\_DP}} = -257.15 + 0(\operatorname{Run\_Hours}) + 2052.03(\operatorname{Density}) - 0.13(\operatorname{Torque}) - 0.55(\operatorname{Oil\_Pressure}) + 0(\operatorname{I(Torque\char`\^2)}) + 0.2(\operatorname{I(Viscosity\char`\^2)}) - 925.3(\operatorname{I(Density\char`\^3)}) - 0.01(\operatorname{I(Viscosity\char`\^3)}) + 0(\operatorname{I(Oil\_Temp\char`\^3)})
\]
This phase assesses the degree to which the model meets the business objectives and seeks to determine if there is some business reason why this model is deficient. We will compare our results found with the original questions posed by the sponsors in the Business Understanding phase. From the the customer expectations, I needed to identify the parameters needed to predict the oil differential pressure and also, to design an oil differential pressure model.
To answer this question, I started by looking at the all the 15 parameters in the original dataset, I performed data cleaning to get ride of rows with na (missing data), outliers and, redundancies. Then I filtered the data for only or the rated condition and additional filtering performed as explained in data cleaning section above. With the cleaned data, I performed a multiple linear regression model which inspected to make sure that its residuals are normally distributed and meet linearity assumptions.
After careful examination of the dataset, I observed that the dataset is grouped in three clusters:Idle condition, Rated condition and, Transition between idle and rated conditions. Our response variable is capped at 125kPa, therefore the data was filtered for values of \(Oil\_DP < 125\). In addition, this analysis focused rated conditions but the analysis can be repeated for other conditions by just filtering the data for those conditions.
From the analysis;Run_Hours,CCP, Density, Dielectric, Speed, Torque, Oil_Pressure and, oil_temp are the significant parameters that are included in the final model. The model performance is done in previous sections. The final model is represented by the following equation. From the model, we see that all the predictors have a linear relationship with response variable
\[ \operatorname{\widehat{Oil\_DP}} = -246.82 + 0(\operatorname{Run\_Hours}) + 9(\operatorname{CCP}) + 289.12(\operatorname{Density}) + 26.03(\operatorname{Dielectric}) + 0.12(\operatorname{Speed}) + 0.01(\operatorname{Torque}) + 1.88(\operatorname{Viscosity}) + 0.27(\operatorname{Oil\_Temp}) - 0.78(\operatorname{Oil\_Pressure}) \]
This model only focuses for on rated conditions. Here are some of the recommendations for the model improvements
The challenges I found at the beginning was the the use of a vast amount of data. Initial the dataset provided had over 2 million observations and with the limited amount of computing power, some portions of analysis were failing. This was mitigated by reducing the amount of data I was I started to use. However, the analysis was a success and the predictive model built, meets the customer requirements for this project.