Domain I: Business Understanding

Background: My client is a global power technology leader, is a corporation of complementary business units that design, engineer, manufacture, distribute and service engines and related technologies, including fuel systems, controls, air handling, filtration, emission solutions, and electrical power generation systems.

The diesel engine business unit supply units for both on highway like semi-trucks and off highway applications like mining, construction, farming and marine. In 2020, the engine segment represents about 32% of the total generated revenue.

The headquarters of my client is located in United States and employees over 58000 people worldwide and serve customers in approximately 190 countries.

Existing Analytics: Working with their customers during the development and field testing their products, low-cost data loggers (LCDs) are installed on the engine or power generators to log performance data. The logged data is then transmitted back to my client server via cell signals or downloaded USB.

The data collected is analyzed and used to support problems solving to improve the quality of their products and used to plan for the monthly service time.

Domain II: Business Problem Framing

Business Problem Framing: My client have installed Spin-on filter into their field test engines through which engine oil flowing to be cleaned of impurities before getting sent through the rest if engine. These filters are designed to last at least 500hrs before getting replaced.

The oil pressure before and after the filter is monitored to determine oil differential pressure (ODP). This is done by taking a difference between post-filter oil pressure and pre-filter oil pressure.

With time, the ODP value increase more impurities from the oil are retained. The filter is expected to last about 500hrs before reaching 125kPa and at that point, the filter, the filtered must be changed

However, these filters are getting plugged so fast that most of the last about 200hrs which leads to frequent down time for replacement. My client’s customers are complaining that these irregular down times are costing them a lot of money and can be asking for as feasible solution to reduce these down times

I met with my client and there is some additional information I need to help me with working on the solution.

  1. What variable contribute more to the oil differential pressure change? Or
  2. What variable generate the biggest change in the ODP change?
  3. Given the variable associated with ODP, what is the predicted value?
  4. When do we expect to hit a certain value of ODP?

Stakeholder Analysis: Following are the stakeholders for this project.

  1. My client: Is the main stakeholder in this project as it responsible for manufacturing and planning maintenance schedule for their field engines. Irregular servicing cost my client both time and money since they have to mobilize service engineers and field technicians to have filters changed.

  2. My client’s end customers: These are the engine users who run the engines in the field. They are interested in having a reliable running engine with the minimal down time as much as possible. When the oil filters get plugged, there fault codes are triggered which leads to engine derate hence reducing the amount of power needed to perform the required job.

Data Sources: My client provided performance data from an engine that was running in Q4 2020 with a spin on filters and with the highest count of oil filter plug in issues in that particular location. The data has parameters/variables/attributes assumed to impact oil consumption.

These attributes include oil temperature, oil pressure, engine speed, engine torque, fuel rate, oil differential pressure, crankcase pressure (CCP), oil property parameters (i.e. oil viscosity, oil density, and oil dielectric) and the Oil differential pressure fault code (FC_1362). The data provided has a sample rate of 5 second interval.

This data will be prepared and transformed to handle missing data.Data scaling that might include standardization and or normalization using R packages may be necessary depending on the type of the model expected.

Expected Benefit of Analysis: With this analysis will be with predicting the ODP. With this information, my client can schedule the maintenance that aligns with the customers planned down time. Hence avoiding unplanned customer down time. This in turn will save my client a lot of money

Data mining goal from business goal:The business goal of my client is to be able to schedule maintenance intervals that aligns with spin on filter performance. This will save my client time and money spent to mobilize service and field engineers to perform filter changes at irregular intervals

Domain III: Data Preparation

Packages Used

Following packages are used in this entire project during the analysis.Each step has been explained and the codes have been given where necessary.

#install the packages if necessary
if(!require(tidyverse)) {install.packages("tidyverse"); require(tidyverse)} 
if(!require(DataExplorer)) {install.packages("DataExplorer"); require(DataExplorer)} #For data exploration
if(!require(GGally)) {install.packages("GGally"); require(GGally)} #visual summary
if(!require(forcats)) {install.packages("forcats"); require(forcats)} #visual summary
if(!require(Hmisc)) {install.packages("Hmisc"); require(Hmisc)} #correlation table
if(!require(corrplot)) {install.packages("corrplot"); require(corrplot)} #correlation  visualization
if(!require(equatiomatic)) {install.packages("equatiomatic"); require(equatiomatic)} # extracting linear equation
if(!require(pander)) {install.packages("pander"); require(pander)} # to create linear model summary

library(reshape2)
library(knitr)
library(plotly)
library(DT)
library(lubridate)
library(data.table)
lib1.name1 <- c("reshape2",
                "knitr",
                "plotly",
                "DT",
                "tidyverse",
                "Data Explorer",
                "lubridate",
                "GGally",
                "Hmisc",
                "corrplot",
                "equatiomatic",
                "pander")

lib1.desc <- c("Used to reshape variables into groups.",
               "For knitting document and include_graphics, kable functions.",
               "Used to plot interactive charts.",
               "Used to display the data on the screen in a scrollable format.",
               "Used for data manipulation and visualization,includes following packages: readr, ggplot2, tibble, tidyr, dplyr, broom, modelr",
               "Used for data exploration process.",
               "Used for date manipulation.",
               "Used for visual summary",
               "Used to create correlation table",
               "Used to create correlation table",
               "Used to extract linear equation from linear model",
               "Used to create a linear model summary"
               
               )

lib1.desc <-as_tibble(cbind(lib1.name1,lib1.desc))
colnames(lib1.desc) <- c("Library", "Purpose")
kable(lib1.desc)
Library Purpose
reshape2 Used to reshape variables into groups.
knitr For knitting document and include_graphics, kable functions.
plotly Used to plot interactive charts.
DT Used to display the data on the screen in a scrollable format.
tidyverse Used for data manipulation and visualization,includes following packages: readr, ggplot2, tibble, tidyr, dplyr, broom, modelr
Data Explorer Used for data exploration process.
lubridate Used for date manipulation.
GGally Used for visual summary
Hmisc Used to create correlation table
corrplot Used to create correlation table
equatiomatic Used to extract linear equation from linear model
pander Used to create a linear model summary

Importing Data

My client provided data in the excel format (Performance_33218293.csv). This data covers the entire Q4 2020 for an engine with the highest count of ODP fault counts. I uploaded the data to my GitHub engine repository (https://github.com/lawbuk/engine) where it is downloaded for the analysis. Here is a sample of our data set.

# Read in original data sets from Github
Performance <- as_tibble(read.csv("https://raw.githubusercontent.com/lawbuk/engine/main/Performance_33218293.csv", sep = ",", na.strings = "NA",strip.white = TRUE, stringsAsFactors = FALSE))

# View data with DT package
datatable(head(Performance, 10))

Data Properties

Purpose: This data set contains performance variables that are related to oil differential pressure. The data collected at 5 second interval and it is to be used to understand the historical performance that led to the activation of fault codes. The the data will be used to build the predictive model that will be used to plan for service time.

Period: This data was collected for date range Oct 1st, 2020 through December 31st, 2020.

Variables: this data set has 15 variables.

Peculiarities: The oil_Diff_Press variable has observations capped at 125kPa. This is because the sensor that is reading the Oil_Diff_Press can read up to 125kPa.

Below are the variables and their description in this data set

Variable.type <-lapply(Performance,class)
Variable.desc <- c("Date when data was collected",
                   "Hour of the day when data was collected (24hr)",
                   "Minute of the day when data was collected (min)",
                   "Cumulative Engine hour at the time of data collection (Hrs.)",
                   "Engine crankcase pressure (CCP) (kPa)",
                   "Engine oil density",
                   "Engine dielectric",
                   "Engine fuel rate (gal/l)",
                   "Engine speed (rpm)",
                   "Engine net torque (Nm)",
                   "Engine oil viscosity",
                   "Engine oil temperature (Deg C)",
                   "Engine oil pressure (kPa)",
                   "Engine Oil Differential Pressure (ODP) (kPa)",
                   "Engine Oil Differential Pressure Fault Code 1321")

Variable.name1 <- colnames(Performance)
parameter.desc <-as_tibble(cbind(Variable.name1,Variable.type,Variable.desc))
colnames(parameter.desc) <- c("Variable", "Type", "Description")
kable(parameter.desc)
Variable Type Description
Date character Date when data was collected
Hour_of_Day integer Hour of the day when data was collected (24hr)
Minute integer Minute of the day when data was collected (min)
Run_Hours numeric Cumulative Engine hour at the time of data collection (Hrs.)
CCP numeric Engine crankcase pressure (CCP) (kPa)
Density numeric Engine oil density
Dielectric numeric Engine dielectric
Fuel_Rate numeric Engine fuel rate (gal/l)
Speed numeric Engine speed (rpm)
Torque numeric Engine net torque (Nm)
Viscosity numeric Engine oil viscosity
Oil_Temp numeric Engine oil temperature (Deg C)
Oil_Pressure numeric Engine oil pressure (kPa)
Oil_DP numeric Engine Oil Differential Pressure (ODP) (kPa)
FC_1362 integer Engine Oil Differential Pressure Fault Code 1321

Initial Data Exploration

All data cleaning steps were performed in RStudio in a R script. The first three columns are removed after getting combined to form a Datetime column, then the data type for each column is set.

Performance<-Performance%>%
  mutate(time =str_c(Hour_of_Day,Minute,sep = ":"),
         time =str_c(Date,time,sep = " "),
         Date = ymd_hm(time)
         )%>%
  select(c(-time,-Hour_of_Day,-Minute))%>%
  rename(Datetime=Date)

# View data with DT package
datatable(head(Performance, n=10))

Missing data: I explored the data for the missing data and as seen the Viscosity, Dielectric and Density columns each has about 0.28% of missing data. Since there are enough observations, these missing points will be used to dropped from the data set.

#Check for missing data
plot_missing(Performance,
             missing_only = FALSE,
             geom_label_args = list(),
             title = "Parameters with missing Data",
             ggtheme = theme_gray(),
             theme_config = list(legend.position = c("none")))

Distributions: Next I explored the numeric data columns for distribution using histograms. Oil_DP is the target variable; however, its values are capped at 125 because of the sensor limit. In order to be able to predict the correct value, the data to be used for predictive model will be filter for values of Oil_DP less than 125kPa.

# Create Histogram for distribution

Performance %>% select(c(-Run_Hours))%>%
  plot_histogram(
    binary_as_factor = TRUE,
    geom_histogram_args = list(bins = 30L),
    scale_x = "continuous",
    title = "Distribution plots",
    ggtheme = theme_gray(),
    theme_config = list(),
    nrow = 2L,
    ncol = 5L,
    parallel = FALSE
    )

Outliers: Then I explored data for outliers using Boxplots. These outliers will be taken out of the data as needed.

# boxplot for distribution
setDT(Performance)%>%
  select(c(-Run_Hours,-FC_1362))%>%
  na.omit() %>%
  mutate(month = format(Datetime, "%W"),month = factor(month))%>%
  melt(id.vars = c("Datetime","month"), variable.name = "param")%>%
  ggplot(aes(month,value, colour = param))+ geom_boxplot()+
  facet_wrap(~param, nrow = 2,scales = "free") +
  ggtitle("Weekly Boxplots to show the outliers")+
  xlab("Week")+
  theme(axis.text.x=element_text(angle=90),legend.position="none")

Redundancy Fuel_Rate is a function of Speed and Torque, therefore it does not provide any more useful information and this will be removed from the data set

#Correlation
M <- round(rcorr(as.matrix(Performance%>%select(-c(Datetime,Run_Hours,FC_1362))))$r,4)
corrplot(M, method="number")

Data Cleaning

Data clean up will involve the following steps

  • Removing all rows with missing values
  • Filter for Oil_DP < 125 since this has values capped at 125.
  • Filter the data only for rated engine running conditions, i.e Speed > 1750, Oil_Temp > 80 and Torque > 10000
  • Additional filtering for outliers are: Dielectric >= 1.8, Viscosity < 40 and Density < 1
  • Removed Fuel_Rate because it is redundant since its a function of Speed and Torque
clean_data<-Performance%>%
  select(-c(Fuel_Rate,FC_1362))%>%
  filter(Oil_DP<125,
         Viscosity<40,
         Density<1,
         Oil_Pressure<600,
         Speed>1750,
         Dielectric >1.8,
         Oil_Temp >80,
         Torque > 10000,
         !is.na(Viscosity),
         !is.na(Dielectric),
         !is.na(Density))

par(mar = c(4, 4, .1, .1))

#Check for missing data
plot_missing(clean_data,
             missing_only = FALSE,
             geom_label_args = list(),
             title = "Parameters with missing Data",
             ggtheme = theme_gray(),
             theme_config = list(legend.position = c("none")))

# Create Histogram for distribution

clean_data %>% select(c(-Run_Hours))%>%
  plot_histogram(
    binary_as_factor = TRUE,
    geom_histogram_args = list(bins = 30L),
    scale_x = "continuous",
    title = "Distribution plots",
    ggtheme = theme_gray(),
    theme_config = list(),
    nrow = 3L,
    ncol = 3L,
    parallel = FALSE
    )

# boxplot for distribution
setDT(clean_data)%>%
  select(-Run_Hours)%>%
  na.omit() %>%
  mutate(month = format(Datetime, "%W"),month = factor(month))%>%
  melt(id.vars = c("Datetime","month"), variable.name = "param")%>%
  ggplot(aes(month,value, colour = param))+ geom_boxplot()+
  facet_wrap(~param, nrow = 3,scales = "free") +
  ggtitle("Weekly Boxplots to show the outliers")+
  xlab("Week")+
  theme(axis.text.x=element_text(angle=90),legend.position="none")

Explore the data: Let us look at the summary statistics of our cleaned data

##     Datetime                     Run_Hours         CCP          Density     
##  Min.   :2020-10-03 22:04:00   Min.   :3951   Min.   :0.86   Min.   :0.760  
##  1st Qu.:2020-10-09 17:23:15   1st Qu.:4044   1st Qu.:1.71   1st Qu.:0.790  
##  Median :2020-11-09 23:42:30   Median :4604   Median :1.83   Median :0.790  
##  Mean   :2020-11-09 20:36:44   Mean   :4569   Mean   :1.86   Mean   :0.792  
##  3rd Qu.:2020-12-09 11:36:45   3rd Qu.:5046   3rd Qu.:1.99   3rd Qu.:0.800  
##  Max.   :2020-12-15 04:16:00   Max.   :5156   Max.   :2.86   Max.   :0.830  
##    Dielectric        Speed          Torque        Viscosity        Oil_Temp    
##  Min.   :1.840   Min.   :1751   Min.   :10002   Min.   :12.02   Min.   :80.03  
##  1st Qu.:2.120   1st Qu.:1812   1st Qu.:10731   1st Qu.:14.64   1st Qu.:88.13  
##  Median :2.170   Median :1819   Median :10746   Median :15.67   Median :91.56  
##  Mean   :2.149   Mean   :1821   Mean   :10725   Mean   :15.84   Mean   :91.07  
##  3rd Qu.:2.200   3rd Qu.:1826   3rd Qu.:10761   3rd Qu.:16.88   3rd Qu.:94.19  
##  Max.   :2.250   Max.   :1932   Max.   :11313   Max.   :26.20   Max.   :99.06  
##   Oil_Pressure       Oil_DP      
##  Min.   :412.0   Min.   : 27.51  
##  1st Qu.:448.0   1st Qu.: 71.50  
##  Median :452.0   Median : 81.98  
##  Mean   :454.1   Mean   : 83.34  
##  3rd Qu.:460.0   3rd Qu.: 94.53  
##  Max.   :488.0   Max.   :124.52

Correlation: Next we will look at the correlation statistics. As seen most predictor variables have have a normal distribution. Oil property variables (Dielectric, viscosity) are skewed.

clean_data%>%
  select(-c(Datetime,Run_Hours, Density))%>%
  as.data.frame()%>%
  GGally::ggpairs()

Domain IV: Methodology

Analytics Type In the previous section, the data is cleaned and the descriptive analysis was performed to explore the distribution using histogram and box plots. Also, weekly box plots are created to show the weekly trends in our data set.

the e performance data is cleaned and explored. Descriptive analysis and predictive analysis will be performed on the performance data.

With descriptive analysis, we will look at the distribution and trends in the data provided. Will also use scatter plots on various variables to explore any relationship between different factors. With predictive analysis, I will attempt to use correlation and regression analysis to predict the possible variables that are associated with the ODP.

  • Descriptive Analytics: This analysis will be applied to the cleaned data to create the visualization for the purpose understanding the historical performance of our client’s engine. In this analysis, I will use bar plots, histograms, scatter plots and box plots with the use of common R packages like ggplot2 among others.

  • Predictive Analytics: To perform this type of analytics, the performance data will be divided into train and test data sets. Train data set will be used to build an ODP predictive models. Whereas the test data will be used to valid the model. In addition, the fault code data will be used to test the model performance.

There might be data scaling done on some input independent variables that have high scale and then will run a model that will rank the independent variable correlation with dependent variable. Narrows down the number of independent variables to be used in the final predictive model

Model Selection The following possible models will will be considered during this analysis.

  • Multiple linear Regression model: The data provided has both qualitative (numerical) and qualitative (categorical) variables. The response variable (oil differential pressure) is the response valuable; thus, this will be applying a Least squares multiple linear regression to predict the value of Oil_DP and, With the use of mean squared error (MSE) measure, we can evaluate the performance of our model

  • Linear Stepwise Regression Model: Since we have more than one predictor variable, I will tune the model above using a linear stepwise regression model. This will give use the best simple model with only significant predictors.

  • Polynomial Regression Model: Depending on the accuracy of the model above, I might a polynomial regression model to improve on the accuracy.

Domain V: Domain V: Analysis

Regression model background

For a successful regression analysis, I will follow the following steps:

  1. Explore the data analysis. This has been already completed in the Initial Data Exploration section
  2. Fit the model using least squares regression
  3. Use the model to interpret the coefficient in the predictions
  4. Check the model assumptions
  5. Do inference

To build a linear regression modal, I will base on these assumptions;

  1. All predictors are independent of each other.
  2. Our data do not increase/decrease in variance (heteroscedasticity)
  3. Our residuals satisfy normality, independence and, linearity

In case those conditions are not met, data transformation will be necessary.

Background For regression, population parameters; \(\hat\beta_0\) for slope and, \(\hat\beta_1\) for y-intercept are estimated. This is done by using a random sample df data set to solve for the slope and y intercept for each dependent parameter. The challenge here is to determine the best sample size because, slope and y-intercept are random variables. i.e their estimate values change from sample to sample.

Therefore since slope and y-intercepts are random variables then, they have a normal underline sampling distributions. Luckily enough;

  1. \(\hat\beta_1\) is unbiased. That is, the mean of the y-intercept is equal to the population mean (\(\mu\hat\beta_1\ = \mu\beta_1\)). Thus, \(E(\hat\beta_1)\) = \(\beta_1\). This means on average, mean estimate is going to approach the true population parameter

  2. The \(\beta_1\) standard deviation is estimated by \(S\hat\beta_1 = \frac{S}{\sqrt\sum(\hat x - x_i)^2}\). The standard deviation of \(\hat\beta_1\) is small, when the denominator \(\sqrt\sum(\hat x - x_i)^2\) is large. This gives us a tighter fit of the line to our data.

Next, the estimates are tested for significance and its done by fitting our data with a line that has a non-zero slope. \(\beta_0 ;\beta_1 = 0\) if intercepting at origin, \(\beta_0 = y, \beta_1 = 0\).

To learn about the population parameters and test for their significance, confidence intervals and hypothesis tests for \(\beta_1\) are used.

Hypothesis tests: Null hypothesis \(H_0 :\beta_1 = 0\), this means that there is no linear relationship between \(x\) and \(y\). i.e just ignore \(x\). So we want our data to give us enough evidence to reject that hypothesis in favor of my alternatives; \(H_a :\beta_1 > 0\) or \(H_a :\beta_1 < 0\) or \(H_0 :\beta_1 \ne 0\). If any of the alternative is true, then \(x\) and \(y\) have a linear relationship and the estimated slope is going to be \(\hat\beta_1\)

T-statistics: The T-distributions helps in estimating more things in our statistics, i.e drawing more information out of the sample to come up with test statistics. In this case estimating \(\hat\beta_1\) and \(S\hat\beta_1\).

So, the test statistics follows a t distribution \(T = \frac{\hat\beta_1- 0}{S\hat\beta_1} = \frac{\hat\beta_1}{S\hat\beta_1}\) where 0 is the mean assumed under the null. This has a t distribution with degrees of freedom \(df =n-2\) if null hypothesis,\(H_0\) is true.

Confidence intervals: The confidence interval is calculated using a formula; \(\hat\beta_1\pm t_{\alpha/2,n-2}*S\beta_1\) where \(t_{\alpha/2}\) is t critical values. Note, the confidence interval of \(\hat\beta_1\) gives a range of values where the true unknown parameter is likely to reside. It will also help in answering the question, is \(y\) related to \(x\) ?.

If the confidence interval of \(\beta_1\) contains 0, then there is no linear relationship between \(y\) and \(x\). Otherwise, there is a linear relationship

Linear Regression Model

We will begin our simple linear model building process by using the Speed as the independent variable. Here is the relationship between Oil_DP and Speed.

data_final <- clean_data%>%
  select(-c(Datetime))

plot(Oil_DP ~ Speed, data = data_final)

abline(coefficients(lm(Oil_DP ~ Speed, data = data_final)), lwd=2,lty = 2,col="red")
title("Oil Differential Pressure vs Speed")

Next, lets fit a simple linear model using Speed to estimate Oil_DP.

## 
## Call:
## lm(formula = Oil_DP ~ Speed, data = data_final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.310 -12.106  -1.507  11.101  41.865 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.88640   22.74477  -0.786    0.432    
## Speed         0.05560    0.01249   4.451 8.72e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.32 on 5752 degrees of freedom
## Multiple R-squared:  0.003432,   Adjusted R-squared:  0.003259 
## F-statistic: 19.81 on 1 and 5752 DF,  p-value: 8.72e-06

From the results, our linear model is represented by the following equation \[ \begin{aligned} \operatorname{Oil\_DP} &= \beta_{0} + \beta_{1}(\operatorname{Speed}) + \epsilon \end{aligned} \]

\[ \begin{aligned} \operatorname{\widehat{Oil\_DP}} &= -17.89 + 0.06(\operatorname{Speed}) \end{aligned} \]

  • Look at the Residuals range (Min,Max) and, the median to see if median is close to 0. If so, This is a quick way to know that the residuals are normally distributed.
  • Under Coefficients, check the the y-intercept ( \(\hat\beta_1\)), the slopes,the standard errors, the corresponding t values for t- statistics and the P value.
    1. The t and p value columns indicate the values of t-test i.e. show if our coefficients chosen are going to differ significantly from from 0.
    2. The \(\hat\beta_1\) is significant with a p value close or equal to 0
    3. The slope estimate is significant with a p value close or equal to 0
  • Check the Multiple R-squared which tells us how much of the variation in dependent variable can be explained by the independent variable
  • The Adjusted R-squared adjusts for models that involve multiple independent predictor \(x\) variables. As more predictor variables are added to the model, the Multiple R-squared increases even if those predictors are not significant. Therefore we want to keep our model as simple as possible. so the adjusted R-squared is preferable

Confidence Interval: We can calculate a 95% confidence interval for our true population slope using the quantile of t distribution. \(CI_{95} =qt(1-\frac{\alpha}{2},df)\)

#calculating a  95% confidence of interval (CI) of true population
CI_95 <- qt(0.975,df=nrow(data_final) - 2) # df is the degree of freedom

#(predictor slope + t*predictor Std.Error)
c <- coef(summary(mod_lm))[,"Estimate"]["Speed"]
e <- coef(summary(mod_lm))[,"Std. Error"]["Speed"]

p_max <- c + CI_95 *e

p_min <- c - CI_95 *e

#(predictor slope / Std.Error)
T <- c/e # this is similar to t value from the model

The results shows that we are 95% confident that the true slope, regressing independent variables on dependent variable is between 0.031111 and 0.0800917 Oil_DP kPa

Hypothesis test: Next, let us do the hypothesis test for the slope \((\beta_1)\) to see if the slope is significantly different from 0 and, compare that to the t value. \(T= \frac{\hat\beta_1}{S\hat\beta_1} =\) 0.0556013/0.0124927 = 4.4507

Then we compare the absolute values of that test t statistics to that quantile of the the t statistics we calculated. You notice that \(|T| > t_{(\alpha/2,n-2)} =\) 4.450713 > 1.9604.

Therefore, we reject the null hypothesis that the slope = 0

Hypothesis test on p values by calculating the percentile of t distribution where the p value is evaluated at the critical value calculated from the data

pt<-pt(T, df=nrow(data_final) - 2) # where T value calculated above and df is the degree of freedom

With both \(\alpha= 0.01\) or \(\alpha = 0.05\), it leads us to reject the null hypothesis that \(\beta_1 = 0\)

Confirm residuals normality: We plot a Histogram and Q-Q plot of residual from our model fit. From the QQ-Plot, we she that the model residual does not satisfy normality and linearity

plot(mod_lm$fitted.values, mod_lm$residuals,pch=20)
title("Residual distribution")

par(mfrow =c(1,2))

# par(cex.main =0.8)
hist(mod_lm$residuals)

#Normal Q-Q Plot of residuals
qqnorm(mod_lm$residuals)
qqline(mod_lm$residuals,col=2)

Collinearity: This occurs when two or more predictor variables are closely related to one another or are highly correlated. As we add more predictors to our model, it becomes hard to tell which of your predictors is contributing most on you model

This may lead to a miss classification error hence leading to over fitting a model. This happens because R might confused to which of those variables is going to be used as a predictor.

Multiple Linear Regression

Next, lets fit a multiple regression model with all predictors.

mod_mlr <- lm(Oil_DP ~ ., data = data_final)
pander(mod_mlr)
Fitting linear model: Oil_DP ~ .
  Estimate Std. Error t value Pr(>|t|)
(Intercept) -246.8 40.83 -6.045 1.59e-09
Run_Hours 0.00262 0.0005274 4.967 7.001e-07
CCP 9.002 1.203 7.481 8.488e-14
Density 289.1 27 10.71 1.664e-26
Dielectric 26.03 3.414 7.625 2.834e-14
Speed 0.1231 0.01206 10.2 3.1e-24
Torque 0.008719 0.001876 4.648 3.426e-06
Viscosity 1.875 0.1778 10.55 9.021e-26
Oil_Temp 0.2705 0.0841 3.216 0.001307
Oil_Pressure -0.7825 0.03317 -23.59 1.506e-117

Residuals:

pander(summary(mod_mlr$residuals))
Min. 1st Qu. Median Mean 3rd Qu. Max.
-56.62 -10.28 -1.417 -8.756e-15 9.418 62.31
# display the actual coefficients
eq2 <- extract_eq(mod_mlr,wrap = TRUE, use_coefs = TRUE)

plot(mod_mlr$fitted.values, mod_mlr$residuals,pch=20)
title("Residual distribution")

par(mfrow =c(1,2))

# par(cex.main =0.8)
hist(mod_mlr$residuals)

#Normal Q-Q Plot of residuals
qqnorm(mod_mlr$residuals)
qqline(mod_mlr$residuals,col=2)

This model is slightly better than the first model. It is represented by the following equation \[ \begin{aligned} \operatorname{Oil\_DP} &= \beta_{0} + \beta_{1}(\operatorname{Run\_Hours}) + \beta_{2}(\operatorname{CCP}) + \beta_{3}(\operatorname{Density})\ + \\ &\quad \beta_{4}(\operatorname{Dielectric}) + \beta_{5}(\operatorname{Speed}) + \beta_{6}(\operatorname{Torque}) + \beta_{7}(\operatorname{Viscosity})\ + \\ &\quad \beta_{8}(\operatorname{Oil\_Temp}) + \beta_{9}(\operatorname{Oil\_Pressure}) + \epsilon \end{aligned} \]. This is simplified to \[ \begin{aligned} \operatorname{\widehat{Oil\_DP}} &= -246.82 + 0(\operatorname{Run\_Hours}) + 9(\operatorname{CCP}) + 289.12(\operatorname{Density})\ + \\ &\quad 26.03(\operatorname{Dielectric}) + 0.12(\operatorname{Speed}) + 0.01(\operatorname{Torque}) + 1.88(\operatorname{Viscosity})\ + \\ &\quad 0.27(\operatorname{Oil\_Temp}) - 0.78(\operatorname{Oil\_Pressure}) \end{aligned} \]

  • Residuals have a normal distribution which meets our assumptions.
  • Looking at the p values, we want to focus on key values close to 0. We only take values less than 0.01. Anything above, means that that predictor is not significant to the model and should be trimmed out of the model
mod_mlr2 <- lm(Oil_DP ~., data = data_final)
drop1(mod_mlr2,test="F")
## Single term deletions
## 
## Model:
## Oil_DP ~ Run_Hours + CCP + Density + Dielectric + Speed + Torque + 
##     Viscosity + Oil_Temp + Oil_Pressure
##              Df Sum of Sq     RSS   AIC F value    Pr(>F)    
## <none>                    1289494 31161                      
## Run_Hours     1      5538 1295032 31184  24.670 7.001e-07 ***
## CCP           1     12564 1302057 31215  55.965 8.488e-14 ***
## Density       1     25739 1315233 31273 114.654 < 2.2e-16 ***
## Dielectric    1     13053 1302546 31217  58.144 2.834e-14 ***
## Speed         1     23369 1312862 31263 104.095 < 2.2e-16 ***
## Torque        1      4850 1294344 31181  21.604 3.426e-06 ***
## Viscosity     1     24972 1314466 31270 111.237 < 2.2e-16 ***
## Oil_Temp      1      2322 1291815 31170  10.342  0.001307 ** 
## Oil_Pressure  1    124964 1414458 31692 556.649 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod_mlr2)
## 
## Call:
## lm(formula = Oil_DP ~ ., data = data_final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.623 -10.276  -1.417   9.418  62.306 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.468e+02  4.083e+01  -6.045 1.59e-09 ***
## Run_Hours     2.620e-03  5.274e-04   4.967 7.00e-07 ***
## CCP           9.002e+00  1.203e+00   7.481 8.49e-14 ***
## Density       2.891e+02  2.700e+01  10.708  < 2e-16 ***
## Dielectric    2.603e+01  3.414e+00   7.625 2.83e-14 ***
## Speed         1.231e-01  1.206e-02  10.203  < 2e-16 ***
## Torque        8.719e-03  1.876e-03   4.648 3.43e-06 ***
## Viscosity     1.875e+00  1.778e-01  10.547  < 2e-16 ***
## Oil_Temp      2.705e-01  8.410e-02   3.216  0.00131 ** 
## Oil_Pressure -7.825e-01  3.317e-02 -23.593  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.98 on 5744 degrees of freedom
## Multiple R-squared:  0.1615, Adjusted R-squared:  0.1602 
## F-statistic: 122.9 on 9 and 5744 DF,  p-value: < 2.2e-16
# display the actual coefficients
eq3<-extract_eq(mod_mlr2,use_coefs = TRUE)

plot(mod_mlr2$fitted.values, mod_mlr2$residuals,pch=20)
title("Residual distribution")

par(mfrow =c(1,2))

# par(cex.main =0.8)
hist(mod_mlr2$residuals)

#Normal Q-Q Plot of residuals
qqnorm(mod_mlr2$residuals)
qqline(mod_mlr2$residuals,col=2)

\[ \operatorname{\widehat{Oil\_DP}} = -246.82 + 0(\operatorname{Run\_Hours}) + 9(\operatorname{CCP}) + 289.12(\operatorname{Density}) + 26.03(\operatorname{Dielectric}) + 0.12(\operatorname{Speed}) + 0.01(\operatorname{Torque}) + 1.88(\operatorname{Viscosity}) + 0.27(\operatorname{Oil\_Temp}) - 0.78(\operatorname{Oil\_Pressure}) \]

Checking model condition

  • We check for linearity using residual plot. We need to see if the residuals are well centered around 0 (on y-axis) and spread evenly evenly above and below 0.

  • We can also check for normality by looking at histogram and qqplots above. As you can see the residue is centered and spreads spread evenly.

Dropping most insignificant Variable can be done by dropping the variable with the lowest value of sum of square (sum of Sq), residual sum of square (RSS) or Akaike information criterion (AIC).

  • The F values is used to test whether that parameter is significant at 5% level or 1% level
  • P value can also be used to drop the least significant variable and in this case, we drop the variable with the highest p value

After the least significant variable is dropped, re-run the model with the remaining variables, and repeat the process until all the variables are significant.

Linear Stepwise Regression Model: As seen earlier that there is collinearity in our data, I have to use a principle of parsimony . The simplest, most efficient model, the best. With application of stepwise regression, this will drop insignificant explanatory variables one by one.

# This provides the best model
mod_smlr <- step(lm(Oil_DP ~ ., data = data_final))
## Start:  AIC=31161.28
## Oil_DP ~ Run_Hours + CCP + Density + Dielectric + Speed + Torque + 
##     Viscosity + Oil_Temp + Oil_Pressure
## 
##                Df Sum of Sq     RSS   AIC
## <none>                      1289494 31161
## - Oil_Temp      1      2322 1291815 31170
## - Torque        1      4850 1294344 31181
## - Run_Hours     1      5538 1295032 31184
## - CCP           1     12564 1302057 31215
## - Dielectric    1     13053 1302546 31217
## - Speed         1     23369 1312862 31263
## - Viscosity     1     24972 1314466 31270
## - Density       1     25739 1315233 31273
## - Oil_Pressure  1    124964 1414458 31692
drop1(mod_smlr,test="F")
## Single term deletions
## 
## Model:
## Oil_DP ~ Run_Hours + CCP + Density + Dielectric + Speed + Torque + 
##     Viscosity + Oil_Temp + Oil_Pressure
##              Df Sum of Sq     RSS   AIC F value    Pr(>F)    
## <none>                    1289494 31161                      
## Run_Hours     1      5538 1295032 31184  24.670 7.001e-07 ***
## CCP           1     12564 1302057 31215  55.965 8.488e-14 ***
## Density       1     25739 1315233 31273 114.654 < 2.2e-16 ***
## Dielectric    1     13053 1302546 31217  58.144 2.834e-14 ***
## Speed         1     23369 1312862 31263 104.095 < 2.2e-16 ***
## Torque        1      4850 1294344 31181  21.604 3.426e-06 ***
## Viscosity     1     24972 1314466 31270 111.237 < 2.2e-16 ***
## Oil_Temp      1      2322 1291815 31170  10.342  0.001307 ** 
## Oil_Pressure  1    124964 1414458 31692 556.649 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod_smlr)
## 
## Call:
## lm(formula = Oil_DP ~ Run_Hours + CCP + Density + Dielectric + 
##     Speed + Torque + Viscosity + Oil_Temp + Oil_Pressure, data = data_final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.623 -10.276  -1.417   9.418  62.306 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.468e+02  4.083e+01  -6.045 1.59e-09 ***
## Run_Hours     2.620e-03  5.274e-04   4.967 7.00e-07 ***
## CCP           9.002e+00  1.203e+00   7.481 8.49e-14 ***
## Density       2.891e+02  2.700e+01  10.708  < 2e-16 ***
## Dielectric    2.603e+01  3.414e+00   7.625 2.83e-14 ***
## Speed         1.231e-01  1.206e-02  10.203  < 2e-16 ***
## Torque        8.719e-03  1.876e-03   4.648 3.43e-06 ***
## Viscosity     1.875e+00  1.778e-01  10.547  < 2e-16 ***
## Oil_Temp      2.705e-01  8.410e-02   3.216  0.00131 ** 
## Oil_Pressure -7.825e-01  3.317e-02 -23.593  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.98 on 5744 degrees of freedom
## Multiple R-squared:  0.1615, Adjusted R-squared:  0.1602 
## F-statistic: 122.9 on 9 and 5744 DF,  p-value: < 2.2e-16
eq5<-extract_eq(mod_smlr,use_coefs = TRUE)

plot(mod_smlr$fitted.values, mod_smlr$residuals,pch=20)
title("Residual distribution")

par(mfrow =c(1,2))

# par(cex.main =0.8)
hist(mod_smlr$residuals)

#Normal Q-Q Plot of residuals
qqnorm(mod_smlr$residuals)
qqline(mod_smlr$residuals,col=2)

Our final model is represented by the equation below:

\[ \operatorname{\widehat{Oil\_DP}} = -246.82 + 0(\operatorname{Run\_Hours}) + 9(\operatorname{CCP}) + 289.12(\operatorname{Density}) + 26.03(\operatorname{Dielectric}) + 0.12(\operatorname{Speed}) + 0.01(\operatorname{Torque}) + 1.88(\operatorname{Viscosity}) + 0.27(\operatorname{Oil\_Temp}) - 0.78(\operatorname{Oil\_Pressure}) \]

Polynimial Regression Modeling

A polynomial regression: extends linear model by powers of predictors our model. As seen from the multiple linear modeling, our residuals did not meet all our required assumption. This led me to design a polynomial regression model. For this project since we are able to fit our data onto the multiple linear regression, doing polynomial regression model dos not bring significant improvement to the model.

Just to show how this can be achieved

mod_pr <- lm(Oil_DP ~ Run_Hours + Density + Torque + 
    Oil_Pressure +(Density^2) +I(Torque^2) + I(Viscosity^2) + I(Density^3) + I(Viscosity^3) + I(Oil_Temp^3), data = data_final)

drop1(mod_pr,test="F")
## Single term deletions
## 
## Model:
## Oil_DP ~ Run_Hours + Density + Torque + Oil_Pressure + (Density^2) + 
##     I(Torque^2) + I(Viscosity^2) + I(Density^3) + I(Viscosity^3) + 
##     I(Oil_Temp^3)
##                Df Sum of Sq     RSS   AIC  F value    Pr(>F)    
## <none>                      1336870 31369                       
## Run_Hours       1     23276 1360145 31466 100.0066 < 2.2e-16 ***
## Density         1       337 1337207 31368   1.4499  0.228599    
## Torque          1       245 1337114 31368   1.0516  0.305189    
## Oil_Pressure    1     83305 1420174 31715 357.9278 < 2.2e-16 ***
## I(Torque^2)     1       291 1337161 31368   1.2500  0.263594    
## I(Viscosity^2)  1      3894 1340764 31384  16.7324 4.363e-05 ***
## I(Density^3)    1       244 1337114 31368   1.0498  0.305593    
## I(Viscosity^3)  1      2288 1339158 31377   9.8306  0.001725 ** 
## I(Oil_Temp^3)   1      5563 1342433 31391  23.9041 1.040e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod_pr)
## 
## Call:
## lm(formula = Oil_DP ~ Run_Hours + Density + Torque + Oil_Pressure + 
##     (Density^2) + I(Torque^2) + I(Viscosity^2) + I(Density^3) + 
##     I(Viscosity^3) + I(Oil_Temp^3), data = data_final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.137 -10.731  -1.511   9.848  52.353 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2.572e+02  1.124e+03  -0.229  0.81906    
## Run_Hours       4.845e-03  4.844e-04  10.000  < 2e-16 ***
## Density         2.052e+03  1.704e+03   1.204  0.22860    
## Torque         -1.336e-01  1.303e-01  -1.025  0.30519    
## Oil_Pressure   -5.487e-01  2.900e-02 -18.919  < 2e-16 ***
## I(Torque^2)     6.924e-06  6.193e-06   1.118  0.26359    
## I(Viscosity^2)  2.003e-01  4.896e-02   4.091 4.36e-05 ***
## I(Density^3)   -9.253e+02  9.031e+02  -1.025  0.30559    
## I(Viscosity^3) -5.980e-03  1.907e-03  -3.135  0.00172 ** 
## I(Oil_Temp^3)   1.632e-05  3.338e-06   4.889 1.04e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.26 on 5744 degrees of freedom
## Multiple R-squared:  0.1307, Adjusted R-squared:  0.1293 
## F-statistic: 95.92 on 9 and 5744 DF,  p-value: < 2.2e-16
eq6 <- extract_eq(mod_pr,use_coefs = TRUE)

plot(mod_pr$fitted.values, mod_pr$residuals,pch=20)
title("Residual distribution")

par(mfrow =c(1,2))

# par(cex.main =0.8)
hist(mod_pr$residuals)

#Normal Q-Q Plot of residuals
qqnorm(mod_pr$residuals)
qqline(mod_pr$residuals,col=2)

\[ \operatorname{\widehat{Oil\_DP}} = -257.15 + 0(\operatorname{Run\_Hours}) + 2052.03(\operatorname{Density}) - 0.13(\operatorname{Torque}) - 0.55(\operatorname{Oil\_Pressure}) + 0(\operatorname{I(Torque\char`\^2)}) + 0.2(\operatorname{I(Viscosity\char`\^2)}) - 925.3(\operatorname{I(Density\char`\^3)}) - 0.01(\operatorname{I(Viscosity\char`\^3)}) + 0(\operatorname{I(Oil\_Temp\char`\^3)}) \]

Domain VI: Report Summary

Evaluation

This phase assesses the degree to which the model meets the business objectives and seeks to determine if there is some business reason why this model is deficient. We will compare our results found with the original questions posed by the sponsors in the Business Understanding phase. From the the customer expectations, I needed to identify the parameters needed to predict the oil differential pressure and also, to design an oil differential pressure model.

To answer this question, I started by looking at the all the 15 parameters in the original dataset, I performed data cleaning to get ride of rows with na (missing data), outliers and, redundancies. Then I filtered the data for only or the rated condition and additional filtering performed as explained in data cleaning section above. With the cleaned data, I performed a multiple linear regression model which inspected to make sure that its residuals are normally distributed and meet linearity assumptions.

Useful insights

After careful examination of the dataset, I observed that the dataset is grouped in three clusters:Idle condition, Rated condition and, Transition between idle and rated conditions. Our response variable is capped at 125kPa, therefore the data was filtered for values of \(Oil\_DP < 125\). In addition, this analysis focused rated conditions but the analysis can be repeated for other conditions by just filtering the data for those conditions.

From the analysis;Run_Hours,CCP, Density, Dielectric, Speed, Torque, Oil_Pressure and, oil_temp are the significant parameters that are included in the final model. The model performance is done in previous sections. The final model is represented by the following equation. From the model, we see that all the predictors have a linear relationship with response variable

\[ \operatorname{\widehat{Oil\_DP}} = -246.82 + 0(\operatorname{Run\_Hours}) + 9(\operatorname{CCP}) + 289.12(\operatorname{Density}) + 26.03(\operatorname{Dielectric}) + 0.12(\operatorname{Speed}) + 0.01(\operatorname{Torque}) + 1.88(\operatorname{Viscosity}) + 0.27(\operatorname{Oil\_Temp}) - 0.78(\operatorname{Oil\_Pressure}) \]

Limitations and Recommendations

This model only focuses for on rated conditions. Here are some of the recommendations for the model improvements

  • A generalized model can be built to include other conditions
  • Build a time series predictive model. This will be useful not only in predicting the oil differential pressure but also we be able to tell when to hit a given value of ODP. This will be useful for my client when it comes to scheduling service intervals.
  • Due the capped values of our response valuable, instead of building a linear regression model, a logistic regression model can be built instead. This can be done by setting the values of Oil_DP greater than 125 to be 1, else 0.
  • Model adaptability: The model above was built with Q4 2020 dataset, but our clients have latest data that is coming and it will be very beneficially for our client to have a model that can pick the latest data to make predictions.
  • Also our client filtered the parameters to only those believed to be related to ODP. It is possible there there are other parameters that more contributed to ODP than those provided. My recommendations would would be to provide all parameters to the analyst to perform a PCA to determined the most significant parameters to the ODP

The challenges I found at the beginning was the the use of a vast amount of data. Initial the dataset provided had over 2 million observations and with the limited amount of computing power, some portions of analysis were failing. This was mitigated by reducing the amount of data I was I started to use. However, the analysis was a success and the predictive model built, meets the customer requirements for this project.