Synopsis

Machine Failure analysis corresponds to a set of statistical approaches used to investigate the time it takes for an event of interest to occur.

Time to event, failure analysis, survival analysis, event history analysis, reliability analysis, all deal with time until occurrence of an event of interest. However, failure time may not be observed within the relevant time period, producing so-called censored observations.

The survival package is the cornerstone of the entire R survival analysis ecosystem. Not only is the package itself rich in features, but the object created by the Surv() function, which contains failure time and censoring information, is the basic survival analysis data structure in R. Dr. Terry Therneau, the package author, began working on the survival package in 1986.

1. Libraries

library(tidymodels)
library(tidyverse)

library(skimr)
library(GGally)
library(ggplot2)
library(ggfortify)

library(survival)
library(survminer)

library(ranger)

2. Data

maintenance <- read_csv("maintenance.csv",
                        show_col_types = FALSE)

3. EDA

skim(maintenance)

Data summary
Name	maintenance
Number of rows	1000
Number of columns	8
_______________________
Column type frequency:
character	2
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
team	0	1	5	5	0	3	0
provider	0	1	9	9	0	4	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
…1	1	499.50	288.82	0.00	249.75	499.50	749.25	999.00	▇▇▇▇▇
lifetime	1	55.20	26.47	1.00	34.00	60.00	80.00	93.00	▃▃▅▇▇
broken	1	0.40	0.49	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▅
pressureInd	1	98.60	19.96	33.48	85.56	97.22	112.25	173.28	▁▅▇▂▁
moistureInd	1	99.38	9.99	58.55	92.77	99.43	106.12	128.60	▁▂▇▇▁
temperatureInd	1	100.63	19.63	42.28	87.68	100.59	113.66	172.54	▁▆▇▂▁

ggpairs(maintenance, mapping = aes(color = provider))

4. Data transformation

maintenance <- maintenance %>% 
  
  # Remove unnecessary column
  select(-...1) %>% 
  
  # Transform column types to categorical / factor types
  mutate(team = as.factor(team)) %>% 
  mutate(provider = as.factor(provider))

head(maintenance)

## # A tibble: 6 x 7
##   lifetime broken pressureInd moistureInd temperatureInd team  provider 
##      <dbl>  <dbl>       <dbl>       <dbl>          <dbl> <fct> <fct>    
## 1       56      0        92.2       104.            96.5 TeamA Provider4
## 2       81      1        72.1       103.            87.3 TeamC Provider4
## 3       60      0        96.3        77.8          112.  TeamA Provider1
## 4       86      1        94.4       108.            72.0 TeamC Provider2
## 5       34      0        97.8        99.4          104.  TeamB Provider1
## 6       30      0        87.7       116.            89.8 TeamA Provider1

5. EDA Summary

No apparent anomalies are observed within the data. The variables in this dataset are:

lifetime - Number of weeks the machine has been active
broken - Specifies if the machine was broken or hasn’t been broken yet for the corresponding weeks in activity
pressureInd - The pressure index is used to quantify the flow of liquid through pipes, as a sudden drop of pressure can indicate a leak
moistureInd - The moisture index is a measure of the relative humidity in the air. It is important to keep track of it as excessive humidity can create mold and damage the equipment
temperature - The temperature index of the machine is computed using voltage devices called thermocouples that translate a change in voltage into temperature measure. It is recorded to avoid damages to electric circuits, fire or even explosion
team - This indicator specifies which team is using the machine
provider - This indicator specifies the name of the machine manufacturer

6. Kaplan Meier Analysis

I am going to use the Surv() function to build the standard survival object. Note that a “+” after the time in the print out of km indicates censoring.

# Kaplan Meier Survival Curve
km <- with(maintenance, Surv(lifetime, broken))

head(km, 400)

##   [1] 56+ 81  60+ 86  34+ 30+ 68+ 65  23+ 81  38+ 29+ 65  65  82  80  48+ 80 
##  [19] 92  88  74  65+ 61+ 35+ 26+ 63+ 88  79  53+ 73  60  13+ 34+ 36+ 80  65 
##  [37] 31+ 74  25+ 58+ 80  65  19+ 60  84+ 13+ 12+ 15+ 26+ 43+ 80  65   1+ 34+
##  [55] 80  12+ 92  82  65  20+ 16+  3+ 18+  7+ 47+ 23+ 39+ 48+ 57+ 68+ 18+  1+
##  [73] 81  65+  4+ 24+ 74  74  16+ 88+ 28+ 92  49+ 39+ 18+ 80+ 81  65  79  19+
##  [91] 58+ 76+ 92  92  52+ 29+ 65   8+ 53+ 40+ 46+  5+ 41+ 12+ 65  80  93  63+
## [109] 77+ 65  92  62+ 52+ 85  92  92  88  80  58+ 55+ 33+ 13+ 34+ 85  65  80+
## [127] 41+ 65  65  43+ 80  80  55+ 88  17+ 86   3+ 81+ 93  65  55+ 25+ 93  82 
## [145] 13+ 19+ 86  60  81  80  93  43+ 63+ 45+ 93  77+  9+ 60  62+ 82  29+ 65 
## [163] 49+ 72+ 16+ 56+ 80  40+ 57+ 88  93  20+ 12+ 57+ 65  50+ 42+ 49+ 65+ 65 
## [181] 44+ 92   9+ 48+ 49+ 65  36+ 60  81+ 24+ 60  65  63+ 93  92  65  28+ 25+
## [199] 74  20+ 93  57+ 46+ 60  56+  1+ 76+ 31+ 54+  1+ 64+ 53+ 62+ 12+ 76+ 53+
## [217] 86  29+ 80  65  60+ 65  88+ 47+  3+ 65  19+ 65  50+ 30+ 58+ 82  26+ 93 
## [235] 80  64+ 46+ 80  58+ 25+ 80  27+ 93   8+ 24+ 80  48+ 52+ 65  74  24+ 85 
## [253] 93  16+ 13+ 22+ 86  58+ 60  93  88  86  62+ 92  65+ 88  81  22+ 45+ 59+
## [271] 35+ 18+ 38+ 81  92  80  61   4+ 48+ 63+ 31+ 65  65  93  81  65  13+ 35+
## [289] 92  65  85  73  54+ 56+ 61+ 12+ 26+ 60  17+ 65  18+ 35+ 60  49+ 85  30+
## [307] 81  66  83+ 14+ 80  79  79   7+ 80+ 44+ 46+ 85  81   5+ 51+ 49+ 51+ 80 
## [325]  4+ 55+ 66+  9+ 76+ 65+ 71+ 85+ 21+ 45+ 74  78+ 56+ 60   6+ 50+ 28+ 93 
## [343] 69+ 28+ 46+ 24+ 61+ 77+ 54+ 22+ 63+ 81  65  81  74  66+ 85+ 30+ 80  80 
## [361] 63+ 19+ 54+ 40+ 35+ 29+ 89  80  33+ 81+ 80  88  77+  8+  2+ 14+ 80  74 
## [379] 38+ 65  79  41+ 13+ 42+ 92  92  65  34+ 65  65  65  80  61+ 45+ 33+ 55+
## [397] 86   2+ 80  61+

To begin the analysis, I will use the formula Surv(lifetime, broken) ~ 1 and the survfit() function to produce the Kaplan-Meier estimates of the probability of survival over time.

The times parameter of the summary() function provides control over which times to print. In this case printing the estimates for 60, 70, 80, and 90 weeks as these are the intervals with observed mechanical failure.

This is the simplest possible model. It only takes two lines of code to fit and produce numerical and graphical summaries.

km_fit <- survfit(Surv(lifetime, broken) ~ 1, data = maintenance)

summary(km_fit, times = c(60, 70, 80, 90))

## Call: survfit(formula = Surv(lifetime, broken) ~ 1, data = maintenance)
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##    60    531      32    0.940  0.0103        0.920        0.960
##    70    349      82    0.772  0.0188        0.736        0.810
##    80    261     116    0.488  0.0242        0.443        0.538
##    90     65     103    0.201  0.0209        0.164        0.246

plot(km_fit, xlab="Weeks", main = 'Kaplan Meyer Plot')

autoplot(km_fit)

Next, lets look at survival curves by team & provider (our categorical variables).

km_provider_fit <- survfit(Surv(lifetime, broken) ~ provider, data = maintenance)

km_provider_fit

## Call: survfit(formula = Surv(lifetime, broken) ~ provider, data = maintenance)
## 
##                      n events median 0.95LCL 0.95UCL
## provider=Provider1 254    116     80      79      NA
## provider=Provider2 266     91     92      92      92
## provider=Provider3 242    114     65      65      65
## provider=Provider4 238     76     88      88      88

According to the KM model the following statements can be made.

Provider1 has a median failure threshold of 80 weeks.

Provider2 has a median failure threshold of 92 weeks.

Provider3 has a median failure threshold of 65 weeks.

Provider4 has a median failure threshold of 88 weeks.

autoplot(km_provider_fit)

This plot illustrates that some of the machine types are prone to failure a lot more, compared to others - mainly Provider3 and Provider4.

km_team_fit <- survfit(Surv(lifetime, broken) ~ team, data = maintenance)

autoplot(km_team_fit)

7. Cox Proportional Hazards Model

Next, lets fit a Cox Proportional Hazards Model that makes use of all of the covariates in the data. The formula is very similar to a normal linear or logistic model.

cox <- coxph(Surv(lifetime, broken) ~ pressureInd + moistureInd + temperatureInd + team + provider, data = maintenance)

cox_fit <- survfit(cox)

summary(cox)

## Call:
## coxph(formula = Surv(lifetime, broken) ~ pressureInd + moistureInd + 
##     temperatureInd + team + provider, data = maintenance)
## 
##   n= 1000, number of events= 397 
## 
##                         coef  exp(coef)   se(coef)      z Pr(>|z|)    
## pressureInd        1.833e-03  1.002e+00  2.522e-03  0.727  0.46739    
## moistureInd       -1.576e-02  9.844e-01  5.387e-03 -2.926  0.00344 ** 
## temperatureInd     2.573e-02  1.026e+00  2.948e-03  8.730  < 2e-16 ***
## teamTeamB          9.270e-02  1.097e+00  1.224e-01  0.757  0.44897    
## teamTeamC          4.094e+01  6.027e+17  1.105e+03  0.037  0.97043    
## providerProvider2 -8.199e+01  2.471e-36  8.778e+02 -0.093  0.92558    
## providerProvider3  6.450e+01  1.024e+28  6.860e+03  0.009  0.99250    
## providerProvider4 -6.125e+01  2.509e-27  7.969e+02 -0.077  0.93873    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                   exp(coef) exp(-coef) lower .95 upper .95
## pressureInd       1.002e+00  9.982e-01    0.9969    1.0068
## moistureInd       9.844e-01  1.016e+00    0.9740    0.9948
## temperatureInd    1.026e+00  9.746e-01    1.0202    1.0320
## teamTeamB         1.097e+00  9.115e-01    0.8631    1.3947
## teamTeamC         6.027e+17  1.659e-18    0.0000       Inf
## providerProvider2 2.471e-36  4.047e+35    0.0000       Inf
## providerProvider3 1.024e+28  9.768e-29    0.0000       Inf
## providerProvider4 2.509e-27  3.986e+26    0.0000       Inf
## 
## Concordance= 0.999  (se = 0 )
## Likelihood ratio test= 1713  on 8 df,   p=<2e-16
## Wald test            = 0.59  on 8 df,   p=1
## Score (logrank) test = 1192  on 8 df,   p=<2e-16

The CP model identifies that moistureInd and temperatureInd are highly significant predictors. However, some caution needs to be exercised in interpreting these results. It must be noted that the Cox model assumes that the covariates do not vary with time.

To analyze how the effects of the covariates change over time lets fit Aalen’s additive regression model for censored data. The documentation states:

“The Aalen model assumes that the cumulative hazard H(t) for a subject can be expressed as a(t) + X B(t), where a(t) is a time-dependent intercept term, X is the vector of covariates for the subject (possibly time-dependent), and B(t) is a time-dependent matrix of coefficients.”

aa_fit <-aareg(Surv(lifetime, broken) ~ pressureInd + moistureInd + temperatureInd + team + provider, data = maintenance)

aa_fit

## Call:
## aareg(formula = Surv(lifetime, broken) ~ pressureInd + moistureInd + 
##     temperatureInd + team + provider, data = maintenance)
## 
##   n= 1000 
##     16 out of 16 unique event times used
## 
##                       slope      coef se(coef)        z        p
## Intercept                NA        NA       NA       NA       NA
## pressureInd        7.12e-06 -5.75e-07 9.77e-06  -0.0589 9.53e-01
## moistureInd       -9.95e-05 -2.52e-05 2.06e-05  -1.2300 2.20e-01
## temperatureInd     1.55e-04  5.08e-05 8.86e-06   5.7400 9.67e-09
## teamTeamB          2.77e-04  4.59e-05 4.22e-04   0.1090 9.14e-01
## teamTeamC          2.09e-02  5.27e-03 5.04e-04  10.4000 1.62e-25
## providerProvider2 -1.96e-02 -7.09e-03 5.45e-04 -13.0000 1.31e-38
## providerProvider3  7.96e-02  1.16e-02 1.10e-03  10.6000 3.40e-26
## providerProvider4        NA        NA       NA       NA       NA
## 
## Chisq=NA on 8 df, p=NA; test weights=aalen

autoplot(aa_fit)

8. Random Forests Model

Good approach to do time-to-event modeling is to fit a Random Forests Ensemble. I am going to use ranger() to build a model for each observation in the data set.

The next code chunk builds the model using the same variables used in the Cox model above, and plots random curves, along with a curve that represents the global average for all of the observations.

r_fit <- ranger(Surv(lifetime, broken) ~ pressureInd + moistureInd + temperatureInd + team + provider, 
                     data = maintenance,
                     mtry = 4,
                     importance = "permutation",
                     splitrule  = "extratrees",
                     verbose    = TRUE)

To be noted that the embedded ranger() function labels the target variable as “death” since it is constructed for general usage within medical context.

# Average the survival models

failure_times <- r_fit$unique.death.times 
surv_prob     <- data.frame(r_fit$survival)
avg_prob      <- sapply(surv_prob,mean)

# Plot the models for each observation

plot(r_fit$unique.death.times, r_fit$survival[1,], 
     type = "l", 
     ylim = c(0,1),
     col  = "red",
     xlab = "Weeks",
     ylab = "Failure",
     main = "Machine Failure Curves")

cols <- colors()
for (n in sample(c(2:dim(maintenance)[1]), 20)){
  
  lines(r_fit$unique.death.times, r_fit$survival[n,], type = "l", col = cols[n])
  
}

lines(failure_times, avg_prob, lwd = 2)
legend(500, 0.7, legend = c('Average = black'))

9. The next block of code illustrates how ranger() ranks variable importance

vi <- data.frame(sort(round(r_fit$variable.importance, 4), decreasing = TRUE))
names(vi) <- "importance"
head(vi)

##                importance
## provider           0.4288
## team               0.0691
## temperatureInd     0.0061
## moistureInd        0.0004
## pressureInd       -0.0004

In contrast with the KM model, RF suggest that the feature with highest importance is provider & team.

10. Assess Performance

ranger() does compute Harrell’s c-index. This is a generalization of the ROC curve, which reduces to the Wilcoxon-Mann-Whitney statistic for binary variables, which in turn, is equivalent to computing the area under the ROC curve.

cat("Prediction Error = 1 - Harrell's c-index = ", r_fit$prediction.error)

## Prediction Error = 1 - Harrell's c-index =  0.02952666

ROC value of .97 is considered as almost perfect discrimination.

Important remark is that ranger models does not address the time varying coefficients. This apparently is a challenge and many researchers are concerned with this limitation of ensembled forest models.

11. Lets comapre all models by plotting them together

kmi   <- rep("KM",length(km_fit$time))
km_df <- data.frame(km_fit$time,km_fit$surv,kmi)
names(km_df) <- c("Time","Failure","Model")

coxi   <- rep("Cox",length(cox_fit$time))
cox_df <- data.frame(cox_fit$time,cox_fit$surv,coxi)
names(cox_df) <- c("Time","Failure","Model")

rfi   <- rep("RF",length(r_fit$unique.death.times))
rf_df <- data.frame(r_fit$unique.death.times,avg_prob,rfi)
names(rf_df) <- c("Time","Failure","Model")

plot_df <- rbind(km_df,cox_df,rf_df)

p <- ggplot(plot_df, aes(x = Time, y = Failure, color = Model))
p + geom_line()

This graph it is not very intuitive and does not provide detailed breakdown of the prediction of each covariate but it highlights model differences.

The 3 modeling approaches are inherently different. The Cox model just provides a straight line to around 80 weeks failure point. It appears that RF and KM are able to pick the relationship of the covariates in the data and thus their predictions follow distinct pattern.

As there is no clear way to evaluate the AUC for the KM and Cox models, it is difficult to speculate by how much their accuracy differs.

Machine Failure Prediction

Metodi Simeonov

December, 2021