Synopsis

Machine Failure analysis corresponds to a set of statistical approaches used to investigate the time it takes for an event of interest to occur.

Time to event, failure analysis, survival analysis, event history analysis, reliability analysis, all deal with time until occurrence of an event of interest. However, failure time may not be observed within the relevant time period, producing so-called censored observations.

The survival package is the cornerstone of the entire R survival analysis ecosystem. Not only is the package itself rich in features, but the object created by the Surv() function, which contains failure time and censoring information, is the basic survival analysis data structure in R. Dr. Terry Therneau, the package author, began working on the survival package in 1986.

1. Libraries

library(tidymodels)
library(tidyverse)

library(skimr)
library(GGally)
library(ggplot2)
library(ggfortify)

library(survival)
library(survminer)

library(ranger)

2. Data

maintenance <- read_csv("maintenance.csv",
                        show_col_types = FALSE)

3. EDA

skim(maintenance)
Data summary
Name maintenance
Number of rows 1000
Number of columns 8
_______________________
Column type frequency:
character 2
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
team 0 1 5 5 0 3 0
provider 0 1 9 9 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
…1 0 1 499.50 288.82 0.00 249.75 499.50 749.25 999.00 ▇▇▇▇▇
lifetime 0 1 55.20 26.47 1.00 34.00 60.00 80.00 93.00 ▃▃▅▇▇
broken 0 1 0.40 0.49 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▅
pressureInd 0 1 98.60 19.96 33.48 85.56 97.22 112.25 173.28 ▁▅▇▂▁
moistureInd 0 1 99.38 9.99 58.55 92.77 99.43 106.12 128.60 ▁▂▇▇▁
temperatureInd 0 1 100.63 19.63 42.28 87.68 100.59 113.66 172.54 ▁▆▇▂▁
ggpairs(maintenance, mapping = aes(color = provider))

4. Data transformation

maintenance <- maintenance %>% 
  
  # Remove unnecessary column
  select(-...1) %>% 
  
  # Transform column types to categorical / factor types
  mutate(team = as.factor(team)) %>% 
  mutate(provider = as.factor(provider))

head(maintenance)
## # A tibble: 6 x 7
##   lifetime broken pressureInd moistureInd temperatureInd team  provider 
##      <dbl>  <dbl>       <dbl>       <dbl>          <dbl> <fct> <fct>    
## 1       56      0        92.2       104.            96.5 TeamA Provider4
## 2       81      1        72.1       103.            87.3 TeamC Provider4
## 3       60      0        96.3        77.8          112.  TeamA Provider1
## 4       86      1        94.4       108.            72.0 TeamC Provider2
## 5       34      0        97.8        99.4          104.  TeamB Provider1
## 6       30      0        87.7       116.            89.8 TeamA Provider1

5. EDA Summary

No apparent anomalies are observed within the data. The variables in this dataset are:

6. Kaplan Meier Analysis

I am going to use the Surv() function to build the standard survival object. Note that a “+” after the time in the print out of km indicates censoring.

# Kaplan Meier Survival Curve
km <- with(maintenance, Surv(lifetime, broken))

head(km, 400)
##   [1] 56+ 81  60+ 86  34+ 30+ 68+ 65  23+ 81  38+ 29+ 65  65  82  80  48+ 80 
##  [19] 92  88  74  65+ 61+ 35+ 26+ 63+ 88  79  53+ 73  60  13+ 34+ 36+ 80  65 
##  [37] 31+ 74  25+ 58+ 80  65  19+ 60  84+ 13+ 12+ 15+ 26+ 43+ 80  65   1+ 34+
##  [55] 80  12+ 92  82  65  20+ 16+  3+ 18+  7+ 47+ 23+ 39+ 48+ 57+ 68+ 18+  1+
##  [73] 81  65+  4+ 24+ 74  74  16+ 88+ 28+ 92  49+ 39+ 18+ 80+ 81  65  79  19+
##  [91] 58+ 76+ 92  92  52+ 29+ 65   8+ 53+ 40+ 46+  5+ 41+ 12+ 65  80  93  63+
## [109] 77+ 65  92  62+ 52+ 85  92  92  88  80  58+ 55+ 33+ 13+ 34+ 85  65  80+
## [127] 41+ 65  65  43+ 80  80  55+ 88  17+ 86   3+ 81+ 93  65  55+ 25+ 93  82 
## [145] 13+ 19+ 86  60  81  80  93  43+ 63+ 45+ 93  77+  9+ 60  62+ 82  29+ 65 
## [163] 49+ 72+ 16+ 56+ 80  40+ 57+ 88  93  20+ 12+ 57+ 65  50+ 42+ 49+ 65+ 65 
## [181] 44+ 92   9+ 48+ 49+ 65  36+ 60  81+ 24+ 60  65  63+ 93  92  65  28+ 25+
## [199] 74  20+ 93  57+ 46+ 60  56+  1+ 76+ 31+ 54+  1+ 64+ 53+ 62+ 12+ 76+ 53+
## [217] 86  29+ 80  65  60+ 65  88+ 47+  3+ 65  19+ 65  50+ 30+ 58+ 82  26+ 93 
## [235] 80  64+ 46+ 80  58+ 25+ 80  27+ 93   8+ 24+ 80  48+ 52+ 65  74  24+ 85 
## [253] 93  16+ 13+ 22+ 86  58+ 60  93  88  86  62+ 92  65+ 88  81  22+ 45+ 59+
## [271] 35+ 18+ 38+ 81  92  80  61   4+ 48+ 63+ 31+ 65  65  93  81  65  13+ 35+
## [289] 92  65  85  73  54+ 56+ 61+ 12+ 26+ 60  17+ 65  18+ 35+ 60  49+ 85  30+
## [307] 81  66  83+ 14+ 80  79  79   7+ 80+ 44+ 46+ 85  81   5+ 51+ 49+ 51+ 80 
## [325]  4+ 55+ 66+  9+ 76+ 65+ 71+ 85+ 21+ 45+ 74  78+ 56+ 60   6+ 50+ 28+ 93 
## [343] 69+ 28+ 46+ 24+ 61+ 77+ 54+ 22+ 63+ 81  65  81  74  66+ 85+ 30+ 80  80 
## [361] 63+ 19+ 54+ 40+ 35+ 29+ 89  80  33+ 81+ 80  88  77+  8+  2+ 14+ 80  74 
## [379] 38+ 65  79  41+ 13+ 42+ 92  92  65  34+ 65  65  65  80  61+ 45+ 33+ 55+
## [397] 86   2+ 80  61+

To begin the analysis, I will use the formula Surv(lifetime, broken) ~ 1 and the survfit() function to produce the Kaplan-Meier estimates of the probability of survival over time.

The times parameter of the summary() function provides control over which times to print. In this case printing the estimates for 60, 70, 80, and 90 weeks as these are the intervals with observed mechanical failure.

This is the simplest possible model. It only takes two lines of code to fit and produce numerical and graphical summaries.

km_fit <- survfit(Surv(lifetime, broken) ~ 1, data = maintenance)

summary(km_fit, times = c(60, 70, 80, 90))
## Call: survfit(formula = Surv(lifetime, broken) ~ 1, data = maintenance)
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##    60    531      32    0.940  0.0103        0.920        0.960
##    70    349      82    0.772  0.0188        0.736        0.810
##    80    261     116    0.488  0.0242        0.443        0.538
##    90     65     103    0.201  0.0209        0.164        0.246
plot(km_fit, xlab="Weeks", main = 'Kaplan Meyer Plot')

autoplot(km_fit)

Next, lets look at survival curves by team & provider (our categorical variables).

km_provider_fit <- survfit(Surv(lifetime, broken) ~ provider, data = maintenance)

km_provider_fit
## Call: survfit(formula = Surv(lifetime, broken) ~ provider, data = maintenance)
## 
##                      n events median 0.95LCL 0.95UCL
## provider=Provider1 254    116     80      79      NA
## provider=Provider2 266     91     92      92      92
## provider=Provider3 242    114     65      65      65
## provider=Provider4 238     76     88      88      88

According to the KM model the following statements can be made.

Provider1 has a median failure threshold of 80 weeks.

Provider2 has a median failure threshold of 92 weeks.

Provider3 has a median failure threshold of 65 weeks.

Provider4 has a median failure threshold of 88 weeks.

autoplot(km_provider_fit)

This plot illustrates that some of the machine types are prone to failure a lot more, compared to others - mainly Provider3 and Provider4.

km_team_fit <- survfit(Surv(lifetime, broken) ~ team, data = maintenance)

autoplot(km_team_fit)

7. Cox Proportional Hazards Model

Next, lets fit a Cox Proportional Hazards Model that makes use of all of the covariates in the data. The formula is very similar to a normal linear or logistic model.

cox <- coxph(Surv(lifetime, broken) ~ pressureInd + moistureInd + temperatureInd + team + provider, data = maintenance)

cox_fit <- survfit(cox)

summary(cox)
## Call:
## coxph(formula = Surv(lifetime, broken) ~ pressureInd + moistureInd + 
##     temperatureInd + team + provider, data = maintenance)
## 
##   n= 1000, number of events= 397 
## 
##                         coef  exp(coef)   se(coef)      z Pr(>|z|)    
## pressureInd        1.833e-03  1.002e+00  2.522e-03  0.727  0.46739    
## moistureInd       -1.576e-02  9.844e-01  5.387e-03 -2.926  0.00344 ** 
## temperatureInd     2.573e-02  1.026e+00  2.948e-03  8.730  < 2e-16 ***
## teamTeamB          9.270e-02  1.097e+00  1.224e-01  0.757  0.44897    
## teamTeamC          4.094e+01  6.027e+17  1.105e+03  0.037  0.97043    
## providerProvider2 -8.199e+01  2.471e-36  8.778e+02 -0.093  0.92558    
## providerProvider3  6.450e+01  1.024e+28  6.860e+03  0.009  0.99250    
## providerProvider4 -6.125e+01  2.509e-27  7.969e+02 -0.077  0.93873    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                   exp(coef) exp(-coef) lower .95 upper .95
## pressureInd       1.002e+00  9.982e-01    0.9969    1.0068
## moistureInd       9.844e-01  1.016e+00    0.9740    0.9948
## temperatureInd    1.026e+00  9.746e-01    1.0202    1.0320
## teamTeamB         1.097e+00  9.115e-01    0.8631    1.3947
## teamTeamC         6.027e+17  1.659e-18    0.0000       Inf
## providerProvider2 2.471e-36  4.047e+35    0.0000       Inf
## providerProvider3 1.024e+28  9.768e-29    0.0000       Inf
## providerProvider4 2.509e-27  3.986e+26    0.0000       Inf
## 
## Concordance= 0.999  (se = 0 )
## Likelihood ratio test= 1713  on 8 df,   p=<2e-16
## Wald test            = 0.59  on 8 df,   p=1
## Score (logrank) test = 1192  on 8 df,   p=<2e-16

The CP model identifies that moistureInd and temperatureInd are highly significant predictors. However, some caution needs to be exercised in interpreting these results. It must be noted that the Cox model assumes that the covariates do not vary with time.

To analyze how the effects of the covariates change over time lets fit Aalen’s additive regression model for censored data. The documentation states:

“The Aalen model assumes that the cumulative hazard H(t) for a subject can be expressed as a(t) + X B(t), where a(t) is a time-dependent intercept term, X is the vector of covariates for the subject (possibly time-dependent), and B(t) is a time-dependent matrix of coefficients.”

aa_fit <-aareg(Surv(lifetime, broken) ~ pressureInd + moistureInd + temperatureInd + team + provider, data = maintenance)

aa_fit
## Call:
## aareg(formula = Surv(lifetime, broken) ~ pressureInd + moistureInd + 
##     temperatureInd + team + provider, data = maintenance)
## 
##   n= 1000 
##     16 out of 16 unique event times used
## 
##                       slope      coef se(coef)        z        p
## Intercept                NA        NA       NA       NA       NA
## pressureInd        7.12e-06 -5.75e-07 9.77e-06  -0.0589 9.53e-01
## moistureInd       -9.95e-05 -2.52e-05 2.06e-05  -1.2300 2.20e-01
## temperatureInd     1.55e-04  5.08e-05 8.86e-06   5.7400 9.67e-09
## teamTeamB          2.77e-04  4.59e-05 4.22e-04   0.1090 9.14e-01
## teamTeamC          2.09e-02  5.27e-03 5.04e-04  10.4000 1.62e-25
## providerProvider2 -1.96e-02 -7.09e-03 5.45e-04 -13.0000 1.31e-38
## providerProvider3  7.96e-02  1.16e-02 1.10e-03  10.6000 3.40e-26
## providerProvider4        NA        NA       NA       NA       NA
## 
## Chisq=NA on 8 df, p=NA; test weights=aalen
autoplot(aa_fit)

8. Random Forests Model

Good approach to do time-to-event modeling is to fit a Random Forests Ensemble. I am going to use ranger() to build a model for each observation in the data set.

The next code chunk builds the model using the same variables used in the Cox model above, and plots random curves, along with a curve that represents the global average for all of the observations.

r_fit <- ranger(Surv(lifetime, broken) ~ pressureInd + moistureInd + temperatureInd + team + provider, 
                     data = maintenance,
                     mtry = 4,
                     importance = "permutation",
                     splitrule  = "extratrees",
                     verbose    = TRUE)

To be noted that the embedded ranger() function labels the target variable as “death” since it is constructed for general usage within medical context.

# Average the survival models

failure_times <- r_fit$unique.death.times 
surv_prob     <- data.frame(r_fit$survival)
avg_prob      <- sapply(surv_prob,mean)
# Plot the models for each observation

plot(r_fit$unique.death.times, r_fit$survival[1,], 
     type = "l", 
     ylim = c(0,1),
     col  = "red",
     xlab = "Weeks",
     ylab = "Failure",
     main = "Machine Failure Curves")

cols <- colors()
for (n in sample(c(2:dim(maintenance)[1]), 20)){
  
  lines(r_fit$unique.death.times, r_fit$survival[n,], type = "l", col = cols[n])
  
}

lines(failure_times, avg_prob, lwd = 2)
legend(500, 0.7, legend = c('Average = black'))

9. The next block of code illustrates how ranger() ranks variable importance

vi <- data.frame(sort(round(r_fit$variable.importance, 4), decreasing = TRUE))
names(vi) <- "importance"
head(vi)
##                importance
## provider           0.4288
## team               0.0691
## temperatureInd     0.0061
## moistureInd        0.0004
## pressureInd       -0.0004

In contrast with the KM model, RF suggest that the feature with highest importance is provider & team.

10. Assess Performance

ranger() does compute Harrell’s c-index. This is a generalization of the ROC curve, which reduces to the Wilcoxon-Mann-Whitney statistic for binary variables, which in turn, is equivalent to computing the area under the ROC curve.

cat("Prediction Error = 1 - Harrell's c-index = ", r_fit$prediction.error)
## Prediction Error = 1 - Harrell's c-index =  0.02952666

ROC value of .97 is considered as almost perfect discrimination.

Important remark is that ranger models does not address the time varying coefficients. This apparently is a challenge and many researchers are concerned with this limitation of ensembled forest models.

11. Lets comapre all models by plotting them together

kmi   <- rep("KM",length(km_fit$time))
km_df <- data.frame(km_fit$time,km_fit$surv,kmi)
names(km_df) <- c("Time","Failure","Model")

coxi   <- rep("Cox",length(cox_fit$time))
cox_df <- data.frame(cox_fit$time,cox_fit$surv,coxi)
names(cox_df) <- c("Time","Failure","Model")

rfi   <- rep("RF",length(r_fit$unique.death.times))
rf_df <- data.frame(r_fit$unique.death.times,avg_prob,rfi)
names(rf_df) <- c("Time","Failure","Model")

plot_df <- rbind(km_df,cox_df,rf_df)

p <- ggplot(plot_df, aes(x = Time, y = Failure, color = Model))
p + geom_line()

This graph it is not very intuitive and does not provide detailed breakdown of the prediction of each covariate but it highlights model differences.

The 3 modeling approaches are inherently different. The Cox model just provides a straight line to around 80 weeks failure point. It appears that RF and KM are able to pick the relationship of the covariates in the data and thus their predictions follow distinct pattern.

As there is no clear way to evaluate the AUC for the KM and Cox models, it is difficult to speculate by how much their accuracy differs.