Synopsis
Machine Failure analysis corresponds to a set of statistical approaches used to investigate the time it takes for an event of interest to occur.
Time to event, failure analysis, survival analysis, event history analysis, reliability analysis, all deal with time until occurrence of an event of interest. However, failure time may not be observed within the relevant time period, producing so-called censored observations.
The survival package is the cornerstone of the entire R survival analysis ecosystem. Not only is the package itself rich in features, but the object created by the Surv() function, which contains failure time and censoring information, is the basic survival analysis data structure in R. Dr. Terry Therneau, the package author, began working on the survival package in 1986.
1. Libraries
library(tidymodels)
library(tidyverse)
library(skimr)
library(GGally)
library(ggplot2)
library(ggfortify)
library(survival)
library(survminer)
library(ranger)2. Data
maintenance <- read_csv("maintenance.csv",
show_col_types = FALSE)3. EDA
skim(maintenance)| Name | maintenance |
| Number of rows | 1000 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| team | 0 | 1 | 5 | 5 | 0 | 3 | 0 |
| provider | 0 | 1 | 9 | 9 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| …1 | 0 | 1 | 499.50 | 288.82 | 0.00 | 249.75 | 499.50 | 749.25 | 999.00 | ▇▇▇▇▇ |
| lifetime | 0 | 1 | 55.20 | 26.47 | 1.00 | 34.00 | 60.00 | 80.00 | 93.00 | ▃▃▅▇▇ |
| broken | 0 | 1 | 0.40 | 0.49 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▅ |
| pressureInd | 0 | 1 | 98.60 | 19.96 | 33.48 | 85.56 | 97.22 | 112.25 | 173.28 | ▁▅▇▂▁ |
| moistureInd | 0 | 1 | 99.38 | 9.99 | 58.55 | 92.77 | 99.43 | 106.12 | 128.60 | ▁▂▇▇▁ |
| temperatureInd | 0 | 1 | 100.63 | 19.63 | 42.28 | 87.68 | 100.59 | 113.66 | 172.54 | ▁▆▇▂▁ |
ggpairs(maintenance, mapping = aes(color = provider))4. Data transformation
maintenance <- maintenance %>%
# Remove unnecessary column
select(-...1) %>%
# Transform column types to categorical / factor types
mutate(team = as.factor(team)) %>%
mutate(provider = as.factor(provider))
head(maintenance)## # A tibble: 6 x 7
## lifetime broken pressureInd moistureInd temperatureInd team provider
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
## 1 56 0 92.2 104. 96.5 TeamA Provider4
## 2 81 1 72.1 103. 87.3 TeamC Provider4
## 3 60 0 96.3 77.8 112. TeamA Provider1
## 4 86 1 94.4 108. 72.0 TeamC Provider2
## 5 34 0 97.8 99.4 104. TeamB Provider1
## 6 30 0 87.7 116. 89.8 TeamA Provider1
5. EDA Summary
No apparent anomalies are observed within the data. The variables in this dataset are:
lifetime - Number of weeks the machine has been active
broken - Specifies if the machine was broken or hasn’t been broken yet for the corresponding weeks in activity
pressureInd - The pressure index is used to quantify the flow of liquid through pipes, as a sudden drop of pressure can indicate a leak
moistureInd - The moisture index is a measure of the relative humidity in the air. It is important to keep track of it as excessive humidity can create mold and damage the equipment
temperature - The temperature index of the machine is computed using voltage devices called thermocouples that translate a change in voltage into temperature measure. It is recorded to avoid damages to electric circuits, fire or even explosion
team - This indicator specifies which team is using the machine
provider - This indicator specifies the name of the machine manufacturer
6. Kaplan Meier Analysis
I am going to use the Surv() function to build the standard survival object. Note that a “+” after the time in the print out of km indicates censoring.
# Kaplan Meier Survival Curve
km <- with(maintenance, Surv(lifetime, broken))
head(km, 400)## [1] 56+ 81 60+ 86 34+ 30+ 68+ 65 23+ 81 38+ 29+ 65 65 82 80 48+ 80
## [19] 92 88 74 65+ 61+ 35+ 26+ 63+ 88 79 53+ 73 60 13+ 34+ 36+ 80 65
## [37] 31+ 74 25+ 58+ 80 65 19+ 60 84+ 13+ 12+ 15+ 26+ 43+ 80 65 1+ 34+
## [55] 80 12+ 92 82 65 20+ 16+ 3+ 18+ 7+ 47+ 23+ 39+ 48+ 57+ 68+ 18+ 1+
## [73] 81 65+ 4+ 24+ 74 74 16+ 88+ 28+ 92 49+ 39+ 18+ 80+ 81 65 79 19+
## [91] 58+ 76+ 92 92 52+ 29+ 65 8+ 53+ 40+ 46+ 5+ 41+ 12+ 65 80 93 63+
## [109] 77+ 65 92 62+ 52+ 85 92 92 88 80 58+ 55+ 33+ 13+ 34+ 85 65 80+
## [127] 41+ 65 65 43+ 80 80 55+ 88 17+ 86 3+ 81+ 93 65 55+ 25+ 93 82
## [145] 13+ 19+ 86 60 81 80 93 43+ 63+ 45+ 93 77+ 9+ 60 62+ 82 29+ 65
## [163] 49+ 72+ 16+ 56+ 80 40+ 57+ 88 93 20+ 12+ 57+ 65 50+ 42+ 49+ 65+ 65
## [181] 44+ 92 9+ 48+ 49+ 65 36+ 60 81+ 24+ 60 65 63+ 93 92 65 28+ 25+
## [199] 74 20+ 93 57+ 46+ 60 56+ 1+ 76+ 31+ 54+ 1+ 64+ 53+ 62+ 12+ 76+ 53+
## [217] 86 29+ 80 65 60+ 65 88+ 47+ 3+ 65 19+ 65 50+ 30+ 58+ 82 26+ 93
## [235] 80 64+ 46+ 80 58+ 25+ 80 27+ 93 8+ 24+ 80 48+ 52+ 65 74 24+ 85
## [253] 93 16+ 13+ 22+ 86 58+ 60 93 88 86 62+ 92 65+ 88 81 22+ 45+ 59+
## [271] 35+ 18+ 38+ 81 92 80 61 4+ 48+ 63+ 31+ 65 65 93 81 65 13+ 35+
## [289] 92 65 85 73 54+ 56+ 61+ 12+ 26+ 60 17+ 65 18+ 35+ 60 49+ 85 30+
## [307] 81 66 83+ 14+ 80 79 79 7+ 80+ 44+ 46+ 85 81 5+ 51+ 49+ 51+ 80
## [325] 4+ 55+ 66+ 9+ 76+ 65+ 71+ 85+ 21+ 45+ 74 78+ 56+ 60 6+ 50+ 28+ 93
## [343] 69+ 28+ 46+ 24+ 61+ 77+ 54+ 22+ 63+ 81 65 81 74 66+ 85+ 30+ 80 80
## [361] 63+ 19+ 54+ 40+ 35+ 29+ 89 80 33+ 81+ 80 88 77+ 8+ 2+ 14+ 80 74
## [379] 38+ 65 79 41+ 13+ 42+ 92 92 65 34+ 65 65 65 80 61+ 45+ 33+ 55+
## [397] 86 2+ 80 61+
To begin the analysis, I will use the formula Surv(lifetime, broken) ~ 1 and the survfit() function to produce the Kaplan-Meier estimates of the probability of survival over time.
The times parameter of the summary() function provides control over which times to print. In this case printing the estimates for 60, 70, 80, and 90 weeks as these are the intervals with observed mechanical failure.
This is the simplest possible model. It only takes two lines of code to fit and produce numerical and graphical summaries.
km_fit <- survfit(Surv(lifetime, broken) ~ 1, data = maintenance)
summary(km_fit, times = c(60, 70, 80, 90))## Call: survfit(formula = Surv(lifetime, broken) ~ 1, data = maintenance)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 60 531 32 0.940 0.0103 0.920 0.960
## 70 349 82 0.772 0.0188 0.736 0.810
## 80 261 116 0.488 0.0242 0.443 0.538
## 90 65 103 0.201 0.0209 0.164 0.246
plot(km_fit, xlab="Weeks", main = 'Kaplan Meyer Plot')autoplot(km_fit)Next, lets look at survival curves by team & provider (our categorical variables).
km_provider_fit <- survfit(Surv(lifetime, broken) ~ provider, data = maintenance)
km_provider_fit## Call: survfit(formula = Surv(lifetime, broken) ~ provider, data = maintenance)
##
## n events median 0.95LCL 0.95UCL
## provider=Provider1 254 116 80 79 NA
## provider=Provider2 266 91 92 92 92
## provider=Provider3 242 114 65 65 65
## provider=Provider4 238 76 88 88 88
According to the KM model the following statements can be made.
Provider1 has a median failure threshold of 80 weeks.
Provider2 has a median failure threshold of 92 weeks.
Provider3 has a median failure threshold of 65 weeks.
Provider4 has a median failure threshold of 88 weeks.
autoplot(km_provider_fit)This plot illustrates that some of the machine types are prone to failure a lot more, compared to others - mainly Provider3 and Provider4.
km_team_fit <- survfit(Surv(lifetime, broken) ~ team, data = maintenance)
autoplot(km_team_fit)7. Cox Proportional Hazards Model
Next, lets fit a Cox Proportional Hazards Model that makes use of all of the covariates in the data. The formula is very similar to a normal linear or logistic model.
cox <- coxph(Surv(lifetime, broken) ~ pressureInd + moistureInd + temperatureInd + team + provider, data = maintenance)
cox_fit <- survfit(cox)
summary(cox)## Call:
## coxph(formula = Surv(lifetime, broken) ~ pressureInd + moistureInd +
## temperatureInd + team + provider, data = maintenance)
##
## n= 1000, number of events= 397
##
## coef exp(coef) se(coef) z Pr(>|z|)
## pressureInd 1.833e-03 1.002e+00 2.522e-03 0.727 0.46739
## moistureInd -1.576e-02 9.844e-01 5.387e-03 -2.926 0.00344 **
## temperatureInd 2.573e-02 1.026e+00 2.948e-03 8.730 < 2e-16 ***
## teamTeamB 9.270e-02 1.097e+00 1.224e-01 0.757 0.44897
## teamTeamC 4.094e+01 6.027e+17 1.105e+03 0.037 0.97043
## providerProvider2 -8.199e+01 2.471e-36 8.778e+02 -0.093 0.92558
## providerProvider3 6.450e+01 1.024e+28 6.860e+03 0.009 0.99250
## providerProvider4 -6.125e+01 2.509e-27 7.969e+02 -0.077 0.93873
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## pressureInd 1.002e+00 9.982e-01 0.9969 1.0068
## moistureInd 9.844e-01 1.016e+00 0.9740 0.9948
## temperatureInd 1.026e+00 9.746e-01 1.0202 1.0320
## teamTeamB 1.097e+00 9.115e-01 0.8631 1.3947
## teamTeamC 6.027e+17 1.659e-18 0.0000 Inf
## providerProvider2 2.471e-36 4.047e+35 0.0000 Inf
## providerProvider3 1.024e+28 9.768e-29 0.0000 Inf
## providerProvider4 2.509e-27 3.986e+26 0.0000 Inf
##
## Concordance= 0.999 (se = 0 )
## Likelihood ratio test= 1713 on 8 df, p=<2e-16
## Wald test = 0.59 on 8 df, p=1
## Score (logrank) test = 1192 on 8 df, p=<2e-16
The CP model identifies that moistureInd and temperatureInd are highly significant predictors. However, some caution needs to be exercised in interpreting these results. It must be noted that the Cox model assumes that the covariates do not vary with time.
To analyze how the effects of the covariates change over time lets fit Aalen’s additive regression model for censored data. The documentation states:
“The Aalen model assumes that the cumulative hazard H(t) for a subject can be expressed as a(t) + X B(t), where a(t) is a time-dependent intercept term, X is the vector of covariates for the subject (possibly time-dependent), and B(t) is a time-dependent matrix of coefficients.”
aa_fit <-aareg(Surv(lifetime, broken) ~ pressureInd + moistureInd + temperatureInd + team + provider, data = maintenance)
aa_fit## Call:
## aareg(formula = Surv(lifetime, broken) ~ pressureInd + moistureInd +
## temperatureInd + team + provider, data = maintenance)
##
## n= 1000
## 16 out of 16 unique event times used
##
## slope coef se(coef) z p
## Intercept NA NA NA NA NA
## pressureInd 7.12e-06 -5.75e-07 9.77e-06 -0.0589 9.53e-01
## moistureInd -9.95e-05 -2.52e-05 2.06e-05 -1.2300 2.20e-01
## temperatureInd 1.55e-04 5.08e-05 8.86e-06 5.7400 9.67e-09
## teamTeamB 2.77e-04 4.59e-05 4.22e-04 0.1090 9.14e-01
## teamTeamC 2.09e-02 5.27e-03 5.04e-04 10.4000 1.62e-25
## providerProvider2 -1.96e-02 -7.09e-03 5.45e-04 -13.0000 1.31e-38
## providerProvider3 7.96e-02 1.16e-02 1.10e-03 10.6000 3.40e-26
## providerProvider4 NA NA NA NA NA
##
## Chisq=NA on 8 df, p=NA; test weights=aalen
autoplot(aa_fit)8. Random Forests Model
Good approach to do time-to-event modeling is to fit a Random Forests Ensemble. I am going to use ranger() to build a model for each observation in the data set.
The next code chunk builds the model using the same variables used in the Cox model above, and plots random curves, along with a curve that represents the global average for all of the observations.
r_fit <- ranger(Surv(lifetime, broken) ~ pressureInd + moistureInd + temperatureInd + team + provider,
data = maintenance,
mtry = 4,
importance = "permutation",
splitrule = "extratrees",
verbose = TRUE)To be noted that the embedded ranger() function labels the target variable as “death” since it is constructed for general usage within medical context.
# Average the survival models
failure_times <- r_fit$unique.death.times
surv_prob <- data.frame(r_fit$survival)
avg_prob <- sapply(surv_prob,mean)# Plot the models for each observation
plot(r_fit$unique.death.times, r_fit$survival[1,],
type = "l",
ylim = c(0,1),
col = "red",
xlab = "Weeks",
ylab = "Failure",
main = "Machine Failure Curves")
cols <- colors()
for (n in sample(c(2:dim(maintenance)[1]), 20)){
lines(r_fit$unique.death.times, r_fit$survival[n,], type = "l", col = cols[n])
}
lines(failure_times, avg_prob, lwd = 2)
legend(500, 0.7, legend = c('Average = black'))9. The next block of code illustrates how ranger() ranks variable importance
vi <- data.frame(sort(round(r_fit$variable.importance, 4), decreasing = TRUE))
names(vi) <- "importance"
head(vi)## importance
## provider 0.4288
## team 0.0691
## temperatureInd 0.0061
## moistureInd 0.0004
## pressureInd -0.0004
In contrast with the KM model, RF suggest that the feature with highest importance is provider & team.
10. Assess Performance
ranger() does compute Harrell’s c-index. This is a generalization of the ROC curve, which reduces to the Wilcoxon-Mann-Whitney statistic for binary variables, which in turn, is equivalent to computing the area under the ROC curve.
cat("Prediction Error = 1 - Harrell's c-index = ", r_fit$prediction.error)## Prediction Error = 1 - Harrell's c-index = 0.02952666
ROC value of .97 is considered as almost perfect discrimination.
Important remark is that ranger models does not address the time varying coefficients. This apparently is a challenge and many researchers are concerned with this limitation of ensembled forest models.
11. Lets comapre all models by plotting them together
kmi <- rep("KM",length(km_fit$time))
km_df <- data.frame(km_fit$time,km_fit$surv,kmi)
names(km_df) <- c("Time","Failure","Model")
coxi <- rep("Cox",length(cox_fit$time))
cox_df <- data.frame(cox_fit$time,cox_fit$surv,coxi)
names(cox_df) <- c("Time","Failure","Model")
rfi <- rep("RF",length(r_fit$unique.death.times))
rf_df <- data.frame(r_fit$unique.death.times,avg_prob,rfi)
names(rf_df) <- c("Time","Failure","Model")
plot_df <- rbind(km_df,cox_df,rf_df)
p <- ggplot(plot_df, aes(x = Time, y = Failure, color = Model))
p + geom_line()This graph it is not very intuitive and does not provide detailed breakdown of the prediction of each covariate but it highlights model differences.
The 3 modeling approaches are inherently different. The Cox model just provides a straight line to around 80 weeks failure point. It appears that RF and KM are able to pick the relationship of the covariates in the data and thus their predictions follow distinct pattern.
As there is no clear way to evaluate the AUC for the KM and Cox models, it is difficult to speculate by how much their accuracy differs.