Data wrangling and Exploratory Data Analysis
The column names are not present in the data set. However, they can be assigned.
data_columns <- c('id','cycle','setting1','setting2','setting3','s1','s2',
's3','s4','s5','s6','s7','s8','s9','s10','s11','s12',
's13','s14','s15','s16','s17','s18','s19','s20','s21')
colnames(aircraft_train) <- data_columns
head(aircraft_train, n=3)## # A tibble: 3 x 27
## id cycle setting1 setting2 setting3 s1 s2 s3 s4 s5 s6
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 -0.0007 -4.00e-4 100 519. 642. 1590. 1401. 14.6 21.6
## 2 1 2 0.0019 -3.00e-4 100 519. 642. 1592. 1403. 14.6 21.6
## 3 1 3 -0.0043 3.00e-4 100 519. 642. 1588. 1404. 14.6 21.6
## # ... with 16 more variables: s7 <dbl>, s8 <dbl>, s9 <dbl>, s10 <dbl>,
## # s11 <dbl>, s12 <dbl>, s13 <dbl>, s14 <dbl>, s15 <dbl>, s16 <dbl>,
## # s17 <dbl>, s18 <dbl>, s19 <dbl>, s20 <dbl>, s21 <dbl>, NA <lgl>
The data frame has 27 columns, however the data set has only 26 variables. Let’s put an arbitrary name to the last column in order to explore its content later.
## # A tibble: 3 x 27
## id cycle setting1 setting2 setting3 s1 s2 s3 s4 s5 s6
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 -0.0007 -4.00e-4 100 519. 642. 1590. 1401. 14.6 21.6
## 2 1 2 0.0019 -3.00e-4 100 519. 642. 1592. 1403. 14.6 21.6
## 3 1 3 -0.0043 3.00e-4 100 519. 642. 1588. 1404. 14.6 21.6
## # ... with 16 more variables: s7 <dbl>, s8 <dbl>, s9 <dbl>, s10 <dbl>,
## # s11 <dbl>, s12 <dbl>, s13 <dbl>, s14 <dbl>, s15 <dbl>, s16 <dbl>,
## # s17 <dbl>, s18 <dbl>, s19 <dbl>, s20 <dbl>, s21 <dbl>, X27 <lgl>
The summary() function shows that the last column is a pitfall, as it is full of NA values.
## id cycle setting1 setting2
## Min. : 1.00 Min. : 1.0 Min. :-8.70e-03 Min. :-6.000e-04
## 1st Qu.: 26.00 1st Qu.: 52.0 1st Qu.:-1.50e-03 1st Qu.:-2.000e-04
## Median : 52.00 Median :104.0 Median : 0.00e+00 Median : 0.000e+00
## Mean : 51.51 Mean :108.8 Mean :-8.87e-06 Mean : 2.351e-06
## 3rd Qu.: 77.00 3rd Qu.:156.0 3rd Qu.: 1.50e-03 3rd Qu.: 3.000e-04
## Max. :100.00 Max. :362.0 Max. : 8.70e-03 Max. : 6.000e-04
## setting3 s1 s2 s3 s4
## Min. :100 Min. :518.7 Min. :641.2 Min. :1571 Min. :1382
## 1st Qu.:100 1st Qu.:518.7 1st Qu.:642.3 1st Qu.:1586 1st Qu.:1402
## Median :100 Median :518.7 Median :642.6 Median :1590 Median :1408
## Mean :100 Mean :518.7 Mean :642.7 Mean :1591 Mean :1409
## 3rd Qu.:100 3rd Qu.:518.7 3rd Qu.:643.0 3rd Qu.:1594 3rd Qu.:1415
## Max. :100 Max. :518.7 Max. :644.5 Max. :1617 Max. :1441
## s5 s6 s7 s8 s9
## Min. :14.62 Min. :21.60 Min. :549.9 Min. :2388 Min. :9022
## 1st Qu.:14.62 1st Qu.:21.61 1st Qu.:552.8 1st Qu.:2388 1st Qu.:9053
## Median :14.62 Median :21.61 Median :553.4 Median :2388 Median :9061
## Mean :14.62 Mean :21.61 Mean :553.4 Mean :2388 Mean :9065
## 3rd Qu.:14.62 3rd Qu.:21.61 3rd Qu.:554.0 3rd Qu.:2388 3rd Qu.:9069
## Max. :14.62 Max. :21.61 Max. :556.1 Max. :2389 Max. :9245
## s10 s11 s12 s13 s14
## Min. :1.3 Min. :46.85 Min. :518.7 Min. :2388 Min. :8100
## 1st Qu.:1.3 1st Qu.:47.35 1st Qu.:521.0 1st Qu.:2388 1st Qu.:8133
## Median :1.3 Median :47.51 Median :521.5 Median :2388 Median :8141
## Mean :1.3 Mean :47.54 Mean :521.4 Mean :2388 Mean :8144
## 3rd Qu.:1.3 3rd Qu.:47.70 3rd Qu.:522.0 3rd Qu.:2388 3rd Qu.:8148
## Max. :1.3 Max. :48.53 Max. :523.4 Max. :2389 Max. :8294
## s15 s16 s17 s18 s19
## Min. :8.325 Min. :0.03 Min. :388.0 Min. :2388 Min. :100
## 1st Qu.:8.415 1st Qu.:0.03 1st Qu.:392.0 1st Qu.:2388 1st Qu.:100
## Median :8.439 Median :0.03 Median :393.0 Median :2388 Median :100
## Mean :8.442 Mean :0.03 Mean :393.2 Mean :2388 Mean :100
## 3rd Qu.:8.466 3rd Qu.:0.03 3rd Qu.:394.0 3rd Qu.:2388 3rd Qu.:100
## Max. :8.585 Max. :0.03 Max. :400.0 Max. :2388 Max. :100
## s20 s21 X27
## Min. :38.14 Min. :22.89 Mode:logical
## 1st Qu.:38.70 1st Qu.:23.22 NA's:20631
## Median :38.83 Median :23.30
## Mean :38.82 Mean :23.29
## 3rd Qu.:38.95 3rd Qu.:23.37
## Max. :39.43 Max. :23.62
Just drop this column.
Similarly, with test set.
Now, let’s implement a helper function which allows us to create the response. The variable ttf is the time to fail (the cycles remaining between the current cycle and the cycle the engine fails). Moreover, label=0 if the engine fails before 30 cycles, and label=1 if the engine fails after 30 cycles.
data_preparation <- function(data, period){
cyc_data <- data %>% group_by(id) %>% summarise(last_cyc=max(cycle))
new_data <- inner_join(data, cyc_data, by='id')
new_data$ttf <- new_data$last_cyc - new_data$cycle
new_data$label <- ifelse(new_data$ttf > period, 1, 0)
return(new_data)
}The argument period=30 because the main goal.
aircraft_train <- data_preparation(aircraft_train, 30)
aircraft_test <- data_preparation(aircraft_test, 30)Let’s explore the summary statistics.
library(RcmdrMisc)
numSummary(aircraft_train[,c("cycle", "label", "last_cyc", "s1", "s2", "s3",
"s4", "s5", "s6", "s7", "s8", "s9", "s10", "s11", "s12", "s13", "s14",
"s15", "s16", "s17", "s18", "s19", "s20", "s21", "setting1", "setting2",
"setting3", "ttf"), drop=FALSE], statistics=c("mean", "sd",
"quantiles"), quantiles=c(0,.5,1))## mean sd 0% 50% 100% n
## cycle 1.088079e+02 6.888099e+01 1.0000 104.0000 362.0000 20631
## label 8.497407e-01 3.573341e-01 0.0000 1.0000 1.0000 20631
## last_cyc 2.166157e+02 5.002860e+01 128.0000 207.0000 362.0000 20631
## s1 5.186700e+02 0.000000e+00 518.6700 518.6700 518.6700 20631
## s2 6.426809e+02 5.000533e-01 641.2100 642.6400 644.5300 20631
## s3 1.590523e+03 6.131150e+00 1571.0400 1590.1000 1616.9100 20631
## s4 1.408934e+03 9.000605e+00 1382.2500 1408.0400 1441.4900 20631
## s5 1.462000e+01 0.000000e+00 14.6200 14.6200 14.6200 20631
## s6 2.160980e+01 1.388985e-03 21.6000 21.6100 21.6100 20631
## s7 5.533677e+02 8.850923e-01 549.8500 553.4400 556.0600 20631
## s8 2.388097e+03 7.098548e-02 2387.9000 2388.0900 2388.5600 20631
## s9 9.065243e+03 2.208288e+01 9021.7300 9060.6600 9244.5900 20631
## s10 1.300000e+00 0.000000e+00 1.3000 1.3000 1.3000 20631
## s11 4.754117e+01 2.670874e-01 46.8500 47.5100 48.5300 20631
## s12 5.214135e+02 7.375534e-01 518.6900 521.4800 523.3800 20631
## s13 2.388096e+03 7.191892e-02 2387.8800 2388.0900 2388.5600 20631
## s14 8.143753e+03 1.907618e+01 8099.9400 8140.5400 8293.7200 20631
## s15 8.442146e+00 3.750504e-02 8.3249 8.4389 8.5848 20631
## s16 3.000000e-02 0.000000e+00 0.0300 0.0300 0.0300 20631
## s17 3.932107e+02 1.548763e+00 388.0000 393.0000 400.0000 20631
## s18 2.388000e+03 0.000000e+00 2388.0000 2388.0000 2388.0000 20631
## s19 1.000000e+02 0.000000e+00 100.0000 100.0000 100.0000 20631
## s20 3.881627e+01 1.807464e-01 38.1400 38.8300 39.4300 20631
## s21 2.328971e+01 1.082509e-01 22.8942 23.2979 23.6184 20631
## setting1 -8.870147e-06 2.187313e-03 -0.0087 0.0000 0.0087 20631
## setting2 2.350831e-06 2.930621e-04 -0.0006 0.0000 0.0006 20631
## setting3 1.000000e+02 0.000000e+00 100.0000 100.0000 100.0000 20631
## ttf 1.078079e+02 6.888099e+01 0.0000 103.0000 361.0000 20631
s1, s5, s6, s10, s16, s18, s19 and setting3 are features whit null standard deviation. So, let’s drop them from the analysis, since they do not gives us useful information.
aircraft_train <- aircraft_train %>%
select(c(cycle:setting2,s2:s4,s7:s9,s11:s15,s17,s20:s21,ttf,label))
aircraft_test <- aircraft_test %>%
select(c(cycle:setting2,s2:s4,s7:s9,s11:s15,s17,s20:s21,ttf,label))Let’s explore the relationship between all the variables.
library(RColorBrewer)
colors <- brewer.pal(11, "RdBu")
corrs <- aircraft_train %>% select(cycle:label) %>%
cor() %>% as.data.frame()
corrs$variable2 <- names(corrs)
corrs <- corrs %>% gather(key='variable1', value='corr', cycle:ttf) %>%
transform(corr=round(corr, 2))
ggplot(data=corrs, mapping=aes(x = variable1, variable2)) +
geom_tile(aes(fill=corr)) + labs(fill='Correlation') +
scale_fill_gradient2(low=colors[2], mid=colors[6], high=colors[10]) +
geom_text(corrs, mapping=aes(label=corr), size=2.5) +
theme(axis.text.x=element_text(angle=90,hjust=1))There is a high correlation (\(r > 0.8\)) between the following pairs: (s14, s9), (s11, s4), (s11, s7), (s11, s12), (s4, s12), (s8,s13), (s7, s12). This may affect the performance of the Machine Learning we are going to learn. According to this findings, let’s drop features from s11 to s14.
aircraft_train <- aircraft_train %>%
select(c(setting2,s2:s4,s7,ttf,label))
aircraft_test <- aircraft_test %>%
select(c(setting2,s2:s4,s7,ttf,label))And let’s check the trends with a scatter plot matrix.
library(GGally)
aircraft_train %>% select(c(s2:s4,s7,ttf,label)) %>%
ggpairs(column=1:5,
mapping=aes(color=factor(label))) +
ggtitle("Aircraft data") +
theme(axis.text.x=element_text(angle=90,hjust=1))Acknowledgment
This exercise is possible thanks to GitHub Repository of Samimust