Aircraft Engine data

Data wrangling and Exploratory Data Analysis

The column names are not present in the data set. However, they can be assigned.

data_columns <- c('id','cycle','setting1','setting2','setting3','s1','s2',
                  's3','s4','s5','s6','s7','s8','s9','s10','s11','s12',
                  's13','s14','s15','s16','s17','s18','s19','s20','s21')
colnames(aircraft_train) <- data_columns
head(aircraft_train, n=3)

## # A tibble: 3 x 27
##      id cycle setting1 setting2 setting3    s1    s2    s3    s4    s5    s6
##   <dbl> <dbl>    <dbl>    <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     1     1  -0.0007 -4.00e-4      100  519.  642. 1590. 1401.  14.6  21.6
## 2     1     2   0.0019 -3.00e-4      100  519.  642. 1592. 1403.  14.6  21.6
## 3     1     3  -0.0043  3.00e-4      100  519.  642. 1588. 1404.  14.6  21.6
## # ... with 16 more variables: s7 <dbl>, s8 <dbl>, s9 <dbl>, s10 <dbl>,
## #   s11 <dbl>, s12 <dbl>, s13 <dbl>, s14 <dbl>, s15 <dbl>, s16 <dbl>,
## #   s17 <dbl>, s18 <dbl>, s19 <dbl>, s20 <dbl>, s21 <dbl>, NA <lgl>

The data frame has 27 columns, however the data set has only 26 variables. Let’s put an arbitrary name to the last column in order to explore its content later.

colnames(aircraft_train)[27] <- 'X27'
head(aircraft_train, n=3)

## # A tibble: 3 x 27
##      id cycle setting1 setting2 setting3    s1    s2    s3    s4    s5    s6
##   <dbl> <dbl>    <dbl>    <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     1     1  -0.0007 -4.00e-4      100  519.  642. 1590. 1401.  14.6  21.6
## 2     1     2   0.0019 -3.00e-4      100  519.  642. 1592. 1403.  14.6  21.6
## 3     1     3  -0.0043  3.00e-4      100  519.  642. 1588. 1404.  14.6  21.6
## # ... with 16 more variables: s7 <dbl>, s8 <dbl>, s9 <dbl>, s10 <dbl>,
## #   s11 <dbl>, s12 <dbl>, s13 <dbl>, s14 <dbl>, s15 <dbl>, s16 <dbl>,
## #   s17 <dbl>, s18 <dbl>, s19 <dbl>, s20 <dbl>, s21 <dbl>, X27 <lgl>

The summary() function shows that the last column is a pitfall, as it is full of NA values.

NA_count <- aircraft_train %>% summary()
NA_count

##        id             cycle          setting1            setting2         
##  Min.   :  1.00   Min.   :  1.0   Min.   :-8.70e-03   Min.   :-6.000e-04  
##  1st Qu.: 26.00   1st Qu.: 52.0   1st Qu.:-1.50e-03   1st Qu.:-2.000e-04  
##  Median : 52.00   Median :104.0   Median : 0.00e+00   Median : 0.000e+00  
##  Mean   : 51.51   Mean   :108.8   Mean   :-8.87e-06   Mean   : 2.351e-06  
##  3rd Qu.: 77.00   3rd Qu.:156.0   3rd Qu.: 1.50e-03   3rd Qu.: 3.000e-04  
##  Max.   :100.00   Max.   :362.0   Max.   : 8.70e-03   Max.   : 6.000e-04  
##     setting3         s1              s2              s3             s4      
##  Min.   :100   Min.   :518.7   Min.   :641.2   Min.   :1571   Min.   :1382  
##  1st Qu.:100   1st Qu.:518.7   1st Qu.:642.3   1st Qu.:1586   1st Qu.:1402  
##  Median :100   Median :518.7   Median :642.6   Median :1590   Median :1408  
##  Mean   :100   Mean   :518.7   Mean   :642.7   Mean   :1591   Mean   :1409  
##  3rd Qu.:100   3rd Qu.:518.7   3rd Qu.:643.0   3rd Qu.:1594   3rd Qu.:1415  
##  Max.   :100   Max.   :518.7   Max.   :644.5   Max.   :1617   Max.   :1441  
##        s5              s6              s7              s8             s9      
##  Min.   :14.62   Min.   :21.60   Min.   :549.9   Min.   :2388   Min.   :9022  
##  1st Qu.:14.62   1st Qu.:21.61   1st Qu.:552.8   1st Qu.:2388   1st Qu.:9053  
##  Median :14.62   Median :21.61   Median :553.4   Median :2388   Median :9061  
##  Mean   :14.62   Mean   :21.61   Mean   :553.4   Mean   :2388   Mean   :9065  
##  3rd Qu.:14.62   3rd Qu.:21.61   3rd Qu.:554.0   3rd Qu.:2388   3rd Qu.:9069  
##  Max.   :14.62   Max.   :21.61   Max.   :556.1   Max.   :2389   Max.   :9245  
##       s10           s11             s12             s13            s14      
##  Min.   :1.3   Min.   :46.85   Min.   :518.7   Min.   :2388   Min.   :8100  
##  1st Qu.:1.3   1st Qu.:47.35   1st Qu.:521.0   1st Qu.:2388   1st Qu.:8133  
##  Median :1.3   Median :47.51   Median :521.5   Median :2388   Median :8141  
##  Mean   :1.3   Mean   :47.54   Mean   :521.4   Mean   :2388   Mean   :8144  
##  3rd Qu.:1.3   3rd Qu.:47.70   3rd Qu.:522.0   3rd Qu.:2388   3rd Qu.:8148  
##  Max.   :1.3   Max.   :48.53   Max.   :523.4   Max.   :2389   Max.   :8294  
##       s15             s16            s17             s18            s19     
##  Min.   :8.325   Min.   :0.03   Min.   :388.0   Min.   :2388   Min.   :100  
##  1st Qu.:8.415   1st Qu.:0.03   1st Qu.:392.0   1st Qu.:2388   1st Qu.:100  
##  Median :8.439   Median :0.03   Median :393.0   Median :2388   Median :100  
##  Mean   :8.442   Mean   :0.03   Mean   :393.2   Mean   :2388   Mean   :100  
##  3rd Qu.:8.466   3rd Qu.:0.03   3rd Qu.:394.0   3rd Qu.:2388   3rd Qu.:100  
##  Max.   :8.585   Max.   :0.03   Max.   :400.0   Max.   :2388   Max.   :100  
##       s20             s21          X27         
##  Min.   :38.14   Min.   :22.89   Mode:logical  
##  1st Qu.:38.70   1st Qu.:23.22   NA's:20631    
##  Median :38.83   Median :23.30                 
##  Mean   :38.82   Mean   :23.29                 
##  3rd Qu.:38.95   3rd Qu.:23.37                 
##  Max.   :39.43   Max.   :23.62

Just drop this column.

aircraft_train <- aircraft_train %>% select(id:s21)

Similarly, with test set.

colnames(aircraft_test) <- data_columns
aircraft_test <- aircraft_test %>% select(id:s21)

Now, let’s implement a helper function which allows us to create the response. The variable ttf is the time to fail (the cycles remaining between the current cycle and the cycle the engine fails). Moreover, label=0 if the engine fails before 30 cycles, and label=1 if the engine fails after 30 cycles.

data_preparation <- function(data, period){
  cyc_data <- data %>% group_by(id) %>% summarise(last_cyc=max(cycle))
  new_data <- inner_join(data, cyc_data, by='id')
  new_data$ttf <- new_data$last_cyc - new_data$cycle
  new_data$label <- ifelse(new_data$ttf > period, 1, 0)
  return(new_data)
}

The argument period=30 because the main goal.

aircraft_train <- data_preparation(aircraft_train, 30)
aircraft_test <- data_preparation(aircraft_test, 30)

Let’s explore the summary statistics.

library(RcmdrMisc)
numSummary(aircraft_train[,c("cycle", "label", "last_cyc", "s1", "s2", "s3",
   "s4", "s5", "s6", "s7", "s8", "s9", "s10", "s11", "s12", "s13", "s14", 
  "s15", "s16", "s17", "s18", "s19", "s20", "s21", "setting1", "setting2", 
  "setting3", "ttf"), drop=FALSE], statistics=c("mean", "sd", 
  "quantiles"), quantiles=c(0,.5,1))

##                   mean           sd        0%       50%      100%     n
## cycle     1.088079e+02 6.888099e+01    1.0000  104.0000  362.0000 20631
## label     8.497407e-01 3.573341e-01    0.0000    1.0000    1.0000 20631
## last_cyc  2.166157e+02 5.002860e+01  128.0000  207.0000  362.0000 20631
## s1        5.186700e+02 0.000000e+00  518.6700  518.6700  518.6700 20631
## s2        6.426809e+02 5.000533e-01  641.2100  642.6400  644.5300 20631
## s3        1.590523e+03 6.131150e+00 1571.0400 1590.1000 1616.9100 20631
## s4        1.408934e+03 9.000605e+00 1382.2500 1408.0400 1441.4900 20631
## s5        1.462000e+01 0.000000e+00   14.6200   14.6200   14.6200 20631
## s6        2.160980e+01 1.388985e-03   21.6000   21.6100   21.6100 20631
## s7        5.533677e+02 8.850923e-01  549.8500  553.4400  556.0600 20631
## s8        2.388097e+03 7.098548e-02 2387.9000 2388.0900 2388.5600 20631
## s9        9.065243e+03 2.208288e+01 9021.7300 9060.6600 9244.5900 20631
## s10       1.300000e+00 0.000000e+00    1.3000    1.3000    1.3000 20631
## s11       4.754117e+01 2.670874e-01   46.8500   47.5100   48.5300 20631
## s12       5.214135e+02 7.375534e-01  518.6900  521.4800  523.3800 20631
## s13       2.388096e+03 7.191892e-02 2387.8800 2388.0900 2388.5600 20631
## s14       8.143753e+03 1.907618e+01 8099.9400 8140.5400 8293.7200 20631
## s15       8.442146e+00 3.750504e-02    8.3249    8.4389    8.5848 20631
## s16       3.000000e-02 0.000000e+00    0.0300    0.0300    0.0300 20631
## s17       3.932107e+02 1.548763e+00  388.0000  393.0000  400.0000 20631
## s18       2.388000e+03 0.000000e+00 2388.0000 2388.0000 2388.0000 20631
## s19       1.000000e+02 0.000000e+00  100.0000  100.0000  100.0000 20631
## s20       3.881627e+01 1.807464e-01   38.1400   38.8300   39.4300 20631
## s21       2.328971e+01 1.082509e-01   22.8942   23.2979   23.6184 20631
## setting1 -8.870147e-06 2.187313e-03   -0.0087    0.0000    0.0087 20631
## setting2  2.350831e-06 2.930621e-04   -0.0006    0.0000    0.0006 20631
## setting3  1.000000e+02 0.000000e+00  100.0000  100.0000  100.0000 20631
## ttf       1.078079e+02 6.888099e+01    0.0000  103.0000  361.0000 20631

s1, s5, s6, s10, s16, s18, s19 and setting3 are features whit null standard deviation. So, let’s drop them from the analysis, since they do not gives us useful information.

aircraft_train <- aircraft_train %>% 
  select(c(cycle:setting2,s2:s4,s7:s9,s11:s15,s17,s20:s21,ttf,label))
aircraft_test <- aircraft_test %>% 
  select(c(cycle:setting2,s2:s4,s7:s9,s11:s15,s17,s20:s21,ttf,label))

Let’s explore the relationship between all the variables.

library(RColorBrewer)
colors <- brewer.pal(11, "RdBu")

corrs <- aircraft_train %>% select(cycle:label) %>% 
  cor() %>% as.data.frame()
corrs$variable2 <- names(corrs)
corrs <- corrs %>% gather(key='variable1', value='corr', cycle:ttf) %>%
  transform(corr=round(corr, 2))
  
ggplot(data=corrs, mapping=aes(x = variable1, variable2)) +
  geom_tile(aes(fill=corr)) + labs(fill='Correlation') +
  scale_fill_gradient2(low=colors[2], mid=colors[6], high=colors[10]) + 
  geom_text(corrs, mapping=aes(label=corr), size=2.5) +
  theme(axis.text.x=element_text(angle=90,hjust=1))

There is a high correlation (\(r > 0.8\)) between the following pairs: (s14, s9), (s11, s4), (s11, s7), (s11, s12), (s4, s12), (s8,s13), (s7, s12). This may affect the performance of the Machine Learning we are going to learn. According to this findings, let’s drop features from s11 to s14.

aircraft_train <- aircraft_train %>% 
  select(c(setting2,s2:s4,s7,ttf,label))
aircraft_test <- aircraft_test %>% 
  select(c(setting2,s2:s4,s7,ttf,label))

And let’s check the trends with a scatter plot matrix.

library(GGally)
aircraft_train %>% select(c(s2:s4,s7,ttf,label)) %>%
  ggpairs(column=1:5, 
          mapping=aes(color=factor(label))) + 
  ggtitle("Aircraft data")  +
  theme(axis.text.x=element_text(angle=90,hjust=1))

Acknowledgment

This exercise is possible thanks to GitHub Repository of Samimust

Aircraft Engine data

Jaime Mosquera

January, 2021

Data wrangling and Exploratory Data Analysis

Acknowledgment