Siemen’s Analytics Contest

Summary For this project, we were asked to find patterns and sequences, and many other things related to the wind turbine data. The main concern we identified was to learn about and predict glitches in the data, and that is the focus of this program. Other observations were recorded and comments can be found throughout this document. A glitch was determined to be a stop signal that was not manually recorded. This would stop the turbine’s productivity from some light degree to a full stop. It is assumed that when turbines stop performing, this can lead to negative consequences, including a decrease in energy production and revenue, and an increase in labor demand required to return the turbines to an operational status.

We were able to create a prediction model for the glitches using certain variables from the recorded signal information resulting in an 82.25% accuracy. Because of the nature of the problem, the statistical method of logistic regression was used. Initially, we organized the data chronologically by turbine, and were able to isolate the total number of stops for further examination. The main goal was to find a certain number of events/warnings prior to each stop to be able to identify patterns and relationships. We decided on 3 previous events/warnings with the aim of building a model that can generate a prediction with enough time to react and prevent the non manual stop of the turbines. Therefore, we only used the stops that had 3 previous events/warnings in the creation of the model. There were 5 variables that were considered to be important for the creation of the model, and 2 out of the 5, that were significantly important.

It is recommended that a software program is developed to be able to capture and update the signals received. These inputs can be entered as the variable values, and updated on a rotational basis, and a list of turbines that are predicted to stop can be generated. The model can be executed each time an event/warning is recorded, and signals can be rotated out as new ones come in. The last in first out approach can be used in the group of 3 event/warnings being recorded for each turbine. The group of event/warnings being recorded for the model can be reset to 0 if a non manual stop is recorded. This process and model would allow technicians to visit the turbines prior to a stop and perform preventive maintenance for stop avoidance.

Some other aspects are also highlighted in the report that we think can influence the occurrence, and management, of the turbine stops. For example, we noticed that most of the stop signals for the turbines are recorded in the morning, but they do continue after 3pm and throughout the night. Most of the visits by the technicinas take place during the morning and general working hours. Finally, the visit durations by the technicians are usually very short, most range from 0 to 10 minutes.

From this information, a few suggestions are proposed. It seems that the technicians might be overwhelmed in the mornings when they are faced with a great number of turbines to visit. Therefore, they are not spending a lot of time per turbine visit. Since stop signals continue to be recorded after normal working hours, some measures can be taken to try to minimize stop signals from occurring. Having an over night shift of technicians handling the stop codes that are occurring during that night shift, will reduce the amount of turbine visits needed during the morning hours. This can increase the amount of time the technicians can spend per turbine, and allow them to carry out more preventive maintenance for each turbine. Overall, this can help to decrease the amount of glitches that occur.

We were not provided with details about the data, so we can just base suggestions and models on numbers and codes provided. We can say that the code related to the previous event/warning and factor D from the 2nd event/warning are statistically significant in predicting a glitch in the model. However, we are limited in how we could explain certain relationships, and this could be preventing us from discovering more. This entire R markdown document takes about 27 minutes to process in a multi core, 64 bit machine.

#Recording the time 
ptm<-proc.time()
#Recording the time 
ptm<-proc.time()
#Including library for multi core processing
library(doParallel)
# Setting up parallel processing.
cl <-makeCluster(detectCores())
registerDoParallel(cl)
#Setting working Directory to import files
setwd("C:/Users/Me/OneDriveLatestData/OneDrive - University of Central Florida - UCF/Data Mining I/Siemen'sAnalyticsContest")
#Readng in Data Files
sites<-read.csv("Sites.csv", header=TRUE)
codesnevents<-read.csv("CodesnEventsWarningStopClass.csv", header=TRUE)
numassploc<-read.csv("LocationListwNumberofAssetspLocation.csv", header=TRUE)

Below, we are providing some preliminary summary statistics on the data files. No conclusions are drawn from these figures, but are just used as a starting point in the analysis.

#Exploring summary statistics
summary(sites)

##    Park_Name         FactorA      FactorB        FactorC      FactorD  
##  Park008: 15616   Min.   : 2.00   GGL:99812   AAA    :45549   A:77852  
##  Park001: 13937   1st Qu.: 4.00   GGS:72117   AAB    :36219   B:54283  
##  Park020:  9909   Median : 5.00               CCC    :16410   C:39794  
##  Park034:  8442   Mean   : 5.11               AAE    :14255            
##  Park002:  8326   3rd Qu.: 7.00               AAD    :11407            
##  Park032:  7970   Max.   :11.00               CCD    :10247            
##  (Other):107729                               (Other):37842            
##    StationID     VisitType      VisitId        ManualStop.during.Visit
##  Min.   : 1152   CU: 28748   Min.   :    178   no : 22537             
##  1st Qu.: 5464   PL:108697   1st Qu.: 364959   yes:149392             
##  Median :11278   SC: 30975   Median : 741691                          
##  Mean   :10198   UN:  3509   Mean   : 726661                          
##  3rd Qu.:13404               3rd Qu.:1078358                          
##  Max.   :18350               Max.   :1413564                          
##                                                                       
##           VisitStartTime   VisitDurMinutes      Code       ManualStop     
##  1/20/2016 11:26 :   801   Min.   : 0.00   Min.   :    2   Mode :logical  
##  4/2/2016 10:03  :   473   1st Qu.: 4.00   1st Qu.: 1014   FALSE:161588   
##  11/4/2016 7:45  :   424   Median : 8.00   Median : 5113   TRUE :10341    
##  10/13/2016 16:42:   399   Mean   :10.38   Mean   :10920                  
##  1/25/2016 12:37 :   387   3rd Qu.:16.00   3rd Qu.:13902                  
##  11/5/2016 12:04 :   385   Max.   :36.00   Max.   :64115                  
##  (Other)         :169060                                                  
##               TimeOn                   TimeOff      
##  4/26/2016 8:51  :   459                   : 29988  
##  10/16/2016 10:16:   242   4/26/2016 8:51  :   177  
##  3/19/2016 9:08  :   228   2/18/2016 13:42 :   109  
##  2/18/2016 15:46 :   205   10/16/2016 10:26:    96  
##  3/12/2016 23:20 :   197   12/19/2016 11:29:    94  
##  3/29/2016 14:45 :   195   9/14/2016 12:48 :    94  
##  (Other)         :170403   (Other)         :141371

#Exploring summary statistics
summary(codesnevents)

##       Code       EventWarningStop IsManualStop     StopUrgency   
##  Min.   :    2   Event  :  9      Mode :logical   Min.   :0.000  
##  1st Qu.: 4109   Stop   :292      FALSE:633       1st Qu.:0.000  
##  Median :13142   Warning:341      TRUE :9         Median :0.000  
##  Mean   :16719                                    Mean   :1.933  
##  3rd Qu.:15255                                    3rd Qu.:5.000  
##  Max.   :64115                                    Max.   :6.000

#Exploring summary statistics
summary(numassploc)

##    Park_Name     X.Assets     
##  Park001: 1   Min.   : 10.00  
##  Park002: 1   1st Qu.: 18.00  
##  Park003: 1   Median : 32.00  
##  Park004: 1   Mean   : 43.62  
##  Park005: 1   3rd Qu.: 67.00  
##  Park006: 1   Max.   :130.00  
##  (Other):31

Summary Sites dataset There are 171,929 recorded events in the dataset

Parks with the three highest number of occurences: Park008: 15,616 Park001: 13,937 Park020: 9,909

These parks have the most records in the entire dataset. They send the most messages, warnings, and stop records. The frequency of records by park have a large range, and this could indicate some parks are more likely to producing records in the dataset.

Factor B occurence rate: GGL 58% GGS 42%

The majority of the records are associated with Factor B GGL.

Factor C three highest number of occurences: AAA :45549 AAB :36219 CCC :16410

Factor D occurences: A:77,852 B:54,283 C:39,794

Visit type occurences: CU: 28,748 PL:108,697 SC: 30,975 UN: 3,509

Visit type occurence PL is the most common type, 63%.

Manual stop by technician during visit: no : 22,537 yes:149,392

Most but not all visits by a technician result in a manual stop.

Visit Duration in minutes: mean: 10min 38sec median: 8 min

The visit duration isn’t normally distributed, with a longer tail to the right pulling the mean upwards. The majority of visits are short, less than 9 minutes.

Manual Stop issued by a command person: TRUE :10,341 FALSE:161,588

The vast majority of manual stops were False, or not issued by a command by a person but by the machine.

Summary of Codes and Events dataset There are 642 rows.

Event Warning Stop frequencies: Event : 9 Stop :292 Warning:341

The majority of the codes are warnings, followed by stops, and then events.

Is manual stop: FALSE:633 TRUE :9

Stop Urgency: Min: 0 Max:6 Mean: 1.933

The stop urgency mean is 1.933 potentially meaning that most stops occur lower on the stop urgency scale of 0-6, and not as aggressive of a stop.

Summary of Locations and Assets dataset

Park Names: 37 Parks

Assets or Turbines: Min: 10 Max: 130 Mean: 43.62 Median: 32

There is a broad range of turbines per park. Half of the parks have less than 32 turbines. There are a small number of parks with a high amount of turbines that pulls the average up from the median. Park size can play a factor in effecting the efficiency of technicians, and overall the performance of the turbines.

Sorting the data based on turbine and the time events/warnings/stops came on, and displaying the first 10 observations below.

#Organizing the data by turnine and time on
chronsites<-sites[order(sites$TimeOn,sites$StationID),]
#Removing unorgainzed data file
rm(sites)
#Viewing organized data set
head(chronsites,10)

Identifying the number of parks and turbines.

print("The number of Parks is:")

## [1] "The number of Parks is:"

length(numassploc$Park_Name)

## [1] 37

print("The Total number of turbines in the Parks is:")

## [1] "The Total number of turbines in the Parks is:"

sum(numassploc$X.Assets)

## [1] 1614

Plotting a correlation matrix to visualize relationships between variables in the complete and organized data set.

As you will see in these correlation charts, the darker the red circle signifies the more negative a relationship between two variables, and the darker the blue circle signifies the more positive a relationship between two variables is. The findings will be discussed in the next section where we have the actual correlation values between the variables.

#Converting data set into numeric to plot
chronplot<-data.matrix(chronsites)
#Importing library needed for plotting
library(corrplot)
#Correlation Analysis
M <- cor(chronplot)
#Removing data frame from memory after use
rm(chronplot)
#Plotting
corrplot(M, method="circle")

Conducting a correlation test to see the strongest relationships between variables numerically in the complete and organized data set. Range should be from -1 to 1. The more negative, the more negative the correlation, the more positive, the more positive the correlation. Closer to 0, no correlation.

#Output of numbers for correlations
M

##                           Park_Name       FactorA      FactorB     FactorC
## Park_Name                1.00000000 -2.989004e-01  0.391096013  0.97922606
## FactorA                 -0.29890037  1.000000e+00 -0.494333026 -0.30440909
## FactorB                  0.39109601 -4.943330e-01  1.000000000  0.41961307
## FactorC                  0.97922606 -3.044091e-01  0.419613069  1.00000000
## FactorD                 -0.14973671 -7.085312e-01  0.352882758 -0.09236555
## StationID                0.26996443 -9.322990e-01  0.438718114  0.28489523
## VisitType               -0.01493665  2.587754e-02 -0.077820046 -0.01352250
## VisitId                 -0.01278187  3.546508e-05  0.002639058 -0.01271376
## ManualStop.during.Visit -0.04609378  9.230470e-02 -0.120128080 -0.05079263
## VisitStartTime           0.05760344  1.395982e-02  0.053676657  0.05550967
## VisitDurMinutes         -0.01491422 -9.491402e-03 -0.078760141 -0.03410653
## Code                    -0.02087135  3.563529e-02 -0.006387941 -0.01856042
## ManualStop              -0.01406810  3.235151e-03 -0.013217171 -0.01648700
## TimeOn                   0.05634077  1.619690e-02  0.050334072  0.05433798
## TimeOff                  0.04790257  1.390545e-03  0.043311615  0.04623634
##                              FactorD    StationID    VisitType
## Park_Name               -0.149736709  0.269964431 -0.014936651
## FactorA                 -0.708531154 -0.932298985  0.025877542
## FactorB                  0.352882758  0.438718114 -0.077820046
## FactorC                 -0.092365553  0.284895225 -0.013522497
## FactorD                  1.000000000  0.694120070  0.033117663
## StationID                0.694120070  1.000000000 -0.035274072
## VisitType                0.033117663 -0.035274072  1.000000000
## VisitId                  0.025490590  0.006762310  0.020258577
## ManualStop.during.Visit -0.098637375 -0.091994450  0.017157196
## VisitStartTime          -0.022703274 -0.006520792 -0.061264850
## VisitDurMinutes          0.009661054  0.006687651  0.047796494
## Code                    -0.020866533 -0.035097404  0.001790626
## ManualStop               0.003102155 -0.003397407  0.005806549
## TimeOn                  -0.024539507 -0.007981090 -0.062056234
## TimeOff                 -0.022061551 -0.001718056 -0.034621624
##                               VisitId ManualStop.during.Visit
## Park_Name               -1.278187e-02            -0.046093784
## FactorA                  3.546508e-05             0.092304700
## FactorB                  2.639058e-03            -0.120128080
## FactorC                 -1.271376e-02            -0.050792628
## FactorD                  2.549059e-02            -0.098637375
## StationID                6.762310e-03            -0.091994450
## VisitType                2.025858e-02             0.017157196
## VisitId                  1.000000e+00            -0.007574048
## ManualStop.during.Visit -7.574048e-03             1.000000000
## VisitStartTime          -3.287309e-02            -0.014698249
## VisitDurMinutes          2.637532e-02             0.225731076
## Code                     1.370397e-02             0.010424067
## ManualStop              -2.875379e-03             0.098256457
## TimeOn                  -3.329477e-02            -0.011042389
## TimeOff                 -2.127363e-02            -0.012631649
##                         VisitStartTime VisitDurMinutes         Code
## Park_Name                  0.057603442    -0.014914216 -0.020871348
## FactorA                    0.013959821    -0.009491402  0.035635293
## FactorB                    0.053676657    -0.078760141 -0.006387941
## FactorC                    0.055509672    -0.034106531 -0.018560418
## FactorD                   -0.022703274     0.009661054 -0.020866533
## StationID                 -0.006520792     0.006687651 -0.035097404
## VisitType                 -0.061264850     0.047796494  0.001790626
## VisitId                   -0.032873086     0.026375321  0.013703968
## ManualStop.during.Visit   -0.014698249     0.225731076  0.010424067
## VisitStartTime             1.000000000     0.060358733  0.017206743
## VisitDurMinutes            0.060358733     1.000000000  0.018568549
## Code                       0.017206743     0.018568549  1.000000000
## ManualStop                -0.007861809     0.048796064 -0.150674381
## TimeOn                     0.992352465     0.066826252  0.017070536
## TimeOff                    0.742315981     0.057852171  0.178025918
##                           ManualStop       TimeOn      TimeOff
## Park_Name               -0.014068102  0.056340771  0.047902566
## FactorA                  0.003235151  0.016196902  0.001390545
## FactorB                 -0.013217171  0.050334072  0.043311615
## FactorC                 -0.016487000  0.054337982  0.046236340
## FactorD                  0.003102155 -0.024539507 -0.022061551
## StationID               -0.003397407 -0.007981090 -0.001718056
## VisitType                0.005806549 -0.062056234 -0.034621624
## VisitId                 -0.002875379 -0.033294766 -0.021273633
## ManualStop.during.Visit  0.098256457 -0.011042389 -0.012631649
## VisitStartTime          -0.007861809  0.992352465  0.742315981
## VisitDurMinutes          0.048796064  0.066826252  0.057852171
## Code                    -0.150674381  0.017070536  0.178025918
## ManualStop               1.000000000 -0.007858399  0.055088399
## TimeOn                  -0.007858399  1.000000000  0.746611015
## TimeOff                  0.055088399  0.746611015  1.000000000

#Removing correlation data
rm(M)

Here are the strongest positive relationships between variables: Factor C & Park name .979 Station ID & Factor D .69 Visit start time & Time off .742 Visit start time & Time on .992 Time on & Time off .746

Here are the strongest negative relationships between variables: Factor A & station ID -.932 Factor D & Factor A -.708

Key findings; the data exhibits some strong linear correlations, meaning that the when one variable increases so does another(positive relationship) and when one variable decreases the other increases (negative relationship).

Any relationship with a value above .7 (+ or -) is considered to have a strong relationship.

We are not told what the factors are, but some of the factors have strong relationships with other variables. Additionally, there are strong relationships between some of the time variables in the dataset.

Plotting a correlation matrix to visualize relationships between variables in codes,events, and warnings data set.

#Converting data set into numeric to plot
chronplot<-data.matrix(codesnevents)
#Correlation Analysis
M <- cor(chronplot)
#Removing data frame from memory after use
rm(chronplot)
#Plotting
corrplot(M, method="circle")

#Removing correlation data
rm(M)

There were no strong positive relationships between variables found. Here are the strong negative relationships between variables: Stop Urgency and Event Warning Stop -.82 This relationship could be due to the mapping structure of the codes.

Plotting a correlation matrix to visualize relationships between variables in the location list with number of assets per location data set.

#Converting data set into numeric to plot
chronplot<-data.matrix(numassploc)
#Correlation Analysis
M <- cor(chronplot)
#Removing data frame from memory after use
rm(chronplot)
rm(numassploc)
#Plotting
corrplot(M, method="circle")

#Removing correlation data
rm(M)

There were no strong relationships found between the two variables in this dataset.

Selecting codes from data set related to stops to be used for selecting chronological sets of information. We will use this to subset the dataframe related to stops in the next section of code.

#Subsetting data frame to extract stops 
stopcodes<-subset(codesnevents,EventWarningStop=="Stop")
#Saving all the stop codes
stopcodes<-stopcodes$Code
#Removing data frame from memory after use
rm(codesnevents)

Traversing data set and extracting the stops, warnings, and events in chronological order as complete sets. Looking back 3 events/warnings from a stop. Also extracting ALL stops for further analysis. If a stop record is found that has 3 previous observations that aren’t stops, then stop, the previous record to the stop, the previous to the previous record to the stop, and the record 3 previous to the stop will each be recorded in seperate data subsets to help analyze what leads up to a stop.

#Setting up new data frames to store data integers for traversal
allstops<-data.frame()
stops<-data.frame()
prevwarns<-data.frame()
preprevwarns<-data.frame()
ppreprevwarns<-data.frame()
j=integer()
k=integer()
l=integer()
#Traversing dataset to find Stops, leaving enough room to be able to analyze the furthest warning from the non manual stop.
for (i in 12:(length(chronsites$Code)))
{
  #Setting up comparison value
  stationid=chronsites$StationID[i]
  #Setting up counter
  j=i-1
  #Making sure we avoid duplicates records.  Avoiding same visit.
  if ((chronsites$Code[i]%in%stopcodes)&
      (chronsites$VisitId[i]!=chronsites$VisitId[i-1]))
  { 
    #Storing stops
    stopob=chronsites[(i),]
    #Storing ALL stops independently for analysis
    allstops<-rbind(allstops,stopob)
    #Traversing data set for previous event/warning
    while(chronsites$Code[j]%in%stopcodes)
      {
       j=j-1
      }
    #There is no preceding observation
    if (chronsites$StationID[j]!=stationid)
    {
      #Removing observation not having previous
      stopob=NULL
      next
    }
    #Making sure we use the record belonging to turbine
    else
    {
    #Storing previous event/warning
    prewob=chronsites[(j),]
    #Setting up counter
    k=j-1
    #Traversing data set for previous to the previous                 event/warning
    while(chronsites$Code[k]%in%stopcodes)
        {
         k=k-1
        }
    }
    #Making sure we use the record belonging to turbine
    if (chronsites$StationID[k]!=stationid)
    {
      #Removing observation not having previous
      stopob=NULL
      prewob=NULL
      next
    }
    else
    {
    #Storing the previous to the previous event/warning
    preprewob=chronsites[(k),]
    #Setting up counter
    l=k-1
    #Traversing data set for pre-previous to the previous             event/warning
    while(chronsites$Code[l]%in%stopcodes)
        {
         l=l-1
        }
    }
    #Making sure we use the record belonging to turbine
    if (chronsites$StationID[l]!=stationid)
    {
      #Removing observation not having previous
      stopob=NULL
      prewob=NULL
      preprewob=NULL
      next
    }
    else
    {
    #Storing pre-previous to the previous warnings
    ppreprevwarns<-rbind(ppreprevwarns,chronsites[(l            ),])
    preprevwarns<-rbind(preprevwarns,preprewob)
    prevwarns<-rbind(prevwarns,prewob)
    stops<-rbind(stops,stopob)
    }
  }
}
#Removing NA's that might exist
allstops=na.omit(allstops)
stops=na.omit(stops)
prevwarns=na.omit(prevwarns)
preprevwarns=na.omit(preprevwarns)
ppreprevwarns=na.omit(ppreprevwarns)
#Removing data frame from memory after use
rm(chronsites)

Plotting a correlation matrix to visualize relationships between variables in all stop observations.

#Converting data set into numeric to plot
chronplot<-data.matrix(allstops)
#Correlation Analysis
M <- cor(chronplot)
#Removing data frame from memory after use
rm(chronplot)
#Plotting
corrplot(M, method="circle")

Conducting a correlation test to see the strongest relationships between variables numerically in ALL Stops. Range should be from -1 to 1. The more negative, the more negative the correlation, the more positive, the more positive the correlation. Closer to 0, no correlation.

#Output of numbers for correlations
M

##                            Park_Name      FactorA      FactorB
## Park_Name                1.000000000 -0.283437695  0.482280544
## FactorA                 -0.283437695  1.000000000 -0.474632907
## FactorB                  0.482280544 -0.474632907  1.000000000
## FactorC                  0.979257543 -0.283450835  0.516924623
## FactorD                 -0.152894206 -0.722351561  0.297810106
## StationID                0.248544491 -0.924383527  0.391991745
## VisitType               -0.030037300  0.007651303 -0.054519740
## VisitId                 -0.022701817  0.001932463 -0.003570285
## ManualStop.during.Visit -0.049496211  0.103570234 -0.099345143
## VisitStartTime           0.100763451 -0.014666522  0.055776052
## VisitDurMinutes         -0.008822199 -0.030101024 -0.057000033
## Code                    -0.027334403  0.070877847 -0.029277131
## ManualStop              -0.007858442 -0.020494147  0.017829601
## TimeOn                   0.100852506 -0.011012649  0.047740249
## TimeOff                  0.097481239 -0.010124405  0.046330615
##                              FactorC     FactorD    StationID    VisitType
## Park_Name                0.979257543 -0.15289421  0.248544491 -0.030037300
## FactorA                 -0.283450835 -0.72235156 -0.924383527  0.007651303
## FactorB                  0.516924623  0.29781011  0.391991745 -0.054519740
## FactorC                  1.000000000 -0.10856154  0.255449196 -0.034018001
## FactorD                 -0.108561543  1.00000000  0.683134931  0.041163380
## StationID                0.255449196  0.68313493  1.000000000 -0.011279736
## VisitType               -0.034018001  0.04116338 -0.011279736  1.000000000
## VisitId                 -0.023998597  0.02503345  0.007862388  0.016550770
## ManualStop.during.Visit -0.048872493 -0.09861768 -0.086254213 -0.008070014
## VisitStartTime           0.103974685 -0.01250629  0.028494316 -0.027543692
## VisitDurMinutes         -0.025049632  0.04731610  0.039664265  0.048650905
## Code                    -0.025144518 -0.03629962 -0.077596248 -0.015246015
## ManualStop              -0.007173372  0.03354477  0.023122605 -0.039202142
## TimeOn                   0.104154665 -0.01761939  0.027254795 -0.027204041
## TimeOff                  0.100721434 -0.01821553  0.027255736 -0.030015979
##                              VisitId ManualStop.during.Visit
## Park_Name               -0.022701817            -0.049496211
## FactorA                  0.001932463             0.103570234
## FactorB                 -0.003570285            -0.099345143
## FactorC                 -0.023998597            -0.048872493
## FactorD                  0.025033451            -0.098617676
## StationID                0.007862388            -0.086254213
## VisitType                0.016550770            -0.008070014
## VisitId                  1.000000000            -0.001760866
## ManualStop.during.Visit -0.001760866             1.000000000
## VisitStartTime          -0.031703196             0.005198197
## VisitDurMinutes          0.019288914             0.238047176
## Code                     0.011356309            -0.021598948
## ManualStop               0.001269653             0.149832256
## TimeOn                  -0.029956811             0.009976922
## TimeOff                 -0.028796439             0.009030345
##                         VisitStartTime VisitDurMinutes        Code
## Park_Name                  0.100763451    -0.008822199 -0.02733440
## FactorA                   -0.014666522    -0.030101024  0.07087785
## FactorB                    0.055776052    -0.057000033 -0.02927713
## FactorC                    0.103974685    -0.025049632 -0.02514452
## FactorD                   -0.012506291     0.047316096 -0.03629962
## StationID                  0.028494316     0.039664265 -0.07759625
## VisitType                 -0.027543692     0.048650905 -0.01524601
## VisitId                   -0.031703196     0.019288914  0.01135631
## ManualStop.during.Visit    0.005198197     0.238047176 -0.02159895
## VisitStartTime             1.000000000     0.052690508  0.04317385
## VisitDurMinutes            0.052690508     1.000000000  0.01528886
## Code                       0.043173851     0.015288858  1.00000000
## ManualStop                -0.050973784    -0.005655096 -0.22678285
## TimeOn                     0.990201039     0.057514525  0.04362032
## TimeOff                    0.975601875     0.057792923  0.04613988
##                           ManualStop       TimeOn      TimeOff
## Park_Name               -0.007858442  0.100852506  0.097481239
## FactorA                 -0.020494147 -0.011012649 -0.010124405
## FactorB                  0.017829601  0.047740249  0.046330615
## FactorC                 -0.007173372  0.104154665  0.100721434
## FactorD                  0.033544769 -0.017619393 -0.018215526
## StationID                0.023122605  0.027254795  0.027255736
## VisitType               -0.039202142 -0.027204041 -0.030015979
## VisitId                  0.001269653 -0.029956811 -0.028796439
## ManualStop.during.Visit  0.149832256  0.009976922  0.009030345
## VisitStartTime          -0.050973784  0.990201039  0.975601875
## VisitDurMinutes         -0.005655096  0.057514525  0.057792923
## Code                    -0.226782846  0.043620322  0.046139884
## ManualStop               1.000000000 -0.050124734 -0.051729100
## TimeOn                  -0.050124734  1.000000000  0.985293713
## TimeOff                 -0.051729100  0.985293713  1.000000000

#Removing correlation data
rm(M)

Below we will compare the strong correlations in the full data set to the same correlation in the all stop subset.

Here are the strongest positive relationships between variables: Full Data set - Factor C & Park name .979 all stop set - Factor C & Park name .979

Full Data set - Station ID & Factor D .69 all stop set - Station ID & Factor D .683

Full Data set - Visit start time & Time off .742 all stop set - Visit start time & Time off

Full Data set - Visit start time & Time on .992 all stop set - Visit start time & Time on .975

Full Data set - Time on & Time off .746 all stop set - Time on & Time off .985

Here are the strongest negative relationships between variables: Full Data set - Factor A & station ID -.932 all stop set - Factor A & station ID -.924

Full Data set - Factor D & Factor A -.708 all stop set - Factor D & Factor A -.722

Key findings; this data subset of the all stops continues to exhibit similar strong linear correlations between certain variables, similar to the full data set. There are however some interesting findings here, the correlation between Visit Start time and time off variables is much more strongly related in the all stop data subset. Additionally, the time on and time off variables are related more strongly in the all stop set. This could be due to eliminating warnings which wouldn’t have as strong of correlation with the visit times as just the all stop record subset.

Exploring frequencies within each variable of All Stops. Park Names

#Sorting the the amount of times a park name appears
head(sort(table(allstops$Park_Name),decreasing = TRUE),10)

## 
## Park008 Park020 Park032 Park002 Park001 Park012 Park006 Park034 Park005 
##    1805    1794    1512    1498    1484    1354    1229     976     863 
## Park014 
##     851

There are 23,817 records in the All stop subset.

Park Names

Three highest frequency parks in subset: Park008 1,805 Park020 1,794 Park032 1,512

Comparison of ratio of park assets/total assets to frequency of a specific park in all stops dataset/amount of records in all stop dataset. This should let us analyze if the parks are in the allstops dataset proportionately to their size, and whether they are over or under performing.

Bottom three parks in all stops dataset more than their size would indicate: Park032 1.3% total assets, 6.3% of all stops records Park008 4.3% total assets, 7.5% of all stops records Park002 4.7% total assets, 6.2% of all stops records

Top three parks that are in all stops dataset less than their size would indicate: Park024 5% of total assets, 1.9% of all stops records Park006 8% of total assets, 5.2% of all stops records Park021 4% of total assets, 2.2% of all stops records

This analysis highlights a potentially very important issue. Some parks are in the all stops records way more often than their size would indicate. Additionally, some parks are in the all stops records less than their size would indicate. This means there could park specific factors that are altering the machines performances. Additional analysis could be to analyze the better performing parks, and contrast with the under performing parks.

Factor A

#Sorting the the amount of times a value appears in Factor A
sort(table(allstops$FactorA),decreasing = TRUE)

## 
##    5    4    7    2    8    3    6    9   10   11 
## 4981 4210 4113 3611 2766 2565  638  395  367  171

Highest frequency for factor A: Class 5- 4,981 Class 4- 4,210 Class 7- 4,113

These classes for factor A in the all stops data set are similar to the full data set.

Factor B

#Sorting the the amount of times a value appears in Factor B
sort(table(allstops$FactorB),decreasing = TRUE)

## 
##   GGL   GGS 
## 14007  9810

The two frequencies for factor B: GGL 14007 59% GGS 9810 41%

Compared to the full dataset at 58% and 42%, Factor b is very similar in the subset.

Factor C

#Sorting the the amount of times a value appears in Factor C
sort(table(allstops$FactorC),decreasing = TRUE)

## 
##  AAA  AAB  CCC  AAE  AAC  BBC  CCD  AAD  BBD  BBB  EEE  DDD 
## 6291 4578 2460 2329 1884 1400 1239 1097  914  697  609  319

Three highest frequencies factor C: AAA 6,291 26% AAB 4,578 19% CCC 2,460 10%

Factor C is also similar to the values in the full data set, AAA 26%, AAB 21%, CCC 10%.

Factor D

#Sorting the the amount of times a value appears in Factor D
sort(table(allstops$FactorD),decreasing = TRUE)

## 
##     A     B     C 
## 11433  6623  5761

Factor D frequencies: A - 11,433 48% B - 6,623 28% C - 5,761 24%

Comparing Factor D in the subset to the full data set, the factors are similar at A - 45%, B - 32%, C - 23% for the full data set.

Station ID

#Sorting the the amount of times a value appears in StationID
head(sort(table(allstops$StationID),decreasing = TRUE),10)

## 
## 13302 13308  5254  5200 13296 13282 13350 13334  3726 13280 
##   212   120   108   107   105   103   103   102    96    93

Three stations with highest frequencies: 13302 - 212 13308 - 120 5254 - 108

range min - 1 range max - 212 mean - 15.58 median 20

Out of the 1,614 stations, 1,529 appeared in the all stops subset. By viewing the range, and see how the mean is pulled higher than the median, it is clear there are stations with a very frequent occurence in the all stops records. This analysis helps identify what stations are causing the most trouble.

Visit Type

#Sorting the the amount of times a value appears in Visit Type
sort(table(allstops$VisitType),decreasing = TRUE)

## 
##    PL    SC    CU    UN 
## 14563  4737  3901   616

Frequencies by type: PL 14,563 61% SC 4,737 20% CU 3,901 16% UN 616 3%

Frequencies by type full data set: PL 63% SC 18% CU 17% UN 2%

Comparing the frequencies by type to the full data set, the values for the visit type are similar.

Visit ID

#Sorting the the amount of times a value appears in Visit ID
head(sort(table(allstops$VisitId),decreasing = TRUE),10)

## 
##    6482 1352238   76215  721384  213230 1001456  992081  786587  116664 
##      86      56      52      46      45      45      44      43      42 
##  282556 
##      42

Three highest visit ID with frequencies:

6482 - 86 .36% 1352238 - 56 .24% 76215 - 52 .22%

These visit Id’s differ from the full data set most frequent visit ids. The most common visit Ids in the full data set are 1113372, 944392, and 1280538.

Visit Start Time

#Changing variable to date
allstops$VisitStartTime=strptime(allstops$VisitStartTime,format="%m/%d/%Y %H:%M")
#Extracting Hours
allstops$VisitStartTime=format(allstops$VisitStartTime,'%H')
#Changing variable to numeric to plot in histogram
allstops$VisitStartTime=as.numeric(allstops$VisitStartTime)
#Plotting Histogram
hist(allstops$VisitStartTime,main = paste("Histogram of Visit Start Times"),xlab = "Hours of the Day")

#Sorting the the amount of times a value appears in Visit Start Time
head(sort(table(allstops$VisitStartTime),decreasing = TRUE),10)

## 
##    8    9   10    7   13   11   12   14   15   16 
## 2806 2429 2227 2055 2037 1987 1932 1884 1658 1421

Three highest visit start times: 8 am - 2,806 9 am - 2,429 10 am - 2,227

Most start times occur between 8am-10am, and taper off until 4pm. From 4pm the visits reduce drastically, with virtually no visits happening in the evening or overnight. This implies that the machines are visited during the day hours, and most specifically in the early mornings.

Visit Duration in Minutes

#Plotting Histogram
hist(allstops$VisitDurMinutes,main = paste("Histogram of Visit Duration"),xlab = "Minutes")

#Sorting the the amount of times a value appears in Visit Duration in Minutes
head(sort(table(allstops$VisitDurMinutes),decreasing = TRUE),10)

## 
##    2    1    3    4    5    6    7    9    8   10 
## 2241 2140 1889 1817 1431 1268 1100  992  972  723

Three highest visit durations:

2 min - 2,241 1 min - 2,140 3 min - 1,889

Most visits are very short, and the frequencies of visits by time decreases as minutes increase. Vists are typically less than 9 minutes, and some are as long as about 30 minutes.

Codes

#Plotting Histogram
hist(allstops$Code,main = paste("Histogram of Codes"),xlab = "Code Numbers")

#Sorting the the amount of times a value appears in Codes
head(sort(table(allstops$Code),decreasing = TRUE),10)

## 
##  3130  7111 10105 13902  1005  1007  5113 10118  1001  8000 
##  3161  1873  1788  1739  1585  1110  1092   868   861   452

Three most frequent codes: 3130 - 3,161 7111 - 1,873 10105 - 1,788

Most of the codes are at the 10,000 level and below.

Time On

#Changing variable to date
allstops$TimeOn=strptime(allstops$TimeOn,format="%m/%d/%Y %H:%M")
#Extracting Hours
allstops$TimeOn=format(allstops$TimeOn,'%H')
#Changing variable to numeric to plot in histogram
allstops$TimeOn=as.numeric(allstops$TimeOn)
#Plotting Histogram
hist(allstops$TimeOn,main = paste("Histogram of Times Stops were Recorded"),xlab = "Hours of the Day")

#Sorting the the amount of times a value appears in Time On
head(sort(table(allstops$TimeOn),decreasing = TRUE),10)

## 
##    8    9   10   11   12   13   14    7   15   16 
## 2007 1856 1849 1621 1565 1515 1501 1496 1307 1033

Three most frequent time on times: 8 am - 2,007 9 am - 1,856 10 am - 1,849

The turbines are sending the stop signals the most frequent around 8 am and taper off throughout the day, however they continue to generate through the night and rev back up fro 5am - 7am. This might create a heavy workload for the technicians in the mornings, and hence why the technicians have many turbine visits in the from 8am to noon.

Exploring stops that have previous event/warnings, looking back 3 instances. Basically, organizing and filtering the events and warnings prior to the stops, and exploring the observations that can go back 3 previous observations.

Listed below are the strong relationships between variables found in each correlation subset in the 4 chunks of code. The order they are in is: 1. warning 3 previous to the stop subset 2. warning 2 previous to stop subset 3. warning previous to stop subset 4. the stop subset.

Here are the strong positive linear relationships: Park Name & Factor C .979, .979, .979, .979 Factor D & Station ID .683, .683, .683, .683 Time off & Visit start time .976, .976, .976, .976 Time on & Time off .985, .985, .985, .985

Here are the strong negative linear relationships: Factor D & Factor A -.722, -.722, -.722, -.722 Station ID & Factor A -.924, -.924, -.924, -.924 Factor B & Factor A -.475, -.475,-.475, -.475

As you can see, the strong correlation variable pairs do not change between variables do not change between the subsets. If there were more variability here, maybe more valuable information could be generated. Is is possible that sequential records leading up to a stop for a specific turbine do not have the autonomy for certain variables to generate different results in each sequence.

Exploring pre-previous to previous warnings to stops.

head(ppreprevwarns,10)

Plotting a correlation matrix to visualize relationships between variables in 1st event/warning in chronological order.

#Converting data set into numeric to plot
chronplot<-data.matrix(ppreprevwarns)
#Correlation Analysis
M <- cor(chronplot)
#Removing data frame from memory after use
rm(chronplot)
#Plotting
corrplot(M, method="circle")

#Removing correlation data
rm(M)

Exploring previous to previous warnings to stops.

head(preprevwarns,10)

Plotting a correlation matrix to visualize relationships between variables in 2nd event/warning in chronological order.

#Converting data set into numeric to plot
chronplot<-data.matrix(preprevwarns)
#Correlation Analysis
M <- cor(chronplot)
#Removing data frame from memory after use
rm(chronplot)
#Plotting
corrplot(M, method="circle")

#Removing correlation data
rm(M)

Exploring previous warnings.

head(prevwarns,10)

Plotting a correlation matrix to visualize relationships between variables in 3rd event/warning in chronological order.

#Converting data set into numeric to plot
chronplot<-data.matrix(prevwarns)
#Correlation Analysis
M <- cor(chronplot)
#Removing data frame from memory after use
rm(chronplot)
#Plotting
corrplot(M, method="circle")

#Removing correlation data
rm(M)

Exploring Stops that have 3 prior events/warnings.

#Viewing organized Non Manual Stops
head(stops,10)

Plotting a correlation matrix to visualize relationships between variables in stops.

#Converting data set into numeric to plot
chronplot<-data.matrix(stops)
#Correlation Analysis
M <- cor(chronplot)
#Removing data frame from memory after use
rm(chronplot)
#Plotting
corrplot(M, method="circle")

#Removing correlation data
rm(M)

Changing times in data frames to numeric, preparing all time data for BIC variable selection analysis.

#Changing variable to date
ppreprevwarns$VisitStartTime=strptime(ppreprevwarns$VisitStartTime,format="%m/%d/%Y %H:%M")
#Extracting Hours
ppreprevwarns$VisitStartTime=format(ppreprevwarns$VisitStartTime,'%H')
#Changing variable to numeric
ppreprevwarns$VisitStartTime=as.numeric(ppreprevwarns$VisitStartTime)
#Changing variable to date
preprevwarns$VisitStartTime=strptime(preprevwarns$VisitStartTime,format="%m/%d/%Y %H:%M")
#Extracting Hours
preprevwarns$VisitStartTime=format(preprevwarns$VisitStartTime,'%H')
#Changing variable to numeric
preprevwarns$VisitStartTime=as.numeric(preprevwarns$VisitStartTime)
#Changing variable to date
prevwarns$VisitStartTime=strptime(prevwarns$VisitStartTime,format="%m/%d/%Y %H:%M")
#Extracting Hours
prevwarns$VisitStartTime=format(prevwarns$VisitStartTime,'%H')
#Changing variable to numeric
prevwarns$VisitStartTime=as.numeric(prevwarns$VisitStartTime)
#Changing variable to date
allstops$TimeOn=strptime(allstops$TimeOn,format="%m/%d/%Y %H:%M")
#Extracting Hours
allstops$TimeOn=format(allstops$TimeOn,'%H')
#Changing variable to numeric to plot in histogram
allstops$TimeOn=as.numeric(allstops$TimeOn)
#Changing variable to date
ppreprevwarns$TimeOn=strptime(ppreprevwarns$TimeOn,format="%m/%d/%Y %H:%M")
#Extracting Hours
ppreprevwarns$TimeOn=format(ppreprevwarns$TimeOn,'%H')
#Changing variable to numeric
ppreprevwarns$TimeOn=as.numeric(ppreprevwarns$TimeOn)
#Changing variable to date
preprevwarns$TimeOn=strptime(preprevwarns$TimeOn,format="%m/%d/%Y %H:%M")
#Extracting Hours
preprevwarns$TimeOn=format(preprevwarns$TimeOn,'%H')
#Changing variable to numeric
preprevwarns$TimeOn=as.numeric(preprevwarns$TimeOn)
#Changing variable to date
prevwarns$TimeOn=strptime(prevwarns$TimeOn,format="%m/%d/%Y %H:%M")
#Extracting Hours
prevwarns$TimeOn=format(prevwarns$TimeOn,'%H')
#Changing variable to numeric
prevwarns$TimeOn=as.numeric(prevwarns$TimeOn)

Preparing data for variable selection and exploring dependent variable bias.

#Setting up new data frame
newdf=data.frame()
#Combining all variables and renaming them into 1 dataframe for picking variable subset
#Modifying certain columns to factors in pre-previous to previous warnings.  1st event/warning in chronological oder.
newdf=cbind(ppreprevwarns)
#Removing unused columns
newdf$TimeOff<-NULL
#Removing Visit ID cause it is not needed
newdf$VisitId<-NULL
#Modifying certain columns to factors in previous to previous warnings.  2nd event/warning in chronological oder.
newdf$pppn=cbind(preprevwarns$Park_Name)
newdf$ppfa=cbind(preprevwarns$FactorA)
newdf$ppfb=cbind(preprevwarns$FactorB)
newdf$ppfc=cbind(preprevwarns$FactorC)
newdf$ppfd=cbind(preprevwarns$FactorD)
newdf$ppsid=cbind(preprevwarns$StationID)
newdf$ppmsdv=cbind(preprevwarns$ManualStop.during.Visit)
newdf$ppvst=cbind(preprevwarns$VisitStartTime)
newdf$ppvdm=cbind(preprevwarns$VisitDurMinutes)
newdf$ppc=cbind(preprevwarns$Code)
newdf$ppms=cbind(preprevwarns$ManualStop)
newdf$ppto=cbind(preprevwarns$TimeOn)
#Modifying certain columns to factors in previous to previous warnings.  3rd event/warning in chronological oder.
newdf$ppn=cbind(prevwarns$Park_Name)
newdf$pfa=cbind(prevwarns$FactorA)
newdf$pfb=cbind(prevwarns$FactorB)
newdf$pfc=cbind(prevwarns$FactorC)
newdf$pfd=cbind(prevwarns$FactorD)
newdf$psid=cbind(prevwarns$StationID)
newdf$pmsdv=cbind(prevwarns$ManualStop.during.Visit)
newdf$pvst=cbind(prevwarns$VisitStartTime)
newdf$pvdm=cbind(prevwarns$VisitDurMinutes)
newdf$pc=cbind(prevwarns$Code)
newdf$pms=cbind(prevwarns$ManualStop)
newdf$pto=cbind(prevwarns$TimeOn)
#Adding the y column: The dependent variable, determining glitch
newdf$y=cbind(stops$ManualStop)
#Removing unused objects and data from memory
rm(allstops)
rm(stops)
rm(ppreprevwarns)
rm(preprevwarns)
rm(prevwarns)
rm(i)
rm(j)
rm(k)
rm(l)
rm(preprewob)
rm(prewob)
rm(stationid)
rm(stopcodes)
rm(stopob)
#Checking bias.  How many are TRUE or FALSE in the dependent variable
table(newdf$y)

## 
## FALSE  TRUE 
##  1220   262

Findings of the Manual Stops are: 1220 FALSE 262 TRUE These indicate that most of the stops that have 3 previous events/warnings have happened without a technician manually stopping the turbines. Next, we will pick the best and most related variables to these stops in order to build a predictive model to estimate the stops that are non manual stops (glitches).

Splitting the data into training and testing set using the standard of 70/30. Executing BIC analysis to pick the predictive model variables from 38 possible variables.

#Creating Training and Test Data, 70/30 split. 
input_true<-newdf[which(newdf$y=="TRUE"),]  #non-glitches
input_false<-newdf[which(newdf$y=="FALSE"),] # glitches
#To ensure similar samples
set.seed(123)
#Non glitches for training
input_true_train_rows<-sample(1:nrow(input_true), 0.7*nrow(input_true))
#Glitches for training
input_false_train_rows<-sample(1:nrow(input_false),0.7*nrow(input_false))#
#Combining the true and false into a training set
train_true<-input_true[input_true_train_rows,]  
train_false<-input_false[input_false_train_rows,]
train.sample<-rbind(train_true, train_false)  
#Combining the true and false into a test set
test_true<-input_true[-input_true_train_rows,]
test_false<-input_false[-input_false_train_rows,]
test.sample<-rbind(test_true,test_false)
#Including library for variable selection and regression
library(leaps)
library(glmnet)
library(glmulti)
#Executing best variable selection for predictive model
glmulti.glm.out <-glmulti(y~Park_Name+FactorA+FactorB+FactorC+FactorD+StationID+VisitType+ManualStop.during.Visit+VisitStartTime+VisitDurMinutes+Code+ManualStop+TimeOn+pppn+ppfa+ppfb+ppfc+ppfd+ppsid+ppmsdv+ppvst+ppvdm+ppc+ppms+ppto+ppn+pfa+pfb+pfc+pfd+psid+pfd+psid+pmsdv+pvst+pvdm+pc+pms+pto,data=train.sample,method="g",crit="bic",fitfunction=glm,minsize=13,maxsize=38,level=1,plotty=F,report = F,confsetsize = 5)

## TASK: Genetic algorithm in the candidate set.
## Initialization...
## Algorithm started...
## Improvements in best and average IC have bebingo en below the specified goals.
## Algorithm is declared to have converged.
## Completed.

#Best Model selected
summary(glmulti.glm.out@objects[[1]])

## 
## Call:
## fitfunc(formula = as.formula(x), data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.32202  -0.21201  -0.17499  -0.02632   0.94794  
## 
## Coefficients: (7 not defined because of singularities)
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 2.972e-02  6.154e-02   0.483 0.629238    
## FactorBGGS                  1.931e-02  3.009e-02   0.642 0.521227    
## ManualStop.during.Visityes  1.734e-01  4.424e-02   3.920 9.43e-05 ***
## FactorA                    -5.521e-04  6.775e-03  -0.081 0.935073    
## ManualStopTRUE                     NA         NA      NA       NA    
## ppfa                               NA         NA      NA       NA    
## ppfb                               NA         NA      NA       NA    
## ppfc                       -6.000e-03  3.976e-03  -1.509 0.131571    
## ppmsTRUE                           NA         NA      NA       NA    
## pfa                                NA         NA      NA       NA    
## pfb                                NA         NA      NA       NA    
## pfc                                NA         NA      NA       NA    
## pvdm                        3.612e-03  1.463e-03   2.468 0.013745 *  
## pc                         -2.822e-06  7.456e-07  -3.785 0.000163 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1405411)
## 
##     Null deviance: 150.71  on 1036  degrees of freedom
## Residual deviance: 144.76  on 1030  degrees of freedom
## AIC: 917
## 
## Number of Fisher Scoring iterations: 2

We have just generated all possible combinations of variables that would give the best predictive model and the result above shows that 13 were chosen, but some were ignored (NA) because they were either too correlated or their influence too small to be considered. Therefore, we will pick the ones with a numerical value showing some sort of importance to the model. 5 variables were selected. After we exceute a logistic regression, we will describe the actual variables and show their relevance to the predictive model.

Running logistic regression for selected model in order to predict glitches. Showing the resulting model.

#Including library necessary for predict function
library(caret)
#Creating training model using logistic regression
glmmodel=glm(y~FactorB+ManualStop.during.Visit+FactorA+ppfd+pc,data=train.sample,family=binomial(link ="logit"))
#Printing out the logistic model
summary(glmmodel)

## 
## Call:
## glm(formula = y ~ FactorB + ManualStop.during.Visit + FactorA + 
##     ppfd + pc, family = binomial(link = "logit"), data = train.sample)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.87161  -0.70270  -0.59512  -0.00027   2.50536  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -1.828e+01  4.298e+02  -0.043 0.966082    
## FactorBGGS                 -4.689e-02  1.963e-01  -0.239 0.811215    
## ManualStop.during.Visityes  1.609e+01  4.298e+02   0.037 0.970140    
## FactorA                     8.306e-02  5.842e-02   1.422 0.155133    
## ppfd                        3.335e-01  1.408e-01   2.368 0.017864 *  
## pc                         -2.770e-05  7.980e-06  -3.472 0.000517 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 966.48  on 1036  degrees of freedom
## Residual deviance: 909.70  on 1031  degrees of freedom
## AIC: 921.7
## 
## Number of Fisher Scoring iterations: 16

From the above results, one can see that the selected predictive model has 5 variables that work together to provide the best predictive power. They are: 1) Factor B- This is the variable Factor B in the 1st of the 3 observations in chronological order, prior to a stop. One of the disguised characteristics of the park. 2) Manual Stop.during.Visit- This is the variable indicating a flag, which indicates true or false, if there were any manual stop codes registered among the provided alarm history for each visit. This particular variable belongs to the 1st of the 3 events/warnings in chronological order, prior to a stop. 3) Factor A- This is the variable Factor A in the 1st of the 3 events/warnings in chronological order, prior to a stop. One of the disguised characteristics of the park. 4) ppfd- This is the variable Factor D in the 2nd of the 3 events/warnings in chronological order, prior to a stop. One of the disguised characteristics of the park. 5) pc- This is the variable Code, a number representing the actual alarm, event, or fault code from the turbine’s historical information log. This particular variable belongs to the 3rd of the 3 events/warnings in chronological order, prior to a stop. In other words to the prior event/warning to the stop.

Particularly, two that seem to be the most important, in the presence of all variables in the model, are the second event/warning Factor D and the previous event/warning Code.

Fitting the training model on the test set and printing out the classifications error.

#Fitting trainning model on test set
pred = predict(glmmodel,newdata=test.sample,type="response")
#Including library for misclassification error function
library(InformationValue) 
misClassError(test.sample$y,pred)

## [1] 0.1775

This model achieves a misclassification error of 17.75 % using the model we described above. This means that approximately 4 out of 5 times, it would predict a glitch correctly.

The total time the program took to run is 1662/60= Approximately 27 minutes.

proc.time()-ptm

##    user  system elapsed 
## 1656.67    3.57 1663.50

We hope that we provided some insight into the issues you might be facing and thank you very much for your time and the opportunity.

Siemen’s Analytics Contest

Danilo Martinez and Patrick Pwooten, TEAM 29

February 17, 2017

Exploring stops that have previous event/warnings, looking back 3 instances. Basically, organizing and filtering the events and warnings prior to the stops, and exploring the observations that can go back 3 previous observations.