Executive Summary

The report explore die relationship between some variables of data from acelerometers on the belt, forearm, and dumbell of 6 participants. The goal is to predict the ā€œclasseā€ variable. For the prediction of the 20 predefined test data sets the method Support Vector Machine is used and delivers good results.

Data Preparation und Data Expoloration

Data Source

More Information about the accelerometer data is available from the website: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har

Training data: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

Test data: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The files should be made available in the current directory.

## size of pml-training.csv : 19622
## size of pml-testing.csv  : 20

Data Preparation

First, the columns with missing data are eliminated.

## Selected Variables:
## --------------------
##  [1] "X"                    "user_name"            "raw_timestamp_part_1"
##  [4] "raw_timestamp_part_2" "cvtd_timestamp"       "new_window"          
##  [7] "num_window"           "roll_belt"            "pitch_belt"          
## [10] "yaw_belt"             "total_accel_belt"     "gyros_belt_x"        
## [13] "gyros_belt_y"         "gyros_belt_z"         "accel_belt_x"        
## [16] "accel_belt_y"         "accel_belt_z"         "magnet_belt_x"       
## [19] "magnet_belt_y"        "magnet_belt_z"        "roll_arm"            
## [22] "pitch_arm"            "yaw_arm"              "total_accel_arm"     
## [25] "gyros_arm_x"          "gyros_arm_y"          "gyros_arm_z"         
## [28] "accel_arm_x"          "accel_arm_y"          "accel_arm_z"         
## [31] "magnet_arm_x"         "magnet_arm_y"         "magnet_arm_z"        
## [34] "roll_dumbbell"        "pitch_dumbbell"       "yaw_dumbbell"        
## [37] "total_accel_dumbbell" "gyros_dumbbell_x"     "gyros_dumbbell_y"    
## [40] "gyros_dumbbell_z"     "accel_dumbbell_x"     "accel_dumbbell_y"    
## [43] "accel_dumbbell_z"     "magnet_dumbbell_x"    "magnet_dumbbell_y"   
## [46] "magnet_dumbbell_z"    "roll_forearm"         "pitch_forearm"       
## [49] "yaw_forearm"          "total_accel_forearm"  "gyros_forearm_x"     
## [52] "gyros_forearm_y"      "gyros_forearm_z"      "accel_forearm_x"     
## [55] "accel_forearm_y"      "accel_forearm_z"      "magnet_forearm_x"    
## [58] "magnet_forearm_y"     "magnet_forearm_z"     "classe"

The target variable ā€œclasseā€ is decomposed into 5 new numerical variables ā€œclasse_Aā€, ā€œclasse_Bā€, ā€œclasse_Cā€, ā€œclasse_Dā€ and ā€œclasse_Eā€ for the prediction of the classification.

Data Exploration

Parallel coordinates of explanatory variables

By way of illustration, the explanatory variables are represented in a diagram of parallel coordinates to search for positive and negative correlations and groupings.

Correlations

Here is an overview of the correlations of the explanatory variables and the target variables.

Data for Training, Valuation and Test

## training data   : 14718 observations / 65 variables
## validation data : 4904 observations / 65 variables
## test data       : 20 observations / 60 variables

Modeling

The classification of the target variables ā€œclasseā€ or ā€œclasse_Aā€, ā€œclasse_Bā€, ā€œclasse_Cā€, ā€œclasse_Dā€ and ā€œclasse_Eā€ is made by using the method Support Vector Machine. That each expression of the characteristic ā€œclasseā€ is classified by a separate column. The new columns are created as numeric values. The larger the value, the more likely the classification to the ā€œclasseā€ and the smaller the less likely.

cross validation

The training data is divided into 5 subsets.

## Size of 5-folder subsets:
## 
##    1    2    3    4    5 
## 2943 2944 2943 2943 2945

Classification of variable ā€œclasse_Aā€

The classification of the target variables ā€œclasseā€ or ā€œclasse_Aā€, ā€œclasse_Bā€, ā€œclasse_Cā€, ā€œclasse_Dā€ and ā€œclasse_Eā€ is made by using the method Support Vector Machine. The first new columns ā€œclasse_Aā€ is created as numeric values. The procedure is based on ā€œone vs allā€. That The expression ā€œAā€ is classified against all other occurrences of the target variable ā€œclasseā€. For this purpose, the new target variable ā€œclasse_Aā€ is introduced. The larger the value, the more likely the classification to the ā€œclasseā€ and the smaller the less likely.

The function ā€œtune.svmā€ allows the implicit use of cross-validation. Cross validation is used to determine the best model.

svmfitA <- tune.resA$best.model
bestPerformA <- tune.resA$best.performance
## sampling                 : 5-fold cross validation
## performance - error      : 0.0180137872199773
## performance - dispersion : 0.00151141950524768
## Error estimation of 'svm' using 5-fold cross validation: 0.0180137872199773
VmodfitA <- predict( svmfitA, newdata=my_validation, se.fit = TRUE, interval = "confidence" )
## predict test data
TmodFitA <- predict( svmfitA, newdata=my_test, se.fit = TRUE, interval = "confidence" )
## knitr::kable( TmodFitA )
## 
## Error estimation of 'svm' using 5-fold cross validation: 0.01801379

Classification of variable ā€œclasse_Bā€

Analogous to the classification of the variable ā€œclasse_Aā€.

## sampling                 : 5-fold cross validation
## performance - error      : 0.0324431580720624
## performance - dispersion : 0.00410400815241028
## Error estimation of 'svm' using 5-fold cross validation: 0.0324431580720624

Classification of variable ā€œclasse_Cā€

Analogous to the classification of the variable ā€œclasse_Aā€.

## sampling                 : 5-fold cross validation
## performance - error      : 0.0398682449695015
## performance - dispersion : 0.00288347762284562
## Error estimation of 'svm' using 5-fold cross validation: 0.0398682449695015

Classification of variable ā€œclasse_Dā€

Analogous to the classification of the variable ā€œclasse_Aā€.

## sampling                 : 5-fold cross validation
## performance - error      : 0.0311277316758964
## performance - dispersion : 0.00114234089713203
## Error estimation of 'svm' using 5-fold cross validation: 0.0311277316758964

Classification of variable ā€œclasse_Eā€

Analogous to the classification of the variable ā€œclasse_Aā€.

## sampling                 : 5-fold cross validation
## performance - error      : 0.0142034018396443
## performance - dispersion : 0.00139332017418986
## Error estimation of 'svm' using 5-fold cross validation: 0.0142034018396443

Interpretation and Visualization

Apply ml-algorithm to validation cases

Using the Confusion matrix, the quality of the classification is calculated on the validation data.

table( as.factor( my_validation$classe ) )
## 
##    A    B    C    D    E 
## 1395  949  855  804  901
## Validation subset: residuals
##             Min. 1st Qu.  Median   Mean 3rd Qu.   Max.
## classe_A -0.8985 -0.0367 -0.0011 0.0039  0.0352 0.8957
## classe_B -0.9003 -0.0348  0.0016 0.0259  0.0349 0.9756
## classe_C -1.0411 -0.0377  0.0002 0.0143  0.0378 0.9322
## classe_D -1.0316 -0.0296 -0.0013 0.0216  0.0285 1.0235
## classe_E -0.3727 -0.0264 -0.0003 0.0140  0.0250 1.0365

## Error estimation of Support Vector Maschine using 10-fold cross validation:
## ---------------------------------------------------------------------------
## classe_A : 0.01801
## classe_B : 0.03244
## classe_C : 0.03987
## classe_D : 0.03113
## classe_E : 0.0142
caret::confusionMatrix( data=my_validation$classe, reference=as.factor(Vclasse) )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1391    0    4    0    0
##          B   69  842   36    2    0
##          C    1   22  822   10    0
##          D    1    6   95  702    0
##          E    0    1   12   13  875
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9445          
##                  95% CI : (0.9378, 0.9508)
##     No Information Rate : 0.2981          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9297          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9514   0.9667   0.8483   0.9656   1.0000
## Specificity            0.9988   0.9735   0.9916   0.9756   0.9935
## Pos Pred Value         0.9971   0.8872   0.9614   0.8731   0.9711
## Neg Pred Value         0.9798   0.9927   0.9637   0.9939   1.0000
## Prevalence             0.2981   0.1776   0.1976   0.1482   0.1784
## Detection Rate         0.2836   0.1717   0.1676   0.1431   0.1784
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9751   0.9701   0.9200   0.9706   0.9968

The key figures Accuracy = 0.9405 and Kappa = 0.9246 as well as Sensitivity and Specificity are very high for the validation data, so that the classification can run on the test data.

Apply ML-algorithm to 20 test cases

The classification of the expression of the ā€œclasseā€ takes place, in which the largest value within a row is determined from the 5 values (majority vote). Based on the magnitude, the reader can immediately see how robust or uncertain the classification is. The larger a value stands out, the more robust the classification. The more equally distributed the values within a row, the more uncertain the assignment.

Presentation as a table
classe A classe B classe C classe D classe E classe robustness
0.228 0.524 0.249 0.035 0.099 B 0.46
0.798 0.096 0.141 0.106 0 A 0.7
0.369 0.153 0.39 0.106 0.103 C 0.35
0.9 0.112 0.073 0.065 0.082 A 0.73
0.942 0.086 0.114 0.075 0.048 A 0.74
0.053 0.048 0.275 0.044 0.554 E 0.57
0.046 0.1 0.137 0.883 0.042 D 0.73
0.115 0.746 0.108 0.14 0.055 B 0.64
0.946 0.055 0.086 0.092 0.081 A 0.75
0.924 0.082 0.107 0.076 0.051 A 0.75
0.229 0.369 0.143 0.273 0.025 B 0.36
0.179 0.083 0.6 0.096 0.043 C 0.6
0.101 0.92 0.046 0.079 0.095 B 0.74
0.951 0.057 0.09 0.056 0.046 A 0.79
0.076 0.127 0.099 0.076 0.852 E 0.69
0.11 0.076 0.027 0.052 0.944 E 0.78
0.915 0.076 0.073 0.081 0.116 A 0.73
0.138 0.838 0.095 0.102 0.106 B 0.66
0.084 0.959 0.037 0.043 0.069 B 0.8
0.046 1 0.077 0.065 0.062 B 0.8

Notice: The classification of test cases 3) and 11) does not seem to be robust. The values for robustness are relatively low at 0.35 and 0.36. In particular in test case 3) the classification of the expression ā€œCā€ with 0.39 against ā€œAā€ with 0.37 and ā€œBā€ with 0.15 is to be described as weak

Presentation as Heatmap

In the graphical representation of the classification, the ā€œweakā€ values of the test cases 3) and 11) become immediately visible.

## l_view_Fit
## l_file1 <- "C:/Users/Paul/Desktop/coursera/data science/Course Certificate for Machine Learning/Week4/df_t_fit_data.csv"
## write.table( l_view_Fit, file = l_file1, row.names=FALSE, na="",col.names=FALSE, sep="," )

## df_robust
## l_file2 <- "C:/Users/Paul/Desktop/coursera/data science/Course Certificate for Machine Learning/Week4/df_t_robust.csv"
## write.table( df_robust, file = l_file2, row.names=FALSE, na="",col.names=FALSE, sep="," )

## df_V_robust 
## l_file3 <- "C:/Users/Paul/Desktop/coursera/data science/Course Certificate for Machine Learning/Week4/df_v_robust.csv"
## write.table( df_V_robust, file = l_file3, row.names=FALSE, na="",col.names=FALSE, sep=",")
Visualization of the robustness for the classifiers
Visualisazion for test set
## [1] "Visualization of robustness ( Juli 27 2019 )"

From the classification of the test data stood out two records with a weak robustness. The following graphics show how the robustness of the classification of the expressions of the variable ā€œclasseā€ behaves.

robust_scatter <- df_robust %>%
  ggplot2::ggplot( aes( x = classify, y = robustness, color = classe ) ) + 
  geom_point() +
  ggtitle( "Visualization of robustness" ) +
  xlab( "classify" ) + ylab( "robustness" ) +
  xlim( c(0,1) ) + ylim( c(0,1) )

plotly::ggplotly( robust_scatter )
Visualisazion for validation set

The amount of test records is too small to recognize a pattern. Therefore, we consider the amount of validation data. Interestingly, there is a visible number of classifications with robustness below 0.5 for all expressions.

The interaction of the representation by means of the package ā€œplotlyā€ makes it possible to look deeper into the set of data points deeper into it.

robust_scatter <- df_V_robust %>%
  ggplot( aes( x = classify, y = robustness, color = classe ) ) + 
  geom_point() + 
##  geom_smooth() + 
  ggtitle( "Visualization of robustness" ) +
  xlab( "classify" ) + ylab( "robustness" ) + xlim( c(0,1) ) + ylim( c(0,1) )

ggplotly( robust_scatter )

The patterns seem to look similar for all expressions of the variable ā€œclasseā€.

robust_scatter <- df_V_robust %>%
  ggplot2::ggplot( aes( x = classify, y = robustness, color = classe ) ) + 
  geom_point() +
  facet_grid( . ~ classe ) +
  ggtitle( "Visualization of robustness" ) +
  xlab( "classify" ) + ylab( "robustness" ) + xlim( c(0, 1) ) + ylim( c(0, 1) )

plotly::ggplotly( robust_scatter )

The boxing plots of robustness per expression show that the classification of ā€œCā€ drops significantly over the others.

robust_boxplot <- df_V_robust %>%
  ggplot2::ggplot( aes( x = 1, y = robustness, color = classe ) ) + 
  geom_boxplot() +
  facet_grid( . ~ classe ) +
  ggtitle( "Visualization of robustness" ) +
  theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank())

plotly::ggplotly( robust_boxplot )