Coursera: Prediction Assignment Writeup

Executive Summary

The report explore die relationship between some variables of data from acelerometers on the belt, forearm, and dumbell of 6 participants. The goal is to predict the “classe” variable. For the prediction of the 20 predefined test data sets the method Support Vector Machine is used and delivers good results.

Data Preparation und Data Expoloration

Data Source

More Information about the accelerometer data is available from the website: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har

Training data: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

Test data: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The files should be made available in the current directory.

## size of pml-training.csv : 19622

## size of pml-testing.csv  : 20

Data Preparation

First, the columns with missing data are eliminated.

## Selected Variables:

## --------------------

##  [1] "X"                    "user_name"            "raw_timestamp_part_1"
##  [4] "raw_timestamp_part_2" "cvtd_timestamp"       "new_window"          
##  [7] "num_window"           "roll_belt"            "pitch_belt"          
## [10] "yaw_belt"             "total_accel_belt"     "gyros_belt_x"        
## [13] "gyros_belt_y"         "gyros_belt_z"         "accel_belt_x"        
## [16] "accel_belt_y"         "accel_belt_z"         "magnet_belt_x"       
## [19] "magnet_belt_y"        "magnet_belt_z"        "roll_arm"            
## [22] "pitch_arm"            "yaw_arm"              "total_accel_arm"     
## [25] "gyros_arm_x"          "gyros_arm_y"          "gyros_arm_z"         
## [28] "accel_arm_x"          "accel_arm_y"          "accel_arm_z"         
## [31] "magnet_arm_x"         "magnet_arm_y"         "magnet_arm_z"        
## [34] "roll_dumbbell"        "pitch_dumbbell"       "yaw_dumbbell"        
## [37] "total_accel_dumbbell" "gyros_dumbbell_x"     "gyros_dumbbell_y"    
## [40] "gyros_dumbbell_z"     "accel_dumbbell_x"     "accel_dumbbell_y"    
## [43] "accel_dumbbell_z"     "magnet_dumbbell_x"    "magnet_dumbbell_y"   
## [46] "magnet_dumbbell_z"    "roll_forearm"         "pitch_forearm"       
## [49] "yaw_forearm"          "total_accel_forearm"  "gyros_forearm_x"     
## [52] "gyros_forearm_y"      "gyros_forearm_z"      "accel_forearm_x"     
## [55] "accel_forearm_y"      "accel_forearm_z"      "magnet_forearm_x"    
## [58] "magnet_forearm_y"     "magnet_forearm_z"     "classe"

The target variable “classe” is decomposed into 5 new numerical variables “classe_A”, “classe_B”, “classe_C”, “classe_D” and “classe_E” for the prediction of the classification.

Data Exploration

Parallel coordinates of explanatory variables

By way of illustration, the explanatory variables are represented in a diagram of parallel coordinates to search for positive and negative correlations and groupings.

Correlations

Here is an overview of the correlations of the explanatory variables and the target variables.

Data for Training, Valuation and Test

## training data   : 14718 observations / 65 variables

## validation data : 4904 observations / 65 variables

## test data       : 20 observations / 60 variables

Modeling

The classification of the target variables “classe” or “classe_A”, “classe_B”, “classe_C”, “classe_D” and “classe_E” is made by using the method Support Vector Machine. That each expression of the characteristic “classe” is classified by a separate column. The new columns are created as numeric values. The larger the value, the more likely the classification to the “classe” and the smaller the less likely.

cross validation

The training data is divided into 5 subsets.

## Size of 5-folder subsets:

## 
##    1    2    3    4    5 
## 2943 2944 2943 2943 2945

Classification of variable “classe_A”

The classification of the target variables “classe” or “classe_A”, “classe_B”, “classe_C”, “classe_D” and “classe_E” is made by using the method Support Vector Machine. The first new columns “classe_A” is created as numeric values. The procedure is based on “one vs all”. That The expression “A” is classified against all other occurrences of the target variable “classe”. For this purpose, the new target variable “classe_A” is introduced. The larger the value, the more likely the classification to the “classe” and the smaller the less likely.

The function “tune.svm” allows the implicit use of cross-validation. Cross validation is used to determine the best model.

svmfitA <- tune.resA$best.model
bestPerformA <- tune.resA$best.performance

## sampling                 : 5-fold cross validation

## performance - error      : 0.0180137872199773

## performance - dispersion : 0.00151141950524768

## Error estimation of 'svm' using 5-fold cross validation: 0.0180137872199773

VmodfitA <- predict( svmfitA, newdata=my_validation, se.fit = TRUE, interval = "confidence" )

## predict test data
TmodFitA <- predict( svmfitA, newdata=my_test, se.fit = TRUE, interval = "confidence" )
## knitr::kable( TmodFitA )

## 
## Error estimation of 'svm' using 5-fold cross validation: 0.01801379

Classification of variable “classe_B”

Analogous to the classification of the variable “classe_A”.

## sampling                 : 5-fold cross validation

## performance - error      : 0.0324431580720624

## performance - dispersion : 0.00410400815241028

## Error estimation of 'svm' using 5-fold cross validation: 0.0324431580720624

Classification of variable “classe_C”

Analogous to the classification of the variable “classe_A”.

## sampling                 : 5-fold cross validation

## performance - error      : 0.0398682449695015

## performance - dispersion : 0.00288347762284562

## Error estimation of 'svm' using 5-fold cross validation: 0.0398682449695015

Classification of variable “classe_D”

Analogous to the classification of the variable “classe_A”.

## sampling                 : 5-fold cross validation

## performance - error      : 0.0311277316758964

## performance - dispersion : 0.00114234089713203

## Error estimation of 'svm' using 5-fold cross validation: 0.0311277316758964

Classification of variable “classe_E”

Analogous to the classification of the variable “classe_A”.

## sampling                 : 5-fold cross validation

## performance - error      : 0.0142034018396443

## performance - dispersion : 0.00139332017418986

## Error estimation of 'svm' using 5-fold cross validation: 0.0142034018396443

Interpretation and Visualization

Apply ml-algorithm to validation cases

Using the Confusion matrix, the quality of the classification is calculated on the validation data.

table( as.factor( my_validation$classe ) )

## 
##    A    B    C    D    E 
## 1395  949  855  804  901

## Validation subset: residuals

##             Min. 1st Qu.  Median   Mean 3rd Qu.   Max.
## classe_A -0.8985 -0.0367 -0.0011 0.0039  0.0352 0.8957
## classe_B -0.9003 -0.0348  0.0016 0.0259  0.0349 0.9756
## classe_C -1.0411 -0.0377  0.0002 0.0143  0.0378 0.9322
## classe_D -1.0316 -0.0296 -0.0013 0.0216  0.0285 1.0235
## classe_E -0.3727 -0.0264 -0.0003 0.0140  0.0250 1.0365

## Error estimation of Support Vector Maschine using 10-fold cross validation:

## ---------------------------------------------------------------------------

## classe_A : 0.01801

## classe_B : 0.03244

## classe_C : 0.03987

## classe_D : 0.03113

## classe_E : 0.0142

caret::confusionMatrix( data=my_validation$classe, reference=as.factor(Vclasse) )

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1391    0    4    0    0
##          B   69  842   36    2    0
##          C    1   22  822   10    0
##          D    1    6   95  702    0
##          E    0    1   12   13  875
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9445          
##                  95% CI : (0.9378, 0.9508)
##     No Information Rate : 0.2981          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9297          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9514   0.9667   0.8483   0.9656   1.0000
## Specificity            0.9988   0.9735   0.9916   0.9756   0.9935
## Pos Pred Value         0.9971   0.8872   0.9614   0.8731   0.9711
## Neg Pred Value         0.9798   0.9927   0.9637   0.9939   1.0000
## Prevalence             0.2981   0.1776   0.1976   0.1482   0.1784
## Detection Rate         0.2836   0.1717   0.1676   0.1431   0.1784
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9751   0.9701   0.9200   0.9706   0.9968

The key figures Accuracy = 0.9405 and Kappa = 0.9246 as well as Sensitivity and Specificity are very high for the validation data, so that the classification can run on the test data.

Apply ML-algorithm to 20 test cases

The classification of the expression of the “classe” takes place, in which the largest value within a row is determined from the 5 values (majority vote). Based on the magnitude, the reader can immediately see how robust or uncertain the classification is. The larger a value stands out, the more robust the classification. The more equally distributed the values within a row, the more uncertain the assignment.

Presentation as a table

classe A	classe B	classe C	classe D	classe E	classe	robustness
0.228	0.524	0.249	0.035	0.099	B	0.46
0.798	0.096	0.141	0.106	0	A	0.7
0.369	0.153	0.39	0.106	0.103	C	0.35
0.9	0.112	0.073	0.065	0.082	A	0.73
0.942	0.086	0.114	0.075	0.048	A	0.74
0.053	0.048	0.275	0.044	0.554	E	0.57
0.046	0.1	0.137	0.883	0.042	D	0.73
0.115	0.746	0.108	0.14	0.055	B	0.64
0.946	0.055	0.086	0.092	0.081	A	0.75
0.924	0.082	0.107	0.076	0.051	A	0.75
0.229	0.369	0.143	0.273	0.025	B	0.36
0.179	0.083	0.6	0.096	0.043	C	0.6
0.101	0.92	0.046	0.079	0.095	B	0.74
0.951	0.057	0.09	0.056	0.046	A	0.79
0.076	0.127	0.099	0.076	0.852	E	0.69
0.11	0.076	0.027	0.052	0.944	E	0.78
0.915	0.076	0.073	0.081	0.116	A	0.73
0.138	0.838	0.095	0.102	0.106	B	0.66
0.084	0.959	0.037	0.043	0.069	B	0.8
0.046	1	0.077	0.065	0.062	B	0.8

Notice: The classification of test cases 3) and 11) does not seem to be robust. The values for robustness are relatively low at 0.35 and 0.36. In particular in test case 3) the classification of the expression “C” with 0.39 against “A” with 0.37 and “B” with 0.15 is to be described as weak

Presentation as Heatmap

In the graphical representation of the classification, the “weak” values of the test cases 3) and 11) become immediately visible.

## l_view_Fit
## l_file1 <- "C:/Users/Paul/Desktop/coursera/data science/Course Certificate for Machine Learning/Week4/df_t_fit_data.csv"
## write.table( l_view_Fit, file = l_file1, row.names=FALSE, na="",col.names=FALSE, sep="," )

## df_robust
## l_file2 <- "C:/Users/Paul/Desktop/coursera/data science/Course Certificate for Machine Learning/Week4/df_t_robust.csv"
## write.table( df_robust, file = l_file2, row.names=FALSE, na="",col.names=FALSE, sep="," )

## df_V_robust 
## l_file3 <- "C:/Users/Paul/Desktop/coursera/data science/Course Certificate for Machine Learning/Week4/df_v_robust.csv"
## write.table( df_V_robust, file = l_file3, row.names=FALSE, na="",col.names=FALSE, sep=",")

Visualization of the robustness for the classifiers

Visualisazion for test set

## [1] "Visualization of robustness ( Juli 27 2019 )"

From the classification of the test data stood out two records with a weak robustness. The following graphics show how the robustness of the classification of the expressions of the variable “classe” behaves.

robust_scatter <- df_robust %>%
  ggplot2::ggplot( aes( x = classify, y = robustness, color = classe ) ) + 
  geom_point() +
  ggtitle( "Visualization of robustness" ) +
  xlab( "classify" ) + ylab( "robustness" ) +
  xlim( c(0,1) ) + ylim( c(0,1) )

plotly::ggplotly( robust_scatter )

Visualisazion for validation set

The amount of test records is too small to recognize a pattern. Therefore, we consider the amount of validation data. Interestingly, there is a visible number of classifications with robustness below 0.5 for all expressions.

The interaction of the representation by means of the package “plotly” makes it possible to look deeper into the set of data points deeper into it.

robust_scatter <- df_V_robust %>%
  ggplot( aes( x = classify, y = robustness, color = classe ) ) + 
  geom_point() + 
##  geom_smooth() + 
  ggtitle( "Visualization of robustness" ) +
  xlab( "classify" ) + ylab( "robustness" ) + xlim( c(0,1) ) + ylim( c(0,1) )

ggplotly( robust_scatter )

The patterns seem to look similar for all expressions of the variable “classe”.

robust_scatter <- df_V_robust %>%
  ggplot2::ggplot( aes( x = classify, y = robustness, color = classe ) ) + 
  geom_point() +
  facet_grid( . ~ classe ) +
  ggtitle( "Visualization of robustness" ) +
  xlab( "classify" ) + ylab( "robustness" ) + xlim( c(0, 1) ) + ylim( c(0, 1) )

plotly::ggplotly( robust_scatter )

The boxing plots of robustness per expression show that the classification of “C” drops significantly over the others.

robust_boxplot <- df_V_robust %>%
  ggplot2::ggplot( aes( x = 1, y = robustness, color = classe ) ) + 
  geom_boxplot() +
  facet_grid( . ~ classe ) +
  ggtitle( "Visualization of robustness" ) +
  theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank())

plotly::ggplotly( robust_boxplot )

Coursera: Prediction Assignment Writeup

A. Paul

Dezember 2017

Executive Summary

Data Preparation und Data Expoloration

Data Source

Data Preparation

Data Exploration

Parallel coordinates of explanatory variables

Correlations

Data for Training, Valuation and Test

Modeling

cross validation

Classification of variable “classe_A”

Classification of variable “classe_B”

Classification of variable “classe_C”

Classification of variable “classe_D”

Classification of variable “classe_E”

Interpretation and Visualization

Apply ml-algorithm to validation cases

Apply ML-algorithm to 20 test cases

Presentation as a table

Presentation as Heatmap

Visualization of the robustness for the classifiers

Visualisazion for test set

Visualisazion for validation set