Prediction-Assignment

INTRODUCTION

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

The data set for this analysis is from:
Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.

METHODS

This project uses data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. This data is used to predict whether participants did the exercise correctly or incorrectly.

This project uses two datsets: a training dataset with 19,622 observations of 53 variables, and a testing dataset of 20 observations of 53 variables. The data sets were cleaned up to removed variables that most of the observations missing.

    setwd("C:/Users/Deborah Passey/Desktop")
    
    pml_training <- read.csv("pml-training.csv", na.strings = c("","NA","#DIV/0!"))        
    pml_testing <- read.csv("pml-testing.csv", na.strings = c("","NA","#DIV/0!"))  
    
    training <- pml_training[lapply(pml_training, function(x) sum(is.na(x)) / length(x) ) < 0.05]
    testing <- pml_testing[lapply(pml_testing, function(x) sum(is.na(x)) / length(x)) < 0.05]

    training_data <- training[,-c(1:7)]
    testing_data <- testing[,-c(1:7)]  
    
    training_data$classe <- unclass(training_data$classe)
    testing_data$classe <- unclass(testing_data$classe)
    
    inTrain <- createDataPartition(y=training_data$classe, p=0.7,list = FALSE)
    train <- training_data[inTrain,] 
    test <- training_data[-inTrain,]

Data Summary for Participants

In order to cut down on processing time and predictors, the dataset was trimmed down to a select number of variables. The research from Ugulino et al (2012) was used a selection algorithm to identify 16 variables that are potentially the best at predicting whether participants did the exercise correctly: (1) Sensor on the belt - acceleration, pitch, yaw, and roll, (2) Sensor on the arm - acceleration, pitch, yaw, and roll, (3) Sensor on the forearm - acceleration, pitch, yaw, and roll, and (4) Sensor on the dumbbell - accleraton, pitch, yaw, and roll.
The data table below shows the average for each of the six participants. The table reports the average of the Euler angles: roll, pitch and yaw, and accelerometer (accel) data for the arm, dumbbell, forearm, and belt.

    means <- training[,c("user_name", "roll_belt","roll_arm", "roll_dumbbell", "roll_forearm", "pitch_belt", "pitch_arm", "pitch_dumbbell", "pitch_forearm", "yaw_belt", "yaw_arm", "yaw_dumbbell", "yaw_forearm","total_accel_arm", "total_accel_belt", "total_accel_dumbbell","total_accel_forearm")]
    dataset <- means %>% group_by(user_name) %>% summarise_each(funs(mean))
    summ <- kable(dataset) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width = "100%", height = "200px")
    summ

user_name	roll_belt	roll_arm	roll_dumbbell	roll_forearm	pitch_belt	pitch_arm	pitch_dumbbell	pitch_forearm	yaw_belt	yaw_arm	yaw_dumbbell	yaw_forearm	total_accel_arm	total_accel_belt	total_accel_dumbbell	total_accel_forearm
adelmo	128.2869990	-15.70269	61.688812	0.000000	-41.396737	5.204404	-22.903124	0.000000	156.7731244	-26.316871	-87.25352	0.00000	22.23124	18.129240	18.411870	40.83505
carlitos	1.1677410	50.81837	28.207232	39.716886	6.582005	-13.245909	3.884116	12.446048	-93.0259961	4.276141	38.95589	15.42912	26.66388	2.997108	7.054627	35.08580
charles	122.0277149	-45.72347	-3.803530	-11.705710	15.762226	-12.492098	-7.397196	24.483634	7.3655062	-31.928391	31.64890	-64.17740	22.21352	18.391120	5.337104	30.01103
eurico	-0.0102345	90.25534	6.832875	-1.639225	3.358449	-6.694892	6.481118	4.401209	-87.2585668	46.479805	33.62308	38.14837	28.44072	3.242020	9.288925	34.25896
jeremy	0.7217578	0.00000	61.315148	108.166149	4.103906	0.000000	-47.904785	0.125050	-87.5118460	0.000000	-73.20845	87.32715	30.23633	4.422693	29.922399	33.36655
pedro	125.2708812	52.65374	-29.186737	83.750153	25.523372	-1.840647	13.321590	27.137759	-0.3736169	18.077609	109.24753	54.25502	23.87088	19.946360	10.108812	33.81609

RESULTS

Three models were fit with the training data set: (1) gradient boosting model (GBM), (2) linear model, and (3) random forest. To increase repoducibility, the “set.seed(150)” function was used for each model. The GBM used a k-fold cross-validation, where the dataset is split into k-subsets. Each subset is held out while the model is trained on the other subsets. This process is completed to determine accuracy for each of the datasets, and an overall accuracy estimate is provided. The linear model and random forest models were fit with “classe” as the outcome and 16 variables as predictors.

## GBM Model
  set.seed(200)
  control <- trainControl(method = "cv", number = 5)
  gbm_model <- train(classe ~ ., method = "gbm", data = train[,c("classe", "roll_belt","roll_arm", "roll_dumbbell", "roll_forearm", "pitch_belt", "pitch_arm", "pitch_dumbbell", "pitch_forearm", "yaw_belt", "yaw_arm", "yaw_dumbbell", "yaw_forearm","total_accel_arm", "total_accel_belt", "total_accel_dumbbell","total_accel_forearm")], trControl= control, verbose=FALSE)

## Linear Model 
  set.seed(200)
  lm_model <‐ train(classe ~.,data = train[,c("classe", "roll_belt","roll_arm", "roll_dumbbell", "roll_forearm", "pitch_belt", "pitch_arm", "pitch_dumbbell", "pitch_forearm", "yaw_belt", "yaw_arm", "yaw_dumbbell", "yaw_forearm","total_accel_arm", "total_accel_belt", "total_accel_dumbbell","total_accel_forearm")],method="lm")
 
## Random Forest
  set.seed(200)
  control <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
  rf_model <- train(classe ~ ., data = train[,c("classe", "roll_belt","roll_arm", "roll_dumbbell", "roll_forearm", "pitch_belt", "pitch_arm", "pitch_dumbbell", "pitch_forearm", "yaw_belt", "yaw_arm", "yaw_dumbbell", "yaw_forearm","total_accel_arm", "total_accel_belt", "total_accel_dumbbell","total_accel_forearm")], method = "rf", ntree = 10, trControl = control, verbose=FALSE)

  trellis.par.set(caretTheme())
  plot(gbm_model)

  summary(gbm_model$finalModel)

##                                       var   rel.inf
## roll_belt                       roll_belt 33.752094
## pitch_forearm               pitch_forearm 15.998537
## roll_forearm                 roll_forearm 14.699880
## yaw_belt                         yaw_belt  8.675100
## pitch_belt                     pitch_belt  7.445905
## total_accel_belt         total_accel_belt  2.928083
## yaw_dumbbell                 yaw_dumbbell  2.622590
## yaw_arm                           yaw_arm  2.502491
## roll_arm                         roll_arm  2.363111
## roll_dumbbell               roll_dumbbell  1.870073
## total_accel_arm           total_accel_arm  1.497838
## pitch_dumbbell             pitch_dumbbell  1.221761
## pitch_arm                       pitch_arm  1.187460
## total_accel_dumbbell total_accel_dumbbell  1.135138
## total_accel_forearm   total_accel_forearm  1.069536
## yaw_forearm                   yaw_forearm  1.030404

  gbm_predict <- predict(gbm_model, testing_data)
  gbm_prediction <- chartr("12345", "ABCDE", round(gbm_predict, digits=0))
  
  summary(lm_model$finalModel)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7241 -0.9839 -0.0674  0.8922  3.7157 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3.601e-01  7.944e-02   4.533 5.87e-06 ***
## roll_belt            -4.737e-03  1.216e-03  -3.896 9.82e-05 ***
## roll_arm              1.657e-03  2.010e-04   8.243  < 2e-16 ***
## roll_dumbbell         7.636e-04  2.172e-04   3.516  0.00044 ***
## roll_forearm          9.796e-04  1.145e-04   8.552  < 2e-16 ***
## pitch_belt           -1.392e-02  1.482e-03  -9.393  < 2e-16 ***
## pitch_arm            -7.721e-03  3.767e-04 -20.497  < 2e-16 ***
## pitch_dumbbell       -5.788e-04  4.293e-04  -1.348  0.17767    
## pitch_forearm         1.855e-02  4.633e-04  40.032  < 2e-16 ***
## yaw_belt             -6.957e-03  5.272e-04 -13.195  < 2e-16 ***
## yaw_arm               7.767e-04  1.752e-04   4.433 9.36e-06 ***
## yaw_dumbbell         -2.753e-03  2.288e-04 -12.031  < 2e-16 ***
## yaw_forearm          -8.186e-04  1.291e-04  -6.343 2.33e-10 ***
## total_accel_arm      -5.045e-03  1.131e-03  -4.461 8.21e-06 ***
## total_accel_belt      1.071e-01  8.423e-03  12.712  < 2e-16 ***
## total_accel_dumbbell  7.404e-05  1.537e-03   0.048  0.96158    
## total_accel_forearm   3.604e-02  1.233e-03  29.225  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.299 on 13720 degrees of freedom
## Multiple R-squared:  0.2293, Adjusted R-squared:  0.2284 
## F-statistic: 255.2 on 16 and 13720 DF,  p-value: < 2.2e-16

  lm_predict <‐ predict(lm_model,testing_data)
  lm_prediction <- chartr("12345", "ABCDE", round(lm_predict, digits=0))

  summary(rf_model$finalModel)

##                 Length Class      Mode     
## call                6  -none-     call     
## type                1  -none-     character
## predicted       13737  -none-     numeric  
## mse                10  -none-     numeric  
## rsq                10  -none-     numeric  
## oob.times       13737  -none-     numeric  
## importance         16  -none-     numeric  
## importanceSD        0  -none-     NULL     
## localImportance     0  -none-     NULL     
## proximity           0  -none-     NULL     
## ntree               1  -none-     numeric  
## mtry                1  -none-     numeric  
## forest             11  -none-     list     
## coefs               0  -none-     NULL     
## y               13737  -none-     numeric  
## test                0  -none-     NULL     
## inbag               0  -none-     NULL     
## xNames             16  -none-     character
## problemType         1  -none-     character
## tuneValue           1  data.frame list     
## obsLevels           1  -none-     logical  
## param               2  -none-     list

  rf_predict <‐  predict(rf_model,testing_data)
  rf_prediction <- chartr("12345", "ABCDE", round(rf_predict, digits=0))  
  
  ## Comparisons
  conf.matrix <- round(prop.table(table(lm_prediction, gbm_prediction, rf_prediction), 3), 3)
  conf.matrix

## , , rf_prediction = A
## 
##              gbm_prediction
## lm_prediction     A     B     C     D     E
##             A 0.143 0.000 0.000 0.000 0.000
##             B 0.286 0.286 0.000 0.000 0.000
##             C 0.143 0.143 0.000 0.000 0.000
##             D 0.000 0.000 0.000 0.000 0.000
## 
## , , rf_prediction = B
## 
##              gbm_prediction
## lm_prediction     A     B     C     D     E
##             A 0.000 0.000 0.000 0.000 0.000
##             B 0.000 0.250 0.000 0.000 0.000
##             C 0.000 0.125 0.375 0.000 0.000
##             D 0.000 0.000 0.250 0.000 0.000
## 
## , , rf_prediction = C
## 
##              gbm_prediction
## lm_prediction     A     B     C     D     E
##             A 0.000 0.000 0.000 0.000 0.000
##             B 0.000 0.000 0.000 0.000 0.000
##             C 0.000 0.000 1.000 0.000 0.000
##             D 0.000 0.000 0.000 0.000 0.000
## 
## , , rf_prediction = D
## 
##              gbm_prediction
## lm_prediction     A     B     C     D     E
##             A 0.000 0.000 0.000 0.000 0.000
##             B 0.000 0.000 0.000 0.000 0.000
##             C 0.000 0.000 0.000 1.000 0.000
##             D 0.000 0.000 0.000 0.000 0.000
## 
## , , rf_prediction = E
## 
##              gbm_prediction
## lm_prediction     A     B     C     D     E
##             A 0.000 0.000 0.000 0.000 0.000
##             B 0.000 0.000 0.333 0.000 0.000
##             C 0.000 0.000 0.000 0.333 0.333
##             D 0.000 0.000 0.000 0.000 0.000

CONCLUSIONS

The predicted cases for all three models is found below. The random forest model performed well and was used to predict the 20 cases for the quiz.

  prediction_results <- cbind(gbm_prediction, lm_prediction, rf_prediction)
  kable(prediction_results) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

gbm_prediction	lm_prediction	rf_prediction
C	D	B
A	C	A
B	B	B
B	B	A
B	B	A
E	C	E
D	C	D
C	D	B
A	A	A
B	C	A
C	C	B
C	C	C
C	C	B
A	B	A
D	C	E
C	B	E
A	B	A
C	C	B
B	C	B
B	B	B

gbm_prediction	lm_prediction	rf_prediction
C	D	B
A	C	A
B	B	B
B	B	A
B	B	A
E	C	E
D	C	D
C	D	B
A	A	A
B	C	A
C	C	B
C	C	C
C	C	B
A	B	A
D	C	E
C	B	E
A	B	A
C	C	B
B	C	B
B	B	B

gbm_prediction	lm_prediction	rf_prediction
C	D	B
A	C	A
B	B	B
B	B	A
B	B	A
E	C	E
D	C	D
C	D	B
A	A	A
B	C	A
C	C	B
C	C	C
C	C	B
A	B	A
D	C	E
C	B	E
A	B	A
C	C	B
B	C	B
B	B	B

Prediction-Assignment

Deborah Passey

7/28/2019

gbm_prediction	lm_prediction	rf_prediction
C	D	B
A	C	A
B	B	B
B	B	A
B	B	A
E	C	E
D	C	D
C	D	B
A	A	A
B	C	A
C	C	B
C	C	C
C	C	B
A	B	A
D	C	E
C	B	E
A	B	A
C	C	B
B	C	B
B	B	B