Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv.
The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv.
Participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes. Participants were supervised by an experienced weight lifter to make sure the execution complied to the manner they were supposed to simulate. The exercises were performed by six male participants aged between 20-28 years, with little weight lifting experience. We made sure that all participants could easily simulate the mistakes in a safe and controlled manner by using a relatively light dumbbell (1.25kg).
More informnation for this project come from this source http://groupware.les.inf.puc-rio.br/har.
One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. This is the “classe” variable in the training set. I can use any of the other variables to predict with, and create a report describing how I built your model, how I used cross validation, what I think the expected out of sample error is, and why I made the choices you did. I will also use the prediction model to predict 20 different test cases.
Let’s load the train and test data into environment.
train_file_name <- "pml-training.csv"
test_file_name <- "pml-testing.csv"
train_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
test_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
# Check if the data is downloaded and download when applicable
if (!file.exists(train_file_name)) {
download.file(train_url, destfile = train_file_name )
}
if (!file.exists(test_file_name)) {
download.file(test_url, destfile = test_file_name )
}
train_data <- read.csv(train_file_name)
test_data <- read.csv(test_file_name)
Train-Test data has been loaded, let’s see how many cases in total.
dim(train_data)
## [1] 19622 160
dim(test_data)
## [1] 20 160
Total 19622 cases are collected for training and 20 cases for test with 159 predictors to predict “classe” variable.
Now, let’s see the stucture of training data.
str(train_data)
## 'data.frame': 19622 obs. of 160 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ user_name : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ raw_timestamp_part_1 : int 1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
## $ raw_timestamp_part_2 : int 788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
## $ cvtd_timestamp : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ new_window : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ num_window : int 11 11 11 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt : num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## $ kurtosis_roll_belt : Factor w/ 397 levels "","-0.016850",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_picth_belt : Factor w/ 317 levels "","-0.021887",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_yaw_belt : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_belt : Factor w/ 395 levels "","-0.003095",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_belt.1 : Factor w/ 338 levels "","-0.005928",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_yaw_belt : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_belt : Factor w/ 68 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ min_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_belt : Factor w/ 68 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ amplitude_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_belt : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_belt : Factor w/ 4 levels "","#DIV/0!","0.00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ var_total_accel_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_belt : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_belt_x : num 0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
## $ gyros_belt_y : num 0 0 0 0 0.02 0 0 0 0 0 ...
## $ gyros_belt_z : num -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
## $ accel_belt_x : int -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
## $ accel_belt_y : int 4 4 5 3 2 4 3 4 2 4 ...
## $ accel_belt_z : int 22 22 23 21 24 21 21 21 24 22 ...
## $ magnet_belt_x : int -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
## $ magnet_belt_y : int 599 608 600 604 600 603 599 603 602 609 ...
## $ magnet_belt_z : int -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
## $ roll_arm : num -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
## $ pitch_arm : num 22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
## $ yaw_arm : num -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
## $ total_accel_arm : int 34 34 34 34 34 34 34 34 34 34 ...
## $ var_accel_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ avg_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ stddev_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ var_yaw_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ gyros_arm_x : num 0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
## $ gyros_arm_y : num 0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
## $ gyros_arm_z : num -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
## $ accel_arm_x : int -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
## $ accel_arm_y : int 109 110 110 111 111 111 111 111 109 110 ...
## $ accel_arm_z : int -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
## $ magnet_arm_x : int -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
## $ magnet_arm_y : int 337 337 344 344 337 342 336 338 341 334 ...
## $ magnet_arm_z : int 516 513 513 512 506 513 509 510 518 516 ...
## $ kurtosis_roll_arm : Factor w/ 330 levels "","-0.02438",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_picth_arm : Factor w/ 328 levels "","-0.00484",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_yaw_arm : Factor w/ 395 levels "","-0.01548",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_arm : Factor w/ 331 levels "","-0.00051",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_pitch_arm : Factor w/ 328 levels "","-0.00184",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_yaw_arm : Factor w/ 395 levels "","-0.00311",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ max_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ min_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_roll_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_pitch_arm : num NA NA NA NA NA NA NA NA NA NA ...
## $ amplitude_yaw_arm : int NA NA NA NA NA NA NA NA NA NA ...
## $ roll_dumbbell : num 13.1 13.1 12.9 13.4 13.4 ...
## $ pitch_dumbbell : num -70.5 -70.6 -70.3 -70.4 -70.4 ...
## $ yaw_dumbbell : num -84.9 -84.7 -85.1 -84.9 -84.9 ...
## $ kurtosis_roll_dumbbell : Factor w/ 398 levels "","-0.0035","-0.0073",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_picth_dumbbell : Factor w/ 401 levels "","-0.0163","-0.0233",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ kurtosis_yaw_dumbbell : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_roll_dumbbell : Factor w/ 401 levels "","-0.0082","-0.0096",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_pitch_dumbbell : Factor w/ 402 levels "","-0.0053","-0.0084",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ skewness_yaw_dumbbell : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
## $ max_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_picth_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ max_yaw_dumbbell : Factor w/ 73 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ min_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_pitch_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## $ min_yaw_dumbbell : Factor w/ 73 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ amplitude_roll_dumbbell : num NA NA NA NA NA NA NA NA NA NA ...
## [list output truncated]
Many missing values were collected into data. Let’s see total number of missing datas in dataset.
sapply(train_data, function(x) sum(x == "" | is.na(x)))
## X user_name raw_timestamp_part_1
## 0 0 0
## raw_timestamp_part_2 cvtd_timestamp new_window
## 0 0 0
## num_window roll_belt pitch_belt
## 0 0 0
## yaw_belt total_accel_belt kurtosis_roll_belt
## 0 0 19216
## kurtosis_picth_belt kurtosis_yaw_belt skewness_roll_belt
## 19216 19216 19216
## skewness_roll_belt.1 skewness_yaw_belt max_roll_belt
## 19216 19216 19216
## max_picth_belt max_yaw_belt min_roll_belt
## 19216 19216 19216
## min_pitch_belt min_yaw_belt amplitude_roll_belt
## 19216 19216 19216
## amplitude_pitch_belt amplitude_yaw_belt var_total_accel_belt
## 19216 19216 19216
## avg_roll_belt stddev_roll_belt var_roll_belt
## 19216 19216 19216
## avg_pitch_belt stddev_pitch_belt var_pitch_belt
## 19216 19216 19216
## avg_yaw_belt stddev_yaw_belt var_yaw_belt
## 19216 19216 19216
## gyros_belt_x gyros_belt_y gyros_belt_z
## 0 0 0
## accel_belt_x accel_belt_y accel_belt_z
## 0 0 0
## magnet_belt_x magnet_belt_y magnet_belt_z
## 0 0 0
## roll_arm pitch_arm yaw_arm
## 0 0 0
## total_accel_arm var_accel_arm avg_roll_arm
## 0 19216 19216
## stddev_roll_arm var_roll_arm avg_pitch_arm
## 19216 19216 19216
## stddev_pitch_arm var_pitch_arm avg_yaw_arm
## 19216 19216 19216
## stddev_yaw_arm var_yaw_arm gyros_arm_x
## 19216 19216 0
## gyros_arm_y gyros_arm_z accel_arm_x
## 0 0 0
## accel_arm_y accel_arm_z magnet_arm_x
## 0 0 0
## magnet_arm_y magnet_arm_z kurtosis_roll_arm
## 0 0 19216
## kurtosis_picth_arm kurtosis_yaw_arm skewness_roll_arm
## 19216 19216 19216
## skewness_pitch_arm skewness_yaw_arm max_roll_arm
## 19216 19216 19216
## max_picth_arm max_yaw_arm min_roll_arm
## 19216 19216 19216
## min_pitch_arm min_yaw_arm amplitude_roll_arm
## 19216 19216 19216
## amplitude_pitch_arm amplitude_yaw_arm roll_dumbbell
## 19216 19216 0
## pitch_dumbbell yaw_dumbbell kurtosis_roll_dumbbell
## 0 0 19216
## kurtosis_picth_dumbbell kurtosis_yaw_dumbbell skewness_roll_dumbbell
## 19216 19216 19216
## skewness_pitch_dumbbell skewness_yaw_dumbbell max_roll_dumbbell
## 19216 19216 19216
## max_picth_dumbbell max_yaw_dumbbell min_roll_dumbbell
## 19216 19216 19216
## min_pitch_dumbbell min_yaw_dumbbell amplitude_roll_dumbbell
## 19216 19216 19216
## amplitude_pitch_dumbbell amplitude_yaw_dumbbell total_accel_dumbbell
## 19216 19216 0
## var_accel_dumbbell avg_roll_dumbbell stddev_roll_dumbbell
## 19216 19216 19216
## var_roll_dumbbell avg_pitch_dumbbell stddev_pitch_dumbbell
## 19216 19216 19216
## var_pitch_dumbbell avg_yaw_dumbbell stddev_yaw_dumbbell
## 19216 19216 19216
## var_yaw_dumbbell gyros_dumbbell_x gyros_dumbbell_y
## 19216 0 0
## gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y
## 0 0 0
## accel_dumbbell_z magnet_dumbbell_x magnet_dumbbell_y
## 0 0 0
## magnet_dumbbell_z roll_forearm pitch_forearm
## 0 0 0
## yaw_forearm kurtosis_roll_forearm kurtosis_picth_forearm
## 0 19216 19216
## kurtosis_yaw_forearm skewness_roll_forearm skewness_pitch_forearm
## 19216 19216 19216
## skewness_yaw_forearm max_roll_forearm max_picth_forearm
## 19216 19216 19216
## max_yaw_forearm min_roll_forearm min_pitch_forearm
## 19216 19216 19216
## min_yaw_forearm amplitude_roll_forearm amplitude_pitch_forearm
## 19216 19216 19216
## amplitude_yaw_forearm total_accel_forearm var_accel_forearm
## 19216 0 19216
## avg_roll_forearm stddev_roll_forearm var_roll_forearm
## 19216 19216 19216
## avg_pitch_forearm stddev_pitch_forearm var_pitch_forearm
## 19216 19216 19216
## avg_yaw_forearm stddev_yaw_forearm var_yaw_forearm
## 19216 19216 19216
## gyros_forearm_x gyros_forearm_y gyros_forearm_z
## 0 0 0
## accel_forearm_x accel_forearm_y accel_forearm_z
## 0 0 0
## magnet_forearm_x magnet_forearm_y magnet_forearm_z
## 0 0 0
## classe
## 0
It shows that many variable has missing 19216 datas. Since there is too many missing data compared to full training data, I decided not to consider those variables in model. Let’s see what those variables are.
findNonPredictors <- function (data) {
x <- sapply(data, function(x) sum(x == "" | is.na(x)))
l <- c()
for (i in 1:length(x)) {
if (x[[i]] != 0) {l <- c(l, i)}
}
return (l)
}
length(findNonPredictors(train_data))
## [1] 100
In total, 100 variables has large amount of missing datas. The reason for the large amount of missing values is unknown, but the rest of data is efficient enought to predict “classe” variable.
Let’s create training set without those variables.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
NotPredictors <- findNonPredictors(train_data)
training <- train_data[,-NotPredictors]
testing <- test_data[,-NotPredictors]
Now, the first few variables seem to be informatic. The first variable, X, is merely index of each cases, and user_name is the name of 6 participants, and rest few variables indicated when and how participants were involved in this experiments. Hence, I will not include first 7 variables in the model as well.
training <- training[,-(1:7)]
testing <- testing[,-(1:7)]
Now, training data contains only necessary variables to modelling. Let’s see the sumamry of training data.
summary(training)
## roll_belt pitch_belt yaw_belt total_accel_belt
## Min. :-28.90 Min. :-55.8000 Min. :-180.00 Min. : 0.00
## 1st Qu.: 1.10 1st Qu.: 1.7600 1st Qu.: -88.30 1st Qu.: 3.00
## Median :113.00 Median : 5.2800 Median : -13.00 Median :17.00
## Mean : 64.41 Mean : 0.3053 Mean : -11.21 Mean :11.31
## 3rd Qu.:123.00 3rd Qu.: 14.9000 3rd Qu.: 12.90 3rd Qu.:18.00
## Max. :162.00 Max. : 60.3000 Max. : 179.00 Max. :29.00
## gyros_belt_x gyros_belt_y gyros_belt_z
## Min. :-1.040000 Min. :-0.64000 Min. :-1.4600
## 1st Qu.:-0.030000 1st Qu.: 0.00000 1st Qu.:-0.2000
## Median : 0.030000 Median : 0.02000 Median :-0.1000
## Mean :-0.005592 Mean : 0.03959 Mean :-0.1305
## 3rd Qu.: 0.110000 3rd Qu.: 0.11000 3rd Qu.:-0.0200
## Max. : 2.220000 Max. : 0.64000 Max. : 1.6200
## accel_belt_x accel_belt_y accel_belt_z magnet_belt_x
## Min. :-120.000 Min. :-69.00 Min. :-275.00 Min. :-52.0
## 1st Qu.: -21.000 1st Qu.: 3.00 1st Qu.:-162.00 1st Qu.: 9.0
## Median : -15.000 Median : 35.00 Median :-152.00 Median : 35.0
## Mean : -5.595 Mean : 30.15 Mean : -72.59 Mean : 55.6
## 3rd Qu.: -5.000 3rd Qu.: 61.00 3rd Qu.: 27.00 3rd Qu.: 59.0
## Max. : 85.000 Max. :164.00 Max. : 105.00 Max. :485.0
## magnet_belt_y magnet_belt_z roll_arm pitch_arm
## Min. :354.0 Min. :-623.0 Min. :-180.00 Min. :-88.800
## 1st Qu.:581.0 1st Qu.:-375.0 1st Qu.: -31.77 1st Qu.:-25.900
## Median :601.0 Median :-320.0 Median : 0.00 Median : 0.000
## Mean :593.7 Mean :-345.5 Mean : 17.83 Mean : -4.612
## 3rd Qu.:610.0 3rd Qu.:-306.0 3rd Qu.: 77.30 3rd Qu.: 11.200
## Max. :673.0 Max. : 293.0 Max. : 180.00 Max. : 88.500
## yaw_arm total_accel_arm gyros_arm_x gyros_arm_y
## Min. :-180.0000 Min. : 1.00 Min. :-6.37000 Min. :-3.4400
## 1st Qu.: -43.1000 1st Qu.:17.00 1st Qu.:-1.33000 1st Qu.:-0.8000
## Median : 0.0000 Median :27.00 Median : 0.08000 Median :-0.2400
## Mean : -0.6188 Mean :25.51 Mean : 0.04277 Mean :-0.2571
## 3rd Qu.: 45.8750 3rd Qu.:33.00 3rd Qu.: 1.57000 3rd Qu.: 0.1400
## Max. : 180.0000 Max. :66.00 Max. : 4.87000 Max. : 2.8400
## gyros_arm_z accel_arm_x accel_arm_y accel_arm_z
## Min. :-2.3300 Min. :-404.00 Min. :-318.0 Min. :-636.00
## 1st Qu.:-0.0700 1st Qu.:-242.00 1st Qu.: -54.0 1st Qu.:-143.00
## Median : 0.2300 Median : -44.00 Median : 14.0 Median : -47.00
## Mean : 0.2695 Mean : -60.24 Mean : 32.6 Mean : -71.25
## 3rd Qu.: 0.7200 3rd Qu.: 84.00 3rd Qu.: 139.0 3rd Qu.: 23.00
## Max. : 3.0200 Max. : 437.00 Max. : 308.0 Max. : 292.00
## magnet_arm_x magnet_arm_y magnet_arm_z roll_dumbbell
## Min. :-584.0 Min. :-392.0 Min. :-597.0 Min. :-153.71
## 1st Qu.:-300.0 1st Qu.: -9.0 1st Qu.: 131.2 1st Qu.: -18.49
## Median : 289.0 Median : 202.0 Median : 444.0 Median : 48.17
## Mean : 191.7 Mean : 156.6 Mean : 306.5 Mean : 23.84
## 3rd Qu.: 637.0 3rd Qu.: 323.0 3rd Qu.: 545.0 3rd Qu.: 67.61
## Max. : 782.0 Max. : 583.0 Max. : 694.0 Max. : 153.55
## pitch_dumbbell yaw_dumbbell total_accel_dumbbell
## Min. :-149.59 Min. :-150.871 Min. : 0.00
## 1st Qu.: -40.89 1st Qu.: -77.644 1st Qu.: 4.00
## Median : -20.96 Median : -3.324 Median :10.00
## Mean : -10.78 Mean : 1.674 Mean :13.72
## 3rd Qu.: 17.50 3rd Qu.: 79.643 3rd Qu.:19.00
## Max. : 149.40 Max. : 154.952 Max. :58.00
## gyros_dumbbell_x gyros_dumbbell_y gyros_dumbbell_z
## Min. :-204.0000 Min. :-2.10000 Min. : -2.380
## 1st Qu.: -0.0300 1st Qu.:-0.14000 1st Qu.: -0.310
## Median : 0.1300 Median : 0.03000 Median : -0.130
## Mean : 0.1611 Mean : 0.04606 Mean : -0.129
## 3rd Qu.: 0.3500 3rd Qu.: 0.21000 3rd Qu.: 0.030
## Max. : 2.2200 Max. :52.00000 Max. :317.000
## accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z magnet_dumbbell_x
## Min. :-419.00 Min. :-189.00 Min. :-334.00 Min. :-643.0
## 1st Qu.: -50.00 1st Qu.: -8.00 1st Qu.:-142.00 1st Qu.:-535.0
## Median : -8.00 Median : 41.50 Median : -1.00 Median :-479.0
## Mean : -28.62 Mean : 52.63 Mean : -38.32 Mean :-328.5
## 3rd Qu.: 11.00 3rd Qu.: 111.00 3rd Qu.: 38.00 3rd Qu.:-304.0
## Max. : 235.00 Max. : 315.00 Max. : 318.00 Max. : 592.0
## magnet_dumbbell_y magnet_dumbbell_z roll_forearm pitch_forearm
## Min. :-3600 Min. :-262.00 Min. :-180.0000 Min. :-72.50
## 1st Qu.: 231 1st Qu.: -45.00 1st Qu.: -0.7375 1st Qu.: 0.00
## Median : 311 Median : 13.00 Median : 21.7000 Median : 9.24
## Mean : 221 Mean : 46.05 Mean : 33.8265 Mean : 10.71
## 3rd Qu.: 390 3rd Qu.: 95.00 3rd Qu.: 140.0000 3rd Qu.: 28.40
## Max. : 633 Max. : 452.00 Max. : 180.0000 Max. : 89.80
## yaw_forearm total_accel_forearm gyros_forearm_x
## Min. :-180.00 Min. : 0.00 Min. :-22.000
## 1st Qu.: -68.60 1st Qu.: 29.00 1st Qu.: -0.220
## Median : 0.00 Median : 36.00 Median : 0.050
## Mean : 19.21 Mean : 34.72 Mean : 0.158
## 3rd Qu.: 110.00 3rd Qu.: 41.00 3rd Qu.: 0.560
## Max. : 180.00 Max. :108.00 Max. : 3.970
## gyros_forearm_y gyros_forearm_z accel_forearm_x accel_forearm_y
## Min. : -7.02000 Min. : -8.0900 Min. :-498.00 Min. :-632.0
## 1st Qu.: -1.46000 1st Qu.: -0.1800 1st Qu.:-178.00 1st Qu.: 57.0
## Median : 0.03000 Median : 0.0800 Median : -57.00 Median : 201.0
## Mean : 0.07517 Mean : 0.1512 Mean : -61.65 Mean : 163.7
## 3rd Qu.: 1.62000 3rd Qu.: 0.4900 3rd Qu.: 76.00 3rd Qu.: 312.0
## Max. :311.00000 Max. :231.0000 Max. : 477.00 Max. : 923.0
## accel_forearm_z magnet_forearm_x magnet_forearm_y magnet_forearm_z
## Min. :-446.00 Min. :-1280.0 Min. :-896.0 Min. :-973.0
## 1st Qu.:-182.00 1st Qu.: -616.0 1st Qu.: 2.0 1st Qu.: 191.0
## Median : -39.00 Median : -378.0 Median : 591.0 Median : 511.0
## Mean : -55.29 Mean : -312.6 Mean : 380.1 Mean : 393.6
## 3rd Qu.: 26.00 3rd Qu.: -73.0 3rd Qu.: 737.0 3rd Qu.: 653.0
## Max. : 291.00 Max. : 672.0 Max. :1480.0 Max. :1090.0
## classe
## A:5580
## B:3797
## C:3422
## D:3216
## E:3607
##
Now, let’s see the correlation of each predictors and figure out whether we need all 52 variables to predict “classe” variable.
M <- cor(training[,-53])
diag(M) <- 0
which(abs(M) >= 0.9, arr.ind = T)
## row col
## total_accel_belt 4 1
## accel_belt_y 9 1
## accel_belt_z 10 1
## accel_belt_x 8 2
## roll_belt 1 4
## accel_belt_y 9 4
## accel_belt_z 10 4
## pitch_belt 2 8
## roll_belt 1 9
## total_accel_belt 4 9
## accel_belt_z 10 9
## roll_belt 1 10
## total_accel_belt 4 10
## accel_belt_y 9 10
## gyros_arm_y 19 18
## gyros_arm_x 18 19
## gyros_dumbbell_z 33 31
## gyros_forearm_z 46 31
## gyros_dumbbell_x 31 33
## gyros_forearm_z 46 33
## gyros_dumbbell_x 31 46
## gyros_dumbbell_z 33 46
It shows that there are few highly correlated data. columnns of (1,4), (1,9), (1,10),(4,9), (31,33), (33, 46), (31, 46) and etc. Let’s see few correlation plot.
par(mfrow = c(2,3))
plot(training[,1],training[,4])
plot(training[,1],training[,9])
plot(training[,1],training[,10])
plot(training[,4],training[,9])
plot(training[,4],training[,10])
plot(training[,9],training[,10])
As I expected, 1,4,9,10th predictors are highly correlated, but there is no significant outliers. Let’s see the rest of the correlation plot.
par(mfrow = c(1,3))
plot(training[,33],training[,46])
plot(training[,31],training[,33])
plot(training[,31],training[,46])
We can see most of the points concentrated at (0,0), but there is one point that seems to be outlier. Let’s investigate that particular case
c(which.min(training[,31]), which.max(training[,31]))
## [1] 5373 13792
c(which.min(training[,33]), which.max(training[,33]))
## [1] 8929 5373
c(which.min(training[,46]), which.max(training[,46]))
## [1] 941 5373
It seems that case 5373 needs to be investigated. let’s see how it effects rest of that. Let’s compare to mean.
training[5373,c(31,33,46)]
## gyros_dumbbell_x gyros_dumbbell_z gyros_forearm_z
## 5373 -204 317 231
sapply(training[,c(31,33,46)], mean)
## gyros_dumbbell_x gyros_dumbbell_z gyros_forearm_z
## 0.1610845 -0.1289889 0.1512450
It seems that 5373th case brings quite variability. Let’s how it makes difference by removing that case.
M <- cor(training[-5373,-53])
diag(M) <- 0
which(abs(M) >= 0.9, arr.ind = T)
## row col
## total_accel_belt 4 1
## accel_belt_y 9 1
## accel_belt_z 10 1
## accel_belt_x 8 2
## roll_belt 1 4
## accel_belt_y 9 4
## accel_belt_z 10 4
## pitch_belt 2 8
## roll_belt 1 9
## total_accel_belt 4 9
## accel_belt_z 10 9
## roll_belt 1 10
## total_accel_belt 4 10
## accel_belt_y 9 10
## gyros_arm_y 19 18
## gyros_arm_x 18 19
Now, we can see that correlation between 31, 33, and 46th variables do not exceed 0.9. Let’s see the correlation of those
cor(training[-5373,31], training[-5373,33])
## [1] -0.6170703
cor(training[-5373,33], training[-5373,46])
## [1] 0.06143106
cor(training[-5373,31], training[-5373,46])
## [1] -0.07783241
It seems that 31 and 33th variable are still correlated, but 33 and 46, 31 and 46 are not correlated. I believe 5373th case is recorded incorrectly so let’s discard this case in this analysis since we have almost 20k rows of data.
training <- training[-5373, ]
Now, we removed one outlier, but some of the variables are still highly correlated. Let’s use “pca” in caret package to reduce correlation and dimensions. Also, it will also normalize the data as well.
preProcValues <- preProcess(training[-53], method = "pca")
trainTrans <- predict(preProcValues, training[,-53])
testTrans <- predict(preProcValues, testing[,-53])
N <- cor(trainTrans)
diag(N) <- 0
which(abs(N) > 0.5, arr.ind = T)
## row col
Now, we can see there is no correlated data. Let’s output the dimension of new uncollreated data
dim(trainTrans)
## [1] 19621 26
dim(testTrans)
## [1] 20 26
From 52 predictors, now we have 26 uncorrelated predictors only.
trainTrans <- cbind(trainTrans, training[,53])
colnames(trainTrans)[dim(trainTrans)[2]] <- "classe"
testTrans <- cbind(testTrans, testing[,53])
colnames(testTrans)[dim(testTrans)[2]] <- "problem_id"
Before fitting the model, let’s set cross validation train-test split and cross validation parameters.
set.seed(12315)
inTrain <- createDataPartition(y = trainTrans$classe,
p = 3/4, list = FALSE)
trainPart <- trainTrans[inTrain,]
testPart <- trainTrans[-inTrain,]
trControl <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
Since this analysis is a classificaiton problem, I will use Decision Tree, Gradient Boosting, and Random Forest to train the data. The Random Forest is expected to output the highest accuracy.
# Decision Tree
mod_rpart <- train(classe ~ ., data = trainPart,
method = "rpart", trControl = trControl)
# Stochastic Gradient Boosting
mod_gbm <- train(classe ~ ., data = trainPart,
method = "gbm", trControl = trControl,verbose = FALSE)
# Random Forest
mod_rf <- train(classe ~ ., data = trainPart,
method = "rf", trControl = trControl,verbose = FALSE)
Let’s see the overall accuracy of the three model.
# Dicision Tree
confusionMatrix(testPart$classe,predict(mod_rpart, newdata = testPart))$overall[1]
## Accuracy
## 0.3720171
# Gredient Boosting
confusionMatrix(testPart$classe,predict(mod_gbm, newdata = testPart))$overall[1]
## Accuracy
## 0.8439731
# Random Forest
confusionMatrix(testPart$classe,predict(mod_rf, newdata = testPart))$overall[1]
## Accuracy
## 0.9928615
As we expected, Random Forest algorithm gives the highest accuracy. Hence, let’s use RF model to test set.
pred_rf <- predict(mod_rf, newdata = testTrans)
pred_rf
## [1] B A B A A B D B A A A C B A E E A B B B
## Levels: A B C D E