PML project

Loading the data

We load the training dataset in the variable ‘tr’. This is the dataset we are later going to split into ‘training’ and ‘testing’. The validation data ‘val’ (from pml-testing.csv) is going to be untouched till the last phase.

tr <- read.csv('pml-training.csv')
val <- read.csv('pml-testing.csv')

We take a look at how big is the ‘tr’ dataset.

dim(tr)

## [1] 19622   160

… which shows it is compiled of 19622 records of 160 variables. We take a look at the ‘classe’ variable which we have to predict

summary(tr$classe)

##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

It is a factor of 5 levels A, B, C, D and E. We now try to visualise relations between the variables. We check for NA values in the set

sum(is.na(tr))

## [1] 1287472

We need to impute these NA values by K nearest eighbours algorithm

preProcValues <- preProcess(tr, method = c("knnImpute","center","scale"))
tr_processed <- predict(preProcValues, tr)
sum(is.na(tr_processed))

We use readings from accelometers in the belt , forearm, arm and dumbbell; we segregate these variables

tr_processed= tr
names(tr_processed)[grep("accel",names(tr_processed))]

##  [1] "total_accel_belt"     "var_total_accel_belt" "accel_belt_x"        
##  [4] "accel_belt_y"         "accel_belt_z"         "total_accel_arm"     
##  [7] "var_accel_arm"        "accel_arm_x"          "accel_arm_y"         
## [10] "accel_arm_z"          "total_accel_dumbbell" "var_accel_dumbbell"  
## [13] "accel_dumbbell_x"     "accel_dumbbell_y"     "accel_dumbbell_z"    
## [16] "total_accel_forearm"  "var_accel_forearm"    "accel_forearm_x"     
## [19] "accel_forearm_y"      "accel_forearm_z"

ta <- names(tr_processed)[grep("^total_accel",names(tr_processed))]

We now plot total accels against each other

featurePlot(x = tr_processed[,ta],y=tr$classe,plot='pairs')

Since the total_accels in each recorder do not show any correlation with each other, we can assume them to be independent variables, and use them as predictors.

set.seed(107)
inTrain <- createDataPartition(
  y = tr_processed$classe,
  p = .75,
  list = FALSE
)
training <- tr_processed[ inTrain,]
testing  <- tr_processed[-inTrain,]
nrow(training)

## [1] 14718

nrow(testing)

## [1] 4904

We now train a KNN based model from the dataset and predict values from testing set

val_acc <- val[,ta]
model_knn <- knn3(classe ~ .,data=training,k=3)
pt_k <- predict(model_knn, testing, type='class')
summary(testing$classe == pt_k)

We then apply the model to the valuation set

pv_k <- predict(model_knn,val_acc,type='class')
pv_k

Which outputs E B E B E B B B E E E B E E B E E E B B for the 20 entries in val dataset.

PML project

Parikshit Sanyal

4 September 2018

Loading the data