We load the training dataset in the variable ‘tr’. This is the dataset we are later going to split into ‘training’ and ‘testing’. The validation data ‘val’ (from pml-testing.csv) is going to be untouched till the last phase.
tr <- read.csv('pml-training.csv')
val <- read.csv('pml-testing.csv')
We take a look at how big is the ‘tr’ dataset.
dim(tr)
## [1] 19622 160
… which shows it is compiled of 19622 records of 160 variables. We take a look at the ‘classe’ variable which we have to predict
summary(tr$classe)
## A B C D E
## 5580 3797 3422 3216 3607
It is a factor of 5 levels A, B, C, D and E. We now try to visualise relations between the variables. We check for NA values in the set
sum(is.na(tr))
## [1] 1287472
We need to impute these NA values by K nearest eighbours algorithm
preProcValues <- preProcess(tr, method = c("knnImpute","center","scale"))
tr_processed <- predict(preProcValues, tr)
sum(is.na(tr_processed))
We use readings from accelometers in the belt , forearm, arm and dumbbell; we segregate these variables
tr_processed= tr
names(tr_processed)[grep("accel",names(tr_processed))]
## [1] "total_accel_belt" "var_total_accel_belt" "accel_belt_x"
## [4] "accel_belt_y" "accel_belt_z" "total_accel_arm"
## [7] "var_accel_arm" "accel_arm_x" "accel_arm_y"
## [10] "accel_arm_z" "total_accel_dumbbell" "var_accel_dumbbell"
## [13] "accel_dumbbell_x" "accel_dumbbell_y" "accel_dumbbell_z"
## [16] "total_accel_forearm" "var_accel_forearm" "accel_forearm_x"
## [19] "accel_forearm_y" "accel_forearm_z"
ta <- names(tr_processed)[grep("^total_accel",names(tr_processed))]
We now plot total accels against each other
featurePlot(x = tr_processed[,ta],y=tr$classe,plot='pairs')
Since the total_accels in each recorder do not show any correlation with each other, we can assume them to be independent variables, and use them as predictors.
set.seed(107)
inTrain <- createDataPartition(
y = tr_processed$classe,
p = .75,
list = FALSE
)
training <- tr_processed[ inTrain,]
testing <- tr_processed[-inTrain,]
nrow(training)
## [1] 14718
nrow(testing)
## [1] 4904
We now train a KNN based model from the dataset and predict values from testing set
val_acc <- val[,ta]
model_knn <- knn3(classe ~ .,data=training,k=3)
pt_k <- predict(model_knn, testing, type='class')
summary(testing$classe == pt_k)
We then apply the model to the valuation set
pv_k <- predict(model_knn,val_acc,type='class')
pv_k
Which outputs E B E B E B B B E E E B E E B E E E B B for the 20 entries in val dataset.