This document presents the results of the Practical Machine Learning Peer Assessments in a report using a single R markdown document that can be processed by knitr and be transformed into an HTML file.
To provide class prediction of data with multiple columns it requires to implement a random forests without cross -validation and test set ,therefore first of all it is necessary to remove the columns with less than 60% of data and then evaluate data validation and testing for answering to following questions:
2.How to build the model and use cross validation.
Data source:https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
install.packages(“ElemStatLearn”) install.packages(“caret”) library(ElemStatLearn) library(caret) install.packages(“rpart”) library(rpart) install.packages(“randomForest”) library(randomForest)
mmm
Urla <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
Urlb <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
training <- read.csv(url(Urla), na.strings=c("NA","#DIV/0!",""))
testing <- read.csv(url(Urlb), na.strings=c("NA","#DIV/0!",""))
dim(training)
[1] 19622 160
dim(testing)
[1] 20 160
table(training$classe)
A B C D E
5580 3797 3422 3216 3607
There are 19622 observation in traning dataset, including 160 variables. The last column is the target variable classe. The most abundant class is A.
There are some variables having a lot of missing values, for simplicity, I have removed all the variables containing NA values. And also, several variables are not direcly related to the target variable classe, I also removed those varialbes, those variables are “x”, “user_name”, and all the time related variables, such as “raw_timestamp_part_1” etc.
NA_Count = sapply(1:dim(training)[2],function(x)sum(is.na(training[,x])))
NA_list = which(NA_Count>0)
colnames(training[,c(1:7)])
[1] "X" "user_name" "raw_timestamp_part_1" "raw_timestamp_part_2"
[5] "cvtd_timestamp" "new_window" "num_window"