Executive Summary

This document presents the results of the Practical Machine Learning Peer Assessments in a report using a single R markdown document that can be processed by knitr and be transformed into an HTML file.

To provide class prediction of data with multiple columns it requires to implement a random forests without cross -validation and test set ,therefore first of all it is necessary to remove the columns with less than 60% of data and then evaluate data validation and testing for answering to following questions:

predict the manner in which they did the exercise. This is the “classe” variable in the training set. All other variables can be use as predictor.

2.How to build the model and use cross validation.

Data source:https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

install.packages(“ElemStatLearn”) install.packages(“caret”) library(ElemStatLearn) library(caret) install.packages(“rpart”) library(rpart) install.packages(“randomForest”) library(randomForest)

mmm

  Urla <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
  Urlb <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
  training <- read.csv(url(Urla), na.strings=c("NA","#DIV/0!",""))
  testing <- read.csv(url(Urlb), na.strings=c("NA","#DIV/0!",""))

Exploring Data

   dim(training)
    [1] 19622   160
    
    dim(testing)
    [1]  20 160
    
    table(training$classe)
    
       A    B    C    D    E 
      5580 3797 3422 3216 3607

There are 19622 observation in traning dataset, including 160 variables. The last column is the target variable classe. The most abundant class is A.

There are some variables having a lot of missing values, for simplicity, I have removed all the variables containing NA values. And also, several variables are not direcly related to the target variable classe, I also removed those varialbes, those variables are “x”, “user_name”, and all the time related variables, such as “raw_timestamp_part_1” etc.

    NA_Count = sapply(1:dim(training)[2],function(x)sum(is.na(training[,x])))
    NA_list = which(NA_Count>0)
    colnames(training[,c(1:7)])
    [1] "X"                    "user_name"            "raw_timestamp_part_1" "raw_timestamp_part_2"
    [5] "cvtd_timestamp"       "new_window"           "num_window"

Mach-Learn

Farzad Ravari

June 6, 2017

Executive Summary

Exploring Data