The purpose of this project is to demonstrate collecting, working with and cleaning a data set. The goal is to prepare tidy data that can be used for later analysis. The brief for this project is taken from the Getting and Cleaning Data course project in the Data Science Specialisation from Coursera. They define tidy data as requiring the following criteria:
One of the most exciting areas in all of data science right now is wearable computing. Companies like Fitbit, Nike, and Jawbone Up are racing to develop the most advanced algorithms to attract new users. The data for this project is collected from the accelerometers from the Samsung Galaxy S smartphone and the primary aim of the project will be to create a tidy data set from the data provided and to perform an analysis using a machine learning technique to classify the data and determine the most important variables.
The data used in this project can be found at the following link:
https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip
The data set consists of recordings from 30 participants over a period of time. Each person performed six activities (walking, walking upstairs, walking downstairs, sitting, standing, laying) wearing a smartphone on their waist. Using the embedded accelerometer and gyroscope, a number of measurements were taken. For each record in the data set, the following information is provided:
More information about the data is available from the UCI Machine Learning Repository where the data was obtained:
http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
The data set obtained was split into training and test sets with the identifier data also separated out. The aim of the first section of this analysis is to:
The second section of the project will involve using a statistic modeling technique to classify the movements and identify the important measurements.
The R packages required for this project are data.table, reshape2, randomForest and caret.
library(data.table)
library(reshape2)
library(randomForest)
library(caret)
Firstly we download the UCI HAR data and load into R.
temp <- tempfile()
download.file("https://archive.ics.uci.edu/ml/machine-learning-databases/00240/UCI%20HAR%20Dataset.zip",temp, mode="wb")
unzip(temp)
# Load train data
trainData <- read.table("UCI HAR Dataset/train/X_train.txt")
trainLabel <- read.table("UCI HAR Dataset/train/y_train.txt")
trainSubject <- read.table("UCI HAR Dataset/train/subject_train.txt")
# Load test data
testData <- read.table("UCI HAR Dataset/test/X_test.txt")
testLabel <- read.table("UCI HAR Dataset/test/y_test.txt")
testSubject <- read.table("UCI HAR Dataset/test/subject_test.txt")
unlink(temp)
joinData <- rbind(trainData, testData)
# Remove no longer required data sets
remove(trainData); remove(testData)
# Show dimensions of new data table
dim(joinData)
## [1] 10299 561
The train and test labels are also joined and finally the train and test subject identifiers.
joinLabel <- rbind(trainLabel, testLabel)
joinSubject <- rbind(trainSubject, testSubject)
remove(trainLabel); remove(testLabel); remove(trainSubject); remove(testSubject)
The head function can be used to preview the data. Only the first five columns are previewed to reduce the size of the output. The preview shows that the column headers do not currently describe the data in the table. This will be addressed in the next sections.
head(joinData,5)[,c('V1', 'V2', 'V3', 'V4', 'V5')]
## V1 V2 V3 V4 V5
## 1 0.2885845 -0.02029417 -0.1329051 -0.9952786 -0.9831106
## 2 0.2784188 -0.01641057 -0.1235202 -0.9982453 -0.9753002
## 3 0.2796531 -0.01946716 -0.1134617 -0.9953796 -0.9671870
## 4 0.2791739 -0.02620065 -0.1232826 -0.9960915 -0.9834027
## 5 0.2766288 -0.01656965 -0.1153619 -0.9981386 -0.9808173
To find the mean and standard deviation measurements, first the features data needs to be loaded into R. This is a list of all features recorded by the accelerometers.
features <- read.table("./UCI HAR Dataset/features.txt")
The list of features loaded contains a set of strings and therefore a regular expression query needs to be used to find the mean and standard deviation features.
meanSD <- grep("mean\\(\\)|std\\(\\)", features[, 2])
The joined data can then be subset using the above.
joinData <- joinData[, meanSD]
dim(joinData)
## [1] 10299 66
This has reduced the number of columns to 66. The columns can also be renamed using the features list with some additional cleaning.
names(joinData) <- gsub("\\(\\)", "", features[meanSD, 2]) # remove "()"
names(joinData) <- gsub("mean", "Mean", names(joinData)) # capitalise M
names(joinData) <- gsub("std", "Std", names(joinData)) # change "std" to "SD"
names(joinData) <- gsub("-", "", names(joinData)) # remove "-"
remove(features)
# Preview data
head(joinData,5)[,c('tBodyAccMeanX', 'tBodyAccMeanY', 'tBodyAccMeanZ', 'tBodyAccStdX',
'tBodyAccStdY')]
## tBodyAccMeanX tBodyAccMeanY tBodyAccMeanZ tBodyAccStdX tBodyAccStdY
## 1 0.2885845 -0.02029417 -0.1329051 -0.9952786 -0.9831106
## 2 0.2784188 -0.01641057 -0.1235202 -0.9982453 -0.9753002
## 3 0.2796531 -0.01946716 -0.1134617 -0.9953796 -0.9671870
## 4 0.2791739 -0.02620065 -0.1232826 -0.9960915 -0.9834027
## 5 0.2766288 -0.01656965 -0.1153619 -0.9981386 -0.9808173
The column names are now a lot more descriptive and all of the non-relevant data has been removed.
The activity labels are also loaded from the original data into R.
activity <- read.table("./UCI HAR Dataset/activity_labels.txt")
activity
## V1 V2
## 1 1 WALKING
## 2 2 WALKING_UPSTAIRS
## 3 3 WALKING_DOWNSTAIRS
## 4 4 SITTING
## 5 5 STANDING
## 6 6 LAYING
These labels will be cleaned up by changing to lower case and removing the underscore.
activity[, 2] <- tolower(gsub("_", "", activity[, 2]))
substr(activity[2, 2], 8, 8) <- toupper(substr(activity[2, 2], 8, 8))
substr(activity[3, 2], 8, 8) <- toupper(substr(activity[3, 2], 8, 8))
These labels can then be joined to the label data that was loaded previously.
activityLabel <- activity[joinLabel[, 1], 2]
joinLabel[, 1] <- activityLabel
names(joinLabel) <- "activity"
head(joinLabel)
## activity
## 1 standing
## 2 standing
## 3 standing
## 4 standing
## 5 standing
## 6 standing
Finally the full data will be binded with the subject and activity labels. The activity labels will become the dependent variable for the statistical modelling task in the next section of the project.
names(joinSubject) <- "subject"
cleanData <- cbind(joinSubject, joinLabel, joinData)
remove(joinData); remove(activity)
remove(joinLabel); remove(joinSubject)
remove(activityLabel); remove(meanSD)
head(cleanData,5)[,c('subject', 'activity', 'tBodyAccMeanX', 'tBodyAccMeanY',
'tBodyAccMeanZ')]
## subject activity tBodyAccMeanX tBodyAccMeanY tBodyAccMeanZ
## 1 1 standing 0.2885845 -0.02029417 -0.1329051
## 2 1 standing 0.2784188 -0.01641057 -0.1235202
## 3 1 standing 0.2796531 -0.01946716 -0.1134617
## 4 1 standing 0.2791739 -0.02620065 -0.1232826
## 5 1 standing 0.2766288 -0.01656965 -0.1153619
This section will use the clean data set from the previous section and utilise a statistical modeling method to classify the data and determine the most important variables in that classification.
As the primary focus of this project was data cleaning, this section will employ only one model (Random forests) and other models will be utilised later in the portfolio.
A simple explanation of random forests are that they operate by constructing a number of decision trees on a training data set and outputting the class that is the mode of the classes across the decision trees. A more detailed explanation will be provided in the Explanatory Post section of the portfolio.
Random forests were chosen for this problem as some of the variables are hard to distinguish e.g. walking vs walking upstairs. The random forests model is more robust than single decision trees (as they use averages) which suffer from high variance or bias.
First the clean data will be re-split into train and test sets using a 70/30 split. The data will be split on subject in order to get an even distribution of activities in each set.
`%!in%` <- Negate(`%in%`)
set.seed(2828)
train_ind <- sample(unique(cleanData$subject), 21)
cleanData.train <- cleanData[cleanData$subject %in% train_ind,]
cleanData.test <- cleanData[cleanData$subject %!in% train_ind,]
Next the random forest model can be performed on the training data with activity as outcome.
activity.rf <- randomForest(as.factor(cleanData.train$activity)~., data=cleanData.train)
The confusion matrix for the training set shows a 100% accuracy using the random forest model.
training.cm <- confusionMatrix(as.factor(cleanData.train$activity),
predict(activity.rf, cleanData.train, type="class"))
training.cm[2]
## $table
## Reference
## Prediction laying sitting standing walking walkingDownstairs
## laying 1380 0 0 0 0
## sitting 0 1257 0 0 0
## standing 0 0 1370 0 0
## walking 0 0 0 1182 0
## walkingDownstairs 0 0 0 0 996
## walkingUpstairs 0 0 0 0 0
## Reference
## Prediction walkingUpstairs
## laying 0
## sitting 0
## standing 0
## walking 0
## walkingDownstairs 0
## walkingUpstairs 1090
Using the ‘varImpPlot’ function, the importance of variables can be ranked.
par(mfrow=c(1,1))
varImpPlot(activity.rf, pch=1, main="Random Forest Model Variables Importance")
Finally the model is run on the test set to further determine its accuracy.
test.cm <- confusionMatrix(as.factor(cleanData.test$activity),
predict(activity.rf, cleanData.test,type="class"))
test.cm[2]
## $table
## Reference
## Prediction laying sitting standing walking walkingDownstairs
## laying 562 0 0 0 2
## sitting 0 461 59 0 0
## standing 0 31 505 0 0
## walking 0 0 0 512 6
## walkingDownstairs 0 0 0 13 397
## walkingUpstairs 0 0 0 13 27
## Reference
## Prediction walkingUpstairs
## laying 0
## sitting 0
## standing 0
## walking 22
## walkingDownstairs 0
## walkingUpstairs 414
The accuracy of the random forests model on the test set is 94%. This indicates that it is a good model for predicting activities based on data collected from the smart phones.
This project aimed to demonstrate the importance of create a clean and structured data set in order to perform meaningful analysis. There are a number of other important cleaning techniques that were not required as part of this project but should be considered in any data cleaning task. This includes filling in missing values, correcting erroneous values and standardising.
The clean data set allowed statistic analysis to be performed in a few simple steps with meaningful results. Further steps in this analysis could be to assess the random forests model against other statistical methods.