This project stems from the Coursera Cleaning Data Course. It has 4 weeks of curriculum. In the 4th week, there is an assignment regarding data cleansing. This entire project is based upon the methodology of solving that assignment.
This is a project that is completed by ASaeedSH.Student of Business Analytics and also works at ThinkFaculty Company . You can visit us at www.thinkfaculty.com
The objective here is to describe how a solution was drafted for the Coursera Cleaning Data Week 4 assignment. The issue is that many of times the code is rendered without the mindset behind the code. Here we are atempting to discuss both.
This is written and executed by ASaeedSH - UMT.
Course is based upon the Coursera Cleaning Data Course Week 4 Assignment. R code was to have 5 immediate outcomes:
Companies like Fitbit, Nike, and Jawbone Up are racing to develop the most advanced algorithms to attract new users. The data linked to from the course website represent data collected from the accelerometers from the Samsung Galaxy S smartphone. A full description is available at the site where the data was obtained:
http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
Here are the data for the project:
https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip
Following are the main steps that are used to complete the requirements of the project:
Downloading and Cleaning the Data:
This is a prerequisite. You cannot move forward without getting the data downloaded and stored. In this particular case, the code also has to be unizipped.
For some odd reason the code is not allowing the URL to be placed here. Hence adding the main steps here
# Unzip dataSet to /data directory
unzip(zipfile="./data/Dataset.zip",exdir="./data")
## Warning in unzip(zipfile = "./data/Dataset.zip", exdir = "./data"): error 1
## in extracting from zip file
Zip file is now downloaded and now we need to start working on the unzip of the file
unzip(zipfile="./midtermdata/Dataset.zip",exdir="./midtermdata")
## Warning in unzip(zipfile = "./midtermdata/Dataset.zip", exdir = "./
## midtermdata"): error 1 in extracting from zip file
# lets check the zip file and the new folder that has been unzipped
list.files("D:/rlang/midtermdata")
## [1] "Dataset.zip" "mainsourcecode.R" "run_analysis2.R"
## [4] "UCI HAR Dataset"
#define the path where the new folder has been unziped
pathdata = file.path("./midtermdata", "UCI HAR Dataset")
#create a file which has the 28 file names
files = list.files(pathdata, recursive=TRUE)
#show the files
files
## [1] "activity_labels.txt"
## [2] "features.txt"
## [3] "features_info.txt"
## [4] "README.txt"
## [5] "test/Inertial Signals/body_acc_x_test.txt"
## [6] "test/Inertial Signals/body_acc_y_test.txt"
## [7] "test/Inertial Signals/body_acc_z_test.txt"
## [8] "test/Inertial Signals/body_gyro_x_test.txt"
## [9] "test/Inertial Signals/body_gyro_y_test.txt"
## [10] "test/Inertial Signals/body_gyro_z_test.txt"
## [11] "test/Inertial Signals/total_acc_x_test.txt"
## [12] "test/Inertial Signals/total_acc_y_test.txt"
## [13] "test/Inertial Signals/total_acc_z_test.txt"
## [14] "test/subject_test.txt"
## [15] "test/X_test.txt"
## [16] "test/y_test.txt"
## [17] "train/Inertial Signals/body_acc_x_train.txt"
## [18] "train/Inertial Signals/body_acc_y_train.txt"
## [19] "train/Inertial Signals/body_acc_z_train.txt"
## [20] "train/Inertial Signals/body_gyro_x_train.txt"
## [21] "train/Inertial Signals/body_gyro_y_train.txt"
## [22] "train/Inertial Signals/body_gyro_z_train.txt"
## [23] "train/Inertial Signals/total_acc_x_train.txt"
## [24] "train/Inertial Signals/total_acc_y_train.txt"
## [25] "train/Inertial Signals/total_acc_z_train.txt"
## [26] "train/subject_train.txt"
## [27] "train/X_train.txt"
## [28] "train/y_train.txt"
At this stage the file are now unipped and there entire data set is ready to be evaluated. There are 28 files and that need to be analyzed. Refer above.
Note - This is the basis for the entire project. Ensure that this milestone is reached.
In the data the readme.me document, you will get a detailed perspective of what to expect and how to manipulate the data. There are three core variables:
Main : activity_labels Inertial Signals Inertial Signals Test: features subject_test subject_train Train: features.info X_test X_train
README y_test y_train ‘features_info.txt’: Shows information about the variables used on the feature vector. * - ‘features.txt’: List of all features. * - ‘activity_labels.txt’: Links the class labels with their activity name. * - ‘train/X_train.txt’: Training set. * - ‘train/y_train.txt’: Training labels. * - ‘test/X_test.txt’: Test set. *- ‘test/y_test.txt’: Test labels
Analysis shows that you can categorize the data into 4 segments * training set * test set * features * activity labels
Inertial Signal data is not required. Additionally, features and activity label are more for tagging and descriptive than data sets.
The objective here is to make the test and training data as per the sequence stated above. 4 basic level data sets will be defined and created:
### 1. Output Steps - Here we begin how to create the data set of training and test
#Reading training tables - xtrain / ytrain, subject train
xtrain = read.table(file.path(pathdata, "train", "X_train.txt"),header = FALSE)
ytrain = read.table(file.path(pathdata, "train", "y_train.txt"),header = FALSE)
subject_train = read.table(file.path(pathdata, "train", "subject_train.txt"),header = FALSE)
#Reading the testing tables
xtest = read.table(file.path(pathdata, "test", "X_test.txt"),header = FALSE)
ytest = read.table(file.path(pathdata, "test", "y_test.txt"),header = FALSE)
subject_test = read.table(file.path(pathdata, "test", "subject_test.txt"),header = FALSE)
#Read the features data
features = read.table(file.path(pathdata, "features.txt"),header = FALSE)
#Read activity labels data
activityLabels = read.table(file.path(pathdata, "activity_labels.txt"),header = FALSE)
Note - the issue here is that the columns are not tagged and there is no easy way to interpret the data. This is a major issue with all data sets.And some coders do it later on however I believe that the best part is to create teh tagging now and create a proper data set now.
#Create Sanity and Column Values to the Train Data
colnames(xtrain) = features[,2]
colnames(ytrain) = "activityId"
colnames(subject_train) = "subjectId"
#Create Sanity and column values to the test data
colnames(xtest) = features[,2]
colnames(ytest) = "activityId"
colnames(subject_test) = "subjectId"
#Create sanity check for the activity labels value
colnames(activityLabels) <- c('activityId','activityType')
At this stage the data sets have been created with the riht coloumn names. This makes it easier to complete.
#Merging the train and test data - important outcome of the project
mrg_train = cbind(ytrain, subject_train, xtrain)
mrg_test = cbind(ytest, subject_test, xtest)
#Create the main data table merging both table tables - this is the outcome of 1
setAllInOne = rbind(mrg_train, mrg_test)
Here the understanding is to measure the mean and standard deviation values only. This can be possible through different means. Here we are using the grepl function to get the data and create a data set associated with the requirements.
# Need step is to read all the values that are available
colNames = colnames(setAllInOne)
#Need to get a subset of all the mean and standards and the correspondongin activityID and subjectID
mean_and_std = (grepl("activityId" , colNames) | grepl("subjectId" , colNames) | grepl("mean.." , colNames) | grepl("std.." , colNames))
#A subtset has to be created to get the required dataset
setForMeanAndStd <- setAllInOne[ , mean_and_std == TRUE]
setWithActivityNames = merge(setForMeanAndStd, activityLabels, by='activityId', all.x=TRUE)
Do note that this was already previously done and now we are simply denoting the vecotrs that have the labels
setAllInOne and setForMeanAndStd are the outcomes and solutions
This is a tricky area and you need to first of all get the average of each variable for each activity. A good fuction here that can be used is the aggregate function
# New tidy set has to be created
secTidySet <- aggregate(. ~subjectId + activityId, setWithActivityNames, mean)
secTidySet <- secTidySet[order(secTidySet$subjectId, secTidySet$activityId),]
#The last step is to write the ouput to a text file
write.table(secTidySet, "secTidySet.txt", row.name=FALSE)