The script performs the following tasks:
Information about the dataset can be found at the repository:
http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
Merge the training and test sets to create one data set and #3, use descriptive activity names to name the activities in the data set, and #4 appropriately labels the data set with descriptive variable names. The following script below accomplishes this task.
X_train <- read.table("~/Downloads/UCI HAR Dataset/train/X_train.txt", quote="\"", stringsAsFactors=FALSE)
X_test <- read.table("~/Downloads/UCI HAR Dataset/test/X_test.txt", quote="\"", stringsAsFactors=FALSE)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(reshape2)
library(data.table)
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:dplyr':
##
## between, last
X_all <- bind_rows(X_train, X_test) #add all the training and test data together
View(X_all)
Now I name the columns something reasonable that indicates the data type.
features <- read.table("~/Downloads/UCI HAR Dataset/features.txt", quote="\"", stringsAsFactors=FALSE)
View(features)
names(X_all) = features$V2
subject_train <- read.table("~/Downloads/UCI HAR Dataset/train/subject_train.txt", quote="\"", stringsAsFactors=FALSE)
subject_test <- read.table("~/Downloads/UCI HAR Dataset/test/subject_test.txt", quote="\"")
subject_all <- bind_rows(subject_train, subject_test) #combine as before in the x train and test sets
activity_labels <- read.table("~/Downloads/UCI HAR Dataset/activity_labels.txt", quote="\"", stringsAsFactors=FALSE)
View(activity_labels)
subject_activity <- full_join(subject_all, activity_labels) #descriptive activity names for all the activities in the data set on each subject
## Joining by: "V1"
subject.no <- subject_activity$V1
activity <- subject_activity$V2
subject.activity <- data.frame(subject.no, activity)
all_data <- cbind(as.data.table(X_all),(subject_activity))
There are now two complete, corresponding data frames, one with all the X data and one with all the subject activity data by number and label
The next task is accomplished by joining the data to make one large data set.
y_train <- read.table("~/Downloads/UCI HAR Dataset/train/y_train.txt", quote="\"", stringsAsFactors=FALSE)
y_test <- read.table("~/Downloads/UCI HAR Dataset/train/y_train.txt", quote="\"", stringsAsFactors=FALSE)
y_all <- bind_rows(y_train, y_test) #add all the training and test data together
View(y_all)
As a precautionary step, just for good measure, I create .csv output files to save and check data or to send to those who prefer other data analysis methods
write.csv(all_data, "all_data.csv")
write.csv(subject_activity, "subject_activity.csv")
write.csv(y_all, "all_y_data.csv")
The next few scripts extracts measurements on the mean and std for each measurement; labels correctly with descriptive name. The first 6 columns in the all_data set have the mean and std, the rest are other measurements or quantiles. Below is a new data set giving just these measurements.
mean <- select(all_data, contains("mean"))
std <- select(all_data, contains("std"))
mean.std <- cbind(as.data.table(mean),(std))
mean.std.subject.activity <- cbind(as.data.table(mean.std), (subject.activity))
write.csv(mean.std.subject.activity, "mean.std.subject.activity.csv")
Create a tidy data set with average of each variable for each activity by each subject and save it as a text file. Now you have a data file ready for further analysis.
tidy <- na.omit(mean.std.subject.activity) #data set has no missing variables and contains all the means (averages) and std for each subject by activity in which they participated
write.table(tidy, row.name=FALSE, "tidy.txt")