Requirement

One of the most exciting areas in all of data science right now is wearable computing - see for example this article . Companies like Fitbit, Nike, and Jawbone Up are racing to develop the most advanced algorithms to attract new users. The data linked to from the course website represent data collected from the accelerometers from the Samsung Galaxy S smartphone. A full description is available at the site where the data was obtained:

http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

Here are the data for the project:

https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip

You should create one R script called run_analysis.R that does the following:
1. Merges the training and the test sets to create one data set.
2. Extracts only the measurements on the mean and standard deviation for each measurement.
3. Uses descriptive activity names to name the activities in the data set.
4. Apropriately labels the data set with descriptive variable names.
5. From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject.

Clean, tidy data (Coursera course)

Hello! Below is the descriptions of my work, I’ll show you how I code the .R file to complete the assignment.

Codes from lines 1-6 load the needed libraries
Codes from lines 10-16 load the files

Data overview:
The original data is:
train set:
X: 7352 rows, 561 cols
Y: 7352 rows, 1 cols
test set:
X: 2497 rows, 561 cols
Y: 2497 rows, 1 cols

The train data contains 7352 samples, each with 561 measurements.
The test data contains 2497 samples, each with 561 measurements.

Add dataset type (code lines 18-20)
Before merging these datasets, I add another column to each of them, indicating whether each sample belongs to train or test set. The new column is “train_test” and is valued “train”/“test” accordingly.

Merge data (code lines 22-26)
For each train/test set, I first merge X, Y and Subject (the volunteer) together using cbind.
Then I merge train and test sets together using rbind.
The data after merging has a total of 10299 samples, each with 561 measurements. Dimension of merged data is 10299 x 564 (3 columns are Y, Subject and train/test).
This would be the end of Question 1.

Assign activity (code lines 31-40)
The next step is to assign Activity from Y value, based on the corresponding information in activity_labels.txt
I first load activity_labels.txt, name the code as Y to match with the merged data.
Then I assign the activity for each sample by joining the activity labels with the merged data, the referenced column is Y.
The merged data now has another column Activity, total of 565.
This would be the end of Question 3.

Extract part of dataset (code lines 45-48)
The next part is to extract all measurements on the mean and standard deviation for each sample.
I first extract the column names of the latest data.
I then find in these names which one has string “Mean”/“mean”/“std” (using grepl command) and extract them from the data.
The data now has only 90 columns, 86 of them are measurements of mean and standard deviation.
This would be the end of Question 2.

Rename variable names (code lines 53-70)
To make the variable names more informative, I perform the following steps:
- bring the last 4 columns to the leftmost and rename them to: “Dataset”,“Label”,“Subject”,“Activity”.
- among the remaining, I:
delete string “()” because they’re redundant
delete string “Freq” because they’re already represented by prefix f
delete one of the duplicated string “BodyBody”
delete string “.1” because they’re meaningless
replace prefix string “angle” to “a” for prefix consistency (t, f, a)
(all of these operations are done using sub command)
This would be the end of Question 4.

Make tidy dataset (code lines 75-103)
For the last question, I:
- first rearrange the columns, now follow: 4 reference information, 57 XYZ data measurements and 29 mean/std data measurements (total of 90 columns).
- then expand all measurements with respect to reference information in XYZ and mean/std data (using gather command, apply on all columns except the 4 reference columns)
- then merge these 2 restructured data together

The data now has this structure:
4 reference columns
1 column of measurement type
1 column of corresponding measurement value
Its dimension now becomes: 885714 x 6.

Code begin below, Code book is at the end of the file.

library(plyr)     # for join commmand
library(tidyr)    # for gather command
library(dplyr)    # for group_by command
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

1. Load training set

trainX <- read.csv("./train/X_train.txt", header = FALSE, sep = "")
trainY <- read.csv("./train/Y_train.txt", header = FALSE)
trainSubj <- read.csv("./train/subject_train.txt", header = FALSE)

Load test set

testX <- read.csv("./test/X_test.txt", header = FALSE, sep = "")
testY <- read.csv("./test/Y_test.txt", header = FALSE)
testSubj <- read.csv("./test/subject_test.txt", header = FALSE)

2. Add a column to indicate which dataset (train/test) the data belong, before merging

trainX$train_test <- "train"
testX$train_test <- "test"

3. Merge X, Y, Subject to both data

testMerge <- cbind(testX,testY,testSubj)
trainMerge <- cbind(trainX,trainY,trainSubj)
# Merge test + train to dt:
dt <- rbind(testMerge,trainMerge)

—– (End of question 1) —–

4. Some modification on merged data

Assign column names from ‘features.txt’, change the last 2 columns to “Y” and “Subject”

features <- read.csv("./features.txt", header = FALSE, sep = " ")
names(dt) <- c(as.character(features[,2]),"train_test","Y","Subject")

Assign activity with corresponding features

activity <- read.csv("./activity_labels.txt", header = FALSE, sep = " ")
names(activity) <- c("Y", "Activity")
dt$Y <- as.integer(dt$Y)
data <- join(dt,activity, by = "Y")

—– (End of question 3) —–

5. Extract mean, standard deviation measurements

name <- names(data)
extr <- grepl("[Mm]ean|std",name)
extract <- cbind(data[,extr],data$train_test,data$Y,data$Subject,data$Activity)

—– (End of question 2) —–

6. Rename columns

Rearrange columns: bring the last 4 columns to first

ncol(extract)
## [1] 90
extract <- extract[c(87:90,1:86)]

Start with the easiest, last columns

varName <- names(extract)
varName[1:4] <- c("Dataset","Label","Subject","Activity")

remove (), Freq, duplicated Body, .1, angle to a

varName1 <- sub("\\()", "", varName)
varName2 <- sub("[Ff]req", "", varName1)
varName3 <- sub("BodyBody", "Body", varName2)
varName4 <- sub("\\.1", "", varName3)
varName5 <- sub("angle", "a", varName4)
#varName5
names(extract) <- varName5
#names(extract)

—– (End of question 4) —–

7. Make tidy datasets

  1. Rearrange columns
extract1 <- extract[,c(1:34,45:71,88:90,35:44,72:87)]

(For some reasons, the edited variable names are not copied, so I fixed them below)

varExtract1 <- names(extract1)
varExtract2 <- sub("\\.1", "", varExtract1)
names(extract1) <- varExtract2
names(extract1)
##  [1] "Dataset"                          "Label"                           
##  [3] "Subject"                          "Activity"                        
##  [5] "tBodyAcc-mean-X"                  "tBodyAcc-mean-Y"                 
##  [7] "tBodyAcc-mean-Z"                  "tBodyAcc-std-X"                  
##  [9] "tBodyAcc-std-Y"                   "tBodyAcc-std-Z"                  
## [11] "tGravityAcc-mean-X"               "tGravityAcc-mean-Y"              
## [13] "tGravityAcc-mean-Z"               "tGravityAcc-std-X"               
## [15] "tGravityAcc-std-Y"                "tGravityAcc-std-Z"               
## [17] "tBodyAccJerk-mean-X"              "tBodyAccJerk-mean-Y"             
## [19] "tBodyAccJerk-mean-Z"              "tBodyAccJerk-std-X"              
## [21] "tBodyAccJerk-std-Y"               "tBodyAccJerk-std-Z"              
## [23] "tBodyGyro-mean-X"                 "tBodyGyro-mean-Y"                
## [25] "tBodyGyro-mean-Z"                 "tBodyGyro-std-X"                 
## [27] "tBodyGyro-std-Y"                  "tBodyGyro-std-Z"                 
## [29] "tBodyGyroJerk-mean-X"             "tBodyGyroJerk-mean-Y"            
## [31] "tBodyGyroJerk-mean-Z"             "tBodyGyroJerk-std-X"             
## [33] "tBodyGyroJerk-std-Y"              "tBodyGyroJerk-std-Z"             
## [35] "fBodyAcc-mean-X"                  "fBodyAcc-mean-Y"                 
## [37] "fBodyAcc-mean-Z"                  "fBodyAcc-std-X"                  
## [39] "fBodyAcc-std-Y"                   "fBodyAcc-std-Z"                  
## [41] "fBodyAcc-mean-X"                  "fBodyAcc-mean-Y"                 
## [43] "fBodyAcc-mean-Z"                  "fBodyAccJerk-mean-X"             
## [45] "fBodyAccJerk-mean-Y"              "fBodyAccJerk-mean-Z"             
## [47] "fBodyAccJerk-std-X"               "fBodyAccJerk-std-Y"              
## [49] "fBodyAccJerk-std-Z"               "fBodyAccJerk-mean-X"             
## [51] "fBodyAccJerk-mean-Y"              "fBodyAccJerk-mean-Z"             
## [53] "fBodyGyro-mean-X"                 "fBodyGyro-mean-Y"                
## [55] "fBodyGyro-mean-Z"                 "fBodyGyro-std-X"                 
## [57] "fBodyGyro-std-Y"                  "fBodyGyro-std-Z"                 
## [59] "fBodyGyro-mean-X"                 "fBodyGyro-mean-Y"                
## [61] "fBodyGyro-mean-Z"                 "a(X,gravityMean)"                
## [63] "a(Y,gravityMean)"                 "a(Z,gravityMean)"                
## [65] "tBodyAccMag-mean"                 "tBodyAccMag-std"                 
## [67] "tGravityAccMag-mean"              "tGravityAccMag-std"              
## [69] "tBodyAccJerkMag-mean"             "tBodyAccJerkMag-std"             
## [71] "tBodyGyroMag-mean"                "tBodyGyroMag-std"                
## [73] "tBodyGyroJerkMag-mean"            "tBodyGyroJerkMag-std"            
## [75] "fBodyAccMag-mean"                 "fBodyAccMag-std"                 
## [77] "fBodyAccMag-mean"                 "fBodyAccJerkMag-mean"            
## [79] "fBodyAccJerkMag-std"              "fBodyAccJerkMag-mean"            
## [81] "fBodyGyroMag-mean"                "fBodyGyroMag-std"                
## [83] "fBodyGyroMag-mean"                "fBodyGyroJerkMag-mean"           
## [85] "fBodyGyroJerkMag-std"             "fBodyGyroJerkMag-mean"           
## [87] "a(tBodyAccMean,gravity)"          "a(tBodyAccJerkMean),gravityMean)"
## [89] "a(tBodyGyroMean,gravityMean)"     "a(tBodyGyroJerkMean,gravityMean)"
  1. Tidy columns with (X, Y, Z)
extract2 <- extract1[,c(1:61)]     # columns with X, Y, Z
res1 <- gather(extract2, key = Measurement, value = Value, -c("Dataset","Label","Subject","Activity"))
  1. Tidy columns with (mean, std)
extract3 <- extract1[,c(1:4,62:90)]     # columns with mean, std
res2 <- gather(extract3, key = Measurement, value = Value, -c("Dataset","Label","Subject","Activity"))
  1. Combine tidy (X, Y, Z) and (mean, std)
tidyData <- rbind(res1,res2)

Factorize columns and check results

cols <- c("Dataset", "Label", "Subject","Activity", "Measurement")
tidyData[cols] <- lapply(tidyData[cols], factor)
#str(tidyData)

Remove “Label” (because it’s now represented by State)

#str(tidyData)
tidyData1 <- tidyData[,-2,drop=FALSE]
head(tidyData1)
##   Dataset Subject Activity     Measurement     Value
## 1    test       2 STANDING tBodyAcc-mean-X 0.2571778
## 2    test       2 STANDING tBodyAcc-mean-X 0.2860267
## 3    test       2 STANDING tBodyAcc-mean-X 0.2754848
## 4    test       2 STANDING tBodyAcc-mean-X 0.2702982
## 5    test       2 STANDING tBodyAcc-mean-X 0.2748330
## 6    test       2 STANDING tBodyAcc-mean-X 0.2792199
str(tidyData1)
## 'data.frame':    885714 obs. of  5 variables:
##  $ Dataset    : Factor w/ 2 levels "test","train": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Subject    : Factor w/ 30 levels "1","2","3","4",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Activity   : Factor w/ 6 levels "LAYING","SITTING",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Measurement: Factor w/ 86 levels "a(tBodyAccJerkMean),gravityMean)",..: 47 47 47 47 47 47 47 47 47 47 ...
##  $ Value      : num  0.257 0.286 0.275 0.27 0.275 ...
#tail(tidyData1)

Creates a second, independent tidy data set with the average of each variable for each activity and each subject

tidyData2 <- tidyData1 %>% group_by(Dataset, Subject, Activity, Measurement) %>% summarise_at(vars(Value), mean)
names(tidyData2)[5] <- 'Mean_Summarized'
tidyData2[1:100,]
## # A tibble: 100 x 5
## # Groups:   Dataset, Subject, Activity [2]
##    Dataset Subject Activity                      Measurement
##     <fctr>  <fctr>   <fctr>                           <fctr>
##  1    test       2   LAYING a(tBodyAccJerkMean),gravityMean)
##  2    test       2   LAYING          a(tBodyAccMean,gravity)
##  3    test       2   LAYING a(tBodyGyroJerkMean,gravityMean)
##  4    test       2   LAYING     a(tBodyGyroMean,gravityMean)
##  5    test       2   LAYING                 a(X,gravityMean)
##  6    test       2   LAYING                 a(Y,gravityMean)
##  7    test       2   LAYING                 a(Z,gravityMean)
##  8    test       2   LAYING                  fBodyAcc-mean-X
##  9    test       2   LAYING                fBodyAcc-mean-X.1
## 10    test       2   LAYING                  fBodyAcc-mean-Y
## # ... with 90 more rows, and 1 more variables: Mean_Summarized <dbl>

—– (End of question 5) —–

write.csv(extract, "extract.csv")
write.csv(tidyData2, "tidyData2.csv")
write.table(extract, file = "extract.txt", row.names = FALSE)
write.table(tidyData2, file = "tidyData2.txt", row.names = FALSE)

Code book

extract.csv:

Dataset: Original dataset
test: Test set
train: Train set

Subject: Identifier index of 30 volunteers
1..30

Activity: Assigned activity corresponding to action label
Walking
Walking Upstairs
Walking Downstairs
Sitting
Standing
Laying

Measurement: type of measurement from accelerometer, gyroscope and their calculated variables in frequency domain. The variables were filtered such that only Mean and Standard deviation measurements are selected. Total number of filtered measurements: 86.
tBodyAcc-mean-X
tBodyAcc-mean-Y
tBodyAcc-mean-Z
tBodyAcc-std-X
tBodyAcc-std-Y
tBodyAcc-std-Z
tGravityAcc-mean-X
tGravityAcc-mean-Y
tGravityAcc-mean-Z
tGravityAcc-std-X
tGravityAcc-std-Y
tGravityAcc-std-Z
tBodyAccJerk-mean-X
tBodyAccJerk-mean-Y
tBodyAccJerk-mean-Z
tBodyAccJerk-std-X
tBodyAccJerk-std-Y
tBodyAccJerk-std-Z
tBodyGyro-mean-X
tBodyGyro-mean-Y
tBodyGyro-mean-Z
tBodyGyro-std-X
tBodyGyro-std-Y
tBodyGyro-std-Z
tBodyGyroJerk-mean-X
tBodyGyroJerk-mean-Y
tBodyGyroJerk-mean-Z
tBodyGyroJerk-std-X
tBodyGyroJerk-std-Y
tBodyGyroJerk-std-Z
fBodyAcc-mean-X
fBodyAcc-mean-Y
fBodyAcc-mean-Z
fBodyAcc-std-X
fBodyAcc-std-Y
fBodyAcc-std-Z
fBodyAcc-mean-X.1
fBodyAcc-mean-Y.1
fBodyAcc-mean-Z.1
fBodyAccJerk-mean-X
fBodyAccJerk-mean-Y
fBodyAccJerk-mean-Z
fBodyAccJerk-std-X
fBodyAccJerk-std-Y
fBodyAccJerk-std-Z
fBodyAccJerk-mean-X.1
fBodyAccJerk-mean-Y.1
fBodyAccJerk-mean-Z.1
fBodyGyro-mean-X
fBodyGyro-mean-Y
fBodyGyro-mean-Z
fBodyGyro-std-X
fBodyGyro-std-Y
fBodyGyro-std-Z
fBodyGyro-mean-X.1
fBodyGyro-mean-Y.1
fBodyGyro-mean-Z.1
a(X,gravityMean)
a(Y,gravityMean)
a(Z,gravityMean)
tBodyAccMag-mean
tBodyAccMag-std
tGravityAccMag-mean
tGravityAccMag-std
tBodyAccJerkMag-mean
tBodyAccJerkMag-std
tBodyGyroMag-mean
tBodyGyroMag-std
tBodyGyroJerkMag-mean
tBodyGyroJerkMag-std
fBodyAccMag-mean
fBodyAccMag-std
fBodyAccMag-mean.1
fBodyAccJerkMag-mean
fBodyAccJerkMag-std
fBodyAccJerkMag-mean.1
fBodyGyroMag-mean
fBodyGyroMag-std
fBodyGyroMag-mean.1
fBodyGyroJerkMag-mean
fBodyGyroJerkMag-std
fBodyGyroJerkMag-mean.1
a(tBodyAccMean,gravity)
a(tBodyAccJerkMean),gravityMean)
a(tBodyGyroMean,gravityMean)
a(tBodyGyroJerkMean,gravityMean)

Value: corresponding value of the measurement type
End of tidyData1 file

tidyData2:

Dataset: Original dataset
test: Test set
train: Train set

Subject: Identifier index of 30 volunteers
1..30

Activity: Assigned activity corresponding to action label
Walking
Walking Upstairs
Walking Downstairs
Sitting
Standing
Laying

Measurement: type of measurement from accelerometer, gyroscope and their calculated variables in frequency domain. The variables were filtered such that only Mean and Standard deviation measurements are selected. Total number of filtered measurements: 86.
tBodyAcc-mean-X
tBodyAcc-mean-Y
tBodyAcc-mean-Z
tBodyAcc-std-X
tBodyAcc-std-Y
tBodyAcc-std-Z
tGravityAcc-mean-X
tGravityAcc-mean-Y
tGravityAcc-mean-Z
tGravityAcc-std-X
tGravityAcc-std-Y
tGravityAcc-std-Z
tBodyAccJerk-mean-X
tBodyAccJerk-mean-Y
tBodyAccJerk-mean-Z
tBodyAccJerk-std-X
tBodyAccJerk-std-Y
tBodyAccJerk-std-Z
tBodyGyro-mean-X
tBodyGyro-mean-Y
tBodyGyro-mean-Z
tBodyGyro-std-X
tBodyGyro-std-Y
tBodyGyro-std-Z
tBodyGyroJerk-mean-X
tBodyGyroJerk-mean-Y
tBodyGyroJerk-mean-Z
tBodyGyroJerk-std-X
tBodyGyroJerk-std-Y
tBodyGyroJerk-std-Z
fBodyAcc-mean-X
fBodyAcc-mean-Y
fBodyAcc-mean-Z
fBodyAcc-std-X
fBodyAcc-std-Y
fBodyAcc-std-Z
fBodyAcc-mean-X.1
fBodyAcc-mean-Y.1
fBodyAcc-mean-Z.1
fBodyAccJerk-mean-X
fBodyAccJerk-mean-Y
fBodyAccJerk-mean-Z
fBodyAccJerk-std-X
fBodyAccJerk-std-Y
fBodyAccJerk-std-Z
fBodyAccJerk-mean-X.1
fBodyAccJerk-mean-Y.1
fBodyAccJerk-mean-Z.1
fBodyGyro-mean-X
fBodyGyro-mean-Y
fBodyGyro-mean-Z
fBodyGyro-std-X
fBodyGyro-std-Y
fBodyGyro-std-Z
fBodyGyro-mean-X.1
fBodyGyro-mean-Y.1
fBodyGyro-mean-Z.1
a(X,gravityMean)
a(Y,gravityMean)
a(Z,gravityMean)
tBodyAccMag-mean
tBodyAccMag-std
tGravityAccMag-mean
tGravityAccMag-std
tBodyAccJerkMag-mean
tBodyAccJerkMag-std
tBodyGyroMag-mean
tBodyGyroMag-std
tBodyGyroJerkMag-mean
tBodyGyroJerkMag-std
fBodyAccMag-mean
fBodyAccMag-std
fBodyAccMag-mean.1
fBodyAccJerkMag-mean
fBodyAccJerkMag-std
fBodyAccJerkMag-mean.1
fBodyGyroMag-mean
fBodyGyroMag-std
fBodyGyroMag-mean.1
fBodyGyroJerkMag-mean
fBodyGyroJerkMag-std
fBodyGyroJerkMag-mean.1
a(tBodyAccMean,gravity)
a(tBodyAccJerkMean),gravityMean)
a(tBodyGyroMean,gravityMean)
a(tBodyGyroJerkMean,gravityMean)

Mean_Summarized: mean of the corresponding values of the same measurement type
End of tidyData2 file