” Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. This way of structuring the data provide a standardized way to link the structure of a dataset (its physical layout) with its semantics ” (Hadley Wickham).
This document outlines the process of downloading, cleaning, and tidying the data from the Human Activity Recognition Using Smartphones Dataset to produce a final tidy data set that contains the average of each variable for each activity and each subject.
The data used in this project is derived from the Human Activity Recognition Using Smartphones Dataset, available from the UCI Machine Learning Repository. The experiments have been carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist.
The authors of this study are: Jorge L. Reyes-Ortiz, Davide Anguita, Alessandro Ghio, Luca Oneto.
Data set URL: https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
To access the data, refer to the following link: https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip. The database is provided in a compressed format, so this process starts with unzipping the file and then loading the data into R.
#Create directory
if(!dir.exists("data")){dir.create("data")}
#Define URL where to obtain the file
fileUrl <- "https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip"
#Define destination
dest_file <- "./data/get_projectfiles.zip"
#Download the file
download.file(fileUrl, dest_file)
#Define directory to unzip the content
unzip_dir <- "./data"
#Unzip content
unzip(dest_file, exdir = unzip_dir)
#Confirm extraction
list.files(unzip_dir)
## [1] "get_projectfiles.zip" "UCI HAR Dataset"
#Confirm date downloaded
dateDownloaded <- date
dateDownloaded
## function ()
## .Internal(date())
## <bytecode: 0x0000028c9454d0d8>
## <environment: namespace:base>
Now I’m going to load the data into an object. In this step, we load the dplyr package, which will be used for data manipulation. We also import two important metadata files from the dataset:
features.txt: contains the names of the 561 measurement variables recorded in the experiments.
activity_labels.txt: provides the mapping between activity IDs (numerical codes) and their descriptive activity names (e.g., WALKING, STANDING).
These files will help label the dataset accurately in the next steps.
#Load dplyr
library(dplyr)
#Load features and activity labels
features <- read.table("./data/UCI HAR Dataset/features.txt", sep="", header=FALSE)
activity_labels <- read.table("./data/UCI HAR Dataset/activity_labels.txt", sep="", header=FALSE)
The Human Activity Recognition dataset is divided into two subsets: test and training data. Each subset contains three components:
X_*: the recorded measurements (561 features per observation),
Y_*: the activity labels (numeric codes),
subject_*: the ID of the subject who performed the activity.
We load each component for both subsets and then combine them using data.frame() to create two complete datasets: Test and Train. These will later be merged into a single dataset for analysis.
#Load test data and create test dataset
X_test <- read.table("./data/UCI HAR Dataset/test/X_test.txt", sep="", header = FALSE)
Y_test <- read.table("./data/UCI HAR Dataset/test/Y_test.txt", sep="", header = FALSE)
subject_test <- read.table("./data/UCI HAR Dataset/test/subject_test.txt", sep="", header = FALSE)
Test <- data.frame(subject_test, Y_test, X_test)
#Load train data and create train data set
X_train <- read.table("./data/UCI HAR Dataset/train/X_train.txt", sep="", header = FALSE)
Y_train <- read.table("./data/UCI HAR Dataset/train/Y_train.txt", sep="", header = FALSE)
subject_train <- read.table("./data/UCI HAR Dataset/train/subject_train.txt", sep="", header = FALSE)
Train <- data.frame(subject_train, Y_train, X_train)
Once the test and training datasets are loaded, we merge them into a single dataset using rbind(). This creates a unified dataset containing all observations. We then assign appropriate column names:
The first two columns are labeled “subject” and “activity”,
The remaining columns are named using the feature labels extracted from features.txt, which describe the sensor measurements.
# 1 - Merge test and train sets to create one data set
merged <- rbind(Train, Test)
#Assign column names
names(merged) <- c("subject", "activity", as.character(features$V2))
We extract only the measurements that represent means and standard deviations, which are the key variables of interest for this analysis. This is done using dplyr::select() with contains(“mean”) and contains(“std”).
Next, we replace the numeric activity codes with descriptive activity names (e.g., “WALKING”, “SITTING”) using the factor levels defined in activity_labels.txt. This improves the readability and interpretability of the dataset.
# 2 - Extract only the measurements on the mean and standard deviation for each measurement
tidy_1 <- merged %>% select(subject, activity, contains("mean"), contains("std"))
# 3 - Use descriptive activity names to name the activities in the data set
tidy_1$activity <- factor(tidy_1$activity, levels = activity_labels$V1, labels = activity_labels$V2)
To improve clarity and make the dataset more descriptive, we update the variable names using a series of gsub() transformations. These changes include:
Replacing abbreviations with full terms (e.g., “Acc” → “Accelerometer”, “Gyro” → “Gyroscope”),
Clarifying time and frequency domain signals (“t” → “Time”, “f” → “Frequency”),
Removing special characters like dashes and parentheses,
Standardizing naming conventions for mean, standard deviation, and frequency features.
These changes result in more readable and self-explanatory variable names throughout the dataset.
# 4 - Appropriately labels the data set with descriptive variable names
names(tidy_1) <- gsub("^t", "Time", names(tidy_1))
names(tidy_1) <- gsub("^f", "Frequency", names(tidy_1))
names(tidy_1) <- gsub("Acc", "Accelerometer", names(tidy_1))
names(tidy_1) <- gsub("Gyro", "Gyroscope", names(tidy_1))
names(tidy_1) <- gsub("Mag", "Magnitude", names(tidy_1))
names(tidy_1) <- gsub("BodyBody", "Body", names(tidy_1))
names(tidy_1) <- gsub("-mean\\(\\)", "Mean", names(tidy_1), ignore.case = TRUE)
names(tidy_1) <- gsub("-std\\(\\)", "STD", names(tidy_1), ignore.case = TRUE)
names(tidy_1) <- gsub("-freq\\(\\)", "Frequency", names(tidy_1), ignore.case = TRUE)
names(tidy_1) <- gsub("angle", "Angle", names(tidy_1))
names(tidy_1) <- gsub("gravity", "Gravity", names(tidy_1))
To complete the renaming process, we remove any remaining special characters such as parentheses () and hyphens - from the variable names. This ensures the column names are clean, consistent, and free of formatting artifacts that could interfere with later analysis or function usage.
# Remove any remaining special characters
names(tidy_1) <- gsub("\\(\\)", "", names(tidy_1))
names(tidy_1) <- gsub("-", "", names(tidy_1))
In this step, we create a second, independent tidy dataset by calculating the average of each measurement variable for each activity and each subject. Using group_by() and summarise_all(), we aggregate the data so that each row represents the mean values of all selected features for one subject performing one activity. This tidy dataset is well-structured and ready for downstream analysis or reporting
# 5 - Create a second, independent tidy data set with average of each variable for each activity and each subject
run_analysis <- tidy_1 %>% group_by(subject, activity) %>% summarise_all(list(mean = mean))
And that’s how you tidy the Human Activity Recognition dataset in just a few simple and systematic steps. By downloading, merging, filtering, labeling, and summarizing the data, we transform raw sensor recordings into a clean, organized, and analysis-ready format — making it easier to explore patterns and build models for human activity recognition.