Tidy data: Getting and Cleaning Data Project

Synopsis

” Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. This way of structuring the data provide a standardized way to link the structure of a dataset (its physical layout) with its semantics ” (Hadley Wickham).

This document outlines the process of downloading, cleaning, and tidying the data from the Human Activity Recognition Using Smartphones Dataset to produce a final tidy data set that contains the average of each variable for each activity and each subject.

The data used in this project is derived from the Human Activity Recognition Using Smartphones Dataset, available from the UCI Machine Learning Repository. The experiments have been carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist.

The authors of this study are: Jorge L. Reyes-Ortiz, Davide Anguita, Alessandro Ghio, Luca Oneto.

Data set URL: https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

Loading and Processing the Data

Create directory and download dataset

To access the data, refer to the following link: https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip. The database is provided in a compressed format, so this process starts with unzipping the file and then loading the data into R.

#Create directory
if(!dir.exists("data")){dir.create("data")}

#Define URL where to obtain the file
fileUrl <- "https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip"

#Define destination
dest_file <- "./data/get_projectfiles.zip"

#Download the file
download.file(fileUrl, dest_file)

#Define directory to unzip the content
unzip_dir <- "./data"

#Unzip content
unzip(dest_file, exdir = unzip_dir)

#Confirm extraction
list.files(unzip_dir)

## [1] "get_projectfiles.zip" "UCI HAR Dataset"

#Confirm date downloaded
dateDownloaded <- date
dateDownloaded

## function () 
## .Internal(date())
## <bytecode: 0x0000028c9454d0d8>
## <environment: namespace:base>

Load metadata

Now I’m going to load the data into an object. In this step, we load the dplyr package, which will be used for data manipulation. We also import two important metadata files from the dataset:

features.txt: contains the names of the 561 measurement variables recorded in the experiments.

activity_labels.txt: provides the mapping between activity IDs (numerical codes) and their descriptive activity names (e.g., WALKING, STANDING).

These files will help label the dataset accurately in the next steps.

#Load dplyr
library(dplyr)

#Load features and activity labels
features <- read.table("./data/UCI HAR Dataset/features.txt", sep="", header=FALSE)
activity_labels <- read.table("./data/UCI HAR Dataset/activity_labels.txt", sep="", header=FALSE)

Load raw Test and Training data

The Human Activity Recognition dataset is divided into two subsets: test and training data. Each subset contains three components:

X_*: the recorded measurements (561 features per observation),

Y_*: the activity labels (numeric codes),

subject_*: the ID of the subject who performed the activity.

We load each component for both subsets and then combine them using data.frame() to create two complete datasets: Test and Train. These will later be merged into a single dataset for analysis.

#Load test data and create test dataset

X_test <- read.table("./data/UCI HAR Dataset/test/X_test.txt", sep="", header = FALSE)
Y_test <- read.table("./data/UCI HAR Dataset/test/Y_test.txt", sep="", header = FALSE)
subject_test <- read.table("./data/UCI HAR Dataset/test/subject_test.txt", sep="", header = FALSE)
Test <- data.frame(subject_test, Y_test, X_test)

#Load train data and create train data set

X_train <- read.table("./data/UCI HAR Dataset/train/X_train.txt", sep="", header = FALSE)
Y_train <- read.table("./data/UCI HAR Dataset/train/Y_train.txt", sep="", header = FALSE)
subject_train <- read.table("./data/UCI HAR Dataset/train/subject_train.txt", sep="", header = FALSE)
Train <- data.frame(subject_train, Y_train, X_train)

Process the Data

Merge the 2 datasets and assign column names

Once the test and training datasets are loaded, we merge them into a single dataset using rbind(). This creates a unified dataset containing all observations. We then assign appropriate column names:

The first two columns are labeled “subject” and “activity”,

The remaining columns are named using the feature labels extracted from features.txt, which describe the sensor measurements.

 # 1 - Merge test and train sets to create one data set

merged <- rbind(Train, Test)

#Assign column names

names(merged) <- c("subject", "activity", as.character(features$V2))

Extract the measurements and apply activity labels

We extract only the measurements that represent means and standard deviations, which are the key variables of interest for this analysis. This is done using dplyr::select() with contains(“mean”) and contains(“std”).

Next, we replace the numeric activity codes with descriptive activity names (e.g., “WALKING”, “SITTING”) using the factor levels defined in activity_labels.txt. This improves the readability and interpretability of the dataset.

# 2 - Extract only the measurements on the mean and standard deviation for each measurement

tidy_1 <- merged %>% select(subject, activity, contains("mean"), contains("std"))

# 3 - Use descriptive activity names to name the activities in the data set

tidy_1$activity <- factor(tidy_1$activity, levels = activity_labels$V1, labels = activity_labels$V2)

Clean and clarify variable names

To improve clarity and make the dataset more descriptive, we update the variable names using a series of gsub() transformations. These changes include:

Replacing abbreviations with full terms (e.g., “Acc” → “Accelerometer”, “Gyro” → “Gyroscope”),

Clarifying time and frequency domain signals (“t” → “Time”, “f” → “Frequency”),

Removing special characters like dashes and parentheses,

Standardizing naming conventions for mean, standard deviation, and frequency features.

These changes result in more readable and self-explanatory variable names throughout the dataset.

# 4 - Appropriately labels the data set with descriptive variable names
names(tidy_1) <- gsub("^t", "Time", names(tidy_1))
names(tidy_1) <- gsub("^f", "Frequency", names(tidy_1))
names(tidy_1) <- gsub("Acc", "Accelerometer", names(tidy_1))
names(tidy_1) <- gsub("Gyro", "Gyroscope", names(tidy_1))
names(tidy_1) <- gsub("Mag", "Magnitude", names(tidy_1))
names(tidy_1) <- gsub("BodyBody", "Body", names(tidy_1))
names(tidy_1) <- gsub("-mean\\(\\)", "Mean", names(tidy_1), ignore.case = TRUE)
names(tidy_1) <- gsub("-std\\(\\)", "STD", names(tidy_1), ignore.case = TRUE)
names(tidy_1) <- gsub("-freq\\(\\)", "Frequency", names(tidy_1), ignore.case = TRUE)
names(tidy_1) <- gsub("angle", "Angle", names(tidy_1))
names(tidy_1) <- gsub("gravity", "Gravity", names(tidy_1))

Final cleanup of variable names

To complete the renaming process, we remove any remaining special characters such as parentheses () and hyphens - from the variable names. This ensures the column names are clean, consistent, and free of formatting artifacts that could interfere with later analysis or function usage.

# Remove any remaining special characters
names(tidy_1) <- gsub("\\(\\)", "", names(tidy_1))
names(tidy_1) <- gsub("-", "", names(tidy_1))

Create final tidy datasets with averages

In this step, we create a second, independent tidy dataset by calculating the average of each measurement variable for each activity and each subject. Using group_by() and summarise_all(), we aggregate the data so that each row represents the mean values of all selected features for one subject performing one activity. This tidy dataset is well-structured and ready for downstream analysis or reporting

# 5 - Create a second, independent tidy data set with average of each variable for each activity and each subject
run_analysis <- tidy_1 %>% group_by(subject, activity) %>% summarise_all(list(mean = mean))

And that’s how you tidy the Human Activity Recognition dataset in just a few simple and systematic steps. By downloading, merging, filtering, labeling, and summarizing the data, we transform raw sensor recordings into a clean, organized, and analysis-ready format — making it easier to explore patterns and build models for human activity recognition.