This analysis is performed using RStudio and a given dataset of daily activity for October and November of 2012. The data contains the number of steps taken per day in 5-minute intervals. Note that there are some missing values that are dealt with by imputation.
To begin the analysis of the Activity dataset, we first need to Set the correct working directory in the R Environment and load required R packages.
# Set the working directory in the R Environment and load required R packages.
setwd("C:/Users/Leigh/Desktop/Coursera/Data and Scripts")
library(knitr)
library(data.table)
library(ggplot2)
Now that R is ready, we can read in the “activity.csv” file and clean it up a bit. If the data structure is correct, then calculate the mean number of steps taken per day using the aggregate function. We can check if a basic plot of the daily mean number of step distribution looks reasonable as a quick way to check the results.
# Read in the activity dataset.
activity <- read.csv('activity.csv', header = TRUE, sep = ",", colClasses=c("numeric", "character", "numeric"))
# Clean up the data: Convert Date to the date class and interval to factor classes.
# Look at the structure of the data to check it is correct.
activity$date <- as.Date(activity$date, format = "%Y-%m-%d")
activity$interval <- as.factor(activity$interval)
# The structure of the dataframe is shown below.
str(activity)
## 'data.frame': 17568 obs. of 3 variables:
## $ steps : num NA NA NA NA NA NA NA NA NA NA ...
## $ date : Date, format: "2012-10-01" "2012-10-01" ...
## $ interval: Factor w/ 288 levels "0","5","10","15",..: 1 2 3 4 5 6 7 8 9 10 ...
# What is mean total number of steps taken per day?
# 1. Calculate the total number of steps per day and check the top of the dataset.
daily_steps <- aggregate(steps ~ date, data=activity, sum)
colnames(daily_steps) <- c("date","steps")
head(daily_steps)
## date steps
## 1 2012-10-02 126
## 2 2012-10-03 11352
## 3 2012-10-04 12116
## 4 2012-10-05 13294
## 5 2012-10-06 15420
## 6 2012-10-07 11015
# 2. Create a histogram of the Activity data using ggplot
library(ggplot2)
ggplot(daily_steps, aes(x = steps)) + geom_histogram(fill = "blue", binwidth = 1000) + labs(title="Histogram of Steps Taken per Day", x = "Number of Steps per Day", y = "Number of times in a day(Count)") + theme_bw()
# 3.After looking at the distribution, calculate the Mean number of steps per day.
daily_steps <- aggregate(steps ~ date, activity, sum)
colnames(daily_steps) <- c("date","steps")
steps_mean <- mean(daily_steps$steps, na.rm=TRUE)
steps_median <- median(daily_steps$steps, na.rm=TRUE)
# The daily mean number of steps is:
steps_mean
## [1] 10766.19
# The daily median number of steps is:
steps_median
## [1] 10765
```
Use the aggregate funtion on the data by the 5-minute intervals to describe the daily activity, then plot the data. One way to show the daily activity pattern is by determining which time intervals have the maximum number of steps taken.
library(ggplot2)
# Create a data frame after calculating the aggregate steps by 5-minute intervals and converting them to integers.
steps_per_interval <- aggregate(activity$steps, by = list(interval = activity$interval), FUN=mean, na.rm=TRUE)
steps_per_interval$interval <- as.integer(levels(steps_per_interval$interval)[steps_per_interval$interval])
colnames(steps_per_interval) <- c("interval", "steps")
#______________________________________________________
# Time Series plot for average steps taken (averaged across all days) versus intervals:
ggplot(steps_per_interval, aes(x=interval, y=steps)) + geom_line(color="red", size=1) + labs(title="Daily Activity Pattern for Average Number of Steps", x="5-Minute Interval", y="Number of Steps Taken") + theme_bw()
# Find the 5-minute interval containing the maximum number of steps:
max_interval <- steps_per_interval[which.max(steps_per_interval$steps),]
The plot of the maximum interval values for steps taken shows the daily activity pattern. The distribution has one main peak and several smaller peaks. It is skewed and not a commonly known distribution (like a normal curve or Poisson distribution).
To handle missing values in the Activity dataset, we can impute them using mean values at similar time intervals.
The total number of missing values in steps can be calculated using is.na() method to check whether the value is mising or not and then summing the logical vector.
missing_values <- sum(is.na(activity$steps))
missing_values
## [1] 2304
The total number of missing values in the Activity data set is 2,304. Since we have a large dataset, we can fill in the missing values with means.
To populate missing values in the dataset, we impute the values using the mean value at the same interval across all of the days.
Create a function in which using the Activity dataset and pervalue is the number of steps per interval. This will impute the missing values and thus we will have a full dataset.
na_fill <- function(data, pervalue) {
na_index <- which(is.na(data$steps))
na_replace <- unlist(lapply(na_index, FUN=function(idx){
interval = data[idx,]$interval
pervalue[pervalue$interval == interval,]$steps
}))
fill_steps <- data$steps
fill_steps[na_index] <- na_replace
fill_steps
}
# Create a dataframe with the imputed values and check the structure.
rdata_fill <- data.frame(steps = na_fill(activity, steps_per_interval), date = activity$date, interval = activity$interval)
str(rdata_fill)
## 'data.frame': 17568 obs. of 3 variables:
## $ steps : num 1.717 0.3396 0.1321 0.1509 0.0755 ...
## $ date : Date, format: "2012-10-01" "2012-10-01" ...
## $ interval: Factor w/ 288 levels "0","5","10","15",..: 1 2 3 4 5 6 7 8 9 10 ...
# Check if there are any missing values remaining.
sum(is.na(rdata_fill$steps))
## [1] 0
The above output shows no missing values (all have been imputed).
# Plot a histogram of the daily total number of steps taken, using a bin interval of 1000 steps, after filling in the missing values.
fill_daily_steps <- aggregate(steps ~ date, rdata_fill, sum)
colnames(fill_daily_steps) <- c("date","steps")
# Create the Histogram for Frequency of Number of Steps Per Day .
library(ggplot2)
ggplot(fill_daily_steps, aes(x = steps)) + geom_histogram(fill = "green", binwidth = 1000) + labs(title="Histogram of Total Number of Steps Taken per Day", x = "Number of Steps per Day", y = "Number of Times Per Day (count)") + theme_bw()
# Calculate the mean total number of steps taken per day.
steps_mean_fill <- mean(fill_daily_steps$steps, na.rm=TRUE)
steps_mean_fill
## [1] 10766.19
The mean number of steps taken per day is 10766.189.
No: The values do differ slightly, but not significantly. The means are essentially the same.
Original Activity Data Mean : 10766.19
Imputed Activity Data Mean : 10766.189
The mean number of steps taken per day after filling the imputed data is equal to the mean number of steps per day only using the original data.
This shows that imputing missing values in data is an acceptable way to deal with missing data to get results.
We can perform this comparison with the table with filled-in missing values by doing the following: 1. Augment the table with a column that indicates the day of the week 2. Subset the table into two parts - weekends (Saturday and Sunday) and weekdays (Monday through Friday). 3. Tabulate the average steps per interval for each data set. 4. Plot the two data sets side by side for comparison.
weekdays_steps <- function(data) {
weekdays_steps <- aggregate(data$steps, by=list(interval = data$interval), FUN=mean, na.rm=T)
weekdays_steps$interval <- as.integer(levels(weekdays_steps$interval)[weekdays_steps$interval])
colnames(weekdays_steps) <- c("interval", "steps")
weekdays_steps}
#________________________________________________________________________________
data_by_weekdays <- function(data) {
data$weekday <-
as.factor(weekdays(data$date)) # weekdays
weekend_data <- subset(data, weekday %in% c("Saturday","Sunday"))
weekday_data <- subset(data, !weekday %in% c("Saturday","Sunday"))
weekend_steps <- weekdays_steps(weekend_data)
weekday_steps <- weekdays_steps(weekday_data)
weekend_steps$dayofweek <- rep("Weekend", nrow(weekend_steps))
weekday_steps$dayofweek <- rep("Weekday", nrow(weekday_steps))
data_by_weekdays <- rbind(weekend_steps, weekday_steps)
data_by_weekdays$dayofweek <- as.factor(data_by_weekdays$dayofweek)
data_by_weekdays
}
data_weekdays <- data_by_weekdays(rdata_fill)
# Create a Panel plot comparing the average number of steps taken per 5-minute interval
# across weekdays and weekends:
ggplot(data_weekdays, aes(x=interval, y=steps)) + geom_line(color="violet") + facet_wrap(~ dayofweek, nrow=2, ncol=1) + labs(x="5-Minute Interval", y="Number of steps") +
theme_bw()
Using the panel plot, we see that activity on the weekday has the greatest peak from all steps intervals. We also see that weekend activities have more peaks over a hundred than weekdays, possibly because weekday activities mostly follow a work-related routine, where we find more intensity activity in little a free time. Overall, for weekends, there is a better distribution of activities across time.
library(“knitr”) knit2html(“Repo_research_activity_1.Rmd”, “http://rpubs.com/leigh_math/repo_research_assignment1”)
PA1_template.md and PA1_template.html