This analysis is performed using RStudio and a given dataset of daily activity for October and November of 2012. The data contains the number of steps taken per day in 5-minute intervals. Note that there are some missing values that are dealt with by imputation.
To begin the analysis of the Activity dataset, we first need to Set the correct working directory in the R Environment and load required R packages. In the knitr package, use opts_knit$ set for the plots created.
library(knitr)
# to base64 encode images
opts_knit$set(upload.fun = image_uri)
# Set the working directory in the R Environment .
#setwd("C:/Users/Leigh/Desktop/Coursera/Data and Scripts")
Now that R is ready, we can read in the “activity.csv” file and clean it up a bit.
If the data structure is correct, then calculate the mean number of steps taken per day using the aggregate function. We can check if a basic plot of the daily mean number of step distribution looks reasonable as a quick way to check the results.
# Read in the activity dataset.
activity <- read.csv('C:/Users/20292/Desktop/activity.csv', header = TRUE, sep = ",", colClasses=c("numeric", "character", "numeric"))
# Clean up the data: Convert Date to the date class and interval to factor classes.
activity$date <- as.Date(activity$date, format = "%Y-%m-%d")
activity$interval <- as.factor(activity$interval)
attach(activity)
# The structure of the dataframe is shown below. Check it looks correct.
str(activity)
## 'data.frame': 17568 obs. of 3 variables:
## $ steps : num NA NA NA NA NA NA NA NA NA NA ...
## $ date : Date, format: "2012-10-01" "2012-10-01" ...
## $ interval: Factor w/ 288 levels "0","5","10","15",..: 1 2 3 4 5 6 7 8 9 10 ...
daily_steps <- aggregate(steps ~ date, data=activity, sum)
colnames(daily_steps) <- c("date","steps")
head(daily_steps)
## date steps
## 1 2012-10-02 126
## 2 2012-10-03 11352
## 3 2012-10-04 12116
## 4 2012-10-05 13294
## 5 2012-10-06 15420
## 6 2012-10-07 11015
library(ggplot2)
ggplot(daily_steps, aes(x = steps)) + geom_histogram(fill = "blue", binwidth = 1000) + labs(title="Histogram of Steps Taken per Day", x = "Number of Steps per Day", y = "Number of times in a day(Count)") + theme_bw()
daily_steps <- aggregate(steps ~ date, activity, sum)
colnames(daily_steps) <- c("date","steps")
steps_mean <- mean(daily_steps$steps, na.rm=TRUE)
steps_median <- median(daily_steps$steps, na.rm=TRUE)
The daily mean number of steps is:
steps_mean
## [1] 10766.19
The daily median number of steps is:
steps_median
## [1] 10765
steps_per_interval <- aggregate(activity$steps, by = list(interval = activity$interval), FUN=mean, na.rm=TRUE)
#Convert to integers
steps_per_interval$interval <- as.integer(levels(steps_per_interval$interval)[steps_per_interval$interval])
colnames(steps_per_interval) <- c("interval", "steps")
#Time Series plot for average number of steps taken (averaged across all days) versus the 5-minute intervals:
ggplot(steps_per_interval, aes(x=interval, y=steps)) + geom_line(color="red", size=1) +
labs(title="Average Daily Activity Pattern", x="Interval", y="Number of steps") + theme_bw()
#### Find the 5-minute interval with the containing the maximum number of steps:
max_interval <- steps_per_interval[which.max(steps_per_interval$steps),]
max_interval
## interval steps
## 104 835 206.1698
The total number of missing values in steps can be calculated using is.na() method to check whether the value is mising or not and then summing the logical vector.
missing_values <- sum(is.na(activity$steps))
missing_values
## [1] 2304
To populate missing values, we impute the values with the mean value at the same interval across days. In most of the cases the median is a better centrality measure than mean, but in our case the total median is not much far away from total mean, and probably we can make the mean and median meets.
Create a function “na_fill” which the data arguement is the activity data frame and pervalue is the steps_per_interval data frame.
na_fill <- function(data, pervalue) {
na_index <- which(is.na(data$steps))
na_replace <- unlist(lapply(na_index, FUN=function(idx){
interval = data[idx,]$interval
pervalue[pervalue$interval == interval,]$steps
}))
fill_steps <- data$steps
fill_steps[na_index] <- na_replace
fill_steps
}
rdata_fill <- data.frame(
steps = na_fill(activity, steps_per_interval),
date = activity$date,
interval = activity$interval)
str(rdata_fill)
## 'data.frame': 17568 obs. of 3 variables:
## $ steps : num 1.717 0.3396 0.1321 0.1509 0.0755 ...
## $ date : Date, format: "2012-10-01" "2012-10-01" ...
## $ interval: Factor w/ 288 levels "0","5","10","15",..: 1 2 3 4 5 6 7 8 9 10 ...
Check if there are any missing values remaining.
sum(is.na(rdata_fill$steps))
## [1] 0
This output shows no missing values (all have been imputed).
Plot a histogram of the daily total number of steps taken, plotted with a bin interval of 1000 steps, after filling missing values.
fill_steps_per_day <- aggregate(steps ~ date, rdata_fill, sum)
colnames(fill_steps_per_day) <- c("date","steps")
#Create the Histogram for Frequency of Number of Steps Per Day .
ggplot(fill_steps_per_day, aes(x = steps)) +
geom_histogram(fill = "green", binwidth = 1000) +
labs(title="Histogram of Steps Taken per Day",
x = "Number of Steps per Day", y = "Number of Times Per Day (count)") + theme_bw()
#Calculate and report the mean and median total number of steps taken per day.
steps_mean_fill <- mean(fill_steps_per_day$steps, na.rm=TRUE)
steps_median_fill <- median(fill_steps_per_day$steps, na.rm=TRUE)
The mean number of steps per day is:
steps_mean
## [1] 10766.19
The median number of steps per day is:
steps_median_fill
## [1] 10766.19
Yes, these values do differ slightly, but not significantly:
Mean : 10766.189
Median: 10765
Mean : 10766.189
Median: 10766.189
We see that the values after filling the data mean and median are almost equal.
weekdays_steps <- function(data) {
weekdays_steps <- aggregate(data$steps, by=list(interval = data$interval), FUN=mean, na.rm=T)
weekdays_steps$interval <- as.integer(levels(weekdays_steps$interval)[weekdays_steps$interval])
colnames(weekdays_steps) <- c("interval", "steps")
weekdays_steps
}
data_by_weekdays <- function(data) {
data$weekday <- as.factor(weekdays(data$date)) # weekdays
weekend_data <- subset(data, weekday %in% c("Saturday","Sunday"))
weekday_data <- subset(data, !weekday %in% c("Saturday","Sunday"))
weekend_steps <- weekdays_steps(weekend_data)
weekday_steps <- weekdays_steps(weekday_data)
weekend_steps$dayofweek <- rep("Weekend", nrow(weekend_steps))
weekday_steps$dayofweek <- rep("Weekday", nrow(weekday_steps))
data_by_weekdays <- rbind(weekend_steps, weekday_steps)
data_by_weekdays$dayofweek <- as.factor(data_by_weekdays$dayofweek)
data_by_weekdays
}
data_weekdays <- data_by_weekdays(rdata_fill)
Panel plot comparing the average number of steps taken per 5-minute interval across weekdays and weekends:
ggplot(data_weekdays, aes(x=interval, y=steps)) + geom_line(color="violet") +
facet_wrap(~ dayofweek, nrow=2, ncol=1) + labs(x="Interval", y="Number of steps") + theme_bw()
We see that activity on the weekday has the greatest peak from all steps intervals. We also see that weekend activities have more peaks over a hundred than weekdays, possibly because weekday activities mostly follow a work-related routine, where we find more intensity activity in little a free time. Overall, for weekends, there is a better distribution of activities across time.