Coursera - Statistical Inference Assignment

This analysis is performed using RStudio and a given dataset of daily activity for October and November of 2012. The data contains the number of steps taken per day in 5-minute intervals. Note that there are some missing values that are dealt with by imputation.


Loading and Pre-Processing the data

To begin the analysis of the Activity dataset, we first need to Set the correct working directory in the R Environment and load required R packages. In the knitr package, use opts_knit$ set for the plots created.

library(knitr)
# to base64 encode images
opts_knit$set(upload.fun = image_uri)


# Set the working directory in the R Environment .
#setwd("C:/Users/Leigh/Desktop/Coursera/Data and Scripts")

Now that R is ready, we can read in the “activity.csv” file and clean it up a bit.

If the data structure is correct, then calculate the mean number of steps taken per day using the aggregate function. We can check if a basic plot of the daily mean number of step distribution looks reasonable as a quick way to check the results.

# Read in the activity dataset.
activity <- read.csv('C:/Users/20292/Desktop/activity.csv', header = TRUE, sep = ",", colClasses=c("numeric", "character", "numeric"))

# Clean up the data: Convert Date to the date class and interval to factor classes.
activity$date <- as.Date(activity$date, format = "%Y-%m-%d")
activity$interval <- as.factor(activity$interval) 
attach(activity)

# The structure of the dataframe is shown below. Check it looks correct. 
str(activity)
## 'data.frame':    17568 obs. of  3 variables:
##  $ steps   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ date    : Date, format: "2012-10-01" "2012-10-01" ...
##  $ interval: Factor w/ 288 levels "0","5","10","15",..: 1 2 3 4 5 6 7 8 9 10 ...

What is mean total number of steps taken per day?

1. Calculate the total number of steps per day and check the top of the dataset.

daily_steps <- aggregate(steps ~ date, data=activity, sum)
colnames(daily_steps) <- c("date","steps")
head(daily_steps)
##         date steps
## 1 2012-10-02   126
## 2 2012-10-03 11352
## 3 2012-10-04 12116
## 4 2012-10-05 13294
## 5 2012-10-06 15420
## 6 2012-10-07 11015

2. Create a histogram of the Activity data using ggplot

library(ggplot2)
ggplot(daily_steps, aes(x = steps)) + geom_histogram(fill = "blue", binwidth = 1000) + labs(title="Histogram of Steps Taken per Day", x = "Number of Steps per Day", y = "Number of times in a day(Count)") + theme_bw() 

3.After looking at the distribution and accepting that it looks reasonable, calculate the Mean number of steps per day.

daily_steps <- aggregate(steps ~ date, activity, sum)
colnames(daily_steps) <- c("date","steps")
steps_mean <- mean(daily_steps$steps, na.rm=TRUE)
steps_median <- median(daily_steps$steps, na.rm=TRUE)

The daily mean number of steps is:

steps_mean
## [1] 10766.19

The daily median number of steps is:

steps_median
## [1] 10765

What is the average daily activity pattern?

Calculate the aggregate steps by 5-minute intervals and convert the intervals as integers and save them in a data frame.

steps_per_interval <- aggregate(activity$steps, by = list(interval = activity$interval), FUN=mean, na.rm=TRUE)

#Convert to integers
steps_per_interval$interval <-  as.integer(levels(steps_per_interval$interval)[steps_per_interval$interval])
colnames(steps_per_interval) <- c("interval", "steps")

#Time Series plot for average number of steps taken (averaged across all days) versus the 5-minute intervals:
ggplot(steps_per_interval, aes(x=interval, y=steps)) + geom_line(color="red", size=1) +  
  labs(title="Average Daily Activity Pattern", x="Interval", y="Number of steps") +  theme_bw()

#### Find the 5-minute interval with the containing the maximum number of steps:
max_interval <- steps_per_interval[which.max(steps_per_interval$steps),]
max_interval
##     interval    steps
## 104      835 206.1698
The 835th interval has maximum 206 steps.

Imputing missing values

1. Total number of missing values:

The total number of missing values in steps can be calculated using is.na() method to check whether the value is mising or not and then summing the logical vector.

missing_values <- sum(is.na(activity$steps))
missing_values
## [1] 2304
The total number of missing values is 2304.

2. Strategy for filling in all of the missing values in the dataset - IMPUTE

To populate missing values, we impute the values with the mean value at the same interval across days. In most of the cases the median is a better centrality measure than mean, but in our case the total median is not much far away from total mean, and probably we can make the mean and median meets.

Create a function “na_fill” which the data arguement is the activity data frame and pervalue is the steps_per_interval data frame.

na_fill <- function(data, pervalue) {
  na_index <- which(is.na(data$steps))
  na_replace <- unlist(lapply(na_index, FUN=function(idx){
    interval = data[idx,]$interval
    pervalue[pervalue$interval == interval,]$steps
  }))
  fill_steps <- data$steps
  fill_steps[na_index] <- na_replace
  fill_steps
}
rdata_fill <- data.frame(  
  steps = na_fill(activity, steps_per_interval),  
  date = activity$date,  
  interval = activity$interval)
str(rdata_fill)
## 'data.frame':    17568 obs. of  3 variables:
##  $ steps   : num  1.717 0.3396 0.1321 0.1509 0.0755 ...
##  $ date    : Date, format: "2012-10-01" "2012-10-01" ...
##  $ interval: Factor w/ 288 levels "0","5","10","15",..: 1 2 3 4 5 6 7 8 9 10 ...

Check if there are any missing values remaining.

sum(is.na(rdata_fill$steps))
## [1] 0

This output shows no missing values (all have been imputed).

3. A histogram of the total number of steps taken each day

Plot a histogram of the daily total number of steps taken, plotted with a bin interval of 1000 steps, after filling missing values.

fill_steps_per_day <- aggregate(steps ~ date, rdata_fill, sum)
colnames(fill_steps_per_day) <- c("date","steps")

#Create the Histogram for Frequency of Number of Steps Per Day . 
ggplot(fill_steps_per_day, aes(x = steps)) + 
  geom_histogram(fill = "green", binwidth = 1000) + 
  labs(title="Histogram of Steps Taken per Day", 
       x = "Number of Steps per Day", y = "Number of Times Per Day (count)") + theme_bw() 

#Calculate and report the mean and median total number of steps taken per day.
steps_mean_fill   <- mean(fill_steps_per_day$steps, na.rm=TRUE)
steps_median_fill   <- median(fill_steps_per_day$steps, na.rm=TRUE)

The mean number of steps per day is:

steps_mean
## [1] 10766.19

The median number of steps per day is:

steps_median_fill
## [1] 10766.19

Do these values differ from the estimates from the first part of the assignment?

Yes, these values do differ slightly, but not significantly:

Before filling the data

Mean : 10766.189

Median: 10765

After filling the data

Mean : 10766.189

Median: 10766.189

We see that the values after filling the data mean and median are almost equal.

Are there differences in activity patterns between weekdays and weekends?

We do this comparison with the table with filled-in missing values as follows:
  1. Augment the table with a column that indicates the day of the week
  2. Subset the table into two parts - weekends (Saturday and Sunday) and weekdays (Monday through Friday).
  3. Tabulate the average steps per interval for each data set.
  4. Plot the two data sets side by side for comparison.
weekdays_steps <- function(data) {
  weekdays_steps <- aggregate(data$steps, by=list(interval = data$interval), FUN=mean, na.rm=T)
  weekdays_steps$interval <- as.integer(levels(weekdays_steps$interval)[weekdays_steps$interval])
  colnames(weekdays_steps) <- c("interval", "steps")
  weekdays_steps
}

data_by_weekdays <- function(data) {
  data$weekday <- as.factor(weekdays(data$date)) # weekdays
  weekend_data <- subset(data, weekday %in% c("Saturday","Sunday"))
  weekday_data <- subset(data, !weekday %in% c("Saturday","Sunday"))
  
  weekend_steps <- weekdays_steps(weekend_data)
  weekday_steps <- weekdays_steps(weekday_data)
    weekend_steps$dayofweek <- rep("Weekend", nrow(weekend_steps))
  weekday_steps$dayofweek <- rep("Weekday", nrow(weekday_steps))
  
  data_by_weekdays <- rbind(weekend_steps, weekday_steps)
  data_by_weekdays$dayofweek <- as.factor(data_by_weekdays$dayofweek)
  data_by_weekdays
}
data_weekdays <- data_by_weekdays(rdata_fill)

Panel plot comparing the average number of steps taken per 5-minute interval across weekdays and weekends:

ggplot(data_weekdays, aes(x=interval, y=steps)) + geom_line(color="violet") + 
  facet_wrap(~ dayofweek, nrow=2, ncol=1) + labs(x="Interval", y="Number of steps") +  theme_bw()

We see that activity on the weekday has the greatest peak from all steps intervals. We also see that weekend activities have more peaks over a hundred than weekdays, possibly because weekday activities mostly follow a work-related routine, where we find more intensity activity in little a free time. Overall, for weekends, there is a better distribution of activities across time.