Coursera - Statistical Inference Assignment

This analysis is performed using RStudio and a given dataset of daily activity for October and November of 2012. The data contains the number of steps taken per day in 5-minute intervals. Note that there are some missing values that are dealt with by imputation.


Loading and Pre-Processing the data

To begin the analysis of the Activity dataset, we first need to Set the correct working directory in the R Environment and load required R packages.

# Set the working directory in the R Environment and load required R packages.

setwd("C:/Users/Leigh/Desktop/Coursera/Data and Scripts")
library(knitr)
library(data.table)
library(ggplot2)

Now that R is ready, we can read in the “activity.csv” file and clean it up a bit. If the data structure is correct, then calculate the mean number of steps taken per day using the aggregate function. We can check if a basic plot of the daily mean number of step distribution looks reasonable as a quick way to check the results.

# Read in the activity dataset.
activity <- read.csv('activity.csv', header = TRUE, sep = ",", colClasses=c("numeric", "character", "numeric"))

# Clean up the data: Convert Date to the date class and interval to factor classes.
# Look at the structure of the data to check it is correct.
activity$date <- as.Date(activity$date, format = "%Y-%m-%d")
activity$interval <- as.factor(activity$interval) 

# The structure of the dataframe is shown below. 
str(activity)
## 'data.frame':    17568 obs. of  3 variables:
##  $ steps   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ date    : Date, format: "2012-10-01" "2012-10-01" ...
##  $ interval: Factor w/ 288 levels "0","5","10","15",..: 1 2 3 4 5 6 7 8 9 10 ...
# What is mean total number of steps taken per day?

# 1. Calculate the total number of steps per day and check the top of the dataset.
daily_steps <- aggregate(steps ~ date, data=activity, sum)
colnames(daily_steps) <- c("date","steps")
head(daily_steps)
##         date steps
## 1 2012-10-02   126
## 2 2012-10-03 11352
## 3 2012-10-04 12116
## 4 2012-10-05 13294
## 5 2012-10-06 15420
## 6 2012-10-07 11015
# 2. Create a histogram of the Activity data using ggplot

library(ggplot2)
ggplot(daily_steps, aes(x = steps)) + geom_histogram(fill = "blue", binwidth = 1000) + labs(title="Histogram of Steps Taken per Day",  x = "Number of Steps per Day", y = "Number of times in a day(Count)") + theme_bw() 

# 3.After looking at the distribution, calculate the Mean number of steps per day.
daily_steps <- aggregate(steps ~ date, activity, sum)
colnames(daily_steps) <- c("date","steps")
steps_mean <- mean(daily_steps$steps, na.rm=TRUE)
steps_median <- median(daily_steps$steps, na.rm=TRUE)


# The daily mean number of steps is:
steps_mean
## [1] 10766.19
# The daily median number of steps is:
steps_median
## [1] 10765

```

For the Activity dataset, the mean number of steps taken per day is 10766.19.

The above histogram shows that the mean is between about 10,000 and 14,000 steps, which includes the calculated mean, so the result is reasonable.

________________________________________________________________________________

What is the average daily activity pattern for the number of steps taken?

Use the aggregate funtion on the data by the 5-minute intervals to describe the daily activity, then plot the data. One way to show the daily activity pattern is by determining which time intervals have the maximum number of steps taken.

library(ggplot2)

# Create a data frame after calculating the aggregate steps by 5-minute intervals and converting them to integers.  
steps_per_interval <- aggregate(activity$steps, by = list(interval = activity$interval), FUN=mean, na.rm=TRUE)
steps_per_interval$interval <-  as.integer(levels(steps_per_interval$interval)[steps_per_interval$interval])
colnames(steps_per_interval) <- c("interval", "steps")

#______________________________________________________

# Time Series plot for average steps taken (averaged across all days) versus intervals:
ggplot(steps_per_interval, aes(x=interval, y=steps)) + geom_line(color="red", size=1) +  labs(title="Daily Activity Pattern for Average Number of Steps", x="5-Minute Interval", y="Number of Steps Taken") +  theme_bw()

# Find the 5-minute interval containing the maximum number of steps:
max_interval <- steps_per_interval[which.max(steps_per_interval$steps),]

The 835th 5-minute interval has the maximum average value: 206 steps.

The plot of the maximum interval values for steps taken shows the daily activity pattern. The distribution has one main peak and several smaller peaks. It is skewed and not a commonly known distribution (like a normal curve or Poisson distribution).

________________________________________________________________________________

Imputing Missing values

To handle missing values in the Activity dataset, we can impute them using mean values at similar time intervals.

1. Total number of missing values:

The total number of missing values in steps can be calculated using is.na() method to check whether the value is mising or not and then summing the logical vector.

missing_values <- sum(is.na(activity$steps))
missing_values
## [1] 2304

The total number of missing values in the Activity data set is 2,304. Since we have a large dataset, we can fill in the missing values with means.

________________________________________________________________________________

2. Strategy for filling in all of the missing values - IMPUTE

To populate missing values in the dataset, we impute the values using the mean value at the same interval across all of the days.

Create a function in which using the Activity dataset and pervalue is the number of steps per interval. This will impute the missing values and thus we will have a full dataset.

na_fill <- function(data, pervalue) {
  na_index <- which(is.na(data$steps))
  na_replace <- unlist(lapply(na_index, FUN=function(idx){
    interval = data[idx,]$interval
    pervalue[pervalue$interval == interval,]$steps
  }))
  fill_steps <- data$steps
  fill_steps[na_index] <- na_replace
  fill_steps
}

# Create a dataframe with the imputed values and check the structure. 
rdata_fill <- data.frame(steps = na_fill(activity, steps_per_interval), date = activity$date, interval = activity$interval)
str(rdata_fill)
## 'data.frame':    17568 obs. of  3 variables:
##  $ steps   : num  1.717 0.3396 0.1321 0.1509 0.0755 ...
##  $ date    : Date, format: "2012-10-01" "2012-10-01" ...
##  $ interval: Factor w/ 288 levels "0","5","10","15",..: 1 2 3 4 5 6 7 8 9 10 ...
# Check if there are any missing values remaining.
sum(is.na(rdata_fill$steps))
## [1] 0

The above output shows no missing values (all have been imputed).

________________________________________________________________________________

Create a histogram of the total number of steps taken each day to show the distrubution of the data.

# Plot a histogram of the daily total number of steps taken, using a bin interval of 1000 steps, after filling in the missing values.
fill_daily_steps <- aggregate(steps ~ date, rdata_fill, sum)
colnames(fill_daily_steps) <- c("date","steps")


# Create the Histogram for Frequency of Number of Steps Per Day . 
library(ggplot2)
ggplot(fill_daily_steps, aes(x = steps)) + geom_histogram(fill = "green", binwidth = 1000) +   labs(title="Histogram of Total Number of Steps Taken per Day", x = "Number of Steps per Day", y = "Number of Times Per Day (count)") + theme_bw() 

# Calculate the mean total number of steps taken per day.

steps_mean_fill <- mean(fill_daily_steps$steps, na.rm=TRUE)
steps_mean_fill
## [1] 10766.19

The mean number of steps taken per day is 10766.189.

________________________________________________________________________________

Do these values differ from the estimates from the first part of the assignment?

No: The values do differ slightly, but not significantly. The means are essentially the same.

Mean number of steps per day before imputing the missing values:

Original Activity Data Mean : 10766.19

Mean number of steps per day after filling the missing values:

Imputed Activity Data Mean : 10766.189

The mean number of steps taken per day after filling the imputed data is equal to the mean number of steps per day only using the original data.

This shows that imputing missing values in data is an acceptable way to deal with missing data to get results.

________________________________________________________________________________

Are there differences in activity patterns in the number of steps per day between weekdays and weekends?

We can perform this comparison with the table with filled-in missing values by doing the following: 1. Augment the table with a column that indicates the day of the week 2. Subset the table into two parts - weekends (Saturday and Sunday) and weekdays (Monday through Friday). 3. Tabulate the average steps per interval for each data set. 4. Plot the two data sets side by side for comparison.

weekdays_steps <- function(data) {
  weekdays_steps <- aggregate(data$steps, by=list(interval = data$interval), FUN=mean, na.rm=T)
  weekdays_steps$interval <- as.integer(levels(weekdays_steps$interval)[weekdays_steps$interval])
  colnames(weekdays_steps) <- c("interval", "steps")
  weekdays_steps}

#________________________________________________________________________________

data_by_weekdays <- function(data) {
  data$weekday <- 
    as.factor(weekdays(data$date)) # weekdays
  weekend_data <- subset(data, weekday %in% c("Saturday","Sunday"))
  weekday_data <- subset(data, !weekday %in% c("Saturday","Sunday"))
  weekend_steps <- weekdays_steps(weekend_data)
  weekday_steps <- weekdays_steps(weekday_data)
  weekend_steps$dayofweek <- rep("Weekend", nrow(weekend_steps))
  weekday_steps$dayofweek <- rep("Weekday", nrow(weekday_steps))
  data_by_weekdays <- rbind(weekend_steps, weekday_steps)
  data_by_weekdays$dayofweek <- as.factor(data_by_weekdays$dayofweek)
  data_by_weekdays
}
data_weekdays <- data_by_weekdays(rdata_fill)


# Create a Panel plot comparing the average number of steps taken per 5-minute interval 
# across weekdays and weekends:
ggplot(data_weekdays, aes(x=interval, y=steps)) + geom_line(color="violet") + facet_wrap(~ dayofweek, nrow=2, ncol=1) + labs(x="5-Minute Interval", y="Number of steps") +
  theme_bw()

Using the panel plot, we see that activity on the weekday has the greatest peak from all steps intervals. We also see that weekend activities have more peaks over a hundred than weekdays, possibly because weekday activities mostly follow a work-related routine, where we find more intensity activity in little a free time. Overall, for weekends, there is a better distribution of activities across time.

________________________________________________________________________________

Use the knitr package to create an html file.

library(“knitr”) knit2html(“Repo_research_activity_1.Rmd”, “http://rpubs.com/leigh_math/repo_research_assignment1”)

PA1_template.md and PA1_template.html