Reproducible Research - Peer Assignment#1

Important Note: Kindly visit the below link for for updated report. The plots are available in the figures folder.
rpubs –> http://rpubs.com/bram/reproducibileresearch
github –> https://github.com/buva-datascience/RepData_PeerAssessment1

Loading and preprocessing the data

1. Show any code that is needed to Load the data (i.e. read.csv()) and Process/transform the data (if necessary) into a format suitable for your analysis

The data processing is done as following:
1. Source file “activity.zip” is downloaded if not present
2. Source data is unzipped using unz() function and read via the read.csv() function
3. Examine the dataframe using header() and str() functions

# load datatable package
library(data.table)     

# function to Download the file from the website location to the local directory
dwld_file <- function(fileurl, dest){
        if (!file.exists("data")) dir.create("data")                            # create a folder if it doesnt exist        
        
        if (!file.exists(dest)) {                                               # download the file if its not already downloaded
                download.file(fileurl, destfile = dest, method = "curl")
        }
}

# Set local destination folder
dest="./data/activity.zip"

# Call funtion to download from the url
dwld_file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip", dest)

# List the files in zip
unzip(dest, list=T)

##           Name Length                Date
## 1 activity.csv 350829 2014-02-11 10:08:00

# set working directory
wd <- "~/Documents/Buva/Data Science/Data Science Course-John Hopkins/05-Reproducible Research/Peer Assessment#1"
setwd(wd)

# Set the directory of the file
dest="./data/activity.zip"                                      

# Unzip and read into a dataframe        
ActivityTable <- read.csv(unz(dest, filename="activity.csv"))   

# Examine dataframe
head(ActivityTable)

##   steps       date interval
## 1    NA 2012-10-01        0
## 2    NA 2012-10-01        5
## 3    NA 2012-10-01       10
## 4    NA 2012-10-01       15
## 5    NA 2012-10-01       20
## 6    NA 2012-10-01       25

str(ActivityTable)

## 'data.frame':    17568 obs. of  3 variables:
##  $ steps   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ date    : Factor w/ 61 levels "2012-10-01","2012-10-02",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ interval: int  0 5 10 15 20 25 30 35 40 45 ...

What is mean total number of steps taken per day?

2. What is mean total number of steps taken per day? Ignore the missing values in the dataset and make a histogram of the total number of steps taken each day. Calculate and report the mean and median total number of steps taken per day

Ignore NAs by filtering out the NA values from steps column and draw histogram of total steps taken each day

# ignore NAs
Activities <- ActivityTable[!is.na(ActivityTable$steps),]      

# Convert date as factor
Activities$date <- factor(as.character(Activities$date))        

# Convert interval as factor
Activities$interval <- factor(Activities$interval)       

# Calculate mean steps taken each day using tapply()
meanSteps <- tapply(Activities$steps, Activities$date, mean, na.rm=T) 

# Caluculate the total steps taken each day using tapply()
totalSteps <- tapply(Activities$steps, Activities$date, sum, na.rm=T)

# Create a histogram for the total steps taken each day
hist(totalSteps, xlab="Total Steps taken each day", main="Histogram of the total number of steps taken each day", col=555 )

plot of chunk histogram

The Median and Mean of the total number of steps taken per day is calculated as below:

# Mean of the total number of steps taken per day 
mean(totalSteps)

## [1] 10766

# Median of the total number of steps taken per day 
median(totalSteps)

## [1] 10765

The Mean of the total steps taken each day is 10766 and Median of the total steps taken each day is 10765

What is the average daily activity pattern?

3. What is the average daily activity pattern? Make a time series plot (i.e. type = “l”) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis). Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?

Calculate Average steps taken per the 5 min interval using tapply() function and draw plot

# Calculate mean avg steps taken per 5 min interval for all the days
timeInterval <- tapply(Activities$steps, Activities$interval, mean, na.rm=T)

# Draw a time series plot 
plot(row.names(timeInterval), timeInterval, type = "l", xlab = "5-min interval", 
     ylab = "Average steps taken across all Days", main = "Time Series of Avg. Steps taken per 5-minute interval", 
     col = "blue")

plot of chunk pattern

# Find the max. number of steps across the time intervals
TI <- aggregate( steps ~ interval, Activities, na.rm=TRUE, mean)

TI[which.max(TI$steps),]

##     interval steps
## 104      835 206.2

We can observe that the interval 835 has the maximum average number of steps: 206.1698. The same can also be observed from the time series plot above.

Imputing missing values

3. Calculate and report the total number of missing values in the dataset (i.e. the total number of rows with NAs)
Devise a strategy for filling in all of the missing values in the dataset. Create a new dataset that is equal to the original dataset but with the missing data filled in. Make a histogram of the total number of steps taken each day and Calculate and report the mean and median total number of steps taken per day.

Identify the missing values and filter ONLY those rows to another dataframe

# Subset the NAs into a different dataframe
NAs <- ActivityTable[is.na(ActivityTable),]

# Find the number of NAs
nrow(NAs)

## [1] 2304

sum(is.na(ActivityTable))

## [1] 2304

There is a total of 2304 missing values in the Activity dataset.

For imputing the missing values, fill in all missing values with the mean step for the same 5-minute time interval. Then draw a histogram of total steps taken each day

# Convert date as factor
NAs$date <- factor(as.character(NAs$date))

# Fill in all missing values with the mean step for the same 5-minute time interval
for( i in 1:nrow(NAs)){
        for ( j in 1:nrow(TI)){
                # Assign the mean step value for the same 5-minute time interval for the missing value        
                if( NAs$interval[i] == TI$interval[j])  NAs$steps[i] <- TI$steps[j]
        }
}

# create a new dataframe with the filled in NA values which is equal to the original dataset
newActivity <- rbind(Activities, NAs)

# Calculate mean steps taken each day using tapply()
newmeanSteps <- tapply(newActivity$steps, newActivity$date, mean, na.rm=T)

# Caluculate the total steps taken each day using tapply()
newtotalSteps <- tapply(newActivity$steps, newActivity$date, sum, na.rm=T)

# Create a histogram for the total steps taken each day with fill ins instead of NAs
hist(newtotalSteps, xlab="Total Steps taken each day", main="Histogram of the total number of steps(Imputed) taken each day", col=999 )

plot of chunk impute

After filling the missing values, the Median and Mean of the total number of steps taken per day for the new dataset is calculated as below:

# Mean of the total number of steps taken per day 
mean(newtotalSteps)

## [1] 10766

# Median of the total number of steps taken per day 
median(newtotalSteps)

## [1] 10766

After filling the missing values, the Mean of the total steps taken each day for the new dataset is 10766 and Median of the total steps taken each day for the new dataset is 10766

Do these values differ from the estimates from the first part of the assignment? What is the impact of imputing
missing data on the estimates of the total daily number of steps?

The Mean and Median from the previous estimate are 10766 and 10765 respectively.

Kindly note that due to the imputing of missing values, median value is impacted with a a little difference, however there is no change to the mean values.

Are there differences in activity patterns between weekdays and weekends?

For this part the weekdays() function may be of some help here. Use the dataset with the filled-in missing values for this part.
Create a new factor variable in the dataset with two levels – “weekday” and “weekend” indicating whether a given date is a weekday or weekend day. Make a panel plot containing a time series plot (i.e. type = “l”) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis).

Following are steps that are performed:
1. Create a new variable “dayType” to hold values “Weekend/Weekdays” based on the date value
2. Convert the dayType to a factor
3. Create a dataframe which holds the Avg. steps taken per 5 min interval for all weekends and weekdays
4. Convert interval from a factor to a numeric
5. Load ggplot2 package
6. Draw a Time Series qplot of Avg. Steps taken per 5-minute interval on all Weekdays and Weekends

# check for the type of days in the dataset
table(format(as.Date(newActivity$date),"%A"))

## 
##    Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday 
##      2592      2592      2304      2304      2592      2592      2592

# create a new variable dataType with weekends and weekdays as values
newActivity$dayType <- ifelse(weekdays(as.Date(newActivity$date)) %in%  c("Saturday", "Sunday"),'weekend','weekday')

# convert dayType to a factor
newActivity$dayType <- factor(newActivity$dayType)

# Calculate mean avg steps taken per 5 min interval/dayType for all the days
plotDF <- aggregate(steps ~ interval+dayType, newActivity, mean)

# convert interval factor to a numeric 
plotDF$interval <- as.numeric(as.character(plotDF$interval))

# load ggplot2
library(ggplot2)

# Draw a timeseries plot
qplot(x=interval, y=steps, data=plotDF, xlab = "5-min interval", 
      ylab = "Avg. steps taken across all WeekDays/Weekends", geom=c("line"),
      main = "Time Series- Avg. Steps per 5-min interval on all Weekdays/Weekends") + facet_wrap(~ dayType, ncol=1)

plot of chunk daytype

Free up the environment memory as we have completed the assignment !

rm(list=ls())