Reproducible Research: Peer Assessment 1

Loading necessary libraries

library(ggplot2)
library(dplyr)
library(lubridate)

Loading and preprocessing the data

In this step we load the data and change to the date varuable to a proper date format.

data <- read.csv(file= "./Data/activity.csv", head=TRUE, sep=",")
data$date <- ymd(data$date)

Question one : what is mean total number of steps taken per day?

For this part of the work, we ignore the missing values in the dataset.

Step 1) Calculating the total number of steps taken per day:

groupby_date <- group_by(data, date)
total_steps <- summarise(groupby_date, totalSteps = sum(steps,na.rm=TRUE))
head(total_steps)

## Source: local data frame [6 x 2]
## 
##         date totalSteps
## 1 2012-10-01          0
## 2 2012-10-02        126
## 3 2012-10-03      11352
## 4 2012-10-04      12116
## 5 2012-10-05      13294
## 6 2012-10-06      15420

Step 2) Plotting a histogram of the total number of steps taken each day.

ggplot (total_steps, aes(totalSteps))+
        geom_histogram(binwidth=1000, alpha=.5, position="identity", fill="blue", col="blue")+
        ggtitle ("Histogram of total number of steps taken each day")+
        xlab("Date")+
        ylab(" Total number of steps taken")

plot of chunk unnamed-chunk-4

Step 3) Calculating the mean and median of total number of steps taken per day.

summary <- summarise(total_steps, mean=mean(totalSteps,na.rm = TRUE), median=median(totalSteps,na.rm = TRUE))
print(summary)

## Source: local data frame [1 x 2]
## 
##      mean median
## 1 9354.23  10395

Question two: what is the average daily activity pattern?

In this part of the work we analyze the avrage daily activity pattern.

Step 1) Plotting a time series of the 5-minute interval and the average number of steps taken, averaged across all days.

First, we calculate the average number of of steps taken in each 5-minute interval across all days.

groupBy_interval <- group_by (data, interval)
average_steps<- summarise(groupBy_interval, averageSteps=mean(steps, na.rm=TRUE))

Then, we make a time series plot of the 5-minute intervals and average number of steps taken.

ggplot (average_steps, aes(interval, averageSteps))+
        geom_line(size=.7, position="identity",color="blue")+
        ggtitle ("Time series plot of the average number of steps taken")+
        xlab("5-minute intervals")+
        ylab(" Average Number of Steps Taken")

plot of chunk unnamed-chunk-7

Step 2) In this step we want to know which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?

filter(average_steps,averageSteps==max(averageSteps) )

## Source: local data frame [1 x 2]
## 
##   interval averageSteps
## 1      835     206.1698

Therefore, interval “835” has the maximum average number of steps taken which is “206.1698”.

Question three: imputing missing values

In the original dataset there are a number of days/intervals where there are missing values. In this part of the work we imput the missing values using a simple procedure and will compare with the result from the original dataset.

Step 1) Calculating the total number of missing values in the dataset.

sum(is.na(data))

## [1] 2304

Step 2) In this step we devise a strategy for filling in all of the missing values in the dataset. Our strategy would be using the mean/median of that 5-minute interval to fill in all of the missing values in the dataset. We create a new datasetcalled clean_data that is equal to the original dataset but with the missing data filled in.

clean_data <- data

for (i in 1:nrow(clean_data)){
        if(is.na(clean_data$steps[i])){
                rowSubset <- filter(average_steps, clean_data$interval[i]==interval )
                clean_data$steps[i] <- rowSubset$averageSteps
        }
}

sum(is.na(clean_data))

## [1] 0

Step 3) First we plot a histogram of the total number of steps taken each day.

groupby_date <- group_by(clean_data, date)
total_steps <- summarise(groupby_date, totalSteps = sum(steps,na.rm=TRUE))

ggplot (total_steps, aes(totalSteps))+
        geom_histogram(binwidth=1000, alpha=.5, position="identity", fill="blue", col="blue")+
        ggtitle ("Histogram of total number of steps taken each day")+
        xlab("Date")+
        ylab(" Total number of steps taken")

plot of chunk unnamed-chunk-11

Now we calculate the mean and median total number of steps taken per day once again for the **clean_data* dataset.

summary <- summarise(total_steps, mean=mean(totalSteps,na.rm = TRUE), median=median(totalSteps,na.rm = TRUE))
print(summary)

## Source: local data frame [1 x 2]
## 
##       mean   median
## 1 10766.19 10766.19

It can be observed that the mean has increased from 9354.23 in original dataset to 10766.19 in the clean dataset, and the median also incresed from 10395 to 10766.19.

Question four: are there differences in activity patterns between weekdays and weekends?

In this part we explore whether there is a difference in activity patterns between weekdays and weekends.

Step 1) Creating a new factor variable in the dataset with two levels – “weekday” and “weekend” indicating whether a given date is a weekday or weekend day.

clean_data<- mutate(clean_data, weekday=wday(date, label=TRUE))
clean_data$weekday <- as.factor(clean_data$weekday)
levels(clean_data$weekday) <- list (weekday = c("Mon", "Tues", "Wed", "Thurs", "Fri"), weekend=c("Sun","Sat"))

Step 2) Making a panel plot containing a time series plot (i.e. type = “l”) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis).

groupBy_interval <- group_by (clean_data, interval, weekday)
average_steps<- summarise(groupBy_interval, averageSteps=mean(steps))

ggplot (average_steps, aes(interval, averageSteps))+
        geom_line(size=.7, position="identity",color="blue")+
        facet_grid(weekday ~ .)+
        ggtitle ("Time series plot of the average number of steps taken by weekday")+
        xlab("5-minute intervals")+
        ylab(" Average Number of Steps Taken")

plot of chunk unnamed-chunk-15