Step Count Tracker

Synopsis

This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.

The data for this assignment can be downloaded from the course web site:

Dataset: [Activity monitoring data] [52K]

The variables included in this dataset are:

\(steps\): Number of steps taking in a 5-minute interval (missing values are coded as NA)

\(date\): The date on which the measurement was taken in YYYY-MM-DD format

\(interval\): Identifier for the 5-minute interval in which measurement was taken

The dataset is stored in a comma-separated-value (CSV) file and there are a total of 17,568 observations in this dataset.

Loading the data

Reading in the dataset and/or processing the data

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
activity <- read.csv("/home/ubuntu/R/activity.csv")
summary(activity)

##      steps                date          interval     
##  Min.   :  0.00   2012-10-01:  288   Min.   :   0.0  
##  1st Qu.:  0.00   2012-10-02:  288   1st Qu.: 588.8  
##  Median :  0.00   2012-10-03:  288   Median :1177.5  
##  Mean   : 37.38   2012-10-04:  288   Mean   :1177.5  
##  3rd Qu.: 12.00   2012-10-05:  288   3rd Qu.:1766.2  
##  Max.   :806.00   2012-10-06:  288   Max.   :2355.0  
##  NA's   :2304     (Other)   :15840

Processing the data

activity$day <- weekdays(as.Date(activity$date))

Calculate the total number of steps taken per day

sumTable <- aggregate(activity$steps ~ activity$date, FUN=sum)
colnames(sumTable)<- c("Date", "Steps")

Histogram of the total number of steps taken each day

hist(sumTable$Steps, breaks=5, xlab="Steps", main = "Total Steps per Day",col="grey")

Calculate the mean and median for the number of steps taken each day

## Mean for the number of steps each day
round(mean(sumTable$Steps))

## [1] 10766

## Median for the number of steps each day
round(median(sumTable$Steps))

## [1] 10765

The average number of steps taken

avg_steps <- aggregate(activity$steps ~ activity$interval, FUN=mean)
colnames(avg_steps)<- c("interval","Average_Steps")

Time series plot

plot(avg_steps$interval,avg_steps$Average_Steps, type="l", xlab="Interval", ylab="Number of Steps",main="Average Number of Steps per Day by Interval")

The maximum number of steps taken in 5-minute interval

avg_steps$Average_Steps <- as.integer(avg_steps$Average_Steps)
max <- max(avg_steps$Average_Steps)
max_step <-avg_steps[avg_steps$Average_Steps==max,1]

The interval 835 having 206 number of step

Imputing missing values.Compare imputed to non-imputed data.

nadata <- activity[is.na(activity$steps),]
## Number of rows having missing values
nrow(nadata)

## [1] 2304

##Merge NA data with average steps data
newdata<-merge(activity, avg_steps, by= "interval")

Zeroes were imputed for 10-01-2012 because it was the first day and would have been over 9,000 steps higher than the following day, which had only 126 steps. NAs then were assumed to be zeros to fit the rising trend of the data.

newdata[as.character(newdata$date) == "2012-10-01", 2] <- 0
## arrange date wise
newdata <- arrange(newdata, date)
##Replace NA with average steps
newdata$steps[is.na(newdata$steps)] <- newdata$Average_Steps

## Warning in newdata$steps[is.na(newdata$steps)] <- newdata$Average_Steps:
## number of items to replace is not a multiple of replacement length

The total number of steps taken each day after missing values are imputed

newdata2 <- aggregate(newdata$steps ~ newdata$date, FUN=sum)
colnames(newdata2)<- c("date", "Steps")

Histogram of the total number of steps taken each day after missing values are imputed

hist(newdata2$Steps, breaks=5, xlab="Steps", main = "Total Steps per Day", col="black")
hist(sumTable$Steps, breaks=5, xlab="Steps", main = "Total Steps per Day",col = "grey",add = T)
legend("topright", c("Imputed", "Non-imputed"), col=c("black", "grey"), lwd=10)

newdata$DayCategory <- ifelse(newdata$day %in% c("Saturday", "Sunday"), "Weekend", "Weekday")

Plot the differences in activity patterns between weekdays and weekends

Created a plot to compare and contrast number of steps between the week and weekend. There is a higher peak earlier on weekdays, and more overall activity on weekends.

plot<- ggplot(newdata, aes(x =interval , y=steps, color=DayCategory)) +
       geom_line() +
       labs(title = "Avg. Daily Steps by Weektype", x = "Interval", y = "No. of Steps") +
       facet_wrap(~DayCategory, ncol = 1, nrow=2)
print(plot)