This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.
The data for this assignment can be downloaded from the course web site:
Dataset: [Activity monitoring data] [52K]
The variables included in this dataset are:
\(steps\): Number of steps taking in a 5-minute interval (missing values are coded as NA)
\(date\): The date on which the measurement was taken in YYYY-MM-DD format
\(interval\): Identifier for the 5-minute interval in which measurement was taken
The dataset is stored in a comma-separated-value (CSV) file and there are a total of 17,568 observations in this dataset.
Reading in the dataset and/or processing the data
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
activity <- read.csv("/home/ubuntu/R/activity.csv")
summary(activity)
## steps date interval
## Min. : 0.00 2012-10-01: 288 Min. : 0.0
## 1st Qu.: 0.00 2012-10-02: 288 1st Qu.: 588.8
## Median : 0.00 2012-10-03: 288 Median :1177.5
## Mean : 37.38 2012-10-04: 288 Mean :1177.5
## 3rd Qu.: 12.00 2012-10-05: 288 3rd Qu.:1766.2
## Max. :806.00 2012-10-06: 288 Max. :2355.0
## NA's :2304 (Other) :15840
Processing the data
activity$day <- weekdays(as.Date(activity$date))
sumTable <- aggregate(activity$steps ~ activity$date, FUN=sum)
colnames(sumTable)<- c("Date", "Steps")
hist(sumTable$Steps, breaks=5, xlab="Steps", main = "Total Steps per Day",col="grey")
## Mean for the number of steps each day
round(mean(sumTable$Steps))
## [1] 10766
## Median for the number of steps each day
round(median(sumTable$Steps))
## [1] 10765
avg_steps <- aggregate(activity$steps ~ activity$interval, FUN=mean)
colnames(avg_steps)<- c("interval","Average_Steps")
plot(avg_steps$interval,avg_steps$Average_Steps, type="l", xlab="Interval", ylab="Number of Steps",main="Average Number of Steps per Day by Interval")
avg_steps$Average_Steps <- as.integer(avg_steps$Average_Steps)
max <- max(avg_steps$Average_Steps)
max_step <-avg_steps[avg_steps$Average_Steps==max,1]
The interval 835 having 206 number of step
nadata <- activity[is.na(activity$steps),]
## Number of rows having missing values
nrow(nadata)
## [1] 2304
##Merge NA data with average steps data
newdata<-merge(activity, avg_steps, by= "interval")
Zeroes were imputed for 10-01-2012 because it was the first day and would have been over 9,000 steps higher than the following day, which had only 126 steps. NAs then were assumed to be zeros to fit the rising trend of the data.
newdata[as.character(newdata$date) == "2012-10-01", 2] <- 0
## arrange date wise
newdata <- arrange(newdata, date)
##Replace NA with average steps
newdata$steps[is.na(newdata$steps)] <- newdata$Average_Steps
## Warning in newdata$steps[is.na(newdata$steps)] <- newdata$Average_Steps:
## number of items to replace is not a multiple of replacement length
newdata2 <- aggregate(newdata$steps ~ newdata$date, FUN=sum)
colnames(newdata2)<- c("date", "Steps")
hist(newdata2$Steps, breaks=5, xlab="Steps", main = "Total Steps per Day", col="black")
hist(sumTable$Steps, breaks=5, xlab="Steps", main = "Total Steps per Day",col = "grey",add = T)
legend("topright", c("Imputed", "Non-imputed"), col=c("black", "grey"), lwd=10)
newdata$DayCategory <- ifelse(newdata$day %in% c("Saturday", "Sunday"), "Weekend", "Weekday")
Created a plot to compare and contrast number of steps between the week and weekend. There is a higher peak earlier on weekdays, and more overall activity on weekends.
plot<- ggplot(newdata, aes(x =interval , y=steps, color=DayCategory)) +
geom_line() +
labs(title = "Avg. Daily Steps by Weektype", x = "Interval", y = "No. of Steps") +
facet_wrap(~DayCategory, ncol = 1, nrow=2)
print(plot)