It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the “quantified self” movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.
This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day
The data can be downloaded from the course web site:
Dataset: Activity Monitoring Data
The variables included in this dataset are:
steps: Number of steps taking in a 5-minute interval (missing values are coded as NA)
date: The date on which the measurement was taken in YYYY-MM-DD format
interval: Identifier for the 5-minute interval in which measurement was taken
The dataset is stored in a comma-separated-value (CSV) file and there are a total of 17,568 observations in this dataset.
activity <- read.csv("activity.csv")
head(activity)
## steps date interval
## 1 NA 2012-10-01 0
## 2 NA 2012-10-01 5
## 3 NA 2012-10-01 10
## 4 NA 2012-10-01 15
## 5 NA 2012-10-01 20
## 6 NA 2012-10-01 25
activity$day <- weekdays(as.Date(activity$date))
activity$DateTime <- as.POSIXct(activity$date, format = " %Y-%M-%D")
clean<- activity[!is.na(activity$steps),]
head(clean)
## steps date interval day DateTime
## 289 0 2012-10-02 0 Tuesday <NA>
## 290 0 2012-10-02 5 Tuesday <NA>
## 291 0 2012-10-02 10 Tuesday <NA>
## 292 0 2012-10-02 15 Tuesday <NA>
## 293 0 2012-10-02 20 Tuesday <NA>
## 294 0 2012-10-02 25 Tuesday <NA>
This is a R-code to calculate the total number of steps taken per day.
sumTable <- aggregate(activity$steps~activity$date, FUN = sum)
colnames(sumTable)<- c("Date","Steps")
head(sumTable)
## Date Steps
## 1 2012-10-02 126
## 2 2012-10-03 11352
## 3 2012-10-04 12116
## 4 2012-10-05 13294
## 5 2012-10-06 15420
## 6 2012-10-07 11015
This is an R-code to plot the histogram to represent the total number of steps per day.
hist(sumTable$Steps, breaks = 5,xlab = "Steps",main = "Total Steps Per Day")
As we know that mean and median are the two important aspects of statistical analysis.
Rcode for the calculation of Mean and Median of the activity Dataset
as.integer(mean(sumTable$Steps))
## [1] 10766
as.integer(median(sumTable$Steps))
## [1] 10765
clean <- activity[!is.na(activity$steps),]
library(plyr)
## Warning: package 'plyr' was built under R version 3.4.3
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.2
## Create Average number of steps per Interval
intervalTable <- ddply(clean, .(interval), summarize, mean = mean(steps))
## Create Line Plot of average number of steps per interval
p<- ggplot(intervalTable, aes(x= interval,y= mean), xlab = "Interval",ylab = "Average Number of Steps")
p+geom_line()+ xlab("Average number of steps") + ggtitle("Average Number of Steps Per Interval")
maxsteps<- max(intervalTable$mean)
## Interval that contains the maximum number of steps
intervalTable[intervalTable$mean == maxsteps,1]
## [1] 835
### The maximum number of steps in a 5-min interval is 835
Calculation of Missing Values:
nrow(activity[is.na(activity$steps),])
## [1] 2304
Substituting the misssing values based on the average 5-minute interval based on the day of the week. This can also be acheived by using median of 5-minute interval.
avgTable<- ddply(clean,.(interval,day),summarize, mean = mean(steps))
nadata<- activity[is.na(activity$steps),]
newdata<- merge(nadata,avgTable, by = c("interval","day"))
### Creating a dataset equal to the orignal dataset nut with the missing data filled in:
ndata3 <- newdata[,c(6,4,1,2,5)]
colnames(ndata3)<- c("steps","date","interval","day","DateTime")
## Merging the dataset
merge <- rbind(clean, ndata3)
## Creating Sum of Steps per date to compare the data with step 1
sumTable2<- aggregate(merge$steps~merge$date, FUN = sum,)
colnames(sumTable2)<- c("Date","Steps")
as.integer(mean(sumTable2$Steps))
## [1] 10821
as.integer(median(sumTable2$Steps))
## [1] 11015
hist(sumTable2$Steps, breaks = 5,xlab = "Steps",main = "Total Steps",col = "Black")
hist(sumTable$Steps, breaks = 5,xlab = "Steps",main = "Total Steps",col = "Blue",add = T)
legend("topright",c("Imputed Data","Non-NA Data"), fill = c("Black","Blue"))
The new mean of the imputed data is 10821 steps compared to the old mean of 10766 steps. That creates a difference of 55 steps on average per day.
The new median of the imputed data is 11015 steps compared to the old median of 10765 steps. That creates a difference of 250 steps for the median.
However, the overall shape of the distribution has not changed.
merge$DayCategory <- ifelse(merge$day %in% c("Saturday", "Sunday"), "Weekend", "Weekday")
## Time -series Plot of 5-min interval
library(lattice)
## Summarize data by interval and type of day
intervalTable2<- ddply(merge,.(interval,DayCategory),summarize, mean = mean(steps))
xyplot(mean~interval|DayCategory, data=intervalTable2, type="l", layout = c(1,2),
main="Average Steps per Interval Based on Type of Day",
ylab="Average Number of Steps", xlab="Interval")
Note that the echo = TRUE parameter was added to the code chunk to enable printing of the R code that generated the analysis and plot.