Reproducible Research on Health Activity Monitoring Data

Introduction

It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the “quantified self” movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.

This analysis makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.

Four Questions to be answered

1- What is mean total number of steps taken per day?
2- What is the average daily activity pattern?
3- Devise a strategy to impute the missing values and compare the mean and median with the result before.
4- Are there differences in activity patterns between weekdays and weekends?

Loading and preprocessing the data

Data Source

library("plyr")
setwd("E:/Cousera-Data Science/Reproducible Research/CourseProject1/repdata-data-activity")
activity<-read.csv("activity.csv")

Data Exploration

head(activity)

##   steps       date interval
## 1    NA 2012-10-01        0
## 2    NA 2012-10-01        5
## 3    NA 2012-10-01       10
## 4    NA 2012-10-01       15
## 5    NA 2012-10-01       20
## 6    NA 2012-10-01       25

The data frame ‘activity’ consists of 17568 obs of 3 variables:
1- steps:Number of steps taking in a 5-minute interval (missing values are coded as NA)
2- date: The date on which the measurement was taken in YYYY-MM-DD format
3- interval:Identifier for the 5-minute interval in which measurement was taken

str(activity)

## 'data.frame':    17568 obs. of  3 variables:
##  $ steps   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ date    : Factor w/ 61 levels "2012-10-01","2012-10-02",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ interval: int  0 5 10 15 20 25 30 35 40 45 ...

What is mean total number of steps taken per day?

In this part of analysis, the missing values in the datset are ignored.
1- Calculate the total number of steps taken per day and make a histogram of the total number of steps taken each day
2-Calculate and report the mean and median of the total number of steps taken per day

##ignore the missing values in the dataset
activity1<-activity[!is.na(activity$steps),]
##total steps of number each day
steps_num<-ddply(activity1,.(date),function(x)sum(x$steps))
##The histgram of the total number of steps taken each day
hist(steps_num$V1,main=paste("Histogram of","the total number of steps taken each day"),xlab="the total number of steps taken each day",xlim=c(0,25000),ylim=c(0,20),breaks=10,col="yellow")

##Obtain the median and mean of the number of steps taken each day 
allinfo<-summary(steps_num$V1)
median_steps<-allinfo["Median"]
mean_steps<-allinfo["Mean"]
report1<-paste("The mean and median total number of steps taken per day is", median_steps, "and",mean_steps)
report1

## [1] "The mean and median total number of steps taken per day is 10760 and 10770"

What is the average daily activity pattern?

1- Make a time series plot (i.e. type = “l”) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis)
2- Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?

##The average daily activity pattern
pattern<-ddply(activity1,.(interval),function(x) mean(x$steps))
plot(pattern,xlab="Time Interval(mins)",ylab="Average Taken Steps",main="The Average Daily Activity Pattern",type="l")

##the 5-minute interval containing the maximum number of steps
interval_max<-pattern$interval[which(pattern$V1==max(pattern$V1))]
report2<-paste("The 5-minute interval of",interval_max,"contains the maximum number of steps on average across all the days in the dataset")
report2

## [1] "The 5-minute interval of 835 contains the maximum number of steps on average across all the days in the dataset"

Imputing missing values

1- Calculate and report the total number of missing values in the dataset.
2- Devise a strategy for filling in all of the missing values in the dataset with the mean for that 5-minute interval.
3- Create a new dataset that is equal to the original dataset but with the missing data filled in.
4- Make a histogram of the total number of steps taken each day and Calculate and report the mean and median total number of steps taken per day. Do these values differ from the estimates from the first part of the assignment? What is the impact of imputing missing data on the estimates of the total daily number of steps?

##the number of NA in records
NA_num<-sum(is.na(activity$steps))
report3<-paste("The number of missing values in the dataset is", NA_num)


##fill in the missing values with the mean for that 5-minute interval
##which is the NA values
which_na<-which(is.na(activity$steps))
##which interval the NA values corresponding to 
whichinterval<-activity[which_na,]$interval
indx<-sapply(whichinterval,function(x){which(pattern$interval==x)})
##the value which is going to be filled in
fillNA<-pattern[indx,]$V1
activity[which_na,]$steps<-fillNA
report4<-c("The missing values in the dataset is filled with the mean for that 5-minute interval")
##the dataset 'activity' now is the dataset with the missing values filled in


##make a histogram of the total number of steps taken each day
total_steps<-ddply(activity,.(date), function(x) sum(x$steps))
hist(total_steps$V1,xlab="the total number of steps taken each day",main="The total number of steps taken each day",xlim=c(0,25000),ylim=c(0,25),col="yellow",breaks=10)

allinfo2<-summary(total_steps$V1)
median<-allinfo2["Median"]
mean<-allinfo2["Mean"]
report5<-paste("The median and mean total number of steps taken per day with missing values filled in is", median,mean)
report6<-c("The median is increased and the mean unchanged relative to the missing values unfilled.")
report3

## [1] "The number of missing values in the dataset is 2304"

report4

## [1] "The missing values in the dataset is filled with the mean for that 5-minute interval"

head(activity)

##       steps       date interval
## 1 1.7169811 2012-10-01        0
## 2 0.3396226 2012-10-01        5
## 3 0.1320755 2012-10-01       10
## 4 0.1509434 2012-10-01       15
## 5 0.0754717 2012-10-01       20
## 6 2.0943396 2012-10-01       25

report5

## [1] "The median and mean total number of steps taken per day with missing values filled in is 10770 10770"

report6

## [1] "The median is increased and the mean unchanged relative to the missing values unfilled."

Are there differences in activity patterns between weekdays and weekends?

1- Create a new factor variable in the dataset with two levels – “weekday” and “weekend” indicating whether a given date is a weekday or weekend day.
2- Make a panel plot containing a time series plot (i.e. type = “l”) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis).

library("chron")
##Create a new factor "week" with 2 levels-"weekday","weekend"
date<-as.Date(activity$date)
activity$week<-rep(0,each=length(activity))
activity$week[is.weekend(date)]<-"weekend"
activity$week[!is.weekend(date)]<-"weekday"
activity$week<-as.factor(activity$week)

##make a panel plot containing a time series plot of the 5-minute interval and the average number of steps taken,averaged across all weekdays or weekend days.
pattern2<-ddply(activity,.(interval,week),function(x) mean(x$steps))
data_weekend<-pattern2[pattern2$week=="weekend",]
data_weekday<-pattern2[pattern2$week=="weekday",]

plot(data_weekday$interval,data_weekday$V1,type="l",xlab="interval",ylab="average steps taken across the weekday")

plot(data_weekend$interval,data_weekend$V1,type="l",xlab="interval",ylab="average steps taken across the weekend")