Reproducible Research: Peer Assessment 1

Synopsis

It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the “quantified self” movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.

This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.

The data for this assignment can be downloaded from the course web site:

Dataset: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip The variables included in this dataset are:
steps: Number of steps taking in a 5-minute interval (missing values are coded as 𝙽𝙰)
date: The date on which the measurement was taken in YYYY-MM-DD format
interval: Identifier for the 5-minute interval in which measurement was taken The dataset is stored in a comma-separated-value (CSV) file and there are a total of 17,568 observations in this dataset.

Data Processing

Loading and preprocessing the data

Data needs first to be downloaded from the course web site at the present URL: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip

WebURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip"
download.file(WebURL,destfile="./RepData_PeerAssessment1/activity.zip",method="curl")

As the dataset is a zipped file, let’s unzip it

## Unzip file in a specified folder
ZipDir <- "./RepData_PeerAssessment1/activity.zip"
UnzipDir <- "./RepData_PeerAssessment1/activity"
unzip(ZipDir,exdir = UnzipDir)

Load the data

We can now load the data in R Studio

DataDir <- "./RepData_PeerAssessment1/activity/activity.csv"
DataSet <- read.csv(DataDir,header = TRUE)

Process/transform the data (if necessary) into a format suitable for your analysis

Let’s have a look to the data with the st’ function

str(DataSet)

## 'data.frame':    17568 obs. of  3 variables:
##  $ steps   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ date    : Factor w/ 61 levels "2012-10-01","2012-10-02",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ interval: int  0 5 10 15 20 25 30 35 40 45 ...

Our data frame is made of 17’568 observations for 3 variables (‘steps’, ‘date’, ‘interval’). ‘steps’ and ‘intervals’ variables are integer, let’s convert them in numeric. ‘date’ variable is in factor format and needs therefore to be formated in ‘Date’ format.

DataSet$date<-as.Date(DataSet$date,"%Y-%m-%d")
DataSet$steps <- as.numeric(DataSet$steps)
DataSet$interval <- as.numeric(DataSet$interval)

Results

What is mean total number of steps taken per day

To answer this question, let’s build a histogram:

SumStepsDay<-aggregate(DataSet$steps, list(DataSet$date),sum)
colnames(SumStepsDay) <- c("date","steps")
hist(as.numeric(SumStepsDay$steps),nclass=10,col="pink",xlab = "Numbers of steps per day",main="Histogram, Numbers of Steps per Day")
medianStepsDay<- median(as.numeric(SumStepsDay$steps),na.rm=TRUE)
meanStepsDay<- mean(as.numeric(SumStepsDay$steps),na.rm=TRUE)
abline(v = medianStepsDay,col = "blue",lwd = 6)
abline(v = meanStepsDay,col = "red",lwd = 2)
legend("topleft", c(paste("Median =",medianStepsDay), paste("Mean",meanStepsDay)), col=c("blue", "red"), lwd=10,bty="n")

paste("The numbers of steps per day median =",medianStepsDay)

## [1] "The numbers of steps per day median = 10765"

paste("The mean numbers of steps per day =",meanStepsDay)

## [1] "The mean numbers of steps per day = 10766.1886792453"

What is the average daily activity pattern?

Time series plot of the average number of steps taken

Time intervals range from 0 to 2350, we should understand 2350 = 23:50 etc… for each day along the two-months test period (2012-10-01 to 2012-11-30). We would like two have an idea of the average numbers of steps for each 5-minutes interval.

MeanStepInterval<- aggregate(steps~interval,data=DataSet,FUN="mean")
Plt1 <- plot(steps ~ interval, data = MeanStepInterval, type = "l", xlab = "5-minutes Time Interval (i.e. 2350  =23:50)", ylab ="Mean number of steps", main = "Average number of steps by 5-minutes interval",  col = "blue")

The 5-minute interval that, on average, contains the maximum number of steps

maxMeanInterval <- MeanStepInterval[which.max(MeanStepInterval$steps),"interval"]
paste("The time interval whith the highest mean steps number is the interval",maxMeanInterval)

## [1] "The time interval whith the highest mean steps number is the interval 835"

The highest mean numbers of steps is recorded at the interval 835, which means at 8h35 AM.

Imputing missing values

Extract NA step data

Calculate and report the total number of missing values in the dataset (i.e. the total number of rows with 𝙽𝙰s)

NAValues <- sum(is.na(DataSet$steps))

paste(NAValues,"values are missing in the 'steps' column")

## [1] "2304 values are missing in the 'steps' column"

Devise a strategy for filling in all of the missing values in the dataset

Since we have calculated average numbers of steps per interval, we can use this data to fill NA values in the original dataset. When looping through the original dataset in the ‘steps’ column, as a ‘NA’ value is encountered, the mean value for the related interval will be used.

Create a new dataset that is equal to the original dataset but with the missing data filled in.

DataSet2 <- DataSet
NAcount <- 0
for(i in 1:nrow(DataSet2)){
        if((is.na(DataSet2[i,"steps"]))){
                NAcount=NAcount+1
                DataSet2[i,"steps"] <- MeanStepInterval[match(DataSet2[i,"interval"],MeanStepInterval$interval),"steps"]
        }
}

Make a histogram of the total number of steps taken each day and Calculate and report the mean and median total number of steps taken per day

SumStepsDay2<-aggregate(DataSet2$steps, list(DataSet2$date),sum)
colnames(SumStepsDay2) <- c("date","steps")
hist(as.numeric(SumStepsDay2$steps),nclass=10,col="pink",xlab = "Numbers of steps per day",main="Histogram, Numbers of Steps per Day")
medianStepsDay2<- median(as.numeric(SumStepsDay2$steps),na.rm=TRUE)
meanStepsDay2<- mean(as.numeric(SumStepsDay2$steps),na.rm=TRUE)
abline(v = medianStepsDay2,col = "blue",lwd = 6)
abline(v = meanStepsDay2,col = "red",lwd = 2)
legend("topleft", c(paste("Median =",medianStepsDay2), paste("Mean",meanStepsDay2)), col=c("blue", "red"), lwd=10,bty="n")

paste("The numbers of steps per day median =",medianStepsDay2)

## [1] "The numbers of steps per day median = 10766.1886792453"

paste("The mean numbers of steps per day =",meanStepsDay2)

## [1] "The mean numbers of steps per day = 10766.1886792453"

Do these values differ from the estimates from the first part of the assignment?

The mean is not different but the median is.

What is the impact of imputing missing data on the estimates of the total daily number of steps?

As expected, the mean value has remained identic. Otherwise, the median value has slightly increased, supported by the addition of mean values instead of NA’s.

Panel plot comparing the average number of steps taken per 5-minute interval across weekdays and weekends

Are there differences in activity patterns between weekdays and weekends?

Yes, there are, let’s have a look to the plots above. What can we deduce?

Activity is more regular during the week-end
During the week, we observe a net activity increase in the early morning (walk to work) followed by a net decrease (sitten activity)

Create a new factor variable in the dataset with two levels – “weekday” and “weekend” indicating whether a given date is a weekday or weekend day

Let’s first add an additional column ‘week’ where we will evaluate if the date is a weekday or a wekeend day. The function ‘wday’ return a number from 1 to seven (attention! 1 is set to Sunday). If the return value is 1 (Sunday) or 7 (Saturday), this will be classified as “weekend”, else as “weekday”.

require(lubridate)

## Loading required package: lubridate

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

DataSet$week <- ifelse(wday(DataSet$date)==1|wday(DataSet$date)==7,"weekend","weekday")

Make a panel plot containing a time series plot (i.e. 𝚝𝚢𝚙𝚎 = “𝚕”) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis)

Let’s start with aggregating ‘steps’ by ‘week’status (weekday or weekend), using the ’mean’ function as previously done and build the xyplot (from ‘Lattice’ library).

require(lattice)

## Loading required package: lattice

WeekSplitSteps <- aggregate(steps ~ interval + week, data = DataSet, mean)
names(WeekSplitSteps) <- c("interval","weekdayStatus", "steps")
xyplot(steps ~ interval | weekdayStatus, WeekSplitSteps, type = "l", layout = c(1, 2), 
    xlab = "Time-interval", ylab = "Number of steps")