It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the “quantified self” movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.
This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.
The data for this assignment can be downloaded from the course web site:
Dataset: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip The variables included in this dataset are:
interval: Identifier for the 5-minute interval in which measurement was taken The dataset is stored in a comma-separated-value (CSV) file and there are a total of 17,568 observations in this dataset.
WebURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip"
download.file(WebURL,destfile="./RepData_PeerAssessment1/activity.zip",method="curl")
## Unzip file in a specified folder
ZipDir <- "./RepData_PeerAssessment1/activity.zip"
UnzipDir <- "./RepData_PeerAssessment1/activity"
unzip(ZipDir,exdir = UnzipDir)
We can now load the data in R Studio
DataDir <- "./RepData_PeerAssessment1/activity/activity.csv"
DataSet <- read.csv(DataDir,header = TRUE)
Let’s have a look to the data with the st’ function
str(DataSet)
## 'data.frame': 17568 obs. of 3 variables:
## $ steps : int NA NA NA NA NA NA NA NA NA NA ...
## $ date : Factor w/ 61 levels "2012-10-01","2012-10-02",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
Our data frame is made of 17’568 observations for 3 variables (‘steps’, ‘date’, ‘interval’). ‘steps’ and ‘intervals’ variables are integer, let’s convert them in numeric. ‘date’ variable is in factor format and needs therefore to be formated in ‘Date’ format.
DataSet$date<-as.Date(DataSet$date,"%Y-%m-%d")
DataSet$steps <- as.numeric(DataSet$steps)
DataSet$interval <- as.numeric(DataSet$interval)
To answer this question, let’s build a histogram:
SumStepsDay<-aggregate(DataSet$steps, list(DataSet$date),sum)
colnames(SumStepsDay) <- c("date","steps")
hist(as.numeric(SumStepsDay$steps),nclass=10,col="pink",xlab = "Numbers of steps per day",main="Histogram, Numbers of Steps per Day")
medianStepsDay<- median(as.numeric(SumStepsDay$steps),na.rm=TRUE)
meanStepsDay<- mean(as.numeric(SumStepsDay$steps),na.rm=TRUE)
abline(v = medianStepsDay,col = "blue",lwd = 6)
abline(v = meanStepsDay,col = "red",lwd = 2)
legend("topleft", c(paste("Median =",medianStepsDay), paste("Mean",meanStepsDay)), col=c("blue", "red"), lwd=10,bty="n")
paste("The numbers of steps per day median =",medianStepsDay)
## [1] "The numbers of steps per day median = 10765"
paste("The mean numbers of steps per day =",meanStepsDay)
## [1] "The mean numbers of steps per day = 10766.1886792453"
Time intervals range from 0 to 2350, we should understand 2350 = 23:50 etc… for each day along the two-months test period (2012-10-01 to 2012-11-30). We would like two have an idea of the average numbers of steps for each 5-minutes interval.
MeanStepInterval<- aggregate(steps~interval,data=DataSet,FUN="mean")
Plt1 <- plot(steps ~ interval, data = MeanStepInterval, type = "l", xlab = "5-minutes Time Interval (i.e. 2350 =23:50)", ylab ="Mean number of steps", main = "Average number of steps by 5-minutes interval", col = "blue")
maxMeanInterval <- MeanStepInterval[which.max(MeanStepInterval$steps),"interval"]
paste("The time interval whith the highest mean steps number is the interval",maxMeanInterval)
## [1] "The time interval whith the highest mean steps number is the interval 835"
The highest mean numbers of steps is recorded at the interval 835, which means at 8h35 AM.
NAValues <- sum(is.na(DataSet$steps))
paste(NAValues,"values are missing in the 'steps' column")
## [1] "2304 values are missing in the 'steps' column"
Since we have calculated average numbers of steps per interval, we can use this data to fill NA values in the original dataset. When looping through the original dataset in the ‘steps’ column, as a ‘NA’ value is encountered, the mean value for the related interval will be used.
DataSet2 <- DataSet
NAcount <- 0
for(i in 1:nrow(DataSet2)){
if((is.na(DataSet2[i,"steps"]))){
NAcount=NAcount+1
DataSet2[i,"steps"] <- MeanStepInterval[match(DataSet2[i,"interval"],MeanStepInterval$interval),"steps"]
}
}
SumStepsDay2<-aggregate(DataSet2$steps, list(DataSet2$date),sum)
colnames(SumStepsDay2) <- c("date","steps")
hist(as.numeric(SumStepsDay2$steps),nclass=10,col="pink",xlab = "Numbers of steps per day",main="Histogram, Numbers of Steps per Day")
medianStepsDay2<- median(as.numeric(SumStepsDay2$steps),na.rm=TRUE)
meanStepsDay2<- mean(as.numeric(SumStepsDay2$steps),na.rm=TRUE)
abline(v = medianStepsDay2,col = "blue",lwd = 6)
abline(v = meanStepsDay2,col = "red",lwd = 2)
legend("topleft", c(paste("Median =",medianStepsDay2), paste("Mean",meanStepsDay2)), col=c("blue", "red"), lwd=10,bty="n")
paste("The numbers of steps per day median =",medianStepsDay2)
## [1] "The numbers of steps per day median = 10766.1886792453"
paste("The mean numbers of steps per day =",meanStepsDay2)
## [1] "The mean numbers of steps per day = 10766.1886792453"
The mean is not different but the median is.
As expected, the mean value has remained identic. Otherwise, the median value has slightly increased, supported by the addition of mean values instead of NA’s.
Yes, there are, let’s have a look to the plots above. What can we deduce?
Let’s first add an additional column ‘week’ where we will evaluate if the date is a weekday or a wekeend day. The function ‘wday’ return a number from 1 to seven (attention! 1 is set to Sunday). If the return value is 1 (Sunday) or 7 (Saturday), this will be classified as “weekend”, else as “weekday”.
require(lubridate)
## Loading required package: lubridate
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
DataSet$week <- ifelse(wday(DataSet$date)==1|wday(DataSet$date)==7,"weekend","weekday")
Let’s start with aggregating ‘steps’ by ‘week’status (weekday or weekend), using the ’mean’ function as previously done and build the xyplot (from ‘Lattice’ library).
require(lattice)
## Loading required package: lattice
WeekSplitSteps <- aggregate(steps ~ interval + week, data = DataSet, mean)
names(WeekSplitSteps) <- c("interval","weekdayStatus", "steps")
xyplot(steps ~ interval | weekdayStatus, WeekSplitSteps, type = "l", layout = c(1, 2),
xlab = "Time-interval", ylab = "Number of steps")