Reproducible Research course repdata-034
In this paper:
Step counts recorded at five minute intervals for a single individual during the months of October and November 2012 are analyzed to answer the questions set by the class assignment, in the order given. Data conversion/massage is necessary because
1. If the time of each interval is read as a number, ggplot2’s plot will interpolate decimal values that should not exist, i.e. between 0:55 and 1:00 it will interpolate values for 0:60, 0:65 … 0:90, 0:95. This would produce a stretched graph with long smooth runs, very false to the original data. Reformatting to date format prevents that.
2. Data were found to be missing for 8 complete days, and per the assignment instructions’ suggestion, replaced with average interval data. This produced drastically shrunken variances, and might be very misleading in a real world problem.
3. Question 4 requires identification of dates as weekend or weekday; a new variable, DayType, with values “Weekday” or “Weekend”, had to be created, and values computed, for that purpose.
Necessary housekeeping:
* library package loading.
* creation of functions: NaRmMean and NaRmMedian, mean and median functions with the default reset so that NAs are ignored, which is the configuration needed in Questions 1 and 2 (also used in Questions 3 and 4 for consistency).
##Library Calls
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(plyr))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(magrittr))
suppressPackageStartupMessages(library(chron))
## Establish functions NaRmMean and NaRmMedian
NaRmMean<-function(x){mean(x,na.rm=TRUE)} #mean function that skips NAs
NaRmMedian<-function(x){median(x,na.rm=TRUE)} #median function that skips NAs
Data were located in the file https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip and were downloaded into a dataset called ActivityData. Preliminary inspection revealed that missing data for steps were confined to eight days that were missing completely.
##Acquire data
ActivityZipped<-tempfile()
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip",ActivityZipped,"curl")
##Read data into dataset
ActivityData<-data.table()
ActivityData<-read.csv(unz(ActivityZipped,"activity.csv"))
##First inspection
summary(ActivityData)
## steps date interval
## Min. : 0.00 2012-10-01: 288 Min. : 0.0
## 1st Qu.: 0.00 2012-10-02: 288 1st Qu.: 588.8
## Median : 0.00 2012-10-03: 288 Median :1177.5
## Mean : 37.38 2012-10-04: 288 Mean :1177.5
## 3rd Qu.: 12.00 2012-10-05: 288 3rd Qu.:1766.2
## Max. :806.00 2012-10-06: 288 Max. :2355.0
## NA's :2304 (Other) :15840
PercentStepsMissing<-100*sum(is.na(ActivityData$steps))/nrow(ActivityData)
The first summary of ActivityData showed that a value for “steps” was missing in 13.1% of the data, a proportion more than large enough to affect the results of analysis if it were all concentrated in one area. (Unfortunately as will be seen, the suggested remedy did concentrate them in one area).
Missing data cases were summarized to determine whether missing values for steps were concentrated in a few places or scattered throughout the data set.
ActivityDataNAs<-data.table()
ActivityDataNAs<-ActivityData[is.na(ActivityData$steps),]
WhichDatesMissingSteps<-as.Date(unique(ActivityDataNAs$date))
NumberDatesMissingSteps<-length(WhichDatesMissingSteps)
WhichDaysMissingSteps<-weekdays(WhichDatesMissingSteps)
Pulling out the rows where the steps value was missing revealed that the record was actually missing all data for 8 complete days. Those dates were 2012-10-01, 2012-10-08, 2012-11-01, 2012-11-04, 2012-11-09, 2012-11-10, 2012-11-14, 2012-11-30, which fall on Monday, Monday, Thursday, Sunday, Friday, Saturday, Wednesday, Friday respectively.
Because missing data were distributed evenly across intervals and roughly proportionately between weekdays and weekends, answers for Questions 1 and 2 would not be severely affected. This also provided a good starting point for Questions 3 and 4.
modify the ActivityData dataset:
ActivityData$steps<-as.numeric(ActivityData$steps) #convert steps
#Convert dates
ActivityData$date<-as.character(ActivityData$date) #convert dates to char from factor
ActivityData$date<-chron(dates=ActivityData$date,format="y-m-d")
##Intervals need to be converted to times so that ggplot2 won't interpolate nonsense values like "1275"
ActivityData$hour<-as.integer(ActivityData$interval/100) #extract hour
ActivityData$minutes<-ActivityData$interval-100*ActivityData$hour #extract minutes
ActivityData$CharacterHour<-as.character(ActivityData$hour) #convert hour to character
ActivityData$CharacterMinutes<-as.character(ActivityData$minutes) #convert minutes to character
Leader<-"0" #to be applied for low values in creating consistent character times
#attach Leader to CharacterHour and CharacterMinute for values <10
for (j in 1:nrow(ActivityData)) {
if (ActivityData$hour[j]<10) {
ActivityData$CharacterHour[j]<-paste(Leader,ActivityData$CharacterHour[j],sep="")
}
if (ActivityData$minutes[j]<10) { #attach Leader to CharacterMinutes for values <10
ActivityData$CharacterMinutes[j]<-paste(Leader,ActivityData$CharacterMinutes[j],sep="")
}
}
#assemble character-class interval and convert to time
Trailer<-"00" #seconds for h:m:s format
ActivityData$interval<-as.character(paste(ActivityData$CharacterHour,ActivityData$CharacterMinutes,Trailer,sep=":")) #assemble character
ActivityData$interval<-times(ActivityData$interval,out.format="h:m") #convert to time
#Establish and fill out DayType variable for later.
ActivityData$DayType<-"Weekday"
ActivityData$DayOWk<-weekdays(ActivityData$date,abbreviate=TRUE)
ActivityData$IsWeekend<-ActivityData$DayOWk=="Sat" | ActivityData$DayOWk=="Sun"
ActivityData$DayType[ActivityData$IsWeekend]<-"Weekend"
#Retain only working variables
ActivityData<-ActivityData[,c(1:3,8)]
rm(Leader,Trailer,j,PercentStepsMissing,WhichDaysMissingSteps,WhichDatesMissingSteps)
This question required creating a derived dataset of daily step counts (TotalStepsByDate), calculating the mean and median daily step counts, and preparing a histogram of daily step counts.
Steps_By_Date<-data.table()
Steps_By_Date<-ActivityData %>% group_by(date)
TotalStepsByDate<-Steps_By_Date%>%summarise_each(funs(sum),steps)
#uses functions defined above to ignore NAs
MeanTotalDailySteps<-NaRmMean(TotalStepsByDate$steps)
MedianTotalDailySteps<-NaRmMedian(TotalStepsByDate$steps)
Mean total daily steps, ignoring NAs: 10766
Median total daily steps, ignoring NAs: 10765
Binwidth was chosen for 20 bins by successive experiments; 20 produced a pleasing balance between showing detail and leaving a clear overall impression. 20 bins also corresponded to a fairly intuitive allocation of values: 8 adjacent bins=10,000, 2 adjacent bins= 2500. If distribution were uniform, this would have resulted in about 3 cases per bin, a low ragged line across the diagram.
In fact, though, there was a very clear central peak. That, plus the closeness of mean and median found above suggests a normal or binomial distribution, so I added an estimated density curve (red curve) and a vertical dark green line to mark the mean/median (on the scale of this drawing, they are closer together than a thin linewidth). The resemblance and fit to a Gaussian curve is quite striking.
If I were investigating further I would definitely try to fit this to a normal or binomial distribution.
TotalStepsPlot<-ggplot(TotalStepsByDate) #setting up the histogram
TotalStepsPlot<-TotalStepsPlot + aes(TotalStepsByDate$steps)
TotalStepsPlot<-TotalStepsPlot + geom_histogram(binwidth=1250,
fill="lightblue",
colour="black",
alpha=.5,
na.rm=TRUE)
#binwidth chosen to have about 20 bins, 8 bins=10,000,
#about 3 cases/bin for a uniform distribution
TotalStepsPlot<-TotalStepsPlot + geom_density(aes(y=..count..*1250),
colour="red",
fill=NA,
show_guide=TRUE,
na.rm=TRUE)
TotalStepsPlot<-TotalStepsPlot + geom_vline(aes(xintercept=MeanTotalDailySteps),
colour="dark green",
show_guide=TRUE)
TotalStepsPlot<-TotalStepsPlot + scale_y_continuous(breaks=0:15)
TotalStepsPlot<-TotalStepsPlot + labs(title="Distribution of Total Daily Steps")
TotalStepsPlot<-TotalStepsPlot + labs(x="Total Steps in One Day")
TotalStepsPlot<-TotalStepsPlot + labs(y="Days")
TotalStepsPlot
## Warning: Removed 8 rows containing non-finite values (stat_density).
Given that for this plot, each day is a case, and there are therefore only 61 cases, the plot looks to be in good accord with a normal or binomial distribution; that hypothesis might warrant further investigation.
Steps_By_Interval<-data.table()
Steps_By_Interval<-ActivityData %>% group_by(interval)
AverageIntervalSteps<-Steps_By_Interval%>%summarise_each(funs(NaRmMean),steps)
## Xbreaks, which are specifications for x axis, all need to be in class times
begin<-times("00:00:00")
end<-times("23:55:00")
increment<-times("01:30:00")
Xbreaks<-seq(from=begin,to=end,by=increment)
#draw plot
AverageIntervalStepsPlot<-ggplot(AverageIntervalSteps) #setting up the line plot
AverageIntervalStepsPlot<-AverageIntervalStepsPlot + aes(interval,steps)
AverageIntervalStepsPlot<-AverageIntervalStepsPlot + geom_line() #line plot
AverageIntervalStepsPlot<-AverageIntervalStepsPlot + scale_x_continuous(breaks=Xbreaks)
AverageIntervalStepsPlot<-AverageIntervalStepsPlot + labs(title="Average Steps by Interval")
AverageIntervalStepsPlot<-AverageIntervalStepsPlot + labs(x="Interval (listed by begin time on 24-hour clock)")
AverageIntervalStepsPlot<-AverageIntervalStepsPlot + labs(y="Average number of steps")
AverageIntervalStepsPlot<-AverageIntervalStepsPlot + theme_bw(base_family = 'Helvetica')
AverageIntervalStepsPlot
MaxAverageIntervalSteps <- max(AverageIntervalSteps$steps)
MaxAverageInterval<-AverageIntervalSteps[AverageIntervalSteps$steps== MaxAverageIntervalSteps,1]
Maximum average interval steps: 206.2
The interval this corresponds to is 0835.
Observable features here include:
Taking advantage of the earlier inspection of the data: the only missing data are for steps. They are missing from exactly eight days; on each of those days, all 288 of the 5 minute intervals are missing. Furthermore, the dataset begins with midnight of October 1 and ends at 23:55 of Nov 30 with no skipped days.
Thus, we can take the average-per-interval vector from question 2, which runs from 00:00 to 23:55, and add it only to the days where NAs occur. Since R automatically repeats the recurring vector in a multiplication, multiplying the averages times a logical vector which is TRUE only where there is an NA will result in a corrections vector that has a numeric value of the interval average wherever there is an NA, and zero otherwise.
After changing NAs to zeros and adding, the new interpolated values for $steps will be complete. I chose to round the averages to integers because it gets rid of the absurdity of fractional steps in the data, reduces the very-low-step intervals down to a step value of 0, and has very little effect on higher values.
MissingStepsSelector<-is.na(ActivityData$steps)
InsertIntervalAverages<-MissingStepsSelector*round(AverageIntervalSteps$steps)
ActivityData$steps[is.na(ActivityData$steps)]<-0
ActivityData$steps<-ActivityData$steps+InsertIntervalAverages
# initial grouping-summing of steps by date, exported to dataset TotalStepsByDate
Steps_By_Date<-data.table()
Steps_By_Date<-ActivityData %>% group_by(date)
TotalStepsByDate<-Steps_By_Date%>%summarise_each(funs(sum),steps)
#uses functions defined above to ignore NAs
MeanTotalDailySteps<-NaRmMean(TotalStepsByDate$steps)
MedianTotalDailySteps<-NaRmMedian(TotalStepsByDate$steps)
With interval values interpolated for NAs, mean total daily steps was found to be 10766, quite close to the median total daily steps of 10762. Both these were very close to results found the first time, for reasons discussed below.
I chose to use exactly the same code to produce the histogram, to make as direct a comparison as possible.
TotalStepsPlot<-ggplot(TotalStepsByDate) #setting up the histogram
TotalStepsPlot<-TotalStepsPlot + aes(TotalStepsByDate$steps)
TotalStepsPlot<-TotalStepsPlot + geom_histogram(binwidth=1250,
fill="lightblue",
colour="black",
alpha=.5)
#binwidth chosen to have about 20 bins, 8 bins=10,000,
#about 3 cases/bin for a uniform distribution
TotalStepsPlot<-TotalStepsPlot + geom_density(aes(y=..count..*1250),
colour="red",
fill=NA,
show_guide=TRUE)
TotalStepsPlot<-TotalStepsPlot + geom_vline(aes(xintercept=MeanTotalDailySteps),
colour="dark green",
show_guide=TRUE)
TotalStepsPlot<-TotalStepsPlot + scale_y_continuous(breaks=0:15)
TotalStepsPlot<-TotalStepsPlot + labs(title="Distribution of Total Daily Steps")
TotalStepsPlot<-TotalStepsPlot + labs(x="Total Steps in One Day")
TotalStepsPlot<-TotalStepsPlot + labs(y="Days")
TotalStepsPlot
Obviously kurtosis (spikiness) of the distribution has become much more acute with the addition of the interpolated values. It is quite noticeable how much farther the central column sticks out above the peak in the red distribution curve.
The histogram still looks roughly like it might be a normal or a
binomial distribution, butthe peak is proportionately higher, and the standard deviation smaller.
Yet the mean and median are almost the same; the overall center didn’t change much, but the distribution changed fairly drastically.
This is because missing values were concentrated in 8 days, and because the average for each interval was used, each of those 8 days received the same interpolated values. This in turn assigned all eight missing days the same total steps, so all 8 missing days were added to the central column. Inspection shows, in fact, that in the original histogram, there were 12 values in the central column range, corresponding to the range 10,000-11,250, and in the new graph there are 20, with no other column of the histogram changed. To further verify this, I checked the sum of the interval averages:
MissingDataIntervalAverageSum<-sum(AverageIntervalSteps$steps)
The result was 10766, which of course is between 10,000 and 11,250. So the actual effect of this method of interpolation was merely to exaggerate the central tendency in the data.
First we have to regroup the dataset on two variables: interval and DayType.
StepsXIntDayType<-data.table()
StepsXIntDayType<-ActivityData %>% group_by(interval,DayType)
AverageIntervalDayTypeSteps<-StepsXIntDayType%>%summarise_each(funs(NaRmMean),steps)
##Xbreaks, which are specifications for x axis, all need to be in class times
begin<-times("00:00:00")
end<-times("23:55:00")
increment<-times("01:30:00")
Xbreaks<-seq(from=begin,to=end,by=increment)
#draw plot
AvgIntDTStpsPlot<-ggplot(AverageIntervalDayTypeSteps) #setting up the line plot
AvgIntDTStpsPlot<-AvgIntDTStpsPlot + aes(interval,steps)
AvgIntDTStpsPlot<-AvgIntDTStpsPlot + geom_line() #line plot
AvgIntDTStpsPlot<-AvgIntDTStpsPlot + scale_x_continuous(breaks=Xbreaks)
AvgIntDTStpsPlot<-AvgIntDTStpsPlot + labs(title="Average Steps by Interval (NAs have been interpolated)")
AvgIntDTStpsPlot<-AvgIntDTStpsPlot + labs(x="Interval (listed by begin time on 24-hour clock)")
AvgIntDTStpsPlot<-AvgIntDTStpsPlot + labs(y="Average number of steps")
AvgIntDTStpsPlot<-AvgIntDTStpsPlot + facet_grid(DayType ~ .)
AvgIntDTStpsPlot