Reproducible Research course repdata-034

Introduction/Housekeeping

In this paper:
Step counts recorded at five minute intervals for a single individual during the months of October and November 2012 are analyzed to answer the questions set by the class assignment, in the order given. Data conversion/massage is necessary because
1. If the time of each interval is read as a number, ggplot2’s plot will interpolate decimal values that should not exist, i.e. between 0:55 and 1:00 it will interpolate values for 0:60, 0:65 … 0:90, 0:95. This would produce a stretched graph with long smooth runs, very false to the original data. Reformatting to date format prevents that.
2. Data were found to be missing for 8 complete days, and per the assignment instructions’ suggestion, replaced with average interval data. This produced drastically shrunken variances, and might be very misleading in a real world problem.
3. Question 4 requires identification of dates as weekend or weekday; a new variable, DayType, with values “Weekday” or “Weekend”, had to be created, and values computed, for that purpose.

Necessary housekeeping:
* library package loading.
* creation of functions: NaRmMean and NaRmMedian, mean and median functions with the default reset so that NAs are ignored, which is the configuration needed in Questions 1 and 2 (also used in Questions 3 and 4 for consistency).

##Library Calls
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(plyr))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(magrittr))
suppressPackageStartupMessages(library(chron))
     ## Establish functions NaRmMean and NaRmMedian
NaRmMean<-function(x){mean(x,na.rm=TRUE)} #mean function that skips NAs
NaRmMedian<-function(x){median(x,na.rm=TRUE)}  #median function that skips NAs

Loading and preprocessing the data

Data were located in the file https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip and were downloaded into a dataset called ActivityData. Preliminary inspection revealed that missing data for steps were confined to eight days that were missing completely.

data loading and preliminary inspection

      ##Acquire data
ActivityZipped<-tempfile()
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip",ActivityZipped,"curl")
      ##Read data into dataset
ActivityData<-data.table()
ActivityData<-read.csv(unz(ActivityZipped,"activity.csv"))
      ##First inspection
summary(ActivityData)
##      steps                date          interval     
##  Min.   :  0.00   2012-10-01:  288   Min.   :   0.0  
##  1st Qu.:  0.00   2012-10-02:  288   1st Qu.: 588.8  
##  Median :  0.00   2012-10-03:  288   Median :1177.5  
##  Mean   : 37.38   2012-10-04:  288   Mean   :1177.5  
##  3rd Qu.: 12.00   2012-10-05:  288   3rd Qu.:1766.2  
##  Max.   :806.00   2012-10-06:  288   Max.   :2355.0  
##  NA's   :2304     (Other)   :15840
PercentStepsMissing<-100*sum(is.na(ActivityData$steps))/nrow(ActivityData)

The first summary of ActivityData showed that a value for “steps” was missing in 13.1% of the data, a proportion more than large enough to affect the results of analysis if it were all concentrated in one area. (Unfortunately as will be seen, the suggested remedy did concentrate them in one area).

examination of distribution of missing data

Missing data cases were summarized to determine whether missing values for steps were concentrated in a few places or scattered throughout the data set.

ActivityDataNAs<-data.table()
ActivityDataNAs<-ActivityData[is.na(ActivityData$steps),]
WhichDatesMissingSteps<-as.Date(unique(ActivityDataNAs$date))
NumberDatesMissingSteps<-length(WhichDatesMissingSteps)
WhichDaysMissingSteps<-weekdays(WhichDatesMissingSteps)

Pulling out the rows where the steps value was missing revealed that the record was actually missing all data for 8 complete days. Those dates were 2012-10-01, 2012-10-08, 2012-11-01, 2012-11-04, 2012-11-09, 2012-11-10, 2012-11-14, 2012-11-30, which fall on Monday, Monday, Thursday, Sunday, Friday, Saturday, Wednesday, Friday respectively.

Because missing data were distributed evenly across intervals and roughly proportionately between weekdays and weekends, answers for Questions 1 and 2 would not be severely affected. This also provided a good starting point for Questions 3 and 4.

formatting the variables in ActivityData and creating the DayType variable which will be needed in Question 4.

modify the ActivityData dataset:

  • numeric values for $steps
  • date values for $date
  • treating $interval as a time (so it won’t try to extrapolate meaningless values like “1275”).
  • create a new variable, DayType, a factor with values of “Weekday” or “Weekend”, for use in Question 4.
  • Several auxiliary variables will be created and then eliminated in the process.
ActivityData$steps<-as.numeric(ActivityData$steps) #convert steps
#Convert dates
ActivityData$date<-as.character(ActivityData$date) #convert dates to char from factor
ActivityData$date<-chron(dates=ActivityData$date,format="y-m-d")

##Intervals need to be converted to times so that ggplot2 won't interpolate nonsense values like "1275"
ActivityData$hour<-as.integer(ActivityData$interval/100) #extract hour
ActivityData$minutes<-ActivityData$interval-100*ActivityData$hour #extract minutes
ActivityData$CharacterHour<-as.character(ActivityData$hour) #convert hour to character
ActivityData$CharacterMinutes<-as.character(ActivityData$minutes) #convert minutes to character
Leader<-"0" #to be applied for low values in creating consistent character times
#attach Leader to CharacterHour and CharacterMinute for values <10
for (j in 1:nrow(ActivityData)) {
      if (ActivityData$hour[j]<10) { 
            ActivityData$CharacterHour[j]<-paste(Leader,ActivityData$CharacterHour[j],sep="")
      }
      if (ActivityData$minutes[j]<10) { #attach Leader to CharacterMinutes for values <10
            ActivityData$CharacterMinutes[j]<-paste(Leader,ActivityData$CharacterMinutes[j],sep="")
      }
}
#assemble character-class interval and convert to time
Trailer<-"00" #seconds for h:m:s format
ActivityData$interval<-as.character(paste(ActivityData$CharacterHour,ActivityData$CharacterMinutes,Trailer,sep=":")) #assemble character
ActivityData$interval<-times(ActivityData$interval,out.format="h:m") #convert to time
  
#Establish and fill out DayType variable for later.
ActivityData$DayType<-"Weekday"
ActivityData$DayOWk<-weekdays(ActivityData$date,abbreviate=TRUE)
ActivityData$IsWeekend<-ActivityData$DayOWk=="Sat" | ActivityData$DayOWk=="Sun"
ActivityData$DayType[ActivityData$IsWeekend]<-"Weekend"

#Retain only working variables
ActivityData<-ActivityData[,c(1:3,8)]
rm(Leader,Trailer,j,PercentStepsMissing,WhichDaysMissingSteps,WhichDatesMissingSteps)

Question 1. What is the mean total number of steps taken per day?

This question required creating a derived dataset of daily step counts (TotalStepsByDate), calculating the mean and median daily step counts, and preparing a histogram of daily step counts.

initial grouping and summing of steps by date, exported to dataset TotalStepsByDate

Steps_By_Date<-data.table()
Steps_By_Date<-ActivityData %>% group_by(date)
TotalStepsByDate<-Steps_By_Date%>%summarise_each(funs(sum),steps)

calculate requested statistics

#uses functions defined above to ignore NAs
MeanTotalDailySteps<-NaRmMean(TotalStepsByDate$steps) 
MedianTotalDailySteps<-NaRmMedian(TotalStepsByDate$steps)

Mean total daily steps, ignoring NAs: 10766
Median total daily steps, ignoring NAs: 10765

histogram of daily total steps

Binwidth was chosen for 20 bins by successive experiments; 20 produced a pleasing balance between showing detail and leaving a clear overall impression. 20 bins also corresponded to a fairly intuitive allocation of values: 8 adjacent bins=10,000, 2 adjacent bins= 2500. If distribution were uniform, this would have resulted in about 3 cases per bin, a low ragged line across the diagram.

In fact, though, there was a very clear central peak. That, plus the closeness of mean and median found above suggests a normal or binomial distribution, so I added an estimated density curve (red curve) and a vertical dark green line to mark the mean/median (on the scale of this drawing, they are closer together than a thin linewidth). The resemblance and fit to a Gaussian curve is quite striking.

If I were investigating further I would definitely try to fit this to a normal or binomial distribution.

TotalStepsPlot<-ggplot(TotalStepsByDate) #setting up the histogram
TotalStepsPlot<-TotalStepsPlot + aes(TotalStepsByDate$steps)
TotalStepsPlot<-TotalStepsPlot + geom_histogram(binwidth=1250, 
                                                fill="lightblue",
                                             colour="black",
                                               alpha=.5,
                                             na.rm=TRUE) 
      #binwidth chosen to have about 20 bins, 8 bins=10,000, 
      #about 3 cases/bin for a uniform distribution
TotalStepsPlot<-TotalStepsPlot + geom_density(aes(y=..count..*1250),
                                              colour="red",
                                              fill=NA,
                                              show_guide=TRUE,
                                              na.rm=TRUE)
TotalStepsPlot<-TotalStepsPlot + geom_vline(aes(xintercept=MeanTotalDailySteps),
                                            colour="dark green",
                                            show_guide=TRUE)
TotalStepsPlot<-TotalStepsPlot + scale_y_continuous(breaks=0:15)
TotalStepsPlot<-TotalStepsPlot + labs(title="Distribution of Total Daily Steps")
TotalStepsPlot<-TotalStepsPlot + labs(x="Total Steps in One Day")
TotalStepsPlot<-TotalStepsPlot + labs(y="Days")

TotalStepsPlot
## Warning: Removed 8 rows containing non-finite values (stat_density).

Given that for this plot, each day is a case, and there are therefore only 61 cases, the plot looks to be in good accord with a normal or binomial distribution; that hypothesis might warrant further investigation.

Question 2. What is the average daily activity pattern?

initial grouping and finding means of steps by interval, exported to AverageIntervalSteps

Steps_By_Interval<-data.table()
Steps_By_Interval<-ActivityData %>% group_by(interval)
AverageIntervalSteps<-Steps_By_Interval%>%summarise_each(funs(NaRmMean),steps)

lineplot time series interval v. average steps

## Xbreaks, which are specifications for x axis, all need to be in class times
begin<-times("00:00:00")
end<-times("23:55:00")
increment<-times("01:30:00")
Xbreaks<-seq(from=begin,to=end,by=increment)
#draw plot
AverageIntervalStepsPlot<-ggplot(AverageIntervalSteps) #setting up the line plot
AverageIntervalStepsPlot<-AverageIntervalStepsPlot + aes(interval,steps)
AverageIntervalStepsPlot<-AverageIntervalStepsPlot + geom_line() #line plot
AverageIntervalStepsPlot<-AverageIntervalStepsPlot + scale_x_continuous(breaks=Xbreaks)
AverageIntervalStepsPlot<-AverageIntervalStepsPlot + labs(title="Average Steps by Interval")
AverageIntervalStepsPlot<-AverageIntervalStepsPlot + labs(x="Interval (listed by begin time on 24-hour clock)")
AverageIntervalStepsPlot<-AverageIntervalStepsPlot + labs(y="Average number of steps")
AverageIntervalStepsPlot<-AverageIntervalStepsPlot + theme_bw(base_family = 'Helvetica')
AverageIntervalStepsPlot

Maximum average interval steps

MaxAverageIntervalSteps <- max(AverageIntervalSteps$steps)
MaxAverageInterval<-AverageIntervalSteps[AverageIntervalSteps$steps== MaxAverageIntervalSteps,1]

Maximum average interval steps: 206.2
The interval this corresponds to is 0835.

Observable features here include:

  1. the subject is nearly stationary between midnight and about 5:30.
  2. a long brisk run or walk from about 8:30 to 9:30 in the morning (perhaps an exercise routine?)
  3. for most of business hours the steps average between 50 and 125, suggesting a sedentary worker at a desk
  4. activity falls off sharply after about 7:30, suggesting the subject has gone home and finished housework.
    These are all things we might expect of an office worker living in Europe, North America, or other economically developed areas.

Question 3: Imputing missing values (and recalculating total steps mean, median, and histogram)

Taking advantage of the earlier inspection of the data: the only missing data are for steps. They are missing from exactly eight days; on each of those days, all 288 of the 5 minute intervals are missing. Furthermore, the dataset begins with midnight of October 1 and ends at 23:55 of Nov 30 with no skipped days.

Thus, we can take the average-per-interval vector from question 2, which runs from 00:00 to 23:55, and add it only to the days where NAs occur. Since R automatically repeats the recurring vector in a multiplication, multiplying the averages times a logical vector which is TRUE only where there is an NA will result in a corrections vector that has a numeric value of the interval average wherever there is an NA, and zero otherwise.

After changing NAs to zeros and adding, the new interpolated values for $steps will be complete. I chose to round the averages to integers because it gets rid of the absurdity of fractional steps in the data, reduces the very-low-step intervals down to a step value of 0, and has very little effect on higher values.

interpolate values into rows with missing data for steps

MissingStepsSelector<-is.na(ActivityData$steps)
InsertIntervalAverages<-MissingStepsSelector*round(AverageIntervalSteps$steps)
ActivityData$steps[is.na(ActivityData$steps)]<-0
ActivityData$steps<-ActivityData$steps+InsertIntervalAverages

re-run same procedure as in Question 1 to generate new results for comparison

# initial grouping-summing of steps by date, exported to dataset TotalStepsByDate  
Steps_By_Date<-data.table()
Steps_By_Date<-ActivityData %>% group_by(date)
TotalStepsByDate<-Steps_By_Date%>%summarise_each(funs(sum),steps)

calculate requested statistics

#uses functions defined above to ignore NAs
MeanTotalDailySteps<-NaRmMean(TotalStepsByDate$steps) 
MedianTotalDailySteps<-NaRmMedian(TotalStepsByDate$steps)

With interval values interpolated for NAs, mean total daily steps was found to be 10766, quite close to the median total daily steps of 10762. Both these were very close to results found the first time, for reasons discussed below.

histogram of daily total steps

I chose to use exactly the same code to produce the histogram, to make as direct a comparison as possible.

TotalStepsPlot<-ggplot(TotalStepsByDate) #setting up the histogram
TotalStepsPlot<-TotalStepsPlot + aes(TotalStepsByDate$steps)
TotalStepsPlot<-TotalStepsPlot + geom_histogram(binwidth=1250, 
                                                fill="lightblue",
                                             colour="black",
                                               alpha=.5) 
      #binwidth chosen to have about 20 bins, 8 bins=10,000, 
      #about 3 cases/bin for a uniform distribution
TotalStepsPlot<-TotalStepsPlot + geom_density(aes(y=..count..*1250),
                                              colour="red",
                                              fill=NA,
                                              show_guide=TRUE)
TotalStepsPlot<-TotalStepsPlot + geom_vline(aes(xintercept=MeanTotalDailySteps),
                                            colour="dark green",
                                            show_guide=TRUE)
TotalStepsPlot<-TotalStepsPlot + scale_y_continuous(breaks=0:15)
TotalStepsPlot<-TotalStepsPlot + labs(title="Distribution of Total Daily Steps")
TotalStepsPlot<-TotalStepsPlot + labs(x="Total Steps in One Day")
TotalStepsPlot<-TotalStepsPlot + labs(y="Days")

TotalStepsPlot

comparison of the two histograms

Obviously kurtosis (spikiness) of the distribution has become much more acute with the addition of the interpolated values. It is quite noticeable how much farther the central column sticks out above the peak in the red distribution curve.

The histogram still looks roughly like it might be a normal or a
binomial distribution, butthe peak is proportionately higher, and the standard deviation smaller.

Yet the mean and median are almost the same; the overall center didn’t change much, but the distribution changed fairly drastically.

This is because missing values were concentrated in 8 days, and because the average for each interval was used, each of those 8 days received the same interpolated values. This in turn assigned all eight missing days the same total steps, so all 8 missing days were added to the central column. Inspection shows, in fact, that in the original histogram, there were 12 values in the central column range, corresponding to the range 10,000-11,250, and in the new graph there are 20, with no other column of the histogram changed. To further verify this, I checked the sum of the interval averages:

MissingDataIntervalAverageSum<-sum(AverageIntervalSteps$steps)

The result was 10766, which of course is between 10,000 and 11,250. So the actual effect of this method of interpolation was merely to exaggerate the central tendency in the data.

Question 4: Are there differences in activity patterns between weekdays and weekends?

First we have to regroup the dataset on two variables: interval and DayType.

StepsXIntDayType<-data.table()
StepsXIntDayType<-ActivityData %>% group_by(interval,DayType)
AverageIntervalDayTypeSteps<-StepsXIntDayType%>%summarise_each(funs(NaRmMean),steps)

drawing the requested plots of interval average steps for weekends and weekdays

##Xbreaks, which are specifications for x axis, all need to be in class times
begin<-times("00:00:00")
end<-times("23:55:00")
increment<-times("01:30:00")
Xbreaks<-seq(from=begin,to=end,by=increment)
#draw plot
AvgIntDTStpsPlot<-ggplot(AverageIntervalDayTypeSteps) #setting up the line plot
AvgIntDTStpsPlot<-AvgIntDTStpsPlot + aes(interval,steps)
AvgIntDTStpsPlot<-AvgIntDTStpsPlot + geom_line() #line plot
AvgIntDTStpsPlot<-AvgIntDTStpsPlot + scale_x_continuous(breaks=Xbreaks)
AvgIntDTStpsPlot<-AvgIntDTStpsPlot + labs(title="Average Steps by Interval (NAs have been interpolated)")
AvgIntDTStpsPlot<-AvgIntDTStpsPlot + labs(x="Interval (listed by begin time on 24-hour clock)")
AvgIntDTStpsPlot<-AvgIntDTStpsPlot + labs(y="Average number of steps")
AvgIntDTStpsPlot<-AvgIntDTStpsPlot + facet_grid(DayType ~ .)
AvgIntDTStpsPlot

observations about average steps (interpolated) versus interval and DayType

  1. Overall, on weekends, the subject is more active across the whole day.
  2. Subject starts his/her heavy activity somewhat later, after a much less active waking up and getting ready period.
  3. The morning heavy workout period on weekends is only about 3/4s as intense on weekends as it is during the week.
  4. Nevertheless weekends are much more active than weekdays, because on weekends:
    1. Subject does not become inactive till well after 9 pm (as opposed to 7:30 during week).
    2. Subject has 2-3 other bursts of activity almost as big as the morning peak during weekend afternoons.