See rendered Rmd file at Rpubs

Loading and preprocessing the data

The first step is to read.csv the data into R and apply relevant transformations to create a usable data frame.

  1. First load the libraries Rmisc,dplyr,tidyr,chron,ggplot2,and mice. (NOTE: package Rmisc uses plyr which needs to be loaded before dplyr for dplyr::summarize to work)
  2. Next step, Read the data with read.csv into a data frame labeled activity using colClasses to specify the date variable’s class as date by the user-defined myDate method.
setClass('myDate') 
setAs("character","myDate",
      function(from) as.Date(from, format="%Y-%m-%d"))

activity <- read.csv('activity.csv',
                     stringsAsFactors = FALSE,
                     colClasses = c('numeric','myDate','numeric'))



Data with missing values
#Get Timeseries data
ts <- activity %>%
    select(date,steps)%>%
    group_by(date)%>%
    summarize(avgSteps=mean(steps),
              totalSteps=sum(steps))

#Histogram 
hist(ts$totalSteps, main = "Total Steps Taken Each Day", col = "gray", labels = TRUE)


#Mean and Median
summary(ts)
##       date               avgSteps         totalSteps   
##  Min.   :2012-10-01   Min.   : 0.1424   Min.   :   41  
##  1st Qu.:2012-10-16   1st Qu.:30.6979   1st Qu.: 8841  
##  Median :2012-10-31   Median :37.3785   Median :10765  
##  Mean   :2012-10-31   Mean   :37.3826   Mean   :10766  
##  3rd Qu.:2012-11-15   3rd Qu.:46.1597   3rd Qu.:13294  
##  Max.   :2012-11-30   Max.   :73.5903   Max.   :21194  
##                       NA's   :8         NA's   :8

From the above, we can see that the mean steps per day is 10,766 and the median steps per day is 10,765.




Data with missing values
omitData <- na.omit(activity)
ranked <- omitData %>%
    select(interval,steps) %>%
    group_by(interval) %>%
    summarize(averageSteps=mean(steps)) %>%
    arrange(desc(averageSteps))

ggplot(ranked, aes(x=interval,y=averageSteps)) + geom_line(stat="identity")


topInt <- ranked[which.max(ranked$averageSteps),]

print(topInt)
## # A tibble: 1 × 2
##   interval averageSteps
##      <dbl>        <dbl>
## 1      835     206.1698
interval averageSteps<br>
1      835     206.1698



Imputing missing values

Sum missing values

#Number of Missing Values
x <- sum(is.na(activity))

The number of missing values is 2304 The number of missing values is 2304

Impute the missing values using the mice package. Ensure reproducibility by setting seed. We create a mids object and then complete() it into a data frame.

#Impute missing values
# -convert the activity$date column back to a 'character' class for imputation 
# -use mice::complete() to fill dataset using new imputed data
# -#convert the date variable back to a date class (mice needed a character class to impute)
activity$date <- as.character(activity$date)
imputedData <- mice(activity, m=5, maxit=50, meth= 'pmm', seed = 500)
newData <- complete(imputedData,2)
newData$date <- as.Date(newData$date)
Data with missing values filled in.
#Get Timeseries data
ts2 <- newData %>%
    select(date,steps)%>%
    group_by(date)%>%
    summarize(avgSteps=mean(steps),
              totalSteps=sum(steps))

#Histogram 
hist(ts2$totalSteps, main = "Total Steps Taken Each Day", col = "green", labels = TRUE)


#Mean and Median
summary(ts2)
##       date               avgSteps         totalSteps   
##  Min.   :2012-10-01   Min.   : 0.1424   Min.   :   41  
##  1st Qu.:2012-10-16   1st Qu.:30.6979   1st Qu.: 8841  
##  Median :2012-10-31   Median :37.3785   Median :10765  
##  Mean   :2012-10-31   Mean   :37.1062   Mean   :10687  
##  3rd Qu.:2012-11-15   3rd Qu.:44.4826   3rd Qu.:12811  
##  Max.   :2012-11-30   Max.   :73.5903   Max.   :21194

From the above, we can see that the mean steps per day is 10,687 and the median steps per day is 10,765.

conclusion: the imputed data impacted the mean of our dataset reducing it from 10,766 to 10,687. The median remained the same at 10,765.




Data with missing values filled in.
  • The next step is to create a factor variable to identify whether or not the date of a record corresponds to a weekend or weekday. Using chron package function is.weekend() pass an anonymous function to sapply on the date column creating new variable WeekPart identifying the date as corresponding to either a WEEKEND or a WEEKDAY. This will be used for the plots in later questions.
#create a weekday variable for analysis purposes if needed
#create a WeekPart variable by passing an anonymous function to identify
#     if it's a WEEKEND or WEEKDAY using the chron::is.weekend() fx
activity <- newData %>%
    mutate(dayofweek = as.character(weekdays(date))) %>%
    mutate(WeekPart = as.factor(sapply(date,function(x){if(is.weekend(x))
                                           {"WEEKEND"}else{"WEEKDAY"}})))

#Get Timeseries data
d <- activity %>%
    select(date,interval,WeekPart,steps)%>%
    group_by(interval,WeekPart)%>%
    summarize(avgSteps=mean(steps),
              totalSteps=sum(steps))

#plot timeseries

p <- ggplot(d,aes(x=interval,y=avgSteps))+
    geom_line(stat="identity",color="purple",alpha=0.5)+
    ggtitle(label = "Average Steps by Interval by Week Part")+
    facet_grid(WeekPart ~ .)

plot(p)