Gyroscope Data

See rendered Rmd file at Rpubs

Loading and preprocessing the data

The first step is to read.csv the data into R and apply relevant transformations to create a usable data frame.

First load the libraries Rmisc,dplyr,tidyr,chron,ggplot2,and mice. (NOTE: package Rmisc uses plyr which needs to be loaded before dplyr for dplyr::summarize to work)
Next step, Read the data with read.csv into a data frame labeled activity using colClasses to specify the date variable’s class as date by the user-defined myDate method.

setClass('myDate') 
setAs("character","myDate",
      function(from) as.Date(from, format="%Y-%m-%d"))

activity <- read.csv('activity.csv',
                     stringsAsFactors = FALSE,
                     colClasses = c('numeric','myDate','numeric'))

Data with missing values

#Get Timeseries data
ts <- activity %>%
    select(date,steps)%>%
    group_by(date)%>%
    summarize(avgSteps=mean(steps),
              totalSteps=sum(steps))

#Histogram 
hist(ts$totalSteps, main = "Total Steps Taken Each Day", col = "gray", labels = TRUE)

#Mean and Median
summary(ts)

##       date               avgSteps         totalSteps   
##  Min.   :2012-10-01   Min.   : 0.1424   Min.   :   41  
##  1st Qu.:2012-10-16   1st Qu.:30.6979   1st Qu.: 8841  
##  Median :2012-10-31   Median :37.3785   Median :10765  
##  Mean   :2012-10-31   Mean   :37.3826   Mean   :10766  
##  3rd Qu.:2012-11-15   3rd Qu.:46.1597   3rd Qu.:13294  
##  Max.   :2012-11-30   Max.   :73.5903   Max.   :21194  
##                       NA's   :8         NA's   :8

From the above, we can see that the mean steps per day is 10,766 and the median steps per day is 10,765.

Data with missing values

omitData <- na.omit(activity)
ranked <- omitData %>%
    select(interval,steps) %>%
    group_by(interval) %>%
    summarize(averageSteps=mean(steps)) %>%
    arrange(desc(averageSteps))

ggplot(ranked, aes(x=interval,y=averageSteps)) + geom_line(stat="identity")

topInt <- ranked[which.max(ranked$averageSteps),]

print(topInt)

## # A tibble: 1 × 2
##   interval averageSteps
##      <dbl>        <dbl>
## 1      835     206.1698

interval averageSteps<br>
1      835     206.1698

Imputing missing values

Sum missing values

#Number of Missing Values
x <- sum(is.na(activity))

The number of missing values is 2304 The number of missing values is 2304

Impute the missing values using the mice package. Ensure reproducibility by setting seed. We create a mids object and then complete() it into a data frame.

#Impute missing values
# -convert the activity$date column back to a 'character' class for imputation 
# -use mice::complete() to fill dataset using new imputed data
# -#convert the date variable back to a date class (mice needed a character class to impute)
activity$date <- as.character(activity$date)
imputedData <- mice(activity, m=5, maxit=50, meth= 'pmm', seed = 500)
newData <- complete(imputedData,2)
newData$date <- as.Date(newData$date)

Data with missing values filled in.

#Get Timeseries data
ts2 <- newData %>%
    select(date,steps)%>%
    group_by(date)%>%
    summarize(avgSteps=mean(steps),
              totalSteps=sum(steps))

#Histogram 
hist(ts2$totalSteps, main = "Total Steps Taken Each Day", col = "green", labels = TRUE)

#Mean and Median
summary(ts2)

##       date               avgSteps         totalSteps   
##  Min.   :2012-10-01   Min.   : 0.1424   Min.   :   41  
##  1st Qu.:2012-10-16   1st Qu.:30.6979   1st Qu.: 8841  
##  Median :2012-10-31   Median :37.3785   Median :10765  
##  Mean   :2012-10-31   Mean   :37.1062   Mean   :10687  
##  3rd Qu.:2012-11-15   3rd Qu.:44.4826   3rd Qu.:12811  
##  Max.   :2012-11-30   Max.   :73.5903   Max.   :21194

From the above, we can see that the mean steps per day is 10,687 and the median steps per day is 10,765.

conclusion: the imputed data impacted the mean of our dataset reducing it from 10,766 to 10,687. The median remained the same at 10,765.

Data with missing values filled in.

The next step is to create a factor variable to identify whether or not the date of a record corresponds to a weekend or weekday. Using chron package function is.weekend() pass an anonymous function to sapply on the date column creating new variable WeekPart identifying the date as corresponding to either a WEEKEND or a WEEKDAY. This will be used for the plots in later questions.

#create a weekday variable for analysis purposes if needed
#create a WeekPart variable by passing an anonymous function to identify
#     if it's a WEEKEND or WEEKDAY using the chron::is.weekend() fx
activity <- newData %>%
    mutate(dayofweek = as.character(weekdays(date))) %>%
    mutate(WeekPart = as.factor(sapply(date,function(x){if(is.weekend(x))
                                           {"WEEKEND"}else{"WEEKDAY"}})))

#Get Timeseries data
d <- activity %>%
    select(date,interval,WeekPart,steps)%>%
    group_by(interval,WeekPart)%>%
    summarize(avgSteps=mean(steps),
              totalSteps=sum(steps))

#plot timeseries

p <- ggplot(d,aes(x=interval,y=avgSteps))+
    geom_line(stat="identity",color="purple",alpha=0.5)+
    ggtitle(label = "Average Steps by Interval by Week Part")+
    facet_grid(WeekPart ~ .)

plot(p)

Gyroscope Data

Neil Kutty

July 16, 2016

See rendered Rmd file at Rpubs

Data with missing values

Data with missing values

Data with missing values filled in.

Data with missing values filled in.