Loading and preprocessing the data
The first step is to read.csv the data into R and apply relevant transformations to create a usable data frame.
Rmisc,dplyr,tidyr,chron,ggplot2,and mice. (NOTE: package Rmisc uses plyr which needs to be loaded before dplyr for dplyr::summarize to work)read.csv into a data frame labeled activity using colClasses to specify the date variable’s class as date by the user-defined myDate method.setClass('myDate')
setAs("character","myDate",
function(from) as.Date(from, format="%Y-%m-%d"))
activity <- read.csv('activity.csv',
stringsAsFactors = FALSE,
colClasses = c('numeric','myDate','numeric'))
#Get Timeseries data
ts <- activity %>%
select(date,steps)%>%
group_by(date)%>%
summarize(avgSteps=mean(steps),
totalSteps=sum(steps))
#Histogram
hist(ts$totalSteps, main = "Total Steps Taken Each Day", col = "gray", labels = TRUE)
#Mean and Median
summary(ts)
## date avgSteps totalSteps
## Min. :2012-10-01 Min. : 0.1424 Min. : 41
## 1st Qu.:2012-10-16 1st Qu.:30.6979 1st Qu.: 8841
## Median :2012-10-31 Median :37.3785 Median :10765
## Mean :2012-10-31 Mean :37.3826 Mean :10766
## 3rd Qu.:2012-11-15 3rd Qu.:46.1597 3rd Qu.:13294
## Max. :2012-11-30 Max. :73.5903 Max. :21194
## NA's :8 NA's :8
From the above, we can see that the mean steps per day is 10,766 and the median steps per day is 10,765.
omitData <- na.omit(activity)
ranked <- omitData %>%
select(interval,steps) %>%
group_by(interval) %>%
summarize(averageSteps=mean(steps)) %>%
arrange(desc(averageSteps))
ggplot(ranked, aes(x=interval,y=averageSteps)) + geom_line(stat="identity")
topInt <- ranked[which.max(ranked$averageSteps),]
print(topInt)
## # A tibble: 1 × 2
## interval averageSteps
## <dbl> <dbl>
## 1 835 206.1698
interval averageSteps<br>
1 835 206.1698
Imputing missing values
Sum missing values
#Number of Missing Values
x <- sum(is.na(activity))
The number of missing values is 2304 The number of missing values is 2304
Impute the missing values using the mice package. Ensure reproducibility by setting seed. We create a mids object and then complete() it into a data frame.
#Impute missing values
# -convert the activity$date column back to a 'character' class for imputation
# -use mice::complete() to fill dataset using new imputed data
# -#convert the date variable back to a date class (mice needed a character class to impute)
activity$date <- as.character(activity$date)
imputedData <- mice(activity, m=5, maxit=50, meth= 'pmm', seed = 500)
newData <- complete(imputedData,2)
newData$date <- as.Date(newData$date)
#Get Timeseries data
ts2 <- newData %>%
select(date,steps)%>%
group_by(date)%>%
summarize(avgSteps=mean(steps),
totalSteps=sum(steps))
#Histogram
hist(ts2$totalSteps, main = "Total Steps Taken Each Day", col = "green", labels = TRUE)
#Mean and Median
summary(ts2)
## date avgSteps totalSteps
## Min. :2012-10-01 Min. : 0.1424 Min. : 41
## 1st Qu.:2012-10-16 1st Qu.:30.6979 1st Qu.: 8841
## Median :2012-10-31 Median :37.3785 Median :10765
## Mean :2012-10-31 Mean :37.1062 Mean :10687
## 3rd Qu.:2012-11-15 3rd Qu.:44.4826 3rd Qu.:12811
## Max. :2012-11-30 Max. :73.5903 Max. :21194
From the above, we can see that the mean steps per day is 10,687 and the median steps per day is 10,765.
conclusion: the imputed data impacted the mean of our dataset reducing it from 10,766 to 10,687. The median remained the same at 10,765.
factor variable to identify whether or not the date of a record corresponds to a weekend or weekday. Using chron package function is.weekend() pass an anonymous function to sapply on the date column creating new variable WeekPart identifying the date as corresponding to either a WEEKEND or a WEEKDAY. This will be used for the plots in later questions.#create a weekday variable for analysis purposes if needed
#create a WeekPart variable by passing an anonymous function to identify
# if it's a WEEKEND or WEEKDAY using the chron::is.weekend() fx
activity <- newData %>%
mutate(dayofweek = as.character(weekdays(date))) %>%
mutate(WeekPart = as.factor(sapply(date,function(x){if(is.weekend(x))
{"WEEKEND"}else{"WEEKDAY"}})))
#Get Timeseries data
d <- activity %>%
select(date,interval,WeekPart,steps)%>%
group_by(interval,WeekPart)%>%
summarize(avgSteps=mean(steps),
totalSteps=sum(steps))
#plot timeseries
p <- ggplot(d,aes(x=interval,y=avgSteps))+
geom_line(stat="identity",color="purple",alpha=0.5)+
ggtitle(label = "Average Steps by Interval by Week Part")+
facet_grid(WeekPart ~ .)
plot(p)