This is an exercise to write up a simple data analysis using R Markdown.
Before we start, we need to set the working directory. I haven’t yet found a good process for doing this automatically, so let’s do this manually. To do this in Rstudio, open this file then choose:
Session | Set Working Directory | To Source File Location
Load the activity data from the current directory:
activity <- read.csv(unzip("activity.zip", "activity.csv", overwrite=TRUE))
Look at the structure of the data (number of observations, number of variables, type of each variable, etc.)
str(activity)
## 'data.frame': 17568 obs. of 3 variables:
## $ steps : int NA NA NA NA NA NA NA NA NA NA ...
## $ date : Factor w/ 61 levels "2012-10-01","2012-10-02",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
We see that the time of day of a measurement is reported in the ‘interval’ variable. At first glance, the interval looks like an integer number of minutes (e.g, 5, 10, 15, etc.):
head(activity$interval)
## [1] 0 5 10 15 20 25
But looking at values at the end of the day, we see its really an hour and minute identifier of the form ‘hhmm’:
tail(activity$interval)
## [1] 2330 2335 2340 2345 2350 2355
Let’s create an fixed-length character string, with leading 0’s for the early hours, that we can use for time-of-day analysis:
activity$hhmm <- sprintf("%04d", activity$interval)
head(activity$hhmm)
## [1] "0000" "0005" "0010" "0015" "0020" "0025"
Now, create categorical variable that tells whether the day is a weekday or weekend day. Using the lubridate package, * Use ymd() to convert the given date string into a posix time stamp * Use wday() function to get the day of the week as a decimal number 01-07 with Sunday as 1 * Categorize days 1 and 7 as weekend and the others as weekday
library(lubridate)
day_of_week <- wday(ymd(activity$date))
# Helper function to map day of week to our category values
# There has to be an easier way to do this
f <- function(i) {
day_to_category <- c('weekend',
'weekday',
'weekday',
'weekday',
'weekday',
'weekday',
'weekend')
return (day_to_category[i])
}
activity$time_of_week <- sapply(day_of_week, f)
head(activity$time_of_week)
## [1] "weekday" "weekday" "weekday" "weekday" "weekday" "weekday"
Here’s a summary of the final activity data:
str(activity)
## 'data.frame': 17568 obs. of 5 variables:
## $ steps : int NA NA NA NA NA NA NA NA NA NA ...
## $ date : Factor w/ 61 levels "2012-10-01","2012-10-02",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ interval : int 0 5 10 15 20 25 30 35 40 45 ...
## $ hhmm : chr "0000" "0005" "0010" "0015" ...
## $ time_of_week: chr "weekday" "weekday" "weekday" "weekday" ...
Create a table (steps_per_day) with the total number of steps taken each day. To do this, use the dplyr package to group the activity table by date and sum the number of steps taken for each each day:
suppressPackageStartupMessages(library(dplyr))
steps_per_day <- activity %>% group_by(date) %>% summarise(steps = sum(steps))
head(steps_per_day)
## Source: local data frame [6 x 2]
##
## date steps
## 1 2012-10-01 NA
## 2 2012-10-02 126
## 3 2012-10-03 11352
## 4 2012-10-04 12116
## 5 2012-10-05 13294
## 6 2012-10-06 15420
Let’s look at the distribution of steps taken in a day:
hist(steps_per_day$steps,
xlab = "Number of steps taken in a day",
main = "Histogram of number of steps taken in a day")
Compute the mean and median of the steps taken per day, ignoring NA values:
mean_steps_per_day <- mean(steps_per_day$steps, na.rm = TRUE)
median_steps_per_day <- median(steps_per_day$steps, na.rm = TRUE)
For fun, let’s report the mean and median in-line in the text:
There was a small bit of hidden code to format the numbers for display - round the mean to 2 decimal paces and use commas in both numbers. Here is the code for those lines:
* The mean number of steps per day is ` r format(round(mean_steps_per_day, 2), big.mark=',')`
* The median steps per day is ` r format(median_steps_per_day, big.mark=',')`
Compute the average number of steps taken in each 5-min interval across all days, ignoring NA values:
steps_per_interval <- activity %>% group_by(hhmm) %>% summarise(steps = mean(steps, na.rm=TRUE))
head(steps_per_interval)
## Source: local data frame [6 x 2]
##
## hhmm steps
## 1 0000 1.7169811
## 2 0005 0.3396226
## 3 0010 0.1320755
## 4 0015 0.1509434
## 5 0020 0.0754717
## 6 0025 2.0943396
Let’s plot that time series. To get a pretty x-axis label, convert hhmm variable to a real time stamp using the strptime() function:
plot(x = strptime(steps_per_interval$hhmm, format('%H%M')),
y = steps_per_interval$steps,
type='l',
main="Average steps taken over a day",
xlab="Time of day (hh:mm)",
ylab="Number of steps")
Use the which.max() function to find the index of the row that has the maximum number of steps. Extract that row the data:
steps_per_interval[which.max(steps_per_interval$steps),]
## Source: local data frame [1 x 2]
##
## hhmm steps
## 1 0835 206.1698
Let’s identify and fill in missing data then re-visit the analysis of the total number of steps per day.
The number of missing values be computed by using the complete.cases() function. Count the number of rows that are not complete (have an NA value is some column):
number_of_rows <- nrow(activity)
number_of_nas <-sum(!complete.cases(activity))
number_of_nas
## [1] 2304
There are 2304 missing values out of 17568 observations.
We will deal with missing values in the following way:
So, for example, if a value is missing for 10:15am on a given Sunday, we will substitute the average number of steps computed for 10:15 over all days in the data set. If that value is also NA, we will leave measurement as NA.
This simple strategy may not get rid of all NAs. We will compare the number of NAs still left in the new data set.
So let’s create a new data set with the missing values filled according to this strategy.
activity2 <- activity
# Get row numbers of NA measurements
na_row_nums <- which(!complete.cases(activity2))
# Helper function to replace NA for a given row number in the activity2 data set
fix.na <- function(activityi) {
intervali <- activity2[activityi,]$hhmm
new_value <- steps_per_interval[which(steps_per_interval$hhmm == intervali),]$steps
activity[activityi,]$steps <- new_value
}
activity2[na_row_nums,]$steps <- sapply(na_row_nums, fix.na)
summary(activity2$steps)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 37.38 27.00 806.00
The summary (above) of the new data set shows no NA values. Compare it to a summary of the original activty data set (below):
summary(activity$steps)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.00 0.00 37.38 12.00 806.00 2304
Note the NAs have been replaced and that the 3rd Quartile is signficantly higher in the new data set.
Let’s look at the distribution of steps taken in a day in the new activity data set:
steps_per_day2 <- activity2 %>% group_by(date) %>% summarise(steps = sum(steps))
hist(steps_per_day2$steps,
xlab = "Number of steps taken in a day",
main = "Histogram of number of steps taken in a day with no-NA data set")
For the imputed data set, compute the mean and median of total number of steps per day:
mean_steps_per_day2 <- mean(steps_per_day2$steps)
mean_steps_per_day2
## [1] 10766.19
median_steps_per_day2 <- median(steps_per_day2$steps)
median_steps_per_day2
## [1] 10766.19
Wow, they are exactly the same. That’s unusual. That needs some more investigation to ensure we did the NA substitution corrrectly.
Use the categorical attribute we added to compute the average weekday and weekend patterns.
This is the same computation used to compute averages for all days, except here we filter() to select only the weekend or weekday days.
Compute the average number of steps for each 5-min interval over all weekday days:
weekday_steps_per_interval <- activity %>%
filter(time_of_week == 'weekday') %>%
group_by(hhmm) %>%
summarise(steps = mean(steps, na.rm=TRUE))
summary(weekday_steps_per_interval)
## hhmm steps
## Length:288 Min. : 0.000
## Class :character 1st Qu.: 2.218
## Mode :character Median : 23.974
## Mean : 35.338
## 3rd Qu.: 51.872
## Max. :234.103
Compute the average number of steps for each 5-min interval over all weekend days:
weekend_steps_per_interval <- activity %>%
filter(time_of_week == 'weekend') %>%
group_by(hhmm) %>%
summarise(steps = mean(steps, na.rm=TRUE))
summary(weekend_steps_per_interval)
## hhmm steps
## Length:288 Min. : 0.000
## Class :character 1st Qu.: 1.107
## Mode :character Median : 32.036
## Mean : 43.078
## 3rd Qu.: 75.571
## Max. :175.000
Let’s use the base graphics package (Even though it would probably be easier and prettier to use lattice or ggplot, we can just reuse the plot we did above and tweak the data being shown):
par(mfrow=c(2,1))
plot(x = strptime(weekday_steps_per_interval$hhmm, format('%H%M')),
y = weekday_steps_per_interval$steps,
type='l',
main="Weekday Days",
xlab="",
ylab="",
ylim=c(0,250))
plot(x = strptime(weekend_steps_per_interval$hhmm, format('%H%M')),
y = weekend_steps_per_interval$steps,
type='l',
main="Weekend Days",
xlab="Time of day (hh:mm)",
ylab="Number of steps",
ylim=c(0,250))