In this project, I will analyze data from a personal activity monitoring device. Questions addressed include:
setwd("/Users/Feng/Documents/AfterGraduationStudy/DataScienceSpecialization/05_ReproducibleResearch/ReproducibleResearch_HW_Quiz/W2_HW_Project1")
rawdata <- read.csv("activity.csv", header = T)
rawdata$date <- ymd(rawdata$date)
dim(rawdata)
## [1] 17568 3
head(rawdata)
## steps date interval
## 1 NA 2012-10-01 0
## 2 NA 2012-10-01 5
## 3 NA 2012-10-01 10
## 4 NA 2012-10-01 15
## 5 NA 2012-10-01 20
## 6 NA 2012-10-01 25
I use dplyr package to calculate the total number of steps taken each day.
sumbyday <- rawdata %>% group_by(date) %>% summarize(sum.steps = sum(steps))
dim(sumbyday)
## [1] 61 2
sumbyday[c(1:3, (nrow(sumbyday) - 2):nrow(sumbyday)), ]
## # A tibble: 6 × 2
## date sum.steps
## <date> <int>
## 1 2012-10-01 NA
## 2 2012-10-02 126
## 3 2012-10-03 11352
## 4 2012-11-28 10183
## 5 2012-11-29 7047
## 6 2012-11-30 NA
The histogram is plotted based on sumbyday data set generated in the previous step.
g <- ggplot(data = sumbyday, aes(x = sum.steps))
g <- g + geom_histogram(color = "wheat4", fill = "hotpink", bins = 25) +
scale_x_continuous(name = "Total Number of Steps Taken Each Day") +
scale_y_continuous(name = "Count", breaks = seq(0, 10, 2)) +
ggtitle(label = "Histogram of the Total Number of Steps Taken Each Day")
g_sumbyday <- g + theme(plot.margin = unit(c(1,1,1,1), "cm"),
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text = element_text(size = 12), axis.title = element_text(size = 14, face = "bold"))
g_sumbyday
Mean and median of total steps taken each day are extracted from the summary of the sumbyday data set.
summary(sumbyday[2])
## sum.steps
## Min. : 41
## 1st Qu.: 8841
## Median :10765
## Mean :10766
## 3rd Qu.:13294
## Max. :21194
## NA's :8
meanofsum <- gsub("[^0-9]", "", unclass(summary(sumbyday[2]))[4])
medianofsum <- gsub("[^0-9]", "", unclass(summary(sumbyday[2]))[3])
The mean and median of the total number of steps taken per day are 10766 steps and 10765 steps respectively.
To facilitate further analysis, I reshape the long format data frame to a wide format one. In the wide format data frame, there’re 289 variables and 61 observations. Variables are date and the 288 intervals. Observations are the daily data from the 61 days.
widedata <- reshape(rawdata, timevar = "interval", idvar = "date", direction = "wide")
names(widedata) <- gsub("steps.","intvl.", names(widedata))
rownames(widedata) <- c()
dim(widedata)
## [1] 61 289
tail(widedata[ , c(1:5, 285:289)])
## date intvl.0 intvl.5 intvl.10 intvl.15 intvl.2335 intvl.2340
## 56 2012-11-25 0 0 0 0 176 94
## 57 2012-11-26 0 0 0 0 0 0
## 58 2012-11-27 0 0 0 0 0 0
## 59 2012-11-28 0 0 0 0 0 0
## 60 2012-11-29 0 0 0 0 0 0
## 61 2012-11-30 NA NA NA NA NA NA
## intvl.2345 intvl.2350 intvl.2355
## 56 26 0 0
## 57 0 0 0
## 58 0 0 0
## 59 0 0 0
## 60 0 0 0
## 61 NA NA NA
A new data set is generated to record the average steps taken in each interval.
avgperintvl <- data.frame(index = 1:288, avg.steps = colMeans(widedata[-1], na.rm = T))
#' The index variable is added to the data frame because there's a jump of the original interval index before
#' each i'th o'clock sharp (e.g., from 1255 to 1300 rather than to 1260), which would lead to discontinuity
#' in the figures to plot.
dim(avgperintvl)
## [1] 288 2
head(avgperintvl)
## index avg.steps
## intvl.0 1 1.7169811
## intvl.5 2 0.3396226
## intvl.10 3 0.1320755
## intvl.15 4 0.1509434
## intvl.20 5 0.0754717
## intvl.25 6 2.0943396
A time series plot is presented to show the average number of steps taken during each 5-minute interval throughout a day.
g <- ggplot(data = avgperintvl, aes(x = index, y = avg.steps))
g <- g + geom_line(color = "royalblue4", size = 1) +
geom_point(aes(x = which.max(avgperintvl$avg.steps), y = max(avg.steps)),
shape = 1, size = 5, color = "orange")
g <- g + scale_x_continuous(name = "5-minute Interval Index", breaks = seq(0, 300, 50)) +
scale_y_continuous(name = "Average Steps Count") +
ggtitle(label = "Average Number of Steps Taken in Each Interval", subtitle = "(Averaged across All Days)")
g_meanbyintvl <- g + theme(plot.margin = unit(c(1,1,1,1), "cm"),
plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
plot.subtitle = element_text(hjust = 0.5, size = 12),
axis.text = element_text(size = 10),
axis.title = element_text(size = 12, face = "bold"))
g_meanbyintvl
maxsteps <- round(max(avgperintvl$avg.steps))
maxindex <- which.max(avgperintvl$avg.steps)
maxintvl <- str_extract(rownames(avgperintvl)[maxindex], "[0-9]+")
In all the 5-minute intervals, interval representing time at 835 (index No.104) contains the maximum number of steps (206 steps) on average across all the days in the data set.
In accordance with the project instructions, data frame rawdata is used to calculate the total number of missing values in the original data set.
na_record <- tapply(rawdata$steps, rawdata$date, function(x) sum(is.na(x)))
head(na_record)
## 2012-10-01 2012-10-02 2012-10-03 2012-10-04 2012-10-05 2012-10-06
## 288 0 0 0 0 0
na_totalnum <- sum(na_record)
na_date <- names(na_record)[na_record > 0]
There are 2304 NA entries in the original data. The days that contain the NA entries are: 2012-10-01, 2012-10-08, 2012-11-01, 2012-11-04, 2012-11-09, 2012-11-10, 2012-11-14, 2012-11-30. The data in the remaining 53 days are complete.
I use the 53-day mean steps taken in each interval to impute the missing data in the 8 NA-value-filled days.
raw_imputed <- rawdata
na_index <- which(raw_imputed$date %in% as.Date(na_date))
raw_imputed$steps[na_index] <- rep(avgperintvl$avg.steps, 8)
head(raw_imputed)
## steps date interval
## 1 1.7169811 2012-10-01 0
## 2 0.3396226 2012-10-01 5
## 3 0.1320755 2012-10-01 10
## 4 0.1509434 2012-10-01 15
## 5 0.0754717 2012-10-01 20
## 6 2.0943396 2012-10-01 25
With the imputed data, I recalculate the total number of steps taken each day. Note that the numbers are rounded to integers.
sumbyday_imp <- raw_imputed %>% group_by(date) %>% summarize(sum.steps = round(sum(steps)))
sumbyday_imp[c(1:3, (nrow(sumbyday_imp) - 2):nrow(sumbyday_imp)), ]
## # A tibble: 6 × 2
## date sum.steps
## <date> <dbl>
## 1 2012-10-01 10766
## 2 2012-10-02 126
## 3 2012-10-03 11352
## 4 2012-11-28 10183
## 5 2012-11-29 7047
## 6 2012-11-30 10766
The histogram of the total number of steps taken each day is replotted based on the data from sumbyday_imp.
g <- ggplot(data = sumbyday_imp, aes(x = sum.steps))
g <- g + geom_histogram(color = "slateblue3", fill = "peachpuff", bins = 25) +
scale_x_continuous(name = "Total Number of Steps Taken Each Day") +
scale_y_continuous(name = "Count", breaks = seq(0, 18, 2)) +
ggtitle(label = "Histogram of the Total Number of Steps Taken Each Day",
subtitle = "(NA Data Imputed with Mean Value)")
g_sumbyday_imp <- g + theme(plot.margin = unit(c(1,1,1,1), "cm"),
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
plot.subtitle = element_text(hjust = 0.5, size = 12),
axis.text = element_text(size = 12),
axis.title = element_text(size = 14, face = "bold"))
g_sumbyday_imp
meanofsum_imp <- mean(sumbyday_imp$sum.steps)
medianofsum_imp <- median(sumbyday_imp$sum.steps)
sdsum <- sd(sumbyday$sum.steps, na.rm = T)
sdsum_imp <- sd(sumbyday_imp$sum.steps)
In the imputed data, the mean and median of the total number of steps taken per day are 10766 steps and 10766 steps respectively.
The data imputation significantly changes the distribution of the total number of steps taken per day, though the changes in its mean and median are negligible. The bin representing the count of observations around the mean value (10766 steps) is almost twice as tall as that in the first histogram. The standard deviation of the total number of steps taken per day decreases from 4269.1804927 to 3974.3907465.
raw_imputed$dayofweek <- factor(weekdays(raw_imputed$date) %in% c("Saturday","Sunday"),
labels = c("Weekday","Weekend"))
by_dofw <- raw_imputed %>% group_by(dayofweek, interval) %>% summarize(mean.steps = mean(steps))
by_dofw$intvl.index <- rep(1:288, 2)
by_dofw[c(1:3, (nrow(by_dofw) - 2):nrow(by_dofw)), ]
## Source: local data frame [6 x 4]
## Groups: dayofweek [2]
##
## dayofweek interval mean.steps intvl.index
## <fctr> <int> <dbl> <int>
## 1 Weekday 0 2.25115304 1
## 2 Weekday 5 0.44528302 2
## 3 Weekday 10 0.17316562 3
## 4 Weekend 2345 1.70518868 286
## 5 Weekend 2350 0.02830189 287
## 6 Weekend 2355 0.13443396 288
Now I use lattace package to draw the panel plot.
xyplot(data = by_dofw, mean.steps ~ intvl.index | dayofweek, type = "l", layout = c(1, 2),
main = "Average Number of Steps Taken in Each Interval\n(Averaged across All Weekends and All Weekdays Respectively)",
xlab = list(label = "5-minute Interval Index", fontsize = 14),
ylab = list(label = "Average Steps Count", fontsize = 14))
Judging from the panel plot, the average number of steps taken in each interval follows similar trend in weekends and weekdays. But the values of the mean steps in each interval do differ appreciably between weekends and weekdays.
results <- by_dofw %>% group_by(dayofweek) %>%
summarize(max.steps = max(mean.steps),
max.int.index = intvl.index[mean.steps == max.steps],
max.intvl = interval[mean.steps == max.steps])
print(results)
## # A tibble: 2 × 4
## dayofweek max.steps max.int.index max.intvl
## <fctr> <dbl> <int> <int>
## 1 Weekday 230.3782 104 835
## 2 Weekend 166.6392 112 915
As we can see from the table above, in weekdays the 104th interval (representing time at 835) has the peak (230 steps) of the steps taken throughout the day while in weekends the 112th interval (representing time at 915) has the maximum number of steps taken (167 steps).
#' At the end of the Rmd document, the command below is run in Console to produce PA1_template.md and
#' PA1_template.html files.
# render("PA1_template.Rmd", c("html_document", "md_document"))