This is the first project for Reproducible Research course, which is a part of Coursera’s Data Science Specialization, and the final course of Coursera’s Data Science: Foundations Using R Specialization.
The Project aims to research data from a personal activity monitoring device.
It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the “quantified self” movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.
Load packages & data
Code buttonlibrary("data.table"); library("dplyr"); library("lattice")
if(!file.exists("./data/0904_DS-RR-w2_Activity/activity.zip")) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip",
destfile = "./data/0904_DS-RR-w2_Activity/activity.zip",
method = "curl")
}
if(!file.exists("./data/0904_DS-RR-w2_Activity/activity.csv")) {
unzip("./data/0904_DS-RR-w2_Activity/activity.zip",
exdir = "./data/0904_DS-RR-w2_Activity")
}
- Read the data
activity <- fread("./data/0904_DS-RR-w2_Activity/activity.csv")
- Process/transform the data into a format suitable for your analysis
as fread function is used, the data is almost already in the suitable format
activity <- activity %>% rename(NumberOfSteps = steps, Date = date,
FiveMinuteInterval = interval)
str(activity)Classes 'data.table' and 'data.frame': 17568 obs. of 3 variables:
$ NumberOfSteps : int NA NA NA NA NA NA NA NA NA NA ...
$ Date : IDate, format: "2012-10-01" "2012-10-01" ...
$ FiveMinuteInterval: int 0 5 10 15 20 25 30 35 40 45 ...
- attr(*, ".internal.selfref")=<externalptr>
17 568 observations in this data setsteps: number of steps taking in a 5-minute interval (missing values are coded as NA)date: the date on which the measurement was taken in YYYY-MM-DD formatinterval: identifier for the 5-minute interval in which measurement was takenFor this part of the assignment, the missing values in the data set can be ignored
- Make a histogram of the total number of steps taken each day
TDsteps <- activity %>%
filter(!is.na(NumberOfSteps)) %>%
group_by(Date) %>%
summarise(DailySteps = sum(NumberOfSteps))
histogram( ~ DailySteps, TDsteps, col= "cornflowerblue",
main = "Histogram of the total number of step taken each day")
- Calculate and report the mean and median total number of steps taken per day
mean and median to variable DailySteps from data set TDstepsTDsteps %>%
summarise(meanTDsteps = mean(DailySteps),
medianTDsteps = median(DailySteps))# A tibble: 1 x 2
meanTDsteps medianTDsteps
<dbl> <int>
1 10766. 10765
10 76610 765
- Make a time series plot (i.e.
type = "l") of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis)
AvgDsteps <- activity %>%
filter(!is.na(NumberOfSteps)) %>%
group_by(FiveMinuteInterval) %>%
summarise(AvgIntSteps = mean(NumberOfSteps))
xyplot(AvgIntSteps ~ FiveMinuteInterval, AvgDsteps, type = "l",
main = "Time series plot of the average number of steps taken")
- Which 5-minute interval, on average across all the days in the data set, contains the maximum number of steps?
AvgDstepsAvgDsteps %>% filter(AvgIntSteps == max(AvgIntSteps)) %>%
rename(maxInterval = FiveMinuteInterval, maxSteps = AvgIntSteps)# A tibble: 1 x 2
maxInterval maxSteps
<int> <dbl>
1 835 206.
Note that there are a number of days/intervals where there are missing values (coded as NA). The presence of missing days may introduce bias into some calculations or summaries of the data
- Calculate and report the total number of missing values in the data set (i.e. the total number of rows with
NAs)
TNAs <- sum(is.na(activity))
- Devise a strategy for filling in all of the missing values in the data set. The strategy does not need to be sophisticated. For example, you could use the mean/median for that day, or the mean for that 5-minute interval, etc.
NA with the mean for the given 5-minute interval (AvgIntSteps)
- Create a new data set that is equal to the original data set but with the missing data filled in
NAs, then do the loop with all rows in it using AvgDsteps, and return them to the original data set again### create an auxiliary data set with only `NA`s ###
NAs <- activity %>% filter(is.na(NumberOfSteps))
### replace each `NA` value with data from `AvgDsteps`,
# i.e. with the mean for the given 5-minute interval ###
for (i in 1:nrow(NAs)) {
NAs$NumberOfSteps[i] <-
AvgDsteps[which(NAs$FiveMinuteInterval[i] ==
AvgDsteps$FiveMinuteInterval),]$AvgIntSteps
}
### combine new `NAs` w/ remaining data in the original data set `activity` ###
noNAs <- rbind(NAs, activity %>% filter(!is.na(NumberOfSteps)))
### function `summary` will show, if there are `NA` in a new data set `noNAs` ###
summary(noNAs) NumberOfSteps Date FiveMinuteInterval
Min. : 0.00 Min. :2012-10-01 Min. : 0.0
1st Qu.: 0.00 1st Qu.:2012-10-16 1st Qu.: 588.8
Median : 0.00 Median :2012-10-31 Median :1177.5
Mean : 37.38 Mean :2012-10-31 Mean :1177.5
3rd Qu.: 27.00 3rd Qu.:2012-11-15 3rd Qu.:1766.2
Max. :806.00 Max. :2012-11-30 Max. :2355.0
NA in a new data set noNAs
- Make a histogram of the total number of steps taken each day and…
TDsteps <- noNAs %>%
group_by(Date) %>%
summarise(DailySteps = sum(NumberOfSteps))
histogram( ~ DailySteps, TDsteps, col= "cornflowerblue",
main = "Histogram of the total number of steps
taken each day after missing values are imputed")4a. … Calculate and report the mean and median total number of steps taken per day 4b. Do these values differ from the estimates from the first part of the assignment?
mean and median to variable DailySteps from data set TDstepsTDsteps %>%
summarise(meanTDsteps = mean(DailySteps),
medianTDsteps = median(DailySteps))# A tibble: 1 x 2
meanTDsteps medianTDsteps
<dbl> <dbl>
1 10766. 10766.
10 766
10 766
10 7654c. What is the impact of imputing missing data on the estimates of the total daily number of steps?
Use the data set with the filled-in missing values for this part
- Create a new factor variable in the data set with two levels – “weekday” and “weekend” indicating whether a given date is a weekday or weekend day
Wdays as numbers from 1-Sunday to 7-Saturday, then replace Wdays 2-6 with weekdays, and 1,7 with weekendnoNAs <- noNAs %>% mutate(Wdays = wday(Date)) %>%
mutate(Wdays = if_else(Wdays == 1 | Wdays == 7, "weekend", "weekday")) %>%
mutate(Wdays = as.factor(Wdays))
str(noNAs)Classes 'data.table' and 'data.frame': 17568 obs. of 4 variables:
$ NumberOfSteps : num 1.717 0.3396 0.1321 0.1509 0.0755 ...
$ Date : IDate, format: "2012-10-01" "2012-10-01" ...
$ FiveMinuteInterval: int 0 5 10 15 20 25 30 35 40 45 ...
$ Wdays : Factor w/ 2 levels "weekday","weekend": 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, ".internal.selfref")=<externalptr>
Wdays with two levels: weekday and weekend
- Make a panel plot containing a time series plot of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis)
AvgDsteps <- noNAs %>%
group_by(Wdays, FiveMinuteInterval) %>%
summarise(AvgIntSteps = mean(NumberOfSteps))
xyplot(AvgIntSteps ~ FiveMinuteInterval | Wdays, data = AvgDsteps, type = "l",
layout = c(1,2), xlab = "Interval", ylab = "Number of steps",
main = "Panel plot comparing the average number of steps
taken per 5-minute interval across weekdays and weekends")Session Info (for reproducibility)
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lattice_0.20-41 dplyr_1.0.2 data.table_1.13.4
loaded via a namespace (and not attached):
[1] knitr_1.30 magrittr_2.0.1 tidyselect_1.1.0 R6_2.5.0
[5] rlang_0.4.9 fansi_0.4.1 stringr_1.4.0 tools_4.0.3
[9] grid_4.0.3 xfun_0.19 utf8_1.1.4 cli_2.2.0
[13] htmltools_0.5.0 ellipsis_0.3.1 assertthat_0.2.1 yaml_2.2.1
[17] digest_0.6.27 tibble_3.0.4 lifecycle_0.2.0 crayon_1.3.4
[21] purrr_0.3.4 vctrs_0.3.5 glue_1.4.2 evaluate_0.14
[25] rmarkdown_2.5 stringi_1.5.3 compiler_4.0.3 pillar_1.4.7
[29] generics_0.1.0 pkgconfig_2.0.3