Course Project 1


Loading And Preprocessing The Data

Frist, we find the file path to the directory where the project has been downloaded / cloned (the package “here” - install.packages(“here”) - is needed):

library(here)
oldWD <- getwd()
setwd(here())
## Warning: 'here' is deprecated; use 'now' instead. Deprecated in version
## '1.5.6'.
## Error in setwd(here()): character argument expected

The loading and preprocessing of data belongs to the munging of data as per the ProjectTemplate structure, see also the READ.md in the root of the project direcotry. Data is loaded with read.csv(). We create a data frame, df_steps, from the csv. The code is saved in the /munge/01-A.R script, which is run every time we do load.project():

library(ProjectTemplate)
load.project()
## Project name: ReproducibleResearch_CourseProject1
## Loading project configuration
## Autoloading packages
##  Loading package: reshape
##  Loading package: plyr
##  Loading package: dplyr
##  Loading package: ggplot2
##  Loading package: stringr
##  Loading package: lubridate
##  Loading package: Hmisc
## Autoloading helper functions
##  Running helper script: globals.R
##  Running helper script: helpers.R
## Autoloading data
## Munging data
##  Running preprocessing script: 01-A.R
dim(df_steps)
## [1] 17568     3
head(df_steps)
##   steps       date interval
## 1    NA 2012-10-01        0
## 2    NA 2012-10-01        5
## 3    NA 2012-10-01       10
## 4    NA 2012-10-01       15
## 5    NA 2012-10-01       20
## 6    NA 2012-10-01       25
summary(df_steps)
##      steps                date          interval     
##  Min.   :  0.00   2012-10-01:  288   Min.   :   0.0  
##  1st Qu.:  0.00   2012-10-02:  288   1st Qu.: 588.8  
##  Median :  0.00   2012-10-03:  288   Median :1177.5  
##  Mean   : 37.38   2012-10-04:  288   Mean   :1177.5  
##  3rd Qu.: 12.00   2012-10-05:  288   3rd Qu.:1766.2  
##  Max.   :806.00   2012-10-06:  288   Max.   :2355.0  
##  NA's   :2304     (Other)   :15840

NOTE: The path in **setwd()* must be changed to the where you have checked out the project on your computer.

Mean Total Number of Steps Taken Per Day

All of the necessary commands for this part are saved to the cp1_basic_processing.R-script in the src folder. For this part, the missing values in the dataset are ignored.

Total number of steps taken per day:

df_steps %>% group_by(date) %>% summarise(sum=sum(steps))
## # A tibble: 61 x 2
##    date         sum
##    <fct>      <int>
##  1 2012-10-01    NA
##  2 2012-10-02   126
##  3 2012-10-03 11352
##  4 2012-10-04 12116
##  5 2012-10-05 13294
##  6 2012-10-06 15420
##  7 2012-10-07 11015
##  8 2012-10-08    NA
##  9 2012-10-09 12811
## 10 2012-10-10  9900
## # ... with 51 more rows

Histogram of the total number of steps taken each day:

dt_steps_per_day <- df_steps %>% group_by(date) %>%
    summarise(sum=sum(steps))
hist(dt_steps_per_day$sum, 
    xlab="Total number of steps per day",
    main = "Histogram of total number of steps per day")

plot of chunk histogram_steps_per_day

dev.print(png, 
    'figure/histogram_sum_steps_per_day.png',
    width=640,
    height=800)
## png 
##   2

Mean and median of the total number of steps taken per day:

mean(dt_steps_per_day$sum,na.rm=TRUE)
## [1] 10766.19
median(dt_steps_per_day$sum,na.rm=TRUE)
## [1] 10765

Average Daily Activity Pattern

Time series plot (type = “l”) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis):

dt_average_steps_per_interval <- df_steps %>%
    group_by(interval) %>%
    summarise(mean=mean(steps, na.rm=TRUE))

plot(dt_average_steps_per_interval$interval,
    dt_average_steps_per_interval$mean,
    type="l",
    xlab="Interval",
    ylab="Average number of steps",
    main="Average Number of Steps Per Interval")

plot of chunk average_steps_per_interval

dev.print(png, 
    'figure/lineplot_average_steps_per_interval.png',
    width=640,
    height=800)
## png 
##   2

5-minute interval, on average across all the days in the dataset, that contains the maximum number of steps:

which.max(dt_average_steps_per_interval$mean)
## [1] 104

Imputing Missing Values

The total number of missing values in the dataset (i.e. the total number of rows with NAs):

sapply(df_steps, function(x) sum(is.na(x)))
##    steps     date interval 
##     2304        0        0

The chosen strategy for filling in all of the missing values in the dataset is to use mean for that 5-minute interval. A new dataset that is equal to the original dataset, but with the missing data filled in:

y <- which(is.na(df_steps$steps)==TRUE) 
df_steps_imputed <- merge(df_steps,dt_average_steps_per_interval,by="interval")
df_steps_imputed$steps <- with(df_steps_imputed,impute(steps,mean[y]))

Histogram of the total number of steps taken each day:

dt_steps_per_day_imputed <- df_steps_imputed %>% group_by(date) %>%
    summarise(sum=sum(steps))
hist(dt_steps_per_day_imputed$sum, 
    xlab="Total number of steps per day",
    main = "Histogram of total number of steps per day")

plot of chunk histogram_of_imputed_steps

dev.print(png, 
    'figure/histogram_sum_steps_per_day_imputed.png',
    width=640,
    height=800)
## png 
##   2

The mean and median total number of steps taken per day:

mean(dt_steps_per_day_imputed$sum)
## [1] 10889.8
median(dt_steps_per_day_imputed$sum)
## [1] 11458

These values are higher than the estimates from the first part of the assignment. Thus, the impact of imputing missing data on the estimates of the total daily number of steps is increased values.

Differences In Activity Patterns Between Weekdays And Weekends

New factor variable in the dataset with two levels – “weekday” and “weekend” indicating whether a given date is a weekday or weekend day:

source("src/cp1_function_daytype.R")
df_steps_imputed$daytype <- apply(df_steps_imputed,1,
    function(x) daytype(x[3]))

Panel plot containing a time series plot (type = “l”) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis).

df_weekday <- subset(df_steps_imputed,daytype=="weekday")
df_weekend <- subset(df_steps_imputed,daytype=="weekend")
dt_steps_per_day_weekday <- df_weekday %>% group_by(interval) %>%
    summarise(mean=mean(steps))
dt_steps_per_day_weekend <- df_weekend %>% group_by(interval) %>%
    summarise(mean=mean(steps))

par(mfrow=c(2,1))

plot(dt_steps_per_day_weekday$interval,
    dt_steps_per_day_weekday$mean,
    type="l",
    xlab="Interval",
    ylab="Average number of steps (weekdays)",
    main="Average Number of Steps Per Interval On Weekdays")

plot(dt_steps_per_day_weekend$interval,
    dt_steps_per_day_weekend$mean,
    type="l",
    xlab="Interval",
    ylab="Average number of steps (weekends)",
    main="Average Number of Steps Per Interval On Weekends")

plot of chunk interval_by_day_type

dev.print(png, 
    'figure/multipanelplot_steps_weekdays_weekends.png',
    width=640,
    height=800)
## png 
##   2

Cleanup: Reset to old working directory:

setwd(oldWD)