“Reproducible Research: Peer Assessment 1” - B. A. Benayoun

Loading and preprocessing the data

First, the csv file is read into an R object.

activity.data <- read.csv('activity.csv',header=T)

Then, steps are summed over each days, while ignoring missing data. In addition, steps are averaged across days per time interval.

daily.steps <- rowsum(activity.data$steps, group = activity.data$date, 
                      na.rm = T)

av.interval.steps <- aggregate(activity.data$steps, 
                               by = list(Interval = activity.data$interval),
                               FUN = mean, na.rm=T)

What is mean total number of steps taken per day?

Here is an histogram reporting the distribution of the number of steps taken daily in the dataset, and the code necesssary to plot it:

hist(daily.steps,breaks=20,main = "Daily activity (missing values ignored)", 
     xlab = "Total number of steps taken per day",col="deeppink")

plot of chunk histogram

We then calculate the mean and median total number of steps taken per day:

my.mean <- mean(daily.steps)
my.median <- median(daily.steps)

The distribution of total number of steps taken per day has respectively a mean of 9354.2295082 steps and a median of 10395 steps.

What is the average daily activity pattern?

Here is a plot of the average number of steps taken averaged across all days as a function of the 5-minutes intervals:

plot(av.interval.steps$Interval,av.interval.steps$x,type='l',col="dodgerblue",
     xlab = "Time of day (min)",
     ylab = "Average steps taken (missing values ignored)")

plot of chunk time plot

Then we identify the 5-minutes interval with the maximum average number of total steps.

my.interval.idx <- which(av.interval.steps$x == max(av.interval.steps$x))
my.interval <- av.interval.steps$Interval[my.interval.idx]

The 5-minute interval, on average across all the days in the dataset, which contains the maximum number of steps is at 835 minutes.

Imputing missing values

We identify missing values and calculate their total number in the dataset:

my.nas <- which(is.na(activity.data$steps))
my.nas.number <- length(my.nas)

There are 2304 missing data points in the dataset.

We will impute missing data points using the rounded (integer) mean value of steps taken over non-missing data points at the corresponding 5-minute time interval. A copy of the dataset is created to receive the new imputed values, and we replace missing data points with calculated imputed values (using the calculations above).

To proceed with imputation, for each missing data point, we will:

identify the corresponding 5-min interval
extract the average value computed before for that time point
round the value to have an integer step value
record the imputed value.

my.imputed.activity.data <- activity.data

for (i in 1:length(my.nas)){
  my.cur.idx <- my.nas[i]
  my.cur.interval <- activity.data$interval[my.cur.idx]
  my.av.interval <- which(av.interval.steps$Interval == my.cur.interval)
  my.new.data.pt <- round(av.interval.steps$x[my.av.interval])
  my.imputed.activity.data$steps[my.cur.idx] <- my.new.data.pt
}

We re-calculate the daily steps using the data with imputed missing values:

imputed.daily.steps <- rowsum(my.imputed.activity.data$steps, 
                              group = my.imputed.activity.data$date)

Now, here is an histogram reporting the distribution of the number of steps taken daily in the dataset, and the code necesssary to plot it with imputed values:

hist(imputed.daily.steps,breaks=20,main = "Daily activity (imputed missing values)", 
     xlab = "Total number of steps taken per day",col="deeppink")

plot of chunk histogram2

We then calculate the updated mean and median total number of steps taken per day:

my.mean.imp <- mean(imputed.daily.steps)
my.median.imp <- median(imputed.daily.steps)

The distribution of total number of steps taken per day, after imputation of missing values, has respectively a mean of 10765.6393443 steps and a median of 10762 steps.

These new values are slightly different than the original observed values. The mean has now increased (10765.6393443 vs. 9354.2295082), as well as the median (10762 vs. 10395). This is probably the result of “missing steps” being now accounted for and so added to the daily tallies.

Are there differences in activity patterns between weekdays and weekends?

Determine which day of the week each measurement was taken on, and expand the imputed table to include a weekstatus factor detailing whether it was a week-end day or week day.

my.dotw <- weekdays(as.Date(activity.data$date))

my.weekends <- c(which(my.dotw %in% "Sunday"),which(my.dotw %in% "Saturday"))

my.imputed.activity.data$weekstatus <- 
  factor(rep("weekday",length(my.dotw)), levels = c("weekday","weekend"),ordered = T)

my.imputed.activity.data$weekstatus[my.weekends] <- "weekend"

We first calculate the data averages over each type of days:

av.interval.steps.weekStatus <- aggregate(my.imputed.activity.data$steps, 
                               by = list(Interval = my.imputed.activity.data$interval,
                               DayType = my.imputed.activity.data$weekstatus),
                               FUN = mean)

Then we plot the time series data on weekdays and week-ends, using base graphics :

my.we.idx <- which(av.interval.steps.weekStatus$DayType == "weekend")

par(mfrow = c(2,1))
par(cex = 0.6)
par(mar = c(0, 0, 0, 0), oma = c(4, 4, 5, 1))
par(mgp = c(2, 0.6, 0))

plot(av.interval.steps.weekStatus$Interval[my.we.idx],
     av.interval.steps.weekStatus$x[my.we.idx],type='l',
     col="dodgerblue",axes = FALSE)
box(col = "black")
axis(2, at = seq(0,150, 25),las=2)
mtext("Weekend", side = 3, line = -1.5, adj = 0.5, cex = 1)

plot(av.interval.steps.weekStatus$Interval[-my.we.idx],
     av.interval.steps.weekStatus$x[-my.we.idx],type='l',
     col="dodgerblue",axes = FALSE)
box(col = "black")
axis(2, at = seq(0,200, 25),las=2)
mtext("Weekday", side = 3, line = -1.5, adj = 0.5, cex = 1)

axis(1, at = seq(0,2500, 500),las=1) 

mtext("Time of day (min)", side = 1, outer = TRUE, line = 2.2)
mtext("Average steps taken per time interval", side = 2, outer = TRUE, line = 2.2)

plot of chunk time_series_graph

We notice that on average, it seems that whereas weekday activity peaks once very noticeably, week-end activity is more homogeneous during the day.