Loading and preprocessing the data

Firsty we load the raw data.

Note: It is assumed that the file activity.csv is in the current working directory.

rData <- read.csv("activity.csv")

Let’s take a look at the data.

str(rData)
## 'data.frame':    17568 obs. of  3 variables:
##  $ steps   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ date    : Factor w/ 61 levels "2012-10-01","2012-10-02",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ interval: int  0 5 10 15 20 25 30 35 40 45 ...
summary(rData)
##      steps                date          interval     
##  Min.   :  0.00   2012-10-01:  288   Min.   :   0.0  
##  1st Qu.:  0.00   2012-10-02:  288   1st Qu.: 588.8  
##  Median :  0.00   2012-10-03:  288   Median :1177.5  
##  Mean   : 37.38   2012-10-04:  288   Mean   :1177.5  
##  3rd Qu.: 12.00   2012-10-05:  288   3rd Qu.:1766.2  
##  Max.   :806.00   2012-10-06:  288   Max.   :2355.0  
##  NA's   :2304     (Other)   :15840

We observe that the date column is a factor.
We convert the data into dates.

tData <- data.frame(steps = rData$steps, date = as.Date(rData$date, "%Y-%m-%d"), interval = rData$interval)

The data are now suitable for analysis.


What is mean total number of steps taken per day?

We calculate the total steps taken per day, ignoring the NAs.

sumSteps <- tapply(tData$steps, tData$date ,sum, na.rm = T)                 

Let’s now plot the histogram of the total steps taken each day.

hist(sumSteps,10, col = "red")

We calculate the mean and the median of the total number of steps taken per day.

meanSteps <- mean(sumSteps)
medSteps <- median(sumSteps)
meanSteps
## [1] 9354.23
medSteps
## [1] 10395

What is the average daily activity pattern?

I will make a time series plot of the 5-minute intervals and the average number of steps taken, averaged across all days.

avIntSteps <- tapply(tData$steps, tData$interval, mean, na.rm = T)
plot(x = as.numeric(names(avIntSteps)), y = avIntSteps, type = "l", xlab="5-Minute Interval", ylab = "Average Number of Steps Taken")

Now we calculate the 5-minute interval, on average across all the days in the dataset, that contains the maximum number of steps.

# First find which element has the maximum value and its name (interval)
maxIntSteps <- which.max(avIntSteps)
maxInt <- as.numeric(names(maxIntSteps))
# Plot interval on the time series chart
plot(x = as.numeric(names(avIntSteps)), y = avIntSteps, type = "l", xlab="5-Minute Interval", ylab = "Average Number of Steps Taken" )
abline(v=maxInt, lwd = 3, col = "darkgreen")

maxInt
## [1] 835

So, as shown above, the 5-minute interval that contains the maximum mnumber of steps is the 835th.


Imputing missing values

Now let’s go back to the raw data and try to impute the missing values (NAs).

First I will calculate the number of missing values.

mValues <- sum(is.na(rData$step))
mValues
## [1] 2304

Now we replace the missing values with the mean for the relative 5-min interval.

# First find the indices of the NAs in the step column of the data
mInd <- which(is.na(rData$steps))

# Then create a vector with the intervals that has NAs
mInt <- rData$interval[mInd]

#Find the values that I want to replace NAs with

mValRepl <- apply(as.array(mInt), 1, FUN = function(x,y) y[names(y)==x], 
y=avIntSteps)

# Replace
nData <- data.frame(steps = replace(rData$steps, mInd, mValRepl), 
date = as.Date(rData$date), interval = rData$interval)

So nData is the new dataset that is equal to the original one but with the missing data filled in.

Let’s replot the histogram and calculate the mean and the median of total number of steps taken per day.

nsumSteps <- tapply(nData$steps, nData$date ,sum)
hist(nsumSteps,10, col = "darkgreen")

nmeanSteps <- mean(nsumSteps)
nmedSteps <- median(nsumSteps)
nmeanSteps
## [1] 10766.19
nmedSteps
## [1] 10766.19

We observed that imputing missing values changed both the mean and the median. That was expected since initially NAs were calculated as 0s so the mean was decreased and the difference between mean and median was increased.

Imputing missing values solved that problem. Histogram shows that the averages distribution looks more like normal (that seems right according to the Central Limit Theorem) and mean equals to the median.


Are there differences in activity patterns between weekdays and weekends?

Now we separate the weekdays from weekends. To do that we add a factor col in our dataset.

library(dplyr)


# Change language  to english
Sys.setlocale("LC_TIME", "English")
## [1] "English_United States.1252"
weekdays1 <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

# Add the new column
nData <- mutate(nData, day.type = factor((weekdays(nData$date) %in% weekdays1),
                                         levels=c(FALSE, TRUE), 
                                         labels=c('weekend', 'weekday'))) 

We make a panel plot containing a time series plot of the 5-minute interval and the average number of steps taken, averaged across all weekday days or weekend days.

#First I will divide the dataset into two subsets using the day.type factor

nDataWE <- nData[nData$day.type == "weekend",]
nDataWD <- nData[nData$day.type == "weekday",]

avIntStepsWE <- tapply(nDataWE$steps, nDataWE$interval, mean)
avIntStepsWD <- tapply(nDataWD$steps, nDataWD$interval, mean)

# I will plot 2 charts arranged in 2 rows and 1 column

par(mfrow = c(1,2))

plot (x = as.numeric(names(avIntStepsWE)), y = avIntStepsWE, type = "l", 
      xlab = "Interval", ylab = "Average Steps", main = "Weekend" )
plot (x = as.numeric(names(avIntStepsWD)), y = avIntStepsWD, type = "l",
      xlab = "Interval", ylab = "Average Steps", main = "Weekday" )