First thing we need to do is to take a peak inside the contents of the file.
fileName <- as.character(unzip("activity.zip", list=T)$Name)
print(fileName)
## [1] "activity.csv"
Given it's just one file we can proceed to loading the data without further actions.
Reading the data into R:
data <- read.csv(unz("activity.zip", fileName))
Let's take a quick look at the data, just to make sure it's all properly loaded.
head(data)
## steps date interval
## 1 NA 2012-10-01 0
## 2 NA 2012-10-01 5
## 3 NA 2012-10-01 10
## 4 NA 2012-10-01 15
## 5 NA 2012-10-01 20
## 6 NA 2012-10-01 25
We're good to go.
Regarding the classes for each feature in our data frame.
str(data)
## 'data.frame': 17568 obs. of 3 variables:
## $ steps : int NA NA NA NA NA NA NA NA NA NA ...
## $ date : Factor w/ 61 levels "2012-10-01","2012-10-02",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
Dates are stored as factors, so we need to take care of that using the lubridate package.
library(lubridate)
data$date <- ymd(data$date)
str(data)
## 'data.frame': 17568 obs. of 3 variables:
## $ steps : int NA NA NA NA NA NA NA NA NA NA ...
## $ date : POSIXct, format: "2012-10-01" "2012-10-01" ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
Looking good.
As a personal note, I tend to prefer data in long format rather tan wide format so we will not address that.
I also read that such format is preferable when dealing with time series in R.
In order to perform the necessary transformations and summarise the data we will use the dplyr package.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
We need to group our data by day, so that R knows we want to summarise data by day afterwards.
Then we summarise the data using the sum of all the steps taken, obtaining the total number of steps taken each day.
As suggested, we are ignoring missing values for now.
stepsByDay <- data %>%
na.omit() %>%
group_by(date) %>%
summarise(totalSteps=sum(steps))
Find the first 10 rows of the final result below.
head(stepsByDay, n=10)
## Source: local data frame [10 x 2]
##
## date totalSteps
## 1 2012-10-02 126
## 2 2012-10-03 11352
## 3 2012-10-04 12116
## 4 2012-10-05 13294
## 5 2012-10-06 15420
## 6 2012-10-07 11015
## 7 2012-10-09 12811
## 8 2012-10-10 9900
## 9 2012-10-11 10304
## 10 2012-10-12 17382
str(stepsByDay)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53 obs. of 2 variables:
## $ date : POSIXct, format: "2012-10-02" "2012-10-03" ...
## $ totalSteps: int 126 11352 12116 13294 15420 11015 12811 9900 10304 17382 ...
## - attr(*, "drop")= logi TRUE
We are now ready to make a histogram of the total number of steps taken each day using our summarised data.
library(ggplot2)
qplot(totalSteps,
data=stepsByDay,
geom="histogram",
binwidth=3000,
main="Histogram of the Total Number of Steps Taken Each Day\n",
ylab="Count of Days\n",
xlab="\nTotal Number of Steps per Day")
The histogram above, as all the remaining plotting throughout the analysis, was made using the ggplot2 package.
Here we will summarise our data further.
We use the previously calculated total number of steps per day to compute the daily mean and median.
Please note that, unlike previous exercises, no grouping is required here.
stepsByDayCentralMeasures <- stepsByDay %>%
summarise(meanStepsByDay=as.numeric(mean(totalSteps)),
medianStepsByDay=as.numeric(median(totalSteps)))
print(stepsByDayCentralMeasures)
## Source: local data frame [1 x 2]
##
## meanStepsByDay medianStepsByDay
## 1 10766.19 10765
The mean is 1.0766189 × 104 steps per day and the median is 1.0765 × 104 steps per day.
As required, we will use of the 5-minute inverval and the average number of steps taken, averaged across all days.
To do this, we need to regroup our data using the identifier of the 5-minute interval.
Afterwards we summarise our data, obtaining the average number of steps, across all days, for each interval.
Note that we are still ignoring missing values at this point.
stepsByInterval <- data %>%
na.omit() %>%
group_by(interval) %>%
summarise(averageSteps=mean(steps))
Let's take a look at the first 10 lines of the summarised data.
head(stepsByInterval, n=10)
## Source: local data frame [10 x 2]
##
## interval averageSteps
## 1 0 1.7169811
## 2 5 0.3396226
## 3 10 0.1320755
## 4 15 0.1509434
## 5 20 0.0754717
## 6 25 2.0943396
## 7 30 0.5283019
## 8 35 0.8679245
## 9 40 0.0000000
## 10 45 1.4716981
Having the data in proper format, plotting is our time series is right around the corner.
qplot(x=interval,
y=averageSteps,
data=stepsByInterval,
geom="line",
main="Average number of steps taken for each Interval, across all days\n",
xlab="\nInterval",
ylab="Number of Steps\n")
To answer this we will use the base R, no packages. We can simply subset our data.
First we need to discover which line contains our maximum value.
maxIndex <- which.max(stepsByInterval$averageSteps)
Once we know the index of our max value, we can use it to subset our data frame.
maxSteps <- stepsByInterval[maxIndex, ]
print(maxSteps)
## Source: local data frame [1 x 2]
##
## interval averageSteps
## 1 835 206.1698
Therefore, the 5-minute inverval with the maximum number of steps starts at the 835th minute.
During this period we observe 206.1698113 steps on average, across all days.
The total number of missing value can computed with applying the following formula to our raw data.
We can do this because R attributes the values 0 to FALSE and 1 to TRUE when dealing with logical values.
sumMissing <- sum(is.na(data$steps))
There are a total 2304 values missing.
Alternatively we can sum the number of incomplete cases.
sum(!complete.cases(data))
## [1] 2304
We will use the mean for the 5-minute interval, as activity has highly variability throughout the day.
We can subset our data so we get a data frame with just incomplete cases.
Then, we join the resulting table with the average number of steps by interval and use it to fill our missing data.
missingValues <- data[which(is.na(data$steps)), ]
missingValues <- missingValues %>%
inner_join(stepsByInterval, by="interval") %>%
mutate(steps=averageSteps) %>%
select(-averageSteps)
The result is a new table with missing values filled with the mean for the specific 5-minute interval in which they occur.
head(missingValues)
## steps date interval
## 1 1.7169811 2012-10-01 0
## 2 0.3396226 2012-10-01 5
## 3 0.1320755 2012-10-01 10
## 4 0.1509434 2012-10-01 15
## 5 0.0754717 2012-10-01 20
## 6 2.0943396 2012-10-01 25
Now we need to replace the missing values in that origina dataset with the values derived from our strategy.
We replicate our original data in a new table.
newData <- data
Then we fill the missing values with the values from our new table.
newData[which(is.na(newData$steps)), 1] <- missingValues[ , 1]
head(newData)
## steps date interval
## 1 1.7169811 2012-10-01 0
## 2 0.3396226 2012-10-01 5
## 3 0.1320755 2012-10-01 10
## 4 0.1509434 2012-10-01 15
## 5 0.0754717 2012-10-01 20
## 6 2.0943396 2012-10-01 25
Mission accomplished.
sum(!complete.cases(newData))
## [1] 0
We need to group our new data by day as we did before.
newStepsByDay <- newData %>%
group_by(date) %>%
summarise(totalSteps=sum(steps))
And the new histogram can be found below.
library(ggplot2)
qplot(totalSteps,
data=newStepsByDay,
geom="histogram",
binwidth=3000,
main="Histogram of the Total Number of Steps Taken Each Day\n",
ylab="Count of Days\n",
xlab="\nTotal Number of Steps per Day")
Not surprisingly, variability decreased and the distribution appears to be “thinner”, converging towards the center.
This happens because missing values appear in missing days: entire days for which there is no data.
Since we are replacing entire days with the same values, this will lead to these days having the equal total steps.
tapply(missingValues$steps, as.factor(missingValues$date), sum)
## 2012-10-01 2012-10-08 2012-11-01 2012-11-04 2012-11-09 2012-11-10
## 10766.19 10766.19 10766.19 10766.19 10766.19 10766.19
## 2012-11-14 2012-11-30
## 10766.19 10766.19
And this is why our distribution is now more centered: the total steps for these days are all equal to the daily mean.
Regarding the effect on the mean and median:
newStepsByDayCentralMeasures <- newStepsByDay %>%
summarise(meanStepsByDay=as.numeric(mean(totalSteps)),
medianStepsByDay=as.numeric(median(totalSteps)))
rbind(stepsByDayCentralMeasures, newStepsByDayCentralMeasures)
## Source: local data frame [2 x 2]
##
## meanStepsByDay medianStepsByDay
## 1 10766.19 10765.00
## 2 10766.19 10766.19
After filling in the missing data, the mean holds the same value while the median converged to the mean.
We want to explore the differences in activity patterns between weekdays and weekends.
We will be using the filled-in missing values for this part.
We use the dplyr packages to create a new column indicating whether a given date is a weekday or weekend day.
Concluding this, we need to convert the new column from character to factor, in order to be used later on.
newData <- newData %>%
mutate(weekdays=ifelse(weekdays(date) == "Saturday" |
weekdays(date) == "Sunday",
"weekend",
"weekday"))
newData$weekdays <- as.factor(newData$weekdays)
str(newData)
## 'data.frame': 17568 obs. of 4 variables:
## $ steps : num 1.717 0.3396 0.1321 0.1509 0.0755 ...
## $ date : POSIXct, format: "2012-10-01" "2012-10-01" ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
## $ weekdays: Factor w/ 2 levels "weekday","weekend": 1 1 1 1 1 1 1 1 1 1 ...
First thing to do is to get our data ready for plotting.
Note that this time we have an additional feature to group by: weekdays.
newStepsByInterval <- newData %>%
group_by(interval, weekdays) %>%
summarise(averageSteps=mean(steps))
head(newStepsByInterval, n=10)
## Source: local data frame [10 x 3]
## Groups: interval
##
## interval weekdays averageSteps
## 1 0 weekday 2.251153040
## 2 0 weekend 0.214622642
## 3 5 weekday 0.445283019
## 4 5 weekend 0.042452830
## 5 10 weekday 0.173165618
## 6 10 weekend 0.016509434
## 7 15 weekday 0.197903564
## 8 15 weekend 0.018867925
## 9 20 weekday 0.098951782
## 10 20 weekend 0.009433962
qplot(x=interval,
y=averageSteps,
data=newStepsByInterval,
geom="line",
main="Average number of steps taken for each Interval, across all days\n",
xlab="\nInterval",
ylab="Number of Steps\n",
facets=weekdays ~ .)
The activity patterns are effectively distinct for weekdays and weekend days.