This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.
The data for this assignment can be downloaded from here.
The variables included in this dataset are:
data <- read.csv("activity.csv", colClasses = c("integer", "Date", "integer"))
dim(data)
## [1] 17568 3
head(data)
## steps date interval
## 1 NA 2012-10-01 0
## 2 NA 2012-10-01 5
## 3 NA 2012-10-01 10
## 4 NA 2012-10-01 15
## 5 NA 2012-10-01 20
## 6 NA 2012-10-01 25
summary(data)
## steps date interval
## Min. : 0.00 Min. :2012-10-01 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.:2012-10-16 1st Qu.: 588.8
## Median : 0.00 Median :2012-10-31 Median :1177.5
## Mean : 37.38 Mean :2012-10-31 Mean :1177.5
## 3rd Qu.: 12.00 3rd Qu.:2012-11-15 3rd Qu.:1766.2
## Max. :806.00 Max. :2012-11-30 Max. :2355.0
## NA's :2304
Obviously missing values are located in the steps variable only.
The next steps of the analysis use a filtered version without NA’s of the origin dataset.
library(dplyr)
data.noNA <- data %>% filter(!is.na(steps))
As noticed during the preparation phase, the period of observation is from 2012-10-01 to 2012-11-30.
The number of steps is summed per day. The following graph presents the frequency of the different sums over the period of observation.
steps.by_day.sum <- summarise(group_by(data.noNA, date), steps = sum(steps))
library(ggplot2)
ggplot(steps.by_day.sum, aes(x = steps)) +
geom_histogram() +
labs(title = "Total number of steps taken per day") +
labs(x = "Encountered daily steps count", y="Frequency")
The next table contains the mean and the median of the total number of steps taken per day, applied on the filtered data:
steps.summary <- steps.by_day.sum %>%
summarise(steps.mean = mean(steps), steps.median = median(steps))
cbind(data=factor(c("filtered")), steps.summary)
## data steps.mean steps.median
## 1 filtered 10766.19 10765
The number of steps is averaged per 5-minutes interval. The following graph presents the different averages over the period of observation..
steps.by_interval.mean <- aggregate(steps ~ interval, data.noNA, mean)
qplot(interval,
steps,
data = steps.by_interval.mean,
geom = c("line"),
xlab = "Time series of 5 minutes intervals",
ylab = "Average number of steps",
main = "Daily activity")
interval.max.index <- which.max(steps.by_interval.mean$steps)
interval.max.steps <- steps.by_interval.mean[interval.max.index, ]
The 5-minutes interval 835 has the maximum average value of steps (about 206).
As previously seen during the preparation phase, missing values are present inside the steps variable only, and observations having missing values are located in the following list of days:
data %>% filter(is.na(steps)) %>% group_by(date) %>% summarise(NA.count = n())
## # A tibble: 8 × 2
## date NA.count
## <date> <int>
## 1 2012-10-01 288
## 2 2012-10-08 288
## 3 2012-11-01 288
## 4 2012-11-04 288
## 5 2012-11-09 288
## 6 2012-11-10 288
## 7 2012-11-14 288
## 8 2012-11-30 288
The ratio of the missing values is the following:
steps.missing <- table(is.na(data$steps))
There are 2304 observations with NA’s in the dataset. This represents about 13 % of the observations.
A missing step value in an observation is replaced by the average count of steps for that day. The replacement is done after having joined the origin dataset and the one containing the average number of steps taken by day.
steps.by_day.mean <- aggregate(steps ~ date, data.noNA, mean)
data.imputed.1 <- merge(x=data, y=steps.by_day.mean, by=c("date"), all = T)
head(data.imputed.1, 10)
## date steps.x interval steps.y
## 1 2012-10-01 NA 0 NA
## 2 2012-10-01 NA 5 NA
## 3 2012-10-01 NA 10 NA
## 4 2012-10-01 NA 15 NA
## 5 2012-10-01 NA 20 NA
## 6 2012-10-01 NA 25 NA
## 7 2012-10-01 NA 30 NA
## 8 2012-10-01 NA 35 NA
## 9 2012-10-01 NA 40 NA
## 10 2012-10-01 NA 45 NA
This strategy does not solve the problem, as the steps variable still contains missing values on both sides.
sum(is.na(data.imputed.1$steps.y))
## [1] 2304
A missing step value in an observation is replaced by the average count of steps for that 5-minutes interval. The replacement is done after having joined the origin dataset and the one containing the average number of steps taken by interval.
data.imputed.2 <- merge(x=data, y=steps.by_interval.mean, by=c("interval"), all = T) %>%
mutate(steps = ifelse(is.na(steps.x), steps.y, steps.x)) %>%
select(steps, date, interval) %>%
arrange(date)
head(data.imputed.2, 10)
## steps date interval
## 1 1.7169811 2012-10-01 0
## 2 0.3396226 2012-10-01 5
## 3 0.1320755 2012-10-01 10
## 4 0.1509434 2012-10-01 15
## 5 0.0754717 2012-10-01 20
## 6 2.0943396 2012-10-01 25
## 7 0.5283019 2012-10-01 30
## 8 0.8679245 2012-10-01 35
## 9 0.0000000 2012-10-01 40
## 10 1.4716981 2012-10-01 45
This strategy solves our data cleaning problem, the steps variable does no more contain missing values.
sum(is.na(data.noNA$steps))
## [1] 0
data.imputed <- data.imputed.2
steps.by_day.imputed <- aggregate(steps ~ date, data.imputed, sum)
ggplot(steps.by_day.imputed, aes(x=steps)) +
geom_histogram() +
labs(title="Total number of steps taken each day (imputed)") +
labs(x ="Encountered daily steps count", y="Frequency")
The next table compares the mean and the median of the total number of steps taken per day, before and after the cleaning of missing data:
steps.summary.imputed <- steps.by_day.imputed %>%
summarise(steps.mean = mean(steps), steps.median = median(steps))
cbind(data=factor(c("filtered","imputed")), rbind(steps.summary, steps.summary.imputed))
## data steps.mean steps.median
## 1 filtered 10766.19 10765.00
## 2 imputed 10766.19 10766.19
Imputing missing steps values has a slight implact on the median indicator, the mean indicator stays constant.
The base is now the imputed dataset. The new weekday.type variable classifies the observations in weekdays and weekend days.
data.imputed <- data.imputed %>%
mutate(weekday.posix = as.POSIXlt(date)$wday,
day.type = ifelse(weekday.posix %in% c(0,6), "weekend", "weekday"))
head(data.imputed, 5)
## steps date interval weekday.posix day.type
## 1 1.7169811 2012-10-01 0 1 weekday
## 2 0.3396226 2012-10-01 5 1 weekday
## 3 0.1320755 2012-10-01 10 1 weekday
## 4 0.1509434 2012-10-01 15 1 weekday
## 5 0.0754717 2012-10-01 20 1 weekday
The number of steps is averaged per 5-minutes interval. The following graph presents the different averages over the period of observation.
steps.by_interval.by_daytype.mean <- aggregate(steps ~ interval + day.type, data.imputed, mean)
qplot(interval,
steps,
data = steps.by_interval.by_daytype.mean,
geom = c("line"),
xlab = "Time series of 5 minutes intervals",
ylab = "Average number of steps",
main = "Daily activity (from imputed data)") +
facet_wrap(~ day.type, ncol = 1)