Increasing the measurement devices related to our physical activity has increased in recent times and the following document will practice with the R language and set of data from a device of physical activity.
The variables included in this dataset are: * steps: Number of steps taking in a 5-minute interval (missing values are coded as NA) * date: The date on which the measurement was taken in YYYY-MM-DD format * interval: Identifier for the 5-minute interval in which measurement was taken
The dataset is stored in a comma-separated-value (CSV) file and there are a total of 17,568 observations in this dataset.
Reading data file
I’ve used dplyr and Lattice to group, summarise and plot the data.
invisible(library(dplyr))
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lattice)
Sys.setlocale("LC_TIME", "C")
## [1] "C"
data <- read.csv("activity.csv", sep = ",",
col.names = c("steps", "date", "interval"),
colClasses = c("integer", "Date", "integer"))
Filter all nan-values in the steps variable and store the result in noNAS.
noNAS <- filter(data, !is.na(steps))
1.1. Make a histogram of the total number of steps taken each day.
noNAS %>%
group_by(date) %>%
summarise(totalSteps = sum(steps)) %>%
with(histogram(totalSteps , breaks = 14, layout = c(1, 1),
xlab ="Total Steps per day", ylab = "Percent of total"))
1.2. Calculate and report the mean and median total number of steps taken per day.
noNAS %>%
group_by(date) %>%
summarise(totalSteps = sum(steps))%>%
ungroup() %>%
summarise(meanSteps = mean(totalSteps),
medianSteps = median(totalSteps))
## Source: local data frame [1 x 2]
##
## meanSteps medianSteps
## 1 10766.19 10765
2.1. Time series plot of tyoe list of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis)
noNAS %>%
group_by(interval) %>%
summarise(meanSteps = mean(steps)) %>%
with(xyplot(meanSteps ~ interval, type = "l",
main = "Average Number of Steps taken by 5-minutes interval",
xlab = "5-minutes interval",
ylab = "Average Steps"))
2.2. Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?
noNAS %>%
group_by(interval) %>%
mutate(meanSteps = mean(steps)) %>%
ungroup() %>%
top_n(meanSteps, n = 1)
## Source: local data frame [53 x 4]
##
## steps date interval meanSteps
## 1 0 2012-10-02 835 206.1698
## 2 19 2012-10-03 835 206.1698
## 3 423 2012-10-04 835 206.1698
## 4 470 2012-10-05 835 206.1698
## 5 225 2012-10-06 835 206.1698
## 6 0 2012-10-07 835 206.1698
## 7 635 2012-10-09 835 206.1698
## 8 0 2012-10-10 835 206.1698
## 9 747 2012-10-11 835 206.1698
## 10 742 2012-10-12 835 206.1698
## .. ... ... ... ...
3.1. There are only missing values in the steps variable.
nrow(filter(data, is.na(steps)))
## [1] 2304
nrow(filter(data, is.na(date)))
## [1] 0
nrow(filter(data, is.na(interval)))
## [1] 0
3.2. Create a dataframe df_media_interval with the mean of all intervals.
noNAS %>%
group_by(interval) %>%
summarise(meanSteps = mean(steps)) -> df_media_interval
3.3. Modify all missing values with the mean steps of the interval.
for(i in 1:nrow(data)){
if (is.na(data[i,]$steps)){
interval <- data[i,]$interval
data[i,]$steps <- df_media_interval[df_media_interval$interval == interval,]$meanSteps #Select()
}
}
3.4. Check nan values.
nrow(filter(data, is.na(steps)))
## [1] 0
3.5. Histogram of the total number of steps taken each day
data %>%
group_by(date) %>%
summarise(totalSteps = sum(steps)) %>%
with(histogram(totalSteps , breaks = 14, layout = c(1, 1),
xlab ="Total Steps per day", ylab = "Percent of total"))
3.6. Calculate and report the mean and median total number of steps taken per day
data %>%
group_by(date) %>%
summarise(totalSteps = sum(steps))%>%
ungroup() %>%
summarise(meanSteps = mean(totalSteps),
medianSteps = median(totalSteps))
## Source: local data frame [1 x 2]
##
## meanSteps medianSteps
## 1 10766.19 10766.19
The media and the median does not change the study between the two datasets.
4.1 Create a new factor variable in the dataset with two levels – “weekday” and “weekend” indicating whether a given date is a weekday or weekend day.
data %>%
mutate(day_type = as.factor(ifelse(weekdays(as.Date(date)) %in% c("Saturday", "Sunday"), "weekend", "weekday"))) -> data
4.2 Plot a time series plot of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis).
data %>%
group_by(interval, day_type) %>%
summarise(averaged_steps = mean(steps)) %>%
xyplot(averaged_steps ~ interval | day_type, data = ., type = "l", layout = c(1, 2))