The data is collected from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day. Data structure is listed as below:
Oberservations: There are a total of 17,568 observations in this dataset Variables: steps: Numbers of steps taking in a 5-minute interval (missing data are coded as NA ) date: The data on which the measurement was taken in YYYY-MM-DD format interval: Identifier for the 5-minute interval in which measurement was taken
The data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
Let’s fetch the data first
library(ggplot2,warn.conflicts = F) ; library(dplyr,warn.conflicts = F)
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
if(! file.exists('activity.csv')) {
download.file(url = 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip',
destfile = 'data.zip')
unzip(zipfile = 'data.zip')
unlink('data.zip', recursive = T)
}
#loading the data
data <- read.csv('./activity.csv')
head(data)
## steps date interval
## 1 NA 2012-10-01 0
## 2 NA 2012-10-01 5
## 3 NA 2012-10-01 10
## 4 NA 2012-10-01 15
## 5 NA 2012-10-01 20
## 6 NA 2012-10-01 25
glimpse(data)
## Observations: 17,568
## Variables: 3
## $ steps <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ date <fct> 2012-10-01, 2012-10-01, 2012-10-01, 2012-10-01, 2012-...
## $ interval <int> 0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 100, 10...
#loading the required packages
library(knitr)
library(dplyr)
library(ggplot2)
library(mice)
## Loading required package: lattice
##
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
##
## cbind, rbind
This part is to calculate the total number of steps taken per day and report the mean and median of total steps taken per day.
TotalSteps <- data %>%
group_by(date) %>%
summarize(Total = sum(steps, na.rm = TRUE))
AvgStep <- mean(TotalSteps$Total, na.rm = TRUE)
MedianStep <- median(TotalSteps$Total, na.rm = TRUE)
TotalStepPlot <- hist(TotalSteps$Total, main = "Daily Total Steps",
xlab = "Number of Steps", ylab = "Frequency", breaks = 25, col = "aquamarine1")
AvgStep
## [1] 9354.23
MedianStep
## [1] 10395
According to the plot and the results, the average of the total number of steps taken per day is 9354.23 The median of the total number of steps taken per day is 10395. Since median and mean are very close, there is actually a bell curve distribution for the distribution of daily steps.
This step is to show the activity pattern of average steps taken on a daily basis in a time series plot.
AvgInterval <- data %>%
group_by(interval) %>%
summarize(Average = mean(steps, na.rm = TRUE))
MaxInterval <- AvgInterval$interval[which.max(AvgInterval$Average)]
plot(x = AvgInterval$interval, y = AvgInterval$Average, type = "l", col = "navy", lwd = 2, xlab = "5-minute Time Interval",
ylab = "Average Steps", main = "Daily Average Steps Activity Pattern")
MaxInterval
## [1] 835
According to the result based on MaxInterval, the interval with maximum average of steps is 835. ## Imputing missing values- 1.report the total number of missing values
#Calculate and report the total number of missing values
Missing <- md.pattern(data)
print(Missing)
## date interval steps
## 15264 1 1 1 0
## 2304 1 1 0 1
## 0 0 2304 2304
Since there are a number of days and intervals with missing values (coded as NA),it may introduce bias into some calculations or summaries of data. There are 15264 rows in the dataset that are complete. There are 2304 rows of missing values in the dataset.
The missing data are replaced using the method of multiple imputation by chained equation.
data$date <- as.factor(data$date)
ImpData <- mice(data, m = 5, meth = 'pmm')
##
## iter imp variable
## 1 1 steps
## 1 2 steps
## 1 3 steps
## 1 4 steps
## 1 5 steps
## 2 1 steps
## 2 2 steps
## 2 3 steps
## 2 4 steps
## 2 5 steps
## 3 1 steps
## 3 2 steps
## 3 3 steps
## 3 4 steps
## 3 5 steps
## 4 1 steps
## 4 2 steps
## 4 3 steps
## 4 4 steps
## 4 5 steps
## 5 1 steps
## 5 2 steps
## 5 3 steps
## 5 4 steps
## 5 5 steps
## Warning: Number of logged events: 25
CompData <- complete(ImpData, 3)
data$date <- as.Date(data$date, format = "%Y-%m-%d")
CompData$date <- as.Date(CompData$date, format = "%Y-%m-%d")
md.pattern(CompData)
## /\ /\
## { `---' }
## { O O }
## ==> V <== No need for mice. This data set is completely observed.
## \ \|/ /
## `-----'
## steps date interval
## 17568 1 1 1 0
## 0 0 0 0
According to md.pattern, the new dataset, CompData, does not have any missing value in it.
Now with missing data imputated, I calculated the total number of steps taken per day and report the mean and median of total steps taken per day.
TotalCompSteps <- CompData %>%
group_by(date) %>%
summarize(TotalComp = sum(steps))
AvgCompStep <- format(mean(TotalCompSteps$TotalComp), scientific = FALSE)
MedianCompStep <- median(TotalCompSteps$TotalComp)
TotalCompStepPlot <- hist(TotalCompSteps$TotalComp, main = "Daily Total Steps (Imputated Data)", xlab = "Number of Steps",
ylab = "Frequency", breaks = 25, col = "cyan4")
AvgCompStep
## [1] "9444.902"
MedianCompStep
## [1] 10395
The average of the total number of steps taken per day is 10940.43. The median of the total number of steps taken per day is 11162 Comparing to the estimates done from the first part, the gap between the mean and the median is smaller than it was before. Therefore, the imputated data are less skewed on the estimates of the total daily number of steps.
CompData$DayType <- ifelse(weekdays(CompData$date) %in% c("Saturday", "Sunday"), "Weekend", "Weekday")
I compare the activity pattern between weekdays and weekend to see if there is any difference in the intervals.
CompInterval <- CompData %>%
group_by(interval, DayType) %>%
summarize(IntervalAvg = mean(steps))
IntervalPlot <- ggplot(CompInterval, aes( x = interval, y = IntervalAvg, color = DayType))
IntervalPlot + geom_line(size = 1.5, alpha = 0.75) + facet_grid(DayType ~ .) + labs(title = "Daily Average Activity Pattern (Weekday & Weekend)", x= "5-minute Time Interval", y = "Average Steps") + theme(plot.title = element_text(hjust = 0.5))
Apparently, the starting time of increasing activity has shifted to the right, which means people tend to start late on weekend.