This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.
Call required libraries for this assignment:
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
actv <- read.csv("activity.csv",na.strings = "NA")
tail(actv)
## steps date interval
## 17563 NA 2012-11-30 2330
## 17564 NA 2012-11-30 2335
## 17565 NA 2012-11-30 2340
## 17566 NA 2012-11-30 2345
## 17567 NA 2012-11-30 2350
## 17568 NA 2012-11-30 2355
actv$date <- as.Date(actv$date, format = "%Y-%m-%d")
str(actv)
## 'data.frame': 17568 obs. of 3 variables:
## $ steps : int NA NA NA NA NA NA NA NA NA NA ...
## $ date : Date, format: "2012-10-01" "2012-10-01" ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
agg_step <- aggregate(actv$steps, by = list(actv$date), FUN = "sum", na.rm = TRUE)
hist(agg_step$x, col="blue", main = "Histogram plot of Total Number Steps (without missing values)", xlab = "Total Steps")
mnval <- mean(agg_step$x)
mdval <- median(agg_step$x)
Mean number of steps taken each day : 9354.2295082
Median number of steps taken each day : 10395
avg_step <- aggregate(actv$steps, by = list(actv$interval), FUN = "mean", na.rm = TRUE)
names(avg_step) <- c("interval","steps")
plot(avg_step$interval, avg_step$steps,type = "l", col = "red", main = "Time series plot of the average number of steps taken", xlab = "Time Interval", ylab = "Average Steps" )
max_avg <- max(avg_step$steps)
filter(avg_step, steps == max_avg)
## interval steps
## 1 835 206.1698
Calculate the total number of missing values in the dataset (i.e. the total number of rows with NAs)
summary(is.na(actv))
## steps date interval
## Mode :logical Mode :logical Mode :logical
## FALSE:15264 FALSE:17568 FALSE:17568
## TRUE :2304
Devise a strategy for filling in all of the missing values in the dataset. The strategy does not need to be sophisticated. For example, you could use the mean/median for that day, or the mean for that 5-minute interval, etc.
actv1 <- filter(actv, is.na(actv$steps) == FALSE) #A data.frame with all valid steps
actv0 <- filter(actv, is.na(actv$steps) == TRUE) #A data.frame with all NA steps
# A loop to put mean steps of that invertal for missing steps
for (i in actv0$interval) {
actv0[actv0$interval == i, ]$steps <- avg_step$steps[avg_step$interval == i]
}
head(actv0)
## steps date interval
## 1 1.7169811 2012-10-01 0
## 2 0.3396226 2012-10-01 5
## 3 0.1320755 2012-10-01 10
## 4 0.1509434 2012-10-01 15
## 5 0.0754717 2012-10-01 20
## 6 2.0943396 2012-10-01 25
Create a new dataset that is equal to the original dataset but with the missing data filled in.
actv_new <- rbind(actv0, actv1)
str(actv_new)
## 'data.frame': 17568 obs. of 3 variables:
## $ steps : num 1.717 0.3396 0.1321 0.1509 0.0755 ...
## $ date : Date, format: "2012-10-01" "2012-10-01" ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
Make a histogram of the total number of steps taken each day and Calculate and report the mean and median total number of steps taken per day. Do these values differ from the estimates from the first part of the assignment? What is the impact of imputing missing data on the estimates of the total daily number of steps?
agg_step_new <- aggregate(actv_new$steps, by = list(actv_new$date), FUN = "sum")
hist(agg_step_new$x, col="gray", main = "Histogram plot of Total Number Steps", xlab = "Total Steps")
# Re-calculate mean and median of Total number of steps per day after filling missing values
mnval_new <- mean(agg_step_new$x)
mdval_new <- median(agg_step_new$x)
Mean number of steps taken each day : 1.076618910^{4}
Median number of steps taken each day : 1.076618910^{4}
Observation: The mean and median values are highrer after replacing missing values comparing original dataset.
Create a new factor variable in the dataset with two levels - “weekday” and “weekend” indicating whether a given date is a weekday or weekend day.
# Add a new column 'day' to hold day of the week
actv_new <- mutate(actv_new, day = weekdays(date))
# Add day category as 'weekend' and 'weekday'
actv_new$day_catg <- if_else(actv_new$day %in% c('Sunday','Saturday'), 'weekend', 'weekday')
actv_new$day_catg <- factor(actv_new$day_catg)
str(actv_new)
## 'data.frame': 17568 obs. of 5 variables:
## $ steps : num 1.717 0.3396 0.1321 0.1509 0.0755 ...
## $ date : Date, format: "2012-10-01" "2012-10-01" ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
## $ day : chr "Monday" "Monday" "Monday" "Monday" ...
## $ day_catg: Factor w/ 2 levels "weekday","weekend": 1 1 1 1 1 1 1 1 1 1 ...
Make a panel plot containing a time series plot of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis).
avg_step_week <- aggregate(steps ~ interval + day_catg, data = actv_new, FUN = "mean")
head(avg_step_week)
## interval day_catg steps
## 1 0 weekday 2.25115304
## 2 5 weekday 0.44528302
## 3 10 weekday 0.17316562
## 4 15 weekday 0.19790356
## 5 20 weekday 0.09895178
## 6 25 weekday 1.59035639
library(ggplot2)
g <- ggplot(data = avg_step_week, aes(x = interval, y = steps))
g <- g + facet_grid(day_catg ~ .)
g <- g + geom_line()
g <- g + labs(title = "Avg Steps taken across weekday/weekend per Interval")
g <- g + labs(x = "Interval", y = "Average Steps")
print(g)
End of Assignment, Thanks.