data <- read.csv("activity.csv", colClasses = c("integer", "Date", "factor"))
sinNA <- na.omit(data)
For this part of the assignment, you can ignore the missing values in the dataset.
Make a histogram of the total number of steps taken each day
Calculate and report the mean and median total number of steps taken per day
attach(sinNA)
totalSteps <- aggregate(steps, list(Date = date), FUN = "sum")
mean(totalSteps$x)
## [1] 10766.19
median(totalSteps$x)
## [1] 10765
hist(totalSteps$x, col = "lightblue", xlab = "Number of Steps Taken Each Day",
ylab = "freq", main = "Histogram of Total Number of Steps Taken Each Day")
Make a time series plot (i.e. type = “l”) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis)
Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?
library(ggplot2)
timeseries <- aggregate(steps, list(interval = as.numeric(as.character(interval))), FUN = "mean")
ggplot(timeseries, aes(interval, y = x)) + geom_line() + labs(title = "Time Series Plot of the 5-minute Interval", x = "5-minute intervals", y = "Average Number of Steps Taken")
maximo <- which.max(timeseries$x)
maximo
## [1] 104
timeseries [maximo, ]
## interval x
## 104 835 206.1698
Devise a strategy for filling in all of the missing values in the dataset. The strategy does not need to be sophisticated. For example, you could use the mean/median for that day, or the mean for that 5-minute interval, etc. My strategy is to use the mean for that 5-minute interval to fill each NA value in the steps column. Now I Create a new dataset that is equal to the original dataset but with the missing data filled in.
newdata <- data
for (i in 1:nrow(newdata)) {
if (is.na(newdata$steps[i])) {
newdata$steps[i] <- timeseries[which(newdata$interval[i] == timeseries$interval), ]$x
}
}
head(newdata)
## steps date interval
## 1 1.7169811 2012-10-01 0
## 2 0.3396226 2012-10-01 5
## 3 0.1320755 2012-10-01 10
## 4 0.1509434 2012-10-01 15
## 5 0.0754717 2012-10-01 20
## 6 2.0943396 2012-10-01 25
sum(is.na(newdata))
## [1] 0
Make a histogram of the total number of steps taken each day and Calculate and report the mean and median total number of steps taken per day. Do these values differ from the estimates from the first part of the assignment? What is the impact of imputing missing data on the estimates of the total daily number of steps? I think is that The mean value should’not change because I use the mean for that 5-minute interval to fill each NA value in the steps column. And mathematically I can show that the new mean is te same.
totalSteps2 <- aggregate(newdata$steps, list(Date = newdata$date), FUN = "sum")
mean(totalSteps2$x)
## [1] 10766.19
median(totalSteps2$x)
## [1] 10766.19
hist(totalSteps2$x, col = "blue", xlab = "Number of Steps Taken Each Day",
ylab = "freq",main = "Histogram of Total Number of Steps Taken Each Day" )
Creating a new factor variable in the dataset with two levels – “weekday” and “weekend” indicating whether a given date is a weekday or weekend day.
newdata$date <- as.Date(newdata$date, "%Y-%m-%d")
newdata$day <- weekdays(newdata$date)
happydays <- c("sábado","domingo")
newdata$tipodia<-as.factor(ifelse(weekdays(newdata$date)%in%happydays,"weekend","weekday"))
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
by_tipodia <- group_by(newdata, tipodia)
a <- summarize(by_tipodia, mean(steps))
a
## Source: local data frame [2 x 2]
##
## tipodia mean(steps)
## (fctr) (dbl)
## 1 weekday 35.61058
## 2 weekend 42.36640
by_tipodia
## Source: local data frame [17,568 x 5]
## Groups: tipodia [2]
##
## steps date interval day tipodia
## (dbl) (date) (fctr) (chr) (fctr)
## 1 1.7169811 2012-10-01 0 segunda-feira weekday
## 2 0.3396226 2012-10-01 5 segunda-feira weekday
## 3 0.1320755 2012-10-01 10 segunda-feira weekday
## 4 0.1509434 2012-10-01 15 segunda-feira weekday
## 5 0.0754717 2012-10-01 20 segunda-feira weekday
## 6 2.0943396 2012-10-01 25 segunda-feira weekday
## 7 0.5283019 2012-10-01 30 segunda-feira weekday
## 8 0.8679245 2012-10-01 35 segunda-feira weekday
## 9 0.0000000 2012-10-01 40 segunda-feira weekday
## 10 1.4716981 2012-10-01 45 segunda-feira weekday
## .. ... ... ... ... ...
qplot(interval, steps, data=by_tipodia, geom=c("line"), xlab="5-min intervals",
ylab="steps mean", main="") + facet_wrap(~ tipodia, ncol=1)