The following peice of code will install (if required) and load the required packages used in this assignment.
if(!require(dplyr))
{
install.packages(dplyr)
library(dplyr)
}
if(!require(ggplot2))
{
install.packages(ggplot2)
library(ggplot2)
}
if(!require(lubridate))
{
install.packages(lubridate)
library(lubridate)
}
We are now going to read the activity dataset.
act = read.csv("activity.csv", colClasses = c("numeric","Date","numeric"))
For this part of the assignment, we will ignore the missing values in the dataset.
The total number of steps taken per day will be stored in “steps_per_day” data frame tbl. We will use the dplyr package to accomplish this.
steps_per_day = act %>%
group_by(date) %>%
summarize(total_steps = sum(steps,na.rm=T))
head(steps_per_day)
## # A tibble: 6 x 2
## date total_steps
## <date> <dbl>
## 1 2012-10-01 0
## 2 2012-10-02 126
## 3 2012-10-03 11352
## 4 2012-10-04 12116
## 5 2012-10-05 13294
## 6 2012-10-06 15420
We will plot the histogram using the ggplot2 package.
qplot(steps_per_day$total_steps, binwidth = 500, xlab = "Total number of steps per day", ylab = "Frequency", main = "Histogram of total number of steps per day")
Mean and median are calculated and reported below.
mean_steps_per_day = mean(steps_per_day$total_steps)
median_steps_per_day = median(steps_per_day$total_steps)
Now, we calculate the average number of steps for an interval across all days. We will accomplish this using the dplyr package.
steps_per_interval = act %>%
group_by(interval) %>%
summarize(avg_steps = mean(steps,na.rm=T))
head(steps_per_interval)
## # A tibble: 6 x 2
## interval avg_steps
## <dbl> <dbl>
## 1 0 1.7169811
## 2 5 0.3396226
## 3 10 0.1320755
## 4 15 0.1509434
## 5 20 0.0754717
## 6 25 2.0943396
Plotting the time series from the above data fram tbl using the ggplot package:
qplot(interval, avg_steps, data = steps_per_interval, geom = "line", xlab = "5-minute interval", ylab = "Average number of steps across all days", main = "Time series plot of interval and the number of steps, averaged across all days")
max_interval = steps_per_interval[which.max(steps_per_interval$avg_steps),1]
The 5-minute interval containing the maximum number of steps, on average across all the days in the dataset, is 835.
There are a number of days/intervals where there are missing values.
miss_vals = sum(is.na(act$steps))
The total number of missing values in the dataset is 2304.
To fill in the missing values of the 5-minute intervals, we will use the mean of the available 5-minute intervals averaged across all days. Note that we have already calculated these values in the previous exercise.
The new dataframe is called “act_imputed”. The missing values are filled according to the above strategy.
act_imputed = act
for(i in 1:length(act_imputed$steps))
{
if(is.na(act_imputed$steps[i]))
{
idx = which(steps_per_interval$interval==act_imputed$interval[i])
act_imputed$steps[i] = steps_per_interval$avg_steps[idx]
}
}
steps_per_day_imputed = act_imputed %>%
group_by(date) %>%
summarize(total_steps = sum(steps,na.rm=T))
qplot(steps_per_day_imputed$total_steps, binwidth = 500, xlab = "Total number of steps per day (imputed)", ylab = "Frequency", main = "Histogram of total number of steps per day for the Imputed Dataset")
mean_steps_per_day_imputed = mean(steps_per_day_imputed$total_steps)
median_steps_per_day_imputed = median(steps_per_day_imputed$total_steps)
The mean and median are different from the previous case when there were missing values in the dataset. But the difference in values is not too significant and we can say that there is not a significant impact of missing data on the estimates of the total daily number of steps.
We will use the dataset with the filled-in missing values for this part.
The column day in the new dataset act_wday will indicate whether the day is a weekday or weekend.
act_wday = act_imputed %>%
mutate(day = as.factor(ifelse(wday(date)>1 & wday(date)<7,"weekday","weekend")))
head(act_wday)
## steps date interval day
## 1 1.7169811 2012-10-01 0 weekday
## 2 0.3396226 2012-10-01 5 weekday
## 3 0.1320755 2012-10-01 10 weekday
## 4 0.1509434 2012-10-01 15 weekday
## 5 0.0754717 2012-10-01 20 weekday
## 6 2.0943396 2012-10-01 25 weekday
act_wday = act_wday %>%
group_by(interval,day) %>%
summarize(avg_steps = mean(steps,na.rm=T))
qplot(interval, avg_steps, data = act_wday, geom = "line", facets = day~., xlab = "5-minute interval", ylab = "Average number of steps across all days", main = "Time series plot of interval and the number of steps, averaged across all days")