Load the data into R data frame “activity”, make sure the data resides in your current directory. We will have to also make transformation to the “date” column to get it in proper date format.
activity<-read.csv("./activity.csv",header = TRUE)
activity[,2]<-as.Date(activity$date)
Now let’s see how our data looks like
str(activity)
## 'data.frame': 17568 obs. of 3 variables:
## $ steps : int NA NA NA NA NA NA NA NA NA NA ...
## $ date : Date, format: "2012-10-01" "2012-10-01" ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
Let’s first build and see the histogram of total steps per day
steps_1<-with(activity,tapply(steps,date,sum,na.rm=TRUE))
hist(steps_1,col = "green",xlab = "Total Steps",ylab = "Frequency",main = "Total Number of Steps per Day")
Now, let’s see the average total steps and median of total steps taken each day
print(mean_steps<-mean(steps_1))
## [1] 9354.23
print(median_steps<-median(steps_1))
## [1] 10395
So the average total steps taken each day is 9354.2295082 and, the average median is is calculated to be 10395
Let’s also just quickly look at the summary of the total steps taken each day
summary(steps_1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 6778 10400 9354 12810 21190
Let’s first make a time series plot of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis).
avg_steps<-with(activity,tapply(steps,interval,mean,na.rm=TRUE))
intervals<-unique(activity$interval)
new<-data.frame(cbind(avg_steps,intervals))
plot(new$intervals,new$avg_steps,type = "l",xlab = "Intervals",
ylab = "Average Steps",main = "Average Steps per Interval")
Now, let’s see which 5-minute interval contains the maximum number of steps
index<-which.max(new$avg_steps)
max<-new[index,2]
The 5-minute interval that contains the maximum number of steps is 835
Approach: Calculate the average of average steps per day across all dates in the data set (ignoring NA values). Then use the resulting value in place of NAs.
So, let’s create a new dataset that is equal to the original dataset but with the missing data filled in.
sum(is.na(activity$steps))
## [1] 2304
index<-which(is.na(activity$steps))
l<-length(index)
steps_avg<-with(activity,tapply(steps,date,mean,na.rm=TRUE))
na<-mean(steps_avg,na.rm = TRUE)
for (i in 1:l) {
activity[index[i],1]<-na
}
Let’s see if we filled all NAs properly and see how our new dataset looks like
sum(is.na(activity$steps))
## [1] 0
It looks like we have zero NAs which means we have successfully filled all missing data and new dataset looks like as follows
str(activity)
## 'data.frame': 17568 obs. of 3 variables:
## $ steps : num 37.4 37.4 37.4 37.4 37.4 ...
## $ date : Date, format: "2012-10-01" "2012-10-01" ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
Let’s see the histogram of total steps taken each day with the new dataset
steps_2<-with(activity,tapply(steps,date,sum,na.rm=TRUE))
hist(steps_2,col = "green",xlab = "Total Steps",ylab = "Frequency",main = "Total Number of Steps per Day")
Let’s calculate and see the mean and median of the total steps taken each day
print(mean_steps_2<-mean(steps_2))
## [1] 10766.19
print(median_steps_2<-median(steps_2))
## [1] 10766.19
The average total steps taken each day is 1.076618910^{4} and the median of total steps taken each day is 1.076618910^{4}. Notice that both the average and the median of the total steps taken eah day after NAs are filled came out to be equal.
In this section, we will use dplyr package and we need to load it from the library.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Also, we will need to create a new variable in the dataset named “day” that shows the day of the week in terms of weekday or weekend.
activity_mod<- mutate(activity, day = ifelse(weekdays(activity$date) == "Saturday" | weekdays(activity$date) == "Sunday", "weekend", "weekday"))
activity_mod$day<-as.factor(activity_mod$day)
str(activity_mod)
## 'data.frame': 17568 obs. of 4 variables:
## $ steps : num 37.4 37.4 37.4 37.4 37.4 ...
## $ date : Date, format: "2012-10-01" "2012-10-01" ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
## $ day : Factor w/ 2 levels "weekday","weekend": 1 1 1 1 1 1 1 1 1 1 ...
Now, let’s plot the weekday and weekend data in seperate graphs
act_wknd<-subset(activity_mod,as.character(activity_mod$day)=="weekend")
act_wkdy<-subset(activity_mod,as.character(activity_mod$day)=="weekday")
steps_wknd<-with(act_wknd,tapply(steps,interval,mean,na.rm=TRUE))
steps_wkdy<-with(act_wkdy,tapply(steps,interval,mean,na.rm=TRUE))
int_wknd<-unique(act_wknd$interval)
int_wkdy<-unique(act_wkdy$interval)
new_wknd<-data.frame(cbind(steps_wknd,int_wknd))
new_wkdy<-data.frame(cbind(steps_wkdy,int_wkdy))
par(mfrow=c(2,1),mar=c(4,4,2,1))
plot(new_wknd$int_wknd,new_wknd$steps_wknd,type = "l",xlab = "Intervals",
ylab = "Average Steps",main = "Weekend")
plot(new_wkdy$int_wkdy,new_wkdy$steps_wkdy,type = "l",xlab = "Intervals",
ylab = "Average Steps",main = "Weekday")