Since the zip file is already in the repository, we do not need to download it.
We will just unzip it and create the activity dataset by reading the csv file.
#Unzip and read in the csv file
activity <- read.csv(unz("./activity.zip", "activity.csv"))
In order to get the total number of steps per day, we need to aggregate data and sum the steps per each day.
# Using aggregate function to sum up steps per each day
# Do NOT use na.rm=TRUE for sum function.
DailySums<- aggregate(activity$steps, by=list(Date=activity$date), FUN=sum)
# renaming defult column name "x"
colnames(DailySums)[2] = "Steps"
hist(DailySums$Steps, breaks = 20, col = "tomato2")
Notice how some days are missing completely as NA?
head(activity, n = 25)
## steps date interval
## 1 NA 2012-10-01 0
## 2 NA 2012-10-01 5
## 3 NA 2012-10-01 10
## 4 NA 2012-10-01 15
## 5 NA 2012-10-01 20
## 6 NA 2012-10-01 25
## 7 NA 2012-10-01 30
## 8 NA 2012-10-01 35
## 9 NA 2012-10-01 40
## 10 NA 2012-10-01 45
## 11 NA 2012-10-01 50
## 12 NA 2012-10-01 55
## 13 NA 2012-10-01 100
## 14 NA 2012-10-01 105
## 15 NA 2012-10-01 110
## 16 NA 2012-10-01 115
## 17 NA 2012-10-01 120
## 18 NA 2012-10-01 125
## 19 NA 2012-10-01 130
## 20 NA 2012-10-01 135
## 21 NA 2012-10-01 140
## 22 NA 2012-10-01 145
## 23 NA 2012-10-01 150
## 24 NA 2012-10-01 155
## 25 NA 2012-10-01 200
Specifically for this data set, if a value is NA then that complete day is filled with NAs.
So If we had used na.rm=TRUE option with sum() function during aggregate, than total number of steps for some days would have been calculated as 0, which would result in lower mean and median values.
In other words, the days without any data would have been treated as days with measurement of 0 steps which would lead to wrong calculations.
Since ignoring NAs is actually not including them in any calculation; replacing those days with 0 is avoided during aggregate & sum.
We can now calculate mean and median of all days by ignoring the NA days. Hence na.rm = TRUE this time
#Now remove NA days when calculating mean and the medians
meanSteps <- round(mean(DailySums$Steps, na.rm = TRUE), 2)
medianSteps <- median(DailySums$Steps, na.rm = TRUE)
Mean :
## [1] 10766.19
Median :
## [1] 10765
Now we should Aggregate Across intervals using mean function to take average steps per interval across days
IntervalMeans<- aggregate(activity$steps, by=list(Interval=activity$interval), FUN=mean, na.rm=TRUE)
# Now our x column is actually the average steps. We should rename it accordingly
names(IntervalMeans) <- c("Interval","AvgSteps")
For the time series plot, the Interval values are coded as integers such as 5 10 15 … etc. To provide more accurate graphics, these values are converted to POSIXlt format by using the strptime() function and format %H%M
Note that the values shorter than 4 digits are padded by leading Zeros by str_pad() function from stringr library.
# Load library needed for str_pad
library(stringr)
# Convert x-axis(Interval values) to time format before plotting
# And Draw the actual line plot.
plot(strptime(str_pad(IntervalMeans$Interval, 4 , pad="0"), format="%H%M"), IntervalMeans$AvgSteps, type="l", xlab="Interval", ylab="AvgSteps",col="royalblue")
#Add some grid to our graph
grid(8,8)
We can see the Maximum Avg Steps for an interval is
max(IntervalMeans$AvgSteps)
## [1] 206.1698
And the corresponding interval for that value
IntervalMeans[IntervalMeans$AvgSteps == max(IntervalMeans$AvgSteps),]
## Interval AvgSteps
## 104 835 206.1698
So the 5 minute interval 08:35 has the maximum average number of steps across all days. Which makes sense since this interval is just before the office hours of the individual.
Since we are asked total number of rows (i.e cases) that has NAs we should use complete.cases() function to see which cases are complete and the complement of that (!complete.cases()) is the non-complete cases.
sum(!complete.cases(activity))
## [1] 2304
Since we already have mean for each 5 minute interval I chose to fill in missing values with the mean for that 5-minute interval
First make sure only na values are from the steps column and never from date or interval
sum(is.na(activity$date))
## [1] 0
sum(is.na(activity$interval))
## [1] 0
sum(is.na(activity$steps))
## [1] 2304
Now we can create a new dataset activityFilled by filling missing values from IntervalMeans
activityFilled <- activity
# Now use IntervalMeans as a lookup Table
for(i in 1:nrow(IntervalMeans)){
interval <- IntervalMeans[i,1]
avgValue <-IntervalMeans[i,2]
# Now replace the missing values with matching averages
# Below code will replace every NA steps value for a particular interval with the matching value from IntervalMeans dataset.
activityFilled$steps[is.na(activityFilled$steps) & activityFilled$interval == interval] <- avgValue
}
#We can see all the missing values are gone
sum(!complete.cases(activityFilled))
## [1] 0
Similar to part one
# Using aggregate function to sum up steps per each day
DailySumsFilled<- aggregate(activityFilled$steps, by=list(Date=activityFilled$date), FUN=sum)
# renaming defult column name "x"
colnames(DailySumsFilled)[2] = "Steps"
Draw the plot:
hist(DailySumsFilled$Steps, breaks = 20, col = "tomato2")
meanStepsFilled <- round(mean(DailySumsFilled$Steps), 2)
medianStepsFilled <- median(DailySumsFilled$Steps)
Mean with Filled Values:
## [1] 10766.19
Median with Filled Values:
## [1] 10766.19
Since the strategy I chose was replacing each interval with the mean of that interval, the mean did not change. The affect on the median is that the median is now equal to the mean. This is due to the fact that we have added multiple days with values around mean. These days were completely NA before so we added data points with value around mean so the median is shifted to there.
Notice if I had not take account for the NA values in the first part, we would have seen much more difference in the new mean and median.
1. Create a new factor variable in the dataset with two levels – “weekday” and “weekend” indicating whether a given date is a weekday or weekend day.
Here we add another column as a factor by using weekdays() function on date column.
DayType<-weekdays(as.POSIXct(activity$date))
DayType[DayType == "Sunday" | DayType == "Saturday"] <- "weekend"
DayType[DayType != "weekend"] <- "weekday"
activity$DayFactor <- as.factor(DayType)
#How our dataset looks now
head(activity)
## steps date interval DayFactor
## 1 NA 2012-10-01 0 weekday
## 2 NA 2012-10-01 5 weekday
## 3 NA 2012-10-01 10 weekday
## 4 NA 2012-10-01 15 weekday
## 5 NA 2012-10-01 20 weekday
## 6 NA 2012-10-01 25 weekday
2. Make a panel plot containing a time series plot (i.e. type = “l”) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis).
Making plot using lattice system. (I did not bother converting Intervals to POSIXlt this time since the example plot had the Intervals as integers.)
IntervalMeansbyDays<- aggregate(activity$steps, by=list(Interval=activity$interval, DayFactor=activity$DayFactor), FUN=mean, na.rm=TRUE)
library(lattice)
xyplot(x ~ Interval |DayFactor,data = IntervalMeansbyDays,type="l",layout=c(1,2), ylab="Avg Number of Steps")
There are obvious differences in patterns. During weekdays the individual is mostly walking around 8:30 to 9:00 AM (possibly to work or school) and does not walk much in the afternoon or throughout the day. Where at the weekends, there is more walking activity overall and the activity can happen almost anytime during the day.