The dataset used in this assignment is included in the github repository of the assignment, which was cloned to the home directory using the following code in the shell
cd to home
git clone https://github.com/rdpeng/RepData_PeerAssessment1.git
For the data analysis the package 'reshape2' is used. The following code checkes whether it is installed, installs it if not and loads the package
if (library("reshape2", logical.return = TRUE) == FALSE) {
install.packages("reshape2")
}
library("reshape2")
The zip file with the dataset is extracted into the home directory and then read in
unzip("RepData_PeerAssessment1/activity.zip")
activity <- read.csv("activity.csv", header = TRUE, nrows = 17570, colClasses = c("numeric",
"Date", "numeric"))
Reshape the dataset to get the total numbers of steps per day
myMelt <- melt(activity, id = "date")
myTotal <- dcast(myMelt, date ~ variable, sum)
Create the histeogram of total numbers of steps taken each day
hist(myTotal$steps, main = "Total number of steps")
Calculate the the mean and median of total numbers of steps taken each day, NAs need to be removed as otherwise the mean and median are NA
mean(myTotal$steps, na.rm = TRUE)
## [1] 10766
median(myTotal$steps, na.rm = TRUE)
## [1] 10765
Reshape the data to get the mean for each interval
myMelt <- melt(activity, id = "interval", na.rm = TRUE)
myInterval <- dcast(myMelt, interval ~ variable, mean)
Create a time series plot of the means for each interval
plot(myInterval$interval, myInterval$steps, type = "l")
Calculate the interval with the maximum average of steps
myInterval[myInterval$steps == max(myInterval$steps), 1]
## [1] 835
Computing the number of missing cases
nrow(activity[!complete.cases(activity), ])
## [1] 2304
Creating a new dataset in which the missing values for steps are replaced with the mean of the respective interval
x <- rep(myInterval$steps, 61)
activityNew <- activity
activityNew$steps[is.na(activityNew$steps)] <- x[is.na(activityNew$steps)]
Reshape the new dataset to get the total numbers of steps per day
myMeltNew <- melt(activityNew, id = "date")
myTotalNew <- dcast(myMeltNew, date ~ variable, sum)
Creating the histeogram for the new dataset
hist(myTotalNew$steps, main = "Total number of steps")
Calculating mean and median for the new dataset
mean(myTotalNew$steps)
## [1] 10766
median(myTotalNew$steps)
## [1] 10766
As one can see the mean and median above differ slightly from the mean ( 1.0766 × 104 ) and median ( 1.0765 × 104 ) of the dataset without replacing the NAs When replacing the NAs with the average of the respective interval the median and mean have the same value, indicating a more even distribution.
Creating an new factor variable called 'Days' with the levels weekday and weekend
x <- as.factor(weekdays(activityNew$date))
levels(x) <- list(Weekday = "Friday", Weekday = "Monday", Weekend = "Saturday",
Weekend = "Sunday", Weekday = "Thursday", Weekday = "Tuesday", Weekday = "Wednesday")
activityNew$Days <- x
Preparing the dataset to create a time series plot of the means for each interval for both weekend and weekday by subsetting and reshaping
myDaySub <- activityNew[activityNew$Days == "Weekday", c("interval", "steps")]
myEndSub <- activityNew[activityNew$Days == "Weekend", c("interval", "steps")]
myMeltDay <- melt(myDaySub, id = "interval", na.rm = TRUE)
myIntervalDay <- dcast(myMeltDay, interval ~ variable, mean)
myMeltEnd <- melt(myEndSub, id = "interval", na.rm = TRUE)
myIntervalEnd <- dcast(myMeltEnd, interval ~ variable, mean)
Creating a time series plot of the means for each interval for both weekend and weekday
par(mfrow = c(2, 1), mar = c(4, 4, 1, 1))
plot(myIntervalDay$interval, myIntervalDay$steps, type = "l", col = "blue",
main = "Weekday", ylab = "Number of Steps", xlab = "", xaxt = "n", ylim = c(0,
250))
plot(myIntervalEnd$interval, myIntervalEnd$steps, type = "l", col = "blue",
main = "Weekend", ylab = "Number of Steps", xlab = "interval", ylim = c(0,
250))
As can be seen the average steps taken on the weekend are much more distributed ver the whole day, while during the weekdays the average spikes at interval 835