Reproducible Research first assignment

Preparation

The dataset used in this assignment is included in the github repository of the assignment, which was cloned to the home directory using the following code in the shell

cd to home
git clone https://github.com/rdpeng/RepData_PeerAssessment1.git

For the data analysis the package 'reshape2' is used. The following code checkes whether it is installed, installs it if not and loads the package

if (library("reshape2", logical.return = TRUE) == FALSE) {
    install.packages("reshape2")
}
library("reshape2")

1. Read in the data

The zip file with the dataset is extracted into the home directory and then read in

unzip("RepData_PeerAssessment1/activity.zip")
activity <- read.csv("activity.csv", header = TRUE, nrows = 17570, colClasses = c("numeric",
    "Date", "numeric"))

2. What is the mean total number of steps taken each day?

Reshape the dataset to get the total numbers of steps per day

myMelt <- melt(activity, id = "date")
myTotal <- dcast(myMelt, date ~ variable, sum)

Create the histeogram of total numbers of steps taken each day

hist(myTotal$steps, main = "Total number of steps")

plot of chunk unnamed-chunk-4

Calculate the the mean and median of total numbers of steps taken each day, NAs need to be removed as otherwise the mean and median are NA

mean(myTotal$steps, na.rm = TRUE)
## [1] 10766
median(myTotal$steps, na.rm = TRUE)
## [1] 10765

3. What is the average daily activity pattern?

Reshape the data to get the mean for each interval

myMelt <- melt(activity, id = "interval", na.rm = TRUE)
myInterval <- dcast(myMelt, interval ~ variable, mean)

Create a time series plot of the means for each interval

plot(myInterval$interval, myInterval$steps, type = "l")

plot of chunk unnamed-chunk-7

Calculate the interval with the maximum average of steps

myInterval[myInterval$steps == max(myInterval$steps), 1]
## [1] 835

4. Imputing missing values

Computing the number of missing cases

nrow(activity[!complete.cases(activity), ])
## [1] 2304

Creating a new dataset in which the missing values for steps are replaced with the mean of the respective interval

x <- rep(myInterval$steps, 61)
activityNew <- activity
activityNew$steps[is.na(activityNew$steps)] <- x[is.na(activityNew$steps)]

Reshape the new dataset to get the total numbers of steps per day

myMeltNew <- melt(activityNew, id = "date")
myTotalNew <- dcast(myMeltNew, date ~ variable, sum)

Creating the histeogram for the new dataset

hist(myTotalNew$steps, main = "Total number of steps")

plot of chunk unnamed-chunk-12

Calculating mean and median for the new dataset

mean(myTotalNew$steps)
## [1] 10766
median(myTotalNew$steps)
## [1] 10766

As one can see the mean and median above differ slightly from the mean ( 1.0766 × 104 ) and median ( 1.0765 × 104 ) of the dataset without replacing the NAs When replacing the NAs with the average of the respective interval the median and mean have the same value, indicating a more even distribution.

5. Are there differences in activity patterns between weekdays and weekends?

Creating an new factor variable called 'Days' with the levels weekday and weekend

x <- as.factor(weekdays(activityNew$date))
levels(x) <- list(Weekday = "Friday", Weekday = "Monday", Weekend = "Saturday",
    Weekend = "Sunday", Weekday = "Thursday", Weekday = "Tuesday", Weekday = "Wednesday")
activityNew$Days <- x

Preparing the dataset to create a time series plot of the means for each interval for both weekend and weekday by subsetting and reshaping

myDaySub <- activityNew[activityNew$Days == "Weekday", c("interval", "steps")]
myEndSub <- activityNew[activityNew$Days == "Weekend", c("interval", "steps")]

myMeltDay <- melt(myDaySub, id = "interval", na.rm = TRUE)
myIntervalDay <- dcast(myMeltDay, interval ~ variable, mean)

myMeltEnd <- melt(myEndSub, id = "interval", na.rm = TRUE)
myIntervalEnd <- dcast(myMeltEnd, interval ~ variable, mean)

Creating a time series plot of the means for each interval for both weekend and weekday

par(mfrow = c(2, 1), mar = c(4, 4, 1, 1))
plot(myIntervalDay$interval, myIntervalDay$steps, type = "l", col = "blue",
    main = "Weekday", ylab = "Number of Steps", xlab = "", xaxt = "n", ylim = c(0,
        250))
plot(myIntervalEnd$interval, myIntervalEnd$steps, type = "l", col = "blue",
    main = "Weekend", ylab = "Number of Steps", xlab = "interval", ylim = c(0,
        250))

plot of chunk unnamed-chunk-16

As can be seen the average steps taken on the weekend are much more distributed ver the whole day, while during the weekdays the average spikes at interval 835