Reproducible Research Week 2 Assignment

Reading data and displaying a summary of the data.

alldata <- read.csv("./repdata%2Fdata%2Factivity/activity.csv")
alldata$date <- as.Date(alldata$date)
alldata$weekday <- factor(weekdays(alldata$date), levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
head(alldata)

##   steps       date interval weekday
## 1    NA 2012-10-01        0  Monday
## 2    NA 2012-10-01        5  Monday
## 3    NA 2012-10-01       10  Monday
## 4    NA 2012-10-01       15  Monday
## 5    NA 2012-10-01       20  Monday
## 6    NA 2012-10-01       25  Monday

summary(alldata$steps)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.00    0.00   37.38   12.00  806.00    2304

What is mean total number of steps taken per day?

1. Calculate the total number of steps taken per day

total_steps_daywise <- tapply(alldata$steps, alldata$date, sum)
head(total_steps_daywise)

## 2012-10-01 2012-10-02 2012-10-03 2012-10-04 2012-10-05 2012-10-06 
##         NA        126      11352      12116      13294      15420

2. Make a histogram of the total number of steps taken each day

I have used the base plot system to create a histogram of the total number of steps taken each day.

hist(total_steps_daywise, main = "Histogram of the total number of steps taken each day", xlab = "Total number of steps taken", ylab = "Frequency", col = "steel blue", breaks = 10)

3. Calculate and report the mean and median of the total number of steps taken per day

mean <- round(mean(total_steps_daywise, na.rm = TRUE), digits = 1)
median <- round(median(total_steps_daywise, na.rm = TRUE), digits = 1)

The mean of the total number of steps taken per day is 1.0766210^{4} and the median of the total number of steps taken per day is 1.076510^{4}.

What is the average daily activity pattern?

1. Make a time series plot (i.e. type = “l”) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis)

Calculating the mean of total steps for every 5-minute interval.

total_steps_intervalwise <- tapply(alldata$steps, alldata$interval, mean, na.rm = TRUE)

Plotting the line diagram

plot(alldata$interval[1:288], total_steps_intervalwise, col = "red", main = "Time Series Plot for each 5-minute interval", xlab = "Intervals --->", ylab = "Average number of Steps --->", type = "l")

2. Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?

max <- alldata$interval[which(total_steps_intervalwise == max(total_steps_intervalwise), )]

The 5-minute interval which contains the maximum number of steps is the 835th interval.

Imputing missing values

1. Calculate and report the total number of missing values in the dataset (i.e. the total number of rows with NAs)

nas <- sum(is.na(alldata$steps))

The total number of rows with NAs are 2304

2. Devise a strategy for filling in all of the missing values in the dataset. The strategy does not need to be sophisticated. For example, you could use the mean/median for that day, or the mean for that 5-minute interval, etc.

The strategy for filling in all of the missing values in the dataset in the steps column is to mutate the mean of steps for that 5 minute interval to each NA value. The data have to be grouped by interval and calculated mean for that interval will have to be overwritten on the NA value.

3. Create a new dataset that is equal to the original dataset but with the missing data filled in.

The above strategy has been applied to create the new dataset. The new data set is called imputed_data.

#loading libraries
library(dplyr)

## Warning: package 'dplyr' was built under R version 3.5.1

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#Function to identify if a value is NA and calculate the mean.
mean_steps <- function(num) replace(num, is.na(num), mean(num, na.rm = TRUE))
#Data Imputed with the above strategy
imputed_data <- alldata %>% group_by(interval) %>% mutate(steps = mean_steps(steps))

## Warning: package 'bindrcpp' was built under R version 3.5.1

#Displaying Imputed data
head(imputed_data)

## # A tibble: 6 x 4
## # Groups:   interval [6]
##    steps date       interval weekday
##    <dbl> <date>        <int> <fct>  
## 1 1.72   2012-10-01        0 Monday 
## 2 0.340  2012-10-01        5 Monday 
## 3 0.132  2012-10-01       10 Monday 
## 4 0.151  2012-10-01       15 Monday 
## 5 0.0755 2012-10-01       20 Monday 
## 6 2.09   2012-10-01       25 Monday

summary(imputed_data)

##      steps             date               interval           weekday    
##  Min.   :  0.00   Min.   :2012-10-01   Min.   :   0.0   Monday   :2592  
##  1st Qu.:  0.00   1st Qu.:2012-10-16   1st Qu.: 588.8   Tuesday  :2592  
##  Median :  0.00   Median :2012-10-31   Median :1177.5   Wednesday:2592  
##  Mean   : 37.38   Mean   :2012-10-31   Mean   :1177.5   Thursday :2592  
##  3rd Qu.: 27.00   3rd Qu.:2012-11-15   3rd Qu.:1766.2   Friday   :2592  
##  Max.   :806.00   Max.   :2012-11-30   Max.   :2355.0   Saturday :2304  
##                                                         Sunday   :2304

new_nas <- sum(is.na(imputed_data$steps))

The imputed data has 0 NA values.

4. Make a histogram of the total number of steps taken each day and Calculate and report the mean and median total number of steps taken per day. Do these values differ from the estimates from the first part of the assignment? What is the impact of imputing missing data on the estimates of the total daily number of steps?

Calculated the new total number of steps taken each day. Plotted the histogram of the calculated data.

new_total_steps_daywise <- tapply(imputed_data$steps, imputed_data$date, sum, na.rm = TRUE)
new_mean <- round(mean(new_total_steps_daywise), digits = 1)
new_median <- round(median(new_total_steps_daywise), digits = 1)
par(mfrow = c(1,2))
hist(new_total_steps_daywise, main = "Imputed Data", xlab = "Total number of steps taken --->", ylab = "Frequency --->", col = "steel blue", breaks = 10, ylim = c(0, 25))
abline(v = new_mean, col = "white")
hist(total_steps_daywise, main = "Original Data", xlab = "Total number of steps taken --->", ylab = "Frequency --->", col = "steel blue", breaks = 10, ylim = c(0, 25))
abline(v= mean, col = "white")

The new mean with the imputed data is 1.0766210^{4} and the old mean was 1.0766210^{4}. The new median with the imputed data is 1.0766210^{4} while the old median was 1.076510^{4}

Imputation of the data has no effect on mean and a small change in median of the total number of steps taken.

Are there differences in activity patterns between weekdays and weekends?

1. Create a new factor variable in the dataset with two levels - “weekday” and “weekend” indicating whether a given date is a weekday or weekend day.

alldata$DayType <- ifelse(alldata$weekday == "Saturday" | alldata$weekday == "Sunday", "Weekend", "Weekday")
head(alldata)

##   steps       date interval weekday DayType
## 1    NA 2012-10-01        0  Monday Weekday
## 2    NA 2012-10-01        5  Monday Weekday
## 3    NA 2012-10-01       10  Monday Weekday
## 4    NA 2012-10-01       15  Monday Weekday
## 5    NA 2012-10-01       20  Monday Weekday
## 6    NA 2012-10-01       25  Monday Weekday

2. Make a panel plot containing a time series plot (i.e. type=“l”) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis).

weekday_alldata <- alldata[alldata$DayType == "Weekday", ]
weekend_alldata <- alldata[alldata$DayType == "Weekend", ]
weekday_intervalwise <- tapply(weekday_alldata$steps, weekday_alldata$interval, mean, na.rm = TRUE)
weekend_intervalwise <- tapply(weekend_alldata$steps, weekend_alldata$interval, mean, na.rm = TRUE)

Plotting the line diagram for weekday and weekend

plot(alldata$interval[1:288], weekday_intervalwise, col = "red", main = " Weekday Time Series Plot", xlab = "Intervals", ylab = "Avg. no. of steps", type = "l", ylim = c(0,250))
abline(h = mean(weekday_intervalwise), col = "red")

plot(alldata$interval[1:288], weekend_intervalwise, col = "green", main = "Weekend Time Series Plot", xlab = "Intervals", ylab = "Avg. no. of steps", type = "l", ylim = c(0,250))
abline(h = mean(weekend_intervalwise), col = "green")

There is significant difference between weekday and weekend activity. Weekdays have more peaks and low standard deviation. Weekends have lesser peaks. Both weekdays and weekends seem to have the same mean average number of steps intervalwise.