Loading and preprocessing the data

Since the zip file is already in the repository, we do not need to download it.
We will just unzip it and create the activity dataset by reading the csv file.

#Unzip and read in the csv file
activity <- read.csv(unz("./activity.zip", "activity.csv"))

What is the avg “total number of steps” taken per day?

In order to get the total number of steps per day, we need to aggregate data and sum the steps per each day.

# Using aggregate function to sum up steps per each day
# Do NOT use na.rm=TRUE for sum function. 
DailySums<- aggregate(activity$steps, by=list(Date=activity$date), FUN=sum)
# renaming defult column name "x"
colnames(DailySums)[2] = "Steps"

Histogram of total number of steps for each day

hist(DailySums$Steps, breaks = 20, col = "tomato2")

About the mean, median and NA values

Notice how some days are missing completely as NA?

head(activity, n = 25)
##    steps       date interval
## 1     NA 2012-10-01        0
## 2     NA 2012-10-01        5
## 3     NA 2012-10-01       10
## 4     NA 2012-10-01       15
## 5     NA 2012-10-01       20
## 6     NA 2012-10-01       25
## 7     NA 2012-10-01       30
## 8     NA 2012-10-01       35
## 9     NA 2012-10-01       40
## 10    NA 2012-10-01       45
## 11    NA 2012-10-01       50
## 12    NA 2012-10-01       55
## 13    NA 2012-10-01      100
## 14    NA 2012-10-01      105
## 15    NA 2012-10-01      110
## 16    NA 2012-10-01      115
## 17    NA 2012-10-01      120
## 18    NA 2012-10-01      125
## 19    NA 2012-10-01      130
## 20    NA 2012-10-01      135
## 21    NA 2012-10-01      140
## 22    NA 2012-10-01      145
## 23    NA 2012-10-01      150
## 24    NA 2012-10-01      155
## 25    NA 2012-10-01      200

Specifically for this data set, if a value is NA then that complete day is filled with NAs.

So If we had used na.rm=TRUE option with sum() function during aggregate, than total number of steps for some days would have been calculated as 0, which would result in lower mean and median values.

In other words, the days without any data would have been treated as days with measurement of 0 steps which would lead to wrong calculations.

Since ignoring NAs is actually not including them in any calculation; replacing those days with 0 is avoided during aggregate & sum.

We can now calculate mean and median of all days by ignoring the NA days. Hence na.rm = TRUE this time

#Now remove NA days when calculating mean and the medians
meanSteps <- round(mean(DailySums$Steps, na.rm = TRUE), 2)
medianSteps <- median(DailySums$Steps, na.rm = TRUE)

Mean :

## [1] 10766.19

Median :

## [1] 10765

What is the average daily activity pattern?


Now we should Aggregate Across intervals using mean function to take average steps per interval across days

IntervalMeans<- aggregate(activity$steps, by=list(Interval=activity$interval), FUN=mean, na.rm=TRUE)
# Now our x column is actually the average steps. We should rename it accordingly
names(IntervalMeans) <- c("Interval","AvgSteps")


For the time series plot, the Interval values are coded as integers such as 5 10 15 … etc. To provide more accurate graphics, these values are converted to POSIXlt format by using the strptime() function and format %H%M
Note that the values shorter than 4 digits are padded by leading Zeros by str_pad() function from stringr library.

# Load library needed for str_pad
library(stringr)

# Convert x-axis(Interval values) to time format before plotting
# And Draw the actual line plot.

plot(strptime(str_pad(IntervalMeans$Interval, 4 , pad="0"), format="%H%M"), IntervalMeans$AvgSteps, type="l", xlab="Interval", ylab="AvgSteps",col="royalblue")
#Add some grid to our graph
grid(8,8)

Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?

We can see the Maximum Avg Steps for an interval is

max(IntervalMeans$AvgSteps)
## [1] 206.1698

And the corresponding interval for that value

IntervalMeans[IntervalMeans$AvgSteps == max(IntervalMeans$AvgSteps),]
##     Interval AvgSteps
## 104      835 206.1698

So the 5 minute interval 08:35 has the maximum average number of steps across all days. Which makes sense since this interval is just before the office hours of the individual.

Imputing missing values

Calculate and report the total number of missing values in the dataset (i.e. the total number of rows with NAs)

Since we are asked total number of rows (i.e cases) that has NAs we should use complete.cases() function to see which cases are complete and the complement of that (!complete.cases()) is the non-complete cases.

sum(!complete.cases(activity))
## [1] 2304

Devise a strategy for filling in all of the missing values in the dataset.

Since we already have mean for each 5 minute interval I chose to fill in missing values with the mean for that 5-minute interval

Create a new dataset that is equal to the original dataset but with the missing data filled in.

First make sure only na values are from the steps column and never from date or interval

sum(is.na(activity$date))
## [1] 0
sum(is.na(activity$interval))
## [1] 0
sum(is.na(activity$steps))
## [1] 2304

Now we can create a new dataset activityFilled by filling missing values from IntervalMeans

activityFilled <- activity

# Now use IntervalMeans as a lookup Table
for(i in 1:nrow(IntervalMeans)){
  interval <- IntervalMeans[i,1]
  avgValue <-IntervalMeans[i,2]
  
  # Now replace the missing values with matching averages
  # Below code will replace every NA steps value for a particular interval with the matching value from IntervalMeans dataset.
  activityFilled$steps[is.na(activityFilled$steps) & activityFilled$interval == interval] <- avgValue
}

#We can see all the missing values are gone
sum(!complete.cases(activityFilled))
## [1] 0

Make a histogram of the total number of steps taken each day and Calculate and report the mean and median total number of steps taken per day.

Similar to part one

# Using aggregate function to sum up steps per each day

DailySumsFilled<- aggregate(activityFilled$steps, by=list(Date=activityFilled$date), FUN=sum)
# renaming defult column name "x"
colnames(DailySumsFilled)[2] = "Steps"

Draw the plot:

hist(DailySumsFilled$Steps, breaks = 20, col = "tomato2")

meanStepsFilled <- round(mean(DailySumsFilled$Steps), 2)
medianStepsFilled <- median(DailySumsFilled$Steps)

Mean with Filled Values:

## [1] 10766.19

Median with Filled Values:

## [1] 10766.19

Do these values differ from the estimates from the first part of the assignment? What is the impact of imputing missing data on the estimates of the total daily number of steps?

Since the strategy I chose was replacing each interval with the mean of that interval, the mean did not change. The affect on the median is that the median is now equal to the mean. This is due to the fact that we have added multiple days with values around mean. These days were completely NA before so we added data points with value around mean so the median is shifted to there.

Notice if I had not take account for the NA values in the first part, we would have seen much more difference in the new mean and median.

Are there differences in activity patterns between weekdays and weekends?

1. Create a new factor variable in the dataset with two levels – “weekday” and “weekend” indicating whether a given date is a weekday or weekend day.
Here we add another column as a factor by using weekdays() function on date column.

DayType<-weekdays(as.POSIXct(activity$date))
DayType[DayType == "Sunday" | DayType == "Saturday"] <- "weekend"
DayType[DayType != "weekend"] <- "weekday"
activity$DayFactor <- as.factor(DayType)
#How our dataset looks now
head(activity)
##   steps       date interval DayFactor
## 1    NA 2012-10-01        0   weekday
## 2    NA 2012-10-01        5   weekday
## 3    NA 2012-10-01       10   weekday
## 4    NA 2012-10-01       15   weekday
## 5    NA 2012-10-01       20   weekday
## 6    NA 2012-10-01       25   weekday

2. Make a panel plot containing a time series plot (i.e. type = “l”) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis).


Making plot using lattice system. (I did not bother converting Intervals to POSIXlt this time since the example plot had the Intervals as integers.)

IntervalMeansbyDays<- aggregate(activity$steps, by=list(Interval=activity$interval, DayFactor=activity$DayFactor), FUN=mean, na.rm=TRUE)
library(lattice)
xyplot(x ~ Interval |DayFactor,data = IntervalMeansbyDays,type="l",layout=c(1,2), ylab="Avg Number of Steps")



There are obvious differences in patterns. During weekdays the individual is mostly walking around 8:30 to 9:00 AM (possibly to work or school) and does not walk much in the afternoon or throughout the day. Where at the weekends, there is more walking activity overall and the activity can happen almost anytime during the day.