Student Name: Sherri Verdugo, M.S. Date: July 19, 2014
Step One: Load in the data
df <- read.csv("activity.csv")
Step Two: Check the head of the data
head(df)
## steps date interval
## 1 NA 2012-10-01 0
## 2 NA 2012-10-01 5
## 3 NA 2012-10-01 10
## 4 NA 2012-10-01 15
## 5 NA 2012-10-01 20
## 6 NA 2012-10-01 25
We have a few steps to take here. First, I like to plot the data. This time I am using qplot from the ggplot2 library. Make sure you have that installed. If you do not have it, be sure to use install.packages(“ggplot2”).
library(ggplot2)
df.steps <- tapply(df$steps, df$date, Fun=sum, na.rm=True)
qplot(df.steps, binwidth=1000, xlab="Total # of steps taken each day",
main="Steps with binwidth set at 1000")
The next step is to find the mean total number of steps per day.
mean(df.steps, na.rm=TRUE)
## [1] 31
median(df.steps, na.rm=TRUE)
## [1] 31
This time, we are looking at the average daily activity pattern. This means that we have to aggregate and then plot. Again, we are using the library (ggplot2)…make sure you have that installed.
library(ggplot2)
df.averages <- aggregate(x=list(steps=df$steps), by=list(interval=df$interval), FUN=mean, na.rm=TRUE)
ggplot(data=df.averages, aes(x=interval, y=steps)) + geom_line() + xlab("Intervals set at 5 minutes") + ylab("Average of steps taken")
Further, on average for all days in the dataset df, the 5 minute intervals contains the following maximum number of steps:
df.averages[which.max(df.averages$steps),]
## interval steps
## 104 835 206.2
This dataset has many missing values that are coded as NA. The very presence of the missing data may introduce what is known as bias into the data analysis process. We need to take care to address this and carefully impute the data using r. First we identify the number of missing items from the dataframe. Finally, we generate a table to identify the number of missing items in this dataset.
df.missing <- is.na(df$steps)
table(df.missing)
## df.missing
## FALSE TRUE
## 15264 2304
We can replace the missing values with the mean value of the 5-minute intervals by using a function that is conditional on the is.na and number of steps. This was tricky as it took more time to run through various options of how to do this.
nafiller <- function(steps, interval){
filler <- NA
if (!is.na(steps))
filler <- c(steps)
else
filler <- (df.averages[df.averages$interval==interval, "steps"])
return(filler)
}
myfill.df <- df
myfill.df$steps <- mapply(nafiller, myfill.df$steps, myfill.df$interval)
Now we can look at what we have done so far by calling the object.
head(myfill.df)
## steps date interval
## 1 1.71698 2012-10-01 0
## 2 0.33962 2012-10-01 5
## 3 0.13208 2012-10-01 10
## 4 0.15094 2012-10-01 15
## 5 0.07547 2012-10-01 20
## 6 2.09434 2012-10-01 25
The next thing we can do is utilize the histogram for visualization with the filled in data set.
myts <- tapply(myfill.df$steps, myfill.df$date, FUN=sum)
qplot(myts, binwidth=1000, xlab="Total Number of Steps per Day",main="Total Number of Steps per Day After Imputation")
mean(myts)
## [1] 10766
median(myts)
## [1] 10766
From the imputation process, we notice that the mean and median values are higher. One explanation is that in the original data with some days that have ‘steps’ with the value of NA for any ‘interval’. That means that the numbre of steps wld have 0 values that are removed in the original histogram. After the imputation, the values of the mean and median increase.
To do this step, we have to look at the day of the week for every single measurement in the data that we are analyzing. We will continue using our filled data (myfill.df) for the next portion of this assignment.
week.identify <- function(date){
day <- weekdays(date)
if (day %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday"))
return("Weekday")
else if (day %in% c("Saturday", "Sunday"))
return("Weekend")
else
stop("Invalid Date")
}
myfill.df$date <- as.Date(myfill.df$date)
myfill.df$day <- sapply(myfill.df$date, FUN=week.identify)
Let’s look at what we have so far for identifying the day of the week as a weekend or weekday. Is R smart enough to handle that? The answer is, yes.
head(myfill.df$day)
## [1] "Weekday" "Weekday" "Weekday" "Weekday" "Weekday" "Weekday"
The next step for this is to visually explore the data that we created. The option that is used is the panel plot that contains the average number of steps taken on either weekends or weekdays. Do people take more steps on the weekends or the weekdays?
avg <- aggregate(steps ~ interval + day, data=myfill.df, mean)
ggplot(avg, aes(interval, steps))+geom_line()+ facet_grid(day ~ .) + xlab("Intervals at 5 minutes") + ylab("# of Steps")
From the graph we see that weekday steps start out similar to the weekend steps. The difference is that more regular paterns occur in the weekend steps.