This is a submission for the Reproducible Research course, which is part of the data science specialization.
In order for this code to work, the ZIP file containing the activity data must be inside the working directory. It is assumed that the the ZIP archive is already extracted. The ZIP can be downloaded from [here] https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip
Set the correct working for R to locate the extracted data file.
setwd("C:/Users/nma/Documents/R/repr_rsrch_proj1")
Upon extracting the archive, the CSV is read using read.csv. The data is stored into a dataframe. The date attribute is converted into data using as.Date.
read_activityData<- read.csv("activity.csv")
read_activityData$date <- as.Date(read_activityData$date) # convert strings to date
head(read_activityData)
## steps date interval
## 1 NA 2012-10-01 0
## 2 NA 2012-10-01 5
## 3 NA 2012-10-01 10
## 4 NA 2012-10-01 15
## 5 NA 2012-10-01 20
## 6 NA 2012-10-01 25
By converting the date strings to Date objects, it becomes easier to extract week days vs. weekends and do the time based plots.
Some pre-processing is done to omit rows with NA values.
activityDataCleaned <- read_activityData[complete.cases(activityData),]
print(nrow(read_activityData))
## [1] 17568
We see that a few thousand rows are removed in this manner.
print(nrow(activityDataCleaned))
## [1] 15264
Using barplot the total number of steps per day are determined
steps_per_day <- aggregate(steps ~ date, data = activityDataCleaned, FUN = sum)
barplot(steps_per_day$steps, names.arg = steps_per_day$date)
plot of total number of steps taken per day
Calculate the mean and median steps per day from this cleaned data set:
aggregatedPerDate <- aggregate(activityDataCleaned$steps, by=list(activityDataCleaned$date), FUN=sum)
print(mean(aggregatedPerDate$x))
## [1] 10766
print(median(aggregatedPerDate$x))
## [1] 10765
Using the cleaned data set with missing values removed, the below plot gives the activity per time interval averaged over all the days:
meansPerInterval <- aggregate(activityDataCleaned$steps,by=list(activityDataCleaned$interval), FUN=mean)
mediansPerInterval <- aggregate(activityDataCleaned$steps,by=list(activityDataCleaned$interval), FUN=median)
# Find maximum
ggplot(data=meansPerInterval, aes(x=Group.1, y=x)) +
geom_line() + geom_point() + xlab("Interval") +
ylab("Average number of steps")
plot of average daily pattern
Time interval with maximum average number of steps:
maximumIndex <- which.max(meansPerInterval$x)
print(meansPerInterval[maximumIndex,"Group.1"])
## [1] 835
It is observed that the input dataset consists of NA. This is obtained using is.na, to determine the length of the vector for which ‘steps’ is NA:
{r Missing Values} print(length(which(is.na(activityData$steps))))
## [1] 2304
The missing values are imputed by interpolating the NA values by taking the mean number of steps from the ‘complete’ dataset obtained earlier. Then the original data set is copied, loop through it to find NA data, and use the ‘mean steps per interval’ table as a lookup table to fill in these values.
# Copy old data frame
copy <- as.data.frame(activityData)
## Replace NA data with means from 'means per interval' frame
for (i in 1:nrow(copy)) {
if (is.na(copy[i,"steps"])) {
intervalToSearch <- copy[i,"interval"]
value <- meansPerInterval[meansPerInterval$Group.1 == intervalToSearch, "x"]
copy[[i,"steps"]] <- value
}
}
print(head(copy))
## steps date interval
## 1 1.71698 2012-10-01 0
## 2 0.33962 2012-10-01 5
## 3 0.13208 2012-10-01 10
## 4 0.15094 2012-10-01 15
## 5 0.07547 2012-10-01 20
## 6 2.09434 2012-10-01 25
We confirm that there are no more NA:
sum(is.na(copy$steps))
[1] 0
Now we make a histogram similar to the one above, but using our interpolated data set.
ggplot(copy, aes(x=as.factor(date), y=steps)) + geom_bar(stat="identity") +
theme(axis.text.x=element_text(angle=-90)) + xlab("Date")
plot of interpolated data
aggregatedPerDateInterpolated <- aggregate(copy$steps, by=list(copy$date), FUN=sum)
print(mean(aggregatedPerDateInterpolated$x))
## [1] 10766
print(median(aggregatedPerDateInterpolated$x))
## [1] 10766
Based in the differences between the values above, the impact from interpolating the missing data seems to be quite minimal
By splitting the filled-out data set into two, for week days and weekends, we can create an extra ‘type’ column indicating whether the data was collected on a weekday or a weekend day, and paste the data together again.
# Add column with day of week??????
copy$Weekday <- weekdays(copy$date)
# Split into week days vs weekends
weekdays <- copy[copy$Weekday != "Saturday" & copy$Weekday != "Sunday",]
weekends <- copy[copy$Weekday == "Saturday" | copy$Weekday == "Sunday",]
#Create means per interval
weekdayIntervalMeans <-aggregate(weekdays$steps,by=list(weekdays$interval), FUN=mean)
weekendIntervalMeans <-aggregate(weekends$steps,by=list(weekends$interval), FUN=mean)
weekdayIntervalMeans$type <- "Weekday"
weekendIntervalMeans$type <- "Weekend"
# Paste together again
pastedData <- rbind(weekdayIntervalMeans,weekendIntervalMeans)
We can then use this ‘type’ column as facets in a ggplot plot.
ggplot(data=pastedData, aes(x=Group.1, y=x)) +
geom_line() + xlab("Interval") +
ylab("Average number of steps") + facet_grid(type ~ .)
From the looks of these plots, there are much clearer ‘peaks’ in activity on week days then there are on weekend days – presumably because during work hours the individual from whom the data were collected is often seated, and they walk mostly during their commute to and from work.