A model of Keith Helfrich

A few simple plots in R

Recent coursework for the Exploratory Data Analysis & Reproducible Research courses, which make up part of the 9 month Data Science Specialization offered on Coursera by John Hopkins University, offer the opportunity to examine what it takes to manipulate data & produce simple plots in R.

Some code might be expected for data manipulation. But the code to plot & anotate simple charts shows the reason why Tableau is such a pleasure to work within.

It is also interesting to observe the code to impute missing values. Come to find later that an R package already exists for this. Never-the-less, one learns more when doing it by hand :) And even with a package, this is a far simpler task to perform in Tableau.

The Data

The gist of the assignment is quite simple:

Data from a personal activity monitoring device collects observations at 5 minute intervals through out the day. For two months of data from an anonymous individual, we’re looking at the number of steps taken during the 5 minute intervals of each day.

The variables included are:

steps: Number of steps taking in a 5-minute interval (missing values are coded as NA) date: The date on which the measurement was taken in YYYY-MM-DD format interval: Identifier for the 5-minute interval in which measurement was taken

Reproducible Research: Peer Assessment 1

Keith V. Helfrich July 10, 2014

Read Data

It???s nice that we can read the CSV data from inside of the zip file:

data <- read.csv(unz("activity.zip", "activity.csv"))

Pre-Processing: Remove Missing Values

For this part of the assignment, we can ignore the missing values. OriginalValues are those which are not missing from the original data. Later, we will come back to replace the NA???s with a point estimate (best guess). For now, how many missing values are we ignoring ?

originalValue <- complete.cases(data)  
nMissing <- length(originalValue[originalValue=FALSE])                      # number of records with NA  
nComplete <- length(originalValue[originalValue=TRUE])                      # number of complete records

title="Missing vs. Complete Cases"  
barplot(table(originalValue),main=title,xaxt='n', col="bisque3")             # render Complete Cases barplot  
axis(side=1,at=c(.7,1.9),labels=c("Missing","Complete"),tick=FALSE)          # render axis  
text(.7,0,labels=nMissing, pos=3)                                            # label the NA's bar  
text(1.9,0,labels=nComplete, pos=3)  

What is the mean total number of steps taken per day?

Interesting that the total number of steps per day follows a nearly normal distribution without any outliers. Let???s make a histogram of the total number of steps taken each day, and report the mean and median of the same.

completes<-subset(data,complete.cases(data))              # build a subset of the complete values

splitByDay<-split(completes,completes$date, drop=TRUE)          # split the complete cases by date  
dailySteps<-sapply(splitByDay, function(x) sum(x$steps))        # build a numeric vector w/ daily sum of steps  
hist(dailySteps, main="Hist Total Steps per Day", xlab="# Steps", col="bisque3") # plot a histogram  
abline(v=mean(dailySteps), lty=3, col="blue")                   # draw a blue line thru the mean  
abline(v=median(dailySteps), lty=4, col="red")                  # draw a red line thru the median  
text(mean(dailySteps),25,labels="mean", pos=4, col="blue")      # label the mean  
text(mean(dailySteps),23,labels="median", pos=4, col="red")     # label the median  
rug(dailySteps, col="chocolate")          

The mean and median total number of steps per day are nearly the same.0

summary(dailySteps)       
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      41    8841   10765   10766   13294   21194

What is the average daily activity pattern?

Let???s make a time series plot of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis).

splitByInterval <- split(completes,completes$interval, drop=TRUE)     # split the complete cases by date  
intervalAvg <- sapply(splitByInterval, function(x) mean(x$steps))     # vector of Avg. steps per interval  
plot(intervalAvg, type="l",  
     main="5' Interval Time Series", 
     ylab="Average # Steps", 
     xlab="Interval INDEX", col="chocolate")                          # plot the 5' time series
abline(v=which.max(intervalAvg), lty=3, col="blue")                   # draw a red line thru the median  
text(which.max(intervalAvg),max(intervalAvg),  
     labels=paste("max = ",as.character(round(max(intervalAvg)))), 
     pos=4, col="blue")       

Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps? Well, that would be the interval by the name 835, with an average # steps = 206, which is located in the INDEX position # 104.

names(which.max(intervalAvg))   
## [1] "835"
round(max(intervalAvg))           
## [1] 206
which.max(intervalAvg)     
## 835 
## 104

Imputing missing values

  • Como nao consegui reproduzir o laco do trecho abaixo, optei por transformar as listas em data frame e entao realizar o merge dos dados para as manipulacoes seguintes.
  • Since I could not reproduce the loop in the section below, I opted to transform the lists into a data frame and then merge the data for the following manipulations.

Uptil now, weve ignored the records with missing values. Yet the presence of missing values can introduce bias into some calculations or summaries of the data. As we???ve seen earlier, the total number of missing values in the dataset (i.e. the total number of rows with NAs) is 2,304

nMissing
## [1] 0

To impute missing values, I will use the mean across all days for the 5-minute interval which the NA occurs. So let`s create a new dataset, equal to the original, but with estimates for the missing data filled in. We???ll get there by adding a fourth utility column, which contains TRUE / FALSE boolean to indicate whether the originalValue was present (TRUE) or missing (FALSE).

newData <- cbind(data,originalValue)                          # newData, with 'originalValue' column  
splitByOrig<-split(newData,newData$originalValue, drop=TRUE)  # split newData by whether originalValue exists

# For each row in the split data frame where originalValue  FALSE, 
# replace NA with the intervalAvg (rounded to the nearest integer)
# the impute value is found with a lookup from the intervalAvg named vector created earlier

#for (row in 1:nrow(splitByOrig[["FALSE"]])){  
#    splitByOrig[["FALSE"]][row,1] <- round(subset(intervalAvg,names(intervalAvg)as.character(splitByOrig[["FALSE"]][row,3])))
#}


#Alternativa adotada 
a <- as.data.frame(splitByOrig[["FALSE"]])
dim(a)
## [1] 2304    4
b <- as.data.frame(splitByOrig[["TRUE"]])
dim(b)
## [1] 15264     4
ab <- rbind(a,b)
dim(ab)
## [1] 17568     4
head(ab)
##   steps       date interval originalValue
## 1    NA 2012-10-01        0         FALSE
## 2    NA 2012-10-01        5         FALSE
## 3    NA 2012-10-01       10         FALSE
## 4    NA 2012-10-01       15         FALSE
## 5    NA 2012-10-01       20         FALSE
## 6    NA 2012-10-01       25         FALSE
library(magrittr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
ab <- ab %>%
        group_by(interval) %>%
                summarise(steps = mean(steps))
head(ab)
## # A tibble: 6 x 2
##   interval steps
##      <int> <dbl>
## 1        0    NA
## 2        5    NA
## 3       10    NA
## 4       15    NA
## 5       20    NA
## 6       25    NA
splitByOrig <- split(ab, seq(nrow(ab)))

Now we have a list named splitByOrig, with two data frame elements: one data frame named ???TRUE??? which contains all those observations for which an originalValue was present, and another named ???FALSE??? which contains the imputed intervalAvg # of steps for the missing 5??? interval.

The imputation is done now, so let???s bring the pieces back together again. Chronological order was temporarily ruined by splitting & recombinig the data, so we also need to re-order the rows by date & interval.

#Cancelei esta op????o ao adotar a solu????o via data frame
#newData <- rbind(splitByOrig[["FALSE"]],splitByOrig[["TRUE"]])           # combine the TRUE & FALSE data frames  
#newData <- newData[with(newData, order(date, interval)), ]               # re-order by date & interval

Using the newData, make a histogram of the total number of steps taken each day and Calculate and report the mean and median total number of steps taken per day.

splitNewByDay <- split(newData,newData$date, drop=TRUE)                  # split the newData by date  
dailyStepsNew <- sapply(splitNewByDay, function(x) sum(x$steps))         # numeric vector w/ daily sum of steps  
hist(dailyStepsNew, main="NEW Hist: Total Steps per Day", xlab="         # Steps", col="bisque3") # plot a histogram  
abline(v=mean(dailySteps), lty=3, col="blue")                            # draw a blue line thru the mean  
abline(v=median(dailySteps), lty=4, col="red")                           # draw a red line thru the median  
text(mean(dailySteps),35,labels="mean", pos=4, col="blue")               # label the mean  
text(mean(dailySteps),33,labels="median", pos=4, col="red")              # label the median  
rug(dailyStepsNew,col="chocolate")

While the quartiles vary a bit, the mean and median total number of steps per day are exactly the same.

summary(dailySteps)     
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      41    8841   10765   10766   13294   21194
summary(dailyStepsNew)     
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      41    8841   10765   10766   13294   21194       8

What is the impact of imputing missing data ?

In fact, by using the average value for imputation, the only difference is in the frequency (number of observations)) for the center bar of the new histogram:

par(mfrow=c(1,2))

### plot the original histogram
hist(dailySteps, main="Hist Total Steps per Day", xlab="# Steps", col="bisque3", ylim=c(0,35)) # plot a histogram  
abline(v=mean(dailySteps), lty=3, col="blue")                      # draw a blue line thru the mean  
abline(v=median(dailySteps), lty=4, col="red")                     # draw a red line thru the median  
text(mean(dailySteps),25,labels="mean", pos=4, col="blue")         # label the mean  
text(mean(dailySteps),23,labels="median", pos=4, col="red")        # label the median  
rug(dailySteps, col="chocolate")

### plot the imputed histogram
hist(dailyStepsNew, main="NEW Hist: Total Steps per Day", xlab="# Steps", col="bisque3", ylab="") # plot a histogram  
abline(v=mean(dailySteps), lty=3, col="blue")                      # draw a blue line thru the mean  
abline(v=median(dailySteps), lty=4, col="red")                     # draw a red line thru the median  
text(mean(dailySteps),35,labels="mean", pos=4, col="blue")         # label the mean  
text(mean(dailySteps),33,labels="median", pos=4, col="red")        # label the median  
rug(dailyStepsNew,col="chocolate")

## Are there differences in activity patterns between weekdays and weekends? Create a new factor variable in the dataset with two levels ??? ???weekday??? and ???weekend??? indicating whether a given date is a weekday or weekend day.

newData$date <- as.Date(strptime(newData$date, format="%Y-%m-%d")) # convert date to a date() class variable  
## Warning in strptime(newData$date, format = "%Y-%m-%d"): unknown timezone
## 'zone/tz/2017c.1.0/zoneinfo/America/Sao_Paulo'
newData$day <- weekdays(newData$date)                              # build a 'day' factor to hold weekday / weekend  
for (i in 1:nrow(newData)) {                                       # for each day  
    if (newData[i,]$day %in% c("Saturday","Sunday")) {             # if Saturday or Sunday,
        newData[i,]$day<-"weekend"                                 #   then 'weekend'
    }
    else{
        newData[i,]$day<-"weekday"                                 #    else 'weekday'
    }
}

Make a panel plot containing a time series plot (i.e. type = ???l???) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis). The plot should look something like the following, which was creating using simulated data:

## aggregate newData by steps as a function of interval + day  
stepsByDay <- aggregate(newData$steps ~ newData$interval + newData$day, newData, mean)

## reset the column names to be pretty & clean
names(stepsByDay) <- c("interval", "day", "steps")

## plot weekday over weekend time series
par(mfrow=c(1,1))  
with(stepsByDay, plot(steps ~ interval, type="n", main="Weekday vs. Weekend Avg."))  
with(stepsByDay[stepsByDay$day == "weekday",], lines(steps ~ interval, type="l", col="chocolate"))  
with(stepsByDay[stepsByDay$day == "weekend",], lines(steps ~ interval, type="l", col="16" ))  
legend("topright", lty=c(1,1), col = c("chocolate", "16"), legend = c("weekday", "weekend"), seg.len=3)

It looks like this person has a day job & does most of his or her walking on the weekends!

References

“Exploratory Data Analysis”, by Roger D. Peng, PhD, Jeff Leek, PhD, Brian Caffo, PhD, Coursera. July 11, 2014. https://www.coursera.org/course/exdata “Reproducible Research”, by Roger D. Peng, PhD, Jeff Leek, PhD, Brian Caffo, PhD, Coursera. July 11, 2014. https://www.coursera.org/course/repdata “Data Science Specialization”, by Roger D. Peng, PhD, Jeff Leek, PhD, Brian Caffo, PhD, Coursera. July 11, 2014. https://www.coursera.org/specialization/jhudatascience/1 “Imputation in R - Stack Overflow”, : http://stackoverflow.com/questions/13114812/imputation-in-r “Peer Assessments / Peer Assessment 1”, Reproducible Research : by Roger D. Peng, PhD, Jeff Leek, PhD, Brian Caffo, PhD, Coursera. July 11, 2014 https://class.coursera.org/repdata-004/human_grading/view/courses/972143/assessments/3/submissions