PA1 Reproducile Research, Playing with Slidify

Roberto Aviles

Peer Assesment Project 1

Coursera Data Science: Reproducible Research --- June 2015

Introduction

This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.

Data

The data for this assignment can be downloaded from the course web site: Activity monitoring data [52K]

The variables included in this dataset are:

steps: Number of steps taking in a 5-minute interval (missing values are coded as NA)
date: The date on which the measurement was taken in YYYY-MM-DD format
interval: Identifier for the 5-minute interval in which measurement was taken
The dataset is stored in a comma-separated-value (CSV) file and contains a total of 17,568 observations.

Slide 2

  1. Firstly, we load and preprocess the data. My activity.zip file is right at the working directory along with the PA1_template.Rmd and other stuff
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
act <- read.csv(unzip("repdata-data-activity.zip"))
## Warning in unzip("repdata-data-activity.zip"): error 1 al extraer del
## archivo zip
## Error in file(file, "rt"): argumento 'description' inválido
  1. Format dates to the appropiate type
act$date <- as.Date(act$date , format = "%Y-%m-%d")
## Error in as.Date(act$date, format = "%Y-%m-%d"): objeto 'act' no encontrado

--- .class #3

Slide 3

  1. From the original data, create and name the columns steps, day and interval
act.day <- aggregate(act$steps, by=list(act$date), sum)
## Error in aggregate(act$steps, by = list(act$date), sum): objeto 'act' no encontrado
act.interval <- aggregate(act$steps, by=list(act$interval), sum)
## Error in aggregate(act$steps, by = list(act$interval), sum): objeto 'act' no encontrado
names(act.day)[2] <- "steps"
## Error in names(act.day)[2] <- "steps": objeto 'act.day' no encontrado
names(act.day)[1] <- "date"
## Error in names(act.day)[1] <- "date": objeto 'act.day' no encontrado
names(act.interval)[2] <- "steps"
## Error in names(act.interval)[2] <- "steps": objeto 'act.interval' no encontrado
names(act.interval)[1] <- "interval"
## Error in names(act.interval)[1] <- "interval": objeto 'act.interval' no encontrado
  1. Now, to the orginal data, we'll aggregate and name a column with the mean number of steps per interval
act.m.interval <- aggregate(act$steps, by=list(act$interval), mean, na.rm=TRUE, na.action=NULL)
## Error in aggregate(act$steps, by = list(act$interval), mean, na.rm = TRUE, : objeto 'act' no encontrado
names(act.m.interval)[1] <- "interval"
## Error in names(act.m.interval)[1] <- "interval": objeto 'act.m.interval' no encontrado
names(act.m.interval)[2] <- "mean.steps"
## Error in names(act.m.interval)[2] <- "mean.steps": objeto 'act.m.interval' no encontrado

--- .class #4

Slide 4

First Question: ¿What is mean total number of steps taken per day? We'll calculate both MEAN and MEDIAN:

mean(act.day$steps, na.rm = TRUE)
## Error in mean(act.day$steps, na.rm = TRUE): objeto 'act.day' no encontrado
median(act.day$steps, na.rm = TRUE )
## Error in median(act.day$steps, na.rm = TRUE): objeto 'act.day' no encontrado

Note that the summary command shows, also, the number of NA in the set

summary(act.day$steps)
## Error in summary(act.day$steps): objeto 'act.day' no encontrado
  • And, the requested histogram:*
hist(act.day$steps, col = "lavender", main = "Histogram of Total Number of Steps per Day",
     xlab = "Total Number of Steps per Day")
## Error in hist(act.day$steps, col = "lavender", main = "Histogram of Total Number of Steps per Day", : objeto 'act.day' no encontrado

Second Question: ¿What is the average daily activity pattern? Specifically:

  1. Make a time series plot (i.e. type = "l") of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis)
  2. Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?

Something slightly different (I do like more circles around dots of data than simple lines)

data <- read.csv("activity.csv")
## Warning in file(file, "rt"): no fue posible abrir el archivo
## 'activity.csv': No such file or directory
## Error in file(file, "rt"): no se puede abrir la conexión
stepsInInterval<-aggregate(steps~interval, data, mean)
## Error in terms.formula(formula, data = data): 'data' argument is of the wrong type
plot(stepsInInterval$interval, stepsInInterval$steps, type='o', col='blue',main="Average of steps per day", xlab="Interval", ylab="Average of Steps in the Interval")
## Error in plot(stepsInInterval$interval, stepsInInterval$steps, type = "o", : objeto 'stepsInInterval' no encontrado

Now we want to find which 5-minute interval,in the dataset, contains the maximum number of steps?
(note the answer points exactly to the sudden peak in previous plot: the 5-minutes interval number 835)

act.m.interval[which.max(act.m.interval$mean.steps), 1]
## Error in eval(expr, envir, enclos): objeto 'act.m.interval' no encontrado
Now: "The presence of missing days may introduce bias into some calculations or summaries of the data"

Third Question: ¿Are there differences in activity patterns between weekdays and weekends?

¿How many NA values are in the set?

table(is.na(data$steps))
## Error in data$steps: objeto de tipo 'closure' no es subconjunto

In order to correct this situation, let's merge and replace lost/missed/NA values with the MEAN value for the interval, as given by the 'function' act.m.interval. and then create a 'new' set with NO NA values

act.lost <- merge(act, act.m.interval, by = "interval", sort= FALSE)
## Error in merge(act, act.m.interval, by = "interval", sort = FALSE): objeto 'act' no encontrado
act.lost$steps[is.na(act.lost$steps)] <- act.lost$mean.steps[is.na(act.lost$steps)]
## Error in eval(expr, envir, enclos): objeto 'act.lost' no encontrado
act.nona <- act.lost[, c(2,3,1)]
## Error in eval(expr, envir, enclos): objeto 'act.lost' no encontrado

Before going any further, compare the new and old set of data Create a new dataset with the total steps per day

act.day.new <- aggregate(act.nona$steps, by=list(act.nona$date), sum)
## Error in aggregate(act.nona$steps, by = list(act.nona$date), sum): objeto 'act.nona' no encontrado
names(act.day.new)[1] <-"day"
## Error in names(act.day.new)[1] <- "day": objeto 'act.day.new' no encontrado
names(act.day.new)[2] <-"steps"
## Error in names(act.day.new)[2] <- "steps": objeto 'act.day.new' no encontrado

And now plot the new 'corrected' histogram

hist(act.day.new$steps, col = "blue", main = "Total Number of Steps per Day (*without* NA values)", xlab = "Total Steps")
## Error in hist(act.day.new$steps, col = "blue", main = "Total Number of Steps per Day (*without* NA values)", : objeto 'act.day.new' no encontrado

By looking histograms is hard to tell a difference; let's compare using the MEAN & MEDIAN:

mean(act.day.new$steps)
## Error in mean(act.day.new$steps): objeto 'act.day.new' no encontrado
median(act.day.new$steps)
## Error in median(act.day.new$steps): objeto 'act.day.new' no encontrado

MEAN values with AND without NA data are the same but, the original MEDIAN was slightly smaller than the 'corrected' value

Fourth Question: ¿Are there differences in activity patterns between weekdays and weekends?

First we need to separate our set in 'weekdays' and 'weekend' days.
And then we add a new column with this new datum: wDay (week or weekend day.)

act.nona$wDay <- ifelse(as.POSIXlt(act.nona$date)$wday %in% c(0,6), 'weekend', 'weekday')
## Error in as.POSIXlt(act.nona$date): objeto 'act.nona' no encontrado
adi <- aggregate(steps ~ interval + wDay, data=act.nona, mean)
## Error in eval(expr, envir, enclos): objeto 'act.nona' no encontrado

Now it is possible to use, again, a time series plot with 'interval' in the X-axis and
the average number of steps per days@interval in the Y-axis and compare the activity
of weekdays versus weekend days.

ggplot(adi, aes(interval, steps)) + 
    geom_line() + 
    facet_grid(wDay ~ .) +
    xlab("5-minute Interval") + 
    ylab("Average Number of Steps")
## Error in ggplot(adi, aes(interval, steps)): objeto 'adi' no encontrado