Reproducible Research: Peer Assessment 1

Loading and preprocessing the data

First we load the data file, and unzip it.

urlfile <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip"
download.file(urlfile, destfile = "activity.zip", method = "curl")
unzip("activity.zip", exdir = "activity")

Then we set the wd, and read the “activity.csv” file into the “actividad” data frame. We also create a variable that indicates the weekday of the respective date.

setwd("activity")
actividad <- read.csv("activity.csv")
actividad$dia <- weekdays(as.Date(actividad$date))
head(x = actividad)

##   steps       date interval   dia
## 1    NA 2012-10-01        0 lunes
## 2    NA 2012-10-01        5 lunes
## 3    NA 2012-10-01       10 lunes
## 4    NA 2012-10-01       15 lunes
## 5    NA 2012-10-01       20 lunes
## 6    NA 2012-10-01       25 lunes

What is mean total number of steps taken per day?

We calculate the total number of steps taken per day

library(dplyr)

## 
## Adjuntando el paquete: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

resumen <- actividad %>%
    group_by(dia) %>%
    summarise(Total=sum(steps, na.rm = T))
resumen

## # A tibble: 7 × 2
##   dia       Total
##   <chr>     <int>
## 1 domingo   85944
## 2 jueves    65702
## 3 lunes     69824
## 4 martes    80546
## 5 miércoles 94326
## 6 sábado    87748
## 7 viernes   86518

Then, we maje a histogram of the total number of steps taken each day

library(ggplot2)
ggplot(actividad, aes(x = steps)) +
    geom_histogram(binwidth = 100, fill = "skyblue", color = "black") +
    facet_wrap(~dia) + # Crea un gráfico por cada día
    labs(title = "Histograma de Pasos por Día",
         x = "Número de Pasos",
         y = "Frecuencia") +
    theme_minimal()

## Warning: Removed 2304 rows containing non-finite outside the scale range
## (`stat_bin()`).

Next we calculate the mean and median of the total number of steps taken per day. The mean for each day:

tapply(actividad$steps, actividad$dia, mean, na.rm=T)

##   domingo    jueves     lunes    martes miércoles    sábado   viernes 
##  42.63095  28.51649  34.63492  31.07485  40.94010  43.52579  42.91567

The median for each day:

tapply(actividad$steps, actividad$dia, median, na.rm=T)

##   domingo    jueves     lunes    martes miércoles    sábado   viernes 
##         0         0         0         0         0         0         0

What is the average daily activity pattern?

Time series plot (i.e. type = “l”) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis)

actividad %>%
    group_by(interval) %>%
    summarise(pasos=mean(steps, na.rm = T)) %>%
    ggplot(aes(x=interval, y=pasos)) +
    geom_line(col="darkblue", lwd=1) +
    labs(title = "Average daily activity pattern", x= "Time Interval (minutes)",
         y = "# of steps")

Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?

resumen <- actividad %>%
    group_by(interval) %>%
    summarise(pasos=mean(steps, na.rm = T)) 
head(resumen[order(-resumen$pasos, na.last = NA),])

## # A tibble: 6 × 2
##   interval pasos
##      <int> <dbl>
## 1      835  206.
## 2      840  196.
## 3      850  183.
## 4      845  180.
## 5      830  177.
## 6      820  171.

Imputing missing values

Calculate the total number of NAs

table(is.na(actividad$steps))

## 
## FALSE  TRUE 
## 15264  2304

A strategy for filling in all of the missing values in the dataset. In this case, we will fill the NAs with the mean of each interval

mean_interval <- actividad %>%
    group_by(interval) %>%
    summarise(mean_steps = mean(steps, na.rm = TRUE))

final <- actividad %>%
    left_join(mean_interval, by = "interval") %>%
    mutate(steps = ifelse(is.na(steps), mean_steps, steps)) %>%
    select(-mean_steps) 
head(final)

##       steps       date interval   dia
## 1 1.7169811 2012-10-01        0 lunes
## 2 0.3396226 2012-10-01        5 lunes
## 3 0.1320755 2012-10-01       10 lunes
## 4 0.1509434 2012-10-01       15 lunes
## 5 0.0754717 2012-10-01       20 lunes
## 6 2.0943396 2012-10-01       25 lunes

Make a histogram of the total number of steps taken each day and Calculate and report the mean and median total number of steps taken per day.

final$dia <- weekdays(as.Date(final$date))
ggplot(final, aes(x = steps)) +
    geom_histogram(binwidth = 100, fill = "skyblue", color = "black") +
    facet_wrap(~dia) + # Crea un gráfico por cada día
    labs(title = "Histograma de Pasos por Día",
         x = "Número de Pasos",
         y = "Frecuencia") +
    theme_minimal()

Are there differences in activity patterns between weekdays and weekends?

For answer the question, we create a new variable with two levels – “weekday” and “weekend” indicating whether a given date is a weekday or weekend day. We then make a panel plot containing a time series plot of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis).

actividad %>%
    mutate(typeday=as.factor(ifelse(dia %in% c("lunes", "martes", "miércoles", "jueves", "viernes"), 
                          "Weekday", "Weekend"))) %>%
    group_by(interval, typeday) %>%
    summarise(pasos=mean(steps, na.rm = T)) %>%
    ggplot(aes(x=interval, y=pasos)) +
    geom_line(col="darkblue", lwd=1) +
    facet_wrap(~typeday, nrow = 2) +
    labs(title = "Average daily activity pattern", x= "Time Interval (minutes)",
         y = "# of steps")

## `summarise()` has grouped output by 'interval'. You can override using the
## `.groups` argument.