Introduction

This project use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.

Data

The data for this assignment were downloaded from the course web site:

Dataset: Activity monitoring data [52K] The variables included in this dataset are:

The dataset is stored in a comma-separated-value (CSV) file activity.csv and there are a total of 17,568 observations in this dataset.

1. Code for reading in the dataset and/or processing the data

Loading the data Personal Activity Monitoring as pam

pam <- read.csv("activity.csv")
str(pam)
## 'data.frame':    17568 obs. of  3 variables:
##  $ steps   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ date    : chr  "2012-10-01" "2012-10-01" "2012-10-01" "2012-10-01" ...
##  $ interval: int  0 5 10 15 20 25 30 35 40 45 ...

Due to date is stored as character we need to change to Date.

pam <- mutate(pam, "date" = as.Date(date, "%Y-%m-%d"))
str(pam)
## 'data.frame':    17568 obs. of  3 variables:
##  $ steps   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ date    : Date, format: "2012-10-01" "2012-10-01" ...
##  $ interval: int  0 5 10 15 20 25 30 35 40 45 ...

2. Histogram of the total number of steps taken each day

Grouping the data by date ommiting NAs values.

by_date <- group_by(pam[!is.na(pam$steps), ], date)
sumsteps <- summarise(by_date, "sum" = sum(steps, na.rm = FALSE))
## `summarise()` ungrouping output (override with `.groups` argument)

Plotting the histogram from the previous table

f <- ggplot(sumsteps, aes(date, sum), na.rm = FALSE)
print(f + geom_bar(stat = "identity") + ggtitle("Number of steps per day") + ylab("steps"))

3. Mean and median number of steps taken each day

Generating a new table where we have the mean and median values The table shows us the summarize of mean and median for each day

mmsteps <- summarise(by_date, "mean" = mean(steps, na.rm = FALSE), "median" = median(steps, na.rm = FALSE))
## `summarise()` ungrouping output (override with `.groups` argument)
print(as.data.frame(mmsteps))
##          date       mean median
## 1  2012-10-02  0.4375000      0
## 2  2012-10-03 39.4166667      0
## 3  2012-10-04 42.0694444      0
## 4  2012-10-05 46.1597222      0
## 5  2012-10-06 53.5416667      0
## 6  2012-10-07 38.2465278      0
## 7  2012-10-09 44.4826389      0
## 8  2012-10-10 34.3750000      0
## 9  2012-10-11 35.7777778      0
## 10 2012-10-12 60.3541667      0
## 11 2012-10-13 43.1458333      0
## 12 2012-10-14 52.4236111      0
## 13 2012-10-15 35.2048611      0
## 14 2012-10-16 52.3750000      0
## 15 2012-10-17 46.7083333      0
## 16 2012-10-18 34.9166667      0
## 17 2012-10-19 41.0729167      0
## 18 2012-10-20 36.0937500      0
## 19 2012-10-21 30.6284722      0
## 20 2012-10-22 46.7361111      0
## 21 2012-10-23 30.9652778      0
## 22 2012-10-24 29.0104167      0
## 23 2012-10-25  8.6527778      0
## 24 2012-10-26 23.5347222      0
## 25 2012-10-27 35.1354167      0
## 26 2012-10-28 39.7847222      0
## 27 2012-10-29 17.4236111      0
## 28 2012-10-30 34.0937500      0
## 29 2012-10-31 53.5208333      0
## 30 2012-11-02 36.8055556      0
## 31 2012-11-03 36.7048611      0
## 32 2012-11-05 36.2465278      0
## 33 2012-11-06 28.9375000      0
## 34 2012-11-07 44.7326389      0
## 35 2012-11-08 11.1770833      0
## 36 2012-11-11 43.7777778      0
## 37 2012-11-12 37.3784722      0
## 38 2012-11-13 25.4722222      0
## 39 2012-11-15  0.1423611      0
## 40 2012-11-16 18.8923611      0
## 41 2012-11-17 49.7881944      0
## 42 2012-11-18 52.4652778      0
## 43 2012-11-19 30.6979167      0
## 44 2012-11-20 15.5277778      0
## 45 2012-11-21 44.3993056      0
## 46 2012-11-22 70.9270833      0
## 47 2012-11-23 73.5902778      0
## 48 2012-11-24 50.2708333      0
## 49 2012-11-25 41.0902778      0
## 50 2012-11-26 38.7569444      0
## 51 2012-11-27 47.3819444      0
## 52 2012-11-28 35.3576389      0
## 53 2012-11-29 24.4687500      0

4. Time series plot of the average number of steps taken

With this code we can summarize the data grouping by interval and calculate the numbers of steps per day as an average

by_timing <- group_by(pam[!is.na(pam$steps), ], interval)
meansteps <- summarise(by_timing, "steps" = mean(steps))
## `summarise()` ungrouping output (override with `.groups` argument)
plot(meansteps$interval, meansteps$steps, type = "l", xlab = "interval", ylab = "steps", main = "Average of steps all day")

5. The 5-minute interval that, on average, contains the maximum number of steps

Code for filter the maximum number of steps

five <- filter(meansteps, steps == max(steps))
print(five)
## # A tibble: 1 x 2
##   interval steps
##      <int> <dbl>
## 1      835  206.

For the interval 835 we have 206.1698113 steps that is the maximum number of steps on average per day

6. Code to describe and show a strategy for imputing missing data

First, we need to locate where are the NAs values

missing <- sum(is.na(pam))
print(missing)
## [1] 2304

The total number of missing values NAs is: 2304 across the table

In order to check in what days or intervals are missing data.

fillnas <- pam %>%
        filter(is.na(steps)) %>%
        with(table(date)) %>%
        print()
## date
## 2012-10-01 2012-10-08 2012-11-01 2012-11-04 2012-11-09 2012-11-10 2012-11-14 
##        288        288        288        288        288        288        288 
## 2012-11-30 
##        288

There are 288 NA values in each date. Then if we multiply by 8 we have 2304. That means, there are 8 days complete with missing values After identifying where are the missing values, we can fill it in the next question.

Now, we need to replace the NAs values with the means obtained in the question number 4

newpam <- pam
for (i in 1:nrow(newpam)){
        if(is.na(newpam$steps[i])){
                j <- i + 287
                newpam$steps[i:j] <- round(meansteps$steps)
        }
}

7. Histogram of the total number of steps taken each day after missing values are imputed

Getting the new histogram after missing values imputed

by_date <- group_by(newpam, date)
sumsteps <- summarise(by_date, "sum" = sum(steps))
## `summarise()` ungrouping output (override with `.groups` argument)
fn <- ggplot(sumsteps, aes(date, sum))
print(fn + geom_bar(stat = "identity") + ggtitle("Number of steps per day") + ylab("steps"))

Report of the mean and median total number of steps taken per day.

mmsteps <- summarise(by_date, "mean" = mean(steps, na.rm = FALSE), "median" = median(steps, na.rm = FALSE))
## `summarise()` ungrouping output (override with `.groups` argument)
print(as.data.frame(mmsteps))
##          date       mean median
## 1  2012-10-01 37.3680556   34.5
## 2  2012-10-02  0.4375000    0.0
## 3  2012-10-03 39.4166667    0.0
## 4  2012-10-04 42.0694444    0.0
## 5  2012-10-05 46.1597222    0.0
## 6  2012-10-06 53.5416667    0.0
## 7  2012-10-07 38.2465278    0.0
## 8  2012-10-08 37.3680556   34.5
## 9  2012-10-09 44.4826389    0.0
## 10 2012-10-10 34.3750000    0.0
## 11 2012-10-11 35.7777778    0.0
## 12 2012-10-12 60.3541667    0.0
## 13 2012-10-13 43.1458333    0.0
## 14 2012-10-14 52.4236111    0.0
## 15 2012-10-15 35.2048611    0.0
## 16 2012-10-16 52.3750000    0.0
## 17 2012-10-17 46.7083333    0.0
## 18 2012-10-18 34.9166667    0.0
## 19 2012-10-19 41.0729167    0.0
## 20 2012-10-20 36.0937500    0.0
## 21 2012-10-21 30.6284722    0.0
## 22 2012-10-22 46.7361111    0.0
## 23 2012-10-23 30.9652778    0.0
## 24 2012-10-24 29.0104167    0.0
## 25 2012-10-25  8.6527778    0.0
## 26 2012-10-26 23.5347222    0.0
## 27 2012-10-27 35.1354167    0.0
## 28 2012-10-28 39.7847222    0.0
## 29 2012-10-29 17.4236111    0.0
## 30 2012-10-30 34.0937500    0.0
## 31 2012-10-31 53.5208333    0.0
## 32 2012-11-01 37.3680556   34.5
## 33 2012-11-02 36.8055556    0.0
## 34 2012-11-03 36.7048611    0.0
## 35 2012-11-04 37.3680556   34.5
## 36 2012-11-05 36.2465278    0.0
## 37 2012-11-06 28.9375000    0.0
## 38 2012-11-07 44.7326389    0.0
## 39 2012-11-08 11.1770833    0.0
## 40 2012-11-09 37.3680556   34.5
## 41 2012-11-10 37.3680556   34.5
## 42 2012-11-11 43.7777778    0.0
## 43 2012-11-12 37.3784722    0.0
## 44 2012-11-13 25.4722222    0.0
## 45 2012-11-14 37.3680556   34.5
## 46 2012-11-15  0.1423611    0.0
## 47 2012-11-16 18.8923611    0.0
## 48 2012-11-17 49.7881944    0.0
## 49 2012-11-18 52.4652778    0.0
## 50 2012-11-19 30.6979167    0.0
## 51 2012-11-20 15.5277778    0.0
## 52 2012-11-21 44.3993056    0.0
## 53 2012-11-22 70.9270833    0.0
## 54 2012-11-23 73.5902778    0.0
## 55 2012-11-24 50.2708333    0.0
## 56 2012-11-25 41.0902778    0.0
## 57 2012-11-26 38.7569444    0.0
## 58 2012-11-27 47.3819444    0.0
## 59 2012-11-28 35.3576389    0.0
## 60 2012-11-29 24.4687500    0.0
## 61 2012-11-30 37.3680556   34.5

According this report we can conclude that now there are median values in all the dates where we inputted data and we can see the increase in the total steps per day.

8. Panel plot comparing the average number of steps taken per 5-minute interval across weekdays and weekends

With this code we can insert a new variable called tday that is type of day

npam <- newpam %>%
        mutate(day = weekdays(date)) %>%
        mutate(tday = factor(1 * (day == "Sunday" | day == "Saturday"), labels = c("weekday", "weekend"))) %>%
        group_by(tday, interval) %>%
        summarise(steps = mean(steps)) %>% 
        print()
## `summarise()` regrouping output by 'tday' (override with `.groups` argument)
## # A tibble: 576 x 3
## # Groups:   tday [2]
##    tday    interval  steps
##    <fct>      <int>  <dbl>
##  1 weekday        0 2.29  
##  2 weekday        5 0.4   
##  3 weekday       10 0.156 
##  4 weekday       15 0.178 
##  5 weekday       20 0.0889
##  6 weekday       25 1.58  
##  7 weekday       30 0.756 
##  8 weekday       35 1.16  
##  9 weekday       40 0     
## 10 weekday       45 1.73  
## # ... with 566 more rows

Plotting the average of steps per type of day weekdays and weekend

t <- ggplot(npam, aes(interval, steps)) + geom_line()
print(t + ggtitle("Average number of steps") + facet_grid(tday ~ .))