This project use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.
The data for this assignment were downloaded from the course web site:
Dataset: Activity monitoring data [52K] The variables included in this dataset are:
steps: Number of steps taking in a 5-minute interval (missing values are coded as NA)
date: The date on which the measurement was taken in YYYY-MM-DD format
interval: Identifier for the 5-minute interval in which measurement was taken
The dataset is stored in a comma-separated-value (CSV) file activity.csv and there are a total of 17,568 observations in this dataset.
Loading the data Personal Activity Monitoring as pam
pam <- read.csv("activity.csv")
str(pam)
## 'data.frame': 17568 obs. of 3 variables:
## $ steps : int NA NA NA NA NA NA NA NA NA NA ...
## $ date : chr "2012-10-01" "2012-10-01" "2012-10-01" "2012-10-01" ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
Due to date is stored as character we need to change to Date.
pam <- mutate(pam, "date" = as.Date(date, "%Y-%m-%d"))
str(pam)
## 'data.frame': 17568 obs. of 3 variables:
## $ steps : int NA NA NA NA NA NA NA NA NA NA ...
## $ date : Date, format: "2012-10-01" "2012-10-01" ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
Grouping the data by date ommiting NAs values.
by_date <- group_by(pam[!is.na(pam$steps), ], date)
sumsteps <- summarise(by_date, "sum" = sum(steps, na.rm = FALSE))
## `summarise()` ungrouping output (override with `.groups` argument)
Plotting the histogram from the previous table
f <- ggplot(sumsteps, aes(date, sum), na.rm = FALSE)
print(f + geom_bar(stat = "identity") + ggtitle("Number of steps per day") + ylab("steps"))
Generating a new table where we have the mean and median values The table shows us the summarize of mean and median for each day
mmsteps <- summarise(by_date, "mean" = mean(steps, na.rm = FALSE), "median" = median(steps, na.rm = FALSE))
## `summarise()` ungrouping output (override with `.groups` argument)
print(as.data.frame(mmsteps))
## date mean median
## 1 2012-10-02 0.4375000 0
## 2 2012-10-03 39.4166667 0
## 3 2012-10-04 42.0694444 0
## 4 2012-10-05 46.1597222 0
## 5 2012-10-06 53.5416667 0
## 6 2012-10-07 38.2465278 0
## 7 2012-10-09 44.4826389 0
## 8 2012-10-10 34.3750000 0
## 9 2012-10-11 35.7777778 0
## 10 2012-10-12 60.3541667 0
## 11 2012-10-13 43.1458333 0
## 12 2012-10-14 52.4236111 0
## 13 2012-10-15 35.2048611 0
## 14 2012-10-16 52.3750000 0
## 15 2012-10-17 46.7083333 0
## 16 2012-10-18 34.9166667 0
## 17 2012-10-19 41.0729167 0
## 18 2012-10-20 36.0937500 0
## 19 2012-10-21 30.6284722 0
## 20 2012-10-22 46.7361111 0
## 21 2012-10-23 30.9652778 0
## 22 2012-10-24 29.0104167 0
## 23 2012-10-25 8.6527778 0
## 24 2012-10-26 23.5347222 0
## 25 2012-10-27 35.1354167 0
## 26 2012-10-28 39.7847222 0
## 27 2012-10-29 17.4236111 0
## 28 2012-10-30 34.0937500 0
## 29 2012-10-31 53.5208333 0
## 30 2012-11-02 36.8055556 0
## 31 2012-11-03 36.7048611 0
## 32 2012-11-05 36.2465278 0
## 33 2012-11-06 28.9375000 0
## 34 2012-11-07 44.7326389 0
## 35 2012-11-08 11.1770833 0
## 36 2012-11-11 43.7777778 0
## 37 2012-11-12 37.3784722 0
## 38 2012-11-13 25.4722222 0
## 39 2012-11-15 0.1423611 0
## 40 2012-11-16 18.8923611 0
## 41 2012-11-17 49.7881944 0
## 42 2012-11-18 52.4652778 0
## 43 2012-11-19 30.6979167 0
## 44 2012-11-20 15.5277778 0
## 45 2012-11-21 44.3993056 0
## 46 2012-11-22 70.9270833 0
## 47 2012-11-23 73.5902778 0
## 48 2012-11-24 50.2708333 0
## 49 2012-11-25 41.0902778 0
## 50 2012-11-26 38.7569444 0
## 51 2012-11-27 47.3819444 0
## 52 2012-11-28 35.3576389 0
## 53 2012-11-29 24.4687500 0
With this code we can summarize the data grouping by interval and calculate the numbers of steps per day as an average
by_timing <- group_by(pam[!is.na(pam$steps), ], interval)
meansteps <- summarise(by_timing, "steps" = mean(steps))
## `summarise()` ungrouping output (override with `.groups` argument)
plot(meansteps$interval, meansteps$steps, type = "l", xlab = "interval", ylab = "steps", main = "Average of steps all day")
Code for filter the maximum number of steps
five <- filter(meansteps, steps == max(steps))
print(five)
## # A tibble: 1 x 2
## interval steps
## <int> <dbl>
## 1 835 206.
For the interval 835 we have 206.1698113 steps that is the maximum number of steps on average per day
First, we need to locate where are the NAs values
missing <- sum(is.na(pam))
print(missing)
## [1] 2304
The total number of missing values NAs is: 2304 across the table
In order to check in what days or intervals are missing data.
fillnas <- pam %>%
filter(is.na(steps)) %>%
with(table(date)) %>%
print()
## date
## 2012-10-01 2012-10-08 2012-11-01 2012-11-04 2012-11-09 2012-11-10 2012-11-14
## 288 288 288 288 288 288 288
## 2012-11-30
## 288
There are 288 NA values in each date. Then if we multiply by 8 we have 2304. That means, there are 8 days complete with missing values After identifying where are the missing values, we can fill it in the next question.
Now, we need to replace the NAs values with the means obtained in the question number 4
newpam <- pam
for (i in 1:nrow(newpam)){
if(is.na(newpam$steps[i])){
j <- i + 287
newpam$steps[i:j] <- round(meansteps$steps)
}
}
Getting the new histogram after missing values imputed
by_date <- group_by(newpam, date)
sumsteps <- summarise(by_date, "sum" = sum(steps))
## `summarise()` ungrouping output (override with `.groups` argument)
fn <- ggplot(sumsteps, aes(date, sum))
print(fn + geom_bar(stat = "identity") + ggtitle("Number of steps per day") + ylab("steps"))
Report of the mean and median total number of steps taken per day.
mmsteps <- summarise(by_date, "mean" = mean(steps, na.rm = FALSE), "median" = median(steps, na.rm = FALSE))
## `summarise()` ungrouping output (override with `.groups` argument)
print(as.data.frame(mmsteps))
## date mean median
## 1 2012-10-01 37.3680556 34.5
## 2 2012-10-02 0.4375000 0.0
## 3 2012-10-03 39.4166667 0.0
## 4 2012-10-04 42.0694444 0.0
## 5 2012-10-05 46.1597222 0.0
## 6 2012-10-06 53.5416667 0.0
## 7 2012-10-07 38.2465278 0.0
## 8 2012-10-08 37.3680556 34.5
## 9 2012-10-09 44.4826389 0.0
## 10 2012-10-10 34.3750000 0.0
## 11 2012-10-11 35.7777778 0.0
## 12 2012-10-12 60.3541667 0.0
## 13 2012-10-13 43.1458333 0.0
## 14 2012-10-14 52.4236111 0.0
## 15 2012-10-15 35.2048611 0.0
## 16 2012-10-16 52.3750000 0.0
## 17 2012-10-17 46.7083333 0.0
## 18 2012-10-18 34.9166667 0.0
## 19 2012-10-19 41.0729167 0.0
## 20 2012-10-20 36.0937500 0.0
## 21 2012-10-21 30.6284722 0.0
## 22 2012-10-22 46.7361111 0.0
## 23 2012-10-23 30.9652778 0.0
## 24 2012-10-24 29.0104167 0.0
## 25 2012-10-25 8.6527778 0.0
## 26 2012-10-26 23.5347222 0.0
## 27 2012-10-27 35.1354167 0.0
## 28 2012-10-28 39.7847222 0.0
## 29 2012-10-29 17.4236111 0.0
## 30 2012-10-30 34.0937500 0.0
## 31 2012-10-31 53.5208333 0.0
## 32 2012-11-01 37.3680556 34.5
## 33 2012-11-02 36.8055556 0.0
## 34 2012-11-03 36.7048611 0.0
## 35 2012-11-04 37.3680556 34.5
## 36 2012-11-05 36.2465278 0.0
## 37 2012-11-06 28.9375000 0.0
## 38 2012-11-07 44.7326389 0.0
## 39 2012-11-08 11.1770833 0.0
## 40 2012-11-09 37.3680556 34.5
## 41 2012-11-10 37.3680556 34.5
## 42 2012-11-11 43.7777778 0.0
## 43 2012-11-12 37.3784722 0.0
## 44 2012-11-13 25.4722222 0.0
## 45 2012-11-14 37.3680556 34.5
## 46 2012-11-15 0.1423611 0.0
## 47 2012-11-16 18.8923611 0.0
## 48 2012-11-17 49.7881944 0.0
## 49 2012-11-18 52.4652778 0.0
## 50 2012-11-19 30.6979167 0.0
## 51 2012-11-20 15.5277778 0.0
## 52 2012-11-21 44.3993056 0.0
## 53 2012-11-22 70.9270833 0.0
## 54 2012-11-23 73.5902778 0.0
## 55 2012-11-24 50.2708333 0.0
## 56 2012-11-25 41.0902778 0.0
## 57 2012-11-26 38.7569444 0.0
## 58 2012-11-27 47.3819444 0.0
## 59 2012-11-28 35.3576389 0.0
## 60 2012-11-29 24.4687500 0.0
## 61 2012-11-30 37.3680556 34.5
According this report we can conclude that now there are median values in all the dates where we inputted data and we can see the increase in the total steps per day.
With this code we can insert a new variable called tday that is type of day
npam <- newpam %>%
mutate(day = weekdays(date)) %>%
mutate(tday = factor(1 * (day == "Sunday" | day == "Saturday"), labels = c("weekday", "weekend"))) %>%
group_by(tday, interval) %>%
summarise(steps = mean(steps)) %>%
print()
## `summarise()` regrouping output by 'tday' (override with `.groups` argument)
## # A tibble: 576 x 3
## # Groups: tday [2]
## tday interval steps
## <fct> <int> <dbl>
## 1 weekday 0 2.29
## 2 weekday 5 0.4
## 3 weekday 10 0.156
## 4 weekday 15 0.178
## 5 weekday 20 0.0889
## 6 weekday 25 1.58
## 7 weekday 30 0.756
## 8 weekday 35 1.16
## 9 weekday 40 0
## 10 weekday 45 1.73
## # ... with 566 more rows
Plotting the average of steps per type of day weekdays and weekend
t <- ggplot(npam, aes(interval, steps)) + geom_line()
print(t + ggtitle("Average number of steps") + facet_grid(tday ~ .))