first let’s load those necessary data analysis tools and do simple data cleaning. Also I’m gonna put those figures I form into one folder.

knitr::opts_chunk$set(fig.path = "figure/")

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data <- read.csv("activity.csv")

str(data)
## 'data.frame':    17568 obs. of  3 variables:
##  $ steps   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ date    : chr  "2012-10-01" "2012-10-01" "2012-10-01" "2012-10-01" ...
##  $ interval: int  0 5 10 15 20 25 30 35 40 45 ...
data$date <- as.Date(data$date, format = "%Y-%m-%d")

now let’s calculate the daily total steps

daily_steps <- data %>%
  group_by(date) %>%
  summarise(total_steps = sum(steps, na.rm = TRUE))

and here is the histogram

ggplot(daily_steps, aes(x = total_steps)) +
  geom_histogram(binwidth = 2500, fill = "skyblue", color = "black") +
  labs(title = "Histogram of Total Steps per Day", x = "Total steps", y = "Frequency")

also we can know the means and medians

mean_steps <- mean(daily_steps$total_steps, na.rm = TRUE)
median_steps <- median(daily_steps$total_steps, na.rm = TRUE)

mean_steps
## [1] 9354.23
median_steps
## [1] 10395

Now I’m curious about the average daily activity pattern, here I do things as follow to explore: 1.I Make a time series plot of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis)

interval_avg <- data %>%
  group_by(interval) %>%
  summarise(avg_steps = mean(steps, na.rm = TRUE))

ggplot(interval_avg, aes(x = interval, y = avg_steps)) +
  geom_line(color = "red") +
  labs(title = "Average Daily Activity Pattern", 
       x = "5-minute interval", 
       y = "Average number of steps")

max_interval <- interval_avg[which.max(interval_avg$avg_steps), "interval"]
max_interval
## # A tibble: 1 × 1
##   interval
##      <int>
## 1      835

from the plot we can clearly tell that 8:35 am on average across all the days in the dataset, contains the maximum number of steps

Note that there are a number of days/intervals where there are missing values. The presence of missing days may introduce bias into some calculations or summaries of the data. Thus I imput those missing values.

Firstly, let’s look at the total number of missing values in the dataset, and then I use the mean for the 5-minute intervals to fill in all these missing values, forming a scientific new dataset.

total_missing <- sum(is.na(data$steps))
total_missing
## [1] 2304
data_filled <- data

for (i in 1:nrow(data_filled)) {
  if (is.na(data_filled$steps[i])) {
    interval_val <- data_filled$interval[i]
    data_filled$steps[i] <- interval_avg$avg_steps[interval_avg$interval == interval_val]
  }
}

sum(is.na(data_filled$steps))
## [1] 0

Now I make a histogram of the total number of steps taken each day and calculate and report the mean and median total number of steps taken per day again, here is the result.

daily_steps_filled <- data_filled %>%
  group_by(date) %>%
  summarise(total_steps = sum(steps))

ggplot(daily_steps_filled, aes(x = total_steps)) +
  geom_histogram(binwidth = 2500, fill = "lightgreen", color = "black") +
  labs(title = "Histogram of Total Steps per Day (Missing Values Imputed)",
       x = "Total steps", y = "Frequency")

mean_filled <- mean(daily_steps_filled$total_steps)
median_filled <- median(daily_steps_filled$total_steps)

mean_filled
## [1] 10766.19
median_filled
## [1] 10766.19

from the plot we can clearly tell that the mean changed a lot by increasing, while the median has nearly no changes, and they are exactly the same amount now. The impact of filling in the missing data is to let the datas be more concentrated and reflect the reality.

Lastly, I wonder if there are differences in activity patterns between weekdays and weekends, thus I do things below: First I create a new factor variable in the dataset with two levels – “weekday” and “weekend” indicating whether a given date is a weekday or weekend day.

data_filled$weekday <- weekdays(data_filled$date)
data_filled$day_type <- ifelse(data_filled$weekday %in% c("Saturday", "Sunday"), 
                               "weekend", "weekday")
data_filled$day_type <- as.factor(data_filled$day_type)

Then I make a panel plot containing a time series plot of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis).

pattern <- data_filled %>%
  group_by(interval, day_type) %>%
  summarise(avg_steps = mean(steps))
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by interval and day_type.
## ℹ Output is grouped by interval.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(interval, day_type))` for per-operation grouping
##   (`?dplyr::dplyr_by`) instead.
ggplot(pattern, aes(x = interval, y = avg_steps, color = day_type)) +
  geom_line() +
  facet_grid(day_type ~ .) +
  labs(title = "Activity Patterns: Weekdays vs. Weekends",
       x = "5-minute interval", y = "Average number of steps") +
  theme_bw()

From the chart we can see that compare to weekdays, the morning activityb summit comes later, while the general activity range throughout the day is more uniform