INTRODUCTION

It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the “quantified self” movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.

This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.

The variables included in this dataset are:

The dataset is stored in a CSV file and there are a total of 17,568 observations in the dataset.

ASSIGNMENT

This assignment will be described in multiple parts. You will need to write a report that answers the questions detailed below. Ultimately, you will need to complete the entire assignment in a single R markdown document that can be processed by knitr and be transformed into an HTML file.

Throughout your report make sure you always include the code that you used to generate the output you present. When writing code chunks in the R markdown document, always use echo = TRUE so that someone else will be able to read the code. This assignment will be evaluated via peer assessment so it is essential that your peer evaluators be able to review the code for your analysis.

For the plotting aspects of this assignment, feel free to use any plotting system in R (i.e., base, lattice, ggplot2)

Fork/clone the GitHub repository created for this assignment. You will submit this assignment by pushing your completed files into your forked repository on GitHub. The assignment submission will consist of the URL to your GitHub repository and the SHA-1 commit ID for your repository state.

Questions to be answered:

SETTING GLOBAL OPTIONS

library(ggplot2)
library(knitr)
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
sessionInfo()
## R version 4.3.2 (2023-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8 
## [2] LC_CTYPE=English_United Kingdom.utf8   
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.utf8    
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_1.1.4   tidyr_1.3.1   knitr_1.45    ggplot2_3.4.4
## 
## loaded via a namespace (and not attached):
##  [1] vctrs_0.6.5       cli_3.6.2         rlang_1.1.3       xfun_0.41        
##  [5] purrr_1.0.2       generics_0.1.3    jsonlite_1.8.8    glue_1.7.0       
##  [9] colorspace_2.1-0  htmltools_0.5.7   sass_0.4.8        fansi_1.0.6      
## [13] scales_1.3.0      rmarkdown_2.25    grid_4.3.2        evaluate_0.23    
## [17] munsell_0.5.0     jquerylib_0.1.4   tibble_3.2.1      fastmap_1.1.1    
## [21] yaml_2.3.8        lifecycle_1.0.4   compiler_4.3.2    pkgconfig_2.0.3  
## [25] rstudioapi_0.15.0 digest_0.6.34     R6_2.5.1          tidyselect_1.2.0 
## [29] utf8_1.2.4        pillar_1.9.0      magrittr_2.0.3    bslib_0.6.1      
## [33] withr_3.0.0       tools_4.3.2       gtable_0.3.4      cachem_1.0.8
knitr::opts_chunk$set(echo = TRUE,
                     warning = FALSE,
                     fig.width = 10,
                     fig.height = 5,
                     fig.keep = "all",
                     dev = "png")

#GETTING THE DATA AND PRE-PROCESSING

# Reading the data

activity <- read.csv("activity.csv")

activity$date <- as.POSIXct(activity$date, "%Y%m%d")

# Add a column with the days of the week (from the date column)

activity$day <- weekdays(activity$date)

summary(activity)
##      steps             date               interval          day           
##  Min.   :  0.00   Min.   :2012-10-01   Min.   :   0.0   Length:17568      
##  1st Qu.:  0.00   1st Qu.:2012-10-16   1st Qu.: 588.8   Class :character  
##  Median :  0.00   Median :2012-10-31   Median :1177.5   Mode  :character  
##  Mean   : 37.38   Mean   :2012-10-31   Mean   :1177.5                     
##  3rd Qu.: 12.00   3rd Qu.:2012-11-15   3rd Qu.:1766.2                     
##  Max.   :806.00   Max.   :2012-11-30   Max.   :2355.0                     
##  NA's   :2304

QUESTIONS

QUESTION 1:

What is the mean total number of steps taken per day?

#Calculate the total number of steps 

TotalSteps <- data.frame(with(activity, 
                              aggregate(steps,
                                        by = list(date),
                                        sum,
                                        na.rm = TRUE)))
#Give more meaningful names

names(TotalSteps) <- c("Date", "Steps")

# Plot a histogram of total steps

ggplot(TotalSteps,
       aes(x = Steps)) +
  geom_histogram(breaks = seq(0,
                              25000,
                              by = 1000),
                 fill = "pink",
                 col = "purple") +
  xlab("Steps") +
  ylab("Frequency")+
  ggtitle("Total number of steps (by Day)") +
  scale_y_continuous(expand = c(0, 0),
                     limits = c(0, 15)) +
  scale_x_continuous(expand = c(0, 0)) +
  theme_classic()

ggsave("question1.png")
## Saving 10 x 5 in image
# Mean steps

mean(TotalSteps$Steps)
## [1] 9354.23
median(TotalSteps$Steps)
## [1] 10395
#Mark Mean and median on the histogram

ggplot(TotalSteps,
       aes(x = Steps)) +
  geom_histogram(breaks = seq(0,
                              25000,
                              by = 1000),
                 fill = "pink",
                 col = "purple") +
  xlab("Steps") +
  ylab("Frequency")+
  labs(title = "Total number of steps",
        subtitle = "(by Day)") +
  scale_y_continuous(expand = c(0, 0),
                     limits = c(0, 15)) +
  #scale_x_continuous(expand = c(0, 0)) +
  geom_vline(aes(xintercept = mean(Steps)),
             col = "black",
             size = 1) +
  geom_vline(aes(xintercept = median(Steps)),
             col = "grey",
             size = 1) +
  scale_color_manual(name = "Stats",
                     values = c("Mean" = "black",
                                "Median" = "grey")) +
  annotate("text",
           label = "Mean",
           x = 8000,
           hjust = 0,
           y = Inf,
           vjust = 2,
           color = "black") +
   annotate("text",
           label = "Median",
           x = 10500,
           hjust = 0,
           y = Inf,
           vjust = 2,
           color = "grey") +
  theme_classic()

ggsave("question1-1.png")
## Saving 10 x 5 in image

QUESTION 2:

What is the average daily activity pattern?

# Calculate the average of steps by interval

DailyAverage <- data.frame(aggregate(activity$steps,
                                     by = list(activity$interval),
                                     FUN = mean,
                                     na.rm = TRUE))

# More menaingful names

names(DailyAverage) = c("Interval",
                        "Mean")

# Plot a line graph

ggplot(DailyAverage,
       aes(Interval,
           Mean)) +
  geom_line(col = "darkmagenta") +
  labs(title = "Average number of Steps",
       subtitle = "(Per Interval)",
       x = "Interval",
       y = "Avg. Number of Steps") +
  theme_classic()

ggsave("plot2.png")
## Saving 10 x 5 in image

QUESTION 3:

Inputting missing values

# Calculate the total number of missing values

sum(is.na(activity$steps))
## [1] 2304

We are going to follow two different routes:

  1. Assume that no data input means 0 steps.
  2. Add the median for that day.
  3. Add the mean for that day
# NAs into 0 
Activity1 <- activity %>%
              mutate_all(funs(ifelse(is.na(.), 0, .)))

Activity1$date <- as.POSIXct(activity$date, "%Y%m%d")


# Match the mean of daily activity with the missing values

meanSteps <- DailyAverage$Mean[match(activity$interval,
                                         DailyAverage$Interval)]

# NAs into the mean of that day

activity2 <- transform(activity,
                       steps = ifelse(is.na(activity$steps),
                                      yes = meanSteps,
                                      no = activity$steps))

# NAs into the median of that day
DailyMedian <- data.frame(aggregate(activity$steps,
                                     by = list(activity$interval),
                                     FUN = median,
                                     na.rm = TRUE))

names(DailyMedian) = c("Interval",
                        "Median")

medianSteps <- DailyMedian$Median[match(activity$interval,
                                        DailyAverage$Interval)]

activity3 <- transform(activity,
                       steps = ifelse(is.na(activity$steps),
                                      yes = medianSteps,
                                      no = activity$steps))
# Checking for na

sum(is.na(Activity1))
## [1] 0
sum(is.na(activity2))
## [1] 0
sum(is.na(activity3))
## [1] 0
# Plot the graphs

Activity1plot <- data.frame(aggregate(steps ~ date,
                                      Activity1,
                                      sum))
activity2plot <- data.frame(aggregate(steps ~ date,
                                      activity2,
                                      sum))
activity3plot <- data.frame(aggregate(steps ~ date,
                                      activity3,
                                      sum))

ggplot(Activity1plot,
       aes(x = steps)) +
  geom_histogram(breaks = seq(0,
                              25000,
                              by = 1000),
                 fill = "pink",
                 col = "purple") +
  xlab("Steps") +
  ylab("Frequency")+
  labs(title = "Total number of steps",
        subtitle = "(NA = 0)") +
  scale_y_continuous(expand = c(0, 0),
                     limits = c(0, 15)) +
  scale_x_continuous(expand = c(0, 0)) +
  theme_classic()

ggsave("question3-1.png")
## Saving 10 x 5 in image
ggplot(activity2plot,
       aes(x = steps)) +
  geom_histogram(breaks = seq(0,
                              25000,
                              by = 1000),
                 fill = "blue",
                 col = "darkblue") +
  xlab("Steps") +
  ylab("Frequency")+
  labs(title = "Total number of steps",
        subtitle = "(NA = Mean)") +
  scale_y_continuous(expand = c(0, 0),
                     limits = c(0, 15)) +
  scale_x_continuous(expand = c(0, 0)) +
  theme_classic()

ggsave("question3-2.png")
## Saving 10 x 5 in image
ggplot(activity3plot,
       aes(x = steps)) +
  geom_histogram(breaks = seq(0,
                              25000,
                              by = 1000),
                 fill = "red",
                 col = "darkred") +
  xlab("Steps") +
  ylab("Frequency")+
  labs(title = "Total number of steps",
        subtitle = "(NA = Median)") +
  scale_y_continuous(expand = c(0, 0),
                     limits = c(0, 15)) +
  scale_x_continuous(expand = c(0, 0)) +
  theme_classic()

ggsave("question3-3.png")
## Saving 10 x 5 in image

QUESTION 4:

Are there differences in activity patterns between weekdays and weekends?

# Adding day type

activity$dayType <- ifelse(activity$day %in% 
                             c("Saturday", "Sunday"),
                           "Weekend",
                           "Weekday")

# Create dataframe

ActivityByDay <- aggregate(steps ~ interval + dayType,
                           activity,
                           mean,
                           na.rm = TRUE)

# Plot

ggplot(ActivityByDay,
       aes(x = interval,
           y = steps,
           color = dayType)) +
  geom_line() +
  labs(title = "Average Daily Step Count",
       subtitle = "By Day Type",
       x = "Interval",
       y = "Avg. Steps") +
  facet_wrap(~dayType,
             ncol = 1,
             nrow = 2) +
  scale_color_discrete(name = "Day Type") +
  theme_classic()

ggsave("question4.png")
## Saving 10 x 5 in image