ActivityTracker Data Analysis

It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the “quantified self” movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.

This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.

The variables included in this dataset are:

steps: Number of steps taking in a 5-minute interval (missing values are NA)
date: The date on which the measurement was taken in YYYY-MM-DD format
interval: Identifier for the 5-minute interval in which measurement was taken The dataset is stored in a comma-separated-value (CSV) file and there are a total of 17,568 observations in this - dataset.

The data for this code can be downloaded from the link. Or just run the code to download it.

Downloading data So, the code below downloads the zipfile and unzips it, if any of the zip file or the data file is missing.

setwd("A:/Coursera DS/Reproducible Research")
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip"
zipdata <- "./repdata_data_activity.zip"
datafile <- "./activity.csv"
if (!file.exists(datafile)) {
    if (!file.exists(zipdata)) {
    download.file(url ,zipdata,method="auto") }
    unzip(zipfile = zipdata) }

Let us read and look at the summary of the dataset.

activity <- read.csv(datafile)
activity$date <- as.Date(activity$date, format = "%Y-%m-%d")
str(activity)

## 'data.frame':    17568 obs. of  3 variables:
##  $ steps   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ date    : Date, format: "2012-10-01" "2012-10-01" ...
##  $ interval: int  0 5 10 15 20 25 30 35 40 45 ...

summary(activity)

##      steps             date               interval     
##  Min.   :  0.00   Min.   :2012-10-01   Min.   :   0.0  
##  1st Qu.:  0.00   1st Qu.:2012-10-16   1st Qu.: 588.8  
##  Median :  0.00   Median :2012-10-31   Median :1177.5  
##  Mean   : 37.38   Mean   :2012-10-31   Mean   :1177.5  
##  3rd Qu.: 12.00   3rd Qu.:2012-11-15   3rd Qu.:1766.2  
##  Max.   :806.00   Max.   :2012-11-30   Max.   :2355.0  
##  NA's   :2304

The above output is the discriptive analysis of the variables in the dataset. Also, from the dates column we can see that the number of observations taken on each day are 288.

Now, since we have to plot the histogram of number of steps per day, we will first create a seperate dataset called “stepsPerDay” and then plot it

stepsPerDay <- aggregate(activity$steps, FUN= "sum", by = list(activity$date), na.rm = TRUE)
names(stepsPerDay) <- c("Date","Steps")

The histogram plot is as below:

library(ggplot2)
his <- ggplot(stepsPerDay, aes(Date, Steps))
his + geom_col(fill = "steelblue", na.rm = FALSE)

meanSteps <- mean(stepsPerDay$Steps, na.rm = TRUE)
medianSteps <- median(stepsPerDay$Steps, na.rm = TRUE)

The mean number of steps per days are 9354.2295082 and the median is 10395.

Next, Time series plot of the average number of steps taken, with the interval.

avgSteps <- aggregate(activity$steps, FUN= "mean", by = list(activity$interval), na.rm = TRUE)
names(avgSteps) <- c("Interval","avgSteps")
tsPlot <- ggplot(avgSteps, aes(Interval, avgSteps))
tsPlot + geom_line(color = "steelblue")

maxInterval <- avgSteps$Interval[which.max(avgSteps$avgSteps)]

The 5-minute interval that, on average, contains the maximum number of steps is 835

numNA <- sum(is.na(activity$steps))

The number of missing values in the steps column of the activity dataset is 2304

To input the missing data we will take help of the average dataset that we created earlier and just replace the places where the NA appear in the orihinal dataset

newData <- activity #making a copy of the original data
for (i in avgSteps){
    newData[newData$interval == i & is.na(newData$steps),]$steps <- avgSteps$avgSteps[avgSteps$Interval == i]
}

Now we do the same procedure as eralier and make the histogram

stepsPerDay2 <- aggregate(newData$steps, FUN= "sum", by = list(newData$date), na.rm = TRUE)
names(stepsPerDay2) <- c("Date","Steps")
library(ggplot2)
his <- ggplot(stepsPerDay2, aes(Date, Steps))
his + geom_col(fill = "steelblue", na.rm = FALSE)

Now we create a new factor column called “week” whick would have data if the day is a weekday or a weekend.

newData$day <- weekdays(newData$date)
newData$week <- ""
newData[newData$day == "Saturday" | newData$day == "Sunday", ]$week <- "Weekend"
newData[!(newData$day == "Saturday" | newData$day == "Sunday"), ]$week <- "Weekday"
newData$week <- factor(newData$week)

We can now plot the data and facet it on the basis of the week days category we created above.

avg_step_newData <- aggregate(steps ~ interval + week, data = newData, mean)
pl <- ggplot(avg_step_newData, aes(interval, steps))
pl + geom_line(color = "steelblue") + facet_wrap(~week) +
    labs(x = "Interval", y = "Steps", title = "Avg. steps comparison in intervals across weekdays and weekends")

ActivityTracker Data Analysis

Shreyas Khadse

18/06/2020