Introduction

As it is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the “quantified self” movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.

This project will make use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.

The data for this project can be downloaded here:

Dataset: Data

knitr::opts_chunk$set(echo = TRUE)
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.0.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.3
## -- Attaching packages --------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v dplyr   1.0.2
## v tibble  3.0.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## v purrr   0.3.4
## Warning: package 'forcats' was built under R version 4.0.3
## -- Conflicts ------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(dplyr)

Variables

The variables included in this dataset are:

  • steps: Number of steps taking in a 5-minute interval (missing values are coded as NA)
  • date: The date on which the measurement was taken in YYYY-MM-DD format
  • interval: Identifier for the 5-minute interval in which measurement was taken The dataset is stored in a comma-separated-value (CSV) file and there are a total of 17,568 observations in this dataset.

Processing data

data<- read.csv("activity.csv",header = T)
head(data)
##   steps       date interval
## 1    NA 2012-10-01        0
## 2    NA 2012-10-01        5
## 3    NA 2012-10-01       10
## 4    NA 2012-10-01       15
## 5    NA 2012-10-01       20
## 6    NA 2012-10-01       25
str(data)
## 'data.frame':    17568 obs. of  3 variables:
##  $ steps   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ date    : chr  "2012-10-01" "2012-10-01" "2012-10-01" "2012-10-01" ...
##  $ interval: int  0 5 10 15 20 25 30 35 40 45 ...

First objective is to calculate total number of steps taken each day.

total_step_data<- data %>%
  group_by(date) %>%
  summarise(total_steps=sum(steps))
## `summarise()` ungrouping output (override with `.groups` argument)
total_step_data<-total_step_data %>% drop_na(total_steps)
head(total_step_data)
## # A tibble: 6 x 2
##   date       total_steps
##   <chr>            <int>
## 1 2012-10-02         126
## 2 2012-10-03       11352
## 3 2012-10-04       12116
## 4 2012-10-05       13294
## 5 2012-10-06       15420
## 6 2012-10-07       11015
str(total_step_data)
## tibble [53 x 2] (S3: tbl_df/tbl/data.frame)
##  $ date       : chr [1:53] "2012-10-02" "2012-10-03" "2012-10-04" "2012-10-05" ...
##  $ total_steps: int [1:53] 126 11352 12116 13294 15420 11015 12811 9900 10304 17382 ...

Visualising the calculated values

g<-ggplot(total_step_data,aes(date,total_steps))
g+ geom_bar(stat = 'identity',fill='black')+
  ggtitle('Total steps per day')+
  theme(axis.text.x = element_text(angle=90))

For finding the mean and the median of the total steps taken per day, summary function can be used.

summary(total_step_data$total_steps)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      41    8841   10765   10766   13294   21194

Time series plotting of average daily activity pattern

Segregating the mean of the steps(data) on the basis on 5 minute interval and plotting it.

step_data<-aggregate(steps~interval,data=data,mean)
head(step_data)
##   interval     steps
## 1        0 1.7169811
## 2        5 0.3396226
## 3       10 0.1320755
## 4       15 0.1509434
## 5       20 0.0754717
## 6       25 2.0943396
plot(step_data$interval,
     step_data$steps, 
     type = "l",
     xlab = 'Minutes',
     ylab = 'Total Steps',
     lwd=2.5,
     main = "Time Series Plot of Average Steps Taken per Interval")

To find out which interval has more number of steps, grep or grepl function can be used as both of them are identical in this scenario.

step_data[grepl(max(step_data$steps),step_data$steps),]
##     interval    steps
## 104      835 206.1698

Imputing missing values

Calculating NA values in the entire dataset

na.info<-apply(is.na(data),2,which)
str(na.info)
## List of 3
##  $ steps   : int [1:2304] 1 2 3 4 5 6 7 8 9 10 ...
##  $ date    : int(0) 
##  $ interval: int(0)

Substituing the NA values in the dataset by the mean of the column

data_withoutNA<-data
data_withoutNA$steps[which(is.na(data_withoutNA$steps))]<-mean(data$steps,na.rm = T)

#checking whether the new datasets has no NA values
na.info<-apply(is.na(data_withoutNA),2,which)
summary(data_withoutNA)
##      steps            date              interval     
##  Min.   :  0.00   Length:17568       Min.   :   0.0  
##  1st Qu.:  0.00   Class :character   1st Qu.: 588.8  
##  Median :  0.00   Mode  :character   Median :1177.5  
##  Mean   : 37.38                      Mean   :1177.5  
##  3rd Qu.: 37.38                      3rd Qu.:1766.2  
##  Max.   :806.00                      Max.   :2355.0

Better way to make efficient dataset is to substitute the the mean value of steps taken in that particular interval. Also, the step_data dataset has the mean values of each interval.

data_withoutNA<-data
for (i in 1:length(data_withoutNA$steps)){
  if (is.na(data_withoutNA[i,1]== TRUE)){
    data_withoutNA[i,1]=step_data[step_data[,1] %in%  data_withoutNA[i,3], 2]}
   }
head(data_withoutNA)
##       steps       date interval
## 1 1.7169811 2012-10-01        0
## 2 0.3396226 2012-10-01        5
## 3 0.1320755 2012-10-01       10
## 4 0.1509434 2012-10-01       15
## 5 0.0754717 2012-10-01       20
## 6 2.0943396 2012-10-01       25
#checking for NA values
na.info<-apply(is.na(data_withoutNA),2,which)
str(na.info)
##  int(0)

Visualising the calculated value in the new dataset

step_data<- data_withoutNA %>% group_by(date) %>% summarise(total_steps=sum(steps))
## `summarise()` ungrouping output (override with `.groups` argument)
g<-ggplot(step_data,aes(date,total_steps))
g + geom_bar(stat='identity',fill='black')+
  ggtitle('Total steps per day')+
  theme(axis.text.x=element_text(angle=90))

#finding out median and mean of the steps taken
summary(step_data)
##      date            total_steps   
##  Length:61          Min.   :   41  
##  Class :character   1st Qu.: 9819  
##  Mode  :character   Median :10766  
##                     Mean   :10766  
##                     3rd Qu.:12811  
##                     Max.   :21194

Activity patterns between weekdays and weekends

Aim is to plot a time series plot and checking the number of step taken per interval in weekdays and weekends.

data_days<-data_withoutNA
data_days$day<-weekdays(as.Date(data_withoutNA$date))
data_days$weekday<-as.numeric(rep(1,times=length(data_withoutNA$steps)))
for (i in 1: length(data_withoutNA$steps)) {
  if (data_days$day[i] %in% c("Saturday","Sunday")){
    data_days$weekday[i]="Weekend"
  }
  else{
    data_days$weekday[i]="Weekday"
  }
}
data_days$day<-as.factor(data_days$day)
data_days$weekday<-as.factor(data_days$weekday)
str(data_days)
## 'data.frame':    17568 obs. of  5 variables:
##  $ steps   : num  1.717 0.3396 0.1321 0.1509 0.0755 ...
##  $ date    : chr  "2012-10-01" "2012-10-01" "2012-10-01" "2012-10-01" ...
##  $ interval: int  0 5 10 15 20 25 30 35 40 45 ...
##  $ day     : Factor w/ 7 levels "Friday","Monday",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ weekday : Factor w/ 2 levels "Weekday","Weekend": 1 1 1 1 1 1 1 1 1 1 ...

Segregating the new dataset and plotting it

weekday_data<-data_days[data_days$weekday=="Weekday",]
weekend_data<-data_days[data_days$weekday=="Weekend",]
weekday_mean<- aggregate(steps ~ interval, weekday_data, mean)
weekend_mean<- aggregate(steps ~ interval, weekend_data,mean)

par(mfrow=c(2,1))
plot(weekday_mean$interval,
     weekday_mean$steps,
     type='l',
     main='Average steps taken per interval in weekdays',
     xlab = 'Intervals',
     ylab = 'Total steps',
     color='blue',
     lwd=2.5,
     ylim = c(1,250),
     xlim = c(1,2500))
## Warning in plot.window(...): "color" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "color" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "color" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "color" is not a
## graphical parameter
## Warning in box(...): "color" is not a graphical parameter
## Warning in title(...): "color" is not a graphical parameter
plot(weekend_mean$interval,
     weekend_mean$steps,
     type='l',
     main='Average steps taken per interval in weekends',
     xlab = 'Intervals',
     ylab = 'Total steps',
     col='black',
     lwd=2.5,
     ylim = c(1,250),
     xlim = c(1,2500))