Activity monitoring project

#loading libraries needed
library(ggplot2,warn.conflicts = F) ; library(dplyr,warn.conflicts = F)

Q1) Code for reading in the dataset and/or processing the data

first of all we’re going go set up the environment somehow that next user be able to just run our code and get the results. no need for extra effort of creating

if(! file.exists('activity.csv')) {
  download.file(url = 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip',
                destfile = 'data.zip')
  unzip(zipfile = 'data.zip')
  unlink('data.zip', recursive = T)
}

now we have or data downloaded and unzipped. We’re one step behind reading the CSV file.

activity <- read.csv('./activity.csv')
head(activity)

##   steps       date interval
## 1    NA 2012-10-01        0
## 2    NA 2012-10-01        5
## 3    NA 2012-10-01       10
## 4    NA 2012-10-01       15
## 5    NA 2012-10-01       20
## 6    NA 2012-10-01       25

as we know about the dataset it includes number of steps in 5 minutes intervals for 60 days. some rows include NAs.

Q2) Histogram of the total number of steps taken each day

in this step we need to use dplyr package.

act_sum <- activity %>%  group_by(date) %>%  summarise(sum(steps))
names(act_sum)[2] <- 'steps'
avgstep <- mean(act_sum$steps, na.rm = T)
g <-
  ggplot(data = act_sum, aes(x = date, y = steps)) +
  geom_bar(stat = 'identity') + geom_hline(yintercept = avgstep, color = 'magenta')
g

Q3) Mean and median number of steps taken each day

  avgstep # mean

## [1] 10766.19

  medsteps <- median(act_sum$steps,na.rm = T)
  medsteps #median

## [1] 10765

since median and mean are this close ;I have a rough guess that steps’ distribution can be bell curve like. let’s see.

ggplot(act_sum,aes(steps)) + geom_histogram(binwidth = 600)

we can say that since this histogram is curve liked so number of steps is distributed a normal way. the person who this data belongs to him has a norm of walking on a daily basis.

so far I’ve done part 3 and 4 in this section since they were closely related.

daily activity patterns

# I make a duplicate of activity data
actd <- activity 
actd <- actd %>%  group_by(interval) %>%  summarise(mean(steps,na.rm = T))
names(actd)[2] <- "steps"
stepmaxindex <- which.max(actd$steps)
maxstep <- actd$steps[stepmaxindex]
maxinteval <- actd$interval[stepmaxindex]
g <- ggplot(actd,aes(x = interval,y = steps)) + geom_line()
g <- g + geom_hline(yintercept =maxstep, col = 'magenta',lwd = 0.6,alpha = 0.5)
g <- g + geom_vline(xintercept =maxinteval, col = 'darkgreen',lwd = 0.6,alpha = 0.5)
g

rm(g)

the interval with maximum average of steps is interval 835 which is 8:45 in the morning.Its average is maxstep .

Dealing with NAs

first we want to know how many rows are filled with NA. Next step would be finding a stategy to fill them in.

NAnum = 0
NArows = c()
st = activity$steps
len = length(st)
for (i in 1:len){
  if(is.na(st[i])){
    NAnum = NAnum + 1
    NArows = c(NArows,i)
  }
}

number of NAs in our dataset is NAnum. our dataset consists of round(NAnum/len*100,2) percent NAs in it.

NA filling in strategy

first thing comes to my mind is using our previous table; average steps on the interval basis , and put the correspondent number in place of NA. this section’s coding would be highly boring and I suggest you to skip

for(i in NArows){
  replacer = as.numeric(actd[actd$interval == activity$interval[i],2])
  activity$steps[i] = replacer
}

NA filled difference check

hereby we want to check how filling in NA cells makes difference in our daily plot and of course median and mean.

let’s jump in to code plot, since the code is copied from from first plot , I check the echo = F

new plot is denser than before but nothing special happened in this new steps/date plot.

now let’s see how the new daily steps histogram differs.

in this new plot as we could guess ; steps daily sum is more consistent around mean since we added 13 percent averages replacing with NAs.

new median and mean

##     median     mean
## 1 10765.00 10766.19
## 2 10766.19 10766.19

this two rows are almost identical.so we can say that our adding number method was practical since it didn’t change the schema of data at all.

weekend and working days pattern

in first step we split data to two parts using date factor in dataset.

dact <- activity
dact$date <- as.Date(dact$date,format = "%Y-%m-%d")

iwe <- function(date){
    if(weekdays(date) == "Sunday" | weekdays(date) == "Saturday")
        return(T)
    else return(F)
}

isWeekend <- sapply(dact$date , iwe)
dact <- cbind(dact,isWeekend)
dact <- dact %>%  group_by(interval , isWeekend) %>%  summarise(steps = mean(steps))
head(dact)

## # A tibble: 6 x 3
## # Groups:   interval [3]
##   interval isWeekend  steps
##      <int> <lgl>      <dbl>
## 1        0 FALSE     2.25  
## 2        0 TRUE      0.215 
## 3        5 FALSE     0.445 
## 4        5 TRUE      0.0425
## 5       10 FALSE     0.173 
## 6       10 TRUE      0.0165

by now we can calculate and see the difference between weekends average of steps in 5 min intevals and working days.

avgWd <- mean(dact[dact$isWeekend == TRUE,3]$steps)
avgWe <- mean(dact[dact$isWeekend == FALSE,3]$steps)
avgtable <- data.frame(WorkingDays = avgWd, Weekends = avgWe)
avgtable

##   WorkingDays Weekends
## 1     42.3664 35.61058

let’s see how this difference changes the shape of our plots :

isWeekend.labs = c("Working Day","Weekend")
names(isWeekend.labs) = c("Working day","Weekend")
g <- ggplot(data = dact,aes(x = interval,y = steps)) + geom_line() + facet_grid(.~ isWeekend,labeller = label_both)
g

So far all questions are answered.