Q1) Code for reading in the dataset and/or processing the data
first of all we’re going go set up the environment somehow that next user be able to just run our code and get the results. no need for extra effort of creating
if(! file.exists('activity.csv')) {
download.file(url = 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip',
destfile = 'data.zip')
unzip(zipfile = 'data.zip')
unlink('data.zip', recursive = T)
}now we have or data downloaded and unzipped. We’re one step behind reading the CSV file.
## steps date interval
## 1 NA 2012-10-01 0
## 2 NA 2012-10-01 5
## 3 NA 2012-10-01 10
## 4 NA 2012-10-01 15
## 5 NA 2012-10-01 20
## 6 NA 2012-10-01 25
as we know about the dataset it includes number of steps in 5 minutes intervals for 60 days. some rows include NAs.
Q2) Histogram of the total number of steps taken each day
in this step we need to use dplyr package.
act_sum <- activity %>% group_by(date) %>% summarise(sum(steps))
names(act_sum)[2] <- 'steps'
avgstep <- mean(act_sum$steps, na.rm = T)
g <-
ggplot(data = act_sum, aes(x = date, y = steps)) +
geom_bar(stat = 'identity') + geom_hline(yintercept = avgstep, color = 'magenta')
gQ3) Mean and median number of steps taken each day
## [1] 10766.19
## [1] 10765
since median and mean are this close ;I have a rough guess that steps’ distribution can be bell curve like. let’s see.
we can say that since this histogram is curve liked so number of steps is distributed a normal way. the person who this data belongs to him has a norm of walking on a daily basis.
so far I’ve done part 3 and 4 in this section since they were closely related.
daily activity patterns
# I make a duplicate of activity data
actd <- activity
actd <- actd %>% group_by(interval) %>% summarise(mean(steps,na.rm = T))
names(actd)[2] <- "steps"
stepmaxindex <- which.max(actd$steps)
maxstep <- actd$steps[stepmaxindex]
maxinteval <- actd$interval[stepmaxindex]
g <- ggplot(actd,aes(x = interval,y = steps)) + geom_line()
g <- g + geom_hline(yintercept =maxstep, col = 'magenta',lwd = 0.6,alpha = 0.5)
g <- g + geom_vline(xintercept =maxinteval, col = 'darkgreen',lwd = 0.6,alpha = 0.5)
gthe interval with maximum average of steps is interval 835 which is 8:45 in the morning.Its average is maxstep .
Dealing with NAs
first we want to know how many rows are filled with NA. Next step would be finding a stategy to fill them in.
NAnum = 0
NArows = c()
st = activity$steps
len = length(st)
for (i in 1:len){
if(is.na(st[i])){
NAnum = NAnum + 1
NArows = c(NArows,i)
}
}number of NAs in our dataset is NAnum. our dataset consists of round(NAnum/len*100,2) percent NAs in it.
NA filling in strategy
first thing comes to my mind is using our previous table; average steps on the interval basis , and put the correspondent number in place of NA. this section’s coding would be highly boring and I suggest you to skip
for(i in NArows){
replacer = as.numeric(actd[actd$interval == activity$interval[i],2])
activity$steps[i] = replacer
}NA filled difference check
hereby we want to check how filling in NA cells makes difference in our daily plot and of course median and mean.
let’s jump in to code plot, since the code is copied from from first plot , I check the echo = F
new plot is denser than before but nothing special happened in this new steps/date plot.
now let’s see how the new daily steps histogram differs.
in this new plot as we could guess ; steps daily sum is more consistent around mean since we added 13 percent averages replacing with NAs.
new median and mean
## median mean
## 1 10765.00 10766.19
## 2 10766.19 10766.19
this two rows are almost identical.so we can say that our adding number method was practical since it didn’t change the schema of data at all.
weekend and working days pattern
in first step we split data to two parts using date factor in dataset.
dact <- activity
dact$date <- as.Date(dact$date,format = "%Y-%m-%d")
iwe <- function(date){
if(weekdays(date) == "Sunday" | weekdays(date) == "Saturday")
return(T)
else return(F)
}
isWeekend <- sapply(dact$date , iwe)
dact <- cbind(dact,isWeekend)
dact <- dact %>% group_by(interval , isWeekend) %>% summarise(steps = mean(steps))
head(dact)## # A tibble: 6 x 3
## # Groups: interval [3]
## interval isWeekend steps
## <int> <lgl> <dbl>
## 1 0 FALSE 2.25
## 2 0 TRUE 0.215
## 3 5 FALSE 0.445
## 4 5 TRUE 0.0425
## 5 10 FALSE 0.173
## 6 10 TRUE 0.0165
by now we can calculate and see the difference between weekends average of steps in 5 min intevals and working days.
avgWd <- mean(dact[dact$isWeekend == TRUE,3]$steps)
avgWe <- mean(dact[dact$isWeekend == FALSE,3]$steps)
avgtable <- data.frame(WorkingDays = avgWd, Weekends = avgWe)
avgtable## WorkingDays Weekends
## 1 42.3664 35.61058
let’s see how this difference changes the shape of our plots :
isWeekend.labs = c("Working Day","Weekend")
names(isWeekend.labs) = c("Working day","Weekend")
g <- ggplot(data = dact,aes(x = interval,y = steps)) + geom_line() + facet_grid(.~ isWeekend,labeller = label_both)
gSo far all questions are answered.