Loading Required Packages:
library(dplyr)
library(tidyr)
library(ggplot2)
We Will first extract the csv file from zip file. Then loading the file ‘activity.csv’ into the memory using correct Data types.
activity<-read.csv(file=unzip("activity.zip"),colClasses=c("integer","Date","integer"))
We will remove first the the observations with NA in steps.
activity_meanStepsDaily<-subset(x = activity,!is.na(activity$steps))
to create the histogram we need to sum all steps per each day.
Summurized.activity<-summarize(group_by(activity_meanStepsDaily,date),Sums=sum(steps))
names(Summurized.activity)<-c("dates","sums") ##Making sure that the column names do not match to a built-in function.
Now we can easily create the histogram.
##Calculating median and the mean
med<-median(Summurized.activity$sums)
mea<-mean(Summurized.activity$sums)
h<-ggplot(Summurized.activity,aes(sums))
h<-h+geom_histogram()+xlab("Steps")+geom_vline(xintercept=med,color="green",size=2) ##adding labels and median
h<-h+geom_vline(xintercept=mea,color="red",linetype="dashed",size=2) ## adding mean
h<-h+ggtitle("Total number of steps taken each day")
h ##showing the plot
The median value is 10765 and is shown in green. Meanwhile the value of mean is 10766.19 and is shown in red.
Below graphic shows average number of steps across all days for each 5-min interval.
activity.intv.summurized<-summarise(group_by(activity_meanStepsDaily,interval),avg=mean(steps))
with(activity.intv.summurized,plot(avg~interval,type="l",ylab="Average number of steps(across all days) ",main="Average number of steps per 5 minutes intervals"))
maxInt<-subset(activity.intv.summurized,avg==max(avg))
axis(1, at = maxInt$interval, las=2,col="red")
abline(v=maxInt$interval,col="red")
As Shown in the graphic the red line shows the interval 835 with the value 206.2 has the maximum average number of steps across all days.
we will use the technique split-Apply-combine technique. First we need to identify the observation with missing Steps values.(split)
activity.missing<-subset(activity,is.na(steps)) ##finding
activity.intv.summurized<-mutate(activity.intv.summurized,"rounded"=round(avg)) ##adding a color for later to be used in merging datasets
There are 2304 observations which are not reporting number of steps -have NA as value in the dataset.
Since the presence of missing values may introduce bias into the calculations, we will fill the missing values with the round function of mean value of that 5-min interval across all days. To do so we will find the related mean value for 5-min interval by merging two previously calculated datasets.(apply)
Afterwards we will just bind the two datasets, imputed-datasets with the dataset with NA-Free-steps.(combine)
activity.missing.imputed<-select((merge(x=activity.missing,y=activity.intv.summurized,by="interval")),interval,date,"steps"=rounded)
activity.fixed<-rbind(activity.missing.imputed,activity_meanStepsDaily) #combining datasets
Now that we replaced the missing values by the related means of that 5-min interval, We are going to create the histogram.
activity.fixed.summurized<-summarise(group_by(activity.fixed,date),sumSteps=sum(steps))
medFixed<-median(activity.fixed.summurized$sumSteps)
meaFixed<-mean(activity.fixed.summurized$sumSteps)
ht<-ggplot(activity.fixed.summurized,aes(sumSteps))
ht<-h+geom_histogram()+xlab("Steps")+geom_vline(xintercept=medFixed,color="green",size=2) ##adding labels and median
ht<-h+geom_vline(xintercept=meaFixed,color="red",linetype="dashed",size=2) ## adding mean
ht<-ht+ggtitle("Total number of steps taken each day")
ht
The new median value is 10762 and is shown in green. Meanwhile the value of new mean is 10765.64 and is shown in red.
Comparing the new values to the previous values shows that median 3 unit(s) and mean has 0.549335 unit(s) changed. We could conclude that our method to impute the missing values did not have any significant effect on the data distribution, since median and mean have barely changed.
here we need to add a column which specifies if the given date is weekdays or weekend. We will use our NA-imputed dataset which we previously showed that our filling values did not changed the distribution characteristics.
activity.fixed.dayType<-mutate(activity.fixed,dayType=as.factor(ifelse (grepl('^S', weekdays(date)), "weekend", "weekday")))
activity.fixed.dayType.summ<-summarise(group_by(activity.fixed.dayType,interval,dayType),avgSteps=mean(steps))
p<-qplot(interval, avgSteps, data=activity.fixed.dayType.summ, geom="line")+facet_grid(dayType ~ .)
p<-p+ylab("Number of steps")+xlab("interval")
p