This report makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.
The data and packages were loaded into R.
library(ggplot2)
library(Hmisc)
data<-read.csv('C:\\Users\\Alina\\OneDrive\\R\\activity.csv')
The dates within the data were converted to ‘date’ format for further manipulation.
data$date<-as.Date(data$date)
The data was preprocessed to get rid of NA values and only keep the complete entries. The total number of steps taken per day were computed using the ‘aggregate’ function. The readings were aggregated over each day.
mydata<-data[(complete.cases(data)),]
dailysteps<-aggregate(x=mydata['steps'], FUN=sum, by=list(day=mydata$date))
A histogram of the total number of steps taken each day was plotted.
hist(dailysteps$steps)
Functions within R were used to calculate the mean and median number of steps taken by the subject.
mean1<-mean(dailysteps$steps)
median1<-median(dailysteps$steps)
The mean number of steps taken per day is 1.076618910^{4} and the median number of steps taken per day is 10765.
A time series plot was made of the 5-minute interval against the average number of steps taken, averaged across all days. First the ‘aggregate’ command was used to find the mean number of steps for each time interval. The graph was plotted from the calculated values.
stepsbytime<-aggregate(x=mydata['steps'], FUN=mean, by=list(interval=mydata$interval))
g<-ggplot(stepsbytime, aes(interval, steps)) + geom_line()+
xlab("interval") + ylab("mean steps")
print(g)
The maximum mean number of steps are contained in 835th minute interval.
missing<-sum(is.na(data$steps))
The total number of missing values in the dataset were 2304. To fill in the missing values the mean number of steps for the day were used. The ‘impute’ function was used to impute the value for the number of steps using the mean.
imputeddata<-data
imputeddata$steps <-impute(imputeddata$steps, fun=mean)
A histogram of the total number of steps taken each day after imputing the values can be made as follows.
dailysteps2<-aggregate(x=imputeddata['steps'], FUN=sum, by=list(day=imputeddata$date))
hist(dailysteps2$steps)
mean2<-mean(dailysteps2$steps)
median2<-median(dailysteps2$steps)
The mean of the total number of steps per day is 1.076618910^{4} and the median is 1.076618910^{4}.
The difference between the mean from the imputed data and the data wihout the missing values is 0 and the difference in the median is -1.1886792. Hence the mean remains the same (since it was picked for imputing the values) but the median decreases as a result of imputing.
The ‘weekdays’ command was used to determine if a given date was a weekday or a weekend. The result was converted to
weekdays1 <- c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday')
imputeddata$wDay <- factor((weekdays(imputeddata$date) %in% weekdays1),
levels=c(FALSE, TRUE), labels=c('weekend', 'weekday') )
A panel plot was made, containing time series plots of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis). The ‘aggregate’ command was used to find the average number of steps taken during the weekend or weekdays over time. The variable ‘wDay’ which determines if the day is weekend or weekday was employed for the purpose.
meanbyweekdays<-aggregate(steps~interval+wDay,data=imputeddata, mean)
g2<-ggplot(meanbyweekdays, aes(interval, steps))+geom_line()+
facet_grid(wDay~.)+
xlab("interval") + ylab("mean steps")
print(g2)