Activity monitoring devices like Fitbit, Jawbone Up, Nike Fuelband are quite popular among health conscious individuals and tech geeks. These devices provides thier user a quantified output their movements and activities. Use cases range from Sleep Monitoring to No. of steps a person walked on a given period of time.
This analysis is done from the data collected from one such device. The device had collected data at 5 minutes interval throughout the day over a period of 2 months.
The report consummates 8 different operations and answers few key questions
Data is read from a csv file and transformed into table data frame. Using ‘mutate’ function date in string format is converted toe Time(POSIXct) format
data <- read.table("activity.csv", header=T, sep=',', na.strings="NA",
stringsAsFactors = FALSE)
# Convert the data to table data and remove from the memory
new_data <- tbl_df(data)
rm(data)
# Mutate so that the format of Date and Time(POSIXct) column is no longer a string
new_data <- mutate(new_data, date = as.Date(date, format="%Y-%m-%d"))
Mean and Median total number of steps taken per day is calculated using ‘ddply’ function
calcStepsPerDay <- function(data) {
stepsPerDay <- ddply(
data,
.(date),
function(x) {
sum(x$steps, na.rm = T)
}
)
return(stepsPerDay)
}
stepsPerDay <- calcStepsPerDay(new_data)
plot(x = stepsPerDay$date, y = stepsPerDay$V1, type= "h", main = "Total No of Steps Taken Each Day",
xlab = "Days", ylab = "Steps(total)")
mean <- mean(stepsPerDay$V1)
median <- median(stepsPerDay$V1)
Mean and Median are 9354.2295082 and 10395 respectively
A time series plot is made across 5 minute interval throughout the day and the average number of steps taken.
average <- ddply(
new_data,
.(interval),
function(x) {
mean(x$steps, na.rm = T)
}
)
plot(x = average$interval, y = average$V1, type = "l", main = "Average No. Of Steps Taken(Time Series)",
xlab = "5 Mins Interval", ylab = "Steps(Average)")
maxInterval <- average[which.max(average$V1),]
The average steps taken is calculated across 2 months of days for that particular 5 minute interval
It has been noticed that the 835th 5 minute interval contains the maximum number steps across all days.
While reading the data from data source, cells with no values are replaced with ‘NA’ this section establishes a strategy for filling the missing data.
Missing variables are replaced with the population mean of the given interval.
na_rows <- complete.cases(new_data)
noOfMissingRows <- 0
for(i in 1 : nrow(new_data)) {
if(na_rows[i] == FALSE) {
index <- match(new_data[i, "interval"], average$interval)
new_data[i, "steps"] <- average[index, "V1"]
noOfMissingRows <- noOfMissingRows + 1
}
}
stepsPerDay <- calcStepsPerDay(new_data)
plot(x = stepsPerDay$date, y = stepsPerDay$V1, type= "h", main = "Total No of Steps Taken Each Day After Imputing", xlab = "Days", ylab = "Steps(total)")
mean <- mean(stepsPerDay$V1)
median <- median(stepsPerDay$V1)
In 2304 rows data is not recorded.
New Mean and Median are 1.076618910^{4} and 1.076618910^{4} respectively. These values are quite different from the actual values. Impact of imputing resulted in same median and mean.
days <- c("Saturday", "Sunday")
weekdays <- filter(new_data, is.na(match(weekdays(date), days)))
weekends <- filter(new_data, !is.na(match(weekdays(date), days)))
aveWeekdayInterval <- ddply(
weekdays,
.(interval),
function(x) {
mean(x$steps, na.rm = T)
}
)
aveWeekendInterval <- ddply(
weekends,
.(interval),
function(x) {
mean(x$steps, na.rm = T)
}
)
p1 <- ggplot(aveWeekdayInterval, aes(interval, V1)) + geom_line() + ggtitle("Average number of steps taken across weekdays and weekends") + xlab("Interval(Weekdays)") + ylab("Average Steps")
p2 <- ggplot(aveWeekendInterval, aes(interval, V1)) + xlab ("Interval(Weekends)") + ylab("Average Steps") + geom_line()
gl = lapply(list(p1,p2), ggplotGrob)
g = do.call(rbind, c(gl, size="first"))
g$widths = do.call(unit.pmax, lapply(gl, "[[", "widths"))
grid.newpage()
grid.draw(g)
The above two plots compares the weekdays vs weekend activities.