Activity monitoring devices like Fitbit, Jawbone Up, Nike Fuelband are quite popular among health conscious individuals and tech geeks. These devices provides thier user a quantified output their movements and activities. Use cases range from Sleep Monitoring to No. of steps a person walked on a given period of time.

This analysis is done from the data collected from one such device. The device had collected data at 5 minutes interval throughout the day over a period of 2 months.

Synopsis

The report consummates 8 different operations and answers few key questions

Loading and preprocessing the data

Data is read from a csv file and transformed into table data frame. Using ‘mutate’ function date in string format is converted toe Time(POSIXct) format

data <- read.table("activity.csv", header=T, sep=',', na.strings="NA", 
                   stringsAsFactors = FALSE)
# Convert the data to table data and remove from the memory
new_data <- tbl_df(data)
rm(data)


# Mutate so that the format of Date and Time(POSIXct) column is no longer a string
new_data <- mutate(new_data, date = as.Date(date, format="%Y-%m-%d"))

What is mean total number of steps taken per day?

Mean and Median total number of steps taken per day is calculated using ‘ddply’ function

calcStepsPerDay <- function(data) {
    stepsPerDay <- ddply(
        data, 
        .(date), 
        function(x) {  
            sum(x$steps, na.rm = T)
        }
    )
    return(stepsPerDay)
}
stepsPerDay <- calcStepsPerDay(new_data)
plot(x = stepsPerDay$date, y = stepsPerDay$V1, type= "h", main = "Total No of Steps Taken Each Day",
     xlab = "Days", ylab = "Steps(total)")

mean <- mean(stepsPerDay$V1)
median <- median(stepsPerDay$V1)

Mean and Median are 9354.2295082 and 10395 respectively

What is the average daily activity pattern?

A time series plot is made across 5 minute interval throughout the day and the average number of steps taken.

average <- ddply(
    new_data, 
    .(interval), 
    function(x) {  
        mean(x$steps, na.rm = T)
    }
)
plot(x = average$interval, y = average$V1, type = "l", main = "Average No. Of Steps Taken(Time Series)",
     xlab = "5 Mins Interval", ylab = "Steps(Average)")

maxInterval <- average[which.max(average$V1),]

The average steps taken is calculated across 2 months of days for that particular 5 minute interval

It has been noticed that the 835th 5 minute interval contains the maximum number steps across all days.

Imputing missing values

While reading the data from data source, cells with no values are replaced with ‘NA’ this section establishes a strategy for filling the missing data.

Impute Strategy

Missing variables are replaced with the population mean of the given interval.

Steps

  • Identify the rows having NAs using complete cases.
  • Reuse the average data calculated for every 5 minute interval
  • Iterate through all the rows and find the NA rows
  • Replace NAs with calculated averages.
na_rows <- complete.cases(new_data)
noOfMissingRows <- 0
for(i in 1 : nrow(new_data)) {
    if(na_rows[i] == FALSE) {
        index <- match(new_data[i, "interval"], average$interval)
        new_data[i, "steps"] <- average[index, "V1"]
        noOfMissingRows <- noOfMissingRows + 1
    } 
}
stepsPerDay <- calcStepsPerDay(new_data)
plot(x = stepsPerDay$date, y = stepsPerDay$V1, type= "h", main = "Total No of Steps Taken Each Day After Imputing", xlab = "Days", ylab = "Steps(total)")

mean <- mean(stepsPerDay$V1)
median <- median(stepsPerDay$V1)

In 2304 rows data is not recorded.

New Mean and Median are 1.076618910^{4} and 1.076618910^{4} respectively. These values are quite different from the actual values. Impact of imputing resulted in same median and mean.

Are there differences in activity patterns between weekdays and weekends?

days <- c("Saturday", "Sunday")
weekdays <- filter(new_data, is.na(match(weekdays(date), days)))
weekends <- filter(new_data, !is.na(match(weekdays(date), days)))

aveWeekdayInterval <- ddply(
    weekdays, 
    .(interval), 
    function(x) {  
        mean(x$steps, na.rm = T)
    }
)

aveWeekendInterval <- ddply(
    weekends, 
    .(interval), 
    function(x) {  
        mean(x$steps, na.rm = T)
    }
)
p1 <- ggplot(aveWeekdayInterval, aes(interval, V1)) + geom_line() + ggtitle("Average number of steps taken     across weekdays and weekends") + xlab("Interval(Weekdays)") + ylab("Average Steps")
p2 <- ggplot(aveWeekendInterval, aes(interval, V1)) + xlab ("Interval(Weekends)") + ylab("Average Steps") + geom_line()

gl = lapply(list(p1,p2), ggplotGrob)     
g = do.call(rbind, c(gl, size="first"))
g$widths = do.call(unit.pmax, lapply(gl, "[[", "widths"))

grid.newpage()
grid.draw(g) 

The above two plots compares the weekdays vs weekend activities.