0.1 Summary

The goal of this analysis project is to elucidate trends in US adult’s activity levels using pedometer data taken from Coursera/Johns Hopkins website compared against data published at the National Institute of Health. The data presented in this analysis depicts the number of steps taken in five minute intervals daily between October 1,2012 and November 30,2012.

0.2 Gathering and Cleaning the Data

0.2.1 The data

The zip file was downloaded and the structure of the data was examined:

library(data.table)
download.file("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip","project1.zip")
unzip("project1.zip",list = T)
##           Name Length                Date
## 1 activity.csv 350829 2014-02-11 10:08:00
unzip("project1.zip");unlink("project1.zip")
activity.data <- fread("activity.csv")
str(activity.data)
## Classes 'data.table' and 'data.frame':   17568 obs. of  3 variables:
##  $ steps   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ date    : chr  "2012-10-01" "2012-10-01" "2012-10-01" "2012-10-01" ...
##  $ interval: int  0 5 10 15 20 25 30 35 40 45 ...
##  - attr(*, ".internal.selfref")=<externalptr>
summary(activity.data)
##      steps            date              interval     
##  Min.   :  0.00   Length:17568       Min.   :   0.0  
##  1st Qu.:  0.00   Class :character   1st Qu.: 588.8  
##  Median :  0.00   Mode  :character   Median :1177.5  
##  Mean   : 37.38                      Mean   :1177.5  
##  3rd Qu.: 12.00                      3rd Qu.:1766.2  
##  Max.   :806.00                      Max.   :2355.0  
##  NA's   :2304

A glimpse of the data shows that roughly 13% is missing, therefore, imputation is necessary. Next, the date column gets transformed to YMD format to aid in the analysis. First, two new features were added. The first shows the calories burned based on the number of steps taken from today.com and the second creates a new variable for the weekday and the weekend:

activity.data$date <- as.Date(activity.data$date, format= "%Y-%m- %d") # convert to YMD
activity.data$calorie_burned <- activity.data$steps *( 250/10000) # 250 cal = 10000 steps per 

activity.data_impute <- caret::preProcess(activity.data,method="bagImpute") # impute missing values
activity.data_impute <- predict(activity.data_impute,activity.data)
summary(activity.data_impute)
##      steps              date               interval      calorie_burned   
##  Min.   :  0.000   Min.   :2012-10-01   Min.   :   0.0   Min.   : 0.0000  
##  1st Qu.:  0.000   1st Qu.:2012-10-16   1st Qu.: 588.8   1st Qu.: 0.0000  
##  Median :  0.000   Median :2012-10-31   Median :1177.5   Median : 0.0000  
##  Mean   : 33.047   Mean   :2012-10-31   Mean   :1177.5   Mean   : 0.8261  
##  3rd Qu.:  4.327   3rd Qu.:2012-11-15   3rd Qu.:1766.2   3rd Qu.: 0.1075  
##  Max.   :806.000   Max.   :2012-11-30   Max.   :2355.0   Max.   :20.1500

0.3 Exploratory Data Analysis

0.3.1 Activity Pattern

A quick comparison was done to compare the original data with the imputed data.

head(activity.data[,.(mean_steps=mean(steps,na.rm=T),total_steps=sum(steps,na.rm=T)),by=date],20)
##           date mean_steps total_steps
##  1: 2012-10-01        NaN           0
##  2: 2012-10-02    0.43750         126
##  3: 2012-10-03   39.41667       11352
##  4: 2012-10-04   42.06944       12116
##  5: 2012-10-05   46.15972       13294
##  6: 2012-10-06   53.54167       15420
##  7: 2012-10-07   38.24653       11015
##  8: 2012-10-08        NaN           0
##  9: 2012-10-09   44.48264       12811
## 10: 2012-10-10   34.37500        9900
## 11: 2012-10-11   35.77778       10304
## 12: 2012-10-12   60.35417       17382
## 13: 2012-10-13   43.14583       12426
## 14: 2012-10-14   52.42361       15098
## 15: 2012-10-15   35.20486       10139
## 16: 2012-10-16   52.37500       15084
## 17: 2012-10-17   46.70833       13452
## 18: 2012-10-18   34.91667       10056
## 19: 2012-10-19   41.07292       11829
## 20: 2012-10-20   36.09375       10395
head(activity.data_impute[,.(mean_steps=mean(steps,na.rm=T),total_steps=sum(steps,na.rm=T)),by=date],20)
##           date mean_steps total_steps
##  1: 2012-10-01   4.326523    1246.039
##  2: 2012-10-02   0.437500     126.000
##  3: 2012-10-03  39.416667   11352.000
##  4: 2012-10-04  42.069444   12116.000
##  5: 2012-10-05  46.159722   13294.000
##  6: 2012-10-06  53.541667   15420.000
##  7: 2012-10-07  38.246528   11015.000
##  8: 2012-10-08   4.326523    1246.039
##  9: 2012-10-09  44.482639   12811.000
## 10: 2012-10-10  34.375000    9900.000
## 11: 2012-10-11  35.777778   10304.000
## 12: 2012-10-12  60.354167   17382.000
## 13: 2012-10-13  43.145833   12426.000
## 14: 2012-10-14  52.423611   15098.000
## 15: 2012-10-15  35.204861   10139.000
## 16: 2012-10-16  52.375000   15084.000
## 17: 2012-10-17  46.708333   13452.000
## 18: 2012-10-18  34.916667   10056.000
## 19: 2012-10-19  41.072917   11829.000
## 20: 2012-10-20  36.093750   10395.000
lattice::histogram(total_steps~date, activity.data[,.(mean_steps=mean(steps,na.rm=T),total_steps=sum(steps,na.rm=T)),by=date][total_steps >0], breaks = 6, main = "Percent of Steps Taken",na.rm=T)

lattice::densityplot( activity.data[,.(mean_steps=mean(steps,na.rm=T),total_steps=sum(steps,na.rm=T)),by=date][total_steps >0,total_steps], main = "Density of Steps Taken",na.rm=T,xlab="Number of Steps Taken")

lattice::histogram(total_steps~date, activity.data_impute[,.(mean_steps=mean(steps),total_steps=sum(steps)),by=date], breaks = 6, main = "Percent of Steps Taken (Imputed)")

lattice::densityplot( activity.data_impute[,.(mean_steps=mean(steps,na.rm=T),total_steps=sum(steps,na.rm=T)),by=date][,total_steps],  main = "Density of Steps Taken (Imputed)",na.rm=T,xlab="Number of Steps Taken")

From the original data-set without missing values imputed, the data suggest that the range from October 15 - November 1 and November 15 - November 30 the participant was the most active. In addition, the density plot suggest the participant is active with more than 10,000 steps taken. Moreover, the imputed data suggest that the participant is constantly active after October 1 through November 30. Not surprisingly, the density plot indicates sedentary and/or low activity. The next step is to investigate the daily activity pattern for the participant:

activity.data[,.(total_steps=sum(steps,na.rm = T),day=weekdays(date),month=month(date)),by=date][,.(total_steps=sum(total_steps)),by=c("day","month")][order(total_steps,decreasing = T)]
##           day month total_steps
##  1: Wednesday    10       58473
##  2:    Friday    10       49283
##  3:  Saturday    10       48360
##  4:   Tuesday    10       46758
##  5:    Sunday    10       46392
##  6:    Monday    11       41207
##  7:    Sunday    11       39552
##  8:  Saturday    11       39388
##  9:    Friday    11       37235
## 10: Wednesday    11       35853
## 11:  Thursday    10       34968
## 12:   Tuesday    11       33788
## 13:  Thursday    11       30734
## 14:    Monday    10       28617
activity.data_impute[,.(total_steps=sum(steps,na.rm = T),day=weekdays(date),month=month(date)),by=date][,.(total_steps=sum(total_steps)),by=c("day","month")][order(total_steps,decreasing = T)]
##           day month total_steps
##  1: Wednesday    10    58473.00
##  2:    Friday    10    49283.00
##  3:  Saturday    10    48360.00
##  4:   Tuesday    10    46758.00
##  5:    Sunday    10    46392.00
##  6:    Monday    11    41207.00
##  7:    Sunday    11    40798.04
##  8:  Saturday    11    40634.04
##  9:    Friday    11    39727.08
## 10: Wednesday    11    37099.04
## 11:  Thursday    10    34968.00
## 12:   Tuesday    11    33788.00
## 13:  Thursday    11    31980.04
## 14:    Monday    10    31109.08

The above tables elucidate the participant’s activity pattern. Both tables, original and imputed, suggest for the month of October, Wednesdays, Fridays, and Saturdays was when the participant was the most active. Whereas, in November the participant was the most active on Mondays, Sundays, and Saturdays. Lastly, a check on the distribution of the means also tells the same story:

activity.data_impute[,.(avg_steps=mean(steps,na.rm = T),day=weekdays(date),month=month(date)),by=date][,.(avg_steps=mean(avg_steps)),by=c("day","month")][order(avg_steps,decreasing = T)]
##           day month avg_steps
##  1:    Friday    10  42.78038
##  2:  Saturday    10  41.97917
##  3: Wednesday    10  40.60625
##  4:    Sunday    10  40.27083
##  5:    Monday    11  35.76997
##  6:    Sunday    11  35.41496
##  7:  Saturday    11  35.27260
##  8:   Tuesday    10  32.47083
##  9: Wednesday    11  32.20403
## 10:  Thursday    10  30.35417
## 11:   Tuesday    11  29.32986
## 12:    Friday    11  27.58825
## 13:  Thursday    11  22.20836
## 14:    Monday    10  21.60353

0.3.2 Per Interval

A detailed investigation of the participants daily activity was ascertained from aggregating their average steps per interval.

activity.data[,.(mean_steps_per_interval=mean(steps,na.rm=T),time_hr=interval*(1/60)),by=interval][order(mean_steps_per_interval,decreasing = T)]
##      interval mean_steps_per_interval   time_hr
##   1:      835                206.1698 13.916667
##   2:      840                195.9245 14.000000
##   3:      850                183.3962 14.166667
##   4:      845                179.5660 14.083333
##   5:      830                177.3019 13.833333
##  ---                                           
## 284:      350                  0.0000  5.833333
## 285:      355                  0.0000  5.916667
## 286:      415                  0.0000  6.916667
## 287:      500                  0.0000  8.333333
## 288:     2310                  0.0000 38.500000
lattice::xyplot(mean_steps_per_interval ~ interval, data=activity.data[,.(mean_steps_per_interval=mean(steps,na.rm=T)),by=interval], type= "l", xlab="5 Min Intervals", ylab="Number of Steps", main="Average Daily Activity Pattern")

The table and time series show that the participant achieves the highest average of steps during intervals 835,840,850,845,and 830, which correspond to hours 13.833 through 14.17.

weekday_or_end <- function(date) {
    day <- weekdays(date)
    if (day %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")) 
        return("weekday") else if (day %in% c("Saturday", "Sunday")) 
        return("weekend") else stop("invalid date")
}

activity.data_impute$date <- as.Date(activity.data_impute$date)
activity.data_impute$day <- sapply(activity.data_impute$date,weekday_or_end)
lattice::xyplot(mean_steps ~ interval| as.factor(day), data=activity.data_impute[,.(mean_steps=mean(steps)),by=c("interval","day")], type="l", ylab ="Number of Steps")

activity.data_impute[,.(avg_steps=mean(steps,na.rm = T),day=weekdays(date),weekday_weekend=day,month=month(date)),by=date][,.(avg_steps=mean(avg_steps)),by=c("weekday_weekend","month")][order(avg_steps,decreasing = T)]
##    weekday_weekend month avg_steps
## 1:         weekend    10  41.12500
## 2:         weekend    11  35.34378
## 3:         weekday    10  33.30179
## 4:         weekday    11  29.00902

Based on the analysis of the steps taken, the participant was most active during the weekend.

0.3.3 Calories Burned

activity.data_impute[,.(day=weekdays(date),avg_calorie_burned=mean(calorie_burned,na.rm = T)),by=date]
##           date       day avg_calorie_burned
##  1: 2012-10-01    Monday        0.107535441
##  2: 2012-10-02   Tuesday        0.010937500
##  3: 2012-10-03 Wednesday        0.985416667
##  4: 2012-10-04  Thursday        1.051736111
##  5: 2012-10-05    Friday        1.153993056
##  6: 2012-10-06  Saturday        1.338541667
##  7: 2012-10-07    Sunday        0.956163194
##  8: 2012-10-08    Monday        0.107535441
##  9: 2012-10-09   Tuesday        1.112065972
## 10: 2012-10-10 Wednesday        0.859375000
## 11: 2012-10-11  Thursday        0.894444444
## 12: 2012-10-12    Friday        1.508854167
## 13: 2012-10-13  Saturday        1.078645833
## 14: 2012-10-14    Sunday        1.310590278
## 15: 2012-10-15    Monday        0.880121528
## 16: 2012-10-16   Tuesday        1.309375000
## 17: 2012-10-17 Wednesday        1.167708333
## 18: 2012-10-18  Thursday        0.872916667
## 19: 2012-10-19    Friday        1.026822917
## 20: 2012-10-20  Saturday        0.902343750
## 21: 2012-10-21    Sunday        0.765711806
## 22: 2012-10-22    Monday        1.168402778
## 23: 2012-10-23   Tuesday        0.774131944
## 24: 2012-10-24 Wednesday        0.725260417
## 25: 2012-10-25  Thursday        0.216319444
## 26: 2012-10-26    Friday        0.588368056
## 27: 2012-10-27  Saturday        0.878385417
## 28: 2012-10-28    Sunday        0.994618056
## 29: 2012-10-29    Monday        0.435590278
## 30: 2012-10-30   Tuesday        0.852343750
## 31: 2012-10-31 Wednesday        1.338020833
## 32: 2012-11-01  Thursday        0.107535441
## 33: 2012-11-02    Friday        0.920138889
## 34: 2012-11-03  Saturday        0.917621528
## 35: 2012-11-04    Sunday        0.107535441
## 36: 2012-11-05    Monday        0.906163194
## 37: 2012-11-06   Tuesday        0.723437500
## 38: 2012-11-07 Wednesday        1.118315972
## 39: 2012-11-08  Thursday        0.279427083
## 40: 2012-11-09    Friday        0.107535441
## 41: 2012-11-10  Saturday        0.107535441
## 42: 2012-11-11    Sunday        1.094444444
## 43: 2012-11-12    Monday        0.934461806
## 44: 2012-11-13   Tuesday        0.636805556
## 45: 2012-11-14 Wednesday        0.107535441
## 46: 2012-11-15  Thursday        0.003559028
## 47: 2012-11-16    Friday        0.472309028
## 48: 2012-11-17  Saturday        1.244704861
## 49: 2012-11-18    Sunday        1.311631944
## 50: 2012-11-19    Monday        0.767447917
## 51: 2012-11-20   Tuesday        0.388194444
## 52: 2012-11-21 Wednesday        1.109982639
## 53: 2012-11-22  Thursday        1.773177083
## 54: 2012-11-23    Friday        1.839756944
## 55: 2012-11-24  Saturday        1.256770833
## 56: 2012-11-25    Sunday        1.027256944
## 57: 2012-11-26    Monday        0.968923611
## 58: 2012-11-27   Tuesday        1.184548611
## 59: 2012-11-28 Wednesday        0.883940972
## 60: 2012-11-29  Thursday        0.611718750
## 61: 2012-11-30    Friday        0.107535441
##           date       day avg_calorie_burned
activity.data_impute[,.(avg_calorie_burned=mean(calorie_burned,na.rm = T),day=weekdays(date),month=month(date)),by=date][,.(avg_calorie_burned=mean(avg_calorie_burned)),by=c("day","month")][order(avg_calorie_burned,decreasing = T)]
##           day month avg_calorie_burned
##  1:    Friday    10          1.0695095
##  2:  Saturday    10          1.0494792
##  3: Wednesday    10          1.0151562
##  4:    Sunday    10          1.0067708
##  5:    Monday    11          0.8942491
##  6:    Sunday    11          0.8852172
##  7:  Saturday    11          0.8816582
##  8:   Tuesday    10          0.8117708
##  9: Wednesday    11          0.8049438
## 10:  Thursday    10          0.7588542
## 11:   Tuesday    11          0.7332465
## 12:    Friday    11          0.6894551
## 13:  Thursday    11          0.5550835
## 14:    Monday    10          0.5398371

When calculating the calories burned from the participants steps tracked, their mean activity level appears low and can be classified as sedentary.

activity.data_impute[,.(day=weekdays(date),total_calorie_burned=sum(calorie_burned,na.rm = T)),by=date]
##           date       day total_calorie_burned
##  1: 2012-10-01    Monday             30.97021
##  2: 2012-10-02   Tuesday              3.15000
##  3: 2012-10-03 Wednesday            283.80000
##  4: 2012-10-04  Thursday            302.90000
##  5: 2012-10-05    Friday            332.35000
##  6: 2012-10-06  Saturday            385.50000
##  7: 2012-10-07    Sunday            275.37500
##  8: 2012-10-08    Monday             30.97021
##  9: 2012-10-09   Tuesday            320.27500
## 10: 2012-10-10 Wednesday            247.50000
## 11: 2012-10-11  Thursday            257.60000
## 12: 2012-10-12    Friday            434.55000
## 13: 2012-10-13  Saturday            310.65000
## 14: 2012-10-14    Sunday            377.45000
## 15: 2012-10-15    Monday            253.47500
## 16: 2012-10-16   Tuesday            377.10000
## 17: 2012-10-17 Wednesday            336.30000
## 18: 2012-10-18  Thursday            251.40000
## 19: 2012-10-19    Friday            295.72500
## 20: 2012-10-20  Saturday            259.87500
## 21: 2012-10-21    Sunday            220.52500
## 22: 2012-10-22    Monday            336.50000
## 23: 2012-10-23   Tuesday            222.95000
## 24: 2012-10-24 Wednesday            208.87500
## 25: 2012-10-25  Thursday             62.30000
## 26: 2012-10-26    Friday            169.45000
## 27: 2012-10-27  Saturday            252.97500
## 28: 2012-10-28    Sunday            286.45000
## 29: 2012-10-29    Monday            125.45000
## 30: 2012-10-30   Tuesday            245.47500
## 31: 2012-10-31 Wednesday            385.35000
## 32: 2012-11-01  Thursday             30.97021
## 33: 2012-11-02    Friday            265.00000
## 34: 2012-11-03  Saturday            264.27500
## 35: 2012-11-04    Sunday             30.97021
## 36: 2012-11-05    Monday            260.97500
## 37: 2012-11-06   Tuesday            208.35000
## 38: 2012-11-07 Wednesday            322.07500
## 39: 2012-11-08  Thursday             80.47500
## 40: 2012-11-09    Friday             30.97021
## 41: 2012-11-10  Saturday             30.97021
## 42: 2012-11-11    Sunday            315.20000
## 43: 2012-11-12    Monday            269.12500
## 44: 2012-11-13   Tuesday            183.40000
## 45: 2012-11-14 Wednesday             30.97021
## 46: 2012-11-15  Thursday              1.02500
## 47: 2012-11-16    Friday            136.02500
## 48: 2012-11-17  Saturday            358.47500
## 49: 2012-11-18    Sunday            377.75000
## 50: 2012-11-19    Monday            221.02500
## 51: 2012-11-20   Tuesday            111.80000
## 52: 2012-11-21 Wednesday            319.67500
## 53: 2012-11-22  Thursday            510.67500
## 54: 2012-11-23    Friday            529.85000
## 55: 2012-11-24  Saturday            361.95000
## 56: 2012-11-25    Sunday            295.85000
## 57: 2012-11-26    Monday            279.05000
## 58: 2012-11-27   Tuesday            341.15000
## 59: 2012-11-28 Wednesday            254.57500
## 60: 2012-11-29  Thursday            176.17500
## 61: 2012-11-30    Friday             30.97021
##           date       day total_calorie_burned
activity.data_impute[,.(total_calorie_burned=sum(calorie_burned,na.rm = T),day=weekdays(date),month=month(date)),by=date][,.(total_calorie_burned=sum(total_calorie_burned)),by=c("day","month")][order(total_calorie_burned,decreasing = T)]
##           day month total_calorie_burned
##  1: Wednesday    10            1461.8250
##  2:    Friday    10            1232.0750
##  3:  Saturday    10            1209.0000
##  4:   Tuesday    10            1168.9500
##  5:    Sunday    10            1159.8000
##  6:    Monday    11            1030.1750
##  7:    Sunday    11            1019.7702
##  8:  Saturday    11            1015.6702
##  9:    Friday    11             992.8154
## 10: Wednesday    11             927.2952
## 11:  Thursday    10             874.2000
## 12:   Tuesday    11             844.7000
## 13:  Thursday    11             799.3202
## 14:    Monday    10             777.3654
activity.data_impute[,.(total_calorie_burned=sum(calorie_burned,na.rm = T)),by=month(date)]
##    month total_calorie_burned
## 1:    10             7883.215
## 2:    11             6629.746

If we compare the participants total calorie count, October showed the most activity.In particular, on Wednesdays in October, their total calorie count is equivalent to burning off a Grand Mac, fries, and a soft drink at McDonald.

0.4 Conclusion

The analysis of the pedometer data elucidates the activity patterns and health behaviors of an individual. When comparing the results from David R. Bassett, JR., Holly R. Wyatt, etc., it is apparent that the individual would not be classified as active and is likely to have a higher body mass index (BMI).