The goal of this analysis project is to elucidate trends in US adult’s activity levels using pedometer data taken from Coursera/Johns Hopkins website compared against data published at the National Institute of Health. The data presented in this analysis depicts the number of steps taken in five minute intervals daily between October 1,2012 and November 30,2012.
The zip file was downloaded and the structure of the data was examined:
library(data.table)
download.file("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip","project1.zip")
unzip("project1.zip",list = T)
## Name Length Date
## 1 activity.csv 350829 2014-02-11 10:08:00
unzip("project1.zip");unlink("project1.zip")
activity.data <- fread("activity.csv")
str(activity.data)
## Classes 'data.table' and 'data.frame': 17568 obs. of 3 variables:
## $ steps : int NA NA NA NA NA NA NA NA NA NA ...
## $ date : chr "2012-10-01" "2012-10-01" "2012-10-01" "2012-10-01" ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
## - attr(*, ".internal.selfref")=<externalptr>
summary(activity.data)
## steps date interval
## Min. : 0.00 Length:17568 Min. : 0.0
## 1st Qu.: 0.00 Class :character 1st Qu.: 588.8
## Median : 0.00 Mode :character Median :1177.5
## Mean : 37.38 Mean :1177.5
## 3rd Qu.: 12.00 3rd Qu.:1766.2
## Max. :806.00 Max. :2355.0
## NA's :2304
A glimpse of the data shows that roughly 13% is missing, therefore, imputation is necessary. Next, the date column gets transformed to YMD format to aid in the analysis. First, two new features were added. The first shows the calories burned based on the number of steps taken from today.com and the second creates a new variable for the weekday and the weekend:
activity.data$date <- as.Date(activity.data$date, format= "%Y-%m- %d") # convert to YMD
activity.data$calorie_burned <- activity.data$steps *( 250/10000) # 250 cal = 10000 steps per
activity.data_impute <- caret::preProcess(activity.data,method="bagImpute") # impute missing values
activity.data_impute <- predict(activity.data_impute,activity.data)
summary(activity.data_impute)
## steps date interval calorie_burned
## Min. : 0.000 Min. :2012-10-01 Min. : 0.0 Min. : 0.0000
## 1st Qu.: 0.000 1st Qu.:2012-10-16 1st Qu.: 588.8 1st Qu.: 0.0000
## Median : 0.000 Median :2012-10-31 Median :1177.5 Median : 0.0000
## Mean : 33.047 Mean :2012-10-31 Mean :1177.5 Mean : 0.8261
## 3rd Qu.: 4.327 3rd Qu.:2012-11-15 3rd Qu.:1766.2 3rd Qu.: 0.1075
## Max. :806.000 Max. :2012-11-30 Max. :2355.0 Max. :20.1500
A quick comparison was done to compare the original data with the imputed data.
head(activity.data[,.(mean_steps=mean(steps,na.rm=T),total_steps=sum(steps,na.rm=T)),by=date],20)
## date mean_steps total_steps
## 1: 2012-10-01 NaN 0
## 2: 2012-10-02 0.43750 126
## 3: 2012-10-03 39.41667 11352
## 4: 2012-10-04 42.06944 12116
## 5: 2012-10-05 46.15972 13294
## 6: 2012-10-06 53.54167 15420
## 7: 2012-10-07 38.24653 11015
## 8: 2012-10-08 NaN 0
## 9: 2012-10-09 44.48264 12811
## 10: 2012-10-10 34.37500 9900
## 11: 2012-10-11 35.77778 10304
## 12: 2012-10-12 60.35417 17382
## 13: 2012-10-13 43.14583 12426
## 14: 2012-10-14 52.42361 15098
## 15: 2012-10-15 35.20486 10139
## 16: 2012-10-16 52.37500 15084
## 17: 2012-10-17 46.70833 13452
## 18: 2012-10-18 34.91667 10056
## 19: 2012-10-19 41.07292 11829
## 20: 2012-10-20 36.09375 10395
head(activity.data_impute[,.(mean_steps=mean(steps,na.rm=T),total_steps=sum(steps,na.rm=T)),by=date],20)
## date mean_steps total_steps
## 1: 2012-10-01 4.326523 1246.039
## 2: 2012-10-02 0.437500 126.000
## 3: 2012-10-03 39.416667 11352.000
## 4: 2012-10-04 42.069444 12116.000
## 5: 2012-10-05 46.159722 13294.000
## 6: 2012-10-06 53.541667 15420.000
## 7: 2012-10-07 38.246528 11015.000
## 8: 2012-10-08 4.326523 1246.039
## 9: 2012-10-09 44.482639 12811.000
## 10: 2012-10-10 34.375000 9900.000
## 11: 2012-10-11 35.777778 10304.000
## 12: 2012-10-12 60.354167 17382.000
## 13: 2012-10-13 43.145833 12426.000
## 14: 2012-10-14 52.423611 15098.000
## 15: 2012-10-15 35.204861 10139.000
## 16: 2012-10-16 52.375000 15084.000
## 17: 2012-10-17 46.708333 13452.000
## 18: 2012-10-18 34.916667 10056.000
## 19: 2012-10-19 41.072917 11829.000
## 20: 2012-10-20 36.093750 10395.000
lattice::histogram(total_steps~date, activity.data[,.(mean_steps=mean(steps,na.rm=T),total_steps=sum(steps,na.rm=T)),by=date][total_steps >0], breaks = 6, main = "Percent of Steps Taken",na.rm=T)
lattice::densityplot( activity.data[,.(mean_steps=mean(steps,na.rm=T),total_steps=sum(steps,na.rm=T)),by=date][total_steps >0,total_steps], main = "Density of Steps Taken",na.rm=T,xlab="Number of Steps Taken")
lattice::histogram(total_steps~date, activity.data_impute[,.(mean_steps=mean(steps),total_steps=sum(steps)),by=date], breaks = 6, main = "Percent of Steps Taken (Imputed)")
lattice::densityplot( activity.data_impute[,.(mean_steps=mean(steps,na.rm=T),total_steps=sum(steps,na.rm=T)),by=date][,total_steps], main = "Density of Steps Taken (Imputed)",na.rm=T,xlab="Number of Steps Taken")
From the original data-set without missing values imputed, the data suggest that the range from October 15 - November 1 and November 15 - November 30 the participant was the most active. In addition, the density plot suggest the participant is active with more than 10,000 steps taken. Moreover, the imputed data suggest that the participant is constantly active after October 1 through November 30. Not surprisingly, the density plot indicates sedentary and/or low activity. The next step is to investigate the daily activity pattern for the participant:
activity.data[,.(total_steps=sum(steps,na.rm = T),day=weekdays(date),month=month(date)),by=date][,.(total_steps=sum(total_steps)),by=c("day","month")][order(total_steps,decreasing = T)]
## day month total_steps
## 1: Wednesday 10 58473
## 2: Friday 10 49283
## 3: Saturday 10 48360
## 4: Tuesday 10 46758
## 5: Sunday 10 46392
## 6: Monday 11 41207
## 7: Sunday 11 39552
## 8: Saturday 11 39388
## 9: Friday 11 37235
## 10: Wednesday 11 35853
## 11: Thursday 10 34968
## 12: Tuesday 11 33788
## 13: Thursday 11 30734
## 14: Monday 10 28617
activity.data_impute[,.(total_steps=sum(steps,na.rm = T),day=weekdays(date),month=month(date)),by=date][,.(total_steps=sum(total_steps)),by=c("day","month")][order(total_steps,decreasing = T)]
## day month total_steps
## 1: Wednesday 10 58473.00
## 2: Friday 10 49283.00
## 3: Saturday 10 48360.00
## 4: Tuesday 10 46758.00
## 5: Sunday 10 46392.00
## 6: Monday 11 41207.00
## 7: Sunday 11 40798.04
## 8: Saturday 11 40634.04
## 9: Friday 11 39727.08
## 10: Wednesday 11 37099.04
## 11: Thursday 10 34968.00
## 12: Tuesday 11 33788.00
## 13: Thursday 11 31980.04
## 14: Monday 10 31109.08
The above tables elucidate the participant’s activity pattern. Both tables, original and imputed, suggest for the month of October, Wednesdays, Fridays, and Saturdays was when the participant was the most active. Whereas, in November the participant was the most active on Mondays, Sundays, and Saturdays. Lastly, a check on the distribution of the means also tells the same story:
activity.data_impute[,.(avg_steps=mean(steps,na.rm = T),day=weekdays(date),month=month(date)),by=date][,.(avg_steps=mean(avg_steps)),by=c("day","month")][order(avg_steps,decreasing = T)]
## day month avg_steps
## 1: Friday 10 42.78038
## 2: Saturday 10 41.97917
## 3: Wednesday 10 40.60625
## 4: Sunday 10 40.27083
## 5: Monday 11 35.76997
## 6: Sunday 11 35.41496
## 7: Saturday 11 35.27260
## 8: Tuesday 10 32.47083
## 9: Wednesday 11 32.20403
## 10: Thursday 10 30.35417
## 11: Tuesday 11 29.32986
## 12: Friday 11 27.58825
## 13: Thursday 11 22.20836
## 14: Monday 10 21.60353
A detailed investigation of the participants daily activity was ascertained from aggregating their average steps per interval.
activity.data[,.(mean_steps_per_interval=mean(steps,na.rm=T),time_hr=interval*(1/60)),by=interval][order(mean_steps_per_interval,decreasing = T)]
## interval mean_steps_per_interval time_hr
## 1: 835 206.1698 13.916667
## 2: 840 195.9245 14.000000
## 3: 850 183.3962 14.166667
## 4: 845 179.5660 14.083333
## 5: 830 177.3019 13.833333
## ---
## 284: 350 0.0000 5.833333
## 285: 355 0.0000 5.916667
## 286: 415 0.0000 6.916667
## 287: 500 0.0000 8.333333
## 288: 2310 0.0000 38.500000
lattice::xyplot(mean_steps_per_interval ~ interval, data=activity.data[,.(mean_steps_per_interval=mean(steps,na.rm=T)),by=interval], type= "l", xlab="5 Min Intervals", ylab="Number of Steps", main="Average Daily Activity Pattern")
The table and time series show that the participant achieves the highest average of steps during intervals 835,840,850,845,and 830, which correspond to hours 13.833 through 14.17.
weekday_or_end <- function(date) {
day <- weekdays(date)
if (day %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday"))
return("weekday") else if (day %in% c("Saturday", "Sunday"))
return("weekend") else stop("invalid date")
}
activity.data_impute$date <- as.Date(activity.data_impute$date)
activity.data_impute$day <- sapply(activity.data_impute$date,weekday_or_end)
lattice::xyplot(mean_steps ~ interval| as.factor(day), data=activity.data_impute[,.(mean_steps=mean(steps)),by=c("interval","day")], type="l", ylab ="Number of Steps")
activity.data_impute[,.(avg_steps=mean(steps,na.rm = T),day=weekdays(date),weekday_weekend=day,month=month(date)),by=date][,.(avg_steps=mean(avg_steps)),by=c("weekday_weekend","month")][order(avg_steps,decreasing = T)]
## weekday_weekend month avg_steps
## 1: weekend 10 41.12500
## 2: weekend 11 35.34378
## 3: weekday 10 33.30179
## 4: weekday 11 29.00902
Based on the analysis of the steps taken, the participant was most active during the weekend.
activity.data_impute[,.(day=weekdays(date),avg_calorie_burned=mean(calorie_burned,na.rm = T)),by=date]
## date day avg_calorie_burned
## 1: 2012-10-01 Monday 0.107535441
## 2: 2012-10-02 Tuesday 0.010937500
## 3: 2012-10-03 Wednesday 0.985416667
## 4: 2012-10-04 Thursday 1.051736111
## 5: 2012-10-05 Friday 1.153993056
## 6: 2012-10-06 Saturday 1.338541667
## 7: 2012-10-07 Sunday 0.956163194
## 8: 2012-10-08 Monday 0.107535441
## 9: 2012-10-09 Tuesday 1.112065972
## 10: 2012-10-10 Wednesday 0.859375000
## 11: 2012-10-11 Thursday 0.894444444
## 12: 2012-10-12 Friday 1.508854167
## 13: 2012-10-13 Saturday 1.078645833
## 14: 2012-10-14 Sunday 1.310590278
## 15: 2012-10-15 Monday 0.880121528
## 16: 2012-10-16 Tuesday 1.309375000
## 17: 2012-10-17 Wednesday 1.167708333
## 18: 2012-10-18 Thursday 0.872916667
## 19: 2012-10-19 Friday 1.026822917
## 20: 2012-10-20 Saturday 0.902343750
## 21: 2012-10-21 Sunday 0.765711806
## 22: 2012-10-22 Monday 1.168402778
## 23: 2012-10-23 Tuesday 0.774131944
## 24: 2012-10-24 Wednesday 0.725260417
## 25: 2012-10-25 Thursday 0.216319444
## 26: 2012-10-26 Friday 0.588368056
## 27: 2012-10-27 Saturday 0.878385417
## 28: 2012-10-28 Sunday 0.994618056
## 29: 2012-10-29 Monday 0.435590278
## 30: 2012-10-30 Tuesday 0.852343750
## 31: 2012-10-31 Wednesday 1.338020833
## 32: 2012-11-01 Thursday 0.107535441
## 33: 2012-11-02 Friday 0.920138889
## 34: 2012-11-03 Saturday 0.917621528
## 35: 2012-11-04 Sunday 0.107535441
## 36: 2012-11-05 Monday 0.906163194
## 37: 2012-11-06 Tuesday 0.723437500
## 38: 2012-11-07 Wednesday 1.118315972
## 39: 2012-11-08 Thursday 0.279427083
## 40: 2012-11-09 Friday 0.107535441
## 41: 2012-11-10 Saturday 0.107535441
## 42: 2012-11-11 Sunday 1.094444444
## 43: 2012-11-12 Monday 0.934461806
## 44: 2012-11-13 Tuesday 0.636805556
## 45: 2012-11-14 Wednesday 0.107535441
## 46: 2012-11-15 Thursday 0.003559028
## 47: 2012-11-16 Friday 0.472309028
## 48: 2012-11-17 Saturday 1.244704861
## 49: 2012-11-18 Sunday 1.311631944
## 50: 2012-11-19 Monday 0.767447917
## 51: 2012-11-20 Tuesday 0.388194444
## 52: 2012-11-21 Wednesday 1.109982639
## 53: 2012-11-22 Thursday 1.773177083
## 54: 2012-11-23 Friday 1.839756944
## 55: 2012-11-24 Saturday 1.256770833
## 56: 2012-11-25 Sunday 1.027256944
## 57: 2012-11-26 Monday 0.968923611
## 58: 2012-11-27 Tuesday 1.184548611
## 59: 2012-11-28 Wednesday 0.883940972
## 60: 2012-11-29 Thursday 0.611718750
## 61: 2012-11-30 Friday 0.107535441
## date day avg_calorie_burned
activity.data_impute[,.(avg_calorie_burned=mean(calorie_burned,na.rm = T),day=weekdays(date),month=month(date)),by=date][,.(avg_calorie_burned=mean(avg_calorie_burned)),by=c("day","month")][order(avg_calorie_burned,decreasing = T)]
## day month avg_calorie_burned
## 1: Friday 10 1.0695095
## 2: Saturday 10 1.0494792
## 3: Wednesday 10 1.0151562
## 4: Sunday 10 1.0067708
## 5: Monday 11 0.8942491
## 6: Sunday 11 0.8852172
## 7: Saturday 11 0.8816582
## 8: Tuesday 10 0.8117708
## 9: Wednesday 11 0.8049438
## 10: Thursday 10 0.7588542
## 11: Tuesday 11 0.7332465
## 12: Friday 11 0.6894551
## 13: Thursday 11 0.5550835
## 14: Monday 10 0.5398371
When calculating the calories burned from the participants steps tracked, their mean activity level appears low and can be classified as sedentary.
activity.data_impute[,.(day=weekdays(date),total_calorie_burned=sum(calorie_burned,na.rm = T)),by=date]
## date day total_calorie_burned
## 1: 2012-10-01 Monday 30.97021
## 2: 2012-10-02 Tuesday 3.15000
## 3: 2012-10-03 Wednesday 283.80000
## 4: 2012-10-04 Thursday 302.90000
## 5: 2012-10-05 Friday 332.35000
## 6: 2012-10-06 Saturday 385.50000
## 7: 2012-10-07 Sunday 275.37500
## 8: 2012-10-08 Monday 30.97021
## 9: 2012-10-09 Tuesday 320.27500
## 10: 2012-10-10 Wednesday 247.50000
## 11: 2012-10-11 Thursday 257.60000
## 12: 2012-10-12 Friday 434.55000
## 13: 2012-10-13 Saturday 310.65000
## 14: 2012-10-14 Sunday 377.45000
## 15: 2012-10-15 Monday 253.47500
## 16: 2012-10-16 Tuesday 377.10000
## 17: 2012-10-17 Wednesday 336.30000
## 18: 2012-10-18 Thursday 251.40000
## 19: 2012-10-19 Friday 295.72500
## 20: 2012-10-20 Saturday 259.87500
## 21: 2012-10-21 Sunday 220.52500
## 22: 2012-10-22 Monday 336.50000
## 23: 2012-10-23 Tuesday 222.95000
## 24: 2012-10-24 Wednesday 208.87500
## 25: 2012-10-25 Thursday 62.30000
## 26: 2012-10-26 Friday 169.45000
## 27: 2012-10-27 Saturday 252.97500
## 28: 2012-10-28 Sunday 286.45000
## 29: 2012-10-29 Monday 125.45000
## 30: 2012-10-30 Tuesday 245.47500
## 31: 2012-10-31 Wednesday 385.35000
## 32: 2012-11-01 Thursday 30.97021
## 33: 2012-11-02 Friday 265.00000
## 34: 2012-11-03 Saturday 264.27500
## 35: 2012-11-04 Sunday 30.97021
## 36: 2012-11-05 Monday 260.97500
## 37: 2012-11-06 Tuesday 208.35000
## 38: 2012-11-07 Wednesday 322.07500
## 39: 2012-11-08 Thursday 80.47500
## 40: 2012-11-09 Friday 30.97021
## 41: 2012-11-10 Saturday 30.97021
## 42: 2012-11-11 Sunday 315.20000
## 43: 2012-11-12 Monday 269.12500
## 44: 2012-11-13 Tuesday 183.40000
## 45: 2012-11-14 Wednesday 30.97021
## 46: 2012-11-15 Thursday 1.02500
## 47: 2012-11-16 Friday 136.02500
## 48: 2012-11-17 Saturday 358.47500
## 49: 2012-11-18 Sunday 377.75000
## 50: 2012-11-19 Monday 221.02500
## 51: 2012-11-20 Tuesday 111.80000
## 52: 2012-11-21 Wednesday 319.67500
## 53: 2012-11-22 Thursday 510.67500
## 54: 2012-11-23 Friday 529.85000
## 55: 2012-11-24 Saturday 361.95000
## 56: 2012-11-25 Sunday 295.85000
## 57: 2012-11-26 Monday 279.05000
## 58: 2012-11-27 Tuesday 341.15000
## 59: 2012-11-28 Wednesday 254.57500
## 60: 2012-11-29 Thursday 176.17500
## 61: 2012-11-30 Friday 30.97021
## date day total_calorie_burned
activity.data_impute[,.(total_calorie_burned=sum(calorie_burned,na.rm = T),day=weekdays(date),month=month(date)),by=date][,.(total_calorie_burned=sum(total_calorie_burned)),by=c("day","month")][order(total_calorie_burned,decreasing = T)]
## day month total_calorie_burned
## 1: Wednesday 10 1461.8250
## 2: Friday 10 1232.0750
## 3: Saturday 10 1209.0000
## 4: Tuesday 10 1168.9500
## 5: Sunday 10 1159.8000
## 6: Monday 11 1030.1750
## 7: Sunday 11 1019.7702
## 8: Saturday 11 1015.6702
## 9: Friday 11 992.8154
## 10: Wednesday 11 927.2952
## 11: Thursday 10 874.2000
## 12: Tuesday 11 844.7000
## 13: Thursday 11 799.3202
## 14: Monday 10 777.3654
activity.data_impute[,.(total_calorie_burned=sum(calorie_burned,na.rm = T)),by=month(date)]
## month total_calorie_burned
## 1: 10 7883.215
## 2: 11 6629.746
If we compare the participants total calorie count, October showed the most activity.In particular, on Wednesdays in October, their total calorie count is equivalent to burning off a Grand Mac, fries, and a soft drink at McDonald.
The analysis of the pedometer data elucidates the activity patterns and health behaviors of an individual. When comparing the results from David R. Bassett, JR., Holly R. Wyatt, etc., it is apparent that the individual would not be classified as active and is likely to have a higher body mass index (BMI).