[[ Instruction: This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit HTML button, a document will be generated that includes both content as well as the output of R code chunks within this document. ]]
This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.
The data for this assignment can be downloaded from the course web site:
Dataset: Activity monitoring data [52K] The variables included in this dataset are:
steps: Number of steps taking in a 5-minute interval (missing values are coded as NA)
date: The date on which the measurement was taken in YYYY-MM-DD format
interval: Identifier for the 5-minute interval in which measurement was taken
The dataset is stored in a comma-separated-value (CSV) file and there are a total of 17,568 observations in this dataset.
Before running the program set your present working directory using setwd() to where file “activity.csv” is downloaded.To know your present working directory type getwd() in the console
Before proceeding download the dataset: Activity monitoring data [52K] to your working directory and unzip as “activity.csv”. You can also download using the script download.file() command in R.
Checking the availability of the data activity.csv“.
fname = "activity.csv"
zfname = "repdata-data-activity.zip" # make sure this file is present in the working directory
if (!file.exists(fname)){
unzip(zfname)
}
In this assignment package knitr and lattice would be used to create reports and analysis
library(lattice)
library(knitr)
Import the data using read.csv() and store into “acmData”.
acmDT<-read.csv("activity.csv", header=TRUE, colClasses=c("numeric", "Date","numeric"))
Note that I have used colClasses() to declare the variable types. Specifying a colClasses() argument to read.table or read.csv can save time on importing “big data”. Make sure the number of arguments within colClasses() are exactly equal to number of columns in the data.
It is also good practice to print first few rows of the data to know some structure of the variables (columns). I would do that using the function head().
head(acmDT)
## steps date interval
## 1 NA 2012-10-01 0
## 2 NA 2012-10-01 5
## 3 NA 2012-10-01 10
## 4 NA 2012-10-01 15
## 5 NA 2012-10-01 20
## 6 NA 2012-10-01 25
I want to know quick summary of the data loaded using summary() function.
summary(acmDT)
## steps date interval
## Min. : 0.0 Min. :2012-10-01 Min. : 0
## 1st Qu.: 0.0 1st Qu.:2012-10-16 1st Qu.: 589
## Median : 0.0 Median :2012-10-31 Median :1178
## Mean : 37.4 Mean :2012-10-31 Mean :1178
## 3rd Qu.: 12.0 3rd Qu.:2012-11-15 3rd Qu.:1766
## Max. :806.0 Max. :2012-11-30 Max. :2355
## NA's :2304
It is important to process/transform the data (if necessary) into a format suitable for analysis or make a tidy version of raw data. In R while reading raw data using read.xxxx(), by defult it would read variables as character unless we mention colClasses as an ‘option’ like I have done while reading the data using read.csv() [viz. colClasses=c("numeric", "Date","numeric") ]. Otherwise we would have convert the varibles into specific formats using as.date() or as.numeric() (e.g. acmDT$date<-as.Date(acmDT$date))
before performing any arithmetic operations.
Creation of new variable also integral part of data analysis, like here we would create some variables like minimum date, maximum date, total number of days to know the span of the Time Series.
min_date <- min(acmDT$date)
max_date <- max(acmDT$date)
n_days <- as.numeric(max_date - min_date) + 1
We have time series data with a start date is 2012-10-01 and most recent date is 2012-11-30 over a span 61 days.
For this part of the assignment, we would ignore the missing values in the dataset. To do so we would essentially remove the missing rows or rows with value “NA” using the functon complete.cases(). This function produces a logical vector with ‘TRUE’s none of the 3 variables has a value “NA” in a particular row and’FALSE’s otherwise.
compl_acmDT<-acmDT[complete.cases(acmDT),] # data with only non-missing rows
num_compl_cases<-nrow(compl_acmDT)
# Missing data
in_compl_acmDT<-acmDT[!complete.cases(acmDT),] ## data with only missing rows
num_incmpl_cases<-nrow(in_compl_acmDT)
There are 15264 rows with no missing observation and 2304 rows with NAs in the data.
We would have to calculate total number of steps in a day using “steps” and “date”. To do so we would use functions tapply() and sum() of “steps” group by “date”. Alternatively, it can be done using library(sqldf) with a small query sqldf("select date, sum(steps) steps from compl_acmDT group by date order by date").
# Number of steps on a particular date
stps_per_day <- as.data.frame(tapply(compl_acmDT$steps, INDEX = compl_acmDT$date,
FUN = "sum", na.rm = TRUE))
colnames(stps_per_day) <- "steps"
# maximum number of steps among all the dates + some buffer to label x-axis
max_num_steps<-max(stps_per_day)+5000
# plot histogram of frequency of the steps
hist(stps_per_day$steps, main = "Total Number of Steps Per Day",
xlab = "Total Number of Steps Per Day", ylab = "Frequency",
breaks = 10, xlim = c(0, max_num_steps), ylim = c(0, 20), col = "blue")
#axis(side=1, at=axTicks(1,
#labels=formatC(axTicks(1), format="d",big.mark=','))
This can be done simply using the functions mean() and median().
mean_num_steps <- format(mean(stps_per_day$steps, na.rm = TRUE), big.mark=",",scientific=F)
median_num_steps <- format(median(stps_per_day$steps, na.rm = TRUE),big.mark=",",scientific=F)
On an average 10,766 steps taken per day and with a meadian of 10,765 steps per day. Note that I have applied na.rm=TRUE to exclude missing values NA. I have also used format() to convert large numeric numbers separated with commas (“,”) to give a pretty look.
To plot average number of steps by time interval, we need to calculate mean of “steps” by “interval”.
#calculate average number of steps by intervals without missing values
avg_no_stps_intvl <- as.data.frame(tapply(compl_acmDT$steps,
INDEX = compl_acmDT$interval,
FUN = "mean", na.rm = TRUE))
#Name the calculated average column to avg_steps
colnames(avg_no_stps_intvl) <- "avg_steps"
#Name rownames to interval as INDEX has created unterval as rownames
avg_no_stps_intvl$interval <- rownames(avg_no_stps_intvl)
row.names(avg_no_stps_intvl)<-NULL
Print few observations to check how the new summary data looks like.
#check few observations from the top
head(avg_no_stps_intvl)
## avg_steps interval
## 1 1.71698 0
## 2 0.33962 5
## 3 0.13208 10
## 4 0.15094 15
## 5 0.07547 20
## 6 2.09434 25
Optional: quick summarization of the data
#summarization
summary(avg_no_stps_intvl)
## avg_steps interval
## Min. : 0.00 Length:288
## 1st Qu.: 2.49 Class :character
## Median : 34.11 Mode :character
## Mean : 37.38
## 3rd Qu.: 52.83
## Max. :206.17
Now plotting line graph of average number of steps against 5-miniute interval using plot() and with type="l" (for line graph).
plot(avg_no_stps_intvl$interval, avg_no_stps_intvl$avg_steps, type = "l",
xlab = "Time Interval (5-mins)", ylab = "Average number of steps",
main = "Average Number of Steps by Time Interval", col="blue")
avg_no_stps_intvl_mx <- avg_no_stps_intvl[(avg_no_stps_intvl$avg_steps == max(avg_no_stps_intvl$avg_steps)), ]
max_interval <-avg_no_stps_intvl_mx[1, 2]
We can see from the plot between 0 to 500 miniutes average number steps very close to 0 (zero) with a flat trend, between 500 to 835 miniutes there is a increasing trend and reached to a maximum of 206.1698 and sudden drop to below 50, for rest of the time periods oscilating between 0 to 100 steps.
Based on the above data analysis and trends we can see that the 5-miniutes interval with maximum number of average steps per day is 835.
Note that there are a number of days/intervals where there are missing values (coded as NA). The presence of missing days may introduce bias into some calculations or summaries of the data.
NAs)# Missing data
in_compl_acmDT<-acmDT[!complete.cases(acmDT),] ## data with only missing rows
num_incmpl_cases<-nrow(in_compl_acmDT)
Total Number of missing rows in the activity monitoring data is 2,304.
Missing value imputation can simply be done by the average steps per day, which we have calculated in the previous step. But let me use median as sometime median is a great representation of missing cases in the presence of extreme cases.
#calculate median number of steps by intervals without missing values
median_stps_intvl <- as.data.frame(tapply(compl_acmDT$steps,
INDEX = compl_acmDT$interval,
FUN = "median", na.rm = TRUE))
#Name calculated median column to median_steps
colnames(median_stps_intvl) <- "median_steps"
#Name rownames to interval as INDEX has created unterval as rownames
median_stps_intvl$interval <- rownames(median_stps_intvl)
row.names(median_stps_intvl)<-NULL
mean_median<-merge(median_stps_intvl,avg_no_stps_intvl, by="interval")
#merge mean and median with original data with missing
acmDT_with_summ<-merge(acmDT,mean_median, by="interval")
head(acmDT_with_summ)
interval steps date median_steps avg_steps
1 0 NA 2012-10-01 0 1.717
2 0 0 2012-11-23 0 1.717
3 0 0 2012-10-28 0 1.717
4 0 0 2012-11-06 0 1.717
5 0 0 2012-11-24 0 1.717
6 0 0 2012-11-15 0 1.717
tail(acmDT_with_summ)
interval steps date median_steps avg_steps
17563 2355 0 2012-10-16 0 1.075
17564 2355 0 2012-10-07 0 1.075
17565 2355 0 2012-10-25 0 1.075
17566 2355 0 2012-11-03 0 1.075
17567 2355 NA 2012-10-08 0 1.075
17568 2355 NA 2012-11-30 0 1.075
Clearly, median would be better replacement than the average. Now to identify each of the missing cell and replace with median we have to apply loop on array.
The dataset “acmDT_No_Miss” is now would be filled by median. This data would be exactly the same as original data “acmDT” with no missing.
for (i in (1:nrow(acmDT_with_summ))) {
if (is.na(acmDT_with_summ$steps[i])) {
acmDT_with_summ$steps[i] <- acmDT_with_summ$median_steps[i]
}
}
#head(acmDT_with_summ)
acmDT_No_Miss<-acmDT_with_summ[, c("steps","date", "interval")]
#head(acmDT_No_Miss)
Compare summary with missing (NAs)
summary(acmDT)
## steps date interval
## Min. : 0.0 Min. :2012-10-01 Min. : 0
## 1st Qu.: 0.0 1st Qu.:2012-10-16 1st Qu.: 589
## Median : 0.0 Median :2012-10-31 Median :1178
## Mean : 37.4 Mean :2012-10-31 Mean :1178
## 3rd Qu.: 12.0 3rd Qu.:2012-11-15 3rd Qu.:1766
## Max. :806.0 Max. :2012-11-30 Max. :2355
## NA's :2304
Compare summary with out missing (No NAs)
summary(acmDT_No_Miss)
## steps date interval
## Min. : 0 Min. :2012-10-01 Min. : 0
## 1st Qu.: 0 1st Qu.:2012-10-16 1st Qu.: 589
## Median : 0 Median :2012-10-31 Median :1178
## Mean : 33 Mean :2012-10-31 Mean :1178
## 3rd Qu.: 8 3rd Qu.:2012-11-15 3rd Qu.:1766
## Max. :806 Max. :2012-11-30 Max. :2355
Now there is no missing in the data. We can see the changes in the distribution after missing value imputation.
# Number of steps on a particular date
stps_per_day_nm <- as.data.frame(tapply(acmDT_No_Miss$steps, INDEX = acmDT_No_Miss$date, FUN = "sum"))
colnames(stps_per_day_nm) <- "steps"
# maximum number of steps among all the dates + some buffer to label x-axis
max_num_steps<-max(stps_per_day_nm)+5000
# plot histogram of frequency of the steps
hist(stps_per_day_nm$steps, main = "Total Number of Steps Per Day",
xlab = "Total Number of Steps Per Day", ylab = "Frequency",
breaks = 10, xlim = c(0, max_num_steps), ylim = c(0, 20), col = "blue")
Clearly there is a change in the distribution of number of steps. Significantly for the initial days.
Calculate and report the mean and median total number of steps taken per day with missing values replaced by medians.
mean_num_steps_nm <- format(mean(stps_per_day_nm$steps), big.mark=",",scientific=F)
median_num_steps_nm <- format(median(stps_per_day_nm$steps),big.mark=",",scientific=F)
New Mean of the steps per day is 9,504 and the median of the steps per day is 10,395. compared to a average 10,766 steps taken per day and with a meadian of 10,765 steps per day after excluding missing values.
We can observe that both the mean and median decresed after missing value imputation.
For this part the weekdays() function may be of some help here. Use the dataset with the filled-in missing values for this part.
# Logical variable - weekdays: Saturday and Sunday would be TRUE else FALSE
acmDT_No_Miss$weekdays <- grepl("^[Ss]", weekdays(as.Date(acmDT_No_Miss$date)))
for (i in (1:nrow(acmDT_No_Miss))) {
if (acmDT_No_Miss$weekdays[i] == TRUE) {
acmDT_No_Miss$week_day_factor[i] <- "weekend"
} else {
acmDT_No_Miss$week_day_factor[i] <- "weekday"
}
}
#Check
head(acmDT_No_Miss)
## steps date interval weekdays week_day_factor
## 1 0 2012-10-01 0 FALSE weekday
## 2 0 2012-11-23 0 FALSE weekday
## 3 0 2012-10-28 0 TRUE weekend
## 4 0 2012-11-06 0 FALSE weekday
## 5 0 2012-11-24 0 TRUE weekend
## 6 0 2012-11-15 0 FALSE weekday
mean_stps_per_intvl_imput <- aggregate(acmDT_No_Miss$steps, by = list(acmDT_No_Miss$interval,
acmDT_No_Miss$week_day_factor), FUN = "mean", na.rm = TRUE)
colnames(mean_stps_per_intvl_imput) <- c("interval", "weekday", "avg_steps")
sort_order <- order(as.numeric(mean_stps_per_intvl_imput$interval))
mean_stps_per_intvl <- mean_stps_per_intvl_imput[sort_order, ]
mean_by_day_type <- aggregate( acmDT_No_Miss$steps, by = list( acmDT_No_Miss$week_day_factor),
FUN = "mean", na.rm = TRUE)
mean_weekdays <- round(mean_by_day_type[1, 2], 2)
mean_weekends <- round(mean_by_day_type[2, 2], 2)
xyplot(data = mean_stps_per_intvl,
avg_steps ~ as.numeric(interval) | as.factor(weekday),
type = "l", layout = c(1, 2), col = c("blue","red"),
main = "Average Number of Steps by Time Interval",
xlab = "Time Interval (5-mins)", ylab = "Average number of steps")
xyplot(data = mean_stps_per_intvl,
avg_steps ~ as.numeric(interval) | as.factor(weekday),
type = "l", layout = c(2, 1), col = c("blue","red"),
main = "Average Number of Steps by Time Interval",
xlab = "Time Interval (5-mins)", ylab = "Average number of steps")
More and more activities over weekend compare to week days.
Note that the echo = TRUE parameter was added to the code chunk to print of the R codes that generate outputs in this report.