It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the “quantified self” movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.
This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.
The variables included in this dataset are:
The dataset is stored in a comma-separated-value (CSV) file and there are a total of 17,568 observations in this dataset.
The first thing to do is to read the csv file containg the data into RStudio.
act_data <- read.csv ("activity.csv", header = T, sep = ",", stringsAsFactors = F)
Quickly inspecting the data by checking:
str(act_data)
## 'data.frame': 17568 obs. of 3 variables:
## $ steps : int NA NA NA NA NA NA NA NA NA NA ...
## $ date : chr "2012-10-01" "2012-10-01" "2012-10-01" "2012-10-01" ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
dimensions of the data
the first 5 rows.
head(act_data, 5)
## steps date interval
## 1 NA 2012-10-01 0
## 2 NA 2012-10-01 5
## 3 NA 2012-10-01 10
## 4 NA 2012-10-01 15
## 5 NA 2012-10-01 20
tail(act_data,5)
## steps date interval
## 17564 NA 2012-11-30 2335
## 17565 NA 2012-11-30 2340
## 17566 NA 2012-11-30 2345
## 17567 NA 2012-11-30 2350
## 17568 NA 2012-11-30 2355
We need to change the date format as follows
act_data$date <- as.Date(act_data$date, "%Y-%m-%d")
library (dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mean_steps <- act_data %>% group_by(date) %>%
summarize(total.steps = sum(steps, na.rm = T),
mean.steps = mean(steps, na.rm = T))
Now we can plot the histogram as below.
library(ggplot2)
m <- ggplot(mean_steps, aes(x=total.steps))
m + geom_histogram(binwidth = 2500) + theme(axis.text = element_text(size = 13),
axis.title = element_text(size = 14)) + labs(y = "Number of Occurrencies") + labs(x = "Total steps/day")
From the histogram we can see that there is slightly negative skew in the distribution of the data, with an abnormally high frequency in the first bar of the histogram.
Now we can determine the exact value of the 5 number summary of the data as follows:
summary(mean_steps$mean.steps)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.1424 30.7000 37.3800 37.3800 46.1600 73.5900 8
There are 8 missing values in the data as represented by the NAs.
The data will be grouped into 5 minute intervals afterwhich their respective means will be calculated as illustrated below.
mean_int <- act_data %>% group_by(interval) %>%
summarize(mean.steps = mean(steps, na.rm = T))
n <- ggplot(mean_int, aes(x = interval, y = mean.steps))
n + geom_line() + theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 14, face = "bold")) +
labs(y = "Mean number of steps") + labs(x = "Interval")
The number of steps peaked between interval 500 and 1000.
mean(is.na(act_data$steps))
## [1] 0.1311475
Approximately 13% of the data is missing as shown above.
sum(is.na(act_data$steps))
## [1] 2304
Lets check for missing values in the interval column within mean_int, where we stored the mean number of steps for each 5 min interval:
sum(is.na(mean_int$mean.steps))
## [1] 0
Now let us duplicate the data as follows:
new_act_data <- act_data
In order to fill in missing values we check at each row if the column interval is NA, when the condition is true we look for the corresponding interval (index), we search for this particular interval in the mean_int data and extract it to a temporary variable values. Last we choose only the column of interest from values, which is the mean.steps and assign this number to the corresponding position in the new_act_data set. We use a for loop to run through all the rows.
for (i in 1:nrow(new_act_data)) {
if (is.na(new_act_data$steps[i])) {
index <- new_act_data$interval[i]
value <- subset(mean_int, interval==index)
new_act_data$steps[i] <- value$mean.steps
}
}
tail(new_act_data)
## steps date interval
## 17563 2.6037736 2012-11-30 2330
## 17564 4.6981132 2012-11-30 2335
## 17565 3.3018868 2012-11-30 2340
## 17566 0.6415094 2012-11-30 2345
## 17567 0.2264151 2012-11-30 2350
## 17568 1.0754717 2012-11-30 2355
Grouping the data by date we can construct the histogram.
new_mean <- new_act_data %>% group_by(date) %>%
summarize(total.steps = sum(steps, na.rm = T))
g <- ggplot(new_mean, aes(x=total.steps))
g + geom_histogram(binwidth = 2500) + theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 14)) + labs(y = "Frequency") + labs(x = "Total steps/day")
The abnormal bar that was on the left has been removed and now the data exhibits a negatively skewed distribution around the mean.
We need to explore and ascertain if there is a statistically significant difference in the activity patterns bewtween weekdays and weekends.
new_act_data$day <- ifelse(weekdays(new_act_data$date) %in% c("Saturday", "Sunday"), "weekend", "weekday")
Next we create two subsets, one containing the weekend and one containing the weekday data:
wend <- filter(new_act_data, day == "weekend")
wday <- filter(new_act_data, day == "weekday")
Since the day column is lots during the grouping, we add it again to the wend and wday dataframes. Lastly, we merge both data sets into one named new_int
wend <- wend %>%
group_by(interval) %>%
summarize(mean.steps = mean(steps))
wend$day <- "weekend"
wday <- wday %>%
group_by(interval) %>%
summarize(mean.steps = mean(steps))
wday$day <- "weekday"
new_int <- rbind(wend, wday)
new_int$day <- as.factor(new_int$day)
new_int$day <- relevel(new_int$day, "weekend")
The two panel plot is now created, using the day column as a factor to spearate the weekday from the weekend timeseries.
g <- ggplot (new_int, aes (interval, mean.steps))
g + geom_line() + facet_grid (day~.) + theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 14)) + labs(y = "Number of Steps") + labs(x = "Interval")
There is a marked difference between weekday and weekend activity with the weeekend showing more activity. There variance during the weelends is lower than during weekdays.