We extract the .csv file from the zip and adjust the column “date” withe a Date format:
DF <- read.csv(unz("activity.zip", "activity.csv"), colClasses=c("integer", "Date", "integer"))
head(DF)
## steps date interval
## 1 NA 2012-10-01 0
## 2 NA 2012-10-01 5
## 3 NA 2012-10-01 10
## 4 NA 2012-10-01 15
## 5 NA 2012-10-01 20
## 6 NA 2012-10-01 25
summary(DF)
## steps date interval
## Min. : 0.00 Min. :2012-10-01 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.:2012-10-16 1st Qu.: 588.8
## Median : 0.00 Median :2012-10-31 Median :1177.5
## Mean : 37.38 Mean :2012-10-31 Mean :1177.5
## 3rd Qu.: 12.00 3rd Qu.:2012-11-15 3rd Qu.:1766.2
## Max. :806.00 Max. :2012-11-30 Max. :2355.0
## NA's :2304
str(DF)
## 'data.frame': 17568 obs. of 3 variables:
## $ steps : int NA NA NA NA NA NA NA NA NA NA ...
## $ date : Date, format: "2012-10-01" "2012-10-01" ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
First we select the data we need and group it per date. Then we sum the steps for each day:
library(dplyr)
stepsDay<-DF %>%
select(steps, date) %>%
group_by(date) %>%
summarise(sumSteps=sum(steps))
head(stepsDay)
## Source: local data frame [6 x 2]
##
## date sumSteps
## 1 2012-10-01 NA
## 2 2012-10-02 126
## 3 2012-10-03 11352
## 4 2012-10-04 12116
## 5 2012-10-05 13294
## 6 2012-10-06 15420
Now we prepare the histogram of the total number of steps taken each day: (we adjust the bindwith to 1.000 steps)
library(ggplot2)
ggplot(data=stepsDay, aes(sumSteps)) +
geom_histogram(binwidth = 1000) +
labs(title="Steps taken each day",
x = "Number of steps each day", y = "Number of times in a day")
Now we calculate the mean and median total number of steps taken per day with this code
mean(stepsDay$sumSteps, na.rm=TRUE)
## [1] 10766.19
median(stepsDay$sumSteps, na.rm=TRUE)
## [1] 10765
Now we make a time series plot of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis):
#To plot average daily activity
DF %>% group_by(interval) %>%
summarize(mean=mean(steps, na.rm=TRUE)) %>%
plot(., type="l", main="Average daily activity",
xlab="5-minute interval", ylab="Average steps across all days")
#To display the interval with the max. number of steps in average on the plot
max<-DF %>% group_by(interval) %>%
summarize(mean=mean(steps, na.rm=TRUE)) %>%
slice(which.max(mean)) %>% select(interval)
abline(v=max, col="purple")
axis(1, at=max, labels = max, pos=0, col.axis="purple")
In purple we see the interval with the maximum number of steps. And here the exact
interval and his mean:
DF %>% group_by(interval) %>%
summarize(mean=mean(steps, na.rm=TRUE)) %>%
slice(which.max(mean)) #slice selects rows by position
## Source: local data frame [1 x 2]
##
## interval mean
## 1 835 206.1698
Note that there are a number of days/intervals where there are missing values (coded as NA
). The presence of missing days may introduce bias into some calculations or summaries of the data.
summary(DF)
## steps date interval
## Min. : 0.00 Min. :2012-10-01 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.:2012-10-16 1st Qu.: 588.8
## Median : 0.00 Median :2012-10-31 Median :1177.5
## Mean : 37.38 Mean :2012-10-31 Mean :1177.5
## 3rd Qu.: 12.00 3rd Qu.:2012-11-15 3rd Qu.:1766.2
## Max. :806.00 Max. :2012-11-30 Max. :2355.0
## NA's :2304
There are 2304 missing values in the dataset
We will fill the NA’s with the mean for that 5-minute interval, etc.
Create a new dataset that is equal to the original dataset but with the missing data filled in.
# Split main DF in two DF, one with NAs and one without NAs
DF_noNA <-DF %>% filter(complete.cases(.))
DF_NA <-DF %>% filter(!complete.cases(.))
# Create DF with intervals and their step means
meanSteps <- DF %>% group_by(interval) %>% summarize(mean=mean(steps, na.rm=TRUE))
# Merge means with DF with NAs by "interval"
DF_NA <- merge(meanSteps, DF_NA, by = "interval", all.y=TRUE)
DF_NA <- DF_NA[,c(2,4,1)] #remove column not needed and reorder columns
colnames(DF_NA) <-c("steps", "date", "interval") #rename columns
# Merge DF with no NA's and with the DF with means
DF_new <- rbind(DF_noNA, DF_NA)
Here the histogram of the new DF
library(dplyr)
library(ggplot2)
stepsDay_new<- DF_new %>%
select(steps, date) %>%
group_by(date) %>%
summarise(sumSteps=sum(steps))
ggplot(data=stepsDay_new, aes(sumSteps)) +
geom_histogram(binwidth = 1000) +
labs(title="Steps taken each day (new DF)",
x = "Number of steps each day", y = "Number of times in a day")
Here the mean and median of the new DF
mean(stepsDay_new$sumSteps, na.rm=TRUE)
## [1] 10766.19
median(stepsDay_new$sumSteps, na.rm=TRUE)
## [1] 10766.19
summary(stepsDay$sumSteps) # before
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 41 8841 10760 10770 13290 21190 8
summary(stepsDay_new$sumSteps) # now
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 41 9819 10770 10770 12810 21190
Now the estimates have changed as follows:
- We have now more observations as shown in the histogram (we added data to the NAs)
- Mean stays the same (the strategy was to use the mean to fulfil the missing values)
- Now the median is the same as the mean
For this part we are going to use the data set with the filled-in missing values.
We will add a new column with the information, Weekday or Weekend.
library(dplyr)
DF_new$weekday <- as.factor(
ifelse(weekdays(DF_new$date) %in% c("Samstag","Sonntag"), "Weekend", "Weekday")
)
str(DF_new)
## 'data.frame': 17568 obs. of 4 variables:
## $ steps : num 0 0 0 0 0 0 0 0 0 0 ...
## $ date : Date, format: "2012-10-02" "2012-10-02" ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
## $ weekday : Factor w/ 2 levels "Weekday","Weekend": 1 1 1 1 1 1 1 1 1 1 ...
Now we make a plot containing a time series plot of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis).
library(dplyr)
library(lattice)
DF_new_weekday <- DF_new %>%
group_by(interval, weekday) %>%
summarise(steps = mean( steps, na.rm=TRUE ))
xyplot(steps~interval | weekday, data = DF_new_weekday,
type="l", ylab="Number of steps",
xlab="Interval",
layout=c(1,2))