Author: Jerome Cholewa
First we set echo = TRUE as default
library(knitr)
opts_chunk$set(echo = TRUE)
First we read in the data
movedata <- read.csv("activity.csv")
We select only the data without NAs
movedata_noNA <- movedata[!is.na(movedata$steps),]
We aggregate the number of steps by day
stepsPerDay <- aggregate(movedata_noNA$steps, by = list(date = movedata_noNA$date), FUN = sum)
names(stepsPerDay) <- c("Date", "Steps")
We can plot a histogram of the number of steps per day. The dashed line represents the mean.
par(mfrow = c(1,1), mar=c(5.1,4.1,4.1,0.4))
cutpoints <- quantile (stepsPerDay$Steps, seq(0,1, .1))
with(stepsPerDay, hist(Steps, col = "green",
main = "Number of steps per day",
xlab = "Steps", breaks = 10))
abline( v = mean(stepsPerDay$Steps), lty = 3, lwd = 4)
The mean and the median total number of steps per day and very close, yet different.
mean(stepsPerDay$Steps)
## [1] 10766.19
median(stepsPerDay$Steps)
## [1] 10765
In order to analyse the daily pattern, I decided to write a function that aggregate the data per each interval and reformats the date (string) and time (integer) as proper dates and times. That way we can plot a time series later. That function is called aggregateStepsInt
:
aggregateStepsInt <- function (DF) {
DFsimple <- DF[,1:3]
stepsPerInt <- aggregate(DFsimple[,1],
by = list(Interval = DFsimple$interval),
FUN = mean)
names(stepsPerInt) <- c("Interval", "AverageSteps")
# convert interval from integer to string with 4 digits (e.g. "0025")
stepsPerInt$time <- sprintf("%04d", stepsPerInt$Interval)
# add a colon after the 2nd digit
stepsPerInt$time <- sub("^([0-9]{2})([0-9]+)", "\\1:\\2",
stepsPerInt$time)
# convert time from string to time
stepsPerInt$time <- strptime( stepsPerInt$time, "%H:%M")
stepsPerInt
}
We will use this function now and later when we compare the data on weekdays and on weekends. First we apply this function on the original set of data (without NAs)
stepsPerInterval <- aggregateStepsInt(movedata_noNA)
That gives a dataframe with 288 observations (288 intervals of 5 minutes in one day) and the number of average steps in each interval.
In order to plot this time series, I created a plot function (that we will re-use later). This uses the base plot system.
plotStepsInt <- function (DF, plottitle = "") {
with(DF, plot(time , AverageSteps, col = "blue", type = "l",
main = "",
yaxt = "n",
ylab = "",
xlab = "",
ylim= c(0,240)))
#format Y-axis
yat <- seq(from = 0 , to = 200, by = 50)
axis(side = 2, at = yat , labels = TRUE)
#format X-axis
r <- as.POSIXct(round(range(stepsPerInterval$time), "hours"))
axis.POSIXct(1, at = seq(r[1], r[2], by = "hour"), format = "%H:%M")
text(as.POSIXct("11:00", format = "%H:%M"), 180, labels = plottitle, pos = 4, cex = 1.5)
}
First we set some parameters and then use my plot function:
par(mfcol= c(1,1), las = 2, mar=c(5.1,4.1,4.1,0.4))
plotStepsInt(stepsPerInterval, "On all days")
title(main="Average number of steps per interval",
xlab="Time (hh:mm)", ylab = "Number of steps")
Then we calculate the max of the datasets and select the observation containing that max:
maxSteps <- max(stepsPerInterval$AverageSteps)
stepsPerInterval[stepsPerInterval$AverageSteps == max(stepsPerInterval$AverageSteps),]
## Interval AverageSteps time
## 104 835 206.1698 2016-09-28 08:35:00
The max is 206.1698113 and it takes place in the interval of 8:35 am.
We calculate the number of NA values
sum(is.na(movedata$steps))
## [1] 2304
We now need to understand where the missing values are. To that purpose, we will select only the lines of the original dataset with NAs:
movedataNA <- movedata[is.na(movedata$steps),2]
daysNA <- unique(movedataNA)
daysNA
## [1] 2012-10-01 2012-10-08 2012-11-01 2012-11-04 2012-11-09 2012-11-10
## [7] 2012-11-14 2012-11-30
## 61 Levels: 2012-10-01 2012-10-02 2012-10-03 2012-10-04 ... 2012-11-30
Eight days contain NA values. But do some of those days also contain numeric values? Let’s find out by selecting the data only on those days and checking for any non-NA values:
movedatadaysNA <- movedata[movedata$date %in% daysNA,]
sum(!is.na(movedatadaysNA$steps))
## [1] 0
Since the sum is 0, that means that those days with some NAs in fact have ONLY NAs. So we cannot replace the NA data by the mean on those respective days. Hence I chose to replace the NAs of a given interval X by the mean of number of steps in that interval X.
We start by making a copy of the original dataset and running a for-loop on each interval:
movedataFilled <- movedata
for (interv in stepsPerInterval$Interval) {
movedataFilled[is.na(movedataFilled$steps) & movedataFilled$interval == interv,][,1] <- stepsPerInterval[stepsPerInterval$Interval == interv,2]
}
We can now compare the original data with the filled data and the average data
head(movedata[movedata$date == daysNA[1],])
## steps date interval
## 1 NA 2012-10-01 0
## 2 NA 2012-10-01 5
## 3 NA 2012-10-01 10
## 4 NA 2012-10-01 15
## 5 NA 2012-10-01 20
## 6 NA 2012-10-01 25
head(movedataFilled[movedataFilled$date == daysNA[1],])
## steps date interval
## 1 1.7169811 2012-10-01 0
## 2 0.3396226 2012-10-01 5
## 3 0.1320755 2012-10-01 10
## 4 0.1509434 2012-10-01 15
## 5 0.0754717 2012-10-01 20
## 6 2.0943396 2012-10-01 25
head(stepsPerInterval)
## Interval AverageSteps time
## 1 0 1.7169811 2016-09-28 00:00:00
## 2 5 0.3396226 2016-09-28 00:05:00
## 3 10 0.1320755 2016-09-28 00:10:00
## 4 15 0.1509434 2016-09-28 00:15:00
## 5 20 0.0754717 2016-09-28 00:20:00
## 6 25 2.0943396 2016-09-28 00:25:00
We can now aggregate the data per each interval with that new data set:
stepsPerDayFilled <- aggregate(movedataFilled$steps, by = list(date = movedataFilled$date), FUN = sum)
names(stepsPerDayFilled) <- c("Date", "Steps")
We can now plot a similar histogram as before with the missing data filled in:
cutpointsFilled <- quantile (stepsPerDayFilled$Steps, seq(0,1, .1))
par(mfrow = c(1,1), mar=c(5.1,4.1,4.1,0.4))
with(stepsPerDayFilled, hist(Steps, col = "green",
main = "Number of steps per day \n- after replacing missing values",
xlab = "Steps", breaks = 10))
abline( v = mean(stepsPerDayFilled$Steps), lty = 3, lwd = 4)
Let’s compare the summaries
summary(stepsPerDay)
## Date Steps
## 2012-10-02: 1 Min. : 41
## 2012-10-03: 1 1st Qu.: 8841
## 2012-10-04: 1 Median :10765
## 2012-10-05: 1 Mean :10766
## 2012-10-06: 1 3rd Qu.:13294
## 2012-10-07: 1 Max. :21194
## (Other) :47
summary(stepsPerDayFilled)
## Date Steps
## 2012-10-01: 1 Min. : 41
## 2012-10-02: 1 1st Qu.: 9819
## 2012-10-03: 1 Median :10766
## 2012-10-04: 1 Mean :10766
## 2012-10-05: 1 3rd Qu.:12811
## 2012-10-06: 1 Max. :21194
## (Other) :55
It increases by 1 unit the median, the mean is still the same, min and max did not move, but the first and 3rd quartile were changed by that manipulation.
First we need to create on new variable showing the day of the week, then another factor column indicating whether it is a weekday or a weekend:
movedataFilled$Weekday <- weekdays(strptime(movedataFilled$date, format = "%Y-%m-%d"))
movedataFilled$DayType <- "Weekday" # all days are weekdays by default
# classify Sun and Sat correctly
movedataFilled$DayType[movedataFilled$Weekday %in% c("Saturday", "Sunday")] <- "Weekend"
movedataFilled$DayType <- as.factor(movedataFilled$DayType)
We are now separating the data in 2 dataframes and aggregating the data to have the average number of steps per each interval
# separate into 2 dataframes weekdays and weekend
moveWeekDays <- movedataFilled[ movedataFilled$DayType == "Weekday" ,]
moveWeekend <- movedataFilled[ movedataFilled$DayType == "Weekend" ,]
######## average daily activity patterns WeekDays and Weekends####
stepsPerIntWD <- aggregateStepsInt(moveWeekDays)
stepsPerIntWE <- aggregateStepsInt(moveWeekend)
Finally we can plot the 2 datasets on top of each other to be able to compare them:
par( xaxt = "n", mar=c(0,4.1,3.1,0.4), mfcol= c(2,1), las = 2)
plotStepsInt(stepsPerIntWD, "On weekdays")
title(main="Average number of steps per interval", ylab = "Number of steps")
text(as.POSIXct("11:00", format = "%H:%M"), 180, labels = "On weekdays", pos = 4, cex = 1.5)
par(xaxt = "s", mar = c(5.1,4.1,0,.4))
plotStepsInt(stepsPerIntWE, "On weekends")
title(xlab = "Time (hh:mm)")
As a conclusion, it seems that there are fewer steps on the weekend in the morning, but slightly more steps during the mid-day on the weekends. The peak around 7pm on the weekdays disappears on the weekends. However around 8:30pm there are more steps on the weekends than on the weekdays.