Reading data and displaying a summary of the data.
alldata <- read.csv("./repdata%2Fdata%2Factivity/activity.csv")
alldata$date <- as.Date(alldata$date)
alldata$weekday <- factor(weekdays(alldata$date), levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
head(alldata)
## steps date interval weekday
## 1 NA 2012-10-01 0 Monday
## 2 NA 2012-10-01 5 Monday
## 3 NA 2012-10-01 10 Monday
## 4 NA 2012-10-01 15 Monday
## 5 NA 2012-10-01 20 Monday
## 6 NA 2012-10-01 25 Monday
summary(alldata$steps)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.00 0.00 37.38 12.00 806.00 2304
total_steps_daywise <- tapply(alldata$steps, alldata$date, sum)
head(total_steps_daywise)
## 2012-10-01 2012-10-02 2012-10-03 2012-10-04 2012-10-05 2012-10-06
## NA 126 11352 12116 13294 15420
I have used the base plot system to create a histogram of the total number of steps taken each day.
hist(total_steps_daywise, main = "Histogram of the total number of steps taken each day", xlab = "Total number of steps taken", ylab = "Frequency", col = "steel blue", breaks = 10)
mean <- round(mean(total_steps_daywise, na.rm = TRUE), digits = 1)
median <- round(median(total_steps_daywise, na.rm = TRUE), digits = 1)
The mean of the total number of steps taken per day is 1.0766210^{4} and the median of the total number of steps taken per day is 1.076510^{4}.
Calculating the mean of total steps for every 5-minute interval.
total_steps_intervalwise <- tapply(alldata$steps, alldata$interval, mean, na.rm = TRUE)
Plotting the line diagram
plot(alldata$interval[1:288], total_steps_intervalwise, col = "red", main = "Time Series Plot for each 5-minute interval", xlab = "Intervals --->", ylab = "Average number of Steps --->", type = "l")
max <- alldata$interval[which(total_steps_intervalwise == max(total_steps_intervalwise), )]
The 5-minute interval which contains the maximum number of steps is the 835th interval.
nas <- sum(is.na(alldata$steps))
The total number of rows with NAs are 2304
The strategy for filling in all of the missing values in the dataset in the steps column is to mutate the mean of steps for that 5 minute interval to each NA value. The data have to be grouped by interval and calculated mean for that interval will have to be overwritten on the NA value.
The above strategy has been applied to create the new dataset. The new data set is called imputed_data.
#loading libraries
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Function to identify if a value is NA and calculate the mean.
mean_steps <- function(num) replace(num, is.na(num), mean(num, na.rm = TRUE))
#Data Imputed with the above strategy
imputed_data <- alldata %>% group_by(interval) %>% mutate(steps = mean_steps(steps))
## Warning: package 'bindrcpp' was built under R version 3.5.1
#Displaying Imputed data
head(imputed_data)
## # A tibble: 6 x 4
## # Groups: interval [6]
## steps date interval weekday
## <dbl> <date> <int> <fct>
## 1 1.72 2012-10-01 0 Monday
## 2 0.340 2012-10-01 5 Monday
## 3 0.132 2012-10-01 10 Monday
## 4 0.151 2012-10-01 15 Monday
## 5 0.0755 2012-10-01 20 Monday
## 6 2.09 2012-10-01 25 Monday
summary(imputed_data)
## steps date interval weekday
## Min. : 0.00 Min. :2012-10-01 Min. : 0.0 Monday :2592
## 1st Qu.: 0.00 1st Qu.:2012-10-16 1st Qu.: 588.8 Tuesday :2592
## Median : 0.00 Median :2012-10-31 Median :1177.5 Wednesday:2592
## Mean : 37.38 Mean :2012-10-31 Mean :1177.5 Thursday :2592
## 3rd Qu.: 27.00 3rd Qu.:2012-11-15 3rd Qu.:1766.2 Friday :2592
## Max. :806.00 Max. :2012-11-30 Max. :2355.0 Saturday :2304
## Sunday :2304
new_nas <- sum(is.na(imputed_data$steps))
The imputed data has 0 NA values.
Calculated the new total number of steps taken each day. Plotted the histogram of the calculated data.
new_total_steps_daywise <- tapply(imputed_data$steps, imputed_data$date, sum, na.rm = TRUE)
new_mean <- round(mean(new_total_steps_daywise), digits = 1)
new_median <- round(median(new_total_steps_daywise), digits = 1)
par(mfrow = c(1,2))
hist(new_total_steps_daywise, main = "Imputed Data", xlab = "Total number of steps taken --->", ylab = "Frequency --->", col = "steel blue", breaks = 10, ylim = c(0, 25))
abline(v = new_mean, col = "white")
hist(total_steps_daywise, main = "Original Data", xlab = "Total number of steps taken --->", ylab = "Frequency --->", col = "steel blue", breaks = 10, ylim = c(0, 25))
abline(v= mean, col = "white")
The new mean with the imputed data is 1.0766210^{4} and the old mean was 1.0766210^{4}. The new median with the imputed data is 1.0766210^{4} while the old median was 1.076510^{4}
Imputation of the data has no effect on mean and a small change in median of the total number of steps taken.
alldata$DayType <- ifelse(alldata$weekday == "Saturday" | alldata$weekday == "Sunday", "Weekend", "Weekday")
head(alldata)
## steps date interval weekday DayType
## 1 NA 2012-10-01 0 Monday Weekday
## 2 NA 2012-10-01 5 Monday Weekday
## 3 NA 2012-10-01 10 Monday Weekday
## 4 NA 2012-10-01 15 Monday Weekday
## 5 NA 2012-10-01 20 Monday Weekday
## 6 NA 2012-10-01 25 Monday Weekday
weekday_alldata <- alldata[alldata$DayType == "Weekday", ]
weekend_alldata <- alldata[alldata$DayType == "Weekend", ]
weekday_intervalwise <- tapply(weekday_alldata$steps, weekday_alldata$interval, mean, na.rm = TRUE)
weekend_intervalwise <- tapply(weekend_alldata$steps, weekend_alldata$interval, mean, na.rm = TRUE)
Plotting the line diagram for weekday and weekend
plot(alldata$interval[1:288], weekday_intervalwise, col = "red", main = " Weekday Time Series Plot", xlab = "Intervals", ylab = "Avg. no. of steps", type = "l", ylim = c(0,250))
abline(h = mean(weekday_intervalwise), col = "red")
plot(alldata$interval[1:288], weekend_intervalwise, col = "green", main = "Weekend Time Series Plot", xlab = "Intervals", ylab = "Avg. no. of steps", type = "l", ylim = c(0,250))
abline(h = mean(weekend_intervalwise), col = "green")
There is significant difference between weekday and weekend activity. Weekdays have more peaks and low standard deviation. Weekends have lesser peaks. Both weekdays and weekends seem to have the same mean average number of steps intervalwise.