Loading data using “readr” package. Appropriate format of every column is defined while loading the data.
suppressWarnings(library(readr))
activity <- read_csv("activity.csv", col_types = cols(date = col_date(format = "%Y-%m-%d"),steps = col_double()))
Creating new table using “dplyr” package. This table contains date with steps taken per day.
suppressWarnings(suppressMessages(library(dplyr)))
activity.perDaySteps <- activity %>% group_by(date) %>% summarize(perDaySteps = sum(steps, na.rm = TRUE))
Make a histogram of the total number of steps taken each day using “ggplot2”
suppressWarnings(library(ggplot2))
g <- ggplot(activity.perDaySteps, aes(x = perDaySteps))
g + geom_histogram(binwidth = 1000, fill = "blue") + theme(plot.title = element_text(hjust = 0.5)) + labs(x = "No of Steps/Day", y = "Count", title = "Total Number of Steps Taken Per Day")
Mean total number of steps taken per day:
mean(activity.perDaySteps$perDaySteps, na.rm = TRUE)
## [1] 9354.23
Median total number of steps taken per day:
median(activity.perDaySteps$perDaySteps, na.rm = TRUE)
## [1] 10395
Create new table which contains time intervals and average number of steps taken, averaged across all days to answer above question.
activity.timeAverage <- activity %>% group_by(interval) %>% summarize(avgSteps = mean(steps, na.rm = TRUE))
Time series plot of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis)
g <- ggplot(data = activity.timeAverage, aes(x = interval, y = avgSteps))
g + geom_line(color = "blue") + labs(title = "Time Series Plot of 5-min Interval and Average Number of Steps", x = "Interval", y = "Average Steps") + theme(plot.title = element_text(hjust = 0.5))
Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?
activity.timeAverage[activity.timeAverage$avgSteps == max(activity.timeAverage$avgSteps),1]
## # A tibble: 1 x 1
## interval
## <dbl>
## 1 835
Total number of missing values in the dataset (i.e. the total number of rows with ’NA’s)
sum(is.na(activity$steps))
## [1] 2304
As there are many missing values in our data set, we will be using average number of steps in that particular interval of time to fill the missing values. Created new dataset with updated values named ‘activity.updated’.
activity.updated <- activity %>% group_by(interval) %>% mutate(steps = if_else(is.na(steps), mean(steps, na.rm = TRUE), steps))
head(activity.updated)
## # A tibble: 6 x 3
## # Groups: interval [6]
## steps date interval
## <dbl> <date> <dbl>
## 1 1.72 2012-10-01 0
## 2 0.340 2012-10-01 5
## 3 0.132 2012-10-01 10
## 4 0.151 2012-10-01 15
## 5 0.0755 2012-10-01 20
## 6 2.09 2012-10-01 25
We need to create a histogram with steps taken each day so we will create a new table using ‘activity.updated’ and create a histogram using that table.
activity.updated.perDaySteps <- activity.updated %>% group_by(date) %>% summarize(perDaySteps = sum(steps, na.rm = TRUE))
g <- ggplot(activity.updated.perDaySteps, aes(x = perDaySteps))
g + geom_histogram(binwidth = 1000, fill = "blue") + theme(plot.title = element_text(hjust = 0.5)) + labs(x = "No of Steps/Day", y = "No of Steps/Day", title = "Total Number of Steps Taken Per Day Using Updated Table")
Mean total number of steps taken per day according to updated table:
mean(activity.updated.perDaySteps$perDaySteps, na.rm = TRUE)
## [1] 10766.19
Median total number of steps taken per day according to updated table:
median(activity.updated.perDaySteps$perDaySteps, na.rm = TRUE)
## [1] 10766.19
Using histogram, mean and median values present above, we can infer that both mean and median values have changed. We observe that the updated mean and median values are same.
Create a new factor to differentiate between week days.
weekend <- c("Saturday","Sunday")
activity.updated <- activity.updated %>% mutate(weekday = if_else(weekdays(date) %in% weekend, "weekend", "weekday"))
head(activity.updated)
## # A tibble: 6 x 4
## # Groups: interval [6]
## steps date interval weekday
## <dbl> <date> <dbl> <chr>
## 1 1.72 2012-10-01 0 weekday
## 2 0.340 2012-10-01 5 weekday
## 3 0.132 2012-10-01 10 weekday
## 4 0.151 2012-10-01 15 weekday
## 5 0.0755 2012-10-01 20 weekday
## 6 2.09 2012-10-01 25 weekday
We need to find out the average steps taken during every interval categorizing it by week day. We need to create a new table which contains average of number of steps during interval along with week day and interval to get a proper plot.
activity.updated.avgStepsInt <- activity.updated %>% group_by(interval,weekday) %>% summarize(averageSteps = mean(steps))
head(activity.updated.avgStepsInt)
## # A tibble: 6 x 3
## # Groups: interval [3]
## interval weekday averageSteps
## <dbl> <chr> <dbl>
## 1 0 weekday 2.25
## 2 0 weekend 0.215
## 3 5 weekday 0.445
## 4 5 weekend 0.0425
## 5 10 weekday 0.173
## 6 10 weekend 0.0165
Plotting required graph:
g <- ggplot(data = activity.updated.avgStepsInt, aes(x = interval, y = averageSteps, color = weekday))
g + geom_line() + facet_wrap(weekday~., nrow = 2, ncol = 1) + labs(x = "Interval", y = "Average Number of Steps", title = "Average Number of Steps vs Interval on Weekdays") + theme(plot.title = element_text(hjust = 0.5))