Now a days it is possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up.Let make use of data from a personal activity monitoring device in R and try doing some analysis.
Data Description
The Data is in form of csv files , downloaded from here.The device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.
# set the working directory (optional)
setwd("/Users/ranjeetapegu/Documents/Rjeeta_Rprograming/FitnessMonitoring")
# check if the Data directory exist,if not create
if(!file.exists("./Data")){dir.create("./Data")}
#Url for Data Download
Durl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip"
#Download Data
download.file(url =Durl, destfile = "./Data/RepAct.zip", mode ="wb")
#Unzip the file
unzip("./Data/RepAct.zip", exdir = "./Data/")
# Read the csv file into a Dataser
activity <- read.csv("./Data/activity.csv", header = TRUE, sep =",")
head(activity)
## steps date interval
## 1 NA 2012-10-01 0
## 2 NA 2012-10-01 5
## 3 NA 2012-10-01 10
## 4 NA 2012-10-01 15
## 5 NA 2012-10-01 20
## 6 NA 2012-10-01 25
What is mean total number of steps taken per day?
Lets find out the mean of total number of steps taken per day, for this, ignore the missing values (i.e. NA).The total number of steps taken per day is calculated by taking the sum of steps group by the date.
Daily.step <- aggregate(activity$steps,
by = list(activity$date),
FUN = sum, na.rm = T)
names(Daily.step) <- c("date", "steps")
head(Daily.step)
## date steps
## 1 2012-10-01 0
## 2 2012-10-02 126
## 3 2012-10-03 11352
## 4 2012-10-04 12116
## 5 2012-10-05 13294
## 6 2012-10-06 15420
Lets find out the mean and median of the total number of steps taken per day by using the summary function.
summary(Daily.step$steps)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 6778 10400 9354 12810 21190
What is the average daily activity pattern?
I will make use of the time series plot, to find the average daily activity pattern. The graph will have 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis). which is most active 5-min interval?
Avgstep.timeIn <- aggregate(activity$steps,
by = list(activity$interval),
FUN = mean, na.rm = T)
names(Avgstep.timeIn) <- c("interval","avgsteps")
plot( Avgstep.timeIn$interval, Avgstep.timeIn$avgsteps ,
type ="l" , col ="blue" ,
main ="Time Series Avg Steps at 5min Interval ",
xlab="Time Interval of 5 mins" ,
ylab = "Average Steps")
#most Active 5 min interval
maxstep_int <- Avgstep.timeIn[Avgstep.timeIn$avgsteps == max(Avgstep.timeIn$avgsteps),c(1) ]
abline(v=maxstep_int, lty=3, col="red")
text(875,5 ,maxstep_int, col = "black", adj = c(0, -.1))
There are a number of days/intervals where there are missing values coded as NA, this may introduce some bias into some calculations and summaries of the data. Lets get the total number of missing values in the dataset (i.e. the total number of rows with 𝙽𝙰s)
summary(activity$steps)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.00 0.00 37.38 12.00 806.00 2304
We see there is a total of 2304 missig values. We will fill the missing value by mean for that 5 min interval for all days and create a modified dataset naming it as New.Activity.
New.Activity <- merge(activity, Avgstep.timeIn, by ="interval")
New.Activity$steps[is.na(New.Activity$steps)] <- New.Activity$avgsteps[is.na(New.Activity$steps)]
New.Activity$steps <-round( New.Activity$steps)
head(New.Activity)
## interval steps date avgsteps
## 1 0 2 2012-10-01 1.716981
## 2 0 0 2012-11-23 1.716981
## 3 0 0 2012-10-28 1.716981
## 4 0 0 2012-11-06 1.716981
## 5 0 0 2012-11-24 1.716981
## 6 0 0 2012-11-15 1.716981
Summary of new dataset will give whether all missing have been replaced and also give the new mean and median of total number of steps taken in a day.
summary(New.Activity$steps)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 37.38 27.00 806.00
Above results , show there is no NA in the modified dataset.
Comparing the Actual and Modified Datasets
After comparing the summaries, it is clear that there is no change in mean, median , min and max number of steps each day after filling the missing values with avg value for that interval. Lets see if there is any change in frequency of days and total steps each day. For this, plot histogram of actual data and modified dataset of the total number of steps each day is compared below:
par(mfrow=c(1,2))
NDaily.step <- aggregate(activity$steps,
by = list(activity$date),
FUN = sum, na.rm = T)
names(NDaily.step) <- c("date", "steps")
hist(NDaily.step$steps,
main ="Actual Data",
xlab ="Total steps per day",
ylab= "Frequency")
MDaily.step <- aggregate(New.Activity$steps,
by = list(New.Activity$date),
FUN = sum, na.rm = T)
names(MDaily.step) <- c("date", "steps")
hist(MDaily.step$steps,
main ="Modified Data",
xlab ="Total steps per day",
ylab= "Frequency",)
The comparision of graphs shows there is an increase in the number of days when the person has reached his/her maximum steps.
Are there differences in activity patterns between weekdays and weekends?Using the modified dataset, we create a new factor variable day, two levels – “weekday” and “weekend” indicating whether a given date is a weekday or weekend day. plot charts having time series plot of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis).
## classify the data as weekday or weekend
New.Activity$d <- weekdays(as.Date(New.Activity$date))
New.Activity$day <- as.factor(ifelse
(New.Activity$d%in% c("Saturday","Sunday"),"Weekend","Weekday") )
head(New.Activity)
## interval steps date avgsteps d day
## 1 0 2 2012-10-01 1.716981 Monday Weekday
## 2 0 0 2012-11-23 1.716981 Friday Weekday
## 3 0 0 2012-10-28 1.716981 Sunday Weekend
## 4 0 0 2012-11-06 1.716981 Tuesday Weekday
## 5 0 0 2012-11-24 1.716981 Saturday Weekend
## 6 0 0 2012-11-15 1.716981 Thursday Weekday
# Average steps on weekdays and weekends
Int_Avgstep <- aggregate(New.Activity$steps, by = list(New.Activity$interval,New.Activity$day), FUN = mean, na.rm = T)
names(Int_Avgstep) <- c("interval","day","steps")
Int_Avgstep$steps <- round(Int_Avgstep$steps)
head(Int_Avgstep)
## interval day steps
## 1 0 Weekday 2
## 2 5 Weekday 0
## 3 10 Weekday 0
## 4 15 Weekday 0
## 5 20 Weekday 0
## 6 25 Weekday 2
Lets use ggplot2 in R and do a comparision of activities in weekends and weekdays.
library(ggplot2)
ggplot( Int_Avgstep, aes(interval, steps )) +
geom_line(colour = "blue") +
xlab("interval of 5mins. ") +
ylab ("Avg steps") +
facet_grid(facets = day ~ .)
From the above graphs, we observe that the average number of steps taken in weekdays is significantly higher (250steps) in time interval between 750 -1000; whereas for weekends the number of observe such high peak during the same interval period.