Unzip the data into “activity.csv” file and Load the data into “activity” dataFrame See the dimensions, head, structure of the activity dataFrame
if(!file.exists("activity.csv")) {
unzip("reproducible_research_week2_project/RepData_PeerAssessment1/activity.zip", exdir = "reproducible_research_week2_project/RepData_PeerAssessment1")
}
activity = read.csv("activity.csv")
dim(activity)
## [1] 17568 3
head(activity)
## steps date interval
## 1 NA 2012-10-01 0
## 2 NA 2012-10-01 5
## 3 NA 2012-10-01 10
## 4 NA 2012-10-01 15
## 5 NA 2012-10-01 20
## 6 NA 2012-10-01 25
str(activity)
## 'data.frame': 17568 obs. of 3 variables:
## $ steps : int NA NA NA NA NA NA NA NA NA NA ...
## $ date : Factor w/ 61 levels "2012-10-01","2012-10-02",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
Transforming the class of activity$date to Date format
activity$date = as.Date(as.character(activity$date), "%Y-%m-%d")
str(activity)
## 'data.frame': 17568 obs. of 3 variables:
## $ steps : int NA NA NA NA NA NA NA NA NA NA ...
## $ date : Date, format: "2012-10-01" "2012-10-01" ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
Lets calculate total number of steps taken per day using tapply excluding NAs and see some values using head
total_numberofsteps_perday = with(activity, tapply(steps, date, sum, na.rm = TRUE))
head(total_numberofsteps_perday)
## 2012-10-01 2012-10-02 2012-10-03 2012-10-04 2012-10-05 2012-10-06
## 0 126 11352 12116 13294 15420
Lets make a histogram of the total number of steps taken each day
hist(total_numberofsteps_perday,breaks = 20)
Lets calculate and report the mean and median of the total number of steps taken per day
mean(total_numberofsteps_perday)
## [1] 9354.23
median(total_numberofsteps_perday)
## [1] 10395
Lets make a time series plot (i.e. type=“l”) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis)
Lets store x-axis data in fivemin_interval variable using unique function on activity$interval
fivemin_interval = unique(activity$interval)
str(fivemin_interval)
## int [1:288] 0 5 10 15 20 25 30 35 40 45 ...
And lets store y-axis data as below
averagenumberofstepstaken_averagedacrossalldays = with(activity,tapply(steps, interval, mean, na.rm=TRUE))
str(averagenumberofstepstaken_averagedacrossalldays)
## num [1:288(1d)] 1.717 0.3396 0.1321 0.1509 0.0755 ...
## - attr(*, "dimnames")=List of 1
## ..$ : chr [1:288] "0" "5" "10" "15" ...
And here comes the plot
#png(filename = "instructions_fig/plot1.png")
plot(fivemin_interval,averagenumberofstepstaken_averagedacrossalldays,
type = "l", xlab = "Interval" , ylab = "Number of steps", main = "Average daily activity pattern" )
#dev.off()
Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?
Lets print averagenumberofstepstaken_averagedacrossalldays in ascending order and get the index of its maximum.
order(averagenumberofstepstaken_averagedacrossalldays)
## [1] 9 17 24 25 26 28 29 31 33 34 37 38 39 40 47 48 52
## [18] 61 279 5 46 274 3 12 30 4 15 20 21 41 32 287 23 11
## [35] 13 273 2 16 53 54 22 45 7 44 42 286 56 14 58 280 8
## [52] 36 50 281 288 18 60 27 49 265 10 35 62 282 275 43 1 19
## [69] 6 67 272 64 51 283 264 278 66 63 59 277 285 65 57 266 55
## [86] 276 284 267 68 269 262 258 263 268 261 270 271 257 259 253 69 260
## [103] 177 254 70 242 256 243 249 241 252 234 120 246 135 250 248 255 117
## [120] 131 164 136 238 178 138 153 245 122 247 235 176 130 137 134 181 237
## [137] 133 73 132 251 152 244 139 240 129 215 119 173 183 182 233 83 128
## [154] 216 174 154 200 118 184 125 80 71 158 165 236 121 217 160 175 141
## [171] 157 123 163 159 171 179 204 180 85 199 81 82 127 92 86 72 142
## [188] 201 155 116 202 239 185 143 203 161 205 78 167 186 187 172 84 74
## [205] 89 77 140 150 87 207 126 90 170 79 93 124 232 166 75 151 88
## [222] 91 169 96 206 162 214 197 95 218 231 144 221 212 198 168 208 193
## [239] 196 149 76 145 194 188 115 156 222 98 211 94 209 97 224 195 219
## [256] 213 223 230 210 189 192 229 220 225 228 227 146 148 147 114 190 226
## [273] 191 113 112 111 110 99 109 102 100 108 101 103 106 107 105 104
Its largest value is in index 104. Lets see the interval and maximum average steps
averagenumberofstepstaken_averagedacrossalldays[104]
## 835
## 206.1698
Lets calculate and report the total number of missing values in the dataset (i.e. the total number of rows with NAs)
sum(is.na(activity$steps))
## [1] 2304
Lets devise a strategy for filling in all of the missing values in the dataset. Here lets impute missing value with mean of that interval and lets create a new dataset (activity_impute) that is equal to the original dataset but with the missing data filled in.
activity_imputed= activity
missingData = is.na(activity$steps)
meanValuesByInterval = tapply(activity$steps, activity$interval, mean, na.rm = TRUE)
activity_imputed$steps[missingData] = meanValuesByInterval[as.character(activity_imputed$interval[missingData])]
sum(is.na(activity_imputed))
## [1] 0
Lets make a histogram of the total number of steps taken each day and Calculate and report the mean and median total number of steps taken per day.
total_numberofsteps_perday_imputed = with(activity_imputed, tapply(steps, date, sum))
hist(total_numberofsteps_perday_imputed, breaks = 20)
mean(total_numberofsteps_perday_imputed)
## [1] 10766.19
median(total_numberofsteps_perday_imputed)
## [1] 10766.19
Do these values differ from the estimates from the first part of the assignment? What is the impact of imputing missing data on the estimates of the total daily number of steps?
mean(total_numberofsteps_perday)
## [1] 9354.23
mean(total_numberofsteps_perday_imputed)
## [1] 10766.19
Yes, Mean differs
median(total_numberofsteps_perday)
## [1] 10395
median(total_numberofsteps_perday_imputed)
## [1] 10766.19
Yes, Median differs
Lets create a new factor variable in the dataset with two levels - “weekday” and “weekend” indicating whether a given date is a weekday or weekend day.
library(chron)
week_factor = is.weekend(activity_imputed$interval)
activity_imputed_weekend = activity_imputed[week_factor,]
activity_imputed_weekday = activity_imputed[!week_factor,]
average_numberofsteps_perday_imputed_weekday = with(activity_imputed_weekday, tapply(steps, interval, mean))
average_numberofsteps_perday_imputed_weekend = with(activity_imputed_weekend, tapply(steps, interval, mean))
_ Lets make a panel plot containing a time series plot (i.e.type=“l”) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis)._
par(mfrow = c(2,1))
plot(unique(activity_imputed_weekday$interval) , average_numberofsteps_perday_imputed_weekday,
type="l" , xlab = "Interval",ylab = "Frequency", main = " weekday days")
plot(unique(activity_imputed_weekend$interval) , average_numberofsteps_perday_imputed_weekend,
type="l", xlab = "Interval",ylab = "Frequency", main = " weekend days")
From the plots we can assumer that there is the hike in steps beginning from the start of the day in weekdays and in weekends the average is shared equally among the intervals compared to the weekdays