This is the 2nd Course project in the COursera course : Reproducible Research.
The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. You must use the database to answer the questions below and show the code for your entire analysis. Your analysis can consist of tables, figures, or other summaries. You may use any R package you want to support your analysis.
A link was provided to download the data, the of of which was approximately 47 MB. The data needs to be processed and then we can use it to answer some of the questions that are needed to be answered.
Sometimes we see while calling some library or during some general calculations, we see a lot of warnings that turn up, which makes the final Markdown file look a bit ugly, so we use the opts_chunck$set function to turn off the warnings that might show up.
knitr::opts_chunk$set(warning = FALSE)
The working directory is same as the directory in which the data has been downloded and extracted. We see the components in the directory by the dir function
dir()
## [1] "course_project_2.html" "course_project_2.Rmd"
## [3] "course_project_2.txt" "FALSE.html"
## [5] "repdata_data_StormData.csv" "repdata_data_StormData.csv.bz2"
## [7] "rsconnect"
We see that the directory contains the required file with a .csv extention, so we will use the read.csv() funtion to read the entire file.
Now we will declare a new variable StormData , that will store all the data related to the storm data that has been downloaded.
StormData <- read.csv("repdata_data_StormData.csv")
It takes some time to read the entire data, but we need only a certain part of the data, but before that we check the structure and the dimenssion of the data.
dim(StormData)
## [1] 902297 37
str(StormData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The dimmension of the data is 902297 by 37 and the structure says that the data is stored as a data.frame. Now it is time for processing the data.
We need only the columns: 1. EVTYPE 2. FATALITIES 3. INJURIES 4. PROPDMG 5. PROPDMGEXP 6. CROPDMG 7. CROMDMGEXP
As these are the required variables, we declare a new variable reqvar that will be a vetor of only the required variables, and make a new data frame named mewdata which contains the data of only the required variables.
reqvar <- c( "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
newdata <- StormData[, reqvar]
We check the dimenssions of the newdata variable and check out the head and tail of the same.
dim(newdata)
## [1] 902297 7
head(newdata)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
tail(newdata)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 902292 WINTER WEATHER 0 0 0 K 0 K
## 902293 HIGH WIND 0 0 0 K 0 K
## 902294 HIGH WIND 0 0 0 K 0 K
## 902295 HIGH WIND 0 0 0 K 0 K
## 902296 BLIZZARD 0 0 0 K 0 K
## 902297 HEAVY SNOW 0 0 0 K 0 K
As we see that the variables has no “NA’s”, but still to confirm, we check the sum of is.na values of some of the variables in newdata
sum(is.na(newdata$FATALITIES))
## [1] 0
sum(is.na(newdata$INJURIES))
## [1] 0
sum(is.na(newdata$PROPDMG))
## [1] 0
As we see that the value of the sum of is.na function in some of the variables of the newdata data frame is zero. Hence, no data has any “NA’s”.
First we see the 10 variables with the highest values as individual units. We do this by sorting the table() function and sorting it in decreasing order and view only the first 10 outputs.
sort(table(newdata$EVTYPE), decreasing = TRUE)[1:10]
##
## HAIL TSTM WIND THUNDERSTORM WIND TORNADO
## 288661 219940 82563 60652
## FLASH FLOOD FLOOD THUNDERSTORM WINDS HIGH WIND
## 54277 25326 20843 20212
## LIGHTNING HEAVY SNOW
## 15754 15708
Now we see that there are multiple variable like HIGH WIND, TSTM WIND, THUNDERSTORM WINDS and many other such variables thagt signify just the WIND data, se we club such datas together. We do this for only the 9 variables and set the other variables as other. The variables are:
1. HAIL
2. HEAD
3. FLOOD
4. WIND
5. SNOW
6. STORM
7. WINTER
8. RAIN
9. TORNADO
10. OTHER
newdata$EVENT <- "OTHER"
# The data containg "HAIL" in the EVTYPE ignore the case is replaced by "HAIL".
newdata$EVENT[grep("HAIL", newdata$EVTYPE, ignore.case = TRUE)] <- "HAIL"
# The data containg "HEAT" in the EVTYPE ignore the case is replaced by "HEAT".
newdata$EVENT[grep("HEAT", newdata$EVTYPE, ignore.case = TRUE)] <- "HEAT"
newdata$EVENT[grep("FLOOD", newdata$EVTYPE, ignore.case = TRUE)] <- "FLOOD"
newdata$EVENT[grep("WIND", newdata$EVTYPE, ignore.case = TRUE)] <- "WIND"
newdata$EVENT[grep("SNOW", newdata$EVTYPE, ignore.case = TRUE)] <- "SNOW"
newdata$EVENT[grep("STORM", newdata$EVTYPE, ignore.case = TRUE)] <- "STORM"
newdata$EVENT[grep("WINTER", newdata$EVTYPE, ignore.case = TRUE)] <- "WINTER"
newdata$EVENT[grep("RAIN", newdata$EVTYPE, ignore.case = TRUE)] <- "RAIN"
newdata$EVENT[grep("TORNADO", newdata$EVTYPE, ignore.case = TRUE)] <- "TORNADO"
Now we need to see that how the sorted table looks. We again check the first 10 columns in the decresasing order.
sort(table(newdata$EVENT), decreasing = TRUE)[1:10]
##
## HAIL WIND STORM FLOOD TORNADO OTHER WINTER SNOW RAIN HEAT
## 289270 255362 113180 82686 60700 48970 19604 17636 12241 2648
We see the table has been sorted and the names of the event are as we required.
We see the first 10 elements of the sorted table of property and crop expenses.
sort(table(newdata$PROPDMGEXP), decreasing = TRUE)[1:10]
##
## K M 0 B 5 1 2 ? m
## 465934 424665 11330 216 40 28 25 13 8 7
sort(table(newdata$CROPDMGEXP), decreasing = TRUE)[1:10]
##
## K M k 0 B ? 2 m <NA>
## 618413 281832 1994 21 19 9 7 1 1
We see that there are a lot of different variables representing the same unit of dollar like
* B or b: Billion(10^9)
* M or m: Million(10^6)
* K or k: Thousand(10^3)
So we club all the data’s into one variable so that we can get a better result.
First we deal with the property expenses and add a new column property_damage to the newdata dataframe.
newdata$PROPDMGEXP <- as.character(newdata$PROPDMGEXP)
newdata$PROPDMGEXP[is.na(newdata$PROPDMGEXP)] <- 0
newdata$PROPDMGEXP[!grepl("K|M|B", newdata$PROPDMGEXP, ignore.case = TRUE)] <- 0
newdata$PROPDMGEXP[grep("K", newdata$PROPDMGEXP, ignore.case = TRUE)] <- "3"
newdata$PROPDMGEXP[grep("M", newdata$PROPDMGEXP, ignore.case = TRUE)] <- "6"
newdata$PROPDMGEXP[grep("B", newdata$PROPDMGEXP, ignore.case = TRUE)] <- "9"
newdata$PROPDMGEXP <- as.numeric(as.character(newdata$PROPDMGEXP))
newdata$property_damage <- newdata$PROPDMG * 10^newdata$PROPDMGEXP
We check the sorted table according to the property_damage variable in decreasing order and see the first 10 expenses list
sort(table(newdata$property_damage), decreasing = TRUE)[1:10]
##
## 0 5000 10000 1000 2000 25000 50000 3000 20000 15000
## 663123 31731 21787 17544 17186 17104 13596 10364 9179 8617
We see the property data is as we required. All the expenses are in dolloar($)
Now we deal with the crop expenses and add a new column crop_damage to the newdata dataframe.
newdata$CROPDMGEXP <- as.character(newdata$CROPDMGEXP)
newdata$CROPDMGEXP[is.na(newdata$CROPDMGEXP)] <- 0
newdata$CROPDMGEXP[!grepl("K|M|B", newdata$CROPDMGEXP, ignore.case = TRUE)] <- 0
newdata$CROPDMGEXP[grep("K", newdata$CROPDMGEXP, ignore.case = TRUE)] <- "3"
newdata$CROPDMGEXP[grep("M", newdata$CROPDMGEXP, ignore.case = TRUE)] <- "6"
newdata$CROPDMGEXP[grep("B", newdata$CROPDMGEXP, ignore.case = TRUE)] <- "9"
newdata$CROPDMGEXP <- as.numeric(as.character(newdata$CROPDMGEXP))
newdata$crop_damage <- newdata$CROPDMG * 10^newdata$CROPDMGEXP
We check the sorted table according to the crop_damage variable in decreasing order and see the first 10 expenses list
sort(table(newdata$crop_damage), decreasing = TRUE)[1:10]
##
## 0 5000 10000 50000 1e+05 1000 2000 25000 20000 5e+05
## 880198 4097 2349 1984 1233 956 951 830 758 721
We see the crop data is as we required. All the expenses are in dolloar($)
For analysing we need 3 libraries.
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
These library conatain the ddply() function which will be used to aggregate the datas as we want, and ggplot2 will be used to plot the data.
First we aggregate the fatality from the newdata. We make a new variable agg_fatality
agg_fatality <- ddply(newdata, .(EVENT), summarise, Total = sum(FATALITIES, nar.rm = TRUE))
agg_fatality$type <- "fatality"
Now we aggregate the injury data from the newdata. We make a new variable agg_injury
agg_injury <- ddply(newdata, .(EVENT), summarize, Total = sum(INJURIES, na.rm = TRUE))
agg_injury$type <- "injury"
Now we make 2 new data frames:
1. agg_health : binded the fatality and injury data frames.
2. health_as_per_event : helps us to see the entire dataset of health as per the events.
agg_health <- rbind(agg_fatality, agg_injury)
health_as_per_event <- join(agg_fatality, agg_injury, by = "EVENT", type = "inner")
health_as_per_event
## EVENT Total type Total type
## 1 FLOOD 1525 fatality 8602 injury
## 2 HAIL 16 fatality 1371 injury
## 3 HEAT 3139 fatality 9224 injury
## 4 OTHER 2627 fatality 12224 injury
## 5 RAIN 115 fatality 305 injury
## 6 SNOW 165 fatality 1164 injury
## 7 STORM 417 fatality 5339 injury
## 8 TORNADO 5662 fatality 91407 injury
## 9 WIND 1210 fatality 9001 injury
## 10 WINTER 279 fatality 1891 injury
First we aggregate the fatality from the newdata. We make a new variable agg_prop
agg_prop <- ddply(newdata, .(EVENT), summarise, Total = sum(property_damage, nar.rm = TRUE))
agg_prop$type <- "property"
Now we aggregate the injury data from the newdata. We make a new variable agg_crop
agg_crop <- ddply(newdata, .(EVENT), summarize, Total = sum(crop_damage, na.rm = TRUE))
agg_crop$type <- "crop"
Now we make 2 new data frames:
1. agg_money : binded the fatality and injury data frames.
2. money_by_event : helps us to see the entire dataset of health as per the events.
agg_money <- rbind(agg_prop, agg_crop)
money_by_event <- join(agg_prop, agg_crop, by = "EVENT", type = "inner")
money_by_event
## EVENT Total type Total type
## 1 FLOOD 167502193930 property 12266906100 crop
## 2 HAIL 15733043049 property 3046837473 crop
## 3 HEAT 20325751 property 904469280 crop
## 4 OTHER 97246712338 property 23588880870 crop
## 5 RAIN 3270230193 property 919315800 crop
## 6 SNOW 1023569753 property 134683100 crop
## 7 STORM 66305015394 property 6374474888 crop
## 8 TORNADO 58593098030 property 417461520 crop
## 9 WIND 10847166619 property 1403719150 crop
## 10 WINTER 6777295252 property 47444000 crop
Now we plot the graphs.
We need to show which events have the most effect on both health and economy using graphs.
ggplot(agg_health, aes(x = EVENT, y = Total, fill = type)) + geom_bar(stat = "identity") + coord_flip() +
xlab("Event Type") + ylab("Total health impact") + ggtitle("Event type VS health impact")
ggplot(agg_money, aes(x = EVENT, y = Total, fill = type)) + geom_bar(stat = "identity") + coord_flip() +
xlab("Event Type") + ylab("Total damage in dollar") + ggtitle("Event type VS on economy")