Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. The basic goal of this report is to explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) Storm Database, and answer the following two questions about storms and severe weather events:
The report follows the principals of Reproducible Research. The steps taken for loading the dataset and process it, including the code used, is shown in the “Data Processing” section. The Results” section at the end of the report presents the results of this exploratory study, shows the top ten storms and severe weather event types that cause the most human causality, and the top ten types cause the most economic damage.
Two R libraries are required to run the R code chunks embedded in this document: the “data.table” to load the data set file, read it and process it; and the “ggplot2” plotting library used to create the plots provided in the Results section.
library(data.table)
library(ggplot2)
The following code reads the “repdata-data-StormData.csv” into a data.table named: “stormDT”. if the data set file is not available in the working directory, the code downloads the original data set zip file and unzip it, then reads the data set into “stormDT” data table. A third library, “R.utils”, is loaded at this stage, because this library provides the “bunzip2” function used to unzip the “.bz2” original data set file.
## Downloading the NOAA Storm Database, if it is not already in the working directory
if(!file.exists("repdata-data-StormData.csv")){
library(R.utils)
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile = "repdata-data-StormData.csv.bz2", method = "curl")
bunzip2("repdata-data-StormData.csv.bz2", "repdata-data-StormData.csv", remove = FALSE, skip = TRUE)
paste("NOAA's orignal zip data set is downloaded from the link provided above, and unziped on: ",
Sys.time())
}
## Reading the data file
stormDT <- fread("repdata-data-StormData.csv")
By exploring the “stormDT” data table using R’s str() and head() functions, it is clear that the data table has 902297 rows and 37 columns. The use of the dim() function provides a more compact view of the data table’s size.
## Exploring the dimensions of stormDT data table
dim(stormDT)
## [1] 902297 37
The report starts the process of cleaning the data table to make it tidy by looking at the columns “stormDT” has, and the documentation provided by the National Weather Service at https://d396qusza40orc.cloudfront.net/repdata/peer2_doc/pd01016005curr.pdf, to understand which variables (columns) are relevant to answer the study’s questions. It is clear that only 7 of these columns are actually relevant to our exploratory study/s questions. These columns identify the storm/weather event type, the number of fatalities and injured caused by the event, and also the ones identify the property and crop damages caused by such events. The following code prints the names of all the columns of “stormDT”; confirms the columns’ numbers of the relevant ones; removes the not relevant columns; then finally check the structure of the now slimmer “stormDT”.
## Finding the relevant columns in stormDT data table
colnames(stormDT)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
## Confirming the column numbers of the columns that are relevant to event type
## and the human and economic damages caused (those are the columns to keep)
colnames(stormDT[, c(8,23:28)])
## [1] "EVTYPE" "FATALITIES" "INJURIES" "PROPDMG" "PROPDMGEXP"
## [6] "CROPDMG" "CROPDMGEXP"
## Removing all other columns
stormDT <- stormDT[, -c(1:7, 9:22, 29:37) ]
## Checking stormDT now, to confirm that it has only the relevant columns
str(stormDT)
## Classes 'data.table' and 'data.frame': 902297 obs. of 7 variables:
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## - attr(*, ".internal.selfref")=<externalptr>
The second step of making the data table tidy is checking if there are missing values, and dealing with them if they exist. The following code checks for missing values in all variable, and found none.
## Checking for NA values in any of the columns, and confirming none exists
sapply(stormDT, function(x) sum(is.na(x)))
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 0 0 0 0 0 0 0
To make the “stormDT” table tidy, it is clear that the two “PROPDMGEXP” and “CROPDMGEXP” columns, providing the exponent to the “PROPDM” and “CROPDMG” respectively, complicate the data unnecessarily. The following code is used to identify the exponent values in the “EXP” column, and then cleans and replaces them with clear and consistent exponent values. The values in “PROPDM” and “CROPDMG” are then multiplied by their respective new and cleaned exponent columns. The code then removes the exponent columns, as they are not needed any more.
## Finding out the unique values of both EXP columns
unique(c(stormDT$PROPDMGEXP, stormDT$CROPDMGEXP))
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
## [20] "k"
## Building a conversion named-numeric vector for the values of the PROPDMG and
## CROPDMG EXP columns to their respective 10^? values
expChar2expNum <- c("+" = 1, "-" = 1, "?" = 1, "0" = 1, "1" = 10, "2" = 100,
"3" = 1000, "4" = 10^4, "5" = 10^5, "6" = 10^6, "7" = 10^7,
"8" = 10^8, "h" = 100, "H" = 100, "k" = 1000, "K" = 1000,
"m" = 10^6, "M" = 10^6, "B" = 10^9)
## Changing the EXP character indicator to their equivalent numeric value
stormDT[PROPDMGEXP =="", "PROPDMGEXP"] <- "0"
stormDT[CROPDMGEXP =="", "CROPDMGEXP"] <- "0"
stormDT[, PROPDMGEXP := expChar2expNum[stormDT[,PROPDMGEXP]]]
stormDT[, CROPDMGEXP := expChar2expNum[stormDT[,CROPDMGEXP]]]
## Updating PROPDMG and CROPDMG by multiplying them with their respective EXP columns
stormDT[, `:=`(PROPDMG = PROPDMG * PROPDMGEXP, CROPDMG = CROPDMG * CROPDMGEXP)]
## Removing the not needed, any more, PROPDMGEXP and CROPDMGEXP columns
stormDT <- stormDT[, -c( "PROPDMGEXP", "CROPDMGEXP") ]
The final step of making the data table tidy, clean and include all needed data to answer the exploratory study’s questions, is to add two new columns: The “totalHumanCasualties” column to sum the number of fatalities and number of injured people caused by the event; and the “totalEconomicCost” column to sum the total financial values of both the property and crop damages caused by the event.
## Adding 2 new columns, one sums all the human casualties caused by the row's
## specific storm/weather event, and another sums all the financial damages caused
stormDT[, `:=`(totalHumanCasualties = FATALITIES + INJURIES, totalEconomicCost = PROPDMG + CROPDMG )]
## Checking stormDT now, to confirm it is tidy, clean and include all needed data
str(stormDT)
## Classes 'data.table' and 'data.frame': 902297 obs. of 7 variables:
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ FATALITIES : num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25000 2500 25000 2500 2500 2500 2500 2500 25000 25000 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ totalHumanCasualties: num 15 0 2 2 2 6 1 0 15 0 ...
## $ totalEconomicCost : num 25000 2500 25000 2500 2500 2500 2500 2500 25000 25000 ...
## - attr(*, ".internal.selfref")=<externalptr>
The exploratory study questions are interested in identifying the “total” effect of storm and other extreme weather event types, not the effect of each and individual events (represented as individual rows in the data set). Therefore, there is a need to summaries the “stormDT” data table with a new one, called here “stormEffectDT”, to have one row for each storm/weather event type and its columns’ values represent the totals of the column values of all the event type’s rows in “stormDT.” The new “stormEffectDT” table’s rows are ordered based on the most total human health harm caused by the event types, then by the total economic harm they caused. This ordering scheme of valuing human harm over financial harm is reasonable to assume, but will not affect the final results, as the final results follow the separation between the two harm types established by having two separate research question to answer. The Research section, to follow, will show some additional summarization done to the “stormEffectDT” to answer the research question specifically and narrowly.
## Creating a new summary data table for storm/weather types and their total human
## and financial harm
stormEffectDT <- stormDT[, .( FATALITIES = sum(FATALITIES), INJURIES = sum(INJURIES),
totalHumanCasualties = sum(totalHumanCasualties),
PROPDMG = sum(PROPDMG), CROPDMG = sum(CROPDMG),
totalEconomicCost = sum(totalEconomicCost) ), by = EVTYPE ]
## Ordering the new summary table based on the most total human harm caused,
## then by the most economic harm caused
stormEffectDT <- stormEffectDT[order(totalHumanCasualties, totalEconomicCost, decreasing = TRUE), ]
## Checking stormEffectDT, to confirm it is tidy, clean and includes all needed data
str(stormEffectDT)
## Classes 'data.table' and 'data.frame': 985 obs. of 7 variables:
## $ EVTYPE : chr "TORNADO" "EXCESSIVE HEAT" "TSTM WIND" "FLOOD" ...
## $ FATALITIES : num 5633 1903 504 470 816 ...
## $ INJURIES : num 91346 6525 6957 6789 5230 ...
## $ totalHumanCasualties: num 96979 8428 7461 7259 6046 ...
## $ PROPDMG : num 5.69e+10 7.75e+06 4.48e+09 1.45e+11 9.30e+08 ...
## $ CROPDMG : num 4.15e+08 4.92e+08 5.54e+08 5.66e+09 1.21e+07 ...
## $ totalEconomicCost : num 5.74e+10 5.00e+08 5.04e+09 1.50e+11 9.42e+08 ...
## - attr(*, ".internal.selfref")=<externalptr>
There are 985 rows in the “stormEffectDT” summary table, as shown from the str() result shown above. This means that there are 985 unique storm and weather event types tracked by the NOAA data set. The following shows the 10 top event types that are the most harmful population health wise, and then economically.
## Showing the 10 first rows in stormEffectDT,
## but showing here only the "type" and "total harm" columns
head(stormEffectDT[, c("EVTYPE","totalHumanCasualties", "totalEconomicCost")], 10)
## EVTYPE totalHumanCasualties totalEconomicCost
## <char> <num> <num>
## 1: TORNADO 96979 57362333946
## 2: EXCESSIVE HEAT 8428 500155700
## 3: TSTM WIND 7461 5038935845
## 4: FLOOD 7259 150319678257
## 5: LIGHTNING 6046 942471520
## 6: HEAT 3037 403258500
## 7: FLASH FLOOD 2755 18243991078
## 8: ICE STORM 2064 8967041360
## 9: THUNDERSTORM WIND 1621 3897965522
## 10: WINTER STORM 1527 6715441251
The first question that this report answers, as stated above, is: across the United States, which types of events are most harmful with respect to population health. NOAA data set provides no other mean to judge the effect of storms and other extreme weather events on population health other than the number of fatalities and injured as a result of such events. Therefore, the effect of the extreme weather event types on population health in this report is measured in terms of the total human casualties, calculated as the sum of the fatalities and the number of injured people caused by the event types.
Unfortunately the data also does not differentiate between classes of injuries caused. It just reports the number of injured as an aggregate number. It reports a permanent disability caused by an injury as it reports a minor injury treated within days, both cases reported in the data equally as a single injury. This should no affect the ability to answer the generic question about the effect of different event types on population health. Surely, having more granular details will provide more insight and more accurate measure of the affect of events on population health. But, the report uses what it has, and it should suffice to answer the generic question it explores.
The following code arranges decreasingly the rows of the “stormEffectDT” table based on the total human casualties number, and in a case of a tie, it order the rows based on number of fatalities and then number of injured. The ordering done in a way that the event type with the most human casualties are on top.The code extracts the relevant columns (“EVTYPE”, “totalHumanCasualties”, “FATALITIES” and “INJURIES”) and places them in a new data table : “EVTYPEbyHumanCasualties”.
## Creating a new EVTYPEbyHumanCasualties data table that has the event type and only the
## human casualty columns, then ordering its rows to have the most totalHumanCasualties on top
EVTYPEbyHumanCasualties <- stormEffectDT[order(totalHumanCasualties, FATALITIES,
INJURIES, decreasing = TRUE), c(1:4)]
## Checking EVTYPEbyHumanCasualties, to confirm its structure
str(EVTYPEbyHumanCasualties)
## Classes 'data.table' and 'data.frame': 985 obs. of 4 variables:
## $ EVTYPE : chr "TORNADO" "EXCESSIVE HEAT" "TSTM WIND" "FLOOD" ...
## $ FATALITIES : num 5633 1903 504 470 816 ...
## $ INJURIES : num 91346 6525 6957 6789 5230 ...
## $ totalHumanCasualties: num 96979 8428 7461 7259 6046 ...
## - attr(*, ".internal.selfref")=<externalptr>
Presenting the top 10 extreme weather event types that caused the most human causalities, and therefore the most harmful with respect to population health.
## Showing the top 10 event types that are the most harmful to US population health
knitr::kable(EVTYPEbyHumanCasualties[1:10,], caption = "**Table 1: Top 10 Extream Weather Event Types with the Most Harm to the U.S. Population Health**")
| EVTYPE | FATALITIES | INJURIES | totalHumanCasualties |
|---|---|---|---|
| TORNADO | 5633 | 91346 | 96979 |
| EXCESSIVE HEAT | 1903 | 6525 | 8428 |
| TSTM WIND | 504 | 6957 | 7461 |
| FLOOD | 470 | 6789 | 7259 |
| LIGHTNING | 816 | 5230 | 6046 |
| HEAT | 937 | 2100 | 3037 |
| FLASH FLOOD | 978 | 1777 | 2755 |
| ICE STORM | 89 | 1975 | 2064 |
| THUNDERSTORM WIND | 133 | 1488 | 1621 |
| WINTER STORM | 206 | 1321 | 1527 |
Figure 1 shows the top 10 storm and other extreme weather event types that are the most harmful to U.S. population health as a stack bar plot (using the “ggplot2” R library loaded earlier). Each event-type bar shows both the fatalities number (as a stacked-bar-part in brown color), and the injuried number (as a stacked-bar-part in red color). Both parts of each bar, in the plot, adds up to the total number of human casualties, as shown in Table 1 above. The following code, first, melts the “EVTYPEbyHumanCasualties” data table, for only the top 10 event-types/rows. A melted table is a requirement in order to build a stack bar plot using the “ggplot2” library. The resultant “meltedEVTYPEbyHumanCasualties” data table, then, is used by the ggplot() to create Figure 1.
## Melting EVTYPEbyHumanCasualties, but ONLY for the TOP 10 Event Types causing
## the Most Human Casualties
meltedEVTYPEbyHumanCasualties <- melt(data = EVTYPEbyHumanCasualties[1:10,],
id.vars = "EVTYPE", measure.vars = c("FATALITIES", "INJURIES"))
## Using ggplot2, generating a "stacked" bar plot to show the human casualties,
## fatalities and injured, caused by the 10 TOP most harmful events
ggplot(meltedEVTYPEbyHumanCasualties,
aes(reorder(EVTYPE, value, decreasing=TRUE), value/1000, fill=variable)) +
geom_bar( position="stack", stat="identity") +
labs( x= "Storm/Weather Event Type", y= "Number of Human Casualties (in 1000s)") +
scale_y_continuous(n.breaks = 10) +
theme(plot.title = element_text(hjust = 0.5, size = 10, face = "bold", color = "darkblue"),
axis.text.x = element_text(angle=35, hjust=1, size = 7, color = "darkblue", face = "bold"),
axis.title.x = element_text(hjust = 0.5, size = 10, color = "darkblue", face = "bold"),
axis.title.y = element_text(hjust = 0.5, size = 10, color = "darkblue", face = "bold"),
axis.text.y = element_text(color = "darkblue", face = "bold"),
legend.text = element_text(size = 7, color = "darkblue", face = "bold"),
legend.title = element_text(size = 8, color = "darkblue", face = "bold"),
plot.caption = element_text(vjust = -1, hjust = 0.4, size = 10, face = "bold")) +
labs(title="Human Casualties by Storm/Weather Events Types (Top 10)") +
labs(caption = "Figure 1: Extream Weather Event Types with the Most Harm to the U.S. Population Health") +
scale_fill_manual(name = "Human Casualties", labels = c( "Fatalities", "Injuries"), values = c( "#8D2D11","#F56964"))
The second question that this report answers is: across the United States, which types of events have the greatest economic consequences. Again here too, NOAA data set provides no other mean to judge the economic consequences of storms and other extreme weather events other than the corp and property damage numbers. The data does not even include the health insurance cost for the human casualties caused by the events (which it includes as numbers of fatalities and injured). Therefore, the effect of the extreme weather event types on the U.S. economy in this report is measured in terms of the total economic cost, calculated as the sum of both the property damage and crop damage caused by the event types.
The following code arranges decreasingly the rows of the “stormEffectDT” table based on the total economic cost number. The ordering done in a way that the event type with the most economic cost is on top. The code extracts the relevant columns (“EVTYPE”, “totalEconomicCost”, “PROPDMG” and “CROPDMG”) and places them in a new data table : “EVTYPEbyEconomicCost”.
## Creating a new EVTYPEbyEconomicCost data table that has the event type and only the
## financial damages columns, then ordering its rows to have the most totalEconomicCost on top
EVTYPEbyEconomicCost <- stormEffectDT[order(totalEconomicCost, decreasing = TRUE), c(1, 5:7)]
## Checking EVTYPEbyEconomicCost, to confirm its structure
str(EVTYPEbyEconomicCost)
## Classes 'data.table' and 'data.frame': 985 obs. of 4 variables:
## $ EVTYPE : chr "FLOOD" "HURRICANE/TYPHOON" "TORNADO" "STORM SURGE" ...
## $ PROPDMG : num 1.45e+11 6.93e+10 5.69e+10 4.33e+10 1.57e+10 ...
## $ CROPDMG : num 5.66e+09 2.61e+09 4.15e+08 5.00e+03 3.03e+09 ...
## $ totalEconomicCost: num 1.50e+11 7.19e+10 5.74e+10 4.33e+10 1.88e+10 ...
## - attr(*, ".internal.selfref")=<externalptr>
Presenting the top 10 extreme weather event types that caused the most economic harm to the U.S., and therefore the events with the greatest economic consequences.
## Showing the top 10 event types that are with the most economic consequences to the U.S.
knitr::kable(EVTYPEbyEconomicCost[1:10,], caption = "**Table 2: Top 10 Extream Weather Event Types with the Most Harm to the U.S. Economy**")
| EVTYPE | PROPDMG | CROPDMG | totalEconomicCost |
|---|---|---|---|
| FLOOD | 144657709807 | 5661968450 | 150319678257 |
| HURRICANE/TYPHOON | 69305840000 | 2607872800 | 71913712800 |
| TORNADO | 56947380676 | 414953270 | 57362333946 |
| STORM SURGE | 43323536000 | 5000 | 43323541000 |
| HAIL | 15735267513 | 3025954473 | 18761221986 |
| FLASH FLOOD | 16822673978 | 1421317100 | 18243991078 |
| DROUGHT | 1046106000 | 13972566000 | 15018672000 |
| HURRICANE | 11868319010 | 2741910000 | 14610229010 |
| RIVER FLOOD | 5118945500 | 5029459000 | 10148404500 |
| ICE STORM | 3944927860 | 5022113500 | 8967041360 |
Figure 2 shows the top 10 storm and other extreme weather event types with the greatest economic consequences to the U.S. as a stack bar plot. Each event-type bar shows both the property damages (as a stacked-bar-part in red color), and the crop damages (as a stacked-bar-part in green color). Both parts of each bar, in the plot, adds up to the total number of economic cost, as shown in Table 2 above. The following code, first, melts the “EVTYPEbyEconomicCost” data table, for only the top 10 event-types/rows. As stated earlier, a melted table is a requirement to build a stack bar plot using “ggplot2”. The resultant “meltedEVTYPEbyEconomicCost” data table, then, is used by the ggplot() to create Figure 2.
## Melting meltedEVTYPEbyEconomicCost, but ONLY for the TOP 10 Event Types causing
## the Most Economic Harm
meltedEVTYPEbyEconomicCost <- melt(data = EVTYPEbyEconomicCost[1:10,],
id.vars = "EVTYPE", measure.vars = c("CROPDMG", "PROPDMG"))
## Using ggplot2, generating a "stacked" bar plot to show the financial damages,
## both crop and property damages, caused by the 10 TOP most harmful events
ggplot(meltedEVTYPEbyEconomicCost,
aes(reorder(EVTYPE, value, decreasing=TRUE), value/10^9, fill=variable)) +
geom_bar( position="stack", stat="identity") +
labs( x= "Storm/Weather Event Type", y= "Damages (in Billions of Dollars)") +
scale_y_continuous(n.breaks = 10) +
theme(plot.title = element_text(hjust = 0.5, size = 10, face = "bold", color = "darkblue"),
axis.text.x = element_text(angle=35, hjust=1, size = 7, color = "darkblue", face = "bold"),
axis.title.x = element_text(hjust = 0.5, size = 10, color = "darkblue", face = "bold"),
axis.title.y = element_text(hjust = 0.5, size = 10, color = "darkblue", face = "bold"),
axis.text.y = element_text(color = "darkblue", face = "bold"),
legend.text = element_text(size = 7, color = "darkblue", face = "bold"),
legend.title = element_text(size = 8, color = "darkblue", face = "bold"),
plot.caption = element_text(vjust = -1, hjust = 0.4, size = 10, face = "bold")) +
labs(title="Property and Crop Damages by Storm/Weather Events Types (Top 10)") +
labs(caption = "Figure 2: Extream Weather Event Types with the Most Harm to the U.S. Economy") +
scale_fill_manual(name = "Damages", labels = c( "Crop", "Property"), values = c("#15B7BB", "#F56964"))
This report started by asking two questions: Across the United States, which types of storms and extreme weather events are most harmful with respect to population health, and which have the greatest economic consequences? The report discussed the steps taken to explore the NOAA data set to answer those two question. It adhered to the principals of Reproducible Research by showing all the steps taken to process the data set, from cleaning it (following the tidy data guidelines), to summarizing it, and using it to come to the results presented. The code used to process the data and generate the results and present them is included in the R markdown document that generated this html document of the report.
The final results shows that: