Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events: The data analysis must address the following questions:
1.- Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
2.- Across the United States, which types of events have the greatest economic consequences?
This analysis shows that the most harmful type of weather events (1950 - 2011) to population health (including fatalities and injuries) was “Tornados” with 96,980 casualties and the most harmful to economy cost (Property and Crops) was “Floods” with $150,320 Million dollars
library(ggplot2)
library(grid)
library(gridExtra)
library(ggpubr)
library(dplyr)
The data for this assignment can be downloaded from the course web site: Storm Data [47Mb]. The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. For more information, please review the documentation Storm Data Documentation
I have already downloaded the data into my working directory.
storm_data <- read.csv("./repdata_data_StormData.csv")
storm_data <- as_tibble(storm_data)
storm_data
## # A tibble: 902,297 x 37
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 1 4/18/19~ 0130 CST 97 MOBILE AL TORNA~ 0
## 2 1 4/18/19~ 0145 CST 3 BALDWIN AL TORNA~ 0
## 3 1 2/20/19~ 1600 CST 57 FAYETTE AL TORNA~ 0
## 4 1 6/8/195~ 0900 CST 89 MADISON AL TORNA~ 0
## 5 1 11/15/1~ 1500 CST 43 CULLMAN AL TORNA~ 0
## 6 1 11/15/1~ 2000 CST 77 LAUDERDALE AL TORNA~ 0
## 7 1 11/16/1~ 0100 CST 9 BLOUNT AL TORNA~ 0
## 8 1 1/22/19~ 0900 CST 123 TALLAPOOSA AL TORNA~ 0
## 9 1 2/13/19~ 2000 CST 125 TUSCALOOSA AL TORNA~ 0
## 10 1 2/13/19~ 2000 CST 57 FAYETTE AL TORNA~ 0
## # ... with 902,287 more rows, and 28 more variables: BGN_AZI <chr>,
## # BGN_LOCATI <chr>, END_DATE <chr>, END_TIME <chr>, COUNTY_END <dbl>,
## # COUNTYENDN <lgl>, END_RANGE <dbl>, END_AZI <chr>, END_LOCATI <chr>,
## # LENGTH <dbl>, WIDTH <dbl>, F <int>, MAG <dbl>, FATALITIES <dbl>,
## # INJURIES <dbl>, PROPDMG <dbl>, PROPDMGEXP <chr>, CROPDMG <dbl>,
## # CROPDMGEXP <chr>, WFO <chr>, STATEOFFIC <chr>, ZONENAMES <chr>,
## # LATITUDE <dbl>, LONGITUDE <dbl>, LATITUDE_E <dbl>, LONGITUDE_ <dbl>,
## # REMARKS <chr>, REFNUM <dbl>
In order to optimize resources, we only consider rows where Fatalities, Injuries, Property Damage Expense or Crop Damage Expense are > 0 and select 7 columns that will be used in this analysis (below are the columns selected). Next, transform the exponent values (“”,1,H,K,M,B) for both Crop Damages and Property Damage into numerical values and multiply them by economic damages
The raw data contains a lot of rows with values 0. Hence I’m filtering rows when values > 0.
sum (storm_data["FATALITIES"] == 0 & storm_data["INJURIES"] == 0 &
storm_data["PROPDMG"] == 0 & storm_data["CROPDMG"] == 0)
## [1] 647664
req_data <- storm_data %>%
select(c(8, 23:28)) %>%
filter(FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0)
req_data
## # A tibble: 254,633 x 7
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 TORNADO 0 15 25 K 0 ""
## 2 TORNADO 0 0 2.5 K 0 ""
## 3 TORNADO 0 2 25 K 0 ""
## 4 TORNADO 0 2 2.5 K 0 ""
## 5 TORNADO 0 2 2.5 K 0 ""
## 6 TORNADO 0 6 2.5 K 0 ""
## 7 TORNADO 0 1 2.5 K 0 ""
## 8 TORNADO 0 0 2.5 K 0 ""
## 9 TORNADO 1 14 25 K 0 ""
## 10 TORNADO 0 0 25 K 0 ""
## # ... with 254,623 more rows
As you can see the required data does not have any NA values.
sum(is.na(req_data))
## [1] 0
Preparing the data The storm data set includes event-related injuries and fatalities, both of which need to be considered for impact to population health.
All the fatalities are grouped by weather events, summed and sorted.
fatalities <- req_data %>%
group_by(EVTYPE) %>%
select(FATALITIES) %>%
summarise(FATALITIES = sum(FATALITIES)) %>%
arrange(desc(FATALITIES))
## Adding missing grouping variables: `EVTYPE`
## `summarise()` ungrouping output (override with `.groups` argument)
fatalities
## # A tibble: 488 x 2
## EVTYPE FATALITIES
## <chr> <dbl>
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
## # ... with 478 more rows
All the injuries are grouped by weather events, summed and sorted.
injuries <- req_data %>%
group_by(EVTYPE) %>%
select(INJURIES) %>%
summarise(INJURIES = sum(INJURIES)) %>%
arrange(desc(INJURIES))
## Adding missing grouping variables: `EVTYPE`
## `summarise()` ungrouping output (override with `.groups` argument)
injuries
## # A tibble: 488 x 2
## EVTYPE INJURIES
## <chr> <dbl>
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 HAIL 1361
## # ... with 478 more rows
Both the fatalities and injuries are combined and grouped by weather events, summed and sorted.
combined <- req_data %>%
group_by(EVTYPE) %>%
select(FATALITIES, INJURIES) %>%
summarise(FATALITIES.and.INJURIES = sum(FATALITIES + INJURIES)) %>%
arrange(desc(FATALITIES.and.INJURIES))
## Adding missing grouping variables: `EVTYPE`
## `summarise()` ungrouping output (override with `.groups` argument)
combined
## # A tibble: 488 x 2
## EVTYPE FATALITIES.and.INJURIES
## <chr> <dbl>
## 1 TORNADO 96979
## 2 EXCESSIVE HEAT 8428
## 3 TSTM WIND 7461
## 4 FLOOD 7259
## 5 LIGHTNING 6046
## 6 HEAT 3037
## 7 FLASH FLOOD 2755
## 8 ICE STORM 2064
## 9 THUNDERSTORM WIND 1621
## 10 WINTER STORM 1527
## # ... with 478 more rows
All the three subsets created above are joined together.
health_effects <- left_join(fatalities, injuries, by="EVTYPE") %>%
left_join(., combined, by='EVTYPE')
health_effects
## # A tibble: 488 x 4
## EVTYPE FATALITIES INJURIES FATALITIES.and.INJURIES
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 FLASH FLOOD 978 1777 2755
## 4 HEAT 937 2100 3037
## 5 LIGHTNING 816 5230 6046
## 6 TSTM WIND 504 6957 7461
## 7 FLOOD 470 6789 7259
## 8 RIP CURRENT 368 232 600
## 9 HIGH WIND 248 1137 1385
## 10 AVALANCHE 224 170 394
## # ... with 478 more rows
Preparing the data Physical damages include property damages and crop damages, so we will combine both to find the events with the greatest economic consequences.
The damage value is represented by two parts “-DMG” (numeric) and “-DMGEXP” (alphanumeric) so the the first step is to convert exponent DMGEXP to numerical value for comparison then calculate to damage mounts for property and crops, then combine property damages and crop damages into a new column (CROP_PROP_TOTAL_DMG) and finally summarize damages by event types and rank to find the most significant events.
In order to convert exponential values to numeric values, I wrote the function below.
convExponent <- function(dmg, exp) {
if (exp == "K") {
dmg * 1000
} else if (exp == "M") {
dmg * 1e+06
} else if (exp == "B") {
dmg * 1e+09
} else if (exp == "") {
dmg
} else {
stop("NOT VALID DATA")
}
}
Changing the exponential values to numeric values.
exp_data <- req_data %>%
filter(PROPDMGEXP %in% c("", "K", "M", "B") &
CROPDMGEXP %in% c("", "K", "M", "B"))
exp_data$NEW_PROPDMG <- mapply(convExponent, exp_data$PROPDMG, exp_data$PROPDMGEXP)
exp_data$NEW_CROPDMG <- mapply(convExponent, exp_data$CROPDMG, exp_data$CROPDMGEXP)
exp_data
## # A tibble: 254,329 x 9
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP NEW_PROPDMG
## <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl>
## 1 TORNA~ 0 15 25 K 0 "" 25000
## 2 TORNA~ 0 0 2.5 K 0 "" 2500
## 3 TORNA~ 0 2 25 K 0 "" 25000
## 4 TORNA~ 0 2 2.5 K 0 "" 2500
## 5 TORNA~ 0 2 2.5 K 0 "" 2500
## 6 TORNA~ 0 6 2.5 K 0 "" 2500
## 7 TORNA~ 0 1 2.5 K 0 "" 2500
## 8 TORNA~ 0 0 2.5 K 0 "" 2500
## 9 TORNA~ 1 14 25 K 0 "" 25000
## 10 TORNA~ 0 0 25 K 0 "" 25000
## # ... with 254,319 more rows, and 1 more variable: NEW_CROPDMG <dbl>
All the crop damages are grouped by weather events, summed and sorted.
crop_damage <- exp_data %>%
group_by(EVTYPE) %>%
select(NEW_CROPDMG) %>%
summarise(CROP_TOTAL = sum(NEW_CROPDMG)/10^6) %>%
arrange(desc(CROP_TOTAL))
## Adding missing grouping variables: `EVTYPE`
## `summarise()` ungrouping output (override with `.groups` argument)
crop_damage
## # A tibble: 482 x 2
## EVTYPE CROP_TOTAL
## <chr> <dbl>
## 1 DROUGHT 13973.
## 2 FLOOD 5662.
## 3 RIVER FLOOD 5029.
## 4 ICE STORM 5022.
## 5 HAIL 3001.
## 6 HURRICANE 2742.
## 7 HURRICANE/TYPHOON 2608.
## 8 FLASH FLOOD 1421.
## 9 EXTREME COLD 1293.
## 10 FROST/FREEZE 1094.
## # ... with 472 more rows
All the property damages are grouped by weather events, summed and sorted.
prop_damage <- exp_data %>%
group_by(EVTYPE) %>%
select(NEW_PROPDMG) %>%
summarise(PROP_TOTAL = sum(NEW_PROPDMG)/10^6) %>%
arrange(desc(PROP_TOTAL))
## Adding missing grouping variables: `EVTYPE`
## `summarise()` ungrouping output (override with `.groups` argument)
prop_damage
## # A tibble: 482 x 2
## EVTYPE PROP_TOTAL
## <chr> <dbl>
## 1 FLOOD 144658.
## 2 HURRICANE/TYPHOON 69306.
## 3 TORNADO 56925.
## 4 STORM SURGE 43324.
## 5 FLASH FLOOD 16141.
## 6 HAIL 15727.
## 7 HURRICANE 11868.
## 8 TROPICAL STORM 7704.
## 9 WINTER STORM 6688.
## 10 HIGH WIND 5270.
## # ... with 472 more rows
crop and property damages are combined together, grouped by weather events, summed and sorted.
combined_damage <- exp_data %>%
group_by(EVTYPE) %>%
select(NEW_PROPDMG, NEW_CROPDMG) %>%
summarise(TOTAL_DMG = sum(NEW_PROPDMG + NEW_CROPDMG)/10^6) %>%
arrange(desc(TOTAL_DMG))
## Adding missing grouping variables: `EVTYPE`
## `summarise()` ungrouping output (override with `.groups` argument)
combined_damage
## # A tibble: 482 x 2
## EVTYPE TOTAL_DMG
## <chr> <dbl>
## 1 FLOOD 150320.
## 2 HURRICANE/TYPHOON 71914.
## 3 TORNADO 57290.
## 4 STORM SURGE 43324.
## 5 HAIL 18728.
## 6 FLASH FLOOD 17562.
## 7 DROUGHT 15019.
## 8 HURRICANE 14610.
## 9 RIVER FLOOD 10148.
## 10 ICE STORM 8967.
## # ... with 472 more rows
All the data subsets created above are joined together.
economic_effects <- left_join(crop_damage, prop_damage, by="EVTYPE") %>%
left_join(., combined_damage, by='EVTYPE')
economic_effects
## # A tibble: 482 x 4
## EVTYPE CROP_TOTAL PROP_TOTAL TOTAL_DMG
## <chr> <dbl> <dbl> <dbl>
## 1 DROUGHT 13973. 1046. 15019.
## 2 FLOOD 5662. 144658. 150320.
## 3 RIVER FLOOD 5029. 5119. 10148.
## 4 ICE STORM 5022. 3945. 8967.
## 5 HAIL 3001. 15727. 18728.
## 6 HURRICANE 2742. 11868. 14610.
## 7 HURRICANE/TYPHOON 2608. 69306. 71914.
## 8 FLASH FLOOD 1421. 16141. 17562.
## 9 EXTREME COLD 1293. 67.7 1361.
## 10 FROST/FREEZE 1094. 9.48 1104.
## # ... with 472 more rows
The following plot demonstrates that tornado is the most harmful weather event to population health and floods have the greatest economic consequences.
g1 <- ggplot(health_effects[1:10,], aes(x=FATALITIES.and.INJURIES,
y= reorder(EVTYPE,FATALITIES.and.INJURIES )))
g2 <- ggplot(economic_effects[1:10,], aes(x=TOTAL_DMG, y= reorder(EVTYPE, TOTAL_DMG)))
plot1 <- g1 + geom_bar(stat="identity", fill=heat.colors(10, alpha = 1)) +
labs(x="Fatalities and Injuries", y="Event Types",
title ="Population Health") +
theme(plot.title = element_text(hjust = 0.5,family="CM Roman", face = "italic"))
plot2 <- g2 + geom_bar(stat="identity", fill=topo.colors(10, alpha = 1)) +
labs(x="Crop and Property Damage", y="Event Types",
title = "USA Economy") +
theme(plot.title = element_text(hjust = 0.5, family="CM Roman", face = "italic"))
title1=text_grob("Top 10 Events has most Impact on", size = 15, face = "bold", hjust=0.5)
grid.arrange(plot1, plot2, top=title1)