This report investigates some key parameters of the storm conditions from the 1950 till 2011 by using the NOAA storm database. The main goal is to analyse the most harmful events in terms of population health and economic aftermath.
This section shows how data was pre-processed, first by importing the packages and then how data is loaded into R and processed for analysis
Before pre-processing data, I imported the packages “tidyverse” for data analysis and “ggrepel” for adding label texts to ggplots.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.1 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggrepel)
I downloaded the dataset and used “setwd” to set work directory, I loaded the dataset “repdata_data_StormData.csv.bz2” by using the “read.csv()” function, which will read this bz2 compressed file. Importantly, this script requires that the work directory is set to the directory where the storm database is located.
#setwd("#####/####/miriam_assignment_2/data")
setwd("C:\\Users\\glori\\Downloads")
STORM_db <- read.csv("repdata_data_StormData.csv.bz2")
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
# Print column names and summary of the dataset to learn which type of data we have
colnames(STORM_db)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
summary(STORM_db)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE
## Min. : 1.0 Length:902297 Length:902297 Length:902297
## 1st Qu.:19.0 Class :character Class :character Class :character
## Median :30.0 Mode :character Mode :character Mode :character
## Mean :31.2
## 3rd Qu.:45.0
## Max. :95.0
##
## COUNTY COUNTYNAME STATE EVTYPE
## Min. : 0.0 Length:902297 Length:902297 Length:902297
## 1st Qu.: 31.0 Class :character Class :character Class :character
## Median : 75.0 Mode :character Mode :character Mode :character
## Mean :100.6
## 3rd Qu.:131.0
## Max. :873.0
##
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## Min. : 0.000 Length:902297 Length:902297 Length:902297
## 1st Qu.: 0.000 Class :character Class :character Class :character
## Median : 0.000 Mode :character Mode :character Mode :character
## Mean : 1.484
## 3rd Qu.: 1.000
## Max. :3749.000
##
## END_TIME COUNTY_END COUNTYENDN END_RANGE
## Length:902297 Min. :0 Mode:logical Min. : 0.0000
## Class :character 1st Qu.:0 NA's:902297 1st Qu.: 0.0000
## Mode :character Median :0 Median : 0.0000
## Mean :0 Mean : 0.9862
## 3rd Qu.:0 3rd Qu.: 0.0000
## Max. :0 Max. :925.0000
##
## END_AZI END_LOCATI LENGTH WIDTH
## Length:902297 Length:902297 Min. : 0.0000 Min. : 0.000
## Class :character Class :character 1st Qu.: 0.0000 1st Qu.: 0.000
## Mode :character Mode :character Median : 0.0000 Median : 0.000
## Mean : 0.2301 Mean : 7.503
## 3rd Qu.: 0.0000 3rd Qu.: 0.000
## Max. :2315.0000 Max. :4400.000
##
## F MAG FATALITIES INJURIES
## Min. :0.0 Min. : 0.0 Min. : 0.0000 Min. : 0.0000
## 1st Qu.:0.0 1st Qu.: 0.0 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median :1.0 Median : 50.0 Median : 0.0000 Median : 0.0000
## Mean :0.9 Mean : 46.9 Mean : 0.0168 Mean : 0.1557
## 3rd Qu.:1.0 3rd Qu.: 75.0 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :5.0 Max. :22000.0 Max. :583.0000 Max. :1700.0000
## NA's :843563
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 Length:902297 Min. : 0.000 Length:902297
## 1st Qu.: 0.00 Class :character 1st Qu.: 0.000 Class :character
## Median : 0.00 Mode :character Median : 0.000 Mode :character
## Mean : 12.06 Mean : 1.527
## 3rd Qu.: 0.50 3rd Qu.: 0.000
## Max. :5000.00 Max. :990.000
##
## WFO STATEOFFIC ZONENAMES LATITUDE
## Length:902297 Length:902297 Length:902297 Min. : 0
## Class :character Class :character Class :character 1st Qu.:2802
## Mode :character Mode :character Mode :character Median :3540
## Mean :2875
## 3rd Qu.:4019
## Max. :9706
## NA's :47
## LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## Min. :-14451 Min. : 0 Min. :-14455 Length:902297
## 1st Qu.: 7247 1st Qu.: 0 1st Qu.: 0 Class :character
## Median : 8707 Median : 0 Median : 0 Mode :character
## Mean : 6940 Mean :1452 Mean : 3509
## 3rd Qu.: 9605 3rd Qu.:3549 3rd Qu.: 8735
## Max. : 17124 Max. :9706 Max. :106220
## NA's :40
## REFNUM
## Min. : 1
## 1st Qu.:225575
## Median :451149
## Mean :451149
## 3rd Qu.:676723
## Max. :902297
##
Because we want to investigate the population damage, we will analyse “FATALITIES” and “INJURIES”.
By using “group_by” and “summarise” functions, we can create a new dataset (“summarize_damage”) with the total number of “FATALITIES” and “INJURIES” based on event type “EVTYPE”.
summarize_damage <- STORM_db %>%
group_by(EVTYPE) %>%
summarise(total_fatalities = sum(FATALITIES, na.rm = TRUE),
total_injuries = sum(INJURIES, na.rm = TRUE))
#Filter top 5 events with most human damage
top5_fatalities <- summarize_damage %>%
top_n(total_fatalities, n = 5) %>%
arrange(desc(total_fatalities))
print(top5_fatalities)
## # A tibble: 5 × 3
## EVTYPE total_fatalities total_injuries
## <chr> <dbl> <dbl>
## 1 TORNADO 5633 91346
## 2 EXCESSIVE HEAT 1903 6525
## 3 FLASH FLOOD 978 1777
## 4 HEAT 937 2100
## 5 LIGHTNING 816 5230
# Filter top 5 events with most human damage
top5_injuries <- summarize_damage %>%
top_n(total_injuries, n = 5) %>%
arrange(desc(total_injuries))
print(top5_injuries)
## # A tibble: 5 × 3
## EVTYPE total_fatalities total_injuries
## <chr> <dbl> <dbl>
## 1 TORNADO 5633 91346
## 2 TSTM WIND 504 6957
## 3 FLOOD 470 6789
## 4 EXCESSIVE HEAT 1903 6525
## 5 LIGHTNING 816 5230
summarize_damage_top <- summarize_damage %>%
mutate(is_top5_damage = case_when(EVTYPE %in% top5_fatalities$EVTYPE & EVTYPE %in% top5_injuries$EVTYPE ~ "Both_damages",
EVTYPE %in% top5_injuries$EVTYPE ~ "Injuries",
EVTYPE %in% top5_fatalities$EVTYPE ~ "Fatalities"))
only_top_humandamage <- summarize_damage_top %>%
filter(is_top5_damage %in% c("Both_damages", "Injuries","Fatalities"))
print(only_top_humandamage)
## # A tibble: 7 × 4
## EVTYPE total_fatalities total_injuries is_top5_damage
## <chr> <dbl> <dbl> <chr>
## 1 EXCESSIVE HEAT 1903 6525 Both_damages
## 2 FLASH FLOOD 978 1777 Fatalities
## 3 FLOOD 470 6789 Injuries
## 4 HEAT 937 2100 Fatalities
## 5 LIGHTNING 816 5230 Both_damages
## 6 TORNADO 5633 91346 Both_damages
## 7 TSTM WIND 504 6957 Injuries
For the visualization of this analysis, a scatter plot is created to show the correlation between the top injuries versus fatalities.
I used functions like geom_text_repel() to label the top event types that caused the highest human damage, where fatalities is in green, injuries in blue and both in red.
ggplot(summarize_damage_top, aes(total_fatalities, total_injuries)) +
geom_point() +
coord_cartesian(expand = TRUE) +
geom_text_repel(data = only_top_humandamage, aes(label = EVTYPE, color = is_top5_damage)) +
xlab("Fatalities") +
ylab("Injuries") +
labs(caption = "Figure 1. Scatter plot showing the top events that lead to most population health damage")
print(only_top_humandamage %>%
select(EVTYPE, total_fatalities, total_injuries, is_top5_damage) %>%
arrange(desc(total_fatalities)))
## # A tibble: 7 × 4
## EVTYPE total_fatalities total_injuries is_top5_damage
## <chr> <dbl> <dbl> <chr>
## 1 TORNADO 5633 91346 Both_damages
## 2 EXCESSIVE HEAT 1903 6525 Both_damages
## 3 FLASH FLOOD 978 1777 Fatalities
## 4 HEAT 937 2100 Fatalities
## 5 LIGHTNING 816 5230 Both_damages
## 6 TSTM WIND 504 6957 Injuries
## 7 FLOOD 470 6789 Injuries
As shown can see, the top human damage event is tornadoes, with 5633 total fatalities and 91346 total injuries, which are also in the top 5 of both fatalities and tornadoes from the storm dataset.
Across the United States, which types of events have the greatest economic consequences?
For that, we need to analyse the data from the variables “PROPDMG”, “PROPDMGEXP”, “CROPDMG” and “CROPDMGEXP”, which we saw at the beginning when looking at the data. However, 2 of the variables (“PROPDMGEXP” and “CROPDMGEXP”) represent data in exponential characters.
unique(STORM_db$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
unique(STORM_db$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
We can see that the data has characters, where K/k is thousands, M/m is millions and B/b is billions.
Creating a new data table that first transforms the data of property damage and crop damage to numbers by multiplying by either 1e3 (thousands), 1e6 (millions) or 1e9 (billions).Then, this is transformed to new columns representing the sum of property and crop damage, grouped by event type (EVTYPE).
All the plots and numbers included in the new data table are calculated in billions, for a more clear visualization of the data in Figure 2.
summarize_economic_damage <- STORM_db %>%
mutate(propertydamage = case_when(PROPDMGEXP %in% c("K","k") ~ PROPDMG*1e3,
PROPDMGEXP %in% c("M","m") ~ PROPDMG*1e6,
PROPDMGEXP %in% c("B","b") ~ PROPDMG*1e9,
TRUE ~ PROPDMG),
cropdamage = case_when(CROPDMGEXP %in% c("K","k") ~ CROPDMG*1e3,
CROPDMGEXP %in% c("M","m") ~ CROPDMG*1e6,
CROPDMGEXP %in% c("B","b") ~ CROPDMG*1e9,
TRUE ~ CROPDMG)) %>%
group_by(EVTYPE) %>%
summarise(total_propertydamage = sum(propertydamage, na.rm = TRUE)/1e9,
total_cropdamage = sum(cropdamage, na.rm = TRUE)/1e9)
top5_propertydamage <- summarize_economic_damage %>%
top_n(total_propertydamage, n = 5)
top5_cropdamage <- summarize_economic_damage %>%
top_n(total_cropdamage, n = 5)
print(top5_propertydamage)
## # A tibble: 5 × 3
## EVTYPE total_propertydamage total_cropdamage
## <chr> <dbl> <dbl>
## 1 FLASH FLOOD 16.1 1.42
## 2 FLOOD 145. 5.66
## 3 HURRICANE/TYPHOON 69.3 2.61
## 4 STORM SURGE 43.3 0.000005
## 5 TORNADO 56.9 0.415
print(top5_cropdamage)
## # A tibble: 5 × 3
## EVTYPE total_propertydamage total_cropdamage
## <chr> <dbl> <dbl>
## 1 DROUGHT 1.05 14.0
## 2 FLOOD 145. 5.66
## 3 HAIL 15.7 3.03
## 4 ICE STORM 3.94 5.02
## 5 RIVER FLOOD 5.12 5.03
summarize_economic_damage_top <- summarize_economic_damage %>%
mutate(is_top5_damage = case_when(EVTYPE %in% top5_propertydamage$EVTYPE & EVTYPE %in% top5_cropdamage$EVTYPE ~ "Both_damages",
EVTYPE %in% top5_propertydamage$EVTYPE ~ "Property_damage",
EVTYPE %in% top5_cropdamage$EVTYPE ~ "Crop_damage"))
only_top_damage <- summarize_economic_damage_top %>%
filter(is_top5_damage %in% c("Both_damages", "Property_damage","Crop_damage")) %>%
mutate(total_damage = total_cropdamage + total_propertydamage)
For the visualization of this analysis, a scatter plot is created to show the correlation between the top crop damage versus the property damage.
I used functions like geom_text_repel() to label the top event types that caused the highest economic damage, where crop damage is in green, property in blue and both in red.
ggplot(summarize_economic_damage_top, aes(x = total_propertydamage, y = total_cropdamage)) +
geom_point() +
coord_cartesian(expand = TRUE) +
geom_text_repel(data = only_top_damage, aes(x = total_propertydamage, y = total_cropdamage, label = EVTYPE, color = is_top5_damage)) +
xlab("Property Damage (billions of Dollars)") +
ylab("Crop Damage (billions of Dollars)") +
labs(caption = "Figure 2. Scatter plot showing the top events that lead to most economic damage")
print(only_top_damage %>%
select(EVTYPE, total_damage, total_propertydamage, total_cropdamage, is_top5_damage) %>%
arrange(desc(total_damage)))
## # A tibble: 9 × 5
## EVTYPE total_damage total_propertydamage total_cropdamage is_top5_damage
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 FLOOD 150. 145. 5.66 Both_damages
## 2 HURRICANE/T… 71.9 69.3 2.61 Property_dama…
## 3 TORNADO 57.4 56.9 0.415 Property_dama…
## 4 STORM SURGE 43.3 43.3 0.000005 Property_dama…
## 5 HAIL 18.8 15.7 3.03 Crop_damage
## 6 FLASH FLOOD 17.6 16.1 1.42 Property_dama…
## 7 DROUGHT 15.0 1.05 14.0 Crop_damage
## 8 RIVER FLOOD 10.1 5.12 5.03 Crop_damage
## 9 ICE STORM 8.97 3.94 5.02 Crop_damage
As we can see, the top economic damage event if we put together the crop and the property damage are floods, with 150 billion dollars, with 145 billion dollars loss for property and 5.66 in crop damage. Interestingly, the event that lead to the most crop damage was drought, with 14 billions of dollars, while only 1 billion dollar in property damage.
The analysis of this data shows that tornadoes cause the biggest human damage, specially in number of reported injuries, while floods and drought have the biggest impact in crops and floods, in addition have a harmful effect in property.