Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site: - Storm Data There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined. - Storm Data Documentation National Weather Service - National Climatic Data Center Storm Events FAQ
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
The GITHUB is also available on: link
library(tidyverse) # to load required package
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Storm <- read_csv("Data_Science_Project/Storm.csv") # to load correct dataset
## Rows: 902297 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): BGN_DATE, BGN_TIME, TIME_ZONE, COUNTYNAME, STATE, EVTYPE, BGN_AZI,...
## dbl (18): STATE__, COUNTY, BGN_RANGE, COUNTY_END, END_RANGE, LENGTH, WIDTH, ...
## lgl (1): COUNTYENDN
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(Storm)
names(Storm)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
Selected Variables: 1. BGN_DATE : The beginning date 2. COUNTY : Number of the County where the event occurred 3. COUNTYNAME : Name of the county where the event occurred 4. STATE : State where the event occurred 5. EVTYPE : Type of events 6. FATALITIES : Numbers of corpses during the events 7. INJURIES : Physical Damages during the events 8. PROPDMG : Value of property damage in USD 9. PROPDMGEXP : Unit multiplier for property damage 10. CROPDMG : Value of crop damage in USD 11. CROPDMGEXP : Unit multiplier for crop damage
data <- Storm %>%
select(BGN_DATE, COUNTY, COUNTYNAME, STATE, EVTYPE,
FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
rm(Storm)
str(data)
## tibble [902,297 × 11] (S3: tbl_df/tbl/data.frame)
## $ BGN_DATE : chr [1:902297] "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ COUNTY : num [1:902297] 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr [1:902297] "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr [1:902297] "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr [1:902297] "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ FATALITIES: num [1:902297] 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num [1:902297] 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num [1:902297] 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr [1:902297] "K" "K" "K" "K" ...
## $ CROPDMG : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr [1:902297] NA NA NA NA ...
summary(data)
## BGN_DATE COUNTY COUNTYNAME STATE
## Length:902297 Min. : 0.0 Length:902297 Length:902297
## Class :character 1st Qu.: 31.0 Class :character Class :character
## Mode :character Median : 75.0 Mode :character Mode :character
## Mean :100.6
## 3rd Qu.:131.0
## Max. :873.0
## EVTYPE FATALITIES INJURIES PROPDMG
## Length:902297 Min. : 0.0000 Min. : 0.0000 Min. : 0.00
## Class :character 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.00
## Mode :character Median : 0.0000 Median : 0.0000 Median : 0.00
## Mean : 0.0168 Mean : 0.1557 Mean : 12.06
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.50
## Max. :583.0000 Max. :1700.0000 Max. :5000.00
## PROPDMGEXP CROPDMG CROPDMGEXP
## Length:902297 Min. : 0.000 Length:902297
## Class :character 1st Qu.: 0.000 Class :character
## Mode :character Median : 0.000 Mode :character
## Mean : 1.527
## 3rd Qu.: 0.000
## Max. :990.000
data$DATE <- mdy_hms(data$BGN_DATE) # to create new column date
data$MONTH <- month(data$DATE, label = T, abbr = F) # to create new column with the months
data$YEAR <- year(data$DATE) # to extract the Year in the date
table(data$PROPDMGEXP) # to see the letters used to characterise the amount of dollar
##
## - ? + 0 1 2 3 4 5 6 7
## 1 8 5 216 25 13 4 4 28 4 5
## 8 B h H K m M
## 1 40 1 6 424665 7 11330
table(data$CROPDMGEXP) # to see the letters used to characterise the amount of dollar
##
## ? 0 2 B k K m M
## 7 19 1 9 21 281832 1 1994
# To change the letters into the corresponding number
data$CROPDMGEXP[!grepl("K|M|B", data$CROPDMGEXP, ignore.case = TRUE)] <- 0
data$PROPDMGEXP[!grepl("K|M|B", data$PROPDMGEXP, ignore.case = TRUE)] <- 0
data$PROPDMGEXP[grep("K", data$PROPDMGEXP, ignore.case = TRUE)] <- "3"
data$PROPDMGEXP[grep("M", data$PROPDMGEXP, ignore.case = TRUE)] <- "6"
data$PROPDMGEXP[grep("B", data$PROPDMGEXP, ignore.case = TRUE)] <- "9"
data$PROPDMGEXP <- as.numeric(as.character(data$PROPDMGEXP))
data$CROPDMGEXP[grep("K", data$CROPDMGEXP, ignore.case = TRUE)] <- "3"
data$CROPDMGEXP[grep("M", data$CROPDMGEXP, ignore.case = TRUE)] <- "6"
data$CROPDMGEXP[grep("B", data$CROPDMGEXP, ignore.case = TRUE)] <- "9"
data$CROPDMGEXP <- as.numeric(as.character(data$CROPDMGEXP))
# To assign the letter into the corresponding amount in dollar
data$PROPCOST <- data$PROPDMG * 10^data$PROPDMGEXP #
data$CROPCOST <- data$CROPDMG * 10^data$CROPDMGEXP
# To sort the cost of Property damage in a decreasing order
sort(table(data$PROPCOST), decreasing = T)[1:20]
##
## 0 5000 10000 1000 2000 25000 50000 3000 20000 15000
## 663123 31731 21787 17544 17186 17104 13596 10364 9179 8617
## 250000 500 1e+05 2500 30000 5e+05 4000 8000 75000 2500000
## 8439 6707 6302 5807 4391 4000 3202 2877 2419 2411
# To sort the cost of Crop damage in a decreasing order
sort(table(data$CROPCOST), decreasing = T)[1:20]
##
## 0 5000 10000 50000 1e+05 1000 2000 25000 20000 5e+05 15000
## 880198 4097 2349 1984 1233 956 951 830 758 721 598
## 500 3000 250000 2e+05 1e+06 30000 75000 150000 3e+05
## 568 554 513 479 447 317 290 268 250
data$EVTYPE <- tolower(data$EVTYPE) # to change EVTYPE into lower case
# To substitute the string in order to have less categories
data$events <- data$EVTYPE
data$events <- gsub('.avalan.', 'avalanche', data$events)
data$events <- gsub('.storm.', 'storm', data$events)
data$events <- gsub('.*storm.*', 'storm', data$events)
data$events <- gsub('.*avalan.*', 'avalanche', data$events)
data$events <- gsub('.*tornado.*', 'tornado', data$events)
data$events <- gsub('.*tstm.*', 'storm', data$events)
data$events <- gsub('.*flood.*', 'flood', data$events)
data$events <- gsub('.*fire.*', 'fire', data$events)
data$events <- gsub('.*cold.*', 'cold', data$events)
data$events <- gsub('.*freez.*', 'cold', data$events)
data$events <- gsub('.*frost.*', 'cold', data$events)
data$events <- gsub('.*ice.*', 'cold', data$events)
data$events <- gsub('.*low.*', 'cold', data$events)
data$events <- gsub('.*hail.*', 'hail', data$events)
data$events <- gsub('.*dry.*', 'heat', data$events)
data$events <- gsub('.*icy.*', 'cold', data$events)
data$events <- gsub('.*summary.*', 'summary', data$events)
data$events <- gsub('.*warmth.*', 'heat', data$events)
data$events <- gsub('.*warm.*', 'heat', data$events)
data$events <- gsub('.*hurricane.*', 'hurricane', data$events)
data$events <- gsub('.*drought.*', 'heat', data$events)
data$events <- gsub('.*hot.*', 'heat', data$events)
data$events <- gsub('.*dri.*', 'heat', data$events)
data$events <- gsub('.*snow.*', 'snow', data$events)
data$events <- gsub('.*torn.*', 'tornado', data$events)
data$events <- gsub('.*volca.*', 'volcano', data$events)
data$events <- gsub('.*rain.*', 'rain', data$events)
data$events <- gsub('.*heat.*', 'heat', data$events)
data$events <- gsub('.*lightning.*', 'lightning', data$events)
data$events <- gsub('.*wint.*', 'cold', data$events)
data$events <- gsub('.*wind.*', 'wind', data$events)
data$events <- gsub('.*precipitation.*', 'rain', data$events)
data$events <- gsub('.*floo.*', 'flood', data$events)
data$events <- gsub('.*whir.*', 'wind', data$events)
data$events <- gsub('.*precip.*', 'rain', data$events)
data$events <- gsub('.*blizzard.*', 'rain', data$events)
data$events <- gsub('.*wet.*', 'wet', data$events)
data$events <- gsub('.*wat.*', 'wet', data$events)
data$events <- gsub('.*wayt.*', 'wet', data$events)
data$events <- gsub('.*urban.*', 'flood', data$events)
data$events <- gsub('.*cloud.*', 'cloud', data$events)
data$events <- gsub('.*surf.*', 'surf', data$events)
data$events <- gsub('.*fog.*', 'fog', data$events)
head(sort(table(data$events), decreasing = T), n = 15) # to see only the 15 first greatest events
##
## storm hail flood tornado wind snow lightning rain
## 351859 289277 86135 60701 26553 17535 15760 14674
## cold cloud heat fire wet fog surf
## 13045 6944 5779 4240 3897 1834 1063
data$All_Events <- 'OTHER' # to create a column names All_Events
# to set to other all the features less than 1000
data$All_Events[grep("storm", data$events, ignore.case = TRUE)] <- "STORM"
data$All_Events[grep("hail", data$events, ignore.case = TRUE)] <- "HAIL"
data$All_Events[grep("flood", data$events, ignore.case = TRUE)] <- "FLOOD"
data$All_Events[grep("tornado", data$events, ignore.case = TRUE)] <- "TORNADO"
data$All_Events[grep("wind", data$events, ignore.case = TRUE)] <- "WIND"
data$All_Events[grep("snow", data$events, ignore.case = TRUE)] <- "SNOW"
data$All_Events[grep("lightning", data$events, ignore.case = TRUE)] <- "LIGHTNING"
data$All_Events[grep("rain", data$events, ignore.case = TRUE)] <- "RAIN"
data$All_Events[grep("cold", data$events, ignore.case = TRUE)] <- "COLD"
data$All_Events[grep("cloud", data$events, ignore.case = TRUE)] <- "CLOUD"
data$All_Events[grep("heat", data$events, ignore.case = TRUE)] <- "HEAT"
data$All_Events[grep("fire", data$events, ignore.case = TRUE)] <- "FIRE"
data$All_Events[grep("wet", data$events, ignore.case = TRUE)] <- "WET"
data$All_Events[grep("fog", data$events, ignore.case = TRUE)] <- "FOG"
data$All_Events[grep("surf", data$events, ignore.case = TRUE)] <- "SURF"
data$events <- data$All_Events # To assign the value of All_Events to events
data$All_Events <- NULL # To remove the column All_Events
This table shows the event that causes the most health impact
# To create the table of the of the elements that affect the the Health by percentage
data %>% group_by(events) %>%
summarise(Total = sum(FATALITIES + INJURIES, na.rm = T)) %>%
mutate(Percent = round(Total / sum(Total) * 100, 3),
Type = rep('Fatalities and Injuries')) %>%
arrange(desc(Total))
## # A tibble: 16 × 4
## events Total Percent Type
## <chr> <dbl> <dbl> <chr>
## 1 TORNADO 97068 62.4 Fatalities and Injuries
## 2 STORM 14909 9.58 Fatalities and Injuries
## 3 HEAT 12457 8.00 Fatalities and Injuries
## 4 FLOOD 10234 6.57 Fatalities and Injuries
## 5 LIGHTNING 6048 3.88 Fatalities and Injuries
## 6 OTHER 3559 2.29 Fatalities and Injuries
## 7 WIND 2362 1.52 Fatalities and Injuries
## 8 COLD 1748 1.12 Fatalities and Injuries
## 9 FIRE 1698 1.09 Fatalities and Injuries
## 10 HAIL 1386 0.89 Fatalities and Injuries
## 11 RAIN 1315 0.845 Fatalities and Injuries
## 12 SNOW 1285 0.825 Fatalities and Injuries
## 13 FOG 1156 0.743 Fatalities and Injuries
## 14 SURF 409 0.263 Fatalities and Injuries
## 15 WET 36 0.023 Fatalities and Injuries
## 16 CLOUD 3 0.002 Fatalities and Injuries
# To create a new variable to plot
ToPlotHealth <- data %>% group_by(events) %>%
summarise(Total = sum(FATALITIES + INJURIES, na.rm = T)) %>%
mutate(Percent = round(Total / sum(Total) * 100, 3),
Type = rep('Fatalities and Injuries')) %>%
arrange(desc(Total))
# To arrange the percentage in decreasing order to plot
ToPlotHealth$events <- factor(ToPlotHealth$events,
levels = ToPlotHealth$events[order(ToPlotHealth$Percent,
decreasing = T)])
# The Plot of the Impact on Health per percentage
ggplot(ToPlotHealth, aes(x = events, y = Percent, fill = events)) +
geom_bar(stat = 'identity') +
geom_text(aes(label = paste0(round(Percent, 3), '%')), vjust = -.5) +
labs(x = 'Health Damage', y = 'Percentage (%)') +
scale_y_continuous(labels = scales::percent_format(scale = 1)) +
theme_minimal() + theme(legend.position = 'none') +
ggtitle('Events that cause the Health damage in percentage in the US \nFrom 1950 - 2011')
This table shows the event that causes the most economic impact.
# To see the economic impacts per type of events
data %>% group_by(events) %>%
summarise(Total = sum(PROPCOST + CROPCOST, na.rm = T)) %>%
mutate(Percent = round(Total / sum(Total) * 100, 3),
Type = rep('Economic Damage')) %>%
arrange(desc(Total))
## # A tibble: 16 × 4
## events Total Percent Type
## <chr> <dbl> <dbl> <chr>
## 1 FLOOD 179976102832. 37.8 Economic Damage
## 2 OTHER 91392479990 19.2 Economic Damage
## 3 STORM 84624213627. 17.8 Economic Damage
## 4 TORNADO 59010560049. 12.4 Economic Damage
## 5 HAIL 18780451671. 3.94 Economic Damage
## 6 HEAT 15950224630 3.35 Economic Damage
## 7 FIRE 8904910130 1.87 Economic Damage
## 8 WIND 6840844523 1.44 Economic Damage
## 9 RAIN 4816707440 1.01 Economic Damage
## 10 COLD 3763584450 0.79 Economic Damage
## 11 SNOW 1140377802. 0.239 Economic Damage
## 12 LIGHTNING 940781537. 0.197 Economic Damage
## 13 WET 157104700 0.033 Economic Damage
## 14 SURF 101475000 0.021 Economic Damage
## 15 FOG 22829500 0.005 Economic Damage
## 16 CLOUD 194600 0 Economic Damage
# To create a new variable to plot
ToPlotEcono <- data %>% group_by(events) %>%
summarise(Total = sum(PROPCOST + CROPCOST, na.rm = T)) %>%
mutate(Percent = round(Total / sum(Total) * 100, 3), Type = rep('Economic Damage')) %>%
arrange(desc(Total))
ToPlotEcono$events <- factor(ToPlotEcono$events,
levels = ToPlotEcono$events[order(ToPlotEcono$Percent,
decreasing = T)])
# The plot of the Economic Impact per event
ggplot(ToPlotEcono, aes(x = events, y = Percent, fill = events)) +
geom_bar(stat = 'identity') +
geom_text(aes(label = paste0(round(Percent, 3), '%')), vjust = -.5) +
labs(x = 'Economic Damage', y = 'Percentage (%)') +
scale_y_continuous(labels = scales::percent_format(scale = 1)) +
theme_minimal() + theme(legend.position = 'none') +
ggtitle('Events that cause the economic damage in percentage in the US \nFrom 1950 - 2011')
Based on the results of the Analysis, we can conclude that: 1. The most Health damages are caused by Tornado, 62% 2. The most Economic damages are cause by Flood, 38%