A preliminary look at NOAA data on fatalities, injuries, and economic damages
A simple exploratory analysis is performed, aimed at classifying weather-related events in order of number of fatalities, number of related injuries, and property and crop damage caused.
Some data manipulation, carefully reported and reproducible, allows the use of ggplot() functions to visualize the quantitative information of interest in the form of barplots.
A laconic comment preceeds most r code blocks.
A brief illustration of the results follow the plots included in the present report.
Start by calling the necessary libraries.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.0 ✓ dplyr 1.0.4
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
First, we read the data and do some preliminary inspection.
rawData <- read.csv("/Users/francescoaldo/Desktop/COURSES MATERIAL/R DATA SCIENCE - Coursera/c5_w4_ProgAss/repdata-data-StormData.csv", header = TRUE)
StormData <- rawData
dim(StormData)
## [1] 902297 37
names(StormData)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
After unzipping and reading in the comma separated data, we keep only the variables we are interested in1.
selectedVars <- c("BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG",
"PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
StormData <- StormData[, selectedVars]
dim(StormData)
## [1] 902297 8
str(StormData)
## 'data.frame': 902297 obs. of 8 variables:
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
From a 902,297 x 37 data set we are now down to 902,297 x 8. Let’s take a look at the numeric variables.
summary(StormData[, c("FATALITIES", "INJURIES", "PROPDMG", "CROPDMG")])
## FATALITIES INJURIES PROPDMG CROPDMG
## Min. : 0.0000 Min. : 0.0000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.0000 Median : 0.0000 Median : 0.00 Median : 0.000
## Mean : 0.0168 Mean : 0.1557 Mean : 12.06 Mean : 1.527
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.50 3rd Qu.: 0.000
## Max. :583.0000 Max. :1700.0000 Max. :5000.00 Max. :990.000
The ‘PROPDMGEXP’ and ‘CROPDMGEXP’ variables contain the exponents (or orders of magnitude), coded as charater inputs, for the ‘PROPDMG’ and ‘CROPDMG’ variables respectively. A double call to the unique() function tells us what unique values they assume, and we then “translate” those to numbers based on the documentation.
unique(StormData$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(StormData$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
StormData$PROPDMGEXP <- toupper(StormData$PROPDMGEXP)
StormData$PROPDMGEXP[StormData$PROPDMGEXP %in% c("", "+", "-", "?")] <- "0"
StormData$PROPDMGEXP[StormData$PROPDMGEXP %in% c("B")] <- "9"
StormData$PROPDMGEXP[StormData$PROPDMGEXP %in% c("M")] <- "6"
StormData$PROPDMGEXP[StormData$PROPDMGEXP %in% c("K")] <- "3"
StormData$PROPDMGEXP[StormData$PROPDMGEXP %in% c("H")] <- "2"
StormData$CROPDMGEXP <- toupper(StormData$CROPDMGEXP)
StormData$CROPDMGEXP[StormData$CROPDMGEXP %in% c("", "+", "-", "?")] <- "0"
StormData$CROPDMGEXP[StormData$CROPDMGEXP %in% c("B")] <- "9"
StormData$CROPDMGEXP[StormData$CROPDMGEXP %in% c("M")] <- "6"
StormData$CROPDMGEXP[StormData$CROPDMGEXP %in% c("K")] <- "3"
StormData$CROPDMGEXP[StormData$CROPDMGEXP %in% c("H")] <- "2"
Now we use the numeric exponents above to create ‘total’ variables.
StormData$PROPDMGTOTAL <- StormData$PROPDMG*(10^as.numeric(StormData$PROPDMGEXP))
StormData$CROPDMGTOTAL <- StormData$CROPDMG*(10^as.numeric(StormData$CROPDMGEXP))
StormData$DMGTOTAL <- StormData$PROPDMGTOTAL + StormData$CROPDMGTOTAL
We use the ‘dplyr’ package pipeline operator to compute the total number of injuries, fatalities, and damages by type of event.
Totals <- StormData %>%
group_by(EVTYPE) %>%
summarize(SUMFATALITIES = sum(FATALITIES),
SUMINJURIES = sum(INJURIES),
SUMPROPDMG = sum(PROPDMGTOTAL),
SUMCROPDMG = sum(CROPDMGTOTAL),
TOTALDMG = sum(DMGTOTAL)
)
head(Totals, 5)
## # A tibble: 5 x 6
## EVTYPE SUMFATALITIES SUMINJURIES SUMPROPDMG SUMCROPDMG TOTALDMG
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 " HIGH SURF ADVISO… 0 0 200000 0 200000
## 2 " COASTAL FLOOD" 0 0 0 0 0
## 3 " FLASH FLOOD" 0 0 50000 0 50000
## 4 " LIGHTNING" 0 0 0 0 0
## 5 " TSTM WIND" 0 0 8100000 0 8100000
Let’s get a look at the ordering, now. First, the worst: fatalities. We’ll show the “top ten”.
TotalFatalities <- arrange(Totals, desc(SUMFATALITIES))
TopTenFatalities <- head(TotalFatalities, 10)
Similarly, for injuries:
TotalInjuries <- arrange(Totals, desc(SUMINJURIES))
TopTenInjuries <- head(TotalInjuries, 10)
Finally, property, crop, and total damage.
TotalPropertyDMG <- arrange(Totals, desc(SUMPROPDMG))
TopTenPropertyDMG <- head(TotalPropertyDMG, 10)
TotalCropDMG <- arrange(Totals, desc(SUMCROPDMG))
TopTenCropDMG <- head(TotalCropDMG, 10)
TotalDMG <- arrange(Totals, desc(TOTALDMG))
TopTenTotalDMG <- head(TotalDMG, 10)
TopTenFatalities$EVTYPE <- with(TopTenFatalities,
reorder(EVTYPE, -SUMFATALITIES))
plotFatalities <- ggplot(TopTenFatalities,
aes(EVTYPE, SUMFATALITIES, label = SUMFATALITIES))
plotFatalities <- plotFatalities +
geom_bar(stat = "identity", color = "black",
fill = "lightblue", width = 0.75) +
geom_text(nudge_y = 200, color ="darkorange2") +
xlab("Type of Weather Event") +
ylab("Total Number of Fatalities") +
ggtitle("Most Fatal Events") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(size = 7, angle = 20))
plotFatalities
Most weather-event related fatalities are due to tornadoes, who account for thrice the number of deaths due to excessive heat.
TopTenInjuries$EVTYPE <- with(TopTenInjuries,
reorder(EVTYPE, -SUMINJURIES))
plotInjuries <- ggplot(TopTenInjuries,
aes(EVTYPE, SUMINJURIES, label = SUMINJURIES))
plotInjuries <- plotInjuries +
geom_bar(stat = "identity", color = "black",
fill = "lightblue", width = 0.75) +
geom_text(nudge_y = 2750, color ="darkorange2") +
scale_y_continuous(limits = c(0, 100000)) +
xlab("Type of Weather Event") +
ylab("Total Number of Injuries") +
ggtitle("Events which Caused Most Injuries") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(size = 7, angle = 20))
plotInjuries
Tornadoes also “dominate” the appalling injuries statistics, with more than 90,000 events related to them: they represent more than the other nine head-listers combined.
TopTenTotalDMG$EVTYPE <- with(TopTenTotalDMG, reorder(EVTYPE, -TOTALDMG))
DamageDataLong <- TopTenTotalDMG %>%
gather(key = "Type", value = "TOTALDAMAGE", c("SUMPROPDMG", "SUMCROPDMG")) %>%
select(EVTYPE, Type, TOTALDAMAGE)
DamageDataLong$Type[DamageDataLong$Type %in% c("SUMPROPDMG")] <- "Property damage"
DamageDataLong$Type[DamageDataLong$Type %in% c("SUMCROPDMG")] <- "Crop damage"
plotDMG <- ggplot(DamageDataLong,
aes(x = EVTYPE, y = TOTALDAMAGE, fill = Type))
plotDMG <- plotDMG +
geom_bar(stat = "identity", position = "stack",
color = "black", width = 0.75) +
scale_y_continuous(labels = scales::label_comma(scale = 1/1000000000)) +
xlab("Event Type") +
ylab("Total Damage in Billion USD$") +
ggtitle("Events Causing Most Damage") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5), legend.position = "bottom",
axis.text.x = element_text(size = 7, angle = 20))
plotDMG
Most of the economics damages is in the form in damage to the property (buildings, factories, vehicles, etc.), with flood representing double the total damage by hurricanes (2nd place).
As was to be expected, drought, river floods and ice storms are pretty damaging for agriculture; the split of total damages between property and crop is about 50-50 for the latter two, while is almost all in crop damage for drought.
Please refer to the StormData NOAA documentation, available at https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf↩︎