This data analysis involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The data records used are from 1989 to 2011. The weather events will be analyzed by population health related damages (i.e. fatailities and injuries) and by economic damage costs (i.e. property and crop). The results will illustrate the 10 most consequential weather events for the population health and economic categories.
The conclusions state that tornado category contributes to the maximum number of injuries and combined total fatalities and injuries. For the property and crop damage costs, the flood category contributes to the maximum number of property damage costs and combined total property and crop damage costs.
library(tidyverse)
theme_set(theme_light())
Download the dataset to ./data directory and read the CSV to the stormData dataframe
if(!file.exists("./data")){
dir.create("./data")}
if(!file.exists("./data/stormData.csv.bz2")){
file_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(file_url, destfile="./data/stormData.csv.bz2")
}
stormData <- read.csv("./data/stormData.csv.bz2", stringsAsFactors = FALSE)
Examine stormData variables types and values
str(stormData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Subset stormData for the variables necessary for the data analysis
stormSel <- stormData %>%
select(BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP,
CROPDMG, CROPDMGEXP)
Examine stormData variables for missing values
sum(is.na(stormSel))
## [1] 0
No missing values!
In the assignments instructions, it states that “in the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.” Considering this statement, it is necessary to determine the start year and the criteria for selection. The following steps will determine the range where 85% or less of the total events are gathered.
Format BGN_DATE column as.Date, extract year from BGN_DATE and create new variable Year
stormSel$BGN_DATE <- as.Date(stormSel$BGN_DATE, format = "%m/%d/%Y")
stormSel <- stormSel %>%
mutate(Year = as.numeric(format(BGN_DATE, '%Y')))
Determine the starting year where the cumulative number of events is 85% or less
yearCnt <- stormSel %>%
group_by(Year) %>%
tally() %>%
arrange(desc(n))
colnames(yearCnt) <- c("Year", "EventFreq")
yearCnt <- yearCnt %>%
mutate(
cumsum = cumsum(EventFreq),
freq = round(EventFreq / sum(EventFreq), 3),
cum_freq = cumsum(freq)
)
tail(yearCnt[which(yearCnt$cum_freq <= 0.85), ])
## # A tibble: 6 x 5
## Year EventFreq cumsum freq cum_freq
## <dbl> <int> <int> <dbl> <dbl>
## 1 1994 20631 702131 0.023 0.779
## 2 1992 13534 715665 0.015 0.794
## 3 1993 12607 728272 0.014 0.808
## 4 1991 12522 740794 0.014 0.822
## 5 1990 10946 751740 0.012 0.834
## 6 1989 10410 762150 0.012 0.846
Starting year is 1989.
Subset stormSel starting at 1989 and FATALITIES (OR) INJURIES (OR) PROPDMG (OR) CROPDMG is greater than 0
stormSel2 <- stormSel %>%
filter(Year >= 1989 & (
FATALITIES > 0 |
INJURIES > 0 |
PROPDMG > 0 |
CROPDMG > 0))
Subset for EVTYPE order by frequencies (n)
eventCnt <- stormSel2 %>%
group_by(EVTYPE) %>%
tally() %>%
arrange(desc(n))
Print the top 40 events with most frequencies
eventCnt %>%
print(n = 40)
## # A tibble: 488 x 2
## EVTYPE n
## <chr> <int>
## 1 TSTM WIND 62607
## 2 THUNDERSTORM WIND 43655
## 3 HAIL 26092
## 4 FLASH FLOOD 20967
## 5 TORNADO 16879
## 6 LIGHTNING 13293
## 7 THUNDERSTORM WINDS 12086
## 8 FLOOD 10175
## 9 HIGH WIND 5522
## 10 STRONG WIND 3370
## 11 WINTER STORM 1508
## 12 HEAVY SNOW 1342
## 13 HEAVY RAIN 1105
## 14 WILDFIRE 857
## 15 ICE STORM 708
## 16 URBAN/SML STREAM FLD 702
## 17 EXCESSIVE HEAT 698
## 18 HIGH WINDS 657
## 19 TSTM WIND/HAIL 441
## 20 TROPICAL STORM 416
## 21 WINTER WEATHER 407
## 22 RIP CURRENT 400
## 23 WILD/FOREST FIRE 388
## 24 FLASH FLOODING 302
## 25 FLOOD/FLASH FLOOD 279
## 26 AVALANCHE 268
## 27 DROUGHT 266
## 28 BLIZZARD 253
## 29 RIP CURRENTS 241
## 30 HEAT 215
## 31 EXTREME COLD 197
## 32 LAKE-EFFECT SNOW 194
## 33 LANDSLIDE 193
## 34 STORM SURGE 177
## 35 COASTAL FLOOD 164
## 36 URBAN FLOOD 139
## 37 WINTER WEATHER/MIX 139
## 38 HIGH SURF 130
## 39 HURRICANE 129
## 40 LIGHT SNOW 119
## # … with 448 more rows
From the table, the count for unique EVTYPE variable is 488 (too many).
Using the National Weather Instruction 10-1065 (https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf) Section 7 - Event Types (pages 2-4) as a guide to combine most frequent event types (ex. TSTM and THUNDER* to THUNDERSTORM, and so on).
stormSel2$EVTYPE <- toupper(stormSel2$EVTYPE)
stormSel2$EVTYPE[grep("TSTM|THUNDER", stormSel2$EVTYPE)] <- "THUNDERSTORM"
stormSel2$EVTYPE[grep("HAIL", stormSel2$EVTYPE)] <- "HAIL"
stormSel2$EVTYPE[grep("FLOOD|FLD", stormSel2$EVTYPE)] <- "FLOOD"
stormSel2$EVTYPE[grep("TORNA|FUNNEL|SPOUT", stormSel2$EVTYPE)] <- "TORNADO"
stormSel2$EVTYPE[grep("LIGHTN|LIGHTI|LIGNT", stormSel2$EVTYPE)] <- "LIGHTNING"
stormSel2$EVTYPE[grep("COLD", stormSel2$EVTYPE)] <- "COLD"
stormSel2$EVTYPE[grep("WIND|GUST", stormSel2$EVTYPE)] <- "WIND"
stormSel2$EVTYPE[grep("WINTER|BLIZZARD", stormSel2$EVTYPE)] <- "WINTERSTORM"
stormSel2$EVTYPE[grep("SNOW", stormSel2$EVTYPE)] <- "SNOW"
stormSel2$EVTYPE[grep("RAIN", stormSel2$EVTYPE)] <- "RAIN"
stormSel2$EVTYPE[grep("FIRE", stormSel2$EVTYPE)] <- "WILDFIRE"
stormSel2$EVTYPE[grep("ICE|ICY", stormSel2$EVTYPE)] <- "ICE"
stormSel2$EVTYPE[grep("HEAT|DROUGHT", stormSel2$EVTYPE)] <- "HEAT"
stormSel2$EVTYPE[grep("TROPICAL", stormSel2$EVTYPE)] <- "TROPICAL STORM"
stormSel2$EVTYPE[grep("RIP", stormSel2$EVTYPE)] <- "RIP CURRENT"
stormSel2$EVTYPE[grep("AVALANC", stormSel2$EVTYPE)] <- "AVALANCHE"
stormSel2$EVTYPE[grep("SLIDE", stormSel2$EVTYPE)] <- "LANDSLIDE"
stormSel2$EVTYPE[grep("SURGE", stormSel2$EVTYPE)] <- "STORM SURGE"
stormSel2$EVTYPE[grep("SURF", stormSel2$EVTYPE)] <- "SURF"
stormSel2$EVTYPE[grep("HURRICAN|TYPHOO", stormSel2$EVTYPE)] <- "HURRICANE"
stormSel2$EVTYPE[grep("FREEZ|FROST", stormSel2$EVTYPE)] <- "FREEZE"
stormSel2$EVTYPE[grep("FOG", stormSel2$EVTYPE)] <- "FOG"
stormSel2$EVTYPE[grep("DUST", stormSel2$EVTYPE)] <- "DUST STORM"
stormSel2$EVTYPE[grep("MICRO | MIRCO", stormSel2$EVTYPE)] <- "DRY MICROBURST"
stormSel2$EVTYPE[grep("GLAZE", stormSel2$EVTYPE)] <- "GLAZE"
stormSel2$EVTYPE[grep("TSUNAMI", stormSel2$EVTYPE)] <- "TSUNAMI"
Subset again for EVTYPE with greater frequencies and print the top 40 most frequent events
eventCnt2 <- stormSel2 %>%
group_by(EVTYPE) %>%
tally() %>%
arrange(desc(n))
eventCnt2 %>%
print(n = 40)
## # A tibble: 78 x 2
## EVTYPE n
## <chr> <int>
## 1 THUNDERSTORM 119192
## 2 FLOOD 33183
## 3 HAIL 26128
## 4 TORNADO 16974
## 5 LIGHTNING 13302
## 6 WIND 9897
## 7 WINTERSTORM 2316
## 8 SNOW 1881
## 9 WILDFIRE 1258
## 10 HEAT 1246
## 11 RAIN 1200
## 12 ICE 764
## 13 RIP CURRENT 643
## 14 COLD 468
## 15 TROPICAL STORM 456
## 16 AVALANCHE 269
## 17 HURRICANE 232
## 18 STORM SURGE 225
## 19 SURF 222
## 20 LANDSLIDE 209
## 21 DUST STORM 199
## 22 FOG 181
## 23 FREEZE 170
## 24 DRY MICROBURST 78
## 25 OTHER 34
## 26 GLAZE 21
## 27 MIXED PRECIPITATION 18
## 28 TSUNAMI 14
## 29 SEICHE 9
## 30 ASTRONOMICAL HIGH TIDE 8
## 31 HEAVY MIX 8
## 32 HIGH SEAS 8
## 33 LOW TEMPERATURE 7
## 34 UNSEASONABLY WARM 7
## 35 HYPOTHERMIA/EXPOSURE 6
## 36 MIXED PRECIP 6
## 37 COASTAL STORM 4
## 38 HIGH WATER 4
## 39 MICROBURST 4
## 40 WINTRY MIX 4
## # … with 38 more rows
Using the National Weather Serivice Instruction 10-1065 page 12, the estimated damage costs for property category is calculated by PROPDMG * PROPDMGEXP (exponent). The same logic will apply to CROPDMG. Since the PROPDMGEXP and CROPDMGEXP are categorical variables, they need a conversion to the corresponding numeric exponent.
Verify the unique values for PROPDMGEXP
unique(stormSel2$PROPDMGEXP)
## [1] "M" "K" "" "B" "m" "+" "0" "5" "6" "4" "h" "2" "7" "3" "H" "-"
Change PROPDMG key to corresponding numeric values
stormSel2 <- stormSel2 %>%
mutate(PROPDMGEXP1 = case_when(
PROPDMGEXP == "-" ~ 1, # 10^0 = 1
PROPDMGEXP == "+" ~ 1,
PROPDMGEXP == "0" ~ 1,
PROPDMGEXP == "2" ~ 10^2,
PROPDMGEXP == "3" ~ 10^3,
PROPDMGEXP == "4" ~ 10^4,
PROPDMGEXP == "5" ~ 10^5,
PROPDMGEXP == "6" ~ 10^6,
PROPDMGEXP == "7" ~ 10^7,
PROPDMGEXP == "H" ~ 10^2,
PROPDMGEXP == "h" ~ 10^2,
PROPDMGEXP == "K" ~ 10^3,
PROPDMGEXP == "M" ~ 10^6,
PROPDMGEXP == "m" ~ 10^6,
PROPDMGEXP == "B" ~ 10^9,
PROPDMGEXP == "" ~ 1))
Verify the unique values for CROPDMGEXP
unique(stormSel2$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k"
Change CROPDMG key to corresponding numeric values
stormSel2 <- stormSel2 %>%
mutate(CROPDMGEXP1 = case_when(
CROPDMGEXP == "" ~ 1, # 10^0 = 1
CROPDMGEXP == "?" ~ 1,
CROPDMGEXP == "0" ~ 1,
CROPDMGEXP == "K" ~ 10^3,
CROPDMGEXP == "k" ~ 10^3,
CROPDMGEXP == "M" ~ 10^6,
CROPDMGEXP == "m" ~ 10^6,
CROPDMGEXP == "B" ~ 10^9))
Calculate fatalities and injuries by event type
fatalities_injuries <- stormSel2 %>%
group_by(EVTYPE) %>%
summarise(Fatalities = sum(FATALITIES),
Injuries = sum(INJURIES),
Totals = Fatalities + Injuries) %>%
arrange(desc(Totals))
Print the top 10 events order by fatalities and injuries (combined)
fatalities_injuries_10 <- fatalities_injuries[1:10, ]
fatalities_injuries_10
## # A tibble: 10 x 4
## EVTYPE Fatalities Injuries Totals
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADO 1808 28037 29845
## 2 HEAT 3138 9228 12366
## 3 FLOOD 1553 8683 10236
## 4 THUNDERSTORM 608 7992 8600
## 5 LIGHTNING 817 5231 6048
## 6 WINTERSTORM 378 2682 3060
## 7 WIND 476 1954 2430
## 8 ICE 102 2183 2285
## 9 WILDFIRE 90 1608 1698
## 10 HURRICANE 133 1333 1466
Plot barchart for total fatalities, injuries and combined totals by event type.
First, convert dataframe to long format
storm_long <- gather(fatalities_injuries_10, Fatalities_Injuries, Totals, -EVTYPE, convert = FALSE)
fi_chart <- storm_long %>%
ggplot(aes(x = reorder(EVTYPE, Totals), y = Totals)) +
geom_bar(stat = "identity", aes(fill = Fatalities_Injuries), position = "dodge") +
coord_flip() +
labs(title = "United States Total Fatalities and Injuries by Event (1989-2011)",
x = "Event Type", y = "Frequency") +
theme(legend.title = element_blank())
fi_chart
The plot illustrates the tornado category contributes to the maximum number of injuries and combined total fatalities and injuries, followed by heat and flood weather events. However, the heat associated events contribute to the maximum number of fatalities, followed by tornado and flood events.
Calculate property damage costs (PropertyCost) and crop damage costs (CropCost) by event type (in million $)
stormSel2 <- stormSel2 %>%
mutate(PropertyCost = (PROPDMG * PROPDMGEXP1) / 10^6) %>%
mutate(CropCost = (CROPDMG * CROPDMGEXP1) / 10^6)
property_crop <- stormSel2 %>%
group_by(EVTYPE) %>%
summarise(PropertyTotalCost = sum(PropertyCost),
CropTotalCost = sum(CropCost),
Totals = PropertyTotalCost + CropTotalCost) %>%
arrange(desc(Totals))
Print the top 10 events order by property and crop damage costs (combined)
property_crop_10 <- property_crop[1:10, ]
property_crop_10
## # A tibble: 10 x 4
## EVTYPE PropertyTotalCost CropTotalCost Totals
## <chr> <dbl> <dbl> <dbl>
## 1 FLOOD 168270. 12389. 180659.
## 2 HURRICANE 85256. 5506. 90763.
## 3 STORM SURGE 47965. 0.855 47966.
## 4 TORNADO 32294. 415. 32709.
## 5 HAIL 15978. 3047. 19024.
## 6 HEAT 1066. 14877. 15943.
## 7 THUNDERSTORM 12785. 1274. 14060.
## 8 ICE 3959. 5022. 8981.
## 9 WILDFIRE 8497. 403. 8900.
## 10 TROPICAL STORM 7716. 695. 8411.
Plot barchart for total property and crop damage costs by event type
First, convert dataframe to long format
prop_crop_long <- gather(property_crop_10, Property_Crop, Totals, -EVTYPE, convert = FALSE)
pc_chart <- prop_crop_long %>%
ggplot(aes(x = reorder(EVTYPE, Totals), y = Totals)) +
geom_bar(stat = "identity", aes(fill = Property_Crop), position = "dodge") +
coord_flip() +
labs(title = "United States Total Property and Crop Damage Costs by Event (1989-2011)",
x = "Event Type", y = "Damage Costs (millions of dollars)") +
theme(legend.title = element_blank())
pc_chart
The plot illustrates the flood category contributes to the maximum number of property damage costs and combined total property and crop damage costs, followed by hurricane and storm surge weather events. However, the heat associated events contribute to the maximum number of crop damage costs, followed by flood, hurricane, and ice/freeze categories.