The NOAA storm data analysis shows that first, the top 10 most harmful events with respect to population health are (descending order): Tornado, Excessive heat, TSTM wind, Flood, Lighting, Heat, Flash flood, Ice storm, Thunderstorm wind, and Winter storm.
If we look at the most harmful event in terms of population health, i.e. Tornado, we can see that compare to other months, tornado have relatively lower impact on population health during May, June, July, August, and September.
And second, the types of events that have most economic damage are (descending order): Flood, Hurricane/Typhoon, Tornado, Storm sruge, Hail, Flash Flood, Drought, Hurricane, River Flood, and Ice storm.
The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. You must use the database to answer the questions below and show the code for your entire analysis. Your analysis can consist of tables, figures, or other summaries. You may use any R package you want to support your analysis.
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
I download the NOAA storm data from the course website, upzip and save in the same directory of this Rmd file.
I am using the read.csv() function to load the data, and first let’s view the data using str() function:
noaa_data <- read.csv("repdata_data_StormData.csv")
str(noaa_data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
All the questions asked about the impact of different types of storm, and by looking at the data, the type variable (EVTYPE) is character format. Thus, I’m going to change it to factor format, which will make further analysis easier.
Going to use as.factor() function to convert data type:
noaa_data$EVTYPE <- as.factor(noaa_data$EVTYPE)
The first question ask which type of storm are most harmful with respect to population health, based on the data, the population health should be the sum of fatalities and injuries. So I am going to create another varialbe PH (population health), which is just the sum of fatality and injury numbers.
noaa_data$PH <- noaa_data$FATALITIES + noaa_data$INJURIES
Finally, sum up all the PH values for each type of event and calculate which event has maximum PH values.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
PH_max <- noaa_data %>%
group_by(EVTYPE) %>%
summarise(total_PH = sum(PH, na.rm = TRUE)) %>%
arrange(desc(total_PH))
# display the top 10 most harmful events
head(PH_max,10)
## # A tibble: 10 × 2
## EVTYPE total_PH
## <fct> <dbl>
## 1 TORNADO 96979
## 2 EXCESSIVE HEAT 8428
## 3 TSTM WIND 7461
## 4 FLOOD 7259
## 5 LIGHTNING 6046
## 6 HEAT 3037
## 7 FLASH FLOOD 2755
## 8 ICE STORM 2064
## 9 THUNDERSTORM WIND 1621
## 10 WINTER STORM 1527
First, let’s turn time variable from character to time format (only focus on begin time):
noaa_data$BGN_DATE <- as.POSIXlt(noaa_data$BGN_DATE, format = "%m/%d/%Y %H:%M:%S")
Next, calculate the monthly average of population health damage and plot monthly variation:
library(ggplot2)
library(dplyr)
noaa_data <- noaa_data %>%
mutate(month = format(noaa_data$BGN_DATE, "%m"))
# turn month into factor variable
noaa_data$month <- as.factor(noaa_data$month)
PH_avg_pmonth <- noaa_data %>%
group_by(month) %>%
summarise(avg_PH = mean(PH, na.rm = TRUE))
ggplot(PH_avg_pmonth, aes(x = month, y = avg_PH)) +
geom_point() +
theme_minimal() +
labs(title = "Tornado monthly avg population health damage",
x = "month",
y = "Population health damage")
We can see that the top 10 most harmful events with respect to population health are (decending order): Tornado, Excessive heat, TSTM wind, Flood, Lighting, Heat, Flash flood, Ice storm, Thunderstorm wind, and Winter storm.
If we look at the most harmful event in terms of population health, i.e. Tornado, we can see that compare to other months, tornado have relatively lower impact on population health during May, June, July, August, and September.
Next, I am going to explore which types of events have the greatest economic consequences? The main variables that indicate economic consequences are property damage (PROPDMG) and crop damage (CROPDMG). They have associated variables PROPDMGEXP and CROPDMGEXP, which indicates how big those damage are, i.e. K stands for thousands of dollar, M stands for millions of dollar, and B stand for billions of dollar. However, there are other values in the associated variables other than “K”, “M”, and “B”:
print(unique(noaa_data$PROPDMGEXP))
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
print(unique(noaa_data$CROPDMGEXP))
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
For example, “+”, “-”, “?”, and numeric values. I’m going to assume that those values mean that corresponding damage is less than 100 dollars and thus a lot smaller than 1000, and will be excluded from the analysis (will return na values).
library(dplyr)
noaa_data <- noaa_data %>%
mutate(propdm_adjusted = case_when(
PROPDMGEXP == "h" ~ PROPDMG * 100,
PROPDMGEXP == "H" ~ PROPDMG * 100,
PROPDMGEXP == "K" ~ PROPDMG * 1000,
PROPDMGEXP == "m" ~ PROPDMG * 1000000,
PROPDMGEXP == "M" ~ PROPDMG * 1000000,
PROPDMGEXP == "B" ~ PROPDMG * 1000000000,
TRUE ~ NA_real_ # if no "K", "M", "B",then return NA
)) %>%
mutate(cropdm_adjusted = case_when(
CROPDMGEXP == "K" ~ CROPDMG * 1000,
CROPDMGEXP == "k" ~ CROPDMG * 1000,
CROPDMGEXP == "M" ~ CROPDMG * 1000000,
CROPDMGEXP == "m" ~ CROPDMG * 1000000,
CROPDMGEXP == "B" ~ CROPDMG * 1000000000,
TRUE ~ NA_real_ # if no "K", "M", "B",then return NA
))
Then sum up property and crop damage together as a indicator for economic impact:
noaa_data$propdm_adjusted[is.na(noaa_data$propdm_adjusted)] <- 0
noaa_data$cropdm_adjusted[is.na(noaa_data$cropdm_adjusted)] <- 0
noaa_data$eco_dam <- noaa_data$propdm_adjusted + noaa_data$cropdm_adjusted
Then calculate the max economic impact and display the top 10 types of events:
library(dplyr)
ED_max <- noaa_data %>%
group_by(EVTYPE) %>%
summarise(total_ED = sum(eco_dam, na.rm = TRUE)) %>%
arrange(desc(total_ED))
# display the top 10 most harmful events
head(ED_max,10)
## # A tibble: 10 × 2
## EVTYPE total_ED
## <fct> <dbl>
## 1 FLOOD 150319678250
## 2 HURRICANE/TYPHOON 71913712800
## 3 TORNADO 57352113590
## 4 STORM SURGE 43323541000
## 5 HAIL 18758221670
## 6 FLASH FLOOD 17562128610
## 7 DROUGHT 15018672000
## 8 HURRICANE 14610229010
## 9 RIVER FLOOD 10148404500
## 10 ICE STORM 8967041310
As shown in the table above, the types of events that have most economic damage are (descending order): Flood, Hurricane/Typhoon, Tornado, Storm sruge, Hail, Flash Flood, Drought, Hurricane, River Flood, and Ice storm.