1. Synopsis
Our objective in this report is to explore the impact of severe weather events on both public health and economy in the United States. In order to achive this objective we use the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which covers the years from 1950 to 2011. More information about the NOAA storm database are available Online
In the rest of this report we try to answer the following two important questions:
Which types of weather events across the United States are most harmful with respect to population health?
Which types of weather events across the United States have the greatest economic consequences?
2. Data Loading and Processing
We first download the raw data from the given website url and then we unzip and load the dataset for processing and cleaning.
# downlaod the data
URL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(URL, destfile ="repdata%2Fdata%2FStormData.csv.bz2")
# load the data
storm_data <- read.csv("repdata%2Fdata%2FStormData.csv.bz2")
As initial task, we can check details of the database such as structure and variable names, dimensions, variable names, etc.
str(storm_data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
names(storm_data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
From these details, we notice it is better to change the variable names to be lower case.
names(storm_data) <- tolower(names(storm_data))
We can have a look at the improtant variables, of course the first important variable is the storm type (evtype), we look at a sample of 100 types as follows:
unique(storm_data$evtype)[sample(1:length(unique(storm_data$evtype)),100)]
## [1] "HEAVY RAIN/URBAN FLOOD" "Urban flood"
## [3] "HIGH WINDS AND WIND CHILL" "FROST/FREEZE"
## [5] "HEAVY SURF/HIGH SURF" "URBAN/SML STREAM FLDG"
## [7] "HURRICANE OPAL" "SEVERE THUNDERSTORM WINDS"
## [9] "THUNDERSTORM WIND 98 MPH" "THUNDERSTORM WINDSS"
## [11] "HIGH WINDS 67" "THUNDERSTORM WINDS"
## [13] "EXTREME/RECORD COLD" "THUNDERSTORM WINDS 60"
## [15] "ICE AND SNOW" "Strong Wind"
## [17] "URBAN FLOOD" "RECORD HIGH TEMPERATURES"
## [19] "Marine Accident" "FLOOD/RAIN/WIND"
## [21] "SMALL HAIL" "COASTAL/TIDAL FLOOD"
## [23] "HURRICANE FELIX" "THUNDERSTORM WINDS/HEAVY RAIN"
## [25] "LOW WIND CHILL" "FLASH FLOOD"
## [27] "ASTRONOMICAL LOW TIDE" "ICE PELLETS"
## [29] "WATERSPOUT TORNADO" "THUNDERSTORM WINDS G60"
## [31] "BLIZZARD WEATHER" "SNOW SQUALL"
## [33] "HEAVY SNOW & ICE" "BLIZZARD/HEAVY SNOW"
## [35] "FLOOD & HEAVY RAIN" "FLASH FLOOD/LANDSLIDE"
## [37] "WINTER STORMS" "RECORD HEAT WAVE"
## [39] "ABNORMALLY DRY" "WET SNOW"
## [41] "WIND ADVISORY" "Summary of June 4"
## [43] "WIND AND WAVE" "HURRICANE EMILY"
## [45] "HIGH SWELLS" "WARM DRY CONDITIONS"
## [47] "FLOOD WATCH/" "TORNDAO"
## [49] "MAJOR FLOOD" "EXCESSIVE RAIN"
## [51] "LATE SEASON HAIL" "LIGHTNING AND WINDS"
## [53] "Freezing drizzle" "RECORD RAINFALL"
## [55] "WINTER STORM" "Summary of June 6"
## [57] "Summary of May 31 pm" "COLD AND FROST"
## [59] "URBAN/SMALL STREAM" "UNSEASONABLY WARM/WET"
## [61] "Summary of June 10" "Microburst"
## [63] "HEAT DROUGHT" "Gusty Wind"
## [65] "URBAN/STREET FLOODING" " WATERSPOUT"
## [67] "APACHE COUNTY" "SMALL STREAM URBAN FLOOD"
## [69] "THUNDERESTORM WINDS" "WILD FIRES"
## [71] "FREEZING RAIN/SNOW" "VOLCANIC ASH"
## [73] "Summary August 4" "EXCESSIVELY DRY"
## [75] "LAKE FLOOD" "STRONG WIND"
## [77] "HEAVY SNOW/HIGH WIND" "SMALL STREAM AND URBAN FLOODIN"
## [79] "Metro Storm, May 26" "DROUGHT/EXCESSIVE HEAT"
## [81] "BITTER WIND CHILL" "LOCALLY HEAVY RAIN"
## [83] "BEACH EROSION/COASTAL FLOOD" "HIGH WINDS"
## [85] "THUNDERSTORM WINDS/FLASH FLOOD" "DUST DEVIL WATERSPOUT"
## [87] "HIGH WINDS 66" "Funnel Cloud"
## [89] "WATERSPOUT/TORNADO" "HVY RAIN"
## [91] "RAIN/WIND" "STRONG WIND GUST"
## [93] "FREEZING RAIN AND SNOW" "Freeze"
## [95] "TSTM WIND 52" "River Flooding"
## [97] "RECORD TEMPERATURES" "MICROBURST WINDS"
## [99] "Urban Flooding" "LAKESHORE FLOOD"
We can observe that there is a big problem in the names of these types: some types are the same with lower case and upper case letters, same types with different numbers at the end, and so on. We can follow simple steps to fix these issues. First, we change all the types names to be lower case, and sort the types according to their frequency and present types with at least 5000 fequency.
storm_data$evtype <- tolower(storm_data$evtype)
storm_data %>% group_by(evtype) %>% count() %>%
arrange(desc(n)) %>% filter(n>5000) %>% kable()
| hail |
288661 |
| tstm wind |
219942 |
| thunderstorm wind |
82564 |
| tornado |
60652 |
| flash flood |
54277 |
| flood |
25327 |
| thunderstorm winds |
20843 |
| high wind |
20214 |
| lightning |
15754 |
| heavy snow |
15708 |
| heavy rain |
11742 |
| winter storm |
11433 |
| winter weather |
7045 |
| funnel cloud |
6844 |
| marine tstm wind |
6175 |
| marine thunderstorm wind |
5812 |
From this table we find 16 storm types have frequencies greater than 5000; however, two types with different names: (1) thunderstorm wind with three names: tstm wind, thunderstorm wind, and thunderstorm winds, and (2) marine thunderstorm wind with two names: marine thunderstorm wind and marine tstm wind. We fix this issue in the following code:
storm_data <- storm_data %>%
mutate(evtype = recode(evtype, 'tstm wind' = "thunderstorm wind",
'thunderstorm winds' = "thunderstorm wind",
'marine tstm wind' = "marine thunderstorm wind",
.default = NULL))
storm_data %>% group_by(evtype) %>% count() %>%
arrange(desc(n)) %>% filter(n>5000) %>% kable()
| thunderstorm wind |
323349 |
| hail |
288661 |
| tornado |
60652 |
| flash flood |
54277 |
| flood |
25327 |
| high wind |
20214 |
| lightning |
15754 |
| heavy snow |
15708 |
| marine thunderstorm wind |
11987 |
| heavy rain |
11742 |
| winter storm |
11433 |
| winter weather |
7045 |
| funnel cloud |
6844 |
Based on our objective and the two research questions which we want answer, we can specify our analysis variables in three groups. Indeed, all the information about these variables are available online from the National Climatic Data Center Storm Events FAQ and the National Weather Service storm data documentation. The three group of variables are listed as follows:
Variables are related to weather events: which include storm type (evtype).
Variables are related to weather events impact on public health: which include fatalities and injuries.
Variables are related to weather events impact on economy: which include property damage (propdmg and propdmgexp) and crop damage (cropdmg and cropdmgexp). In each one of these two sets, the first variable gives the first three significant digits and the second variable gives the exponents. Accordingly, we can clean the data and remove all the other irrelevant variables.
# select relevant variables
data_cleaned <- storm_data %>% select(evtype, fatalities,
injuries, propdmg, propdmgexp, cropdmg, cropdmgexp)
str(data_cleaned)
## 'data.frame': 902297 obs. of 7 variables:
## $ evtype : chr "tornado" "tornado" "tornado" "tornado" ...
## $ fatalities: num 0 0 0 0 0 0 0 0 1 0 ...
## $ injuries : num 15 0 2 2 2 6 1 0 14 0 ...
## $ propdmg : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ propdmgexp: chr "K" "K" "K" "K" ...
## $ cropdmg : num 0 0 0 0 0 0 0 0 0 0 ...
## $ cropdmgexp: chr "" "" "" "" ...
Now, it is the time to convert the exponent variables (propdmgexp and cropdmgexp) to be actual numeric exponents variables and use them to calculate property and crop cost (propcost and cropcost) as follows:
data_cleaned$propdmgexp <- tolower(data_cleaned$propdmgexp)
data_cleaned$cropdmgexp <- tolower(data_cleaned$cropdmgexp)
data_cleaned <- data_cleaned %>%
mutate(propdmgexp = recode(propdmgexp,
"-" = 10^0,
"+" = 10^0,
"0" = 10^0,
"1" = 10^1,
"2" = 10^2,
"3" = 10^3,
"4" = 10^4,
"5" = 10^5,
"6" = 10^6,
"7" = 10^7,
"8" = 10^8,
"9" = 10^9,
"h" = 10^2,
"k" = 10^3,
"m" = 10^6,
"b" = 10^9,
.default = 10^0),
cropdmgexp = recode(cropdmgexp,
"?" = 10^0,
"0" = 10^0,
"k" = 10^3,
"m" = 10^6,
"b" = 10^9,
.default = 10^0),
propcost = propdmg*propdmgexp,
cropcost = cropdmg*cropdmgexp) %>%
select(evtype, fatalities,
injuries, propcost, cropcost)
str(data_cleaned)
## 'data.frame': 902297 obs. of 5 variables:
## $ evtype : chr "tornado" "tornado" "tornado" "tornado" ...
## $ fatalities: num 0 0 0 0 0 0 0 0 1 0 ...
## $ injuries : num 15 0 2 2 2 6 1 0 14 0 ...
## $ propcost : num 25000 2500 25000 2500 2500 2500 2500 2500 25000 25000 ...
## $ cropcost : num 0 0 0 0 0 0 0 0 0 0 ...
3. Main Results
We compute the total amount of damage in fatalities, injuries, properties, and crops per severe weather event type, and present in tables the top 10 causes of damage for each type.
Totals <- data_cleaned %>% group_by(evtype) %>%
summarize(total_fatalities = sum(fatalities, na.rm = T),
total_injuries = sum(injuries, na.rm = T),
total_propdmg = sum(propcost, na.rm = T),
total_cropdmg = sum(cropcost, na.rm = T)) %>%
ungroup()
## `summarise()` ungrouping output (override with `.groups` argument)
total_fatalities <- Totals %>%
select(evtype, total = total_fatalities) %>%
arrange(desc(total)) %>% head(10)
total_fatalities %>% kable()
| tornado |
5633 |
| excessive heat |
1903 |
| flash flood |
978 |
| heat |
937 |
| lightning |
816 |
| thunderstorm wind |
701 |
| flood |
470 |
| rip current |
368 |
| high wind |
248 |
| avalanche |
224 |
total_injuries <- Totals %>%
select(evtype, total = total_injuries) %>%
arrange(desc(total)) %>% head(10)
total_injuries %>% kable()
| tornado |
91346 |
| thunderstorm wind |
9353 |
| flood |
6789 |
| excessive heat |
6525 |
| lightning |
5230 |
| heat |
2100 |
| ice storm |
1975 |
| flash flood |
1777 |
| hail |
1361 |
| winter storm |
1321 |
total_propdmg <- Totals %>%
select(evtype, total = total_propdmg) %>%
arrange(desc(total)) %>% head(10)
total_propdmg%>% kable()
| flood |
144657709807 |
| hurricane/typhoon |
69305840000 |
| tornado |
56947380677 |
| storm surge |
43323536000 |
| flash flood |
16822673979 |
| hail |
15735267513 |
| hurricane |
11868319010 |
| thunderstorm wind |
9912671826 |
| tropical storm |
7703890550 |
| winter storm |
6688497251 |
total_cropdmg <- Totals %>%
select(evtype, total = total_cropdmg) %>%
arrange(desc(total)) %>% head(10)
total_cropdmg%>% kable()
| drought |
13972566000 |
| flood |
5661968450 |
| river flood |
5029459000 |
| ice storm |
5022113500 |
| hail |
3025954473 |
| hurricane |
2741910000 |
| hurricane/typhoon |
2607872800 |
| flash flood |
1421317100 |
| extreme cold |
1312973000 |
| thunderstorm wind |
1159505188 |
Now, we can plot the total amount of damage in fatalities and injuries in one graph to compare them and evaluate the impact on publich health.
par(mfrow = c(1, 2), mar = c(10, 5, 3, 2))
barplot(total_fatalities$total/10^3, las = 3, names.arg = total_fatalities$evtype, col = "black", main = "Top 10 causes for Fatalities", cex.names = 0.8, ylab = "Total Fatalities (in thousands)")
barplot(total_injuries$total/10^3, las = 3, names.arg = total_injuries$evtype, col = "red", main = "Top 10 causes for Injuries", cex.names = 0.8, ylab = "Total Injuries (in thousands)")

Similarly, we plot the total amount of damage in crops and properties in one graph to compare them and evaluate the impact on economy.
par(mfrow = c(1, 2), mar = c(10, 5, 3, 1))
barplot(total_propdmg$total/10^9, las = 3, names.arg = total_propdmg$evtype, col = "gray", main = "Top 10 causes for properties damage", cex.names = 0.8, ylab = "Properties damage (in billion dollars)")
barplot(total_cropdmg$total/10^9, las = 3, names.arg = total_cropdmg$evtype, col = "green", main = "Top 10 causes for crops damage", cex.names = 0.8, ylab = "Crops damage (in billion dollars)")

4. Conclusion
From our analysis result, we can claim that Tornadoes, thunderstorm winds, and excessive heat are the most damaging event for the public health as reflected by the number of fatalities and injuries. On the other hand, flood, hurricane/typhon, and drought have the greatest economic consequences as represented in the total amount of damage in properies and crops.