Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

This data analysis looks addresses the following questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

Data Processing

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

First, the file needs to be downloaded:

src <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

dest <- file.path("data",basename(src))
if (!file.exists(dest))
        download.file(src,dest,method="curl",quiet=TRUE)

After downloading, the file is read in. I look at the column names to get the idea of which variables I will need for further analysis.

my_raw_data <- read_csv(dest)
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   STATE__ = col_double(),
##   COUNTY = col_double(),
##   BGN_RANGE = col_double(),
##   COUNTY_END = col_double(),
##   END_RANGE = col_double(),
##   LENGTH = col_double(),
##   WIDTH = col_double(),
##   F = col_integer(),
##   MAG = col_double(),
##   FATALITIES = col_double(),
##   INJURIES = col_double(),
##   PROPDMG = col_double(),
##   CROPDMG = col_double(),
##   LATITUDE = col_double(),
##   LONGITUDE = col_double(),
##   LATITUDE_E = col_double(),
##   LONGITUDE_ = col_double(),
##   REFNUM = col_double()
## )
## See spec(...) for full column specifications.
names(my_raw_data)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Cleaning and extracting relevant data

From this analysis point of view the relevant information about storms and severe weather conditions is contained in the following columns:

Therefore only this information will be contained in a data subset for further analysis:

data <- select(my_raw_data,EVTYPE, INJURIES, FATALITIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
names(data) <- tolower(names(data)) #changing column names to lowercase for easier handling
head(data)
## # A tibble: 6 <U+00D7> 7
##    evtype injuries fatalities propdmg propdmgexp cropdmg cropdmgexp
##     <chr>    <dbl>      <dbl>   <dbl>      <chr>   <dbl>      <chr>
## 1 TORNADO       15          0    25.0          K       0       <NA>
## 2 TORNADO        0          0     2.5          K       0       <NA>
## 3 TORNADO        2          0    25.0          K       0       <NA>
## 4 TORNADO        2          0     2.5          K       0       <NA>
## 5 TORNADO        2          0     2.5          K       0       <NA>
## 6 TORNADO        6          0     2.5          K       0       <NA>

The summary of data (available here, page 12) indicates that:

Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions. If additional precision is available, it may be provided in the narrative part of the entry

The estimates columns (“propdmgexp”, “cropdmgexp”) contain other characters that need to be cleaned:

unique(data$propdmgexp)
##  [1] "K" "M" NA  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
unique(data$cropdmgexp)
## [1] NA  "M" "K" "m" "B" "?" "0" "k" "2"

Characters like +“,”?“,”-" will be treated as 0 (magnitude). Since none of the “propdmg” and “cropdmg” data is NA, the NAs in “propdmgexp” and “cropdmgexp” will be also treated as 0 (no multiplier).

The “propdmg” and “cropdmg” data columns will be updated with the magnitude information from “propdmgexp” and “cropdmgexp

data$propdmgexp[(data$propdmgexp == "") | (data$propdmgexp == "+") | (data$propdmgexp == "?") | (data$propdmgexp == "-") | is.na(data$propdmgexp)] <- 0
data$propdmgexp[(data$propdmgexp == "h") | (data$propdmgexp == "H")] <- 2
data$propdmgexp[(data$propdmgexp == "k") | (data$propdmgexp == "K")] <- 3
data$propdmgexp[(data$propdmgexp == "m") | (data$propdmgexp == "M")] <- 6
data$propdmgexp[(data$propdmgexp == "b") | (data$propdmgexp == "B")] <- 9

data$cropdmgexp[(data$cropdmgexp == "") | (data$cropdmgexp == "+") | (data$cropdmgexp == "?") | (data$cropdmgexp == "-") | is.na(data$cropdmgexp)] <- 0
data$cropdmgexp[(data$cropdmgexp == "h") | (data$cropdmgexp == "H")] <- 2
data$cropdmgexp[(data$cropdmgexp == "k") | (data$cropdmgexp == "K")] <- 3
data$cropdmgexp[(data$cropdmgexp == "m") | (data$cropdmgexp == "M")] <- 6
data$cropdmgexp[(data$cropdmgexp == "b") | (data$cropdmgexp == "B")] <- 9

data$propdmg <- data$propdmg * 10^(as.numeric(data$propdmgexp))
data$cropdmg <- data$cropdmg * 10^(as.numeric(data$cropdmgexp))

Event type names also need cleanup. The names repeat in various forms, like “FLOOD”, “FLOOOD”, FLOODS“,”flood“. The code below cleans up the event type naming.

head(unique(data$evtype),25)
##  [1] "TORNADO"                      "TSTM WIND"                   
##  [3] "HAIL"                         "FREEZING RAIN"               
##  [5] "SNOW"                         "ICE STORM/FLASH FLOOD"       
##  [7] "SNOW/ICE"                     "WINTER STORM"                
##  [9] "HURRICANE OPAL/HIGH WINDS"    "THUNDERSTORM WINDS"          
## [11] "RECORD COLD"                  "HURRICANE ERIN"              
## [13] "HURRICANE OPAL"               "HEAVY RAIN"                  
## [15] "LIGHTNING"                    "THUNDERSTORM WIND"           
## [17] "DENSE FOG"                    "RIP CURRENT"                 
## [19] "THUNDERSTORM WINS"            "FLASH FLOOD"                 
## [21] "FLASH FLOODING"               "HIGH WINDS"                  
## [23] "FUNNEL CLOUD"                 "TORNADO F0"                  
## [25] "THUNDERSTORM WINDS LIGHTNING"
#There are 977 various entries for evtype
length(unique(data$evtype))
## [1] 977
data$evtype <- toupper(data$evtype) #takes care for UPPER-lower cases difference
# evtype grouping
data$evtype <- gsub('.*FLOOD.*', 'FLOOD', data$evtype)
data$evtype <- gsub('.*FLOOOD.*', 'FLOOD', data$evtype)
data$evtype <- gsub('.*HAIL.*', 'HAIL', data$evtype)
data$evtype <- gsub('.*THUNDERSTORM.* W.*', 'THUNDERSTORM WIND', data$evtype)
data$evtype <- gsub('.*THUNDERSTORM.*', 'THUNDERSTORM', data$evtype)
data$evtype <- gsub('.*WIND.*', 'STRONG WINDS', data$evtype)
data$evtype <- gsub('.*SNOW.*', 'SNOW', data$evtype)
data$evtype <- gsub('.*TSTM.*', 'THUNDERSTORM WIND', data$evtype)
data$evtype <- gsub('.*HIGH TEMPERATURE.*', 'HIGH TEMPERATURE', data$evtype)
data$evtype <- gsub('.*UNUSUAL\\/RECORD WARMTH.*', 'HIGH TEMPERATURE', data$evtype)
data$evtype <- gsub('.*RECORD WARM*', 'HIGH TEMPERATURE', data$evtype)
data$evtype <- gsub('.*HOT WEATHER.*', 'HIGH TEMPERATURE', data$evtype)
data$evtype <- gsub('.*RECORD TEMPERATURE.*', 'HIGH TEMPERATURE', data$evtype)
data$evtype <- gsub('.*TORNADO.*', 'TORNADO OR TYPHOON', data$evtype)
data$evtype <- gsub('.*TYPHOON.*', 'TORNADO OR TYPHOON', data$evtype)
data$evtype <- gsub('.*TROPICAL STORM.*', 'TROPICAL STORM', data$evtype)
data$evtype <- gsub('.*HEAT.*', 'HEAT', data$evtype)
data$evtype <- gsub('.*BLIZZARD.*', 'BLIZZARD', data$evtype)
data$evtype <- gsub('.*LIGHTNING.*', 'LIGHTNING', data$evtype)
data$evtype <- gsub('.*COLD.*', 'COLD', data$evtype)
data$evtype <- gsub('.*HEAVY RAIN.*', 'HEAVY RAIN', data$evtype)
data$evtype <- gsub('.*HEAVY SHOWER.*', 'HEAVY RAIN', data$evtype)
data$evtype <- gsub('.*EXCESSIVE RAIN.*', 'HEAVY RAIN', data$evtype)
data$evtype <- gsub('.*HVY RAIN.*', 'HEAVY RAIN', data$evtype)
data$evtype <- gsub('.*RAIN (HEAVY).*', 'HEAVY RAIN', data$evtype)
data$evtype <- gsub('.*EXCESSIVE PRECIPITATION.*', 'HEAVY RAIN', data$evtype)
data$evtype <- gsub('.*HEAVY PRECIPITATION.*', 'HEAVY RAIN', data$evtype)
data$evtype <- gsub('.*HURRICANE.*', 'HURRICANE', data$evtype)
data$evtype <- gsub('.*LANDSLIDE.*', 'LANDSLIDE', data$evtype)
data$evtype <- gsub('.*FREEZING FOG.*', 'FREEZING FOG', data$evtype)
data$evtype <- gsub('VOLCANIC ASHFALL', 'VOLCANIC ERUPTION', data$evtype)
data$evtype <- gsub('.*FREEZING RAIN SLEET.*', 'FREEZING RAIN SLEET', data$evtype)
data$evtype <- gsub('.*SLEET \\& FREEZING RAIN.*', 'FREEZING RAIN SLEET', data$evtype)
data$evtype <- gsub('.*WILD\\/FOREST FIRE.*', 'WILDFIRE', data$evtype)
data$evtype <- gsub('\\?', 'OTHER', data$evtype)
data$evtype <- gsub('.*OTHER.*', 'OTHER', data$evtype)

#After grouping, there are 335 unique enrties in evtype:
length(unique(data$evtype))
## [1] 335

Results

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Events harmful for population health result in injuries and fatalities (death)

data_health <- data %>% group_by(evtype) %>% 
        summarize(deaths=sum(fatalities), injury=sum(injuries)) %>% arrange(desc(deaths))
head(data_health)
## # A tibble: 6 <U+00D7> 3
##               evtype deaths injury
##                <chr>  <dbl>  <dbl>
## 1 TORNADO OR TYPHOON   5700  92687
## 2               HEAT   3138   9224
## 3              FLOOD   1525   8604
## 4       STRONG WINDS   1212   8964
## 5          LIGHTNING    817   5231
## 6        RIP CURRENT    368    232

The most harmful for population health are tornadoes or typhoons, with death count of 5700 and injuries count of 92687, followed by heat (3138 and 9224, respectively), flood in case of deaths (1525) and strong winds in case of injuries (8964). Below are presented charts for five types of disaster with the highest count of fatalities and injuries.

par(mfrow=c(1,1),mar=c(7,2,2,1))
thedeaths <- data_health$deaths[1:5]
theinjuries <- data_health$injury[1:5]
cause <- rbind(theinjuries, thedeaths)
names(cause) <- data_health$evtype[1:5]

end_point = 9.5 + ncol(cause)
barplot(cause,beside=TRUE,col=c("steelblue","violet"),legend = c("Injuries", "Fatalities"), main="Number of injuries and fatalities caused by natural disasters")
text(seq(2,end_point,by=3), par("usr")[3]-0.25, 
     srt = 45, adj= 1, xpd = TRUE,
     labels = data_health$evtype[1:5], cex=0.8)

Across the United States, which types of events have the greatest economic consequences?

During disastrous events personal property is damaged, as well as crops. The figures below show the most frequent cause for property and crops damage. The amount is given in millions (1,000,000) US Dollars ($).

property <- data %>% group_by(evtype) %>% summarize(property_damage=sum(propdmg)/1000000) %>% arrange(desc(property_damage))
head(property,10)
## # A tibble: 10 <U+00D7> 2
##                evtype property_damage
##                 <chr>           <dbl>
## 1               FLOOD      168212.216
## 2  TORNADO OR TYPHOON      126909.388
## 3         STORM SURGE       43323.536
## 4                HAIL       17622.992
## 5           HURRICANE       15350.340
## 6        STRONG WINDS       10867.551
## 7            WILDFIRE        7766.963
## 8      TROPICAL STORM        7714.391
## 9        WINTER STORM        6688.497
## 10       THUNDERSTORM        6639.609
prop <- property$property_damage[1:10]
par(mar=c(1,12,8,5))
barplot(rev(prop),horiz=TRUE,col="steelblue",names.arg=property$evtype[1:10], las=2, xaxt="n", main="Property damage cause by natural disasters in millions of US Dollars")
axis(3, at = pretty(prop), labels = format(pretty(prop)), las = 1)

crops <- data %>% group_by(evtype) %>% summarize(crop_damage=sum(cropdmg)/1000000) %>% arrange(desc(crop_damage))
head(crops,10)
## # A tibble: 10 <U+00D7> 2
##                evtype crop_damage
##                 <chr>       <dbl>
## 1             DROUGHT  13972.5660
## 2               FLOOD  12380.1091
## 3           ICE STORM   5022.1135
## 4                HAIL   3114.2129
## 5  TORNADO OR TYPHOON   3023.6593
## 6           HURRICANE   2897.4200
## 7                COLD   1409.1155
## 8        STRONG WINDS   1344.4878
## 9        FROST/FREEZE   1094.1860
## 10               HEAT    904.4693
crop <- crops$crop_damage[1:10]
par(mar=c(1,12,8,5))
barplot(rev(crop),horiz=TRUE,col="darkgreen",names.arg=crops$evtype[1:10], las=2, xaxt="n", main="Crops damage cause by natural disasters in millions of US Dollars")
axis(3, at = pretty(crop), labels = format(pretty(crop)), las = 1)

Summary

In this document, economic and health consequences of natural disasters in the United States were studied. It was shown that different kinds of tornadoes and typhoons and heat cause the highest number of deaths and injuries of all severe event types, storms and fires cause most property damage, and extreme temperatures influence damage of crops.