Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
This data analysis looks addresses the following questions:
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
First, the file needs to be downloaded:
src <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
dest <- file.path("data",basename(src))
if (!file.exists(dest))
download.file(src,dest,method="curl",quiet=TRUE)
After downloading, the file is read in. I look at the column names to get the idea of which variables I will need for further analysis.
my_raw_data <- read_csv(dest)
## Parsed with column specification:
## cols(
## .default = col_character(),
## STATE__ = col_double(),
## COUNTY = col_double(),
## BGN_RANGE = col_double(),
## COUNTY_END = col_double(),
## END_RANGE = col_double(),
## LENGTH = col_double(),
## WIDTH = col_double(),
## F = col_integer(),
## MAG = col_double(),
## FATALITIES = col_double(),
## INJURIES = col_double(),
## PROPDMG = col_double(),
## CROPDMG = col_double(),
## LATITUDE = col_double(),
## LONGITUDE = col_double(),
## LATITUDE_E = col_double(),
## LONGITUDE_ = col_double(),
## REFNUM = col_double()
## )
## See spec(...) for full column specifications.
names(my_raw_data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
From this analysis point of view the relevant information about storms and severe weather conditions is contained in the following columns:
Therefore only this information will be contained in a data subset for further analysis:
data <- select(my_raw_data,EVTYPE, INJURIES, FATALITIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
names(data) <- tolower(names(data)) #changing column names to lowercase for easier handling
head(data)
## # A tibble: 6 <U+00D7> 7
## evtype injuries fatalities propdmg propdmgexp cropdmg cropdmgexp
## <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 TORNADO 15 0 25.0 K 0 <NA>
## 2 TORNADO 0 0 2.5 K 0 <NA>
## 3 TORNADO 2 0 25.0 K 0 <NA>
## 4 TORNADO 2 0 2.5 K 0 <NA>
## 5 TORNADO 2 0 2.5 K 0 <NA>
## 6 TORNADO 6 0 2.5 K 0 <NA>
The summary of data (available here, page 12) indicates that:
Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions. If additional precision is available, it may be provided in the narrative part of the entry
The estimates columns (“propdmgexp”, “cropdmgexp”) contain other characters that need to be cleaned:
unique(data$propdmgexp)
## [1] "K" "M" NA "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
unique(data$cropdmgexp)
## [1] NA "M" "K" "m" "B" "?" "0" "k" "2"
Characters like +“,”?“,”-" will be treated as 0 (magnitude). Since none of the “propdmg” and “cropdmg” data is NA, the NAs in “propdmgexp” and “cropdmgexp” will be also treated as 0 (no multiplier).
The “propdmg” and “cropdmg” data columns will be updated with the magnitude information from “propdmgexp” and “cropdmgexp”
data$propdmgexp[(data$propdmgexp == "") | (data$propdmgexp == "+") | (data$propdmgexp == "?") | (data$propdmgexp == "-") | is.na(data$propdmgexp)] <- 0
data$propdmgexp[(data$propdmgexp == "h") | (data$propdmgexp == "H")] <- 2
data$propdmgexp[(data$propdmgexp == "k") | (data$propdmgexp == "K")] <- 3
data$propdmgexp[(data$propdmgexp == "m") | (data$propdmgexp == "M")] <- 6
data$propdmgexp[(data$propdmgexp == "b") | (data$propdmgexp == "B")] <- 9
data$cropdmgexp[(data$cropdmgexp == "") | (data$cropdmgexp == "+") | (data$cropdmgexp == "?") | (data$cropdmgexp == "-") | is.na(data$cropdmgexp)] <- 0
data$cropdmgexp[(data$cropdmgexp == "h") | (data$cropdmgexp == "H")] <- 2
data$cropdmgexp[(data$cropdmgexp == "k") | (data$cropdmgexp == "K")] <- 3
data$cropdmgexp[(data$cropdmgexp == "m") | (data$cropdmgexp == "M")] <- 6
data$cropdmgexp[(data$cropdmgexp == "b") | (data$cropdmgexp == "B")] <- 9
data$propdmg <- data$propdmg * 10^(as.numeric(data$propdmgexp))
data$cropdmg <- data$cropdmg * 10^(as.numeric(data$cropdmgexp))
Event type names also need cleanup. The names repeat in various forms, like “FLOOD”, “FLOOOD”, FLOODS“,”flood“. The code below cleans up the event type naming.
head(unique(data$evtype),25)
## [1] "TORNADO" "TSTM WIND"
## [3] "HAIL" "FREEZING RAIN"
## [5] "SNOW" "ICE STORM/FLASH FLOOD"
## [7] "SNOW/ICE" "WINTER STORM"
## [9] "HURRICANE OPAL/HIGH WINDS" "THUNDERSTORM WINDS"
## [11] "RECORD COLD" "HURRICANE ERIN"
## [13] "HURRICANE OPAL" "HEAVY RAIN"
## [15] "LIGHTNING" "THUNDERSTORM WIND"
## [17] "DENSE FOG" "RIP CURRENT"
## [19] "THUNDERSTORM WINS" "FLASH FLOOD"
## [21] "FLASH FLOODING" "HIGH WINDS"
## [23] "FUNNEL CLOUD" "TORNADO F0"
## [25] "THUNDERSTORM WINDS LIGHTNING"
#There are 977 various entries for evtype
length(unique(data$evtype))
## [1] 977
data$evtype <- toupper(data$evtype) #takes care for UPPER-lower cases difference
# evtype grouping
data$evtype <- gsub('.*FLOOD.*', 'FLOOD', data$evtype)
data$evtype <- gsub('.*FLOOOD.*', 'FLOOD', data$evtype)
data$evtype <- gsub('.*HAIL.*', 'HAIL', data$evtype)
data$evtype <- gsub('.*THUNDERSTORM.* W.*', 'THUNDERSTORM WIND', data$evtype)
data$evtype <- gsub('.*THUNDERSTORM.*', 'THUNDERSTORM', data$evtype)
data$evtype <- gsub('.*WIND.*', 'STRONG WINDS', data$evtype)
data$evtype <- gsub('.*SNOW.*', 'SNOW', data$evtype)
data$evtype <- gsub('.*TSTM.*', 'THUNDERSTORM WIND', data$evtype)
data$evtype <- gsub('.*HIGH TEMPERATURE.*', 'HIGH TEMPERATURE', data$evtype)
data$evtype <- gsub('.*UNUSUAL\\/RECORD WARMTH.*', 'HIGH TEMPERATURE', data$evtype)
data$evtype <- gsub('.*RECORD WARM*', 'HIGH TEMPERATURE', data$evtype)
data$evtype <- gsub('.*HOT WEATHER.*', 'HIGH TEMPERATURE', data$evtype)
data$evtype <- gsub('.*RECORD TEMPERATURE.*', 'HIGH TEMPERATURE', data$evtype)
data$evtype <- gsub('.*TORNADO.*', 'TORNADO OR TYPHOON', data$evtype)
data$evtype <- gsub('.*TYPHOON.*', 'TORNADO OR TYPHOON', data$evtype)
data$evtype <- gsub('.*TROPICAL STORM.*', 'TROPICAL STORM', data$evtype)
data$evtype <- gsub('.*HEAT.*', 'HEAT', data$evtype)
data$evtype <- gsub('.*BLIZZARD.*', 'BLIZZARD', data$evtype)
data$evtype <- gsub('.*LIGHTNING.*', 'LIGHTNING', data$evtype)
data$evtype <- gsub('.*COLD.*', 'COLD', data$evtype)
data$evtype <- gsub('.*HEAVY RAIN.*', 'HEAVY RAIN', data$evtype)
data$evtype <- gsub('.*HEAVY SHOWER.*', 'HEAVY RAIN', data$evtype)
data$evtype <- gsub('.*EXCESSIVE RAIN.*', 'HEAVY RAIN', data$evtype)
data$evtype <- gsub('.*HVY RAIN.*', 'HEAVY RAIN', data$evtype)
data$evtype <- gsub('.*RAIN (HEAVY).*', 'HEAVY RAIN', data$evtype)
data$evtype <- gsub('.*EXCESSIVE PRECIPITATION.*', 'HEAVY RAIN', data$evtype)
data$evtype <- gsub('.*HEAVY PRECIPITATION.*', 'HEAVY RAIN', data$evtype)
data$evtype <- gsub('.*HURRICANE.*', 'HURRICANE', data$evtype)
data$evtype <- gsub('.*LANDSLIDE.*', 'LANDSLIDE', data$evtype)
data$evtype <- gsub('.*FREEZING FOG.*', 'FREEZING FOG', data$evtype)
data$evtype <- gsub('VOLCANIC ASHFALL', 'VOLCANIC ERUPTION', data$evtype)
data$evtype <- gsub('.*FREEZING RAIN SLEET.*', 'FREEZING RAIN SLEET', data$evtype)
data$evtype <- gsub('.*SLEET \\& FREEZING RAIN.*', 'FREEZING RAIN SLEET', data$evtype)
data$evtype <- gsub('.*WILD\\/FOREST FIRE.*', 'WILDFIRE', data$evtype)
data$evtype <- gsub('\\?', 'OTHER', data$evtype)
data$evtype <- gsub('.*OTHER.*', 'OTHER', data$evtype)
#After grouping, there are 335 unique enrties in evtype:
length(unique(data$evtype))
## [1] 335
Events harmful for population health result in injuries and fatalities (death)
data_health <- data %>% group_by(evtype) %>%
summarize(deaths=sum(fatalities), injury=sum(injuries)) %>% arrange(desc(deaths))
head(data_health)
## # A tibble: 6 <U+00D7> 3
## evtype deaths injury
## <chr> <dbl> <dbl>
## 1 TORNADO OR TYPHOON 5700 92687
## 2 HEAT 3138 9224
## 3 FLOOD 1525 8604
## 4 STRONG WINDS 1212 8964
## 5 LIGHTNING 817 5231
## 6 RIP CURRENT 368 232
The most harmful for population health are tornadoes or typhoons, with death count of 5700 and injuries count of 92687, followed by heat (3138 and 9224, respectively), flood in case of deaths (1525) and strong winds in case of injuries (8964). Below are presented charts for five types of disaster with the highest count of fatalities and injuries.
par(mfrow=c(1,1),mar=c(7,2,2,1))
thedeaths <- data_health$deaths[1:5]
theinjuries <- data_health$injury[1:5]
cause <- rbind(theinjuries, thedeaths)
names(cause) <- data_health$evtype[1:5]
end_point = 9.5 + ncol(cause)
barplot(cause,beside=TRUE,col=c("steelblue","violet"),legend = c("Injuries", "Fatalities"), main="Number of injuries and fatalities caused by natural disasters")
text(seq(2,end_point,by=3), par("usr")[3]-0.25,
srt = 45, adj= 1, xpd = TRUE,
labels = data_health$evtype[1:5], cex=0.8)
During disastrous events personal property is damaged, as well as crops. The figures below show the most frequent cause for property and crops damage. The amount is given in millions (1,000,000) US Dollars ($).
property <- data %>% group_by(evtype) %>% summarize(property_damage=sum(propdmg)/1000000) %>% arrange(desc(property_damage))
head(property,10)
## # A tibble: 10 <U+00D7> 2
## evtype property_damage
## <chr> <dbl>
## 1 FLOOD 168212.216
## 2 TORNADO OR TYPHOON 126909.388
## 3 STORM SURGE 43323.536
## 4 HAIL 17622.992
## 5 HURRICANE 15350.340
## 6 STRONG WINDS 10867.551
## 7 WILDFIRE 7766.963
## 8 TROPICAL STORM 7714.391
## 9 WINTER STORM 6688.497
## 10 THUNDERSTORM 6639.609
prop <- property$property_damage[1:10]
par(mar=c(1,12,8,5))
barplot(rev(prop),horiz=TRUE,col="steelblue",names.arg=property$evtype[1:10], las=2, xaxt="n", main="Property damage cause by natural disasters in millions of US Dollars")
axis(3, at = pretty(prop), labels = format(pretty(prop)), las = 1)
crops <- data %>% group_by(evtype) %>% summarize(crop_damage=sum(cropdmg)/1000000) %>% arrange(desc(crop_damage))
head(crops,10)
## # A tibble: 10 <U+00D7> 2
## evtype crop_damage
## <chr> <dbl>
## 1 DROUGHT 13972.5660
## 2 FLOOD 12380.1091
## 3 ICE STORM 5022.1135
## 4 HAIL 3114.2129
## 5 TORNADO OR TYPHOON 3023.6593
## 6 HURRICANE 2897.4200
## 7 COLD 1409.1155
## 8 STRONG WINDS 1344.4878
## 9 FROST/FREEZE 1094.1860
## 10 HEAT 904.4693
crop <- crops$crop_damage[1:10]
par(mar=c(1,12,8,5))
barplot(rev(crop),horiz=TRUE,col="darkgreen",names.arg=crops$evtype[1:10], las=2, xaxt="n", main="Crops damage cause by natural disasters in millions of US Dollars")
axis(3, at = pretty(crop), labels = format(pretty(crop)), las = 1)
In this document, economic and health consequences of natural disasters in the United States were studied. It was shown that different kinds of tornadoes and typhoons and heat cause the highest number of deaths and injuries of all severe event types, storms and fires cause most property damage, and extreme temperatures influence damage of crops.