Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The research questions that this study is trying to address is as
follows:
1. Across the United States, which types of events are most harmful with
respect to population health?
2. Across the United States, which types of events have the greatest
economic consequences?
We first read in the from the raw csv file included in the zip archive. The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","StormData.csv.bz2")
data <- read.csv(bzfile("StormData.csv.bz2"))
Next, we understand the structure and summary of data.
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
summary(data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE
## Min. : 1.0 Length:902297 Length:902297 Length:902297
## 1st Qu.:19.0 Class :character Class :character Class :character
## Median :30.0 Mode :character Mode :character Mode :character
## Mean :31.2
## 3rd Qu.:45.0
## Max. :95.0
##
## COUNTY COUNTYNAME STATE EVTYPE
## Min. : 0.0 Length:902297 Length:902297 Length:902297
## 1st Qu.: 31.0 Class :character Class :character Class :character
## Median : 75.0 Mode :character Mode :character Mode :character
## Mean :100.6
## 3rd Qu.:131.0
## Max. :873.0
##
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## Min. : 0.000 Length:902297 Length:902297 Length:902297
## 1st Qu.: 0.000 Class :character Class :character Class :character
## Median : 0.000 Mode :character Mode :character Mode :character
## Mean : 1.484
## 3rd Qu.: 1.000
## Max. :3749.000
##
## END_TIME COUNTY_END COUNTYENDN END_RANGE
## Length:902297 Min. :0 Mode:logical Min. : 0.0000
## Class :character 1st Qu.:0 NA's:902297 1st Qu.: 0.0000
## Mode :character Median :0 Median : 0.0000
## Mean :0 Mean : 0.9862
## 3rd Qu.:0 3rd Qu.: 0.0000
## Max. :0 Max. :925.0000
##
## END_AZI END_LOCATI LENGTH WIDTH
## Length:902297 Length:902297 Min. : 0.0000 Min. : 0.000
## Class :character Class :character 1st Qu.: 0.0000 1st Qu.: 0.000
## Mode :character Mode :character Median : 0.0000 Median : 0.000
## Mean : 0.2301 Mean : 7.503
## 3rd Qu.: 0.0000 3rd Qu.: 0.000
## Max. :2315.0000 Max. :4400.000
##
## F MAG FATALITIES INJURIES
## Min. :0.0 Min. : 0.0 Min. : 0.0000 Min. : 0.0000
## 1st Qu.:0.0 1st Qu.: 0.0 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median :1.0 Median : 50.0 Median : 0.0000 Median : 0.0000
## Mean :0.9 Mean : 46.9 Mean : 0.0168 Mean : 0.1557
## 3rd Qu.:1.0 3rd Qu.: 75.0 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :5.0 Max. :22000.0 Max. :583.0000 Max. :1700.0000
## NA's :843563
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 Length:902297 Min. : 0.000 Length:902297
## 1st Qu.: 0.00 Class :character 1st Qu.: 0.000 Class :character
## Median : 0.00 Mode :character Median : 0.000 Mode :character
## Mean : 12.06 Mean : 1.527
## 3rd Qu.: 0.50 3rd Qu.: 0.000
## Max. :5000.00 Max. :990.000
##
## WFO STATEOFFIC ZONENAMES LATITUDE
## Length:902297 Length:902297 Length:902297 Min. : 0
## Class :character Class :character Class :character 1st Qu.:2802
## Mode :character Mode :character Mode :character Median :3540
## Mean :2875
## 3rd Qu.:4019
## Max. :9706
## NA's :47
## LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## Min. :-14451 Min. : 0 Min. :-14455 Length:902297
## 1st Qu.: 7247 1st Qu.: 0 1st Qu.: 0 Class :character
## Median : 8707 Median : 0 Median : 0 Mode :character
## Mean : 6940 Mean :1452 Mean : 3509
## 3rd Qu.: 9605 3rd Qu.:3549 3rd Qu.: 8735
## Max. : 17124 Max. :9706 Max. :106220
## NA's :40
## REFNUM
## Min. : 1
## 1st Qu.:225575
## Median :451149
## Mean :451149
## 3rd Qu.:676723
## Max. :902297
##
The results above show that the dataset contains 902,297 observations and 37 variables.
We will now select only the important variables to conduct the next
steps of analysis. 7 variables will be chosen which are:
1. Event Type - EVTYPE
2. Number of Fatalities - FATALITIES
3. Number of Injuries - INJURIES
4. Property Damage - PROPDMG
5. Property Damage Multiplier - PROPDMGEXP
6. Crop Damage - CROPDMG
7. Crop Damange Multiplier - CROPDMGEXP
There are no missing values/NAs in the 7 variables that we are interested in, as seen in the summary data above.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df1 <- data %>% select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
We will now create a new variable called
TOTAL_POPN_HARM, which sums up the total
FATALITIES plus total INJURIES.
df1 <- df1 %>% mutate(TOTAL_POPN_HARM = FATALITIES + INJURIES)
str(df1)
## 'data.frame': 902297 obs. of 8 variables:
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ FATALITIES : num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP : chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP : chr "" "" "" "" ...
## $ TOTAL_POPN_HARM: num 15 0 2 2 2 6 1 0 15 0 ...
Next, we will now go through the Event Type categories and use the
gsub command combine categories to create less categories
that are unique and have more general terms.
# Standardising all naming to Uppercase
df1$EVTYPE <- toupper(df1$EVTYPE)
# Factor levels within each variables
df1$EVTYPE <- factor(df1$EVTYPE)
df1$PROPDMGEXP <- factor(df1$PROPDMGEXP)
df1$CROPDMGEXP <- factor(df1$CROPDMGEXP)
# Categorise events by more general terms
# Avalanche
df1$EVTYPE <- gsub("^.*AVALAN.*$", "AVALANCHE", df1$EVTYPE)
# Blizzard
df1$EVTYPE <- gsub("^.*BLIZZARD.*$", "BLIZZARD", df1$EVTYPE)
# Cloud
df1$EVTYPE <- gsub("^.*CLOUD.*$", "CLOUD", df1$EVTYPE)
# Cold Weather
df1$EVTYPE <- gsub("^.*COLD.*$", "COLD WEATHER", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*ICE.*$", "COLD WEATHER", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*ICY.*$", "COLD WEATHER", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*FREEZ.*$", "COLD WEATHER", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*COOL.*$", "COLD WEATHER", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*LOW TEMP.*$", "COLD WEATHER", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*WINTER.*$", "COLD WEATHER", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*FROS.*$", "COLD WEATHER", df1$EVTYPE)
# Dry Weather
df1$EVTYPE <- gsub("^.*DRY.*$", "DRY WEATHER", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*DRIE.*$", "DRY WEATHER", df1$EVTYPE)
# Dust
df1$EVTYPE <- gsub("^.*DUST.*$", "DUST", df1$EVTYPE)
# Flood
df1$EVTYPE <- gsub("^.*FLOO.*$", "FLOOD", df1$EVTYPE)
# Fog
df1$EVTYPE <- gsub("^.*FOG.*$", "FOG", df1$EVTYPE)
# Hail
df1$EVTYPE <- gsub("^.*HAIL.*$", "HAIL", df1$EVTYPE)
# High Tide/Waves
df1$EVTYPE <- gsub("^.*SEAS.*$", "HIGH TIDE/WAVES", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*SURF.*$", "HIGH TIDE/WAVES", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*TIDE.*$", "HIGH TIDE/WAVES", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*WAVE.*$", "HIGH TIDE/WAVES", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*SWELL.*$", "HIGH TIDE/WAVES", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*HIGH WATER.*$", "HIGH TIDE/WAVES", df1$EVTYPE)
# Hot Weather
df1$EVTYPE <- gsub("^.*HOT.*$", "HOT WEATHER", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*WARM.*$", "HOT WEATHER", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*HEAT.*$", "HOT WEATHER", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*HIGH TEMP.*$", "HOT WEATHER", df1$EVTYPE)
# Hurricane
df1$EVTYPE <- gsub("^.*HURRICANE.*$", "HURRICANE", df1$EVTYPE)
# Hypothermia
df1$EVTYPE <- gsub("^.*HYPOTHERMIA.*$", "HYPOTHERMIA", df1$EVTYPE)
# Landslide
df1$EVTYPE <- gsub("^.*LANDSLIDE.*$", "LANDSLIDE", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*EROSI.*$", "LANDSLIDE", df1$EVTYPE)
# Mud slide
df1$EVTYPE <- gsub("^.*MUD.*$", "MUDSLIDE", df1$EVTYPE)
# Precipitation
df1$EVTYPE <- gsub("^.*PRECIP.*$", "PRECIPITATION RELATED", df1$EVTYPE)
# Rain/Lightning
df1$EVTYPE <- gsub("^.*RAIN.*$", "RAIN/LIGHTNING", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*LIGHTNING.*$", "RAIN/LIGHTNING", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*SHOWER.*$", "RAIN/LIGHTNING", df1$EVTYPE)
# Rip Current
df1$EVTYPE <- gsub("^.*RIP CURR.*$", "RIP CURRENT", df1$EVTYPE)
# Snow
df1$EVTYPE <- gsub("^.*SNOW.*$", "SNOW", df1$EVTYPE)
# Strong Winds
df1$EVTYPE <- gsub("^.*WIND.*$", "WINDS", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*WINDS.*$", "WINDS", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*TSTM.*$", "WINDS", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*WND.*$", "WINDS", df1$EVTYPE)
# Storm
df1$EVTYPE <- gsub("^.*STORM.*$", "STORM", df1$EVTYPE)
# Storm
df1$EVTYPE <- gsub("^.*SUMMARY.*$", "SUMMARY", df1$EVTYPE)
# Tornado
df1$EVTYPE <- gsub("^.*TORNADO.*$", "TORNADO", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*GUSTNADO.*$", "TORNADO", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*TORNDAO.*$", "TORNADO", df1$EVTYPE)
# Volcano
df1$EVTYPE <- gsub("^.*VOLCANIC.*$", "VOLCANIC", df1$EVTYPE)
# Urban Small Stream
df1$EVTYPE <- gsub("^.*URBAN.*$", "URBAN/SMALL STREAM", df1$EVTYPE)
df1$EVTYPE <- gsub("^.*STREAM.*$", "URBAN/SMALL STREAM", df1$EVTYPE)
# Wet Weather
df1$EVTYPE <- gsub("^.*WET.*$", "WET WEATHER", df1$EVTYPE)
# Water Spout
df1$EVTYPE <- gsub("^.*SPOUT.*$", "WATERSPOUT", df1$EVTYPE)
# Wild Fires
df1$EVTYPE <- gsub("^.*FIRE.*$", "WILD FIRES", df1$EVTYPE)
df1$EVTYPE <- factor(df1$EVTYPE)
The property and crop damage multiplier will need to be redefined
next. Here is how it will be categorised:
1. Non numbers or 0 = 0
2. 1 = 10
3. 2 = 100
4. h/H = 100
5. 3 = 1,000
6. K = 1,000
7. 4 = 10,000
8. 5 = 100,000
9. 6 = 1,000,000
10. m/M = 1,000,000
11. 7 = 10,000,000
12. 8 = 100,000,000
13. B = 1,000,000,000
The first will be for property damage
df1$PROPDMGEXP <- factor(df1$PROPDMGEXP,
levels = c("","-","?","+","0","1","2","3","4","5","6","7","8","B","h","H","K","m","M"),
labels = c("0","0","0","0","0","10","100","1000","10000","100000","1000000","10000000",
"100000000","1000000000","100","100","1000","1000000","1000000"))
df1$PROPDMGEXP <- as.character(df1$PROPDMGEXP)
df1$PROPDMGEXP <- as.integer(df1$PROPDMGEXP)
df1 <- df1 %>% mutate(PROP_DMG_VALUE = PROPDMG * PROPDMGEXP)
The second will be for crop damage (CROP_DMG_VALUE).
Both the CROP_DMG_VALUE and
property damagePROP_DMG_VALUEwill be mutated and summed up into a new variable calledTOTAL_DMG.
df1$CROPDMGEXP <- factor(df1$CROPDMGEXP,
levels = c("","?","0","2","B","k","K","m","M"),
labels = c("0","0","0","100","1000000000","1000","1000","1000000","1000000"))
df1$CROPDMGEXP <- as.character(df1$CROPDMGEXP)
df1$CROPDMGEXP <- as.integer(df1$CROPDMGEXP)
df1 <- df1 %>% mutate(CROP_DMG_VALUE = CROPDMG * CROPDMGEXP)
df1 <- df1 %>% mutate(TOTAL_DMG = CROP_DMG_VALUE * PROP_DMG_VALUE)
str(df1)
## 'data.frame': 902297 obs. of 11 variables:
## $ EVTYPE : Factor w/ 75 levels "?","APACHE COUNTY",..: 64 64 64 64 64 64 64 64 64 64 ...
## $ FATALITIES : num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP : int 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP : int 0 0 0 0 0 0 0 0 0 0 ...
## $ TOTAL_POPN_HARM: num 15 0 2 2 2 6 1 0 15 0 ...
## $ PROP_DMG_VALUE : num 25000 2500 25000 2500 2500 2500 2500 2500 25000 25000 ...
## $ CROP_DMG_VALUE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TOTAL_DMG : num 0 0 0 0 0 0 0 0 0 0 ...
In this section, we will plot some graphs to answer the Research Questions.
The graph and table below show that Tornados cause the most harm to the population, with 97,043 fatalities plus injuries.
df2 <- df1 %>% group_by(EVTYPE) %>% summarise(SUM_HARM = sum(TOTAL_POPN_HARM)) %>% arrange(desc(SUM_HARM))
head(df2, 10)
## # A tibble: 10 × 2
## EVTYPE SUM_HARM
## <fct> <dbl>
## 1 TORNADO 97043
## 2 WINDS 12464
## 3 HOT WEATHER 11789
## 4 FLOOD 10127
## 5 RAIN/LIGHTNING 6438
## 6 COLD WEATHER 5305
## 7 WILD FIRES 1698
## 8 HAIL 1512
## 9 HURRICANE 1461
## 10 SNOW 1293
df3 <- head(df2, 10)
barplot(df3$SUM_HARM, names.arg = df3$EVTYPE,
main = "Event types most harmful to population health (Fatalities + Injuries)",
col = "green",
las = 2)
From the graph and table below, it can be seen that Floods cause the most economic damage.
df4 <- df1 %>% group_by(EVTYPE) %>% summarise(SUM_DMG = sum(TOTAL_DMG)) %>% arrange(desc(SUM_DMG))
head(df4, 10)
## # A tibble: 10 × 2
## EVTYPE SUM_DMG
## <fct> <dbl>
## 1 FLOOD 2.91e19
## 2 HURRICANE 1.36e19
## 3 WINDS 1.80e17
## 4 HAIL 1.50e16
## 5 DROUGHT 1.43e16
## 6 WILD FIRES 1.34e16
## 7 STORM 1.14e16
## 8 COLD WEATHER 3.72e15
## 9 TORNADO 3.08e15
## 10 BLIZZARD 2.56e15
df5 <- head(df4, 10)
barplot(df5$SUM_DMG, names.arg = df5$EVTYPE,
main = "Event types that caused the most damage",
col = "blue",
las = 2)