Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
* Storm Data [47Mb]
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
Let’s initialize data file name, zip_file name, and valid URL for raw datasets.
file_name <- "repdata-data-StormData.csv"
zip_file <- "repdata-data-StormData.csv.bz2"
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
Before input data, let’s check whether the data is loaded in current directory.
# Check if the data is downloaded and download when applicable
if (!file.exists(file_name)) {
download.file(url, destfile = zip_file)
unzip(zip_file)
file.remove(zip_file)
}
Now, Let’s load the data for preprocessing and show first 5 events
full_data <- read.csv(file_name)
dim(full_data)
## [1] 902297 37
head(full_data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
Since the data is loaded, let’s input dplyr library for manipulating data
library(dplyr)
The objective of this analysis is to explore the damage to society from all events, and we majorly focus on the damage to either human and economy.
Let’s see how many different events were collected.
length(unique(full_data$EVTYPE))
## [1] 985
We have total 985 different event types in the data. Let’s sort the event name and see the first 50 events
sort(unique(full_data$EVTYPE))[1:50]
## [1] HIGH SURF ADVISORY COASTAL FLOOD
## [3] FLASH FLOOD LIGHTNING
## [5] TSTM WIND TSTM WIND (G45)
## [7] WATERSPOUT WIND
## [9] ? ABNORMAL WARMTH
## [11] ABNORMALLY DRY ABNORMALLY WET
## [13] ACCUMULATED SNOWFALL AGRICULTURAL FREEZE
## [15] APACHE COUNTY ASTRONOMICAL HIGH TIDE
## [17] ASTRONOMICAL LOW TIDE AVALANCE
## [19] AVALANCHE BEACH EROSIN
## [21] Beach Erosion BEACH EROSION
## [23] BEACH EROSION/COASTAL FLOOD BEACH FLOOD
## [25] BELOW NORMAL PRECIPITATION BITTER WIND CHILL
## [27] BITTER WIND CHILL TEMPERATURES Black Ice
## [29] BLACK ICE BLIZZARD
## [31] BLIZZARD AND EXTREME WIND CHIL BLIZZARD AND HEAVY SNOW
## [33] Blizzard Summary BLIZZARD WEATHER
## [35] BLIZZARD/FREEZING RAIN BLIZZARD/HEAVY SNOW
## [37] BLIZZARD/HIGH WIND BLIZZARD/WINTER STORM
## [39] BLOW-OUT TIDE BLOW-OUT TIDES
## [41] BLOWING DUST blowing snow
## [43] Blowing Snow BLOWING SNOW
## [45] BLOWING SNOW & EXTREME WIND CH BLOWING SNOW- EXTREME WIND CHI
## [47] BLOWING SNOW/EXTREME WIND CHIL BREAKUP FLOODING
## [49] BRUSH FIRE BRUSH FIRES
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
We found that the event names have not been organized well. There are some leading
and trailing spaces, some punctuations, many similarties, and etc.
Let’s make them all upper cases and remove punctuations and irrelevant spaces.
full_data$EVTYPE <- toupper(full_data$EVTYPE)
full_data$EVTYPE <- gsub("[[:punct:]]", " ",full_data$EVTYPE)# remove punctuaion
full_data$EVTYPE <- gsub("\\s+"," ",full_data$EVTYPE) # remove consequtive spaces
full_data$EVTYPE <- trimws(full_data$EVTYPE, which = "both") # remove leading and trailing spaces
sort(unique(full_data$EVTYPE))[1:50]
## [1] "" "ABNORMAL WARMTH"
## [3] "ABNORMALLY DRY" "ABNORMALLY WET"
## [5] "ACCUMULATED SNOWFALL" "AGRICULTURAL FREEZE"
## [7] "APACHE COUNTY" "ASTRONOMICAL HIGH TIDE"
## [9] "ASTRONOMICAL LOW TIDE" "AVALANCE"
## [11] "AVALANCHE" "BEACH EROSIN"
## [13] "BEACH EROSION" "BEACH EROSION COASTAL FLOOD"
## [15] "BEACH FLOOD" "BELOW NORMAL PRECIPITATION"
## [17] "BITTER WIND CHILL" "BITTER WIND CHILL TEMPERATURES"
## [19] "BLACK ICE" "BLIZZARD"
## [21] "BLIZZARD AND EXTREME WIND CHIL" "BLIZZARD AND HEAVY SNOW"
## [23] "BLIZZARD FREEZING RAIN" "BLIZZARD HEAVY SNOW"
## [25] "BLIZZARD HIGH WIND" "BLIZZARD SUMMARY"
## [27] "BLIZZARD WEATHER" "BLIZZARD WINTER STORM"
## [29] "BLOW OUT TIDE" "BLOW OUT TIDES"
## [31] "BLOWING DUST" "BLOWING SNOW"
## [33] "BLOWING SNOW EXTREME WIND CH" "BLOWING SNOW EXTREME WIND CHI"
## [35] "BLOWING SNOW EXTREME WIND CHIL" "BREAKUP FLOODING"
## [37] "BRUSH FIRE" "BRUSH FIRES"
## [39] "COASTAL EROSION" "COASTAL FLOOD"
## [41] "COASTAL FLOODING" "COASTAL FLOODING EROSION"
## [43] "COASTAL STORM" "COASTAL SURGE"
## [45] "COASTAL TIDAL FLOOD" "COASTALFLOOD"
## [47] "COASTALSTORM" "COLD"
## [49] "COLD AIR FUNNEL" "COLD AIR FUNNELS"
We can see there is still similarities, such as singular or plurals, TSTM or Thunderstorm, Flash Flood, Flood Flash, Flash Flood Flood or etc.
full_data$EVTYPE <- gsub("S$", "", full_data$EVTYPE) # plural to singular by removing trailing S
full_data[which(full_data$EVTYPE %in% c("FLASH FLOOODING","FLASH FLOOD FLOOD",
"FLOOD FLASH","FLOOD FLASH FLOOD",
"FLOOD FLASHFLOOD","FLOOD FLOOD FLASH",
"FLASH FLODDING")),]$EVTYPE <- "FLASH FLOOD"
full_data$EVTYPE <- gsub("TSTM", "THUNDERSTORM", full_data$EVTYPE, fixed = TRUE)
Even though there is still many similarties, in this project, we only focus on the top 10 the most influecial events. In orer to do that, let’s group the data by event and
summarize by the total count and see top 20 occurracne.
(full_data %>% group_by(EVTYPE) %>% summarise(CNT = n()) %>% arrange(desc(CNT)))[1:20,]
## # A tibble: 20 x 2
## EVTYPE CNT
## <chr> <int>
## 1 THUNDERSTORM WIND 323370
## 2 HAIL 288661
## 3 TORNADO 60653
## 4 FLASH FLOOD 54969
## 5 FLOOD 25330
## 6 HIGH WIND 21749
## 7 LIGHTNING 15756
## 8 HEAVY SNOW 15708
## 9 MARINE THUNDERSTORM WIND 11987
## 10 HEAVY RAIN 11768
## 11 WINTER STORM 11436
## 12 WINTER WEATHER 7045
## 13 FUNNEL CLOUD 6932
## 14 WATERSPOUT 3845
## 15 STRONG WIND 3773
## 16 URBAN SML STREAM FLD 3392
## 17 WILDFIRE 2769
## 18 BLIZZARD 2719
## 19 DROUGHT 2488
## 20 ICE STORM 2006
For the rest similarties, we correct the EVTYPE when we analize the rest data based on
whether correction of EVTYPE is really necessary.
First, let’s explore the damage to human from all the events during past half century.
In order to do that, let’s group the data by all events types and sum up all the injuries based on the event type, then store the top 10 largest number of injuries in to variable large_injuries
large_injuries <- full_data %>%
group_by(EVTYPE) %>%
summarise(INJURIES = sum(INJURIES, na.rm = TRUE)) %>%
filter(INJURIES %in% sort(INJURIES,decreasing = TRUE)[1:10])
Let’s see the result in descending order by total injuries.
print(arrange(large_injuries, desc(INJURIES)))
## # A tibble: 10 x 2
## EVTYPE INJURIES
## <chr> <dbl>
## 1 TORNADO 91346
## 2 THUNDERSTORM WIND 9363
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1792
## 9 HIGH WIND 1439
## 10 HAIL 1361
Do similar to total fatalities and see the result
large_fatalities <- full_data %>%
group_by(EVTYPE) %>%
summarise(FATALITIES = sum(FATALITIES, na.rm = TRUE)) %>%
filter(FATALITIES %in% sort(FATALITIES,decreasing = TRUE)[1:10])
print(arrange(large_fatalities, desc(FATALITIES)))
## # A tibble: 10 x 2
## EVTYPE FATALITIES
## <chr> <dbl>
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 1011
## 4 HEAT 937
## 5 LIGHTNING 817
## 6 THUNDERSTORM WIND 701
## 7 RIP CURRENT 572
## 8 FLOOD 470
## 9 HIGH WIND 283
## 10 AVALANCHE 224
We can see that Tornado gives the most damage to people in death or injuries.
Let’s plot both fatalities and injuries, and visualize the differences. In order to do that, let’s combine the injuries and fatalities data by event type
fatal_injur <- inner_join(large_fatalities, large_injuries, by = "EVTYPE")
library(reshape2) # need reshape2 library for melt function betlow
fatal_injur<- melt(fatal_injur, id.vars = c("EVTYPE"))
print(fatal_injur)
## EVTYPE variable value
## 1 EXCESSIVE HEAT FATALITIES 1903
## 2 FLASH FLOOD FATALITIES 1011
## 3 FLOOD FATALITIES 470
## 4 HEAT FATALITIES 937
## 5 HIGH WIND FATALITIES 283
## 6 LIGHTNING FATALITIES 817
## 7 THUNDERSTORM WIND FATALITIES 701
## 8 TORNADO FATALITIES 5633
## 9 EXCESSIVE HEAT INJURIES 6525
## 10 FLASH FLOOD INJURIES 1792
## 11 FLOOD INJURIES 6789
## 12 HEAT INJURIES 2100
## 13 HIGH WIND INJURIES 1439
## 14 LIGHTNING INJURIES 5230
## 15 THUNDERSTORM WIND INJURIES 9363
## 16 TORNADO INJURIES 91346
Now, let’s visualize the total fatalites and injureis caused by major event type
library(ggplot2)
ggplot(data = fatal_injur, aes(x = EVTYPE, y = value,
group = variable, fill = variable)) +
geom_bar(stat = "identity",width = 0.5, position = "dodge") +
coord_flip() +
labs(title = "Total number of fatalities and injuires by event",
x = "Event Type", y = "Number of causalties")
From the graph, we can see that Tornado injuired human the most significant than any other event type; also, it took the most people’s life.
Now, let’s talk about economical influence from event.
From the data, we see that PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP were recorded all the property or crop damage caused by event. Let’s take out the data has economic influnces, which is DMG >0, from full_data to
“P_CROPDMG_data”
P_CROPDMG_data <- full_data[which(full_data$PROPDMG > 0 | full_data$PROPDMG > 0),
c("EVTYPE","PROPDMG","PROPDMGEXP",
"CROPDMG","CROPDMGEXP","REFNUM")]
summary(P_CROPDMG_data)
## EVTYPE PROPDMG PROPDMGEXP CROPDMG
## Length:239174 Min. : 0.01 K :227481 Min. : 0.000
## Class :character 1st Qu.: 2.50 M : 11319 1st Qu.: 0.000
## Mode :character Median : 10.00 0 : 209 Median : 0.000
## Mean : 45.51 : 76 Mean : 4.398
## 3rd Qu.: 25.00 B : 40 3rd Qu.: 0.000
## Max. :5000.00 5 : 18 Max. :978.000
## (Other): 31
## CROPDMGEXP REFNUM
## :145037 Min. : 1
## K : 92734 1st Qu.:281994
## M : 1356 Median :479228
## k : 21 Mean :487226
## 0 : 16 3rd Qu.:707229
## ? : 6 Max. :902260
## (Other): 4
As we see, the PROPDMG and CROPDMG are only numerical values that stand for total economical
damage, and PROPDMGEXP and CROPDMGEXP are categorial values that stand for units of amouts.
“B” stands for billion, “K” stands for thousands, “M” stands for million, and 1,2,3,4 stands
for the power of 10s.
First, let’s organize the CROPDMGEXP and PROPDMGXP by removing all punctuation
and make them all capital
P_CROPDMG_data$PROPDMGEXP <- toupper(gsub("[[:punct:]]",0,P_CROPDMG_data$PROPDMGEXP))
P_CROPDMG_data$CROPDMGEXP <- toupper(gsub("[[:punct:]]",0,P_CROPDMG_data$CROPDMGEXP))
Also, lets’s correct letter to specific numbers, such as K to 3, M to 6,
P_CROPDMG_data[grepl("M", P_CROPDMG_data$PROPDMGEXP),"PROPDMGEXP"] <- "6"
P_CROPDMG_data[grepl("K", P_CROPDMG_data$PROPDMGEXP),"PROPDMGEXP"] <- "3"
P_CROPDMG_data[grepl("B", P_CROPDMG_data$PROPDMGEXP),"PROPDMGEXP"] <- "9"
P_CROPDMG_data[grepl("H", P_CROPDMG_data$PROPDMGEXP),"PROPDMGEXP"] <- "2"
P_CROPDMG_data[grepl("M", P_CROPDMG_data$CROPDMGEXP),"CROPDMGEXP"] <- "6"
P_CROPDMG_data[grepl("K", P_CROPDMG_data$CROPDMGEXP),"CROPDMGEXP"] <- "3"
P_CROPDMG_data[grepl("B", P_CROPDMG_data$CROPDMGEXP),"CROPDMGEXP"] <- "9"
P_CROPDMG_data[grepl("H", P_CROPDMG_data$CROPDMGEXP),"CROPDMGEXP"] <- "2"
table(P_CROPDMG_data$PROPDMGEXP)
##
## 0 2 3 4 5 6 7 9
## 76 215 8 227482 4 18 11329 2 40
table(P_CROPDMG_data$CROPDMGEXP)
##
## 0 3 6 9
## 145037 22 92755 1357 3
P_CROPDMG_data[which(P_CROPDMG_data$PROPDMGEXP == ""),"PROPDMGEXP"] <- "0"
P_CROPDMG_data[which(P_CROPDMG_data$CROPDMGEXP == ""),"CROPDMGEXP"] <- "0"
Let’s only focus on property damage.
PROP_DMG <- P_CROPDMG_data[,c("EVTYPE","PROPDMG","PROPDMGEXP","REFNUM")]
Then, adjust the PROPDMG with actual property damgage
PROP_DMG$PROPDMGEXP <- as.numeric(PROP_DMG$PROPDMGEXP)
PROP_DMG$PROPDMG <- PROP_DMG$PROPDMG*10^PROP_DMG$PROPDMGEXP
Now, lets’s group by the EVTYPE and sum up property damage by EVTYPE,
and see top 10 events
TOTAL_PROP_DMG <- PROP_DMG %>%
group_by(EVTYPE) %>%
summarise(PROP_SUM = sum(PROPDMG)) %>%
filter(PROP_SUM %in% sort(PROP_SUM,decreasing = TRUE)[1:10])
print(arrange(TOTAL_PROP_DMG, desc(PROP_SUM)))
## # A tibble: 10 x 2
## EVTYPE PROP_SUM
## <chr> <dbl>
## 1 FLOOD 144663709807
## 2 HURRICANE TYPHOON 69305840000
## 3 TORNADO 56947380676.
## 4 STORM SURGE 43323536000
## 5 FLASH FLOOD 17279483492.
## 6 HAIL 15735267513.
## 7 HURRICANE 11868319010
## 8 THUNDERSTORM WIND 9920860826.
## 9 TROPICAL STORM 7703890550
## 10 WINTER STORM 6688997251
We can see the Flood makes the greatest property damage, which is total $144,663,709,807. Let’s see the difference from graph
ggplot(data = TOTAL_PROP_DMG, aes(x = EVTYPE, y = PROP_SUM)) +
geom_bar(stat = "identity", width = 0.5, position = "dodge") +
coord_flip()
For Crop Damage, lets do similary.
CROP_DMG <- P_CROPDMG_data[,c("EVTYPE","CROPDMG","CROPDMGEXP","REFNUM")]
CROP_DMG$CROPDMGEXP <- as.numeric(CROP_DMG$CROPDMGEXP)
CROP_DMG$CROPDMG <- CROP_DMG$CROPDMG*10^CROP_DMG$CROPDMGEXP
Now, lets’s group by the EVTYPE and sum up property damage by EVTYPE,
and see top 10 events
TOTAL_CROP_DMG <- CROP_DMG %>%
group_by(EVTYPE) %>%
summarise(CROP_SUM = sum(CROPDMG)) %>%
filter(CROP_SUM %in% sort(CROP_SUM,decreasing = TRUE)[1:10])
print(arrange(TOTAL_CROP_DMG, desc(CROP_SUM)))
## # A tibble: 10 x 2
## EVTYPE CROP_SUM
## <chr> <dbl>
## 1 RIVER FLOOD 5028734000
## 2 ICE STORM 5022113500
## 3 FLOOD 4073493450
## 4 HURRICANE 2688910000
## 5 HURRICANE TYPHOON 2607822800
## 6 HAIL 1878569103
## 7 FLASH FLOOD 1477879350
## 8 DROUGHT 1231755000
## 9 THUNDERSTORM WIND 1017159388
## 10 HIGH WIND 636085650
We can see the river flood makes the greatest crop damage, which is total $50,228,734,000.
From the analysis, Tornado gave us the most fatalities and injuries.
However, Flood gave us the most property and crop damage.
Therefore, Tornado gives the most effective for human’s health, and Flood gives the
the most severe economic damage.