In this report we aim to analyse the most dangerous weather events through US. To find out the most dangerous event we had to have data about weather events across the US. The data for the analysis came from U.S. National Oceanic and Atmospheric Administration. We found that tornadoe is the most dangerous event in US as for the human health and for property damage. However not all states in US suffer from tornadoes and that is why we found the most dangerous event for every state in US
Data for this analysis came from U.S. National Oceanic and Atmospheric Administration (NOAA). Database include information aboum major weather events as well as estimates of any fatalities, injuries and property damage caused by these events. Database contains events from 1950 till now.
Firs of all we have to download raw data and available documentation that comes with the data.
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
url1 <- "https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf"
url2 <- "https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf"
download.file(url = url, destfile = "Stormdata.csv.bz2")
download.file(url = url1, destfile = "Documentation.pdf", mode = "wb")
download.file(url = url2, destfile = "FAQ.pdf", mode = "wb")
First of all we read in the data from the downloaded stormdata.csv.bz2 file. File compressed and the data is comma-separated. First row of the data is column names, so we read it like a header.
stormdata <- read.table("Stormdata.csv.bz2", sep = ",", header = TRUE)
dim(stormdata)
## [1] 902297 37
str(stormdata$EVTYPE)
## Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
The dataset has 37 columns and approximately 900 000 rows.
Event types in the downloaded dataframe looks quite messy. Moreover we have a lot of columns that is unnecessary for our future analysis. We have to clean it to perform future analysis. First af all we have to get rid of unnecessary columns ans select columns that have relation to our research.
From the structure of stormdata we can see that columns that can be interesting for us are:
BGN_DATE - date of the event
STATE - state where event happened
FATALITIES, INJURIES - number of death and injuries due to the event
PROPDMG, CROPDMG - damage caused by the event
library(dplyr)
stormdata <- select(stormdata, BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG)
head(stormdata)
## BGN_DATE STATE EVTYPE FATALITIES INJURIES PROPDMG CROPDMG
## 1 4/18/1950 0:00:00 AL TORNADO 0 15 25.0 0
## 2 4/18/1950 0:00:00 AL TORNADO 0 0 2.5 0
## 3 2/20/1951 0:00:00 AL TORNADO 0 2 25.0 0
## 4 6/8/1951 0:00:00 AL TORNADO 0 2 2.5 0
## 5 11/15/1951 0:00:00 AL TORNADO 0 2 2.5 0
## 6 11/15/1951 0:00:00 AL TORNADO 0 6 2.5 0
Now we need list of events to compare our EVTYPE variable, because it is highly unlikely that there are 985 event types. In the downloaded documentation we have list of events on page 6 and there we can see only 48 types of events. First of all we need to extract information about events tome from this pdf document.
library(pdftools)
text <- pdf_text("Documentation.pdf")
text2 <- strsplit(text, "\n")
Events <- text2[[6]]
Events <- Events[9:32]
Events <- gsub("( )+", "\n", Events)
tmp <- strsplit(Events, "\n")
tmp1 <- sapply(tmp, '[[', 1)
tmp2 <- sapply(tmp, '[[', 3)
Events <- c(tmp1, tmp2)
Events
## [1] "Astronomical Low Tide" "Avalanche"
## [3] "Blizzard" "Coastal Flood"
## [5] "Cold/Wind Chill" "Debris Flow"
## [7] "Dense Fog" "Dense Smoke"
## [9] "Drought" "Dust Devil"
## [11] "Dust Storm" "Excessive Heat"
## [13] "Extreme Cold/Wind Chill" "Flash Flood"
## [15] "Flood" "Frost/Freeze"
## [17] "Funnel Cloud" "Freezing Fog"
## [19] "Hail" "Heat"
## [21] "Heavy Rain" "Heavy Snow"
## [23] "High Surf" "High Wind"
## [25] "Hurricane (Typhoon)" "Ice Storm"
## [27] "Lake-Effect Snow" "Lakeshore Flood"
## [29] "Lightning" "Marine Hail"
## [31] "Marine High Wind" "Marine Strong Wind"
## [33] "Marine Thunderstorm Wind" "Rip Current"
## [35] "Seiche" "Sleet"
## [37] "Storm Surge/Tide" "Strong Wind"
## [39] "Thunderstorm Wind" "Tornado"
## [41] "Tropical Depression" "Tropical Storm"
## [43] "Tsunami" "Volcanic Ash"
## [45] "Waterspout" "Wildfire"
## [47] "Winter Storm" "Winter Weather"
Now we have list of correct event names. Let’s look how many events in our dataframe look like events in the documentation.
sum(stormdata$EVTYPE %in% Events)/nrow(stormdata)
## [1] 8.090462e-05
Less than 1%. It can be caused by the difference in capital and lower letters. Let’s make all event types in our dataframe and in our list of events lowercase.
stormdata$EVTYPE <- as.character(stormdata$EVTYPE)
stormdata$EVTYPE <- tolower(stormdata$EVTYPE)
Events <- tolower(Events)
sum(stormdata$EVTYPE %in% Events)/nrow(stormdata)
## [1] 0.7041484
70% of correct event types. Now it is much better, howewer in order to avoid big mistakes in our analysis we have to figure out what this 30% represent. If we look at the list of event types in our database we can see word “tstm” which is abbreviation of thunderstorm. Lets count the number of rows with this abbreviation.
length(grep("tstm", stormdata$EVTYPE))
## [1] 227236
Almost 230 000 rows - thats big part of the data. Now we change this abbreviation to full word thunderstorm.
stormdata$EVTYPE <- sub("tstm", "thunderstorm", stormdata$EVTYPE)
sum(stormdata$EVTYPE %in% Events)/nrow(stormdata)
## [1] 0.9547499
Now 95% of event types are in list of event types. Now we can perform our analysis and look at the results. However it can be possible that these 5% or poorly documented event types responsible for majority of death and injuries. Lets separate our dataset into 2 sets. One with proper event labeling and second with rest of poorely documented events and calculate total harm to human health in both sets.
stormdata.correct <- filter(stormdata, stormdata$EVTYPE %in% Events)
stormdata.messy <- filter(stormdata, !(stormdata$EVTYPE %in% Events))
After separation we can group our datasets by event type and calculate sum of fatalities and injuries across the US for every event type
by.country.correct <- stormdata.correct %>%
group_by(EVTYPE) %>%
summarise(Death = sum(FATALITIES), Inj = sum(INJURIES)) %>%
filter(Death > quantile(Death, 0.7) | Inj > quantile(Inj, 0.7)) %>%
arrange(desc(Death)) %>%
print
## # A tibble: 18 x 3
## EVTYPE Death Inj
## <chr> <dbl> <dbl>
## 1 tornado 5633 91346
## 2 excessive heat 1903 6525
## 3 flash flood 978 1777
## 4 heat 937 2100
## 5 lightning 816 5230
## 6 thunderstorm wind 637 8445
## 7 flood 470 6789
## 8 rip current 368 232
## 9 high wind 248 1137
## 10 avalanche 224 170
## 11 winter storm 206 1321
## 12 heavy snow 127 1021
## 13 extreme cold/wind chill 125 24
## 14 high surf 104 156
## 15 blizzard 101 805
## 16 ice storm 89 1975
## 17 wildfire 75 911
## 18 hail 15 1361
by.country.messy <- stormdata.messy %>%
group_by(EVTYPE) %>%
summarise(Death = sum(FATALITIES), Inj = sum(INJURIES)) %>%
filter(Death > quantile(Death, 0.7) | Inj > quantile(Inj, 0.7)) %>%
arrange(desc(Death))%>%
print
## # A tibble: 166 x 3
## EVTYPE Death Inj
## <chr> <dbl> <dbl>
## 1 rip currents 204 297
## 2 heat wave 172 379
## 3 extreme cold 162 231
## 4 extreme heat 96 155
## 5 hurricane/typhoon 64 1275
## 6 thunderstorm winds 64 908
## 7 fog 62 734
## 8 hurricane 61 46
## 9 heavy surf/high surf 42 48
## 10 cold 38 48
## # ... with 156 more rows
Now we can see that in the dataset with messy event types the most harmful event is “rip currents”" that should be labelled as “rip current” (as we can see from list of correct event types). Looks like second harmful “heat wave” should be just “heat”, “extreme heat” should be “excessive heat”, “thunderstorm winds” should be “thunderstorm wind”, “heavy surf/high surf” should be “high surf” “high winds” should be “high wind”
“fog” should be “debse fog”and “hurricane” should be “hurricane (typhoon)”. For the last two event type we cant use sub command because it will substitute separate words fog and hurricane in correct event types and we can clearly see that words fog and hurricane can be related to one event only (for for dense for and hurricane for hurricane (typhoon)). “extreme cold” should be “extreme cold/wind chill” but just one word “cold” can be related to both “extreme cold/wind chill” and “Cold/Wind Chill”. We will leave it in messy data.
stormdata$EVTYPE <- sub("rip currents", "rip current", stormdata$EVTYPE)
stormdata$EVTYPE <- sub("heat wave", "heat", stormdata$EVTYPE)
stormdata$EVTYPE <- sub("extreme heat", "excessive heat", stormdata$EVTYPE)
stormdata$EVTYPE <- sub("thunderstorm winds", "thunderstorm wind", stormdata$EVTYPE)
stormdata$EVTYPE <- sub("heavy surf/high surf", "high surf", stormdata$EVTYPE)
stormdata$EVTYPE <- sub("high winds", "high wind", stormdata$EVTYPE)
stormdata[grep("fog", stormdata$EVTYPE), 3] <- "dense fog"
stormdata[grep("hurricane", stormdata$EVTYPE), 3] <- "hurricane (typhoon)"
stormdata[grep("extreme cold", stormdata$EVTYPE), 3] <- "extreme cold/wind chill"
sum(stormdata$EVTYPE %in% Events)/nrow(stormdata)
## [1] 0.9819029
After substitutions our dataset contains only 2% of messy data. Lets separate our initial dataset again and look at the summary of correct and messy datasets
stormdata.correct <- filter(stormdata, stormdata$EVTYPE %in% Events)
stormdata.messy <- filter(stormdata, !(stormdata$EVTYPE %in% Events))
by.country.correct <- stormdata.correct %>%
group_by(EVTYPE) %>%
summarise(Death = sum(FATALITIES), Inj = sum(INJURIES)) %>%
arrange(desc(Death)) %>%
print
## # A tibble: 46 x 3
## EVTYPE Death Inj
## <chr> <dbl> <dbl>
## 1 tornado 5633 91346
## 2 excessive heat 1999 6680
## 3 heat 1109 2479
## 4 flash flood 978 1777
## 5 lightning 816 5230
## 6 thunderstorm wind 701 9353
## 7 rip current 572 529
## 8 flood 470 6789
## 9 extreme cold/wind chill 287 255
## 10 high wind 283 1439
## # ... with 36 more rows
by.country.messy <- stormdata.messy %>%
group_by(EVTYPE) %>%
summarise(Death = sum(FATALITIES), Inj = sum(INJURIES)) %>%
filter(Death > quantile(Death, 0.7) | Inj > quantile(Inj, 0.7)) %>%
arrange(desc(Death)) %>%
print
## # A tibble: 147 x 3
## EVTYPE Death Inj
## <chr> <dbl> <dbl>
## 1 cold 38 48
## 2 landslide 38 52
## 3 unseasonably warm and dry 29 0
## 4 urban/sml stream fld 28 79
## 5 winter weather/mix 28 72
## 6 tornadoes, thunderstorm wind, hail 25 0
## 7 wind 23 86
## 8 flash flooding 19 8
## 9 extreme windchill 17 5
## 10 flood/flash flood 17 15
## # ... with 137 more rows
Now our summary for messy data looks like we can make more substitution but it represents little bit less than 2% of data so we can just delete it for our analysis.
In the following analysis we will use only data with correct event labels. It’s obvious that different states have different the most dangerous weather events. So to understand what is the most dangerous evnet in each state it’s worth to summarise storm data by states and show only event types that caused majority of fatalities.
by.country.correct <- stormdata.correct %>%
group_by(EVTYPE) %>%
summarise(Death = sum(FATALITIES), Inj = sum(INJURIES)) %>%
arrange(desc(Death)) %>%
print
## # A tibble: 46 x 3
## EVTYPE Death Inj
## <chr> <dbl> <dbl>
## 1 tornado 5633 91346
## 2 excessive heat 1999 6680
## 3 heat 1109 2479
## 4 flash flood 978 1777
## 5 lightning 816 5230
## 6 thunderstorm wind 701 9353
## 7 rip current 572 529
## 8 flood 470 6789
## 9 extreme cold/wind chill 287 255
## 10 high wind 283 1439
## # ... with 36 more rows
The plot for top 20 events that caused most damage to human health.
library(ggplot2)
g <- ggplot(by.country.correct[1:20,], aes(x = factor(EVTYPE), y = Death))
g + geom_histogram(stat = "identity", width = 0.5) + coord_flip() + labs(x = "Weather Event", y = "Number of fatalities")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
by.country.correct <- stormdata.correct %>%
group_by(EVTYPE) %>%
summarise(Damage.Property = sum(PROPDMG), Damage.Crop = sum(CROPDMG)) %>%
filter(Damage.Property >0 | Damage.Crop >0) %>%
arrange(desc(Damage.Property), desc(Damage.Crop)) %>%
print
## # A tibble: 45 x 3
## EVTYPE Damage.Property Damage.Crop
## <chr> <dbl> <dbl>
## 1 tornado 3212258. 100019.
## 2 thunderstorm wind 2659192. 194679.
## 3 flash flood 1420125. 179200.
## 4 flood 899938. 168038.
## 5 hail 688693. 579596.
## 6 lightning 603352. 3581.
## 7 high wind 380357. 19043.
## 8 winter storm 132721. 1979.
## 9 heavy snow 122252. 2166.
## 10 wildfire 84459. 4364.
## # ... with 35 more rows
And the plot for top 20 events that caused most damage to property.
g <- ggplot(by.country.correct[1:20,], aes(x = factor(EVTYPE), y = Damage.Property))
g + geom_histogram(stat = "identity", width = 0.5) + coord_flip() + labs(x = "Weather Event", y = "Damage to property")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
Summary of the dataframe grouped by states.
by.states <- stormdata.correct %>%
group_by(EVTYPE, STATE) %>%
summarise(Death = sum(FATALITIES), Inj = sum(INJURIES)) %>%
ungroup %>%
group_by(STATE) %>%
filter(Death == max(Death) & (Death > 0 | Inj > 0)) %>%
arrange(desc(Death), desc(Inj)) %>%
print(n = Inf)
## # A tibble: 65 x 4
## # Groups: STATE [63]
## EVTYPE STATE Death Inj
## <chr> <fct> <dbl> <dbl>
## 1 heat IL 653 242
## 2 tornado AL 617 7929
## 3 tornado TX 538 8207
## 4 tornado MS 450 6244
## 5 tornado MO 388 4330
## 6 tornado AR 379 5116
## 7 tornado TN 368 4748
## 8 excessive heat PA 365 357
## 9 tornado OK 296 4829
## 10 rip current FL 271 247
## 11 tornado IN 252 4224
## 12 tornado MI 243 3362
## 13 tornado KS 236 2721
## 14 tornado OH 191 4438
## 15 tornado GA 180 3926
## 16 tornado LA 153 2637
## 17 tornado NC 126 2536
## 18 tornado KY 125 2806
## 19 excessive heat CA 110 260
## 20 tornado MA 108 1758
## 21 tornado MN 99 1976
## 22 tornado WI 96 1601
## 23 excessive heat NY 93 1
## 24 excessive heat MD 88 461
## 25 tornado IA 81 2208
## 26 flash flood AZ 62 150
## 27 tornado SC 59 1314
## 28 tornado NE 54 1158
## 29 heat NV 54 0
## 30 lightning CO 48 260
## 31 avalanche CO 48 35
## 32 avalanche UT 44 25
## 33 excessive heat NJ 39 300
## 34 high wind WA 38 75
## 35 rip current GU 38 55
## 36 tornado VA 36 914
## 37 flash flood PR 34 2
## 38 avalanche AK 33 17
## 39 tsunami AS 32 129
## 40 high surf HI 28 32
## 41 tornado ND 25 326
## 42 flash flood WV 24 10
## 43 avalanche WY 23 21
## 44 excessive heat DC 20 316
## 45 high wind OR 19 50
## 46 tornado SD 18 452
## 47 flash flood NM 16 18
## 48 avalanche ID 16 9
## 49 lightning MT 9 30
## 50 marine thunderstorm wind AM 9 30
## 51 excessive heat DE 7 51
## 52 thunderstorm wind NH 7 33
## 53 marine thunderstorm wind AN 7 3
## 54 lightning ME 6 70
## 55 high wind CT 6 11
## 56 marine strong wind PZ 5 3
## 57 flood VT 4 5
## 58 high surf RI 3 0
## 59 high surf VI 3 0
## 60 marine strong wind LM 2 1
## 61 marine thunderstorm wind LM 2 1
## 62 marine strong wind LS 1 0
## 63 marine strong wind PH 1 0
## 64 marine thunderstorm wind GM 1 0
## 65 high surf MH 0 1
Now we can clarly see that for a lot of the states tornado is the most harmful event type but for big part of the states another types of weather events cause majority of fatalities.
Now we can calculate damage to crops and property that was caused by different weather events.
by.states <- stormdata.correct %>%
group_by(EVTYPE, STATE) %>%
summarise(Damage.Property = sum(PROPDMG), Damage.Crop = sum(CROPDMG)) %>%
ungroup %>%
group_by(STATE) %>%
filter(Damage.Property == max(Damage.Property) & (Damage.Property >0 | Damage.Crop >0)) %>%
arrange(STATE) %>%
print(n = Inf)
## # A tibble: 66 x 4
## # Groups: STATE [66]
## EVTYPE STATE Damage.Property Damage.Crop
## <chr> <fct> <dbl> <dbl>
## 1 flood AK 10808. 8
## 2 tornado AL 167816. 1653.
## 3 waterspout AM 5102 0
## 4 marine thunderstorm wind AN 189 0
## 5 tornado AR 119550. 388.
## 6 hurricane (typhoon) AS 610 602
## 7 thunderstorm wind AZ 42352 165
## 8 high wind CA 32545. 1483.
## 9 tornado CO 18348. 21
## 10 thunderstorm wind CT 9457. 0
## 11 thunderstorm wind DC 2080. 0
## 12 lightning DE 5914 0
## 13 tornado FL 159753. 148.
## 14 tornado GA 151349. 3792.
## 15 marine thunderstorm wind GM 395. 0
## 16 tropical storm GU 1779 241
## 17 flash flood HI 3162. 585
## 18 thunderstorm wind IA 154564. 22951.
## 19 thunderstorm wind ID 10562. 36
## 20 tornado IL 128909. 1297.
## 21 tornado IN 104686. 516
## 22 tornado KS 143210. 5482.
## 23 thunderstorm wind KY 78842. 1472.
## 24 tornado LA 131476. 2844
## 25 marine thunderstorm wind LE 30 0
## 26 marine thunderstorm wind LM 1210. 0
## 27 marine thunderstorm wind LO 70 0
## 28 marine thunderstorm wind LS 400 0
## 29 flood MA 19044. 0
## 30 thunderstorm wind MD 21883. 158.
## 31 flood ME 34350. 0
## 32 high surf MH 5 0
## 33 thunderstorm wind MI 72839. 2293.
## 34 tornado MN 74033. 4205.
## 35 tornado MO 132160. 2286
## 36 tornado MS 187841. 24964.
## 37 thunderstorm wind MT 16767. 5371.
## 38 tornado NC 96280. 2039.
## 39 tornado ND 49739. 4742
## 40 tornado NE 110774. 14259.
## 41 flood NH 18406. 200
## 42 tornado NJ 14079. 200
## 43 flash flood NM 15792. 1384
## 44 high wind NV 8288. 100.
## 45 thunderstorm wind NY 85845. 1629
## 46 thunderstorm wind OH 165661. 1004
## 47 tornado OK 165168. 607.
## 48 high wind OR 6072. 5
## 49 thunderstorm wind PA 67253. 30.5
## 50 marine high wind PK 31 0
## 51 flash flood PR 16270. 1581
## 52 marine strong wind PZ 76 0
## 53 thunderstorm wind RI 2514. 0
## 54 tornado SC 56730. 271.
## 55 tornado SD 33512. 639.
## 56 marine thunderstorm wind SL 15 0
## 57 tornado TN 112161. 681
## 58 tornado TX 283097. 4866.
## 59 flash flood UT 9271. 824.
## 60 tornado VA 49360. 158
## 61 flash flood VI 561. 0
## 62 flash flood VT 19145. 739
## 63 high wind WA 20308. 93.2
## 64 tornado WI 111640. 11748.
## 65 flash flood WV 74637. 1250
## 66 tornado WY 7657. 0