Data from Department of Commerce • National Oceanic & Atmospheric Administration • National Weather Service were be used to analyse which kinds of weather event cause the most harm within the USA. About 900,000 observations were consolidated and analyzed regarding the damage they cause. With respect to health the most harmful events are tornados, thunderstorm winds and exessive heat. With respect to economics the most harmful events are flood, hurricane and tornado. Since tornado is one of the three most harmful weather events with respect to health and economics, respectively, precautionary measures against tornados should be adopted.
From the coursera course project we obtained data on characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
We read in the data from the csv-file included in the zip archive. The data is a comma-separated-value file compressed via the bzip2 algorithm.
destfile = "repdata_data_StormData.csv.bz2"
if(!file.exists(destfile)){
download.file("https://d396qusza40orc.cloudfront.net/repdata/data/StormData.csv.bz2", destfile)
}
if(!exists("storm0")){
storm0 <- read.csv("repdata_data_StormData.csv.bz2")
}
After reading in the data we got an overview of the data frame and its size (there are 902,297).
dim(storm0)
## [1] 902297 37
str(storm0)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Missing values are a common problem with environmental data and so we check to se what proportion of the observations are missing (i.e. coded as NA).
mean(is.na(storm0)) ## Are missing values important here?
## [1] 0.05229737
Because the proportion of missing values is relatively low (0.0522974), we choose to ignore missing values for now.
We changed the column names to lower case for convenience.
storm1 <- storm0 ## copy data frame
names(storm1) <- tolower(names(storm1))
We looked at the column evtype. There are 0 different event types recorded. For further analysis we copied the data frame and uncapitalised the values.
storm1$evtype <- tolower(storm1$evtype) ## convert into small letters
length(unique(storm1$evtype)) ## How many different event types - recorded?
## [1] 898
We created a vector with the event types described within the documentation of the dataset.
events <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke", "Drought", "Dust Devil", "Dust Storm", "Excessive Heat", "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Frost/Freeze", "Funnel Cloud", "Freezing Fog", "Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf", "High Wind", "Hurricane (Typhoon)", "Ice Storm", "Lake-Effect Snow", "Lakeshore Flood", "Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind", "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", "Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind", "Tornado", "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm", "Winter Weather")
events <- tolower(events)
length(unique(events)) ## How many different event types - defined?
## [1] 48
Since there are far too many event types recorded within the dataset, we looked for recorded events which are not described in the Storm Data Event Table and edited the text variables in order to consolidate different spellings.
eventnoise <- storm1$evtype %in% events
eventnoisedata <- unique(storm1$evtype[!eventnoise])
length(unique(eventnoisedata)) ## How many different event types - not defined?
## [1] 852
There are noticeable patterns for a wrong documentation or misspelling of some events. We corrected these items. We analysed the frequency of the observed events in order to decide how to deal with wrong documented events.
countevtype1 <- count(storm1, evtype)
countevtype1[with(countevtype1, order(-n, evtype)),] ## occurrence of eventtypes
## # A tibble: 898 x 2
## evtype n
## <chr> <int>
## 1 hail 288661
## 2 tstm wind 219942
## 3 thunderstorm wind 82564
## 4 tornado 60652
## 5 flash flood 54277
## 6 flood 25327
## 7 thunderstorm winds 20843
## 8 high wind 20214
## 9 lightning 15754
## 10 heavy snow 15708
## # ... with 888 more rows
According to their frequency we corrected the observed events.
storm2 <- storm1
storm2$evtype <- gsub("tstm", "thunderstorm", storm2$evtype) #abbreviation for thunderstorm
storm2$evtype <- gsub("^(thunder.*)(.*)", "thunderstorm wind", storm2$evtype) #misspelling thunderstorm
storm2$evtype <- gsub("^((?!marine).*)((?<!marine).)thunde.* win.*", "thunderstorm wind", storm2$evtype, perl = TRUE) #misspelling thunderstorm
storm2$evtype <- gsub("wind.+", "wind", storm2$evtype) #misspelling wind
storm2$evtype <- gsub("^(wind)$", "high wind", storm2$evtype) #recategorise wind
storm2$evtype <- gsub("\\bfld\\b", "flood", storm2$evtype) #misspelling flood
storm2$evtype <- gsub("\\bflo*ding\\b", "flood", storm2$evtype) #misspelling flood
storm2$evtype <- gsub("(.*)flash flood(.*)", "flash flood", storm2$evtype) #misspelling flash flood
storm2$evtype <- gsub("(.*)coastal flood(.*)", "coastal flood", storm2$evtype) #misspelling coastal flood
storm2$evtype <- gsub("(.*)lakeshore flood(.*)", "lakeshore flood", storm2$evtype) #misspelling lakeshore flood
storm2$evtype <- gsub("^(flood.*)(.*)", "flood", storm2$evtype) #misspelling flood
storm2$evtype <- gsub("^((?!flash|coastal|lakeshore).*)((?<!flash|coastal|lakeshore).)flood(.*)", "flood", storm2$evtype, perl = TRUE) #misspelling flood
storm2$evtype <- gsub("(.*)wild(.*)fire(.*)", "wildfire", storm2$evtype) #misspelling wildfire
storm2$evtype <- gsub("(.*)winter weather(.*)", "winter weather", storm2$evtype) #misspelling winter weather
storm2$evtype <- gsub("(.*)extreme cold(.*)", "extreme cold/wind chill", storm2$evtype) #misspelling extreme cold
storm2$evtype <- gsub("^(snow)$", "heavy snow", storm2$evtype) #misspelling snow
storm2$evtype <- gsub("\\blandslide\\b", "avalanche", storm2$evtype) #misspelling avalanche
storm2$evtype <- gsub("^(fog)$", "dense fog", storm2$evtype) #misspelling dense fog
storm2$evtype <- gsub("^(cold/wind)$", "extreme cold/wind chill", storm2$evtype) #misspelling extreme cold
storm2$evtype <- gsub("current.+", "current", storm2$evtype) #misspelling current
storm2$evtype <- gsub("^(storm surge)$", "storm surge/tide", storm2$evtype) #misspelling current
storm2$evtype <- gsub("^(freezing rain)$", "frost/freeze", storm2$evtype) #misspelling current
storm2$evtype <- gsub("^(extreme wind)$", "extreme cold/wind chill", storm2$evtype) #misspelling extreme wind
storm2$evtype <- gsub("(.*)(high|heavy surf)(.*)", "high surf", storm2$evtype) #misspelling high surf
storm2$evtype <- gsub("^((hurricane).*)", "hurricane (typhoon)", storm2$evtype) #misspelling hurricane
storm2$evtype <- gsub("^(record warmth)$", "heat", storm2$evtype) #misspelling heat
storm2$evtype <- gsub("^(record heat)$", "excessive heat", storm2$evtype) #misspelling heat
storm2$evtype <- gsub("^(heat wave)$", "excessive heat", storm2$evtype) #misspelling heat
storm2$evtype <- gsub("^(cold)$", "cold/wind chill", storm2$evtype) #misspelling cold
storm2$evtype <- gsub("^(record cold)$", "extreme cold/wind chill", storm2$evtype) #misspelling extreme cold
storm2$evtype <- gsub("cloud.+", "cloud", storm2$evtype) #misspelling cloud
storm2$evtype <- gsub("^(frost)$", "frost/freeze", storm2$evtype) #misspelling frost
storm2$evtype <- gsub("^(ice)$", "frost/freeze", storm2$evtype) #misspelling freeze
storm2$evtype <- gsub("waterspout.+", "waterspout", storm2$evtype) #misspelling waterspout
storm2$evtype <- gsub("rain.+", "rain", storm2$evtype) #misspelling rain
storm2$evtype <- gsub("lake effect", "lake-effect", storm2$evtype) #misspelling lake-effect
storm2$evtype <- gsub("(.*)tornado(.*)", "tornado", storm2$evtype) #misspelling tornado
storm2$evtype <- gsub("^(freez.*)(.*)", "frost/freeze", storm2$evtype) #misspelling freeze
storm2$evtype <- gsub("^(hail.*)(.*)", "hail", storm2$evtype) #misspelling hail
storm2$evtype <- gsub("^(blizzard.*)(.*)", "thunderstorm wind", storm2$evtype) #misspelling blizzard
Some recorded events could not be classified according to the storm data event table. These data were classified as “others”.
storm2$evtype[which(!storm2$evtype %in% events)] <- c("others")
Due to the replacements the frequency of the observed events changed significantly
countevtype2 <- count(storm2, evtype)
countevtype2[with(countevtype2, order(-n, evtype)),] ## occurrence of eventtypes
## # A tibble: 44 x 2
## evtype n
## <chr> <int>
## 1 thunderstorm wind 327522
## 2 hail 288775
## 3 tornado 60686
## 4 flash flood 55668
## 5 flood 29576
## 6 high surf 23604
## 7 heavy snow 16325
## 8 lightning 15754
## 9 marine thunderstorm wind 11987
## 10 heavy rain 11785
## # ... with 34 more rows
Processing data with respect to population health
We extracted and rearranged the columns ‘evtype’, ‘fatalities’ and ‘injuries’ in order to get a tidy dataset.
varsHealth <- c("evtype", "fatalities", "injuries") #define relevant columns
stormHealth <- storm2[varsHealth] #extract relevant columns
stormHealth1 <- stormHealth %>% group_by(evtype) %>%
summarise(damages = sum(fatalities, injuries, na.rm = TRUE),
fatalities = sum(fatalities, na.rm = TRUE),
injuries = sum(injuries, na.rm = TRUE))
#calculate fatalities and injuries per eventtype
stormHealth1 <- stormHealth1[with(stormHealth1, order(-damages, evtype)),]
#order eventtypes according to their damages
head(stormHealth1)
## # A tibble: 6 x 4
## evtype damages fatalities injuries
## <chr> <dbl> <dbl> <dbl>
## 1 tornado 96997 5633 91364
## 2 thunderstorm wind 11154 838 10316
## 3 excessive heat 9031 2077 6954
## 4 flood 7386 512 6874
## 5 lightning 6046 816 5230
## 6 heat 3037 937 2100
We rearranged the calculated dataset in order to get a tidy dataset.
stormHealth2 <- stormHealth1[,c(1,3,4)] #extract relevant columns
stormHealth2 <- stormHealth2 %>% pivot_longer(-evtype, names_to = "harmtype",
values_to = "QUANTITY")
#change columns 'fatalities' and 'injuries' to rows
stormHealth2 <- as.data.frame(stormHealth2)
#define dataset as data frame
head(stormHealth2)
## evtype harmtype QUANTITY
## 1 tornado fatalities 5633
## 2 tornado injuries 91364
## 3 thunderstorm wind fatalities 838
## 4 thunderstorm wind injuries 10316
## 5 excessive heat fatalities 2077
## 6 excessive heat injuries 6954
Processing data with respect to economic consequences
For analysing economic consequences we prepared a separate data frame.
varsEconomics <- c("evtype", "propdmg", "propdmgexp", "cropdmg", "cropdmgexp") #define relevant columns
stormEconomics <- storm2[varsEconomics] #extract relevant columns
stormEconomics1 <- stormEconomics
stormEconomics1$propdmgexp <- tolower(stormEconomics1$propdmgexp) ## convert into small letters
stormEconomics1$cropdmgexp <- tolower(stormEconomics1$cropdmgexp) ## convert into small letters
Since the amount of damage is coded in two columns, we calculated the correct amount and the sum of property and crop damages.
stormEconomics1$propdmgfactor[stormEconomics1$propdmgexp=="k"]<- 1000
stormEconomics1$propdmgfactor[stormEconomics1$propdmgexp=="m"]<- 1000000
stormEconomics1$propdmgfactor[stormEconomics1$propdmgexp=="b"]<- 1000000000
stormEconomics1$propdmgfactor[stormEconomics1$propdmgexp==""]<- 1
stormEconomics1$cropdmgfactor[stormEconomics1$cropdmgexp=="k"]<- 1000
stormEconomics1$cropdmgfactor[stormEconomics1$cropdmgexp=="m"]<- 1000000
stormEconomics1$cropdmgfactor[stormEconomics1$cropdmgexp=="b"]<- 1000000000
stormEconomics1$cropdmgfactor[stormEconomics1$cropdmgexp==""]<- 1
stormEconomics1$propdmgcalc <- with(stormEconomics1, (propdmg*propdmgfactor))
stormEconomics1$cropdmgcalc <- with(stormEconomics1, (cropdmg*cropdmgfactor))
For furhter analysis we extracted and rearranged the columns ‘evtype’ and the calculated economic damages,respectively.
stormEconomics2 <- stormEconomics1 %>% group_by(evtype) %>%
summarise(damages = sum(propdmgcalc, cropdmgcalc, na.rm = TRUE),
propdmgcalc = sum(propdmgcalc, na.rm = TRUE),
cropdmgcalc = sum(cropdmgcalc, na.rm = TRUE))
stormEconomics2 <- stormEconomics2[with(stormEconomics2, order(-damages, evtype)),]
#order eventtypes according to their damages
We rearranged the calculated dataset in order to get a tidy dataset.
stormEconomics3 <- stormEconomics2[,c(1,3,4)] #extract relevant columns
stormEconomics3 <- stormEconomics3 %>% pivot_longer(-evtype, names_to = "harmtype",
values_to = "QUANTITY")
#change columns 'propdmgcalc' and 'cropdmgcalc' to rows
stormEconomics3 <- as.data.frame(stormEconomics3) #define dataset as data frame
head(stormEconomics3)
## evtype harmtype QUANTITY
## 1 flood propdmgcalc 150239461307
## 2 flood cropdmgcalc 10856294050
## 3 hurricane (typhoon) propdmgcalc 84656180010
## 4 hurricane (typhoon) cropdmgcalc 5505292800
## 5 tornado propdmgcalc 56941931733
## 6 tornado cropdmgcalc 414961360
We plot stacked bar char and ordered the displayed bars according to their overall damage in oder to visualize a ranking of the most harmful types of events with respect to population health.
ggplot(stormHealth2, aes(fill=harmtype, y=QUANTITY, x=evtype)) +
geom_bar(postition="stack", stat="identity") +
scale_x_discrete(limits=stormHealth2$evtype)+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title="Most Harmful Types of Events", subtitle="with Respect to Population Health",
x="Event Type", y="Harm with Respect \n to Population Health \n [in people]") +
scale_fill_discrete(name="Type of Damage",
breaks=c("fatalities", "injuries"),
labels=c("Fatalities", "Injuries"))
We plot stacked bar char and ordered the displayed bars according to their overall damage in oder to visualize a ranking of the most harmful types of events with respect to economics.
ggplot(stormEconomics3, aes(fill=harmtype, y=QUANTITY, x=evtype)) +
geom_bar(postition="stack", stat="identity") +
scale_x_discrete(limits=stormEconomics3$evtype) +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE, big.mark = ",")) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title="Most Harmful Types of Events", subtitle="with Respect to Economics",
x="Event Type", y="Harm with Respect \n to Economics \n [in US$]") +
scale_fill_discrete(name="Type of Damage",
breaks=c("cropdmgcalc", "propdmgcalc"),
labels=c("Crop Damage", "Property Damage"))