##Synopsis##
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. From these data, we found that, TORNADO is the event that most harmful with respect to population health, while FLOOD is the event that most harmful with respect to population health.
##Loading Raw Data##
From the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, we obtained the data in the form of a comma-seperated-value file compressed via the bzip2 algorithm to reduce the size.
storm_data <- read.csv("data.csv", header = TRUE, sep = ",", na.string = "")
After loading, we read a few rows in this dataset
dim(storm_data)
## [1] 902297 37
head(storm_data[, 1:13])
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME
## 1 0 <NA> <NA> <NA> <NA>
## 2 0 <NA> <NA> <NA> <NA>
## 3 0 <NA> <NA> <NA> <NA>
## 4 0 <NA> <NA> <NA> <NA>
## 5 0 <NA> <NA> <NA> <NA>
## 6 0 <NA> <NA> <NA> <NA>
Then check the data variables and its characteristics
str(storm_data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29600 levels "5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13512 1872 4597 10591 4371 10093 1972 23872 24417 4597 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 34 levels " N"," NW","E",..: NA NA NA NA NA NA NA NA NA NA ...
## $ BGN_LOCATI: Factor w/ 54428 levels "- 1 N Albion",..: NA NA NA NA NA NA NA NA NA NA ...
## $ END_DATE : Factor w/ 6662 levels "1/1/1993 0:00:00",..: NA NA NA NA NA NA NA NA NA NA ...
## $ END_TIME : Factor w/ 3646 levels " 0900CST"," 200CST",..: NA NA NA NA NA NA NA NA NA NA ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 23 levels "E","ENE","ESE",..: NA NA NA NA NA NA NA NA NA NA ...
## $ END_LOCATI: Factor w/ 34505 levels "- .5 NNW","- 11 ESE Jay",..: NA NA NA NA NA NA NA NA NA NA ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 18 levels "-","?","+","0",..: 16 16 16 16 16 16 16 16 16 16 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 8 levels "?","0","2","B",..: NA NA NA NA NA NA NA NA NA NA ...
## $ WFO : Factor w/ 541 levels " CI","$AC","$AG",..: NA NA NA NA NA NA NA NA NA NA ...
## $ STATEOFFIC: Factor w/ 249 levels "ALABAMA, Central",..: NA NA NA NA NA NA NA NA NA NA ...
## $ ZONENAMES : Factor w/ 25111 levels " "| __truncated__,..: NA NA NA NA NA NA NA NA NA NA ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436780 levels "-2 at Deer Park\n",..: NA NA NA NA NA NA NA NA NA NA ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
##Data Processing##
###Which Type of Events are Most Harmful with Respect to Population Health###
We will concentrate on two particular variables, FATALITIES and INJUREIS. So we first gourp the data based on the type of the event EVTYPE.
data_INJ <- aggregate(storm_data["INJURIES"], list(EVTYPE = storm_data$EVTYPE), sum)
data_FAT <- aggregate(storm_data["FATALITIES"], list(EVTYPE = storm_data$EVTYPE), sum)
data_PH <- merge(data_INJ, data_FAT, by = "EVTYPE", all = TRUE)
summary(data_PH)
## EVTYPE INJURIES FATALITIES
## HIGH SURF ADVISORY: 1 Min. : 0.0 Min. : 0.00
## COASTAL FLOOD : 1 1st Qu.: 0.0 1st Qu.: 0.00
## FLASH FLOOD : 1 Median : 0.0 Median : 0.00
## LIGHTNING : 1 Mean : 142.7 Mean : 15.38
## TSTM WIND : 1 3rd Qu.: 0.0 3rd Qu.: 0.00
## TSTM WIND (G45) : 1 Max. :91346.0 Max. :5633.00
## (Other) :979
We can find out that there are total 979 types of weather events. We made a scatterplot to measure which events has the most inpact on both Injuries and Fatalities
So to make the plot easy to read, I will just choose the point which contains Injuries number larger then the mean.
###Which Type of Events have the Greatest Economic Consequences###
To address this question, we select variables PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP. These 4 variables, given in numerical values, represents the magnitude of the damage caused to the property. However, PROPDMGEXP and CROPDMGEXP represents the multiples in thousands K amd millions M, for the corresponding value for crop damage and property damage. Therefore, we just choose the highest multipler M for our analysis.
data.sub <- subset(storm_data, select = c(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))
data.sub1 <- subset(data.sub, data.sub$PROPDMGEXP %in% "M")
data.sub2 <- subset(data.sub1, data.sub1$CROPDMGEXP %in% "M")
head(data.sub2)
## EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 187581 HURRICANE ERIN 25.0 M 1 M
## 187583 HURRICANE OPAL 48.0 M 4 M
## 188204 FLOODING 50.0 M 5 M
## 188205 HEAVY RAIN 50.0 M 5 M
## 191345 WINTER STORM 5.0 M 5 M
## 192339 HIGH WINDS 5.5 M 7 M
First I selected all the value that PROPDMGEXP and CROPDMGEXP are equals to B
data_PRO <- aggregate(data.sub2["PROPDMG"], list(EVTYPE = data.sub2$EVTYPE), sum)
data_CRO <- aggregate(data.sub2["CROPDMG"], list(EVTYPE = data.sub2$EVTYPE), sum)
data_ECO <- merge(data_PRO, data_CRO, by = "EVTYPE", all = TRUE)
Then I merged the needed data together to plot a graph.
##Results##
###Injuries and Fatalities due to severe weather events###
library(ggplot2)
g <- ggplot(data_PH[data_PH$INJURIES > 142.7, ], aes(INJURIES, FATALITIES, label = EVTYPE))
g + geom_point(aes(size = INJURIES)) + geom_text(aes(size = INJURIES), colour = "red") + scale_size(range = c(1, 4)) + labs(title = "Injuries and Fatalities due to severe weather events")
According to the plot, TORNADO is the event that most harmful with respect to population health. To see it clearly, I choosed top six event, and listed the Injuries number and Fatalities ruined number below. The ording rule is that Injuries first then Fatalities.
head(data_PH[order(data_PH$INJURIES, data_PH$FATALITIES, decreasing = TRUE), ])
## EVTYPE INJURIES FATALITIES
## 834 TORNADO 91346 5633
## 856 TSTM WIND 6957 504
## 170 FLOOD 6789 470
## 130 EXCESSIVE HEAT 6525 1903
## 464 LIGHTNING 5230 816
## 275 HEAT 2100 937
###Economic losses due to severe weather phenomena###
library(ggplot2)
g <- ggplot(data_ECO, aes(PROPDMG, CROPDMG, label = EVTYPE))
g + geom_point(aes(size = PROPDMG)) + geom_text(aes(size = PROPDMG), colour = "red") + scale_size(range = c(1, 4)) + labs(title = "Economic losses due to severe weather phenomena")
According to the plot, FLOOD is the event that most harmful with respect to population health. To see it clearly, I choosed top six event, and listed the Injuries number and Fatalities ruined number below. The ording rule is that Injuries first then Fatalities.
head(data_ECO[order(data_ECO$PROPDMG, data_ECO$CROPDMG, decreasing = TRUE), ])
## EVTYPE PROPDMG CROPDMG
## 5 FLOOD 3136.64 2487.21
## 17 HURRICANE 3105.87 1879.31
## 20 HURRICANE/TYPHOON 2460.75 656.64
## 3 FLASH FLOOD 1614.40 880.56
## 9 HAIL 1372.87 550.15
## 14 HIGH WIND 1150.09 481.50