Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
This analysis will focus on which types of events are most harmful with respect to population health and which types of events have the greatest economic consequences.
Bringing in the data from a bz2 file
stormdata<- read.csv("stormdata.csv.bz2")
head(stormdata)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
To help answer the question which event is most impactful to health I created a column that added fatalities and injuries together then summed the values by event group
# Create a column that adds fatalities and injuries together then sum fatalties by event type for all of U.S #
stormdata$tothealth<- stormdata$FATALITIES + stormdata$INJURIES
df1 <- aggregate(tothealth~EVTYPE, stormdata, sum)
head(df1)
## EVTYPE tothealth
## 1 HIGH SURF ADVISORY 0
## 2 COASTAL FLOOD 0
## 3 FLASH FLOOD 0
## 4 LIGHTNING 0
## 5 TSTM WIND 0
## 6 TSTM WIND (G45) 0
Then look at the structure of the new data set to make sure the format is what is needed Also checked to see if there are any NA’s introduced into the data set
# look to see what the structure of the dataframe is
str(stormdata)
## 'data.frame': 902297 obs. of 38 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
## $ tothealth : num 15 0 2 2 2 6 1 0 15 0 ...
# Check and see if there are any NA volumes in the columns of interest #
sum(is.na(stormdata$tothealth))
## [1] 0
sum(is.na(stormdata$EVTYPE))
## [1] 0
Sort the data frame from highest to lowest and reduce the data set down to the top 20 events by total number of health incidents
# Reduce data set down to events that have greater than 100 fatalities and order by size #
df2<-df1[order(-df1$tothealth), ]
df2<-df2[1:20, ]
df2
## EVTYPE tothealth
## 834 TORNADO 96979
## 130 EXCESSIVE HEAT 8428
## 856 TSTM WIND 7461
## 170 FLOOD 7259
## 464 LIGHTNING 6046
## 275 HEAT 3037
## 153 FLASH FLOOD 2755
## 427 ICE STORM 2064
## 760 THUNDERSTORM WIND 1621
## 972 WINTER STORM 1527
## 359 HIGH WIND 1385
## 244 HAIL 1376
## 411 HURRICANE/TYPHOON 1339
## 310 HEAVY SNOW 1148
## 957 WILDFIRE 986
## 786 THUNDERSTORM WINDS 972
## 30 BLIZZARD 906
## 188 FOG 796
## 585 RIP CURRENT 600
## 955 WILD/FOREST FIRE 557
In order to answer the second question which event type has the biggest economic impact we will need to create additional columns that add crop damage and property damage together.
The columns propdmg and cropdmg have values that are in different units that is driven by the propdmgexp and cropdmgexp columns. Will need to convert these dollar values in order to get them all in the same units in order to add them together
For converting the first column we will multiply the value according to propdmgexp by whether it contains a “K” for thousand, “M” for million or a “B” for Billion. Will then check to see if any NA’s were introduced
# Convert Propert damage column to numerical value based off property damage exp column #
stormdata$PROPDMG2 <- ifelse(stormdata$PROPDMG>0, ifelse(stormdata$PROPDMGEXP == "K", stormdata$PROPDMG*1000,
ifelse(stormdata$PROPDMGEXP == "M", stormdata$PROPDMG*1000000,
ifelse(stormdata$PROPDMGEXP == "B", stormdata$PROPDMG*1000000000, 0))), 0)
# Check to see if NA's were introduced #
sum(is.na(stormdata$PROPDMG2))
## [1] 0
We will follow the same process for cropdmg.
stormdata$CROPDMG2 <- ifelse(stormdata$CROPDMG>0, ifelse(stormdata$CROPDMGEXP == "K" | stormdata$CROPDMGEXP == "k", stormdata$CROPDMG*1000,
ifelse(stormdata$CROPDMGEXP == "M" | stormdata$CROPDMGEXP == "m", stormdata$CROPDMG*1000000,
ifelse(stormdata$CrOPDMGEXP == "B", stormdata$CROPDMG*1000000000, 0))),0)
# Check to see if NA's were introduced #
sum(is.na(stormdata$CROPDMG2))
## [1] 22
Looks like we have introduced 22 NA’s. Inspecting the data it looks like they are invalid entries as ???? and numbers that do not apply to any unit designation so we will delete them from the data set
# Drop rows with NA values in CROPDMG2
data1 <- stormdata[!(is.na(stormdata$CROPDMG2)),]
Create a column that sums both property damage and crop damage and then sum by Event type
# create column that sums both proprert dmg and crop dmg then aggregate by event type#
data1$totdmg <- data1$PROPDMG2 + data1$CROPDMG2
data2<- aggregate(totdmg~EVTYPE, data1, sum)
Sort data frame by highest sum of property damage anf limit data to the top 20 values
# order dataframe by highest total damage and then take top 20 values
data3<-data2[order(-data2$totdmg), ]
data3<-data3[1:20,]
Now that the data has been processed we can answer the question: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
We will use a barplot sorted in order to see which event type is the highest
# bar plot of fatalities by event type #
barplot(df2$tothealth/1000, names.arg = df2$EVTYPE, col = "lightblue", las = 2, ylab = "Total Injuries and Fatalities in Thousands", main = "HEALTH BY EVENT", cex.names = .8)
By the results of the bar plot it is clear that Tornados cause the highest number of injuries and fatalities
To answer the question: Across the United States, which types of events have the greatest economic consequences? We will use a bar plot that shows event type by the total dollar amount from property and crop damage in Billions
barplot(data3$totdmg/1000000000, names.arg = data3$EVTYPE, col = rainbow(20), las = 2, ylab = "Total Damage in Billions", main = "TOTAL DAMAGE BY EVENT")
By the results of the barplot it is clear the Flooding causes the highest property and crop damage in dollars across the U.S.