Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

This analysis will focus on which types of events are most harmful with respect to population health and which types of events have the greatest economic consequences.

Data Processing

Bringing in the data from a bz2 file

stormdata<- read.csv("stormdata.csv.bz2")
head(stormdata)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

To help answer the question which event is most impactful to health I created a column that added fatalities and injuries together then summed the values by event group

# Create a column that adds fatalities and injuries together then sum fatalties by event type for all of U.S #
stormdata$tothealth<- stormdata$FATALITIES + stormdata$INJURIES


df1 <- aggregate(tothealth~EVTYPE, stormdata, sum)
head(df1)
##                  EVTYPE tothealth
## 1    HIGH SURF ADVISORY         0
## 2         COASTAL FLOOD         0
## 3           FLASH FLOOD         0
## 4             LIGHTNING         0
## 5             TSTM WIND         0
## 6       TSTM WIND (G45)         0

Then look at the structure of the new data set to make sure the format is what is needed Also checked to see if there are any NA’s introduced into the data set

# look to see what the structure of the dataframe is
str(stormdata)
## 'data.frame':    902297 obs. of  38 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ tothealth : num  15 0 2 2 2 6 1 0 15 0 ...
# Check and see if there are any NA volumes in the columns of interest #

  sum(is.na(stormdata$tothealth))
## [1] 0
  sum(is.na(stormdata$EVTYPE))
## [1] 0

Sort the data frame from highest to lowest and reduce the data set down to the top 20 events by total number of health incidents

# Reduce data set down to events that have greater than 100 fatalities and order by size #   
    
 
  df2<-df1[order(-df1$tothealth), ]
  df2<-df2[1:20, ]
  df2
##                 EVTYPE tothealth
## 834            TORNADO     96979
## 130     EXCESSIVE HEAT      8428
## 856          TSTM WIND      7461
## 170              FLOOD      7259
## 464          LIGHTNING      6046
## 275               HEAT      3037
## 153        FLASH FLOOD      2755
## 427          ICE STORM      2064
## 760  THUNDERSTORM WIND      1621
## 972       WINTER STORM      1527
## 359          HIGH WIND      1385
## 244               HAIL      1376
## 411  HURRICANE/TYPHOON      1339
## 310         HEAVY SNOW      1148
## 957           WILDFIRE       986
## 786 THUNDERSTORM WINDS       972
## 30            BLIZZARD       906
## 188                FOG       796
## 585        RIP CURRENT       600
## 955   WILD/FOREST FIRE       557

In order to answer the second question which event type has the biggest economic impact we will need to create additional columns that add crop damage and property damage together.

The columns propdmg and cropdmg have values that are in different units that is driven by the propdmgexp and cropdmgexp columns. Will need to convert these dollar values in order to get them all in the same units in order to add them together

For converting the first column we will multiply the value according to propdmgexp by whether it contains a “K” for thousand, “M” for million or a “B” for Billion. Will then check to see if any NA’s were introduced

#  Convert Propert damage column to numerical value based off property damage exp column #
   stormdata$PROPDMG2 <- ifelse(stormdata$PROPDMG>0, ifelse(stormdata$PROPDMGEXP == "K", stormdata$PROPDMG*1000, 
                               ifelse(stormdata$PROPDMGEXP == "M", stormdata$PROPDMG*1000000, 
                                      ifelse(stormdata$PROPDMGEXP == "B", stormdata$PROPDMG*1000000000, 0))), 0) 
  
  
  # Check to see if NA's were introduced #
  sum(is.na(stormdata$PROPDMG2))
## [1] 0

We will follow the same process for cropdmg.

  stormdata$CROPDMG2 <- ifelse(stormdata$CROPDMG>0, ifelse(stormdata$CROPDMGEXP == "K" | stormdata$CROPDMGEXP == "k", stormdata$CROPDMG*1000, 
                                                           ifelse(stormdata$CROPDMGEXP == "M" | stormdata$CROPDMGEXP == "m", stormdata$CROPDMG*1000000, 
                                                                  ifelse(stormdata$CrOPDMGEXP == "B", stormdata$CROPDMG*1000000000, 0))),0)
 # Check to see if NA's were introduced #
sum(is.na(stormdata$CROPDMG2))
## [1] 22

Looks like we have introduced 22 NA’s. Inspecting the data it looks like they are invalid entries as ???? and numbers that do not apply to any unit designation so we will delete them from the data set

#  Drop rows with NA values in CROPDMG2
  
  data1 <- stormdata[!(is.na(stormdata$CROPDMG2)),]

Create a column that sums both property damage and crop damage and then sum by Event type

 # create column that sums both proprert dmg and crop dmg then aggregate by event type#
  data1$totdmg <- data1$PROPDMG2 + data1$CROPDMG2
  data2<- aggregate(totdmg~EVTYPE, data1, sum)

Sort data frame by highest sum of property damage anf limit data to the top 20 values

# order dataframe by highest total damage and then take top 20 values
  
  data3<-data2[order(-data2$totdmg), ]
  data3<-data3[1:20,]

Results

Now that the data has been processed we can answer the question: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

We will use a barplot sorted in order to see which event type is the highest

# bar plot of fatalities by event type #
  barplot(df2$tothealth/1000, names.arg = df2$EVTYPE, col = "lightblue", las = 2, ylab = "Total Injuries and Fatalities in Thousands", main = "HEALTH BY EVENT", cex.names = .8)

By the results of the bar plot it is clear that Tornados cause the highest number of injuries and fatalities

To answer the question: Across the United States, which types of events have the greatest economic consequences? We will use a bar plot that shows event type by the total dollar amount from property and crop damage in Billions

  barplot(data3$totdmg/1000000000, names.arg = data3$EVTYPE, col = rainbow(20), las = 2, ylab = "Total Damage in Billions", main = "TOTAL DAMAGE BY EVENT")

By the results of the barplot it is clear the Flooding causes the highest property and crop damage in dollars across the U.S.