Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Synopsis

In this report we aim to answer
1. Across the United States, which types of events are most harmful with respect to population health? and
2.Across the United States, which types of events have the greatest economic consequences?

To answer these questions we use weather events data provided by the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database.The data is subsetted for the fields required for answering the questions. The data in evtye, propdmgexp and cropdmgexp colums are cleaned for anlaysis. From these data, we found that the weather event tornado is the major cause for harm to people’s health in terms of both number of injury and death and the weather event flood is the major cause for economic loss in terms of both property and crop damage ,in the US.

Loading the Raw Data

we obtained the data from

https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2

fileurl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
        download.file(fileurl,"full_data.csv.bz2")
        full_data <- read.csv("full_data.csv.bz2", header = TRUE)
        
        dim(full_data)
## [1] 902297     37
        str(full_data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
        head(full_data,3)
##   STATE__          BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1 4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1 4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1 2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806              1
## 2          0          0              2
## 3          0          0              3

Processing the data

Data required for analysis was subsetted from full data. The fields considered are “BGN_DATE”,“EVTYPE”,“FATALITIES”,“INJURIES”,“PROPDMG”,“PROPDMGEXP”,“CROPDMG”,“CROPDMGEXP”.The names are converted to lower case for ease of use. evtype is converted to character for easy handling. bgn_date is converted to date format.

        work_data <- full_data[,c("BGN_DATE","EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
        names(work_data) <- tolower(names(work_data))
        work_data$evtype <- tolower(as.character(work_data$evtype))
        work_data$bgn_date <- as.Date(as.character(work_data$bgn_date),format = "%m/%d/%Y")

The data contains 898 event types. we have to bring this to a manageable level.we remove obs that has 0 value in “fatalities”, “injuries” ,“propdmg”, and “cropdmg” .

        library(magrittr)
## Warning: package 'magrittr' was built under R version 3.6.3
        library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
        library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
         work_data <- work_data %>% mutate("check" = fatalities + injuries + propdmg + cropdmg)
         work_data <- work_data[work_data$check != 0,]

Now the data contains 447 event types. We also group some similar events together using the unique event types as per the documentation in Page 6 of https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf .

        grep("^(thun|tun|tstm)",unique(work_data$evtype),value = TRUE)
##  [1] "tstm wind"                      "thunderstorm winds"            
##  [3] "thunderstorm wind"              "thunderstorm wins"             
##  [5] "thunderstorm winds lightning"   "thunderstorm winds/hail"       
##  [7] "thunderstorm winds hail"        "thunderstorm winds/funnel clou"
##  [9] "thunderstorms winds"            "thunderstorms"                 
## [11] "thunderstorm windss"            "thunderstorm winds53"          
## [13] "thunderstorm winds 13"          "thundersnow"                   
## [15] "tstm wind 55"                   "thundertorm winds"             
## [17] "thunderstorm  winds"            "thunderstorms wind"            
## [19] "tunderstorm wind"               "thunderstorm wind/lightning"   
## [21] "thunderstorm"                   "thunderstorm wind g50"         
## [23] "thunderstorm winds."            "thunderstorm wind g55"         
## [25] "thunderstorm wind g60"          "thunderstorm winds g60"        
## [27] "tstm wind g58"                  "thunderstorm wind 65mph"       
## [29] "thunderstorm wind/ trees"       "thunderstorm wind/awning"      
## [31] "thunderstorm wind 98 mph"       "thunderstorm wind trees"       
## [33] "thunderstorm wind 60 mph"       "thunderstorm winds 63 mph"     
## [35] "thunderstorm wind/ tree"        "thunderstorm damage to"        
## [37] "thunderstorm wind 65 mph"       "thunderstorm wind."            
## [39] "thunderstorm hail"              "thunderstorm windshail"        
## [41] "thunderstorm winds and"         "tstm wind damage"              
## [43] "thunderstorm wind g52"          "thunderestorm winds"           
## [45] "thunderstorm winds/flooding"    "thundeerstorm winds"           
## [47] "thunerstorm winds"              "thunderstorm wind/hail"        
## [49] "thunderstormw"                  "tstm winds"                    
## [51] "tstmw"                          "tstm wind 65)"                 
## [53] "thunderstorm winds/ flood"      "thunderstormwinds"             
## [55] "thunderstrom wind"              "tstm wind/hail"                
## [57] "tstm wind (g45)"                "tstm wind 40"                  
## [59] "tstm wind 45"                   "tstm wind (41)"                
## [61] "tstm wind (g40)"                "tstm wind and lightning"       
## [63] "tstm wind  (g45)"               "tstm wind (g35)"               
## [65] "tstm wind g45"                  "thunderstorm wind (g40)"
        work_data$evtype[grep("^(thun|tun|tstm)",work_data$evtype)] <- "thunderstorm wind" #tstm is abbreviation for thunderstorm as per the documentation mentioned above
        
        grep("^torn",unique(work_data$evtype),value = TRUE)
## [1] "tornado"                    "tornado f0"                
## [3] "tornadoes, tstm wind, hail" "tornado f3"                
## [5] "torndao"                    "tornado f1"                
## [7] "tornado f2"                 "tornadoes"
        work_data$evtype[grep("torn",work_data$evtype)] <- "tornado"
        
        grep("^hurri",unique(work_data$evtype),value = TRUE)
##  [1] "hurricane opal/high winds"  "hurricane erin"            
##  [3] "hurricane opal"             "hurricane"                 
##  [5] "hurricane-generated swells" "hurricane emily"           
##  [7] "hurricane gordon"           "hurricane felix"           
##  [9] "hurricane edouard"          "hurricane/typhoon"
        work_data$evtype[grep("^hurri",work_data$evtype)] <- "hurricane(typhoon)"
        
        grep("^hai",unique(work_data$evtype),value = TRUE)
##  [1] "hail"        "hail/winds"  "hail/wind"   "hail 150"    "hail 075"   
##  [6] "hail 100"    "hail 125"    "hail 200"    "hail 0.75"   "hail 75"    
## [11] "hail 175"    "hail 275"    "hail 450"    "hailstorm"   "hail damage"
        work_data$evtype[grep("^(hai)",work_data$evtype)] <- "hail"
        
       grep("\\sheat",unique(work_data$evtype),value = TRUE)
## [1] "excessive heat"         "extreme heat"           "drought/excessive heat"
## [4] "record heat"            "record/excessive heat"
       work_data$evtype[grep("\\sheat",work_data$evtype)] <- "excessive heat"

After this processing , we have 343 event types. Now we look at summary statistics for “fatalities” and “injuries”

        summary(work_data$fatalities)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.0000   0.0000   0.0595   0.0000 583.0000
        summary(work_data$injuries)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    0.0000    0.0000    0.0000    0.5519    0.0000 1700.0000

Now we are ready to answer the first question
1. Across the United States, which types of events are most harmful with respect to population health?

To answer this question we pick the Top five events that cause maximum injuries and deaths to people.

        hharm_data <- work_data %>% select(evtype,injuries,fatalities)  %>%      group_by(evtype) %>% summarise(injured = sum(injuries),death = sum(fatalities)) %>% mutate(impact_lives = injured + death) %>% arrange(desc(impact_lives)) %>% top_n(n=5,wt = impact_lives)
        hharm_data
## # A tibble: 5 x 4
##   evtype            injured death impact_lives
##   <chr>               <dbl> <dbl>        <dbl>
## 1 tornado             91407  5661        97068
## 2 thunderstorm wind    9509   712        10221
## 3 excessive heat       6730  2020         8750
## 4 flood                6789   470         7259
## 5 lightning            5230   816         6046
        hharm_data$evtype <- factor(hharm_data$evtype, levels=c("tornado", "thunderstorm wind","excessive heat", "flood","lightning"))
        
        ggplot(data = hharm_data,aes(x=evtype,y=impact_lives/1000))+ geom_col(fill = "blue") + labs(x = "events type",y = "Impacted lives ('000)",title = "Top five events - Health impact")

We would also like to know the type of events that caused the maximum deaths and maximum injuries seperately

        hharm_data %>% select(evtype,death) %>% arrange(desc(death)) %>% top_n(n=1,wt = death)
## # A tibble: 1 x 2
##   evtype  death
##   <fct>   <dbl>
## 1 tornado  5661
        hharm_data %>% select(evtype,injured) %>%arrange(desc(injured)) %>% top_n(n=1,wt = injured)
## # A tibble: 1 x 2
##   evtype  injured
##   <fct>     <dbl>
## 1 tornado   91407

In order to answer the question on “events with great economic consequence” , we need to look at propdmg,propdmgexp for amount of property damage and cropdmg ,cropdmgexp for the amount of crop damage.

The field propdmg has 0 NAs and the field cropdmg has 0 NAs. The summary statistics of propdmg and cropdmg is examined.

        summary(work_data$propdmg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    2.00    5.00   42.75   25.00 5000.00
        summary(work_data$cropdmg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   5.411   0.000 990.000

The values for propdmg grouped by propdmgexp is examined

work_data %>% select(propdmg,propdmgexp) %>% group_by(propdmgexp) %>% summarise(count = n(),sum = sum(propdmg))
## # A tibble: 16 x 3
##    propdmgexp  count        sum
##    <fct>       <int>      <dbl>
##  1 ""          11585      527. 
##  2 "-"             1       15  
##  3 "+"             5      117  
##  4 "0"           210     7108. 
##  5 "2"             1       12  
##  6 "3"             1       20  
##  7 "4"             4       14.5
##  8 "5"            18      210. 
##  9 "6"             3       65  
## 10 "7"             3       82  
## 11 "B"            40      276. 
## 12 "h"             1        2  
## 13 "H"             6       25  
## 14 "K"        231428 10735292. 
## 15 "m"             7       38.9
## 16 "M"         11320   140694.

Looking at the above table, 240 obs are with numerical factors.We wil be pushing zero for all cases other than H,h,B,b,M,m,K,k For cases with H,h,b,B,M,m,K,k we will be pushing the respective exponential powers for easy calculations.Therefore
B,b will be 9
M,m will be 6
K,k will be 3
H,h will be 2
and we create a column “propcost” = propdmg * (10^propdmgexp) indicating the total cost due to property damage.

        work_data$propdmgexp <- tolower(as.character(work_data$propdmgexp))
        work_data$propdmgexp[grep("[0-9]|[+?-]",work_data$propdmgexp)] <- "0"
        work_data$propdmgexp[grep("[h]",work_data$propdmgexp)] <- "2"
        work_data$propdmgexp[grep("k",work_data$propdmgexp)] <- "3"
        work_data$propdmgexp[grep("m",work_data$propdmgexp)] <- "6"
        work_data$propdmgexp[grep("b",work_data$propdmgexp)] <- "9"
        
        work_data$propdmgexp <- as.integer(work_data$propdmgexp)
        work_data$propdmgexp[is.na(work_data$propdmgexp)] <- 0
        work_data <- work_data %>% mutate(propcost = propdmg * (10^propdmgexp))

Now we examine the values for cropdmg grouped by cropdmgexp

        work_data %>% select(cropdmg,cropdmgexp) %>% group_by(cropdmgexp) %>% summarise(count = n(),sum = sum(cropdmg))
## # A tibble: 8 x 3
##   cropdmgexp  count       sum
##   <fct>       <int>     <dbl>
## 1 ""         152664      11  
## 2 "?"             6       0  
## 3 "0"            17     260  
## 4 "B"             7      13.6
## 5 "k"            21     436  
## 6 "K"         99932 1342956. 
## 7 "m"             1      10  
## 8 "M"          1985   34141.

Looking at the above table, 20 obs are with numerical factors and hence can be ignored.We wil be pushing zero for all cases other than B,b,M,m,K,k For cases with b,B,M,m,K,k we will be pushing the respective exponential powers for easy calculations.Therefore
B,b will be 9
M,m will be 6
K,k will be 3
and we create a column “cropcost” = cropdmg * (10^cropdmgexp) indicating the total cost due to crop damage.

        work_data$cropdmgexp <- tolower(as.character(work_data$cropdmgexp))
        work_data$cropdmgexp[grep("[0-9]|[?]",work_data$cropdmgexp)] <- "0"
        
        work_data$cropdmgexp[grep("k",work_data$cropdmgexp)] <- "3"
        work_data$cropdmgexp[grep("m",work_data$cropdmgexp)] <- "6"
        work_data$cropdmgexp[grep("b",work_data$cropdmgexp)] <- "9"
        
        work_data$cropdmgexp <- as.integer(work_data$cropdmgexp)
        work_data$cropdmgexp[is.na(work_data$cropdmgexp)] <- 0
        work_data <- work_data %>% mutate(cropcost = cropdmg * (10^cropdmgexp)) 

Now we combine the propcost and cropcost into ecocost which we will use to answer the second question.
2.Across the United States, which types of events have the greatest economic consequences?
We pick the Top five events that cause maximum damage by cost.

        work_data <- work_data %>% mutate(ecocost = propcost + cropcost)

        cost_data <- subset(work_data,subset = work_data$ecocost > 0 ,select = c("evtype","propcost","cropcost","ecocost"))
        
        cost_data <- cost_data %>% group_by(evtype) %>% summarise(propcost_USD_Billion = sum(propcost)/(10^9),cropcost_USD_Billion = sum(cropcost)/(10^9),ecocost_USD_Billion = sum(ecocost)/(10^9)) %>% arrange(desc(ecocost_USD_Billion)) %>% top_n(n=5,wt = ecocost_USD_Billion)
        cost_data
## # A tibble: 5 x 4
##   evtype            propcost_USD_Billion cropcost_USD_Billion ecocost_USD_Billi~
##   <chr>                            <dbl>                <dbl>              <dbl>
## 1 flood                            145.              5.66                  150. 
## 2 hurricane(typhoo~                 84.8             5.52                   90.3
## 3 tornado                           58.6             0.417                  59.0
## 4 storm surge                       43.3             0.000005               43.3
## 5 hail                              16.0             3.03                   19.0
        cost_data$evtype <- factor(cost_data$evtype, levels=c("flood","hurricane(typhoon)","tornado", "storm surge", "hail"))
        ggplot(data = cost_data,aes(evtype,ecocost_USD_Billion)) + geom_col(fill = "brown")+ labs(x = "events",y = "Total damage (B USD)",title = "Top five events - Total Damage Cost")

We would also like to know the type of events that caused the maximum property loss and maximum crop loss seperately

        cost_data %>% select(evtype,propcost_USD_Billion) %>% arrange(desc(propcost_USD_Billion)) %>% top_n(n=1,wt = propcost_USD_Billion) 
## # A tibble: 1 x 2
##   evtype propcost_USD_Billion
##   <fct>                 <dbl>
## 1 flood                  145.
        cost_data %>% select(evtype,cropcost_USD_Billion) %>% arrange(desc(cropcost_USD_Billion)) %>% top_n(n=1,wt = cropcost_USD_Billion) 
## # A tibble: 1 x 2
##   evtype cropcost_USD_Billion
##   <fct>                 <dbl>
## 1 flood                  5.66

Result

The weather event tornado is the major cause for harm to people’s health in terms of both number of people injured and death in the US.The weather event flood is the major cause for economic loss in the US both for property damage and crop damage.