Synopsis

This data project is a basic exploration of the economic and public health problems caused by storms across the U.S. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

The following report is based on the NOAA storm database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

An exploratory analysis is done to show which types of events are most harmful with respect to population health and which have the greatest economic consequences. After briefly describing the storm database a data processing is thoroughly done to reduce the data set into the well known (see the NOAA documentation) types of weather events including only the most relevant years in which these were recorded and excluding events that did not cause any damage. In the last section, the results are shown.

Storm database description

This database documents the following phenomena:

  1. The occurrence of storms and other significant weather phenomena having sufficient intensity to cause loss of life, injuries, significant property damage, and/or disruption to commerce;
  2. Rare, unusual, weather phenomena that generate media attention;
  3. Other significant meteorological events, such as record maximum or minimum temperatures or precipitation that occur in connection with another event.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. The storm database includes \(902297\) of types of events recorded. Note that several of these entries may correspond to the same event type but recorded with a different name by the storm data preparer. According to the NOAA storm database description there a total of \(48\) different event types (table 2.1.1 in StromData preparation publication )

The NOAA storm database used for this analysis can be downloaded from the following link Storm data, from the coursera web site. The data set contains 37 variables. Since I am only interested in the health (fatalities or injuries) and property or crop damages across the US, we will only read the following variables: - BGN_DATE: When the event took place - STATE: State abbreviation (contains territories and minor islands) - EVTYPE: Type of weather event - FATALITIES and INJURIES - PROPGMG, CROPDMG: Properties and crop damages respectively. - PROPDMGEXP, CROPDMGEXP: Exponential degree on base 10

Data processing

1. Downloading and reading the database

## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
## 
##     smiths
setClass("myDate")
setAs("character","myDate", function(from) as.Date(from, format="%m/%d/%Y %H:%M:%S",origin="1950-01-01") )
data<-read.csv("repdata%2Fdata%2FStormData.csv.bz2",sep=",",header=TRUE,colClasses =
                       c("NULL","myDate","character","character",rep("NULL",2),"factor",
                         "factor",rep("NULL",3),"myDate","character",
                         rep("NULL",9),rep("numeric",3),"factor","numeric","factor",
                         rep("NULL",3),rep("numeric",4),rep("NULL",2)))

2. Choose recorded events beginning in 1993.

All event types were actually started to be recorded starting from 1993, this is according to the NOAA web site http://www.ncdc.noaa.gov/stormevents/details.jsp?type=collection. For that reason in this exploratory analysis only events starting from 1993 will be compared.

To show this, let us explore data before and after 1993, comparing the number of different levels in the factor variable “EVTYPE” before and after 1993, and 1996, another reference date given by the NOAA website.

Total number of different type of events in raw data set:

str(data$EVTYPE)
##  Factor w/ 985 levels "?","ABNORMALLY DRY",..: 830 830 830 830 830 830 830 830 830 830 ...

From 1950 to 1993

df5093<-data[data$BGN_DATE<="1993-01-01" & data$BGN_DATE>="1950-01-01","EVTYPE"]
df5093<-factor(df5093)
str(df5093)
##  Factor w/ 11 levels "AVALANCHE","FLOOD",..: 10 10 10 10 10 10 10 10 10 10 ...

From 1993 to 1996

df9396<-data[data$BGN_DATE<="1996-01-01" & data$BGN_DATE>="1993-01-01","EVTYPE"]
df9396<-factor(df9396)
str(df9396)
##  Factor w/ 601 levels "?","AGRICULTURAL FREEZE",..: 116 395 291 413 595 413 280 140 470 470 ...

Therefore, I will include those events starting from 1993. Note that there are still much more than 48 event types, which is the number reported in the StormData publication. Many of these were recorded with similar names, or misspelled, so the data has to be cleaned before exploring it.

  • Select data starting at 1993
df<-data[data$BGN_DATE>="1993-01-01",]

3. Check for NA or NULL values

vars<-names(df)
checknanull<-sapply(vars[4:9],function(x){c(sum(is.na(df[x])),sum(is.null(df[x])))})

4. Manipulating exponential variables.

Several analysis have been done by David Hood and others on the interpretation of exponential values different from \(M,m=E6\), \(B,b=E9\), \(K,k=E3\) or \(H,h=E2\). Most of the numeric values in the EXP variables can be interpreted as \(numeric:{1,9} =E1\), \(? = NA\), \(- = NA\), \(+= E0\). Let us look at the frequency of these levels

table(df$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 313139      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 392674      7   8557
table(df$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 430854      7     19      1      9     21 281832      1   1994

We can see that when compared to PROPDMEXP= E9,E6 or CROPDMEXP =E9,E6, all the numeric values, and the “?,+,-” which are within \([E0,E1]\), are not significant. Therefore we can ignore them. The null values are all equal to NA on total damages cost, so we will exclude them too.

expsubsetprop<-grep("\\?|\\+|-|^\\s*$|[0-9]",df$PROPDMGEXP)
expsubsetcrop<-grep("\\?|\\+|-|^\\s*$|[0-9]",df$CROPDMGEXP)
df$PROPDMG[expsubsetprop]<-0 #replace all the corresponding PROPDMG or CROPDMG with 0 
df$CROPDMG[expsubsetcrop]<-0
df$PROPDMGEXP[expsubsetprop]<-0 #replace all the corresponding PROPDMG or CROPDMG with 0 
df$CROPDMGEXP[expsubsetcrop]<-0
df$PROPDMGEXP<-toupper(df$PROPDMGEXP)
df$CROPDMGEXP<-toupper(df$CROPDMGEXP)

5. Excluding those that did not cause any damage

I will exclude all those events that did not cause any damage, since we are interested in exploring only the damages caused.

dfdam<-df[df$FATALITIES!=0 | df$INJURIES!=0 | df$PROPDMG!=0 | df$CROPDMG!=0,]
  • Now multiply the variables PROPDMG and CROPDMG with the exponential variables (in exponent form).
dfdam$PROPDMGEXP<-revalue(dfdam$PROPDMGEXP, c("B"="E9", "M"="E6","K"="E3","H"="E2")) # or levels(dfdam$PROPDMGEXP)<-c("E9","E2","E3","E6")
dfdam$CROPDMGEXP<-revalue(dfdam$CROPDMGEXP, c("B"="E9", "M"="E6","K"="E3"))
dfdam$PROPDMG <- do.call(paste, c(dfdam[c("PROPDMG","PROPDMGEXP")], sep=""))
dfdam$CROPDMG <- do.call(paste, c(dfdam[c("CROPDMG","CROPDMGEXP")], sep=""))
dfdam$PROPDMG<-as.numeric(dfdam$PROPDMG)
dfdam$CROPDMG<-as.numeric(dfdam$CROPDMG)
dfdam<-dfdam[-c(11,13)]

6. Matching and minimizing the number of event types

Here I will try to match all the event types (around 900) with those 48 event types reported from the NOAA publication.

  • Change to upper case
dfdam$EVTYPE<-toupper(dfdam$EVTYPE)
length(unique(dfdam$EVTYPE))
## [1] 443

I included a reference list for event types (from table 2.1.1 from the National Weather Service Storm Data Documentation) in file “/files/evtyperef.csv”.

EVTYPEREF<-read.csv("files/evtyperef.csv",header=TRUE,colClasses = "character");EVTYPEREF<-toupper(EVTYPEREF$EVTYPEREF)
  • Let us take a first look on approximate matches with the reference list above from all the evtype values
uniquematches<-lapply(EVTYPEREF,function(x){unique(grep(x,dfdam$EVTYPE,value = TRUE))})
  • The way I will match several events that share identical words, or share multiple events in one entry is by identifying first by most severe, for example if an event is described as “heavy rain/flood/hurricane”, then “hurricane is chosen”, that is “hurricane” is matched first, then “flood” and then “rain”. For this reason I have to choose for each match a selected order, such as:
dfdam$EVTYPE[grep("ASTRONOMICAL LOW TIDE|LOW TIDE",dfdam$EVTYPE)]<-"ASTRONOMICAL LOW T"
dfdam$EVTYPE[grep("HIGH WATER|SEAS|HIGH TIDE|SURF|SWELL|WAVE|RISING",dfdam$EVTYPE)]<-"HIGH SURF"
dfdam$EVTYPE[grep("(?<!LOW )(?<! HIGH)TIDE",dfdam$EVTYPE,perl=TRUE)]<-"SURGE/TIDE"
dfdam$EVTYPE[grep("(?<!COASTAL )SURGE",dfdam$EVTYPE,perl=TRUE)]<-"SURGE/TIDE"
dfdam$EVTYPE[grep("MARINE THUNDERSTORM WIND|MARINE TSTM WIND",dfdam$EVTYPE)]<-"MARINE TDSTM W"
dfdam$EVTYPE[grep("(?<!DUST )(?<!WINTER )(?<!ICE )(?<!TROPICAL )STORM",dfdam$EVTYPE,perl=TRUE)]<-"THUNDERSTORM W"
dfdam$EVTYPE[grep("(?<!MARINE )TDSTM|(?<!MARINE )TSTM|MICROB|DOWNB",dfdam$EVTYPE,perl = TRUE)]<-"THUNDERSTORM W"
dfdam$EVTYPE[agrep("ICE STORM",dfdam$EVTYPE,max.distance = 2)]<-"I STORM"
dfdam$EVTYPE[grep("WINTER STORM",dfdam$EVTYPE)]<-"WINTER STORM"
dfdam$EVTYPE[grep("TROPICAL STORM",dfdam$EVTYPE)]<-"TROPICAL STORM"
dfdam$EVTYPE[grep("DUST STORM|BLOWING DUST",dfdam$EVTYPE)]<-"DUST STORM"
dfdam$EVTYPE[grep("FLASH",dfdam$EVTYPE)]<-"FLASH F"
dfdam$EVTYPE[grep("(?<!COASTAL )(?<!ICE )FLOOD",dfdam$EVTYPE,perl = TRUE)]<-"FLOOD"
dfdam$EVTYPE[grep("COASTAL",dfdam$EVTYPE)]<-"COASTAL F"
dfdam$EVTYPE[agrep("LIGHTNING",dfdam$EVTYPE,max.distance = 2)]<-"LIGHTNING"
dfdam$EVTYPE[grep("WINTER WEATHER",dfdam$EVTYPE)]<-"WINTER WEATHER"
dfdam$EVTYPE[grep("TROPICAL DEPRESSION",dfdam$EVTYPE)]<-"TROPICAL DEPRESSION"
dfdam$EVTYPE[agrep("WATERSPROUT",dfdam$EVTYPE,max.distance = 2)]<-"WATERSPROUT"
dfdam$EVTYPE[agrep("TORNADO",dfdam$EVTYPE,max.distance = 2)]<-"TORNADO"
dfdam$EVTYPE[grep("MARINE STRONG WIND",dfdam$EVTYPE)]<-"MARINE STRONG W"
dfdam$EVTYPE[agrep("RIP CURRENT",dfdam$EVTYPE,max.distance = 2)]<-"RIP CURRENT"
dfdam$EVTYPE[agrep("SLEET",dfdam$EVTYPE,max.distance = 2)]<-"SLEET"
dfdam$EVTYPE[agrep("AVALANCHE",dfdam$EVTYPE,max.distance = 2)]<-"AVALANCHE"
dfdam$EVTYPE[agrep("BLIZZARD",dfdam$EVTYPE,max.distance = 2)]<-"BLIZZARD"
dfdam$EVTYPE[grep("LAKE-EFFECT SNOW",dfdam$EVTYPE)]<-"LAKE-EFFECT S"
dfdam$EVTYPE[grep("SNOW",dfdam$EVTYPE)]<-"HEAVY SNOW"
dfdam$EVTYPE[grep("HURRICANE|TYPHOON",dfdam$EVTYPE)]<-"HURRICANE"
dfdam$EVTYPE[agrep("WILDFIRE",dfdam$EVTYPE,max.distance = 2)]<-"WILDFIRE"
dfdam$EVTYPE[agrep("DUST DEVIL",dfdam$EVTYPE,max.distance = 2)]<-"DUST DEVIL"
dfdam$EVTYPE[agrep("WILDFIRE",dfdam$EVTYPE,max.distance = 2)]<-"WILDFIRE"
dfdam$EVTYPE[agrep("DROUGHT",dfdam$EVTYPE,max.distance = 2)]<-"DROUGHT"
dfdam$EVTYPE[agrep("EXCESSIVE HEAT",dfdam$EVTYPE,max.distance = 2)]<-"EXCESSIVE H"
dfdam$EVTYPE[grep("HEAT|WARM",dfdam$EVTYPE)]<-"HEAT"
dfdam$EVTYPE[grep("MARINE HAIL",dfdam$EVTYPE)]<-"MARINE HL"
dfdam$EVTYPE[grep("MARINE HIGH WIND",dfdam$EVTYPE)]<-"MARINE HIGH W"
dfdam$EVTYPE[grep("HAIL",dfdam$EVTYPE)]<-"HAIL"
dfdam$EVTYPE[grep("STRONG WIND",dfdam$EVTYPE)]<-"STRONG W"
dfdam$EVTYPE[grep("EXTREME COLD|HYPOTHERMIA|HYPERTHERMIA",dfdam$EVTYPE)]<-"EXTREME C/W CHILL"
dfdam$EVTYPE[grep("COLD|LOW TEMPERATURE|COOL",dfdam$EVTYPE)]<-"C/W CHILL"
dfdam$EVTYPE[grep("FREEZING FOG",dfdam$EVTYPE)]<-"F FOG"
dfdam$EVTYPE[grep("(?<!FREEZING )FOG",dfdam$EVTYPE,perl = TRUE)]<-"D FOG"
dfdam$EVTYPE[grep("DENSE SMOKE|SMOKE",dfdam$EVTYPE)]<-"D SMOKE"
dfdam$EVTYPE[grep("ICE|FROST|FREEZ|ICY",dfdam$EVTYPE)]<-"FROST/FREEZE"
dfdam$EVTYPE[grep("RAIN|PRECIP|SHOWER",dfdam$EVTYPE)]<-"HEAVY RAIN"
dfdam$EVTYPE[grep("WIND",dfdam$EVTYPE)]<-"HIGH W"
dfdam$EVTYPE[grep("SLIDE|SLUMP|SPOUT",dfdam$EVTYPE)]<-"LANDSLIDE"
dfdam$EVTYPE[grep("I STORM",dfdam$EVTYPE)]<-"ICE STORM"
dfdam$EVTYPE[grep("FIRE",dfdam$EVTYPE)]<-"WILDFIRE"
dfdam$EVTYPE[grep("\\?|APACHE|BEACH|DAM|DENSE|FUNNEL|GLAZE|MIX|TURBULENCE|URBAN|ACCIDENT|DROWNING|MISHAP|HIGH$",dfdam$EVTYPE)]<-"OTHER"

Let us look at the resulting event types:

events<-unique(dfdam[order(dfdam$EVTYPE),"EVTYPE"])
length(events)
## [1] 45
events
##  [1] "ASTRONOMICAL LOW T"  "AVALANCHE"           "BLIZZARD"           
##  [4] "COASTAL F"           "C/W CHILL"           "D FOG"              
##  [7] "DROUGHT"             "D SMOKE"             "DUST DEVIL"         
## [10] "DUST STORM"          "EXCESSIVE H"         "EXTREME C/W CHILL"  
## [13] "FLASH F"             "FLOOD"               "FROST/FREEZE"       
## [16] "HAIL"                "HEAT"                "HEAVY RAIN"         
## [19] "HEAVY SNOW"          "HIGH SURF"           "HIGH W"             
## [22] "HURRICANE"           "ICE STORM"           "LAKE-EFFECT S"      
## [25] "LANDSLIDE"           "LIGHTNING"           "MARINE HIGH W"      
## [28] "MARINE HL"           "MARINE STRONG W"     "MARINE TDSTM W"     
## [31] "OTHER"               "RIP CURRENT"         "SEICHE"             
## [34] "SLEET"               "STRONG W"            "SURGE/TIDE"         
## [37] "THUNDERSTORM W"      "TORNADO"             "TROPICAL DEPRESSION"
## [40] "TSUNAMI"             "VOLCANIC ASH"        "WATERSPROUT"        
## [43] "WILDFIRE"            "WINTER STORM"        "WINTER WEATHER"

Other variables (features)

  1. US region: Let us add a column matching the state with the US region according to the state database in R.
stateNOAA<-as.character(unique(dfdam$STATE))
#We see that there are more entries than the common 50 (includes other regions and territories)
complement_state<-function(x,y) unique(c(setdiff(x,y),setdiff(y,x)))
dfdam$REGION<-mapvalues(as.character(dfdam$STATE),from = c(as.character(state.abb),
                                                           complement_state(stateNOAA,state.abb)),
                        to =(c(as.character(state.region),rep("Other",17))))

Other useful variables that could give us an insight of region/time and events correlation are:

    2. Elapsed time for each event when both beginning date/time and ending date/time was reported. So let us add the column TIME_DIFF
dfdam<-dfdam%>%mutate(END_DATE2=ifelse(is.na(END_DATE),as.character(BGN_DATE),as.character(END_DATE)))
dfdam$END_DATE<-dfdam$END_DATE2
dfdam$END_DT<-as.POSIXct(paste(dfdam$END_DATE,gsub("[^0-9]","",dfdam$END_TIME)," "),format="%Y-%m-%d %H%M")
dfdam$BGN_DT<-as.POSIXct(paste(dfdam$BGN_DATE,gsub("[^0-9]","",dfdam$BGN_TIME)," "),format="%Y-%m-%d %H%M")
dfdam<-dfdam%>% mutate(TIME_DIFF=as.numeric(difftime(END_DT,BGN_DT,units="hours")))
  1. Area: Approximate area covered by each event when beginning and ending coordinates (latitude,longitude) are given. Now I add the column LOC_DIFF. To calculate the area covered I use the Haversine formula for the great circle distance.
great_distance_hf <- function(lat1,long1,lat2,long2) {
        R <- 6371 
        a <- sin((lat2 - lat1)/2)^2 + cos(lat1) * cos(lat2) * sin((long2 - long1)/2)^2
        c <- 2 * asin(min(1,sqrt(a)))
        d = R * c
        return(d) # km
}
dfdam<-dfdam%>%mutate(
        LOC_DIFF=mapply(great_distance_hf,
                        dfdam$LATITUDE,dfdam$LONGITUDE,
                        dfdam$LATITUDE_E,dfdam$LONGITUDE_))

Results

As stated before, I am interested in which types of events are most harmful with respect to population health across the U.S., and which types of events have the greatest economic consequences across the U.S.

First, let us look at the most frequent events through out the years, which not need to be necessarily the most harmful. The figure below shows the total number of events from 1993 to 2011 for each type of weather event.

#Theme
gral_theme <- function(base_size = 12, base_family = "sans"){
        theme_minimal(base_size = base_size, base_family = base_family) +
                theme(
                        axis.text = element_text(size = 12),
                        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5),
                        axis.title = element_text(size = 14),
                        panel.grid.major = element_line(color = "grey"),
                        panel.grid.minor = element_blank(),
                        panel.background = element_rect(fill = "aliceblue"),
                        strip.background = element_rect(fill = "lightgrey", color = "grey", size = 1),
                        strip.text = element_text(face = "bold", size = 12, color = "black"),
                        legend.position = "bottom",
                        legend.justification = "top",
                        legend.box = "horizontal",
                        legend.background = element_blank(),
                        panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
                )
}
g1<-ggplot(dfdam,aes(dfdam$EVTYPE))
g1+geom_bar()+gral_theme()+labs(title="Total number of events from 1993 to 2011",x="Event type",y="Total number of events")
Total number of events for each event type through 1993 and 2011

Total number of events for each event type through 1993 and 2011

# theme(axis.text.x = element_text(angle = 90, hjust = 1))

1. Consequences in population health

The most harmful events need not to be the most frequent. In order to explore which type of event has been most harmful through out the years, let us look at the total number of fatalities and injuries for each weather event type, as shown in the chart below.

fatevents<-setNames(aggregate(dfdam$FATALITIES~dfdam$EVTYPE,FUN = sum),c("Type","Fatalities"))
injevents<-setNames(aggregate(dfdam$INJURIES~dfdam$EVTYPE,FUN = sum),c("Type","Injuries"))
fatinjevents<-setNames(data.frame(fatevents$Type,fatevents$Fatalities,injevents$Injuries),c("Type","FATALITIES","INJURIES"))
df_melted <- melt(fatinjevents, id=c("Type"))
g2<-ggplot(df_melted,aes(x=value,y=Type,color=variable))+geom_point()
g2+labs(title="Total number of incidents",x="",y="")+theme(legend.title=element_blank())
Total number of fatalities and injuries for each event type

Total number of fatalities and injuries for each event type

From the figure above we can see that the event that has caused most harm through out the years (accumulative) is the Tornado. However, this may not be the type of event that has caused most harm in a single day (or event period). This is illustrated in the following tables, were I show the top 5 most harmful incidents, where the first row shows the maximum number of fatalities and injuries in each tables, respectively.

dftop<-dfdam[with(dfdam,order(-dfdam$FATALITIES,-dfdam$INJURIES)),c("BGN_DATE",'EVTYPE','STATE','FATALITIES','INJURIES','TIME_DIFF','LOC_DIFF')]
kable(head(dftop,5))
BGN_DATE EVTYPE STATE FATALITIES INJURIES TIME_DIFF LOC_DIFF
4539 1995-07-12 HEAT IL 583 0 102.0000000 0.000
214090 2011-05-22 TORNADO MO 158 1150 0.3333333 7686.457
62329 1999-07-28 EXCESSIVE H IL 99 0 68.0000000 0.000
67709 1999-07-04 EXCESSIVE H PA 74 135 48.0000000 0.000
18450 1995-07-01 EXCESSIVE H PA 67 0 NA 0.000
dftop2<-dfdam[with(dfdam,order(-dfdam$INJURIES,-dfdam$FATALITIES)),c("BGN_DATE",'EVTYPE','STATE','FATALITIES','INJURIES','TIME_DIFF','LOC_DIFF')]
kable(head(dftop2,5))
BGN_DATE EVTYPE STATE FATALITIES INJURIES TIME_DIFF LOC_DIFF
16588 1994-02-08 ICE STORM OH 1 1568 20.0000000 0.000
214090 2011-05-22 TORNADO MO 158 1150 0.3333333 7686.457
213286 2011-04-27 TORNADO AL 44 800 0.7833333 6087.296
58152 1998-10-17 FLOOD TX 2 800 30.2500000 0.000
114987 2004-08-13 HURRICANE FL 7 780 6.0000000 0.000

We can also look at the variable “TIME_DIFF”, which described the time elapsed (for those events we have the data for) to see this. I will first rearrange the eventtypes in a new column for those with significant number of incidents and “others”. In order to do this I will filter with a limit set by figure 2. since the median and the mean are very small (close to cero).

summary(dfdam$FATALITIES)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.0000   0.0000   0.0479   0.0000 583.0000
summary(dfdam$INJURIES)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    0.000    0.000    0.000    0.303    0.000 1568.000
sigevtype<-unique(dfdam$EVTYPE[dfdam$FATALITIES>50 | dfdam$INJURIES>200])
sigevtype
## [1] "TORNADO"     "BLIZZARD"    "HEAT"        "ICE STORM"   "EXCESSIVE H"
## [6] "FLOOD"       "HURRICANE"
nsigtype<-complement_state(sigevtype,events)
dfdam$SIGEVTYPE<-factor(mapvalues(dfdam$EVTYPE,from = nsigtype,
                        to =rep("Other",length(nsigtype))))
palet<-c("#89C5DA", "#DA5724", "#74D944", "#CE50CA", "#3F4921", "#C0717C", "#CBD588", "#5F7FC7", 
"#673770", "#D3D93E", "#38333E")
dfdam2<-dfdam[!(is.na(dfdam$TIME_DIFF)),]
dfdam2_gather<-dfdam2%>%gather(Type,Incidents,FATALITIES,INJURIES)
ggplot(dfdam2_gather,aes(x=log10(TIME_DIFF),y=log10(Incidents),color=SIGEVTYPE))+ geom_jitter(aes(color = SIGEVTYPE), alpha=.6,size = 1.5) +
        labs(
                color = "Event Type",
                x = "Time elapsed [h]",
                y = "Incidents",
                title = "1993-2011 US Storm events",
                subtitle = "Dataset from NOAA",
                caption = ""
        )+
        scale_y_continuous(limits=c(log10(1), log10(1600)), labels = scales::math_format(10^.x))+
        scale_x_continuous(limits=c(log10(1), log10(9000)), labels = scales::math_format(10^.x))+
        facet_grid(Type ~ REGION) +
        gral_theme()+scale_color_manual(values = palet)
## Warning in FUN(X[[i]], ...): NaNs produced
## Warning: Removed 121720 rows containing missing values (geom_point).

And finally we can look at the number of incidents (fatalities and injuries) across different regions of the US.

dfdam$ID<-seq.int(nrow(dfdam))
dfdam_gather<-dfdam%>%gather(Group,Incidents,FATALITIES,INJURIES)
dfdam_gather$Group<-as.factor(dfdam_gather$Group)
ggplot(dfdam_gather,aes(x=BGN_DATE,y=log10(Incidents),color=SIGEVTYPE))+
        geom_jitter(aes(color = SIGEVTYPE), size = 1.5) +
        labs(
                color = "Event Type",
                x = "Date",
                y = "Incidents",
                title = "1993-2011 US Storm events",
                subtitle = "Dataset from NOAA",
                caption = ""
        ) +
        scale_y_continuous(limits=c(log10(1), log10(1600)), labels = scales::math_format(10^.x))+
        facet_grid(Group ~ REGION) +
        gral_theme()+scale_color_manual(values = palet)
## Warning: Removed 4869 rows containing missing values (geom_point).

2. Consequences in economy

The figure below shows the total amount in damages each of the weather events has caused through out 1993 to 2011 and across the U.S. It is clear that ‘Floods’ have caused the most. Inflation parameters were not considered in this report.

prop<-setNames(aggregate(dfdam$PROPDMG~dfdam$EVTYPE,FUN = sum),c("Type","PROPERTY.DMG"))
crop<-setNames(aggregate(dfdam$CROPDMG~dfdam$EVTYPE,FUN = sum),c("Type","CROP.DMG"))
propcrop<-setNames(data.frame(prop$Type,prop$PROPERTY.DMG,crop$CROP.DMG),c("Type","PROPERTY.DMG","CROP.DMG"))
df_melted <- melt(propcrop, id=c("Type"))
g2<-ggplot(df_melted,aes(x=value,y=Type,color=variable))+geom_point()
g2+labs(title="Total cost of damages",x="",y="")+theme(legend.title=element_blank())
Consequences in property and crop damages

Consequences in property and crop damages

Across the US.

summary(dfdam$PROPDMG)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 2.000e+03 8.000e+03 1.748e+06 3.000e+04 1.150e+11
summary(dfdam$CROPDMG)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 0.000e+00 0.000e+00 2.164e+05 0.000e+00 5.000e+09
sevtype<-unique(dfdam$EVTYPE[dfdam$PROPDMG>2*10e8 | dfdam$CROPDMG>10e9])
sevtype
## [1] "WINTER STORM" "HURRICANE"    "FLOOD"        "HEAVY RAIN"  
## [5] "ICE STORM"    "SURGE/TIDE"   "TORNADO"
ntype<-complement_state(sevtype,events)
dfdam$SEVTYPE<-factor(mapvalues(dfdam$EVTYPE,from = ntype,
                        to =rep("Other",length(ntype))))
dfdam_gather2<-dfdam%>%gather(Group,Damages,PROPDMG,CROPDMG)
dfdam_gather2$Group<-as.factor(dfdam_gather2$Group)
ggplot(dfdam_gather2,aes(x=BGN_DATE,y=log10(Damages),color=SEVTYPE))+
        geom_jitter(aes(color = SEVTYPE),alpha=.3, size = 1.5) +
        labs(
                color = "Event Type",
                x = "Date",
                y = "Damages",
                title = "1993-2011 US Storm events",
                subtitle = "Dataset from NOAA",
                caption = ""
        ) + scale_y_continuous(limits=c(log10(1), log10(1.2e11)), labels = scales::math_format(10^.x))+
        facet_grid(Group ~ REGION) +
        gral_theme()+scale_color_manual(values=palet)

The following tables show the top 5 most damaging events in properties and crops cost, where the first row shows the most damaging for properties and crops, respectively.

dftop3<-dfdam[with(dfdam,order(-dfdam$PROPDMG,-dfdam$CROPDMG)),c("BGN_DATE",'EVTYPE','STATE','PROPDMG','CROPDMG','TIME_DIFF','LOC_DIFF')]
kable(head(dftop3,5))
BGN_DATE EVTYPE STATE PROPDMG CROPDMG TIME_DIFF LOC_DIFF
135162 2006-01-01 FLOOD CA 1.150e+11 32500000 -5 0
128121 2005-08-29 SURGE/TIDE LA 3.130e+10 0 3 0
128120 2005-08-28 HURRICANE LA 1.693e+10 0 18 0
129286 2005-08-29 SURGE/TIDE MS 1.126e+10 0 3 0
125932 2005-10-24 HURRICANE FL 1.000e+10 0 8 0
dftop4<-dfdam[with(dfdam,order(-dfdam$CROPDMG,-dfdam$PROPDMG)),c("BGN_DATE",'EVTYPE','STATE','PROPDMG','CROPDMG','TIME_DIFF','LOC_DIFF')]
kable(head(dftop4,5))
BGN_DATE EVTYPE STATE PROPDMG CROPDMG TIME_DIFF LOC_DIFF
4394 1993-08-31 FLOOD IL 5.00e+09 5.00e+09 NA 0
10610 1994-02-09 ICE STORM MS 5.00e+05 5.00e+09 NA 0
129287 2005-08-29 HURRICANE MS 5.88e+09 1.51e+09 3.5000 0
144148 2006-01-01 DROUGHT TX 0.00e+00 1.00e+09 719.9833 0
47054 1998-12-20 EXTREME C/W CHILL CA 0.00e+00 5.96e+08 163.5000 0

Note that some of the top events are most probably related to Hurricane Katrina (from August 28th to 31st 2005).

3. Most affected state

The following result shows which state was the most affected (with most frequent events) in the period from 1993 to 2011.

table(dfdam$REGION)
## 
## North Central     Northeast         Other         South          West 
##         71694         24033          1421        116253         13554
table(dfdam$STATE)
## 
##    AK    AL    AM    AN    AR    AS    AZ    CA    CO    CT    DC    DE 
##   521 10161    20    29  6183    49  1536  3154  1706   863   141   358 
##    FL    GA    GM    GU    HI    IA    ID    IL    IN    KS    KY    LA 
##  7003 10098    48   156   150 14801   773  5491  5265  5941  6300  5288 
##    LC    LE    LH    LM    LO    LS    MA    MD    ME    MH    MI    MN 
##     0     2     0   101     5     2  2035  2972   574     1  4569  2733 
##    MO    MS    MT    NC    ND    NE    NH    NJ    NM    NV    NY    OH 
##  5525 10959   944  4658  2016  5831   598  1222  1091   637  9433 12599 
##    OK    OR    PA    PH    PK    PM    PR    PZ    RI    SC    SD    SL 
##  6400   500  6715     1     7     0   799     6   242  4073  1763     1 
##    ST    TN    TX    UT    VA    VI    VT    WA    WI    WV    WY    XX 
##     0 10142 19008   832  8828    53  2351  1046  5160  3822   664     0
statemax<-dfdam%>%group_by(STATE)%>%summarise(freq=n())
statemax[statemax$freq==max(statemax$freq),"STATE"]
## # A tibble: 1 × 1
##    STATE
##   <fctr>
## 1     TX