This data project is a basic exploration of the economic and public health problems caused by storms across the U.S. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
The following report is based on the NOAA storm database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
An exploratory analysis is done to show which types of events are most harmful with respect to population health and which have the greatest economic consequences. After briefly describing the storm database a data processing is thoroughly done to reduce the data set into the well known (see the NOAA documentation) types of weather events including only the most relevant years in which these were recorded and excluding events that did not cause any damage. In the last section, the results are shown.
This database documents the following phenomena:
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. The storm database includes \(902297\) of types of events recorded. Note that several of these entries may correspond to the same event type but recorded with a different name by the storm data preparer. According to the NOAA storm database description there a total of \(48\) different event types (table 2.1.1 in StromData preparation publication )
The NOAA storm database used for this analysis can be downloaded from the following link Storm data, from the coursera web site. The data set contains 37 variables. Since I am only interested in the health (fatalities or injuries) and property or crop damages across the US, we will only read the following variables: - BGN_DATE: When the event took place - STATE: State abbreviation (contains territories and minor islands) - EVTYPE: Type of weather event - FATALITIES and INJURIES - PROPGMG, CROPDMG: Properties and crop damages respectively. - PROPDMGEXP, CROPDMGEXP: Exponential degree on base 10
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
##
## smiths
setClass("myDate")
setAs("character","myDate", function(from) as.Date(from, format="%m/%d/%Y %H:%M:%S",origin="1950-01-01") )
data<-read.csv("repdata%2Fdata%2FStormData.csv.bz2",sep=",",header=TRUE,colClasses =
c("NULL","myDate","character","character",rep("NULL",2),"factor",
"factor",rep("NULL",3),"myDate","character",
rep("NULL",9),rep("numeric",3),"factor","numeric","factor",
rep("NULL",3),rep("numeric",4),rep("NULL",2)))
All event types were actually started to be recorded starting from 1993, this is according to the NOAA web site http://www.ncdc.noaa.gov/stormevents/details.jsp?type=collection. For that reason in this exploratory analysis only events starting from 1993 will be compared.
To show this, let us explore data before and after 1993, comparing the number of different levels in the factor variable “EVTYPE” before and after 1993, and 1996, another reference date given by the NOAA website.
Total number of different type of events in raw data set:
str(data$EVTYPE)
## Factor w/ 985 levels "?","ABNORMALLY DRY",..: 830 830 830 830 830 830 830 830 830 830 ...
From 1950 to 1993
df5093<-data[data$BGN_DATE<="1993-01-01" & data$BGN_DATE>="1950-01-01","EVTYPE"]
df5093<-factor(df5093)
str(df5093)
## Factor w/ 11 levels "AVALANCHE","FLOOD",..: 10 10 10 10 10 10 10 10 10 10 ...
From 1993 to 1996
df9396<-data[data$BGN_DATE<="1996-01-01" & data$BGN_DATE>="1993-01-01","EVTYPE"]
df9396<-factor(df9396)
str(df9396)
## Factor w/ 601 levels "?","AGRICULTURAL FREEZE",..: 116 395 291 413 595 413 280 140 470 470 ...
Therefore, I will include those events starting from 1993. Note that there are still much more than 48 event types, which is the number reported in the StormData publication. Many of these were recorded with similar names, or misspelled, so the data has to be cleaned before exploring it.
df<-data[data$BGN_DATE>="1993-01-01",]
vars<-names(df)
checknanull<-sapply(vars[4:9],function(x){c(sum(is.na(df[x])),sum(is.null(df[x])))})
Several analysis have been done by David Hood and others on the interpretation of exponential values different from \(M,m=E6\), \(B,b=E9\), \(K,k=E3\) or \(H,h=E2\). Most of the numeric values in the EXP variables can be interpreted as \(numeric:{1,9} =E1\), \(? = NA\), \(- = NA\), \(+= E0\). Let us look at the frequency of these levels
table(df$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 313139 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 392674 7 8557
table(df$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 430854 7 19 1 9 21 281832 1 1994
We can see that when compared to PROPDMEXP= E9,E6 or CROPDMEXP =E9,E6, all the numeric values, and the “?,+,-” which are within \([E0,E1]\), are not significant. Therefore we can ignore them. The null values are all equal to NA on total damages cost, so we will exclude them too.
expsubsetprop<-grep("\\?|\\+|-|^\\s*$|[0-9]",df$PROPDMGEXP)
expsubsetcrop<-grep("\\?|\\+|-|^\\s*$|[0-9]",df$CROPDMGEXP)
df$PROPDMG[expsubsetprop]<-0 #replace all the corresponding PROPDMG or CROPDMG with 0
df$CROPDMG[expsubsetcrop]<-0
df$PROPDMGEXP[expsubsetprop]<-0 #replace all the corresponding PROPDMG or CROPDMG with 0
df$CROPDMGEXP[expsubsetcrop]<-0
df$PROPDMGEXP<-toupper(df$PROPDMGEXP)
df$CROPDMGEXP<-toupper(df$CROPDMGEXP)
I will exclude all those events that did not cause any damage, since we are interested in exploring only the damages caused.
dfdam<-df[df$FATALITIES!=0 | df$INJURIES!=0 | df$PROPDMG!=0 | df$CROPDMG!=0,]
dfdam$PROPDMGEXP<-revalue(dfdam$PROPDMGEXP, c("B"="E9", "M"="E6","K"="E3","H"="E2")) # or levels(dfdam$PROPDMGEXP)<-c("E9","E2","E3","E6")
dfdam$CROPDMGEXP<-revalue(dfdam$CROPDMGEXP, c("B"="E9", "M"="E6","K"="E3"))
dfdam$PROPDMG <- do.call(paste, c(dfdam[c("PROPDMG","PROPDMGEXP")], sep=""))
dfdam$CROPDMG <- do.call(paste, c(dfdam[c("CROPDMG","CROPDMGEXP")], sep=""))
dfdam$PROPDMG<-as.numeric(dfdam$PROPDMG)
dfdam$CROPDMG<-as.numeric(dfdam$CROPDMG)
dfdam<-dfdam[-c(11,13)]
Here I will try to match all the event types (around 900) with those 48 event types reported from the NOAA publication.
dfdam$EVTYPE<-toupper(dfdam$EVTYPE)
length(unique(dfdam$EVTYPE))
## [1] 443
I included a reference list for event types (from table 2.1.1 from the National Weather Service Storm Data Documentation) in file “/files/evtyperef.csv”.
EVTYPEREF<-read.csv("files/evtyperef.csv",header=TRUE,colClasses = "character");EVTYPEREF<-toupper(EVTYPEREF$EVTYPEREF)
uniquematches<-lapply(EVTYPEREF,function(x){unique(grep(x,dfdam$EVTYPE,value = TRUE))})
dfdam$EVTYPE[grep("ASTRONOMICAL LOW TIDE|LOW TIDE",dfdam$EVTYPE)]<-"ASTRONOMICAL LOW T"
dfdam$EVTYPE[grep("HIGH WATER|SEAS|HIGH TIDE|SURF|SWELL|WAVE|RISING",dfdam$EVTYPE)]<-"HIGH SURF"
dfdam$EVTYPE[grep("(?<!LOW )(?<! HIGH)TIDE",dfdam$EVTYPE,perl=TRUE)]<-"SURGE/TIDE"
dfdam$EVTYPE[grep("(?<!COASTAL )SURGE",dfdam$EVTYPE,perl=TRUE)]<-"SURGE/TIDE"
dfdam$EVTYPE[grep("MARINE THUNDERSTORM WIND|MARINE TSTM WIND",dfdam$EVTYPE)]<-"MARINE TDSTM W"
dfdam$EVTYPE[grep("(?<!DUST )(?<!WINTER )(?<!ICE )(?<!TROPICAL )STORM",dfdam$EVTYPE,perl=TRUE)]<-"THUNDERSTORM W"
dfdam$EVTYPE[grep("(?<!MARINE )TDSTM|(?<!MARINE )TSTM|MICROB|DOWNB",dfdam$EVTYPE,perl = TRUE)]<-"THUNDERSTORM W"
dfdam$EVTYPE[agrep("ICE STORM",dfdam$EVTYPE,max.distance = 2)]<-"I STORM"
dfdam$EVTYPE[grep("WINTER STORM",dfdam$EVTYPE)]<-"WINTER STORM"
dfdam$EVTYPE[grep("TROPICAL STORM",dfdam$EVTYPE)]<-"TROPICAL STORM"
dfdam$EVTYPE[grep("DUST STORM|BLOWING DUST",dfdam$EVTYPE)]<-"DUST STORM"
dfdam$EVTYPE[grep("FLASH",dfdam$EVTYPE)]<-"FLASH F"
dfdam$EVTYPE[grep("(?<!COASTAL )(?<!ICE )FLOOD",dfdam$EVTYPE,perl = TRUE)]<-"FLOOD"
dfdam$EVTYPE[grep("COASTAL",dfdam$EVTYPE)]<-"COASTAL F"
dfdam$EVTYPE[agrep("LIGHTNING",dfdam$EVTYPE,max.distance = 2)]<-"LIGHTNING"
dfdam$EVTYPE[grep("WINTER WEATHER",dfdam$EVTYPE)]<-"WINTER WEATHER"
dfdam$EVTYPE[grep("TROPICAL DEPRESSION",dfdam$EVTYPE)]<-"TROPICAL DEPRESSION"
dfdam$EVTYPE[agrep("WATERSPROUT",dfdam$EVTYPE,max.distance = 2)]<-"WATERSPROUT"
dfdam$EVTYPE[agrep("TORNADO",dfdam$EVTYPE,max.distance = 2)]<-"TORNADO"
dfdam$EVTYPE[grep("MARINE STRONG WIND",dfdam$EVTYPE)]<-"MARINE STRONG W"
dfdam$EVTYPE[agrep("RIP CURRENT",dfdam$EVTYPE,max.distance = 2)]<-"RIP CURRENT"
dfdam$EVTYPE[agrep("SLEET",dfdam$EVTYPE,max.distance = 2)]<-"SLEET"
dfdam$EVTYPE[agrep("AVALANCHE",dfdam$EVTYPE,max.distance = 2)]<-"AVALANCHE"
dfdam$EVTYPE[agrep("BLIZZARD",dfdam$EVTYPE,max.distance = 2)]<-"BLIZZARD"
dfdam$EVTYPE[grep("LAKE-EFFECT SNOW",dfdam$EVTYPE)]<-"LAKE-EFFECT S"
dfdam$EVTYPE[grep("SNOW",dfdam$EVTYPE)]<-"HEAVY SNOW"
dfdam$EVTYPE[grep("HURRICANE|TYPHOON",dfdam$EVTYPE)]<-"HURRICANE"
dfdam$EVTYPE[agrep("WILDFIRE",dfdam$EVTYPE,max.distance = 2)]<-"WILDFIRE"
dfdam$EVTYPE[agrep("DUST DEVIL",dfdam$EVTYPE,max.distance = 2)]<-"DUST DEVIL"
dfdam$EVTYPE[agrep("WILDFIRE",dfdam$EVTYPE,max.distance = 2)]<-"WILDFIRE"
dfdam$EVTYPE[agrep("DROUGHT",dfdam$EVTYPE,max.distance = 2)]<-"DROUGHT"
dfdam$EVTYPE[agrep("EXCESSIVE HEAT",dfdam$EVTYPE,max.distance = 2)]<-"EXCESSIVE H"
dfdam$EVTYPE[grep("HEAT|WARM",dfdam$EVTYPE)]<-"HEAT"
dfdam$EVTYPE[grep("MARINE HAIL",dfdam$EVTYPE)]<-"MARINE HL"
dfdam$EVTYPE[grep("MARINE HIGH WIND",dfdam$EVTYPE)]<-"MARINE HIGH W"
dfdam$EVTYPE[grep("HAIL",dfdam$EVTYPE)]<-"HAIL"
dfdam$EVTYPE[grep("STRONG WIND",dfdam$EVTYPE)]<-"STRONG W"
dfdam$EVTYPE[grep("EXTREME COLD|HYPOTHERMIA|HYPERTHERMIA",dfdam$EVTYPE)]<-"EXTREME C/W CHILL"
dfdam$EVTYPE[grep("COLD|LOW TEMPERATURE|COOL",dfdam$EVTYPE)]<-"C/W CHILL"
dfdam$EVTYPE[grep("FREEZING FOG",dfdam$EVTYPE)]<-"F FOG"
dfdam$EVTYPE[grep("(?<!FREEZING )FOG",dfdam$EVTYPE,perl = TRUE)]<-"D FOG"
dfdam$EVTYPE[grep("DENSE SMOKE|SMOKE",dfdam$EVTYPE)]<-"D SMOKE"
dfdam$EVTYPE[grep("ICE|FROST|FREEZ|ICY",dfdam$EVTYPE)]<-"FROST/FREEZE"
dfdam$EVTYPE[grep("RAIN|PRECIP|SHOWER",dfdam$EVTYPE)]<-"HEAVY RAIN"
dfdam$EVTYPE[grep("WIND",dfdam$EVTYPE)]<-"HIGH W"
dfdam$EVTYPE[grep("SLIDE|SLUMP|SPOUT",dfdam$EVTYPE)]<-"LANDSLIDE"
dfdam$EVTYPE[grep("I STORM",dfdam$EVTYPE)]<-"ICE STORM"
dfdam$EVTYPE[grep("FIRE",dfdam$EVTYPE)]<-"WILDFIRE"
dfdam$EVTYPE[grep("\\?|APACHE|BEACH|DAM|DENSE|FUNNEL|GLAZE|MIX|TURBULENCE|URBAN|ACCIDENT|DROWNING|MISHAP|HIGH$",dfdam$EVTYPE)]<-"OTHER"
Let us look at the resulting event types:
events<-unique(dfdam[order(dfdam$EVTYPE),"EVTYPE"])
length(events)
## [1] 45
events
## [1] "ASTRONOMICAL LOW T" "AVALANCHE" "BLIZZARD"
## [4] "COASTAL F" "C/W CHILL" "D FOG"
## [7] "DROUGHT" "D SMOKE" "DUST DEVIL"
## [10] "DUST STORM" "EXCESSIVE H" "EXTREME C/W CHILL"
## [13] "FLASH F" "FLOOD" "FROST/FREEZE"
## [16] "HAIL" "HEAT" "HEAVY RAIN"
## [19] "HEAVY SNOW" "HIGH SURF" "HIGH W"
## [22] "HURRICANE" "ICE STORM" "LAKE-EFFECT S"
## [25] "LANDSLIDE" "LIGHTNING" "MARINE HIGH W"
## [28] "MARINE HL" "MARINE STRONG W" "MARINE TDSTM W"
## [31] "OTHER" "RIP CURRENT" "SEICHE"
## [34] "SLEET" "STRONG W" "SURGE/TIDE"
## [37] "THUNDERSTORM W" "TORNADO" "TROPICAL DEPRESSION"
## [40] "TSUNAMI" "VOLCANIC ASH" "WATERSPROUT"
## [43] "WILDFIRE" "WINTER STORM" "WINTER WEATHER"
stateNOAA<-as.character(unique(dfdam$STATE))
#We see that there are more entries than the common 50 (includes other regions and territories)
complement_state<-function(x,y) unique(c(setdiff(x,y),setdiff(y,x)))
dfdam$REGION<-mapvalues(as.character(dfdam$STATE),from = c(as.character(state.abb),
complement_state(stateNOAA,state.abb)),
to =(c(as.character(state.region),rep("Other",17))))
Other useful variables that could give us an insight of region/time and events correlation are:
2. Elapsed time for each event when both beginning date/time and ending date/time was reported. So let us add the column TIME_DIFF
dfdam<-dfdam%>%mutate(END_DATE2=ifelse(is.na(END_DATE),as.character(BGN_DATE),as.character(END_DATE)))
dfdam$END_DATE<-dfdam$END_DATE2
dfdam$END_DT<-as.POSIXct(paste(dfdam$END_DATE,gsub("[^0-9]","",dfdam$END_TIME)," "),format="%Y-%m-%d %H%M")
dfdam$BGN_DT<-as.POSIXct(paste(dfdam$BGN_DATE,gsub("[^0-9]","",dfdam$BGN_TIME)," "),format="%Y-%m-%d %H%M")
dfdam<-dfdam%>% mutate(TIME_DIFF=as.numeric(difftime(END_DT,BGN_DT,units="hours")))
great_distance_hf <- function(lat1,long1,lat2,long2) {
R <- 6371
a <- sin((lat2 - lat1)/2)^2 + cos(lat1) * cos(lat2) * sin((long2 - long1)/2)^2
c <- 2 * asin(min(1,sqrt(a)))
d = R * c
return(d) # km
}
dfdam<-dfdam%>%mutate(
LOC_DIFF=mapply(great_distance_hf,
dfdam$LATITUDE,dfdam$LONGITUDE,
dfdam$LATITUDE_E,dfdam$LONGITUDE_))
As stated before, I am interested in which types of events are most harmful with respect to population health across the U.S., and which types of events have the greatest economic consequences across the U.S.
First, let us look at the most frequent events through out the years, which not need to be necessarily the most harmful. The figure below shows the total number of events from 1993 to 2011 for each type of weather event.
#Theme
gral_theme <- function(base_size = 12, base_family = "sans"){
theme_minimal(base_size = base_size, base_family = base_family) +
theme(
axis.text = element_text(size = 12),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5),
axis.title = element_text(size = 14),
panel.grid.major = element_line(color = "grey"),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "aliceblue"),
strip.background = element_rect(fill = "lightgrey", color = "grey", size = 1),
strip.text = element_text(face = "bold", size = 12, color = "black"),
legend.position = "bottom",
legend.justification = "top",
legend.box = "horizontal",
legend.background = element_blank(),
panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
)
}
g1<-ggplot(dfdam,aes(dfdam$EVTYPE))
g1+geom_bar()+gral_theme()+labs(title="Total number of events from 1993 to 2011",x="Event type",y="Total number of events")
Total number of events for each event type through 1993 and 2011
# theme(axis.text.x = element_text(angle = 90, hjust = 1))
The most harmful events need not to be the most frequent. In order to explore which type of event has been most harmful through out the years, let us look at the total number of fatalities and injuries for each weather event type, as shown in the chart below.
fatevents<-setNames(aggregate(dfdam$FATALITIES~dfdam$EVTYPE,FUN = sum),c("Type","Fatalities"))
injevents<-setNames(aggregate(dfdam$INJURIES~dfdam$EVTYPE,FUN = sum),c("Type","Injuries"))
fatinjevents<-setNames(data.frame(fatevents$Type,fatevents$Fatalities,injevents$Injuries),c("Type","FATALITIES","INJURIES"))
df_melted <- melt(fatinjevents, id=c("Type"))
g2<-ggplot(df_melted,aes(x=value,y=Type,color=variable))+geom_point()
g2+labs(title="Total number of incidents",x="",y="")+theme(legend.title=element_blank())
Total number of fatalities and injuries for each event type
From the figure above we can see that the event that has caused most harm through out the years (accumulative) is the Tornado. However, this may not be the type of event that has caused most harm in a single day (or event period). This is illustrated in the following tables, were I show the top 5 most harmful incidents, where the first row shows the maximum number of fatalities and injuries in each tables, respectively.
dftop<-dfdam[with(dfdam,order(-dfdam$FATALITIES,-dfdam$INJURIES)),c("BGN_DATE",'EVTYPE','STATE','FATALITIES','INJURIES','TIME_DIFF','LOC_DIFF')]
kable(head(dftop,5))
| BGN_DATE | EVTYPE | STATE | FATALITIES | INJURIES | TIME_DIFF | LOC_DIFF | |
|---|---|---|---|---|---|---|---|
| 4539 | 1995-07-12 | HEAT | IL | 583 | 0 | 102.0000000 | 0.000 |
| 214090 | 2011-05-22 | TORNADO | MO | 158 | 1150 | 0.3333333 | 7686.457 |
| 62329 | 1999-07-28 | EXCESSIVE H | IL | 99 | 0 | 68.0000000 | 0.000 |
| 67709 | 1999-07-04 | EXCESSIVE H | PA | 74 | 135 | 48.0000000 | 0.000 |
| 18450 | 1995-07-01 | EXCESSIVE H | PA | 67 | 0 | NA | 0.000 |
dftop2<-dfdam[with(dfdam,order(-dfdam$INJURIES,-dfdam$FATALITIES)),c("BGN_DATE",'EVTYPE','STATE','FATALITIES','INJURIES','TIME_DIFF','LOC_DIFF')]
kable(head(dftop2,5))
| BGN_DATE | EVTYPE | STATE | FATALITIES | INJURIES | TIME_DIFF | LOC_DIFF | |
|---|---|---|---|---|---|---|---|
| 16588 | 1994-02-08 | ICE STORM | OH | 1 | 1568 | 20.0000000 | 0.000 |
| 214090 | 2011-05-22 | TORNADO | MO | 158 | 1150 | 0.3333333 | 7686.457 |
| 213286 | 2011-04-27 | TORNADO | AL | 44 | 800 | 0.7833333 | 6087.296 |
| 58152 | 1998-10-17 | FLOOD | TX | 2 | 800 | 30.2500000 | 0.000 |
| 114987 | 2004-08-13 | HURRICANE | FL | 7 | 780 | 6.0000000 | 0.000 |
We can also look at the variable “TIME_DIFF”, which described the time elapsed (for those events we have the data for) to see this. I will first rearrange the eventtypes in a new column for those with significant number of incidents and “others”. In order to do this I will filter with a limit set by figure 2. since the median and the mean are very small (close to cero).
summary(dfdam$FATALITIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0479 0.0000 583.0000
summary(dfdam$INJURIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.303 0.000 1568.000
sigevtype<-unique(dfdam$EVTYPE[dfdam$FATALITIES>50 | dfdam$INJURIES>200])
sigevtype
## [1] "TORNADO" "BLIZZARD" "HEAT" "ICE STORM" "EXCESSIVE H"
## [6] "FLOOD" "HURRICANE"
nsigtype<-complement_state(sigevtype,events)
dfdam$SIGEVTYPE<-factor(mapvalues(dfdam$EVTYPE,from = nsigtype,
to =rep("Other",length(nsigtype))))
palet<-c("#89C5DA", "#DA5724", "#74D944", "#CE50CA", "#3F4921", "#C0717C", "#CBD588", "#5F7FC7",
"#673770", "#D3D93E", "#38333E")
dfdam2<-dfdam[!(is.na(dfdam$TIME_DIFF)),]
dfdam2_gather<-dfdam2%>%gather(Type,Incidents,FATALITIES,INJURIES)
ggplot(dfdam2_gather,aes(x=log10(TIME_DIFF),y=log10(Incidents),color=SIGEVTYPE))+ geom_jitter(aes(color = SIGEVTYPE), alpha=.6,size = 1.5) +
labs(
color = "Event Type",
x = "Time elapsed [h]",
y = "Incidents",
title = "1993-2011 US Storm events",
subtitle = "Dataset from NOAA",
caption = ""
)+
scale_y_continuous(limits=c(log10(1), log10(1600)), labels = scales::math_format(10^.x))+
scale_x_continuous(limits=c(log10(1), log10(9000)), labels = scales::math_format(10^.x))+
facet_grid(Type ~ REGION) +
gral_theme()+scale_color_manual(values = palet)
## Warning in FUN(X[[i]], ...): NaNs produced
## Warning: Removed 121720 rows containing missing values (geom_point).
And finally we can look at the number of incidents (fatalities and injuries) across different regions of the US.
dfdam$ID<-seq.int(nrow(dfdam))
dfdam_gather<-dfdam%>%gather(Group,Incidents,FATALITIES,INJURIES)
dfdam_gather$Group<-as.factor(dfdam_gather$Group)
ggplot(dfdam_gather,aes(x=BGN_DATE,y=log10(Incidents),color=SIGEVTYPE))+
geom_jitter(aes(color = SIGEVTYPE), size = 1.5) +
labs(
color = "Event Type",
x = "Date",
y = "Incidents",
title = "1993-2011 US Storm events",
subtitle = "Dataset from NOAA",
caption = ""
) +
scale_y_continuous(limits=c(log10(1), log10(1600)), labels = scales::math_format(10^.x))+
facet_grid(Group ~ REGION) +
gral_theme()+scale_color_manual(values = palet)
## Warning: Removed 4869 rows containing missing values (geom_point).
The figure below shows the total amount in damages each of the weather events has caused through out 1993 to 2011 and across the U.S. It is clear that ‘Floods’ have caused the most. Inflation parameters were not considered in this report.
prop<-setNames(aggregate(dfdam$PROPDMG~dfdam$EVTYPE,FUN = sum),c("Type","PROPERTY.DMG"))
crop<-setNames(aggregate(dfdam$CROPDMG~dfdam$EVTYPE,FUN = sum),c("Type","CROP.DMG"))
propcrop<-setNames(data.frame(prop$Type,prop$PROPERTY.DMG,crop$CROP.DMG),c("Type","PROPERTY.DMG","CROP.DMG"))
df_melted <- melt(propcrop, id=c("Type"))
g2<-ggplot(df_melted,aes(x=value,y=Type,color=variable))+geom_point()
g2+labs(title="Total cost of damages",x="",y="")+theme(legend.title=element_blank())
Consequences in property and crop damages
Across the US.
summary(dfdam$PROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 2.000e+03 8.000e+03 1.748e+06 3.000e+04 1.150e+11
summary(dfdam$CROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 0.000e+00 0.000e+00 2.164e+05 0.000e+00 5.000e+09
sevtype<-unique(dfdam$EVTYPE[dfdam$PROPDMG>2*10e8 | dfdam$CROPDMG>10e9])
sevtype
## [1] "WINTER STORM" "HURRICANE" "FLOOD" "HEAVY RAIN"
## [5] "ICE STORM" "SURGE/TIDE" "TORNADO"
ntype<-complement_state(sevtype,events)
dfdam$SEVTYPE<-factor(mapvalues(dfdam$EVTYPE,from = ntype,
to =rep("Other",length(ntype))))
dfdam_gather2<-dfdam%>%gather(Group,Damages,PROPDMG,CROPDMG)
dfdam_gather2$Group<-as.factor(dfdam_gather2$Group)
ggplot(dfdam_gather2,aes(x=BGN_DATE,y=log10(Damages),color=SEVTYPE))+
geom_jitter(aes(color = SEVTYPE),alpha=.3, size = 1.5) +
labs(
color = "Event Type",
x = "Date",
y = "Damages",
title = "1993-2011 US Storm events",
subtitle = "Dataset from NOAA",
caption = ""
) + scale_y_continuous(limits=c(log10(1), log10(1.2e11)), labels = scales::math_format(10^.x))+
facet_grid(Group ~ REGION) +
gral_theme()+scale_color_manual(values=palet)
The following tables show the top 5 most damaging events in properties and crops cost, where the first row shows the most damaging for properties and crops, respectively.
dftop3<-dfdam[with(dfdam,order(-dfdam$PROPDMG,-dfdam$CROPDMG)),c("BGN_DATE",'EVTYPE','STATE','PROPDMG','CROPDMG','TIME_DIFF','LOC_DIFF')]
kable(head(dftop3,5))
| BGN_DATE | EVTYPE | STATE | PROPDMG | CROPDMG | TIME_DIFF | LOC_DIFF | |
|---|---|---|---|---|---|---|---|
| 135162 | 2006-01-01 | FLOOD | CA | 1.150e+11 | 32500000 | -5 | 0 |
| 128121 | 2005-08-29 | SURGE/TIDE | LA | 3.130e+10 | 0 | 3 | 0 |
| 128120 | 2005-08-28 | HURRICANE | LA | 1.693e+10 | 0 | 18 | 0 |
| 129286 | 2005-08-29 | SURGE/TIDE | MS | 1.126e+10 | 0 | 3 | 0 |
| 125932 | 2005-10-24 | HURRICANE | FL | 1.000e+10 | 0 | 8 | 0 |
dftop4<-dfdam[with(dfdam,order(-dfdam$CROPDMG,-dfdam$PROPDMG)),c("BGN_DATE",'EVTYPE','STATE','PROPDMG','CROPDMG','TIME_DIFF','LOC_DIFF')]
kable(head(dftop4,5))
| BGN_DATE | EVTYPE | STATE | PROPDMG | CROPDMG | TIME_DIFF | LOC_DIFF | |
|---|---|---|---|---|---|---|---|
| 4394 | 1993-08-31 | FLOOD | IL | 5.00e+09 | 5.00e+09 | NA | 0 |
| 10610 | 1994-02-09 | ICE STORM | MS | 5.00e+05 | 5.00e+09 | NA | 0 |
| 129287 | 2005-08-29 | HURRICANE | MS | 5.88e+09 | 1.51e+09 | 3.5000 | 0 |
| 144148 | 2006-01-01 | DROUGHT | TX | 0.00e+00 | 1.00e+09 | 719.9833 | 0 |
| 47054 | 1998-12-20 | EXTREME C/W CHILL | CA | 0.00e+00 | 5.96e+08 | 163.5000 | 0 |
Note that some of the top events are most probably related to Hurricane Katrina (from August 28th to 31st 2005).
The following result shows which state was the most affected (with most frequent events) in the period from 1993 to 2011.
table(dfdam$REGION)
##
## North Central Northeast Other South West
## 71694 24033 1421 116253 13554
table(dfdam$STATE)
##
## AK AL AM AN AR AS AZ CA CO CT DC DE
## 521 10161 20 29 6183 49 1536 3154 1706 863 141 358
## FL GA GM GU HI IA ID IL IN KS KY LA
## 7003 10098 48 156 150 14801 773 5491 5265 5941 6300 5288
## LC LE LH LM LO LS MA MD ME MH MI MN
## 0 2 0 101 5 2 2035 2972 574 1 4569 2733
## MO MS MT NC ND NE NH NJ NM NV NY OH
## 5525 10959 944 4658 2016 5831 598 1222 1091 637 9433 12599
## OK OR PA PH PK PM PR PZ RI SC SD SL
## 6400 500 6715 1 7 0 799 6 242 4073 1763 1
## ST TN TX UT VA VI VT WA WI WV WY XX
## 0 10142 19008 832 8828 53 2351 1046 5160 3822 664 0
statemax<-dfdam%>%group_by(STATE)%>%summarise(freq=n())
statemax[statemax$freq==max(statemax$freq),"STATE"]
## # A tibble: 1 × 1
## STATE
## <fctr>
## 1 TX