Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
library(ggplot2)
library(knitr)
library(printr)
library(plyr)
library(dplyr)
library(lubridate)
library(gridExtra)
library(reshape2)
options(scipen = 1) # Turn off scientific notations for numbers
The data for this project come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. Data can be downloaded from the following link :
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
We have downloaded and unzipped the data in the folder named “Data” in the project working directory. We now load the data to R using the read.csv
method.
if (!"data" %in% ls()) {
data <- read.csv("Data/repdata_data_StormData.csv", sep=",")
}
In order to get a better sense of the data, we first look at the structure of the dataset :
str(data)
## 'data.frame': 902297 obs. of 39 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
## $ date : POSIXct, format: "1950-04-18" "1950-04-18" ...
## $ year : num 1950 1950 1951 1951 1951 ...
Therefore, there are 902297
row and 39
in the dataset. The class of teh variables as well as the first few records were also present the output.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
In order to verify this fact, we create a histogram of year of events.
data$date <- mdy_hms(data$BGN_DATE) # change begin date to a date format
data$year<- year(data$date) # extract year from the begin date
ggplot (data, aes(year))+
geom_histogram(binwidth=3, alpha=.5, position="identity", fill="blue", col="blue")+
ggtitle ("Histogram of the start year of events")+
xlab("Year")+
ylab("Frequency")
Histogram clearly indicates that the number of events in the dataset increase after 1990.
In this section, we separately explore the impact of the storms and severe weather events on population health and economy. Specifically we are interested to know which type of events are most harmful with respect to the population health,and economy.
In this section we explore that across the United States, which types of events are most harmful with respect to population health.
To answer this question we focus on the total number of FATALITIES
and INJURIES
caused by the storm and events. Therefore we create a frequency table of event type and the total number of fatalities and injuries.
Due to the large number of event types we focus only on the top 15 most harmful event in relation to fatality and injury.
by_event <- group_by(data, EVTYPE)
summary_data <- summarise (by_event,
Num_event = n(),
Total_injury = sum(INJURIES),
Total_fatality = sum(FATALITIES),
Total_injury_fatality = sum(INJURIES+FATALITIES)
)
arranged_data <- arrange(summary_data, desc(Total_injury, Total_fatality, Total_injury_fatality))
subset25 <- slice(arranged_data, 1:15)
subset25
EVTYPE | Num_event | Total_injury | Total_fatality | Total_injury_fatality |
---|---|---|---|---|
TORNADO | 60652 | 91346 | 5633 | 96979 |
TSTM WIND | 219940 | 6957 | 504 | 7461 |
FLOOD | 25326 | 6789 | 470 | 7259 |
EXCESSIVE HEAT | 1678 | 6525 | 1903 | 8428 |
LIGHTNING | 15754 | 5230 | 816 | 6046 |
HEAT | 767 | 2100 | 937 | 3037 |
ICE STORM | 2006 | 1975 | 89 | 2064 |
FLASH FLOOD | 54277 | 1777 | 978 | 2755 |
THUNDERSTORM WIND | 82563 | 1488 | 133 | 1621 |
HAIL | 288661 | 1361 | 15 | 1376 |
WINTER STORM | 11433 | 1321 | 206 | 1527 |
HURRICANE/TYPHOON | 88 | 1275 | 64 | 1339 |
HIGH WIND | 20212 | 1137 | 248 | 1385 |
HEAVY SNOW | 15708 | 1021 | 127 | 1148 |
WILDFIRE | 2761 | 911 | 75 | 986 |
\(\;\)
The table is arranged in descending order of total injury, total fatality and total injury and fatality combined. The table clearly shows that TORNADO
is the most harmful event. After that TSTM WIND
and FLOOD
are in the second and third position.
Bar plot below is also created to visualize the frequency of the total injury and fatality by event type for the top 15 events. Again we can observe that TORNADO
is clearly indicated as the most hamful event.
subset25 <- transform(subset25, EVTYPE = reorder(EVTYPE, Total_injury_fatality))
ggplot(data=subset25, aes(x=EVTYPE, y=Total_injury, fill=Total_injury_fatality)) +
geom_bar(stat="identity", position='dodge', size=0.1, alpha=.85)+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+
geom_text(aes(label=Total_injury_fatality), hjust=-0.05 ,position=position_dodge(.9), size=3)+
theme_bw( base_family = "Avenir", base_size = 12)+
labs(title = "Top 15 most harmful events to public health") +
labs(x="Event type", y="Total injury and fatality" )+
coord_flip()+
theme(legend.position="none")
In this section of the report we explore that across the United States, which types of events have the greatest economic consequences.
To answer this question we focus on the number of property damage (PROPDMG
) and crop damage (CROPDMG
) in the dataset. However there is a multiplier associated with each observation of property and crop damage. Specifically CROPDMGEXP
and PROPDMGEXP
record the value of multipliers where we have hundred (H), Thousand (K), Million (M), Billion (B) and some Na values (+,?,…). Therefore, in order to find a meaningful value for the total damage, we need to first code the multipliers to a numerical value.
First we subset the appropriate variables and take a look at the levels of PROPDMGEXP
and CROPDMGEXP
variables.
subData <- select(data, EVTYPE, PROPDMG,CROPDMG,CROPDMGEXP,PROPDMGEXP) # selecting a subset of variables
levels <- union (levels(subData$CROPDMGEXP), levels(subData$PROPDMGEXP)) %>%
sort(decreasing = TRUE)
levels
## [1] "M" "m" "K" "k" "H" "h" "B" "8" "7" "6" "5" "4" "3" "2" "1" "0" "+"
## [18] "?" "-" ""
The following function is written to to code the factor values to numeric values. We decided to code NA
values as 1
.
recodeFunction <- function(x){
result <- toupper(x)
result <- mapvalues(result,
from = c("B", "M", "K", "H" , "1", "2", "3", "4", "5", "6", "7", "8", "0", "+" ,"?", "-" ,""),
to = c(1000000000, 1000000, 1000, 100, 10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1, 1, 1, 1, 1))
result <- as.numeric(result)
#the value of NAs will be coded as zero
#result[is.na(result)] <- 0
result
}
Now we apply the function to create the total property damage and total crop damage by event type. We create a table to list the top 15 most harmful events with respect to the total cost of crop damage, total cost of property damage and the total cost of crop and property damage combined.
subData <- mutate (subData, PROPIndex= recodeFunction(PROPDMGEXP), CROPIndex=recodeFunction(CROPDMGEXP) )
## The following `from` values were not present in `x`: H, 1, 3, 4, 5, 6, 7, 8, +, -
subData <- mutate (subData, Prop_damage =PROPIndex*PROPDMG , Crop_damage=CROPIndex*CROPDMG)
summary_data <- group_by(subData,EVTYPE )%>%
summarise (Num_event = n(),
Total_property_damage = sum(Prop_damage),
Total_crop_damage = sum(Crop_damage),
Total_crop_property_damage = sum(Prop_damage+Prop_damage)
)%>%
arrange(desc(Total_property_damage,Total_crop_damage,Total_crop_property_damage )) %>%
slice(1:15)
# top_n(20, Total_crop_property_damage)
kable(summary_data, align = "c")
EVTYPE | Num_event | Total_property_damage | Total_crop_damage | Total_crop_property_damage |
---|---|---|---|---|
FLOOD | 25326 | 144657709807 | 5661968450 | 289315419614 |
HURRICANE/TYPHOON | 88 | 69305840000 | 2607872800 | 138611680000 |
TORNADO | 60652 | 56947380676 | 414953270 | 113894761353 |
STORM SURGE | 261 | 43323536000 | 5000 | 86647072000 |
FLASH FLOOD | 54277 | 16822673978 | 1421317100 | 33645347957 |
HAIL | 288661 | 15735267513 | 3025954473 | 31470535025 |
HURRICANE | 174 | 11868319010 | 2741910000 | 23736638020 |
TROPICAL STORM | 690 | 7703890550 | 678346000 | 15407781100 |
WINTER STORM | 11433 | 6688497251 | 26944000 | 13376994502 |
HIGH WIND | 20212 | 5270046295 | 638571300 | 10540092590 |
RIVER FLOOD | 173 | 5118945500 | 5029459000 | 10237891000 |
WILDFIRE | 2761 | 4765114000 | 295472800 | 9530228000 |
STORM SURGE/TIDE | 148 | 4641188000 | 850000 | 9282376000 |
TSTM WIND | 219940 | 4484928495 | 554007350 | 8969856990 |
ICE STORM | 2006 | 3944927860 | 5022113500 | 7889855720 |
\(\;\)
Below we create a bar plot to better visualize the top 15 most harmful events with respect to the cost of crop damage and property damage separately.
barplot_data <- select(summary_data, EVTYPE, Total_property_damage, Total_crop_damage)
# fisrt we need to reshape the data
barplot_data <- melt(barplot_data, id="EVTYPE", variable.name="Damage", value.name="Total_cost")
ggplot(barplot_data, aes(x=reorder(EVTYPE, Total_cost), y=Total_cost, fill=Damage)) +
geom_bar(stat="identity", position='dodge', size=0.1, alpha=.85)+
theme_bw( base_family = "Avenir", base_size = 12)+
scale_fill_manual(values=c("#9ECAE1", "#3282BD")) +
theme(axis.text.x=element_text(angle=45,hjust=1,vjust=1))+
labs(title = "Top 15 harmful events with higher economic impact") +
labs(x="Event type", y="Cost of damage " )
\(\;\)
Therefore, as we can clearly see in the bar plot that the FLOOD is the most harmful event with respect to the property damage. With respect to the crop damage, the FLOOD has still the highest cost, however the RIVER FLOOD and ICE STORM are in the next positions with very close numbers.
We also create a bar plot to visualize the top 15 most harmful events with respect to the total crop and property cost combined.
ggplot(summary_data, aes(x=reorder(EVTYPE, Total_crop_property_damage ), y=Total_crop_property_damage , fill=Total_crop_property_damage )) +
geom_bar(stat="identity", position='dodge', size=0.1, alpha=.85)+
theme_bw( base_family = "Avenir", base_size = 12)+
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=1))+
geom_text(aes(label=Total_crop_property_damage), hjust=-0.05 ,position=position_dodge(.9), size=3)+
labs(title = "Top 15 harmful events with higher economic impact ") +
labs(x="Event type", y="Total cost of crop and property damage " )+
coord_flip()+
theme(legend.position="none")
\(\;\)
As we can see in the bar plot of the total crop and property damage, FLOOD is the weather event with highest economic impact on crop and property costing a total of 289315419614 US Dollars .