This is a partial exploration of the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database (hereafter, the data.) Storms and other severe weather events can result in fatalities, injuries, and property damage. The data spans over 61 years (1950 to 2011) and is organized in 37 columns and +902K rows.
As reported at the NOAA web site, “since 1993 these narratives have been entered digitally into computer-based records by NWS personnel and loaded into the Storm Events Database…. Prior to 1993, the narratives were typed for the paper publication.” The data is organized by STATE (in column 1), some states show activity than others and also, some events have more remarks than others.Some of the most interesting data are in columns like: “EVTYPE”, “MAG”, “FATALITIES”, “INJURIES”, “PROPDMG”, “CROPDMG”.
First: read the data and load some libraries
AllDataStorm <- read.csv("repdata-data-StormData.csv", sep=",", header=T, quote = "", stringsAsFactors = FALSE);
library(dplyr)
library(tidyr)
library(ggplot2)
And now , do a pre-process and a quick examination of the data
cnames <- unlist(strsplit(colnames(AllDataStorm), "[.]"));
cnames <- cnames[cnames != "X"];
colnames(AllDataStorm) <- cnames;
head(sort(table(AllDataStorm$EVTYPE), decreasing = TRUE))
##
## "HAIL" "TSTM WIND"
## 801945 288654 219940
## "THUNDERSTORM WIND" "TORNADO" "FLASH FLOOD"
## 82562 60640 54225
names(AllDataStorm)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
We assume the way to evaluate the most harmful events is by counting “FATALITIES” and “INJURIES”.
1.1 Then, we do not need all of the columns
data<- AllDataStorm;
data$FATALITIES <- as.numeric(data$FATALITIES);
data$INJURIES <- as.numeric(data$INJURIES);
data <- data[(!is.na(data$FATALITIES) & data$FATALITIES > 0) | (!is.na(data$INJURIES) & data$INJURIES > 0), c("EVTYPE","FATALITIES","INJURIES")];
1.2 Now we group using the column “EVTYPE”
data <- group_by(data, EVTYPE);
data <- summarise(data, FATALITIES = sum(FATALITIES), INJURIES = sum(INJURIES));
data <- arrange(data, desc(FATALITIES + INJURIES));
1.3 Now we can plot. Some preliminary tests show is not needed to go beyond
the first 7 most harmful types of event. We noted that HEAT and FLASH FLOOD
(event type numbers 6 and 7), though small in number of casualties, shows a
big proportion of fatalities to injuries (of the total of casualties for HEAT
or FLASH FLOOD, almost a 30% results in FATALITIES.)
data <- data[1:7,];
data <- gather(data, HMTYPE, HMVALUE, FATALITIES:INJURIES);
ggplot(data, aes(x = reorder(EVTYPE, -HMVALUE), y = HMVALUE, fill = HMTYPE)) + geom_bar(stat = "identity") + scale_fill_manual(values = c("black", "red"))+theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0)) + labs(x = "Type of Event", y = "Number of People Affected (1950-2011)") + labs(title = "The Seven Most Harmful Types of Weather Events");
The columns with information on damage to properties and crops, and their economic costs, are “PROPFMG” and “CROPDMG”.
2.1 Then, we just need those two columns:
AllDataStorm$PROPDMG <- as.numeric(AllDataStorm$PROPDMG);
## Warning: NAs introducidos por coerción
AllDataStorm$CROPDMG <- as.numeric(AllDataStorm$CROPDMG);
## Warning: NAs introducidos por coerción
data_prop <- AllDataStorm[!is.na(AllDataStorm$PROPDMG) & AllDataStorm$PROPDMG > 0, c("EVTYPE", "PROPDMG", "PROPDMGEXP")];
data_crop <- AllDataStorm[!is.na(AllDataStorm$CROPDMG) & AllDataStorm$CROPDMG > 0, c("EVTYPE", "CROPDMG", "CROPDMGEXP")];
2.2 In page 12 of the National Weather Service instructions “alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B”
*billions“* and those can be lower case or capital letters. Other identifiers will count as zero (0) or, invalid data.
Then, we must consider the following conversion with our data:
DamageExp Unit B 1,000,000,000 M 1,000,000 K 1,000 H 100 NA or BLANK 1
2.3 However, some times instead of ‘values’ there are strings with comments. Then, there are 28 levels for the
variable “PROPDMGEXP” and 13 levels for “CROPDMGEXP”. The Appendix B1 of the aforementiones document, defines
Property Damage Estimates (the file I found corresponds to 2007.) There we count 41 items at all (28 + 13.)
## [1] ""
## [2] " which encompassed parts of the southeastern section of the island"
## [3] "\"-\""
## [4] "\"+\""
## [5] "\"0\""
## [6] "\"2\""
## [7] "\"3\""
## [8] "\"4\""
## [9] "\"5\""
## [10] "\"6\""
## [11] "\"7\""
## [12] "\"B\""
## [13] "\"h\""
## [14] "\"H\""
## [15] "\"K\""
## [16] "\"m\""
## [17] "\"M\""
## [18] "0"
## [19] "0.00"
## [20] "000 acres slightly damaged"
## [21] "000 chickens were destroyed on a chicken farm when the structure collapsed during the storm. While miraculously"
## [22] "000 in Howard County"
## [23] "10.00"
## [24] "2.00"
## [25] "5.00"
## [26] "50.00"
## [27] "500.00"
## [28] "7.00"
## [1] ""
## [2] " 106)"
## [3] " which encompassed parts of the southeastern section of the island"
## [4] "\"0\""
## [5] "\"B\""
## [6] "\"k\""
## [7] "\"K\""
## [8] "\"m\""
## [9] "\"M\""
## [10] "0.00"
## [11] "1645CST"
## [12] "50.00"
## [13] "600 in Anne Arundel County"
2.4 Now we need to assign the different chances available for these exponents, so we later can add:
data_prop[data_prop$PROPDMGEXP == "" |
data_prop$PROPDMGEXP == "\"-\"" |
data_prop$PROPDMGEXP == "\"+\"" |
data_prop$PROPDMGEXP == "\"0\"" |
data_prop$PROPDMGEXP == "\"1\"" |
data_prop$PROPDMGEXP == "\"2\"" |
data_prop$PROPDMGEXP == "\"3\"" |
data_prop$PROPDMGEXP == "\"4\"" |
data_prop$PROPDMGEXP == "\"5\"" |
data_prop$PROPDMGEXP == "\"6\"" |
data_prop$PROPDMGEXP == "\"7\"" ,
c("PROPDMGEXP")] <- "1";
data_prop[data_prop$PROPDMGEXP == "\"H\"" | data_prop$PROPDMGEXP == "\"h\"", c("PROPDMGEXP")] <- "100";
data_prop[data_prop$PROPDMGEXP == "\"K\"" , c("PROPDMGEXP")] <- "1000";
data_prop[data_prop$PROPDMGEXP == "\"M\"" | data_prop$PROPDMGEXP == "\"m\"", c("PROPDMGEXP")] <- "1000000";
data_prop[data_prop$PROPDMGEXP == "\"B\"", c("PROPDMGEXP")] <- "1000000000";
data_prop[data_prop$PROPDMGEXP != "1" &
data_prop$PROPDMGEXP != "100" &
data_prop$PROPDMGEXP != "1000" &
data_prop$PROPDMGEXP != "1000000" &
data_prop$PROPDMGEXP != "1000000000",
c("PROPDMGEXP")] <- "0";
data_crop[data_crop$CROPDMGEXP == "" | data_crop$CROPDMGEXP == "\"0\"", c("CROPDMGEXP")] <- "1";
data_crop[data_crop$CROPDMGEXP == "\"K\"" | data_crop$CROPDMGEXP == "\"k\"", c("CROPDMGEXP")] <- "1000";
data_crop[data_crop$CROPDMGEXP == "\"M\"" | data_crop$CROPDMGEXP == "\"m\"", c("CROPDMGEXP")] <- "1000000";
data_crop[data_crop$CROPDMGEXP == "\"B\"", c("CROPDMGEXP")] <- "1000000000";
data_crop[data_crop$CROPDMGEXP != "1" &
data_crop$CROPDMGEXP != "1000" &
data_crop$CROPDMGEXP != "1000000" &
data_crop$CROPDMGEXP != "1000000000",
c("CROPDMGEXP")] <- "0";
2.5 Almost there! We now group using “EVTYPE” so we can add the economic costs different weather events had had, in the
given period, 1050 to 2011.
data_prop$PROPDMG <- data_prop$PROPDMG * as.numeric(data_prop$PROPDMGEXP);
data_prop <- data_prop[, 1:2];
data_prop <- group_by(data_prop, EVTYPE);
data_prop <- summarise(data_prop, DMG = sum(PROPDMG));
data_prop <- mutate(data_prop, Type = "PROPDMG");
data_crop$CROPDMG <- data_crop$CROPDMG * as.numeric(data_crop$CROPDMGEXP);
data_crop <- data_crop[, 1:2];
data_crop <- group_by(data_crop, EVTYPE);
data_crop <- summarise(data_crop, DMG = sum(CROPDMG));
data_crop <- mutate(data_crop, Type = "CROPDMG");
costs <- rbind(data_prop, data_crop);
costs <- spread(costs, Type, DMG);
costs[is.na(costs$CROPDMG), c("CROPDMG")] <- 0;
costs[is.na(costs$PROPDMG), c("PROPDMG")] <- 0;
costs <- arrange(costs, desc(PROPDMG + CROPDMG));
2.6 Let’s plot this time the Top 10 types of weather event with the higher impact on the economy:
costs <- costs[1:10,];
costs <- gather(costs, TYPE, VALUE, CROPDMG:PROPDMG);
ggplot(costs, aes(x = reorder(EVTYPE,VALUE), y = VALUE/1E+9, fill=TYPE)) + geom_bar(stat = "identity") + scale_fill_manual(values = c("green", "gray"))+theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0)) + labs(x = "Event Type", y = "Economic Costs in Billions of Dollars") + labs(title = "The Impact on Economy Due To Weather Events: 1950 - 2011");
Our answer to Question 2 is: the 2 worser weather events, in the sense of their effects over the
economy of the affected communities are, FLOODS and HURRICANES/TYPHOONS.
Even though DROUGHT is no the ‘worst’ of the weather events, we can see how its effect on CROPS and
PROPERTY are comparable.
Closing Remarks: We have examined data of the NOAA Storm database for the period 1950 to 2011. We have
found that the worst type of weather event, in the sense of its effects over the health of the people
of NorthAmerican Communities, are TORNADOES. We have also found that, the worst type of weather event, in the sense of its effects over the economy, measured as damages to property and crops, are FLOODS.