An exploratory data analysis was performed for Storm Data from NOAA agency. We were required to find which type of events had the greater economical and health consequences. The study was made across all United states from 1950 to 2011, the consequences are summed over states for each type of event. The economic consequences are measured summing property losses and crop losses and the health consequences were studied individually by fatalities and injuries. We have found that the greater economical consequences are produced by floods whereas the greater consequences in terms of people’s health are produced by tornado and excessive heat.
Some Initial configurations:
Sys.setlocale("LC_TIME", "English")
## [1] "English_United States.1252"
The data can be obtained from: “https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2” Documentation about the data can be obtained from: “https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf” The .bz2 file should be downloaded and unziped, inside it is the .csv file
Also the file containing a description of the columns can be found: “https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/Storm-Data-Bulk-csv-Format.pdf”
Read the data into a data frame
data <- read.csv("repdata_data_StormData.csv")
Explore first rows and dimensions
print(dim(data))
## [1] 902297 37
head(data, 2)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14 100 3 0 0 15 25.0
## 2 0 2 150 2 0 0 0 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
There are 5 variables of interest for this report:
Source dplyr and tidyr libraries (they have to be previously installed using install.packages())
#install.packages("dplyr")
#install.packages("tidyr")
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.5.2
##
## Adjuntando el paquete: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
Select the columns of interest
reduced_data <- data %>% select(EVTYPE, CROPDMG, CROPDMGEXP, PROPDMG, PROPDMGEXP, INJURIES, FATALITIES)
head(reduced_data)
## EVTYPE CROPDMG CROPDMGEXP PROPDMG PROPDMGEXP INJURIES FATALITIES
## 1 TORNADO 0 25.0 K 15 0
## 2 TORNADO 0 2.5 K 0 0
## 3 TORNADO 0 25.0 K 2 0
## 4 TORNADO 0 2.5 K 2 0
## 5 TORNADO 0 2.5 K 2 0
## 6 TORNADO 0 2.5 K 6 0
Data dates range.
summary(as.Date(data$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "1950-01-03" "1995-04-20" "2002-03-18" "1998-12-27" "2007-07-28" "2011-11-30"
Check the exponents for damage
table(reduced_data$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
table(reduced_data$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5 6
## 465934 1 8 5 216 25 13 4 4 28 4
## 7 8 B h H K m M
## 5 1 40 1 6 424665 7 11330
There are some undefined factor names like “?”, “h”, etc. Let’s study the possible impact of these exponents.
factors1 <- reduced_data %>% group_by(CROPDMGEXP) %>% summarise(total = sum(CROPDMG), count = n())
factors1
## # A tibble: 9 × 3
## CROPDMGEXP total count
## <chr> <dbl> <int>
## 1 "" 11 618413
## 2 "0" 260 19
## 3 "2" 0 1
## 4 "?" 0 7
## 5 "B" 13.6 9
## 6 "K" 1342956. 281832
## 7 "M" 34141. 1994
## 8 "k" 436 21
## 9 "m" 10 1
factors2 <- reduced_data %>% group_by(PROPDMGEXP) %>% summarise(total = sum(PROPDMG), count = n())
factors2
## # A tibble: 19 × 3
## PROPDMGEXP total count
## <chr> <dbl> <int>
## 1 "" 527. 465934
## 2 "+" 117 5
## 3 "-" 15 1
## 4 "0" 7108. 216
## 5 "1" 0 25
## 6 "2" 12 13
## 7 "3" 20 4
## 8 "4" 14.5 4
## 9 "5" 210. 28
## 10 "6" 65 4
## 11 "7" 82 5
## 12 "8" 0 1
## 13 "?" 0 8
## 14 "B" 276. 40
## 15 "H" 25 6
## 16 "K" 10735292. 424665
## 17 "M" 140694. 11330
## 18 "h" 2 1
## 19 "m" 38.9 7
As these exponents may influence the results they would be taken into account, using the following interpretation:
#exponents of ten
map_factors <- c("0"= 0, "1"= 1, "2" = 2, "3" = 3, "4" = 4, "5" = 5, "6" = 6,
"7" = 7, "8" = 8, "-" = 0, "+" = 0, "m" = 6, "M" = 6,
"B" = 9, "K" = 3, "h" = 2, "H" = 2, "?" = 0)
#"?" explicitly chosen to be 0, as all the values for this factor are zero.
#Get numerical exponents and fill NA with zero.
reduced_data$CROPDMGEXP_num <- map_factors[reduced_data$CROPDMGEXP]
reduced_data$CROPDMGEXP_num[is.na(reduced_data$CROPDMGEXP_num)] <- 0
reduced_data$PROPDMGEXP_num <- map_factors[reduced_data$PROPDMGEXP]
reduced_data$PROPDMGEXP_num[is.na(reduced_data$PROPDMGEXP_num)] <- 0
Let’s convert all damage to millions to have a common scale to compare
reduced_data_all_exp <- reduced_data %>% mutate(PROPDMG = PROPDMG * 10^PROPDMGEXP_num / 10^6,
CROPDMG = CROPDMG * 10^CROPDMGEXP_num / 10^6) %>%
select(EVTYPE, CROPDMG, PROPDMG, INJURIES, FATALITIES)
head(reduced_data_all_exp)
## EVTYPE CROPDMG PROPDMG INJURIES FATALITIES
## 1 TORNADO 0 0.0250 15 0
## 2 TORNADO 0 0.0025 0 0
## 3 TORNADO 0 0.0250 2 0
## 4 TORNADO 0 0.0025 2 0
## 5 TORNADO 0 0.0025 2 0
## 6 TORNADO 0 0.0025 6 0
Group by event to highlight the most economically harmful events.
dmg_by_event <- reduced_data_all_exp %>% mutate(EVTYPE = as.factor(EVTYPE)) %>%
mutate(total_dmg = CROPDMG + PROPDMG) %>% group_by(EVTYPE) %>%
summarise(economic_dmg = sum(total_dmg)) %>% arrange(desc(economic_dmg))
head(dmg_by_event)
## # A tibble: 6 × 2
## EVTYPE economic_dmg
## <fct> <dbl>
## 1 FLOOD 150320.
## 2 HURRICANE/TYPHOON 71914.
## 3 TORNADO 57362.
## 4 STORM SURGE 43324.
## 5 HAIL 18761.
## 6 FLASH FLOOD 18244.
In this part it is analyzed if using only the defined exponents K, M, B would affect the results
#All values with exponents different than "K", "M", "B" are set to zero.
reduced_data_def_exp <- reduced_data %>%
mutate(PROPDMG = if_else(PROPDMGEXP %in% c("K", "M", "B"), PROPDMG, 0),
CROPDMG = if_else(CROPDMGEXP %in% c("K", "M", "B"), CROPDMG, 0))
dmg_by_event_def_exp <- reduced_data_def_exp %>% mutate(EVTYPE = as.factor(EVTYPE)) %>%
mutate(PROPDMG = as.numeric(PROPDMG) * 10^PROPDMGEXP_num / 10^6,
CROPDMG = as.numeric(CROPDMG) * 10^CROPDMGEXP_num / 10^6) %>%
mutate(total_dmg = CROPDMG + PROPDMG) %>% group_by(EVTYPE) %>%
summarise(economic_dmg = sum(total_dmg)) %>% arrange(desc(economic_dmg))
head(dmg_by_event_def_exp)
## # A tibble: 6 × 2
## EVTYPE economic_dmg
## <fct> <dbl>
## 1 FLOOD 150320.
## 2 HURRICANE/TYPHOON 71914.
## 3 TORNADO 57341.
## 4 STORM SURGE 43324.
## 5 HAIL 18753.
## 6 FLASH FLOOD 17562.
The results don’t vary at least for the first types of events. so we can be confdident that our interpetation of the exponents won’t affect the results.
Group by event to highlight the most harmful events for people
harm_by_event <- reduced_data_all_exp %>% mutate(EVTYPE = as.factor(EVTYPE)) %>%
group_by(EVTYPE) %>%
summarise(Fatalities = sum(FATALITIES), Injuries = sum(INJURIES)) %>%
arrange(desc(Fatalities)) %>% pivot_longer(cols = 2:3, names_to = "type",
values_to = "Count")
head(harm_by_event)
## # A tibble: 6 × 3
## EVTYPE type Count
## <fct> <chr> <dbl>
## 1 TORNADO Fatalities 5633
## 2 TORNADO Injuries 91346
## 3 EXCESSIVE HEAT Fatalities 1903
## 4 EXCESSIVE HEAT Injuries 6525
## 5 FLASH FLOOD Fatalities 978
## 6 FLASH FLOOD Injuries 1777
The 15 most harmful events for people are shown.
library(ggplot2)
p <- ggplot(data = harm_by_event[1:15,], aes(x = Count, y = reorder(EVTYPE, Count)))
p + facet_grid(cols = vars(type)) + geom_col(aes(fill = type)) +
labs(title = "Fatalities and Injuries by Event type", y="Type of Event")
The most harmful event for people is Tornado with the greater quantity
of fatalities and injuries the second one is Excessive Heat although is
not the second event in injuries it is second in fatalities.
The 15 most economical harmful events are shown
g <- ggplot(data = dmg_by_event[1:15,], aes(x = economic_dmg, y = reorder(EVTYPE, economic_dmg)))
g + geom_col(fill = "cyan3") + labs(title = "Property and Crop Monetary Damage",
x = "Damage in Millions of Dollars", y="Event")
The most economically harmful events are Floods followed by
Hurricanes/Typhoons