The NOAA storm database available on the course website was analyzed. It contains data from 1950 to 2011. The database was cleaned by removing records classified as NA. Then, the values were filtered for each question, taking only the data that had a value greater than zero in both parameters of each question. For question 1, the values expressed in the INJURIES and FATALITIES columns were considered. For question 2, the values of the PROPDMG and CROPDMG columns were taken with their respective adjustment values for Thousands (K), Millions (M), and Billions (B). Tornadoes were found to be the event type that causes the most damage to people across the US, with a maximum that exceeds its successor by almost 5 times, as well as the highest mortality rate. Floods were found to be the event type that causes the greatest economic damage, with more damage to properties than crops.
This section shows how the data was collected and cleaned for further analysis.
The data was obtained form the course “Reproducible Research” website from Coursera platform. The data is a csv document that containsstorm events since 1950 and ends in 2011.
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
data <- read.csv("repdata_data_StormData.csv.bz2")
head(data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
## 6 0 0 6
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
Fatalities and injuries columns indicates this parameter
q1_df <- data[,c("STATE","COUNTY","COUNTYNAME","EVTYPE","FATALITIES","INJURIES")]
# Then I filter for all the events that present some injury or fatality
q1_df <- filter(q1_df, FATALITIES > 0 & INJURIES > 0)
# Grouped by event type:
ppl_dmg <- q1_df %>% group_by(EVTYPE) %>%
summarise(injuries = sum(INJURIES), mortality = sum(FATALITIES))
# and ordered to see which event is more harmful
ppl_dmg[with(ppl_dmg, order(-injuries,-mortality)), ]
## # A tibble: 86 × 3
## EVTYPE injuries mortality
## <chr> <dbl> <dbl>
## 1 TORNADO 60187 5227
## 2 EXCESSIVE HEAT 4791 402
## 3 FLOOD 2679 104
## 4 ICE STORM 1720 35
## 5 HEAT 1420 73
## 6 HURRICANE/TYPHOON 1219 32
## 7 BLIZZARD 718 48
## 8 LIGHTNING 649 283
## 9 TSTM WIND 646 199
## 10 FLASH FLOOD 641 171
## # … with 76 more rows
# in order to show the damage in a plot, the data was filtered again considering the value of the 75% superior class for injuries (129.75)
ppl_maxdmg <- filter(ppl_dmg, injuries > 129)
plot1 <- ggplot()+ geom_point(data = ppl_maxdmg, aes(x=EVTYPE, y=injuries), color="red", shape=4)+
geom_point(data = ppl_maxdmg, aes(x=EVTYPE, y=mortality),
color="blue", shape=5)+
xlab("Event Type")+
ylab("Total damage people health")+
theme_bw()+
theme(axis.text.x = element_text(angle = 90))
print(plot1)
Despite all analysis the graph and the filtered tables shows that the worst event type related to clime it is the Tornadoes, with a max amount of people injured (60187) and with a fatality amount of 5227. the next climate event is the Excessive heat, but this is way to far of the Tornado event, with max amount of injury and fatalities of 4791 and 402 respectly.
Columns to be analyzed: - PROPDMG: indicate the economic estimated amount per event - PROPDMGEXP: character that indicate thousand (K), Million(M) an Billion (B). Same structure for the parameter CROPDMG (amount of damages in crops)
q2_df <- data[,c("STATE","EVTYPE","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
q2_df <- filter(q2_df, PROPDMG > 0 & CROPDMG > 0) #This eliminates the NA values too in both columns
# normalization of values for the expansion rate
unique(q2_df$PROPDMGEXP)
## [1] "B" "M" "m" "K" "5" "0" "" "3"
unique(q2_df$CROPDMGEXP)
## [1] "M" "K" "m" "k" "B" "0" ""
q2_df["PROPDMGEXP"][q2_df["PROPDMGEXP"] == "m"] <- "M"
q2_df["PROPDMGEXP"][q2_df["PROPDMGEXP"] == ""] <- "0"
q2_df["CROPDMGEXP"][q2_df["CROPDMGEXP"] == "m"] <- "M"
q2_df["CROPDMGEXP"][q2_df["CROPDMGEXP"] == "k"] <- "K"
q2_df["CROPDMGEXP"][q2_df["CROPDMGEXP"] == ""] <- "0"
# empty values were transformed to exp "0" and other letter were transformed to uppercase
head(q2_df[, c("PROPDMG","PROPDMGEXP")])
## PROPDMG PROPDMGEXP
## 1 0.1 B
## 2 5.0 M
## 3 25.0 M
## 4 48.0 M
## 5 20.0 M
## 6 50.0 K
mean(q2_df$PROPDMG)
## [1] 66.2253
# The "num" variable allows to identify the values out of range, and the number of these instances
num <- filter(q2_df, PROPDMGEXP == 5 | PROPDMGEXP == 3 )
# there are three values in the PROPDMGEXP column that indicates non-factor value
# for these values, the amount of PROPDMG is less than the mean of the same column
# so these values were removed from the total calculations.
q2_df <- q2_df %>% filter(!(PROPDMGEXP == 5 | PROPDMGEXP == 3))
# Iteration to replace values from Exp to number
prop_vect <- numeric()
crop_vect <- numeric()
# this iteration creates two vector, the they are multiply to add new columns values
# that allows to group by event type
for (i in q2_df$PROPDMGEXP) {
if (i == "K"){i <- 1000}
else if (i == "M"){i <- 1000000}
else if (i == "B"){i <-1000000000}
prop_vect <- c(prop_vect, i)
}
for (i in q2_df$CROPDMGEXP) {
if (i == "K"){i <- 1000}
else if (i == "M"){i <- 1000000}
else if (i == "B"){i <-1000000000}
crop_vect <- c(crop_vect, i)
}
q2_df$prop_damage <- as.numeric(prop_vect) * q2_df$PROPDMG
q2_df$crop_damage <- as.numeric(crop_vect) * q2_df$CROPDMG
q2_df$total_damage <- q2_df$prop_damage + q2_df$crop_damage
# Group the event type and sum the estimated values of economic loss
eco_dmg <- q2_df %>% group_by(EVTYPE) %>%
summarise(p_dmg = sum(prop_damage), c_dmg = sum(crop_damage), t_dmg = sum(total_damage))
plot2 <- ggplot()+ geom_point(data = eco_dmg, aes(x=EVTYPE,y=t_dmg),
color="red")+
xlab("Event Type")+ylab("Total economic damage")+theme_bw()+
theme(axis.text.x = element_text(angle = 90))
print(plot2)
# Filter the damage according to quantiles
quantile(eco_dmg$t_dmg)
## 0% 25% 50% 75% 100%
## 100 423750 8755000 142077900 126044533500
mean(eco_dmg$t_dmg)
## [1] 2164341765
eco_maxdmg <- filter(eco_dmg, t_dmg > mean(eco_dmg$t_dmg))
# table with max values in Million USD$
eco_maxdmg$t_dmg_MUSD <- eco_maxdmg$t_dmg/1000000
eco_maxdmg[with(eco_maxdmg, order(-t_dmg_MUSD)), ]
## # A tibble: 9 × 5
## EVTYPE p_dmg c_dmg t_dmg t_dmg_MUSD
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 FLOOD 121971090050 4073443450 126044533500 126045.
## 2 HURRICANE/TYPHOON 26740295000 2607822800 29348117800 29348.
## 3 HURRICANE 7809278000 2688910000 10498188000 10498.
## 4 RIVER FLOOD 5079635000 5028734000 10108369000 10108.
## 5 ICE STORM 86504000 5022113500 5108617500 5109.
## 6 FLASH FLOOD 2926981010 1381860350 4308841360 4309.
## 7 HAIL 1960393440 1853569100 3813962540 3814.
## 8 TORNADO 1987231000 398705950 2385936950 2386.
## 9 HURRICANE OPAL 2168000000 19000000 2187000000 2187
plot3 <- ggplot() + geom_point(data=eco_maxdmg, aes(x=EVTYPE, y=t_dmg_MUSD), color="red")+
xlab("Event Type")+
ylab("Total economic damage")+
theme_bw()+
theme(axis.text.x = element_text(angle = 90))
print(plot3)
max(eco_maxdmg$t_dmg_MUSD)
## [1] 126044.5
max(eco_maxdmg$p_dmg)/1000000 # Million USD$ for better readability
## [1] 121971.1
max(eco_maxdmg$c_dmg)/1000000 # Million USD$ for better readability
## [1] 5028.734
For the economic consequences, the Flood events are most harmful, with MUSD 126,044.5, with more severe damage to the property (MUSD 121,971.1) than the crops (MUSD 5,028.734)