The present project aim to show which weather events cause more economics and health damages between 1950 and 2011 in U.S. To investigate this, It was obtained the necesary data from U.S. National Oceanic and Atmospheric Administration’s (NOAA).All the weather events was group in ten groups: flood, hail, heat, rain, snow, storm, tornado, wind,winter and others. From this data It was found that all the weather events caused importants damages in crops and properties but the group of events that caused more damage was “flood” in properties, and “others” in crops (The “others” group of events are weather events diferents of flood, hail, heat, rain, snow, storm, tornado, wind and winter like wintry mix, waterspout, etc.). Also It was found that the event that caused more health damage (Fatalities and injuries) was the tornados.
From U.S. National Oceanic and Atmospheric Administration’s (NOAA) was obtained the storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
# Dowload the data.
if (!file.exists("StormData.csv.bz2")){
DataUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(DataUrl, destfile = "StormData.csv.bz2", method = "curl")
if(!file.exists("StormData.csv.bz2")){
stop("The data is ready")
}
}
# Reading downloaded data.
df <- read.csv("StormData.csv.bz2")
Now It will be observed the head and tail to the first 10 columns of the data to see it general structure
head(df[,1:10])
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
tail(df[,1:10])
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY
## 902292 47 11/28/2011 0:00:00 03:00:00 PM CST 21
## 902293 56 11/30/2011 0:00:00 10:30:00 PM MST 7
## 902294 30 11/10/2011 0:00:00 02:48:00 PM MST 9
## 902295 2 11/8/2011 0:00:00 02:58:00 PM AKS 213
## 902296 2 11/9/2011 0:00:00 10:21:00 AM AKS 202
## 902297 1 11/28/2011 0:00:00 08:00:00 PM CST 6
## COUNTYNAME STATE EVTYPE BGN_RANGE
## 902292 TNZ001>004 - 019>021 - 048>055 - 088 TN WINTER WEATHER 0
## 902293 WYZ007 - 017 WY HIGH WIND 0
## 902294 MTZ009 - 010 MT HIGH WIND 0
## 902295 AKZ213 AK HIGH WIND 0
## 902296 AKZ202 AK BLIZZARD 0
## 902297 ALZ006 AL HEAVY SNOW 0
## BGN_AZI
## 902292
## 902293
## 902294
## 902295
## 902296
## 902297
Also, It will be seen the dimensions of the data
dim(df)
## [1] 902297 37
Number of rows = 902297 Number of columns = 37
Now, It will be seen the questions to answer with the analisis.
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
First, let´s see all the variables in the data and their characteristics.
str(df)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
When the CODEBOOK is read, It´s observed that the necessary variables are:
– EVTYPE variable (To see the types of weather events)
– Related with people´s health:
FATALITIES (approx. number of deaths)
INJURIES (approx. number of injuries)
– Related with economic facts:
PROPDMG (Approx. property damags)
PROPDMGEXP (The units for property damage value)
CROPDMG (Approx. crop damages)
CROPDMGEXP (The units for crop damage value)
This is the information about the necessary variables for the Project.
str(df[,c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")])
## 'data.frame': 902297 obs. of 7 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
Let´s see the percentage of missing values in each variable.
mean(is.na(df$EVTYPE))
## [1] 0
mean(is.na(df$FATALITIES))
## [1] 0
mean(is.na(df$INJURIES))
## [1] 0
mean(is.na(df$PROPDMG))
## [1] 0
mean(is.na(df$PROPDMGEXP))
## [1] 0
mean(is.na(df$CROPDMG))
## [1] 0
mean(is.na(df$CROPDMGEXP))
## [1] 0
In all the variables the percentage of missing values are 0. So, there is no N.A data in all the variables that its been needed.
Now, It will need to change the exponential values gives by PROPDMGEXP and CROPDMGEXP for calculate the economic and health damage. Where:
It is been created a dataframe with the necessary variables
DfWithVariables <- df %>% select("EVTYPE","FATALITIES","INJURIES","PROPDMG", "PROPDMGEXP","CROPDMG","CROPDMGEXP")
Here, the exponent values (Showed previously) were change to the numerical values, respectively
DfWithVariables$PROPDMGEXP <- as.character(DfWithVariables$PROPDMGEXP)
DfWithVariables <- DfWithVariables %>%
mutate(PROPDMGEXP = ifelse((PROPDMGEXP == "K"), "1000", PROPDMGEXP)) %>%
mutate(PROPDMGEXP = ifelse((PROPDMGEXP == "m" | PROPDMGEXP == "M"), "1000000", PROPDMGEXP)) %>%
mutate(PROPDMGEXP = ifelse((PROPDMGEXP == "B"), "1000000000", PROPDMGEXP)) %>%
mutate(PROPDMGEXP = ifelse((PROPDMGEXP == "h" | PROPDMGEXP == "H"), "100", PROPDMGEXP)) %>%
mutate(PROPDMGEXP = ifelse((PROPDMGEXP == "0" | PROPDMGEXP == "1" |PROPDMGEXP == "2" | PROPDMGEXP == "3"|PROPDMGEXP == "4" |PROPDMGEXP == "5" |PROPDMGEXP == "6"|PROPDMGEXP == "7" |PROPDMGEXP == "8" |PROPDMGEXP == "+"),1, PROPDMGEXP)) %>%
mutate(PROPDMGEXP = ifelse((PROPDMGEXP == "-" | PROPDMGEXP == "?" | PROPDMGEXP == ""),0, PROPDMGEXP))
DfWithVariables$PROPDMGEXP <- as.numeric(DfWithVariables$PROPDMGEXP)
DfWithVariables$CROPDMGEXP <- as.character(DfWithVariables$CROPDMGEXP)
DfWithVariables <- DfWithVariables %>%
mutate(CROPDMGEXP = ifelse((CROPDMGEXP == "k" | CROPDMGEXP == "K"), "1000",CROPDMGEXP)) %>%
mutate(CROPDMGEXP = ifelse((CROPDMGEXP == "m" | CROPDMGEXP == "M"), "1000000", CROPDMGEXP)) %>%
mutate(CROPDMGEXP = ifelse((CROPDMGEXP == "B"), "1000000000", CROPDMGEXP)) %>%
mutate(CROPDMGEXP = ifelse((CROPDMGEXP == "0" | CROPDMGEXP == "2"), "1", CROPDMGEXP)) %>%
mutate(CROPDMGEXP = ifelse((CROPDMGEXP == "" | CROPDMGEXP == "?"), "0", CROPDMGEXP))
DfWithVariables$CROPDMGEXP <- as.numeric(DfWithVariables$CROPDMGEXP)
Now, It will been calculated the total of CROP and PROPERTY damage
DfWithVariables$PROTOTALDAMAGE <- DfWithVariables$PROPDMG * DfWithVariables$PROPDMGEXP
DfWithVariables$CROPTOTALDAMAGE <- DfWithVariables$CROPDMG * DfWithVariables$CROPDMGEXP
In the CodeBook It is been seen that exist 10 types of events that group all the most specific events that is in the data
So It will been grouped the data in base to the next events:
Hail
Heat
Flood
Wind
Storm
Snow
Tornado
Winter
Rain
Others
# This table are the first 10 events of EVTYPE variable.
table(DfWithVariables$EVTYPE)[1:10]
##
## HIGH SURF ADVISORY COASTAL FLOOD FLASH FLOOD
## 1 1 1
## LIGHTNING TSTM WIND TSTM WIND (G45)
## 1 4 1
## WATERSPOUT WIND ?
## 1 1 1
## ABNORMAL WARMTH
## 4
# This table are the first 10 events of EVTYPE variable.
table(DfWithVariables$EVTYPE)[975:985]
##
## WINTER STORM/HIGH WINDS WINTER STORMS Winter Weather
## 1 3 19
## WINTER WEATHER WINTER WEATHER MIX WINTER WEATHER/MIX
## 7026 6 1104
## WINTERY MIX Wintry mix Wintry Mix
## 2 3 1
## WINTRY MIX WND
## 90 1
In the previous tables, It is been observed all the diferents types of events to group.
Now, all the types of events showed previously were group in the 10 groups, also showed previously. The technique that It is been used was tha the event contain the word of one of the 10 groups. For example the COASTAL FLOOD event it is in the FLOOD group because contain the FLOOD word.
# Create new variable GROUPEVENTS
DfWithVariables$GROUPEVENTS <- "OTHERS"
DfWithVariables$GROUPEVENTS[grep("RAIN",DfWithVariables$EVTYPE,ignore.case = TRUE)] <- "RAIN"
DfWithVariables$GROUPEVENTS[grep("WINTER",DfWithVariables$EVTYPE,ignore.case = TRUE)] <- "WINTER"
DfWithVariables$GROUPEVENTS[grep("TORNADO",DfWithVariables$EVTYPE,ignore.case = TRUE)] <- "TORNADO"
DfWithVariables$GROUPEVENTS[grep("SNOW",DfWithVariables$EVTYPE,ignore.case = TRUE)] <- "SNOW"
DfWithVariables$GROUPEVENTS[grep("STORM",DfWithVariables$EVTYPE,ignore.case = TRUE)] <- "STORM"
DfWithVariables$GROUPEVENTS[grep("WIND",DfWithVariables$EVTYPE,ignore.case = TRUE)] <- "WIND"
DfWithVariables$GROUPEVENTS[grep("FLOOD",DfWithVariables$EVTYPE,ignore.case = TRUE)] <- "FLOOD"
DfWithVariables$GROUPEVENTS[grep("HEAT",DfWithVariables$EVTYPE,ignore.case = TRUE)] <- "HEAT"
DfWithVariables$GROUPEVENTS[grep("HAIL",DfWithVariables$EVTYPE,ignore.case = TRUE)] <- "HAIL"
# Table of the total of events.
table(DfWithVariables$GROUPEVENTS)
##
## FLOOD HAIL HEAT OTHERS RAIN SNOW STORM TORNADO WIND WINTER
## 82730 290401 2648 48970 12155 17652 15123 60699 363759 8160
# Table with the total economic damage (In crops and properties) in dollars.
TotalEconomicDamage <- DfWithVariables %>% select(PROTOTALDAMAGE,CROPTOTALDAMAGE,GROUPEVENTS) %>%
group_by(GROUPEVENTS) %>% summarise(TotalProp = sum(PROTOTALDAMAGE), TotalCrop = sum(CROPTOTALDAMAGE))
## `summarise()` ungrouping output (override with `.groups` argument)
# Table with the total Health damage(Fatalities and Injuries)
TotalHealthDamage <- DfWithVariables %>% select(FATALITIES,INJURIES,GROUPEVENTS) %>%
group_by(GROUPEVENTS) %>% summarise(TotalFatalities = sum(FATALITIES), TotalInjuries = sum(INJURIES))
## `summarise()` ungrouping output (override with `.groups` argument)
The first graph it´s about the total damage in properties caused by the weather events.
ggplot(data = TotalEconomicDamage, mapping = aes(x = GROUPEVENTS , y = TotalProp, fill = TotalProp )) +
geom_bar(stat = "identity") +
geom_text(data = NULL, x = 3, y = 1.5e+10, label = "20325750", angle = 90) +
geom_text(data = NULL, x = 10, y = 1.5e+10, label = "27298000", angle = 90) +
ggtitle("Total property damage (In dollars) by weather type of event.") +
theme(plot.title = element_text(color = "red")) +
labs(x = "Weather type of event.", y = "Total Property damage in dollars.")+
scale_fill_continuous(name = "Total property damage scale")
It is important to say that HEAT and WINTER do not have a bar because the y axis range its so big, but the values of the total economic damages caused (In dollars) were put in the graph.
It is been seen that the FLOOD was the event that more damage caused in the properties between 1950 and 2011
The second graph it´s about the total damage in properties caused by the weather events.
ggplot(data = TotalEconomicDamage, mapping = aes(x = GROUPEVENTS , y = TotalCrop, fill = TotalCrop)) +
geom_bar(stat = "identity", position = position_dodge()) +
ggtitle("Total crop damage (In dollars) by weather type of event.") +
theme(plot.title = element_text(color = "red"))+
scale_fill_continuous(name = "Total Crop damage scale")+
geom_text(data = NULL, x = 10, y = 3.5e+9, label = "15000000", angle = 90)+
labs(x = "Weather type of event.", y = "Total crop damage in dollars.")
Events diferents of flood, hail, heat, rain, snow, storm, tornado, wind and winter like wintry mix, waterspout, etc. Was the most damage weather events for the crops between 1950 and 2011.
The third graph it´s about the total fatalities caused by the weather events.
ggplot(data = TotalHealthDamage, mapping = aes(x = GROUPEVENTS , y = TotalFatalities, fill = TotalFatalities )) +
geom_bar(stat = "identity", position = position_dodge()) +
ggtitle("Total fatalities by weather type of event.") +
theme(plot.title = element_text(color = "red")) +
labs(x = "Weather type of event.", y = "Total fatalities")+
scale_fill_continuous(name = "Total Fatalities")
It is been seen that the TORNADO was the event that more fatalities caused between 1950 and 2011
The fourth graph it´s about the total injured people caused by the weather events.
ggplot(data = TotalHealthDamage, mapping = aes(x = GROUPEVENTS , y = TotalInjuries, fill = TotalInjuries )) +
geom_bar(stat = "identity", position = position_dodge()) +
ggtitle("Total injuries by weather type of event.") +
theme(plot.title = element_text(color = "red")) +
labs(x = "Weather type of event.", y = "Total injuries")+
scale_fill_continuous(name = "Total Injuried people")
It is been seen that the TORNADO was the event that more injured people caused between 1950 and 2011