This report describes the analysis of the impact of Weather Events on the population health and Economy. The data used in this report was obtained on the the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database contain the characteristics of weather events in the United States, from 1950 to November 2011, including the location and date that they occur and estimates of any fatalities, injuries, property damage and crop damage.
The analysis aimed to answer two questions:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
The data was loaded, inspected, cleaned, processed and analysed, using dplyr, tidyr, mgsub and ggplot2 packages. After processing and analyzing the data it was found that the Tornado is the most harmful weather event to population health , resulting in higher injuries and fatalities, while Floods have greatest economic impact, resulting in higher damages to property and crops in the United States.
As mentioned before, the data was obtained from the [U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database] (https://www.weather.gov/). The events on the database are from the year 1950 to November 2011.
The data was downloaded from the following link:
[Storm Data] (https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2)
The data came in the form of a comma-separated-value file compressed via the bzip2 algorithm. To read this data, the read.csv function was used.
data <- read.csv("repdata_data_StormData.csv.bz2", header = TRUE, sep = ",")
The code above loaded the data into R for the following processing and analysis.
After reading the data, it is necessary to get an overview of how the data looks like. For this, the dim, head and summary function were used.
dim(data)
## [1] 902297 37
head(data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
## 6 0 0 6
summary (data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE
## Min. : 1.0 Length:902297 Length:902297 Length:902297
## 1st Qu.:19.0 Class :character Class :character Class :character
## Median :30.0 Mode :character Mode :character Mode :character
## Mean :31.2
## 3rd Qu.:45.0
## Max. :95.0
##
## COUNTY COUNTYNAME STATE EVTYPE
## Min. : 0.0 Length:902297 Length:902297 Length:902297
## 1st Qu.: 31.0 Class :character Class :character Class :character
## Median : 75.0 Mode :character Mode :character Mode :character
## Mean :100.6
## 3rd Qu.:131.0
## Max. :873.0
##
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## Min. : 0.000 Length:902297 Length:902297 Length:902297
## 1st Qu.: 0.000 Class :character Class :character Class :character
## Median : 0.000 Mode :character Mode :character Mode :character
## Mean : 1.484
## 3rd Qu.: 1.000
## Max. :3749.000
##
## END_TIME COUNTY_END COUNTYENDN END_RANGE
## Length:902297 Min. :0 Mode:logical Min. : 0.0000
## Class :character 1st Qu.:0 NA's:902297 1st Qu.: 0.0000
## Mode :character Median :0 Median : 0.0000
## Mean :0 Mean : 0.9862
## 3rd Qu.:0 3rd Qu.: 0.0000
## Max. :0 Max. :925.0000
##
## END_AZI END_LOCATI LENGTH WIDTH
## Length:902297 Length:902297 Min. : 0.0000 Min. : 0.000
## Class :character Class :character 1st Qu.: 0.0000 1st Qu.: 0.000
## Mode :character Mode :character Median : 0.0000 Median : 0.000
## Mean : 0.2301 Mean : 7.503
## 3rd Qu.: 0.0000 3rd Qu.: 0.000
## Max. :2315.0000 Max. :4400.000
##
## F MAG FATALITIES INJURIES
## Min. :0.0 Min. : 0.0 Min. : 0.0000 Min. : 0.0000
## 1st Qu.:0.0 1st Qu.: 0.0 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median :1.0 Median : 50.0 Median : 0.0000 Median : 0.0000
## Mean :0.9 Mean : 46.9 Mean : 0.0168 Mean : 0.1557
## 3rd Qu.:1.0 3rd Qu.: 75.0 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :5.0 Max. :22000.0 Max. :583.0000 Max. :1700.0000
## NA's :843563
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 Length:902297 Min. : 0.000 Length:902297
## 1st Qu.: 0.00 Class :character 1st Qu.: 0.000 Class :character
## Median : 0.00 Mode :character Median : 0.000 Mode :character
## Mean : 12.06 Mean : 1.527
## 3rd Qu.: 0.50 3rd Qu.: 0.000
## Max. :5000.00 Max. :990.000
##
## WFO STATEOFFIC ZONENAMES LATITUDE
## Length:902297 Length:902297 Length:902297 Min. : 0
## Class :character Class :character Class :character 1st Qu.:2802
## Mode :character Mode :character Mode :character Median :3540
## Mean :2875
## 3rd Qu.:4019
## Max. :9706
## NA's :47
## LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## Min. :-14451 Min. : 0 Min. :-14455 Length:902297
## 1st Qu.: 7247 1st Qu.: 0 1st Qu.: 0 Class :character
## Median : 8707 Median : 0 Median : 0 Mode :character
## Mean : 6940 Mean :1452 Mean : 3509
## 3rd Qu.: 9605 3rd Qu.:3549 3rd Qu.: 8735
## Max. : 17124 Max. :9706 Max. :106220
## NA's :40
## REFNUM
## Min. : 1
## 1st Qu.:225575
## Median :451149
## Mean :451149
## 3rd Qu.:676723
## Max. :902297
##
As observed the data has 902297 rows and 37 variables (columns). The variables “COUNTYENDN”, “F”, “LATITUDE” and “LATITUDE_E” present NA values, therefore, none of the variables needed for this analysis.
To answer the questions of which types of events are most harmful to the population health and which have the greatest economic consequences, the data needed to be subsetted.
For the first question, only the variables event type (EVTYPE), injuries and fatalities (INJURIES, FATALITIES) were necessary, Therefore, the cbind function was used, selecting only these columns and using ["..."] to maintain the column names from the original dataset. The result was stores in another object called hlt.
hlt <- cbind(data["EVTYPE"], data["INJURIES"], data["FATALITIES"])
After this, the data was transformed. The previous dataset (hlt) was grouped by the event type. Then, for both injuries and fatalities columns, the sum was calculated, by its groups (EVTYPE). After calculating the sum, the dataset was ordered by injuries and fatalities, in descending order, therefore, the first row had the highest possible number of injuries and fatalities together. Finally, the data was sliced to select only the 10 first events with more injuries and fatalities together and transformed, by gathering the damage type into one column with two levels (INJURIES and FATALITIES), using the gather function.
With the summary function in was possible to observe that this dataset presented no NA’s, therefore there was no need to further clean and process this data.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
hltplot <- hlt %>%
# Grouping by event
group_by(EVTYPE) %>%
# Calculating total fatalities and injuries
summarise(INJURIES = sum(INJURIES),
FATALITIES = sum(FATALITIES)) %>%
# Sorting fatalities and injuries in descending order
arrange(desc(INJURIES + FATALITIES)) %>%
#selecting only 10 first records of injuries and fatalities
slice(1:10) %>%
# Melting injuries and fatalities
gather(TYPE, Value, c(INJURIES, FATALITIES))
summary(hltplot)
## EVTYPE TYPE Value
## Length:20 Length:20 Min. : 89
## Class :character Class :character 1st Qu.: 738
## Mode :character Mode :character Median : 1632
## Mean : 6859
## 3rd Qu.: 5331
## Max. :91346
For the second question, only the variables of event type (EVYTYPE), property damage and crop damage (PROPDMG, CROPDMG) were necessary and the cbind function was also used in the same way as previously.
#subset
ecn <- cbind(data["EVTYPE"],
data["PROPDMG"], data["PROPDMGEXP"],
data["CROPDMG"], data["CROPDMGEXP"])
However, the variables PROPDMG and CROPDMG were in different dimensions. In some cases the property and crop damages were in the order of millions or thousands etc. The following columns “PRPDMGEXP” and ‘CROPDMGEXP’ disclaim the exponent of these columns.
By creating a vector with these columns as factors and using the function levels() it was possible to detect the levels representing an exponen, on each column. In both cases there were exponents like B (billions), M (millions) and K(thousands) and there was also levels like H (hundreds) or levels like “?”, “+”, etc, that did not represented any exponent.
lvlPROPDMGEXP <- as.factor(ecn$PROPDMGEXP)
levels(lvlPROPDMGEXP)
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K" "m" "M"
lvlCROPDMGEXP <- as.factor(ecn$CROPDMGEXP)
levels(lvlCROPDMGEXP)
## [1] "" "?" "0" "2" "B" "k" "K" "m" "M"
Therefore, to properly subset the columns of interest for the following analysis, these exponents were substituted accordingly, using the mgsub functions, that allowed substitution of multiple patterns at one single line of code. The columns were transformed into numbers and the NA’s generated were removed. Subsequently, the columns “PROPDMG” and “CROPDMG” were multiplied, by row, using the arithmetic symbol (*) to its following exponent column, in order to obtain the actual cost of the damage, creating two new vectors. Then, these vectors along with the first column on event type (EVTYPE) were binded using cbind, with the results being stored in a new dataset.
# substitution K and M
library(mgsub)
ecn$PROPDMGEXP <- mgsub(ecn$PROPDMGEXP,
c("", "-" ,"?", "+", "0", "1" ,"2" ,"3", "4" ,"5", "6", "7" ,"8" ,"B","h" ,"H","K", "m", "M"),
c("NA", "NA", "NA", "NA", "NA", 1, 1, 1, 1, 1 , 1, 1, 1, 1000000000, 100, 100, 1000, 1000000, 1000000))
ecn$CROPDMGEXP <- mgsub(ecn$CROPDMGEXP,
c("", "?", "0", "2", "B","k","K", "m", "M"),
c("NA", "NA", "NA", 1, 1000000000, 1000, 1000, 1000000, 1000000))
ecn$PROPDMGEXP <- as.numeric(ecn$PROPDMGEXP)
ecn$CROPDMGEXP <- as.numeric(ecn$CROPDMGEXP)
#remove NAs
ecnnona <- ecn[complete.cases(ecn), ]
summary(ecnnona)
## EVTYPE PROPDMG PROPDMGEXP CROPDMG
## Length:279540 Min. : 0.00 Min. :1.000e+00 Min. : 0.000
## Class :character 1st Qu.: 0.00 1st Qu.:1.000e+03 1st Qu.: 0.000
## Mode :character Median : 0.00 Median :1.000e+03 Median : 0.000
## Mean : 14.44 Mean :1.009e+05 Mean : 4.074
## 3rd Qu.: 3.00 3rd Qu.:1.000e+03 3rd Qu.: 0.000
## Max. :5000.00 Max. :1.000e+09 Max. :978.000
## CROPDMGEXP
## Min. :1.000e+03
## 1st Qu.:1.000e+03
## Median :1.000e+03
## Mean :2.443e+04
## 3rd Qu.:1.000e+03
## Max. :1.000e+09
#multiplication
PROPDAMAGE <- ecnnona$PROPDMG * ecnnona$PROPDMGEXP
CROPDAMAGE <- ecnnona$CROPDMG * ecnnona$CROPDMGEXP
#final dataset
dataecn <- cbind(ecnnona["EVTYPE"], PROPDAMAGE, CROPDAMAGE )
After cleaning the dataset, further processing was performed using the tidyr package and the dplyr package, similarly to what was performed with the previous dataset with injuries and fatalities (hltplot).
First, the previous dataset (dataecn) was grouped by the event type. Then, for both property and crop damage columns, the sum was calculated, by group. After calculating the sum, the dataset was ordered by property and crop damage, in descending order. Finally, the data was sliced and transformed, by selecting only the first 10 events with a higher impact and gathering the damage type into one column with two levels (PROPDAMAGE and CROPDAMAGE). The result was stores in a new dataset.
library(tidyr)
library(dplyr)
ecnplot <- dataecn %>%
# Grouping by event
group_by(EVTYPE) %>%
# Calculating total fatalities and injuries
summarise(PROPDAMAGE = sum(PROPDAMAGE),
CROPDAMAGE = sum(CROPDAMAGE)) %>%
# Sorting fatalities and injuries in descending order
arrange(desc(PROPDAMAGE + CROPDAMAGE)) %>%
#selecting only 10 first records of injuries and fatalities
slice(1:10) %>%
# Melting injuries and fatalities
gather(DMGTYPE, Value, c(PROPDAMAGE, CROPDAMAGE))
The following section describes the generation of the plots to answer the two questions of interest on this report. For both, the ggplot2 package was used, along with the two new datasets created in the previous steps: hltplot and ecnplot, to answer the first and second questions respectively.
The plot created to answer the first question was achieved using ggplot2 function. First, we defined the data and aesthetic. The x axis was the Event Type (EVTYPE) and we needed this variable to be ordered, in decresing order. Therefore, the reorder function was used, excluding the y variable (Value)
In addition, is was used the function geom_col as this function is more appropriate to display the values in the dataset, while geom_bar displays, typically, the counts.
The ggtitle function determines the title of the plot, xlab or ylab, the title of the axis and the theme function, along with the axis.text.x and element_text arguments, allow the rotation of the x.axis.
library(ggplot2)
ggplot(hltplot, aes(x = reorder(EVTYPE, -Value),
y = Value,
fill = TYPE)) +
geom_col(position="stack") +
ggtitle (" Impact of Weather Events on Population Health")+
xlab ("Event Type")+
theme(axis.text.x = element_text(angle = 90 ))
Therefore, as observed, the Tornados are responsible for the highest number of Injuries and Fatalities, therefore is the most harmful event to population health.
The second question (“Across the United States, which types of events have the greatest economic consequences?”) was answered using a similar code, to generate a similar plot.
library(ggplot2)
ggplot(ecnplot, aes(x = reorder(EVTYPE, -Value),
y = Value,
fill = DMGTYPE)) +
geom_col(position="stack") +
ggtitle ("Weather Events Impact on the Economy of United Stated")+
xlab ("Event Type")+
theme(axis.text.x = element_text(angle = 90 ))
Therefore, as observed, the Floods are responsible for the most damage in property and crops and therefore has the most impact on the economy of U.S.