In this data analysis we used the data from the NOAA Storm Database to answer two major questions:
In our analysis we use tools like graphs, tables and summaries to gain some insight about the data and columns that we will use later, after that, we created two datasets called healthDamage and economicDmg which will allow us to answer the questions in a conclusive form. Finally we put our answers in the Results section.
url <- 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'
download.file(url,"./StormData.csv.bz2")
stormData <- read.csv("StormData.csv.bz2", sep = ",", header = TRUE)
Once the data set is loaded we can explore the columns using:
names(stormData)
[1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
[6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
[11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
[16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
[21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
[26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
[31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
[36] "REMARKS" "REFNUM"
From those columns we only need the following to carry on our analysis: STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP. We select them using the select function from the dplyr package:
library(dplyr)
stormData <- select(stormData,STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
Once we have our new data set, we will used it to create two more datasets:
We use the function group_by from the dplyr package to agrup our data by the EVTYPE (event type) variable and also compute the sum of each column INJURIES and FATALITIES according to the EVTYPE column
healthDamage <- stormData %>% group_by(EVTYPE) %>% summarise_at(vars(INJURIES, FATALITIES), sum)
When we inspect the healthDamage data we noted that there are events with 0 fatalities and 0 injuries
library(knitr)
kable(head(healthDamage))
| EVTYPE | INJURIES | FATALITIES |
|---|---|---|
| HIGH SURF ADVISORY | 0 | 0 |
| COASTAL FLOOD | 0 | 0 |
| FLASH FLOOD | 0 | 0 |
| LIGHTNING | 0 | 0 |
| TSTM WIND | 0 | 0 |
| TSTM WIND (G45) | 0 | 0 |
We proceed now to remove those events since they don’t contribute with any useful information to our analysis:
healthDamage <- healthDamage[!(healthDamage$INJURIES == 0 & healthDamage$FATALITIES == 0),]
With this process we reduce our healthDamage data from 985 to 220 observations, now a second inspection to the data revel us two problems, first there are events with the same name that are not grouped and second there are events from the same class, but they are not grouped because there are differences in their names
kable(head(healthDamage))
| EVTYPE | INJURIES | FATALITIES |
|---|---|---|
| AVALANCE | 0 | 1 |
| AVALANCHE | 170 | 224 |
| BLACK ICE | 24 | 1 |
| BLIZZARD | 805 | 101 |
| blowing snow | 1 | 1 |
| BLOWING SNOW | 13 | 1 |
We can fix those problems with the following strategy:
Step 1: This is easily done with:
healthDamage$EVTYPE <- sapply(healthDamage$EVTYPE, toupper)
Step 2: Here after a rapid inspection to the dataset we obtain the wrong event names and put them into the variable dupNames the names with wrong writing
dupNames <- c("AVALANCE", "COSTAL FLOOD", "COASTALSTORM", "FLASH FLOODS", "HEAVY RAIN", "HIGH WIND", "ICE ROADS", "LIGHTNING.", "RIP CURRENT", "THUNDERSTORM WIND", "THUNDERSTORM WINDSS", "THUNDERSTORMS WINDS", "THUNDERSTORMW", "THUNDERTORM WINDS", "WIND", "WINTER STORM", "WINTER WEATHER/MIX", "TSTM WIND")
and we used the variable correctNames for the correct names
correctNames <- c("AVALANCHE", "COASTAL FLOODING", "COASTAL STORM", "FLASH FLOOD", "HEAVY RAINS", "HIGH WINDS", "ICY ROADS", "LIGHTNING", "RIP CURRENTS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "WINDS", "WINTER STORMS", "WINTER WEATHER MIX", "THUNDERSTORM WINDS")
Step 3: Now, using a loop, we replace the wrong names by the correct names:
for (k in c(1:length(correctNames))){
healthDamage[healthDamage$EVTYPE == dupNames[k], 1] <- correctNames[k]
}
Now we apply again the function group_by to the dataset because now we have names that weren’t grouped before:
healthDamage <- healthDamage %>% group_by(EVTYPE) %>% summarise_at(vars(INJURIES, FATALITIES), sum)
And now we have our dataset ready for analysis, next we will process the data for the economicDmg dataset
In this section we are interested in the columns PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP from the stormData because they are referred to the economic damage to properties and crops, a rapid inspection revel us that the columns PROPDMG and CROPDMG are numeric values:
str(stormData$PROPDMG)
num [1:902297] 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
str(stormData$CROPDMG)
num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
But when we inspect the columns PROPDMGEXP and CROPDMGEXP we observe a curious notation:
levels(stormData$PROPDMGEXP)
[1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
[18] "m" "M"
levels(stormData$CROPDMGEXP)
[1] "" "?" "0" "2" "B" "k" "K" "m" "M"
Those values are referred to the following notation:
Before we can process our data, we need to map those values (from the two columns) with their respective number representation.
Mapping the PROPDMGEXP column
First we transform our column to a character class due to problems when mapping in to its original factor form
stormData$PROPDMGEXP <- as.character(stormData$PROPDMGEXP)
Next we use our reference notation to map the values:
stormData[stormData$PROPDMGEXP == "0" | stormData$PROPDMGEXP == "1", 6] <- 10
stormData[stormData$PROPDMGEXP == "2" | stormData$PROPDMGEXP == "3", 6] <- 10
stormData[stormData$PROPDMGEXP == "4" | stormData$PROPDMGEXP == "5", 6] <- 10
stormData[stormData$PROPDMGEXP == "6" | stormData$PROPDMGEXP == "7" | stormData$PROPDMGEXP == "8", 6] <- 10
stormData[stormData$PROPDMGEXP == "-" | stormData$PROPDMGEXP == "?" | stormData$PROPDMGEXP == "", 6] <- 0
stormData[stormData$PROPDMGEXP == "+", 6] <- 1
stormData[stormData$PROPDMGEXP == "H" | stormData$PROPDMGEXP == "h", 6] <- 100
stormData[stormData$PROPDMGEXP == "K" | stormData$PROPDMGEXP == "k", 6] <- 1000
stormData[stormData$PROPDMGEXP == "M" | stormData$PROPDMGEXP == "m", 6] <- 1000000
stormData[stormData$PROPDMGEXP == "B", 6] <- 1000000000
Mapping the CROPDMGEXP column
We repeat the process, first we transform the factor column into a character type:
stormData$CROPDMGEXP <- as.character(stormData$CROPDMGEXP)
then we map the values using our reference notation:
stormData[stormData$CROPDMGEXP == "0" | stormData$CROPDMGEXP == "2", 8] <- 10
stormData[stormData$CROPDMGEXP == "?" | stormData$CROPDMGEXP == "", 8] <- 0
stormData[stormData$CROPDMGEXP == "K" | stormData$CROPDMGEXP == "k", 8] <- 1000
stormData[stormData$CROPDMGEXP == "M" | stormData$CROPDMGEXP == "m", 8] <- 1000000
stormData[stormData$CROPDMGEXP == "B", 8] <- 1000000000
Finally we transform the values from the two columns into a numeric representation:
stormData$PROPDMGEXP <- as.numeric(stormData$PROPDMGEXP)
stormData$CROPDMGEXP <- as.numeric(stormData$CROPDMGEXP)
Our next step will be the creation of the dataset economicDmg, in which we will group the sum of the variables referred to the economic damage: PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP and we will group them using the function group_by by their respective event type (EVTYPE)
economicDmg <- stormData %>% group_by(EVTYPE) %>% summarise_at(vars(PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP), sum)
As we seen before with the healthDamage dataset, we also have inconsistencies in the names of the events in our economicDmg dataset
kable(head(economicDmg[c(18:23), ]))
| EVTYPE | PROPDMG | PROPDMGEXP | CROPDMG | CROPDMGEXP |
|---|---|---|---|---|
| AVALANCE | 0.0 | 0 | 0 | 0 |
| AVALANCHE | 1623.9 | 2191000 | 0 | 154000 |
| BEACH EROSIN | 0.0 | 0 | 0 | 0 |
| Beach Erosion | 100.0 | 1000 | 0 | 0 |
| BEACH EROSION | 0.0 | 0 | 0 | 0 |
| BEACH EROSION/COASTAL FLOOD | 0.0 | 0 | 0 | 0 |
We also have values with 0 in the columns PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP:
sum(economicDmg$PROPDMG == 0)
[1] 579
sum(economicDmg$PROPDMGEXP == 0)
[1] 576
sum(economicDmg$CROPDMG == 0)
[1] 849
sum(economicDmg$CROPDMGEXP == 0)
[1] 824
We will address the problems with the following strategy:
Step 1: We achieve this step by just applying the function toupper to the EVTYPE column:
economicDmg$EVTYPE <- sapply(economicDmg$EVTYPE, toupper)
Step 2: We can remove the 0 values using:
economicDmg <- economicDmg[!(economicDmg$PROPDMG == 0 & economicDmg$PROPDMGEXP == 0 & economicDmg$CROPDMG == 0 & economicDmg$CROPDMGEXP == 0),]
Step 3: We define the incongruent names into the dpnames variable:
dpnames <- c("COASTAL FLOOD", "FLASH FLOOD/", "FLASH FLOODING", "FLASH FLOODING/FLOOD", "FLASH FLOODS", "FLOOD/FLASH", "FLOOD/FLASH FLOOD", "FLOOD/FLASH FLOOD", "FLOOD", "FLOODING", "GUSTY WIND", "HEAVY RAIN", "HIGH WIND", "HIGH WINDS/", "ICY ROADS", "LAKE-EFFECT SNOW", "LANDSLIDE", "LIGHTING", "LIGNTNING", "LIGHTING", "LIGNTNING", "MARINE TSTM WIND", "MUD SLIDE", "MUDSLIDE", "MUDSLIDES", "RIVER FLOOD", "RIP CURRENT", "SEVERE THUNDERSTORM", "SNOW SQUALL", "SNOW/ ICE", "STRONG WIND", "THUNDERESTORM WINDS", "THUNDEERSTORM WINDS", "THUDERSTORM WINDS", "THUNDERSTORM WIND", "THUNDERSTORM WIND.", "THUNDERSTORM WIND/ TREE", "THUNDERSTORM WIND/ TREES", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS.", "THUNDERSTORM WINDSS", "THUNDERSTORM WINS", "THUNDERSTORM WINDSHAIL", "THUNDERSTORM WINDS HAIL", "THUNDERSTORM WINDS.", "THUNDERSTORM WINDSHAIL", "THUNDERSTORM WINDSS", "THUNDERSTORM WINS", "THUNDERSTORMS WIND", "THUNDERSTORMS WINDS", "THUNDERSTORMW", "THUNDERSTORMWINDS", "THUNDERSTROM WIND", "THUNDERTORM WINDS", "THUNERSTORM WINDS", "TORNDAO", "UNSEASONABLE COLD", "URBAN FLOOD", "URBAN FLOODS", "URBAN/SML STREAM FLD", "URBAN/SMALL STREAM FLOOD", "WATERSPOUT", "WATERSPOUT-", "WATERSPOUT-TORNADO", "WATERSPOUT/ TORNADO", "WATERSPOUT/ TORNADO", "WILDFIRE", "WILDFIRES", "WIND", "WINDS", "WINTER STORM", "WINTER WEATHER/MIX", "TSTM WIND")
Step 4: We defined the correct names into the variable fxnames:
fxnames <- c("COASTAL FLOODING", "FLASH FLOOD", "FLASH FLOOD", "FLASH FLOOD/FLOOD","FLASH FLOOD", "FLOOD FLASH", "FLOOD/FLASHFLOOD", "FLOOD/FLASH/FLOOD", "FLOODS", "FLOODS", "GUSTY WINDS", "HEAVY RAINS", "HIGH WINDS", "HIGH WINDS", "ICE ROADS", "LAKE EFFECT SNOW", "LANDSLIDES", "LIGHTNING", "LIGHTNING", "LIGHTNING", "LIGHTNING", "MARINE THUNDERSTORM WIND", "MUD SLIDES", "MUD SLIDES", "MUD SLIDES", "RIVER FLOODING", "RIP CURRENTS", "SEVERE THUNDERSTORMS", "SNOW SQUALLS", "SNOW/ICE", "STRONG WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WIND TREES", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS/HAIL", "THUNDERSTORM WINDS/HAIL", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS/HAIL", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "TORNADO", "UNSEASONABLY COLD", "URBAN FLOODING", "URBAN FLOODING", "URBAN/SMALL STREAM FLOODING", "URBAN/SMALL STREAM FLOODING", "WATERSPOUT TORNADO", "WATERSPOUT TORNADO", "WATERSPOUT TORNADO", "WATERSPOUT TORNADO", "WATERSPOUT TORNADO", "WILD FIRES", "WILD FIRES", "WINDS", "WINDS", "WINTER STORMS", "WINTER WEATHER MIX", "THUNDERSTORM WINDS")
Step 5: We replace the incongruent names with the correct names:
for (k in c(1:length(fxnames))){
economicDmg[economicDmg$EVTYPE == dpnames[k], 1] <- fxnames[k]
}
Step 6: Now we compute the sum of the columns using the group_by function
economicDmg <- economicDmg %>% group_by(EVTYPE) %>% summarise_at(vars(PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP), sum)
Finally we have both our healthDamage and economicDmg ready for the analysis, with this we conclude the section of Data Proccessing, the next section will focus in the Data Analysis
When we inspect our healthDamage data we note that both FATALITIES and INJURIES have their own respective values:
summary(healthDamage$INJURIES)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 0.0 3.0 747.5 38.5 91346.0
summary(healthDamage$FATALITIES)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 1.00 2.00 80.56 13.25 5633.00
But here the idea is to have a general view of the total impact to the population’s health. Now we proceed to calculate the total damage to health, we can do this by adding a new column called TOTALDMG in which we sum the injuries and fatalities.
healthDamage$TOTALDMG <- healthDamage$INJURIES + healthDamage$FATALITIES
Now if we inspect our data ordered by the total damage we can see the most harmful events to the Health:
kable(head(arrange(healthDamage, -TOTALDMG)))
| EVTYPE | INJURIES | FATALITIES | TOTALDMG |
|---|---|---|---|
| TORNADO | 91346 | 5633 | 96979 |
| THUNDERSTORM WINDS | 9385 | 702 | 10087 |
| EXCESSIVE HEAT | 6525 | 1903 | 8428 |
| FLOOD | 6789 | 470 | 7259 |
| LIGHTNING | 5230 | 817 | 6047 |
| HEAT | 2100 | 937 | 3037 |
Now we need a way to defined the top harmful events to health, if we observe the summary of the TOTALDMG:
summary(healthDamage$TOTALDMG)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 2.00 7.00 828.05 48.75 96979.00
We note that the mean value is 828.05, this is a quite high number, considering that it groups both injuries and fatalities, so we will take this value (the mean of TOTALDMG) as our metric to defined the top harmulf events to health. Next we will take a subset of the data and we will keep all the values of TOTALDMG that are equal or greater than its mean
healthDamage <- subset(healthDamage, healthDamage$TOTALDMG >= mean(healthDamage$TOTALDMG))
This process reduce our healthDamage data from 188 to only 16 observations. We can now explore the results of our analysis, this will be done in the Results section.
When inspecting our economicDmg data we noted the same problem of decide which variable we will consider to select the top economic damages:
summary(economicDmg$PROPDMG)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 5 50 32203 555 3212259
summary(economicDmg$PROPDMGEXP)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000e+00 1.000e+03 2.000e+03 1.531e+08 1.003e+06 1.205e+10
summary(economicDmg$CROPDMG)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 0 0 4076 10 579596
summary(economicDmg$CROPDMGEXP)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000e+00 0.000e+00 0.000e+00 3.336e+07 1.650e+04 4.143e+09
As we did before we need a way to compute the total economic damage, but first we need to combine the total damage in properties and crops, we do this multiplyin the values from the variables PROPDMG and CROPDMG with PROPDMGEXP and CROPDMGEXP respectivily:
economicDmg$PROPDMG <- economicDmg$PROPDMG * economicDmg$PROPDMGEXP
economicDmg$CROPDMG <- economicDmg$CROPDMG * economicDmg$CROPDMG
Since the variables PROPDMGEXP and CROPDMGEXP are not longer need, we dismiss them and kept only the variables that we will use:
economicDmg <- select(economicDmg, EVTYPE, PROPDMG, CROPDMG)
The next step is calculate the total economic damage, for which we create the variable TOTALDMG which is the sum of the total values from PROPDMG and CROPDMG
economicDmg$TOTALDMG <- economicDmg$PROPDMG + economicDmg$CROPDMG
Now when we inspect our data we find that some TOTALDMG values are equal to 0.
sum(economicDmg$TOTALDMG == 0)
[1] 4
We proceed to remove those values using:
economicDmg <- economicDmg[!(economicDmg$TOTALDMG == 0),]
Now we will use the mean of the total damage variable to select our top events with the greatest economic impact:
economicDmg <- subset(economicDmg, economicDmg$TOTALDMG >= mean(economicDmg$TOTALDMG))
This process reduce our economicDmg data from 334 to only 7 observations. We can now explore the results of our analysis, this will be done in the Results section. But before we go to the next section, we need to remove the stormData from which we built the datasets healthDamage and economicDmg, because we no longer need it:
remove(stormData)
In this section we show the results of our Data Analysis:
This question is related to the healthDamage data, when we inspect it we can see that the most harmful events related to population health are:
kable(arrange(healthDamage, -TOTALDMG))
| EVTYPE | INJURIES | FATALITIES | TOTALDMG |
|---|---|---|---|
| TORNADO | 91346 | 5633 | 96979 |
| THUNDERSTORM WINDS | 9385 | 702 | 10087 |
| EXCESSIVE HEAT | 6525 | 1903 | 8428 |
| FLOOD | 6789 | 470 | 7259 |
| LIGHTNING | 5230 | 817 | 6047 |
| HEAT | 2100 | 937 | 3037 |
| FLASH FLOOD | 1777 | 980 | 2757 |
| ICE STORM | 1975 | 89 | 2064 |
| HIGH WINDS | 1439 | 283 | 1722 |
| WINTER STORMS | 1338 | 216 | 1554 |
| HAIL | 1361 | 15 | 1376 |
| HURRICANE/TYPHOON | 1275 | 64 | 1339 |
| HEAVY SNOW | 1021 | 127 | 1148 |
| RIP CURRENTS | 529 | 572 | 1101 |
| WILDFIRE | 911 | 75 | 986 |
| BLIZZARD | 805 | 101 | 906 |
From the table above we can build a bar plot for a better representation:
library(ggplot2)
ggplot(data = healthDamage, aes(x=reorder(EVTYPE, -TOTALDMG), y = TOTALDMG, fill = EVTYPE)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle("Top 16 harmful Events to Population Health") + xlab("") + ylab("Total Health Damage (Injuries + Fatalities)") + labs(fill = "Event Type") + guides(fill=guide_legend(ncol=2))
Also we show the relationship between injuries and fatalities by the events:
library(ggplot2)
ggplot(data = healthDamage, aes(x=INJURIES, y = FATALITIES)) + geom_point(aes(color = EVTYPE), size = 3) + ggtitle("Relationship between Fatalities and Injuries by Events") + xlab("Total Injuries") + ylab("Total Fatalities") + labs(color = "Event Type")
As we can see, those are the top 16 most harmful events to population health
This question is related to the economicDmg data, when we inspect it we can see that the events with greatest economic consequences are:
kable(arrange(economicDmg, -TOTALDMG))
| EVTYPE | PROPDMG | CROPDMG | TOTALDMG |
|---|---|---|---|
| TORNADO | 2.416373e+16 | 10003704343 | 2.416374e+16 |
| FLOODS | 5.935982e+15 | 29394718453 | 5.936012e+15 |
| FLASH FLOOD | 3.730919e+15 | 33976262289 | 3.730953e+15 |
| THUNDERSTORM WINDS | 3.199747e+15 | 37947636090 | 3.199785e+15 |
| HAIL | 1.348529e+15 | 335931847790 | 1.348865e+15 |
| HIGH WINDS | 5.171167e+14 | 362628613 | 5.171171e+14 |
| WINTER STORMS | 1.527205e+14 | 6145391 | 1.527205e+14 |
From the table above we can build a bar plot for a better representation:
ggplot(data = economicDmg, aes(x=reorder(EVTYPE, -TOTALDMG), y = TOTALDMG, fill = EVTYPE)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle("Top 7 Events with Greatest Economic Consequences") + xlab("") + ylab("Total Economic Damage (Properties and Crops)") + labs(fill = "Event Type")
As we can see, those are the top 7 events with greatest economic consequences