Synopsis

In this data analysis we used the data from the NOAA Storm Database to answer two major questions:

In our analysis we use tools like graphs, tables and summaries to gain some insight about the data and columns that we will use later, after that, we created two datasets called healthDamage and economicDmg which will allow us to answer the questions in a conclusive form. Finally we put our answers in the Results section.

Download the Data Set [OPTIONAL]

url <- 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'
download.file(url,"./StormData.csv.bz2")

Data Processing

Read the Data Set:

stormData <- read.csv("StormData.csv.bz2", sep = ",", header = TRUE)

Select columns:

Once the data set is loaded we can explore the columns using:

names(stormData)
 [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
 [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
[11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
[16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
[21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
[26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
[31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
[36] "REMARKS"    "REFNUM"    

From those columns we only need the following to carry on our analysis: STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP. We select them using the select function from the dplyr package:

library(dplyr)
stormData <- select(stormData,STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

Once we have our new data set, we will used it to create two more datasets:

  • healthDamage: Which will have the data related to the population’s health and
  • economicDmg: Which will have the data related to the economic damage.

Health Damage Data:

We use the function group_by from the dplyr package to agrup our data by the EVTYPE (event type) variable and also compute the sum of each column INJURIES and FATALITIES according to the EVTYPE column

healthDamage <- stormData %>% group_by(EVTYPE) %>% summarise_at(vars(INJURIES, FATALITIES), sum)

When we inspect the healthDamage data we noted that there are events with 0 fatalities and 0 injuries

library(knitr)
kable(head(healthDamage))
EVTYPE INJURIES FATALITIES
HIGH SURF ADVISORY 0 0
COASTAL FLOOD 0 0
FLASH FLOOD 0 0
LIGHTNING 0 0
TSTM WIND 0 0
TSTM WIND (G45) 0 0

We proceed now to remove those events since they don’t contribute with any useful information to our analysis:

healthDamage <- healthDamage[!(healthDamage$INJURIES == 0 & healthDamage$FATALITIES == 0),]

With this process we reduce our healthDamage data from 985 to 220 observations, now a second inspection to the data revel us two problems, first there are events with the same name that are not grouped and second there are events from the same class, but they are not grouped because there are differences in their names

kable(head(healthDamage))
EVTYPE INJURIES FATALITIES
AVALANCE 0 1
AVALANCHE 170 224
BLACK ICE 24 1
BLIZZARD 805 101
blowing snow 1 1
BLOWING SNOW 13 1

We can fix those problems with the following strategy:

  • Step 1: Convert to uppercase all the events names in the data set.
  • Step 2: Find the names with wrong writing
  • Step 3: Change the names with wrong writing by the correct ones.

Step 1: This is easily done with:

healthDamage$EVTYPE <- sapply(healthDamage$EVTYPE, toupper)

Step 2: Here after a rapid inspection to the dataset we obtain the wrong event names and put them into the variable dupNames the names with wrong writing

dupNames <- c("AVALANCE", "COSTAL FLOOD", "COASTALSTORM", "FLASH FLOODS", "HEAVY RAIN", "HIGH WIND", "ICE ROADS", "LIGHTNING.", "RIP CURRENT", "THUNDERSTORM WIND", "THUNDERSTORM WINDSS", "THUNDERSTORMS WINDS", "THUNDERSTORMW", "THUNDERTORM WINDS", "WIND", "WINTER STORM", "WINTER WEATHER/MIX", "TSTM WIND")

and we used the variable correctNames for the correct names

correctNames <- c("AVALANCHE", "COASTAL FLOODING", "COASTAL STORM", "FLASH FLOOD", "HEAVY RAINS", "HIGH WINDS", "ICY ROADS", "LIGHTNING", "RIP CURRENTS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "WINDS", "WINTER STORMS", "WINTER WEATHER MIX", "THUNDERSTORM WINDS")

Step 3: Now, using a loop, we replace the wrong names by the correct names:

for (k in c(1:length(correctNames))){
  healthDamage[healthDamage$EVTYPE == dupNames[k], 1] <- correctNames[k] 
}

Now we apply again the function group_by to the dataset because now we have names that weren’t grouped before:

healthDamage <- healthDamage %>% group_by(EVTYPE) %>% summarise_at(vars(INJURIES, FATALITIES), sum)

And now we have our dataset ready for analysis, next we will process the data for the economicDmg dataset

Economic Damage Data:

In this section we are interested in the columns PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP from the stormData because they are referred to the economic damage to properties and crops, a rapid inspection revel us that the columns PROPDMG and CROPDMG are numeric values:

str(stormData$PROPDMG)
 num [1:902297] 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
str(stormData$CROPDMG)
 num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...

But when we inspect the columns PROPDMGEXP and CROPDMGEXP we observe a curious notation:

levels(stormData$PROPDMGEXP)
 [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
[18] "m" "M"
levels(stormData$CROPDMGEXP)
[1] ""  "?" "0" "2" "B" "k" "K" "m" "M"

Those values are referred to the following notation:

  • H,h = hundreds = 100
  • K,k = kilos = thousands = 1,000
  • M,m = millions = 1,000,000
  • B,b = billions = 1,000,000,000
  • (+) = 1
  • (-) = 0
  • (?) = 0
  • black/empty character = 0
  • numeric 0..8 = 10

Before we can process our data, we need to map those values (from the two columns) with their respective number representation.

Mapping the PROPDMGEXP column

First we transform our column to a character class due to problems when mapping in to its original factor form

stormData$PROPDMGEXP <- as.character(stormData$PROPDMGEXP)

Next we use our reference notation to map the values:

stormData[stormData$PROPDMGEXP == "0" | stormData$PROPDMGEXP == "1", 6] <- 10
stormData[stormData$PROPDMGEXP == "2" | stormData$PROPDMGEXP == "3", 6] <- 10
stormData[stormData$PROPDMGEXP == "4" | stormData$PROPDMGEXP == "5", 6] <- 10
stormData[stormData$PROPDMGEXP == "6" | stormData$PROPDMGEXP == "7" | stormData$PROPDMGEXP == "8", 6] <- 10
stormData[stormData$PROPDMGEXP == "-" | stormData$PROPDMGEXP == "?" | stormData$PROPDMGEXP == "", 6] <- 0
stormData[stormData$PROPDMGEXP == "+", 6] <- 1
stormData[stormData$PROPDMGEXP == "H" | stormData$PROPDMGEXP == "h", 6] <- 100
stormData[stormData$PROPDMGEXP == "K" | stormData$PROPDMGEXP == "k", 6] <- 1000
stormData[stormData$PROPDMGEXP == "M" | stormData$PROPDMGEXP == "m", 6] <- 1000000
stormData[stormData$PROPDMGEXP == "B", 6] <- 1000000000

Mapping the CROPDMGEXP column

We repeat the process, first we transform the factor column into a character type:

stormData$CROPDMGEXP <- as.character(stormData$CROPDMGEXP)

then we map the values using our reference notation:

stormData[stormData$CROPDMGEXP == "0" | stormData$CROPDMGEXP == "2", 8] <- 10
stormData[stormData$CROPDMGEXP == "?" | stormData$CROPDMGEXP == "", 8] <- 0
stormData[stormData$CROPDMGEXP == "K" | stormData$CROPDMGEXP == "k", 8] <- 1000
stormData[stormData$CROPDMGEXP == "M" | stormData$CROPDMGEXP == "m", 8] <- 1000000
stormData[stormData$CROPDMGEXP == "B", 8] <- 1000000000

Finally we transform the values from the two columns into a numeric representation:

stormData$PROPDMGEXP <- as.numeric(stormData$PROPDMGEXP)
stormData$CROPDMGEXP <- as.numeric(stormData$CROPDMGEXP)

Our next step will be the creation of the dataset economicDmg, in which we will group the sum of the variables referred to the economic damage: PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP and we will group them using the function group_by by their respective event type (EVTYPE)

economicDmg <- stormData %>% group_by(EVTYPE) %>% summarise_at(vars(PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP), sum)

As we seen before with the healthDamage dataset, we also have inconsistencies in the names of the events in our economicDmg dataset

kable(head(economicDmg[c(18:23), ]))
EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
AVALANCE 0.0 0 0 0
AVALANCHE 1623.9 2191000 0 154000
BEACH EROSIN 0.0 0 0 0
Beach Erosion 100.0 1000 0 0
BEACH EROSION 0.0 0 0 0
BEACH EROSION/COASTAL FLOOD 0.0 0 0 0

We also have values with 0 in the columns PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP:

sum(economicDmg$PROPDMG == 0)
[1] 579
sum(economicDmg$PROPDMGEXP == 0)
[1] 576
sum(economicDmg$CROPDMG == 0)
[1] 849
sum(economicDmg$CROPDMGEXP == 0)
[1] 824

We will address the problems with the following strategy:

  • Step 1: Convert to upper case all the event names.
  • Step 2: Remove the 0 values from the columns PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP that occur at the same time.
  • Step 3: Find the incongruences in the event names
  • Step 4: Define the correct names
  • Step 5: Replace the incongruent names with the correct names
  • Step 6: Compute again the sum of the columns PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP according to the event type (EVTYPE) column

Step 1: We achieve this step by just applying the function toupper to the EVTYPE column:

economicDmg$EVTYPE <- sapply(economicDmg$EVTYPE, toupper)

Step 2: We can remove the 0 values using:

economicDmg <- economicDmg[!(economicDmg$PROPDMG == 0 & economicDmg$PROPDMGEXP == 0 &   economicDmg$CROPDMG == 0 & economicDmg$CROPDMGEXP == 0),]

Step 3: We define the incongruent names into the dpnames variable:

dpnames <- c("COASTAL FLOOD", "FLASH FLOOD/", "FLASH FLOODING", "FLASH FLOODING/FLOOD", "FLASH FLOODS", "FLOOD/FLASH", "FLOOD/FLASH FLOOD", "FLOOD/FLASH FLOOD", "FLOOD", "FLOODING", "GUSTY WIND", "HEAVY RAIN", "HIGH WIND", "HIGH WINDS/", "ICY ROADS", "LAKE-EFFECT SNOW", "LANDSLIDE", "LIGHTING", "LIGNTNING", "LIGHTING", "LIGNTNING", "MARINE TSTM WIND", "MUD SLIDE", "MUDSLIDE", "MUDSLIDES", "RIVER FLOOD", "RIP CURRENT", "SEVERE THUNDERSTORM", "SNOW SQUALL", "SNOW/ ICE", "STRONG WIND", "THUNDERESTORM WINDS", "THUNDEERSTORM WINDS", "THUDERSTORM WINDS", "THUNDERSTORM WIND", "THUNDERSTORM WIND.", "THUNDERSTORM WIND/ TREE", "THUNDERSTORM WIND/ TREES", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS.", "THUNDERSTORM WINDSS",  "THUNDERSTORM WINS", "THUNDERSTORM WINDSHAIL", "THUNDERSTORM WINDS HAIL", "THUNDERSTORM WINDS.", "THUNDERSTORM WINDSHAIL", "THUNDERSTORM WINDSS", "THUNDERSTORM WINS", "THUNDERSTORMS WIND", "THUNDERSTORMS WINDS", "THUNDERSTORMW", "THUNDERSTORMWINDS", "THUNDERSTROM WIND", "THUNDERTORM WINDS", "THUNERSTORM WINDS", "TORNDAO", "UNSEASONABLE COLD", "URBAN FLOOD", "URBAN FLOODS", "URBAN/SML STREAM FLD", "URBAN/SMALL STREAM FLOOD", "WATERSPOUT", "WATERSPOUT-", "WATERSPOUT-TORNADO", "WATERSPOUT/ TORNADO", "WATERSPOUT/ TORNADO", "WILDFIRE", "WILDFIRES", "WIND", "WINDS", "WINTER STORM", "WINTER WEATHER/MIX", "TSTM WIND")

Step 4: We defined the correct names into the variable fxnames:

fxnames <- c("COASTAL FLOODING", "FLASH FLOOD", "FLASH FLOOD", "FLASH FLOOD/FLOOD","FLASH FLOOD", "FLOOD FLASH", "FLOOD/FLASHFLOOD", "FLOOD/FLASH/FLOOD", "FLOODS", "FLOODS", "GUSTY WINDS", "HEAVY RAINS", "HIGH WINDS", "HIGH WINDS", "ICE ROADS", "LAKE EFFECT SNOW", "LANDSLIDES", "LIGHTNING", "LIGHTNING", "LIGHTNING", "LIGHTNING", "MARINE THUNDERSTORM WIND", "MUD SLIDES", "MUD SLIDES", "MUD SLIDES", "RIVER FLOODING", "RIP CURRENTS", "SEVERE THUNDERSTORMS", "SNOW SQUALLS", "SNOW/ICE", "STRONG WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WIND TREES",  "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS/HAIL", "THUNDERSTORM WINDS/HAIL", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS/HAIL", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "THUNDERSTORM WINDS", "TORNADO", "UNSEASONABLY COLD", "URBAN FLOODING", "URBAN FLOODING", "URBAN/SMALL STREAM FLOODING", "URBAN/SMALL STREAM FLOODING", "WATERSPOUT TORNADO", "WATERSPOUT TORNADO", "WATERSPOUT TORNADO", "WATERSPOUT TORNADO", "WATERSPOUT TORNADO", "WILD FIRES", "WILD FIRES", "WINDS", "WINDS", "WINTER STORMS", "WINTER WEATHER MIX", "THUNDERSTORM WINDS")

Step 5: We replace the incongruent names with the correct names:

for (k in c(1:length(fxnames))){
  economicDmg[economicDmg$EVTYPE == dpnames[k], 1] <- fxnames[k] 
}

Step 6: Now we compute the sum of the columns using the group_by function

economicDmg <- economicDmg %>% group_by(EVTYPE) %>% summarise_at(vars(PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP), sum)

Finally we have both our healthDamage and economicDmg ready for the analysis, with this we conclude the section of Data Proccessing, the next section will focus in the Data Analysis

Data Analysis

Health Damage Data Analysis:

When we inspect our healthDamage data we note that both FATALITIES and INJURIES have their own respective values:

summary(healthDamage$INJURIES)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0     0.0     3.0   747.5    38.5 91346.0 
summary(healthDamage$FATALITIES)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    1.00    2.00   80.56   13.25 5633.00 

But here the idea is to have a general view of the total impact to the population’s health. Now we proceed to calculate the total damage to health, we can do this by adding a new column called TOTALDMG in which we sum the injuries and fatalities.

healthDamage$TOTALDMG <- healthDamage$INJURIES + healthDamage$FATALITIES

Now if we inspect our data ordered by the total damage we can see the most harmful events to the Health:

kable(head(arrange(healthDamage, -TOTALDMG)))
EVTYPE INJURIES FATALITIES TOTALDMG
TORNADO 91346 5633 96979
THUNDERSTORM WINDS 9385 702 10087
EXCESSIVE HEAT 6525 1903 8428
FLOOD 6789 470 7259
LIGHTNING 5230 817 6047
HEAT 2100 937 3037

Now we need a way to defined the top harmful events to health, if we observe the summary of the TOTALDMG:

summary(healthDamage$TOTALDMG)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
    1.00     2.00     7.00   828.05    48.75 96979.00 

We note that the mean value is 828.05, this is a quite high number, considering that it groups both injuries and fatalities, so we will take this value (the mean of TOTALDMG) as our metric to defined the top harmulf events to health. Next we will take a subset of the data and we will keep all the values of TOTALDMG that are equal or greater than its mean

healthDamage <- subset(healthDamage, healthDamage$TOTALDMG >= mean(healthDamage$TOTALDMG))

This process reduce our healthDamage data from 188 to only 16 observations. We can now explore the results of our analysis, this will be done in the Results section.

Economic Damage Data Analysis:

When inspecting our economicDmg data we noted the same problem of decide which variable we will consider to select the top economic damages:

summary(economicDmg$PROPDMG)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       5      50   32203     555 3212259 
summary(economicDmg$PROPDMGEXP)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.000e+00 1.000e+03 2.000e+03 1.531e+08 1.003e+06 1.205e+10 
summary(economicDmg$CROPDMG)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0       0       0    4076      10  579596 
summary(economicDmg$CROPDMGEXP)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.000e+00 0.000e+00 0.000e+00 3.336e+07 1.650e+04 4.143e+09 

As we did before we need a way to compute the total economic damage, but first we need to combine the total damage in properties and crops, we do this multiplyin the values from the variables PROPDMG and CROPDMG with PROPDMGEXP and CROPDMGEXP respectivily:

economicDmg$PROPDMG <- economicDmg$PROPDMG * economicDmg$PROPDMGEXP
economicDmg$CROPDMG <- economicDmg$CROPDMG * economicDmg$CROPDMG

Since the variables PROPDMGEXP and CROPDMGEXP are not longer need, we dismiss them and kept only the variables that we will use:

economicDmg <- select(economicDmg, EVTYPE, PROPDMG, CROPDMG)

The next step is calculate the total economic damage, for which we create the variable TOTALDMG which is the sum of the total values from PROPDMG and CROPDMG

economicDmg$TOTALDMG <- economicDmg$PROPDMG + economicDmg$CROPDMG

Now when we inspect our data we find that some TOTALDMG values are equal to 0.

sum(economicDmg$TOTALDMG == 0)
[1] 4

We proceed to remove those values using:

economicDmg <- economicDmg[!(economicDmg$TOTALDMG == 0),]

Now we will use the mean of the total damage variable to select our top events with the greatest economic impact:

economicDmg <- subset(economicDmg, economicDmg$TOTALDMG >= mean(economicDmg$TOTALDMG))

This process reduce our economicDmg data from 334 to only 7 observations. We can now explore the results of our analysis, this will be done in the Results section. But before we go to the next section, we need to remove the stormData from which we built the datasets healthDamage and economicDmg, because we no longer need it:

remove(stormData)

Results

In this section we show the results of our Data Analysis:

Across the United States, which types of events are most harmful with respect to population health

This question is related to the healthDamage data, when we inspect it we can see that the most harmful events related to population health are:

kable(arrange(healthDamage, -TOTALDMG))
EVTYPE INJURIES FATALITIES TOTALDMG
TORNADO 91346 5633 96979
THUNDERSTORM WINDS 9385 702 10087
EXCESSIVE HEAT 6525 1903 8428
FLOOD 6789 470 7259
LIGHTNING 5230 817 6047
HEAT 2100 937 3037
FLASH FLOOD 1777 980 2757
ICE STORM 1975 89 2064
HIGH WINDS 1439 283 1722
WINTER STORMS 1338 216 1554
HAIL 1361 15 1376
HURRICANE/TYPHOON 1275 64 1339
HEAVY SNOW 1021 127 1148
RIP CURRENTS 529 572 1101
WILDFIRE 911 75 986
BLIZZARD 805 101 906

From the table above we can build a bar plot for a better representation:

library(ggplot2)
ggplot(data = healthDamage, aes(x=reorder(EVTYPE, -TOTALDMG), y = TOTALDMG, fill = EVTYPE)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle("Top 16 harmful Events to Population Health") + xlab("") + ylab("Total Health Damage (Injuries + Fatalities)") + labs(fill = "Event Type") + guides(fill=guide_legend(ncol=2))

Also we show the relationship between injuries and fatalities by the events:

library(ggplot2)
ggplot(data = healthDamage, aes(x=INJURIES, y = FATALITIES)) + geom_point(aes(color = EVTYPE), size = 3) + ggtitle("Relationship between Fatalities and Injuries by Events") + xlab("Total Injuries") + ylab("Total Fatalities") +  labs(color = "Event Type")

As we can see, those are the top 16 most harmful events to population health

Which types of events have the greatest economic consequences?

This question is related to the economicDmg data, when we inspect it we can see that the events with greatest economic consequences are:

kable(arrange(economicDmg, -TOTALDMG))
EVTYPE PROPDMG CROPDMG TOTALDMG
TORNADO 2.416373e+16 10003704343 2.416374e+16
FLOODS 5.935982e+15 29394718453 5.936012e+15
FLASH FLOOD 3.730919e+15 33976262289 3.730953e+15
THUNDERSTORM WINDS 3.199747e+15 37947636090 3.199785e+15
HAIL 1.348529e+15 335931847790 1.348865e+15
HIGH WINDS 5.171167e+14 362628613 5.171171e+14
WINTER STORMS 1.527205e+14 6145391 1.527205e+14

From the table above we can build a bar plot for a better representation:

ggplot(data = economicDmg, aes(x=reorder(EVTYPE, -TOTALDMG), y = TOTALDMG, fill = EVTYPE)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle("Top 7 Events with Greatest Economic Consequences") + xlab("") + ylab("Total Economic Damage (Properties and Crops)") + labs(fill = "Event Type")

As we can see, those are the top 7 events with greatest economic consequences