Synopsis

The goal of this project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States between 1950 and 2011, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

In summary how it affects the communities from two differents points of view:

Data processing

Downloading File

First of all let’s load the database in a data frame. We will download the file from the url if it doesn’t exist in the working directory.

if(!file.exists("StormData.csv.bz2")) 
{
        download.file(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
                      destfile = "./StormData.csv.bz2" )
}      

DataRaw <- read.csv("StormData.csv.bz2")

Exploratory Data

For this section, we will need to load the following R packets:

  • dplyr to data manipulate.
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Now we can inspect the content of the data frame to know the different variables that we need to make the analysis of.

str(DataRaw)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "10/10/1954 0:00:00",..: 6523 6523 4213 11116 1426 1426 1462 2873 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "000","0000","00:00:00 AM",..: 212 257 2645 1563 2524 3126 122 1563 3126 3126 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "?","ABNORMALLY DRY",..: 830 830 830 830 830 830 830 830 830 830 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","E","Eas","EE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","?","(01R)AFB GNRY RNG AL",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","10/10/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels "","?","0000",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","(0E4)PAYSON ARPT",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels "","2","43","9V9",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels ""," ","  ","   ",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

The data frame has a lot of variables but to answer the two main questions (see the synopsis) we will need only the following: EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP. In order to make it, we will create a new data frame with only the interesting variables and will remove the initial data frame to save RAM memory.

DataInterest <- select(DataRaw, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
rm(DataRaw)

And now we can see only the interesting variables for our analysis.

str(DataInterest)
## 'data.frame':    902297 obs. of  7 variables:
##  $ EVTYPE    : Factor w/ 985 levels "?","ABNORMALLY DRY",..: 830 830 830 830 830 830 830 830 830 830 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...

We will group the events by type and sum up the fatalities or injuries. The results will be saved in two new data frames.

# manipulate the data frame to sumarise events with fatalities and save in a new data frame
EventFatalities <- DataInterest %>% select(EVTYPE, FATALITIES) %>%
  group_by(EVTYPE) %>% summarise(Total_Fatalities=sum(FATALITIES)) %>% arrange(-Total_Fatalities)

# manipulate the data frame to sumarise events with injuries and save in a new data frame
EventInjuries <- DataInterest %>% select(EVTYPE, INJURIES) %>%
  group_by(EVTYPE) %>% summarise(Total_Injuries=sum(INJURIES)) %>% arrange(-Total_Injuries)

Now we can see the Top Ten atalities and injuries events between 1950 and 2011.

head(EventFatalities, 10) #show top ten events fatalities
## # A tibble: 10 x 2
##    EVTYPE         Total_Fatalities
##    <fct>                     <dbl>
##  1 TORNADO                    5633
##  2 EXCESSIVE HEAT             1903
##  3 FLASH FLOOD                 978
##  4 HEAT                        937
##  5 LIGHTNING                   816
##  6 TSTM WIND                   504
##  7 FLOOD                       470
##  8 RIP CURRENT                 368
##  9 HIGH WIND                   248
## 10 AVALANCHE                   224
head(EventInjuries, 10) #show top ten events injuries
## # A tibble: 10 x 2
##    EVTYPE            Total_Injuries
##    <fct>                      <dbl>
##  1 TORNADO                    91346
##  2 TSTM WIND                   6957
##  3 FLOOD                       6789
##  4 EXCESSIVE HEAT              6525
##  5 LIGHTNING                   5230
##  6 HEAT                        2100
##  7 ICE STORM                   1975
##  8 FLASH FLOOD                 1777
##  9 THUNDERSTORM WIND           1488
## 10 HAIL                        1361

Especial Processing for Property and Crop Damages Data

To get the amount of dollars from the property and crop damages, we need to extract the numeric exponent information from PROPDMGEXP and CROPDMGEXP variables and create two new columns with this information. But first, we will inspect the unique values of these variables:

unique(DataInterest$PROPDMGEXP)
##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(DataInterest$CROPDMGEXP)
## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M

We can see different values (numerics and characters). We will need to transform this information into a numeric exponent to get the final amount of dollars. With numeric values it is not neccesary to make anything but with the characters we will need to make transformation like this:

  • With “”, “-”, “+” and “?” change into 0
  • With numbers between 0 and 9 nothing
  • With “h” or “H” change into 2
  • With “k” or “K” change into 3
  • With “m” or “M” change into 6
  • With “b” or “B” change into 9

The following R sentences make it for us:

# convert character to capital
DataInterest$PROPDMGEXP <- toupper(DataInterest$PROPDMGEXP)
DataInterest$CROPDMGEXP <- toupper(DataInterest$CROPDMGEXP)
# convert the undefined simbol to 0
DataInterest$PROPDMGEXP[DataInterest$PROPDMGEXP %in% c("", "-", "+", "?", "0")] <- "0"
DataInterest$CROPDMGEXP[DataInterest$CROPDMGEXP %in% c("", "-", "+", "?", "0")] <- "0"
# convert the character 'H' to '2'
DataInterest$PROPDMGEXP[DataInterest$PROPDMGEXP %in% c("H")] <- "2"
DataInterest$CROPDMGEXP[DataInterest$CROPDMGEXP %in% c("H")] <- "2"
# convert the character 'K' to '3'
DataInterest$PROPDMGEXP[DataInterest$PROPDMGEXP %in% c("K")] <- "3"
DataInterest$CROPDMGEXP[DataInterest$CROPDMGEXP %in% c("K")] <- "3"
# convert the character 'M' to '6'
DataInterest$PROPDMGEXP[DataInterest$PROPDMGEXP %in% c("M")] <- "6"
DataInterest$CROPDMGEXP[DataInterest$CROPDMGEXP %in% c("M")] <- "6"
# convert the character 'B' to '9'
DataInterest$PROPDMGEXP[DataInterest$PROPDMGEXP %in% c("B")] <- "9"
DataInterest$CROPDMGEXP[DataInterest$CROPDMGEXP %in% c("B")] <- "9"

Now, we will add two new variables with the amount of dollars:

  • PROPDMGx10^PROPDMGEXP=PROPDMGDOLLARS
  • CROPDMGx10^CROPDMGEXP=CROPDMGDOLLARS
DataInterest$PROPDMGDOLLARS <- DataInterest$PROPDMG*10^as.numeric(DataInterest$PROPDMGEXP)
DataInterest$CROPDMGDOLLARS <- DataInterest$CROPDMG*10^as.numeric(DataInterest$CROPDMGEXP)

We will group the events by type and sum up the property or crop damages. The results will be saved in two new data frames. One for property damages named EventPropertyDamage and other for crop damages named EventCropDamage.

# summarise for properties damage
EventPropertyDamage <- DataInterest %>% select(EVTYPE, PROPDMGDOLLARS) %>%
  group_by(EVTYPE) %>% summarise(Total_Property_Damages=sum(PROPDMGDOLLARS)) %>% arrange(-Total_Property_Damages)
# summarise for crop damage
EventCropDamage <- DataInterest %>% select(EVTYPE, CROPDMGDOLLARS) %>%
  group_by(EVTYPE) %>% summarise(Total_Crop_Damages=sum(CROPDMGDOLLARS)) %>% arrange(-Total_Crop_Damages)

And finally , we can inspect the TOP TEN impact events in properties and crops in these two tables:

head(EventPropertyDamage, 10)
## # A tibble: 10 x 2
##    EVTYPE            Total_Property_Damages
##    <fct>                              <dbl>
##  1 FLOOD                       144657709807
##  2 HURRICANE/TYPHOON            69305840000
##  3 TORNADO                      56947380676
##  4 STORM SURGE                  43323536000
##  5 FLASH FLOOD                  16822673978
##  6 HAIL                         15735267513
##  7 HURRICANE                    11868319010
##  8 TROPICAL STORM                7703890550
##  9 WINTER STORM                  6688497251
## 10 HIGH WIND                     5270046295
head(EventCropDamage, 10)
## # A tibble: 10 x 2
##    EVTYPE            Total_Crop_Damages
##    <fct>                          <dbl>
##  1 DROUGHT                  13972566000
##  2 FLOOD                     5661968450
##  3 RIVER FLOOD               5029459000
##  4 ICE STORM                 5022113500
##  5 HAIL                      3025954473
##  6 HURRICANE                 2741910000
##  7 HURRICANE/TYPHOON         2607872800
##  8 FLASH FLOOD               1421317100
##  9 EXTREME COLD              1292973000
## 10 FROST/FREEZE              1094086000

Results

The final study shows in the Figure 1 a barplot with the Top Ten fatalities and injuries events:

par(mfrow = c(1, 2), mar = c(12, 4, 3, 2), mgp = c(3, 1, 0), cex = 0.6)
barplot(EventFatalities$Total_Fatalities[1:10], names.arg = EventFatalities$EVTYPE[1:10], las=3, col="red", main="Top Ten Fatalities Events", ylab="Numbers of Fatalities")
barplot(EventInjuries$Total_Injuries[1:10], names.arg = EventInjuries$EVTYPE[1:10], las=3, col="yellow", main="Top Ten Injuries Events", ylab="Numbers of Injuries")

Figure 1

And in the Figure 2 we can see the economic impact of these events:

par(mfrow = c(1, 2), mar = c(12, 4, 3, 2), mgp = c(3, 1, 0), cex = 0.6)
barplot(EventPropertyDamage$Total_Property_Damages[1:10], names.arg = EventPropertyDamage$EVTYPE[1:10], col = "green", main = "Top Ten Ecomomic Impact in Properties", ylab="Dollars", las=3)
barplot(EventCropDamage$Total_Crop_Damages[1:10], names.arg = EventCropDamage$EVTYPE[1:10], col = "green", main = "Top Ten Ecomomic Impact in Crops", ylab="Dollars", las=3)

Figure 2