Sinopsis

This project consists in explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database and look up the severe weather events causing the greatest economic consequences and the most harmful for the human health. To give answer to those questions, processing the data will be needed. Weather events will be grouped into categories and the answer to those questions will be given in function of these groups.

Data Processing

Let’s start loading the data from the working directory and to identify the key variables for this project as wel as loading the required packages.

require(knitr)
## Loading required package: knitr
raw_data<-read.csv("repdata_data_StormData.csv.bz2")

str(raw_data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Most harmful human health key variables:

  1. EVTYPE: Describes the weather event

  2. FATALITIES: Number of dead people

  3. INJURIES: Number of people injured

Greatest economic consequences:

  1. EVTYPE: Describes the weather event

  2. PROPDMG: Property damage estimated cost

  3. CROPDMG: Cropping damage estimated cost

I will create a new data set called data containing only those variables

to_use <- c(grep("EVTYPE",names(raw_data)),grep("FATALITIES",names(raw_data)),grep("INJURIES",names(raw_data)),grep("PROPDMG",names(raw_data)),grep("CROPDMG",names(raw_data)))

data <- raw_data[,to_use]

If we explore the events type, we will see that many events are repeated due to a lower and upper cases spelling. The number of unique events are:

print(length(unique(raw_data$EVTYPE)))
## [1] 985

To solve that problem, I will use the function tolower() to make all the weather events to lower case.

data$EVTYPE <- tolower(data$EVTYPE)

After the conversion, the number of unique events we do have in our edited data set is:

print(length(unique(data$EVTYPE)))
## [1] 898

This number is still very high.

I am going to create categories for the weather events to be able to group them into 8 groups:

1. Water: damage caused mainly by water as precipitation, floodings, etc.

2. Wind: damage caused mainly by air as hurricane, wind, blizzard, etc.

3. Storm: damage caused mainly by storm, ligntning, etc.

4. Snow: damage caused mainly by snow, avalanche, etc.

5. Fire: damage caused mainly by fire, smoke, etc.

6. LowTemp: damage caused mainly by low temperature, frost, hyperthermia, etc.

7. HighTemp: damage caused mainly by record high temperatures, heat, etc.

8. Other: Rest of the events excluded the above groups

The next step is to create a new variable called Damage_g in data , naming events with the mentioned groups

#each variable is created for a specific group. Those variables have the key words to look up
Wind_damage  <- c("tornado","tornados", "wind","hurricane","winds","blizzard","severe turbulence","blowing dust","microburst","beach erosin",
                  "blow-out tide","wnd","torndao","typhoon")

Water_damage  <- c("hail", "rain","rains","funnel cloud","funnel","cloud","clouds","record rainfall","precipitation","water",
                   "rip current", "flooding","flood","floods","waterspout","high tides","surf","water","marine mishap","high seas",
                   "urban/small","unseasonably wet","wayterspout" ,"stream","excessive wetness","waves","wet","flash floooding",
                   "heavy seas","showers","small stream","marine accident","precip","coastal erosion","tsunami","astronomical low tide",
                   "coastal surge","heavy shower","beach erosion","remnants of floyd","rough seas","rogue wave","drowning")

Storm_damage  <- c("storm","lighting","thunderstorm","thunderstorms","tstm","rainstorm","ligntning","urban/small strm fldg","northern lights")

LowTemp_damage <- c("ice","cold", "cool","freeze","record low","low temperature record","freezing drizzle","glaze","freezing",
                    "wintry mix","winter weather","frost","icy","hypothermia","low temperature","frost","hypothermia/exposure","low temp",
                    "hyperthermia/exposure")
  
HighTemp_damage  <- c("heat","high temperature","record high","record high temperatures","record warmth","dry","warm","drought","hot")
  
Fire_damage  <- c("fire","fires","smoke")

Snow_damage  <- c("avalanche","snow","record snowfall","sleet","snowpack","avalance")

Other_damage <- c("apache county","dust devil","high","mudslides","urban and small","downburst","gustnado and","mud slide","mud slides","other",
                  "record temperatures","southeast","mud/rock slide","landslide" ,"landslides","excessive","saharan dust","mild pattern",
                  "swells","urban small","temperature record","landslump","record temperature","summary","no severe weather","dam break",
                  "none","landspout","winter weather mix", "rock slide","dam break","gustnado","mudslide","heavy mix","dam failure",
                  "?","winter mix","wintery mix","seiche","monthly temperature","vog","tropical depression","driest month","red flag criteria",
                  "dust devel","volcanic ash","volcanic ashfall","volcanic","volcanic ash plume","volcanic eruption","fog")

damage <- list("Other" = Other_damage,"Wind"=Wind_damage,"Water" = Water_damage,"Storm" = Storm_damage,"LowTemp" = LowTemp_damage,
              "HighTemp" = HighTemp_damage,"Fire" = Fire_damage,"Snow" = Snow_damage)

#damage is a list containg all the values for all the groups. Each element of the list is a group

data$Damage_g <- rep("",dim(data)[1])   # Define the new variable with the groups

for(i in seq_along(damage)){     #This function creates the new variable Damage_g and allocate the corresponding group
  
    for(j in seq_along(damage[[i]])){
      data$Damage_g[grep(damage[[i]][j],data$EVTYPE)]  <- names(damage)[[i]]
      
    }
  
}

Now, let’s calculate the total number of fatalities and injuries per events group. This new data set will be stored in a data frame called health:

fatal  <- aggregate(FATALITIES ~ Damage_g, data= data, sum)
injur  <- aggregate(INJURIES ~ Damage_g, data = data, sum)

health <- merge(fatal, injur, "Damage_g")

Ordering health by total number of fatalities by events group.

health <- health[order(health$FATALITIES, decreasing = TRUE),]

print(health)
##   Damage_g FATALITIES INJURIES
## 8     Wind       6333    95397
## 2 HighTemp       3181     9276
## 7    Water       2478    11359
## 6    Storm       1088    11765
## 4    Other        944     6410
## 3  LowTemp        635     3377
## 5     Snow        396     1336
## 1     Fire         90     1608

To calculate the total Property damage and the total Crop damage, we just follow a analogous process to previously described

prop <- aggregate(PROPDMG ~ Damage_g, data = data, sum)
crop <- aggregate(CROPDMG ~ Damage_g, data= data, sum)

economic <- merge(prop, crop, "Damage_g")
economic$Total  <- economic$PROPDMG + economic$CROPDMG #A new variable of "totals"" is created to order this data frame by it

economic <- economic[order(economic$Total, decreasing = TRUE),]

print(economic)
##   Damage_g    PROPDMG   CROPDMG      Total
## 7    Water 3226169.69 961333.34 4187503.03
## 8     Wind 3717726.33 133088.72 3850815.05
## 6    Storm 2893713.27 211303.49 3105016.76
## 4    Other  646000.41   4653.56  650653.97
## 5     Snow  154726.53   2195.72  156922.25
## 1     Fire  125323.29   9565.74  134889.03
## 3  LowTemp  111770.98  20289.95  132060.93
## 2 HighTemp    9069.51  35396.80   44466.31

Results

The answer to the proposed questions are detailed below:

1. Across the United States, which types of events are most harmful with respect to population health?

kable(health[,], row.names=F)
Damage_g FATALITIES INJURIES
Wind 6333 95397
HighTemp 3181 9276
Water 2478 11359
Storm 1088 11765
Other 944 6410
LowTemp 635 3377
Snow 396 1336
Fire 90 1608
barplot(t(health[,2:3]), names.arg = health$Damage_g, legend = c("FATALITIES", "INJURIES"), 
        main = "Total fatalities and injures by groups of events", col = c("red","yellow"),
        las=2)

Result are extremes, Wind events is the main weather event that cause more injuries and fatalities across the U.S. (more than 10,000 since 1950)

Independently, Wind events is the major cause of dead for severe weather events and the main cause for injuries as well

2. Across the United States, which types of events have the greatest economic consequences?

kable(economic[,], row.names=F)
Damage_g PROPDMG CROPDMG Total
Water 3226169.69 961333.34 4187503.03
Wind 3717726.33 133088.72 3850815.05
Storm 2893713.27 211303.49 3105016.76
Other 646000.41 4653.56 650653.97
Snow 154726.53 2195.72 156922.25
Fire 125323.29 9565.74 134889.03
LowTemp 111770.98 20289.95 132060.93
HighTemp 9069.51 35396.80 44466.31
barplot(t(economic[,2:3]), names.arg = economic$Damage_g, legend = c("PROPERTY", "CROP"), 
        main = "Total property and crop damage by groups of events", col = c("blue","green"),
        las=2)

The aggregation of crop and property damage drops that Water, Wind and Storm as the main weather events with the greatest economic consecuences across the U.S. from 1950 Wind events as the main event for properties and Water events for crops