1. Synopsis

 

In this report, we aim to find what type of severe weather events are most harmful with respect to population health, and what types of severe weather events have the greatest economic consequences, accorss the United States, from 1950 to November 2011. To investigate these questions, we obtained and explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which tracks characteristics of major storm and weather events in the United States, and has been collected from 1950 to November 2011. From these data, we found that the most harmful weather events in terms of population health are tornados, and that the weather events that had the greatest economic consequences are floods, hurricanes/typhoons, and tornados.

 

2. Data processing

 

From the NOAA website, we obtained and explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which tracks characteristics of major storm and weather events in the United States, and has been collected from year 1950 to November 2011.

 

2.1 Reading the data

 

The data come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. We first download the file with the appropriate weblink https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2 and then read the csv file into R. We can read directly the csv.bz2 file with the read.csv function, without having to uncompress the file before. That’s what we do here.

data <- read.csv("repdata-data-StormData.csv.bz2", stringsAsFactors = FALSE)

Let’s have a look at the dimensions of the dataset.

dim(data) # 902297 * 37
## [1] 902297     37

So the data is composed of 902 297 rows and 37 variables.

We can have a look at the general structure of the data to know what are those variables and what is their class.

str(data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

We can also have a quick look at the first rows in this dataset.

head(data)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

First we can have a look at the missing values if there are any. We use for this purpose the function summary so that we can check the variables.

summary(data)
##     STATE__       BGN_DATE           BGN_TIME          TIME_ZONE        
##  Min.   : 1.0   Length:902297      Length:902297      Length:902297     
##  1st Qu.:19.0   Class :character   Class :character   Class :character  
##  Median :30.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :31.2                                                           
##  3rd Qu.:45.0                                                           
##  Max.   :95.0                                                           
##                                                                         
##      COUNTY       COUNTYNAME           STATE              EVTYPE         
##  Min.   :  0.0   Length:902297      Length:902297      Length:902297     
##  1st Qu.: 31.0   Class :character   Class :character   Class :character  
##  Median : 75.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :100.6                                                           
##  3rd Qu.:131.0                                                           
##  Max.   :873.0                                                           
##                                                                          
##    BGN_RANGE          BGN_AZI           BGN_LOCATI       
##  Min.   :   0.000   Length:902297      Length:902297     
##  1st Qu.:   0.000   Class :character   Class :character  
##  Median :   0.000   Mode  :character   Mode  :character  
##  Mean   :   1.484                                        
##  3rd Qu.:   1.000                                        
##  Max.   :3749.000                                        
##                                                          
##    END_DATE           END_TIME           COUNTY_END COUNTYENDN    
##  Length:902297      Length:902297      Min.   :0    Mode:logical  
##  Class :character   Class :character   1st Qu.:0    NA's:902297   
##  Mode  :character   Mode  :character   Median :0                  
##                                        Mean   :0                  
##                                        3rd Qu.:0                  
##                                        Max.   :0                  
##                                                                   
##    END_RANGE          END_AZI           END_LOCATI       
##  Min.   :  0.0000   Length:902297      Length:902297     
##  1st Qu.:  0.0000   Class :character   Class :character  
##  Median :  0.0000   Mode  :character   Mode  :character  
##  Mean   :  0.9862                                        
##  3rd Qu.:  0.0000                                        
##  Max.   :925.0000                                        
##                                                          
##      LENGTH              WIDTH                F               MAG         
##  Min.   :   0.0000   Min.   :   0.000   Min.   :0.0      Min.   :    0.0  
##  1st Qu.:   0.0000   1st Qu.:   0.000   1st Qu.:0.0      1st Qu.:    0.0  
##  Median :   0.0000   Median :   0.000   Median :1.0      Median :   50.0  
##  Mean   :   0.2301   Mean   :   7.503   Mean   :0.9      Mean   :   46.9  
##  3rd Qu.:   0.0000   3rd Qu.:   0.000   3rd Qu.:1.0      3rd Qu.:   75.0  
##  Max.   :2315.0000   Max.   :4400.000   Max.   :5.0      Max.   :22000.0  
##                                         NA's   :843563                    
##    FATALITIES          INJURIES            PROPDMG       
##  Min.   :  0.0000   Min.   :   0.0000   Min.   :   0.00  
##  1st Qu.:  0.0000   1st Qu.:   0.0000   1st Qu.:   0.00  
##  Median :  0.0000   Median :   0.0000   Median :   0.00  
##  Mean   :  0.0168   Mean   :   0.1557   Mean   :  12.06  
##  3rd Qu.:  0.0000   3rd Qu.:   0.0000   3rd Qu.:   0.50  
##  Max.   :583.0000   Max.   :1700.0000   Max.   :5000.00  
##                                                          
##   PROPDMGEXP           CROPDMG         CROPDMGEXP       
##  Length:902297      Min.   :  0.000   Length:902297     
##  Class :character   1st Qu.:  0.000   Class :character  
##  Mode  :character   Median :  0.000   Mode  :character  
##                     Mean   :  1.527                     
##                     3rd Qu.:  0.000                     
##                     Max.   :990.000                     
##                                                         
##      WFO             STATEOFFIC         ZONENAMES            LATITUDE   
##  Length:902297      Length:902297      Length:902297      Min.   :   0  
##  Class :character   Class :character   Class :character   1st Qu.:2802  
##  Mode  :character   Mode  :character   Mode  :character   Median :3540  
##                                                           Mean   :2875  
##                                                           3rd Qu.:4019  
##                                                           Max.   :9706  
##                                                           NA's   :47    
##    LONGITUDE        LATITUDE_E     LONGITUDE_       REMARKS         
##  Min.   :-14451   Min.   :   0   Min.   :-14455   Length:902297     
##  1st Qu.:  7247   1st Qu.:   0   1st Qu.:     0   Class :character  
##  Median :  8707   Median :   0   Median :     0   Mode  :character  
##  Mean   :  6940   Mean   :1452   Mean   :  3509                     
##  3rd Qu.:  9605   3rd Qu.:3549   3rd Qu.:  8735                     
##  Max.   : 17124   Max.   :9706   Max.   :106220                     
##                   NA's   :40                                        
##      REFNUM      
##  Min.   :     1  
##  1st Qu.:225575  
##  Median :451149  
##  Mean   :451149  
##  3rd Qu.:676723  
##  Max.   :902297  
## 

Some variables have missing values. For the purpose of our study here, we are going to use the following variables:

  • STATE
  • EVTYPE
  • FATALITIES
  • INJURIES
  • PROPDMG
  • PROPDMGEXP
  • CROPDMG
  • CROPDMGEXP

As those variables don’t present any missing value in the dataset, we are not going to operate any transformation to deal with the missing values. But we are going to make a subset of the initial dataset containing only those variables.

data_new <- data[,c("STATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]

We check the dimensions of this new dataset.

dim(data_new) # 902297 * 8
## [1] 902297      8

 

2.2 Preprocessing for harmfulness

 

The type of events is given by the variable EVTYPE. The harmfulness of the event is given by 2 variables : FATALITIES and INJURIES.

Let’s first sum Injuries and Fatalities variables. Then we aggregate this sum with types of event.

data_new$harm <- data_new$INJURIES + data_new$FATALITIES

harm <- aggregate(harm ~ EVTYPE, data= data_new, sum)
harm <- harm[order(harm$harm, decreasing = TRUE),]

 

2.3 Preprocessing for economic consequences

 

The economic consequences can be assessed with the variables PROPDMG and CROPDMG. These are numerical values. But these values must be multiplied by the following variables PROPDMGEXPand CROPDMGEXPrespectively. Let’s have a look at those variables.

unique(data_new$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
unique(data_new$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

Thse characters are used to signify the magnitude of the number, i.e., 1.55B for $1,550,000,000 : “K” for thousands, “M” for millions, and “B” for billions. But we can observe that these variables are like “mixed”, as there is for example an “m” character and an “M” character, which mean the same thing. We first have to transform those characters so that to clean them up.

data_new$PROPDMGEXP <- toupper(data_new$PROPDMGEXP)
unique(data_new$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "+" "0" "5" "6" "?" "4" "2" "3" "H" "7" "-" "1" "8"
data_new$CROPDMGEXP <- toupper(data_new$CROPDMGEXP)
unique(data_new$CROPDMGEXP)
## [1] ""  "M" "K" "B" "?" "0" "2"

We know from the documentation the meaning of the letters in PROPDMGEXP and CROPDMGEXP, but we don’t have any information regarding the numbers in those variables, supposingly representing an old format of conversion maybe not used anymore. We can see by using the function table that those numbers are really less represented in the set than the letters.

table(data_new$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      H      K      M 
##      4      5      1     40      7 424665  11337
table(data_new$CROPDMGEXP)
## 
##             ?      0      2      B      K      M 
## 618413      7     19      1      9 281853   1995

So we’re not going to take those numbers into account, and we only consider the letters, as we are sure of their meaning.

First, we create a function to convert all the values, and then we create a new variable to store the converted values.

conv <- function(dmg, dmgexp){
    dmg * switch(dmgexp, H = 100, K = 1000, M = 10^6, B = 10^9, 1)
}

data_new$cProp <- mapply(conv, data_new$PROPDMG, data_new$PROPDMGEXP)
data_new$cCROP <- mapply(conv, data_new$CROPDMG, data_new$CROPDMGEXP)

Now we are going to aggregate the dataset to prepare the plotting. And first, we cumulate the PROP and CROP damages.

data_new$cost <- data_new$cProp + data_new$cCROP

eco <- aggregate(cost ~ EVTYPE, data= data_new, sum)
eco <- eco[order(eco$cost, decreasing = TRUE),]

 

3. Results

 

3.1 Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

 

library(ggplot2)

harm15 <- harm[1:15,]

g <- ggplot(data = harm15, aes(EVTYPE, harm, fill = harm))
g <- g + geom_bar(stat = "identity")
g <- g + xlab("Top 15 events")
g <- g + ylab("harmful measurement (Fatalities + Injuries)")
g <- g + ggtitle("15 most harmful events \n from 1950 to November 2011")
g <- g + coord_flip()
g

    As we can see from the plot above, tornados have the greatest impact on health. We can also observe that some other categories have a significant impact on health, such as thunderstorm winds, lightnings, floods, and excessive heats.

 

3.2 Across the United States, which types of events have the greatest economic consequences?

 

eco15 <- eco[1:15,]

g <- ggplot(data = eco15, aes(EVTYPE, cost, fill = cost))
g <- g + geom_bar(stat = "identity")
g <- g + xlab("Top 15 events")
g <- g + ylab("economic consequence measurement in $\n (Property damages + Crop damages)")
g <- g + ggtitle("15 events with greatest economic consequences \n from 1950 to November 2011")
g <- g + coord_flip()
g

    As we can see from above, the most costly category of weather events is floods. We can see that hurricanes/typhoons, storm surges and tornados are also very costly.

 

4. Environment used

 

This project has been conducted with the following tools and systems :