Examining the United States Weather Data and Its Impact on the Country’s Public and Economic Sector

Introduction

This report is written in part of the Course Project for the Reproducible Research course on Coursera by Johns Hopkins University.

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Synopsis

The analysis of the data is conducted to answer the following questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

The selected packages (“dplyr” and “ggplot”) are loaded into the workspace before the data is loaded as well. Some data cleaning is conducted, where significant variables for the analysis are selected, any existing missing values are handled, and some variables are transformed so that the data is less complex.

The Results section is divided into two parts: one for each research question. In each section, weather events are grouped together, and the total number of fatalities and injuries, and property and crop damages are calculated, respectively, in each section. Stacked bar plots are created in each section depicting the values calculated and displayed in tables at each section.

Data Processing

Loading packages and data

The packages included in this analysis are dplyr for data cleaning and ggplot2 for data visualization.

if(!require("dplyr")){
    install.packages("dplyr")
}
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
if(!require("ggplot2")){
  install.packages("ggplot2")
}
## Loading required package: ggplot2
library(dplyr)
library(ggplot2)

The data is downloaded from this link. The data is a comma-separated-value (csv) file compressed via the bzip2 algorithm to reduce its size. After the data is loaded, the structure and dimension of the data is inspected.

df <- read.csv(bzfile("repdata_data_StormData.csv.bz2"), header = TRUE)

# Structure of data
str(df)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
# Dimension of data
dim(df)
## [1] 902297     37

Extracting selected variables

Only selected variables are used in the analysis, and they are:

  • FATALITIES - approximate number of deaths
  • INJURIES - approximate number of injuries
  • PROPDMG - approximate property damage
  • PROPDMGEXP - property damage value
  • CROPDMG - approximate crop damages
  • CROPDMGEXP - crop damage value
  • EVTYPE - weather event

These variables are selected using the select() function from dplyr.

mydata <- select(df, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, EVTYPE)
head(mydata)
##   FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP  EVTYPE
## 1          0       15    25.0          K       0            TORNADO
## 2          0        0     2.5          K       0            TORNADO
## 3          0        2    25.0          K       0            TORNADO
## 4          0        2     2.5          K       0            TORNADO
## 5          0        2     2.5          K       0            TORNADO
## 6          0        6     2.5          K       0            TORNADO

Checking for missing values

To ensure a complete and thorough analysis, any missing values should be handled with care. The following code checks for any missing values.

sapply(mydata, function(x) sum(is.na(x)))
## FATALITIES   INJURIES    PROPDMG PROPDMGEXP    CROPDMG CROPDMGEXP     EVTYPE 
##          0          0          0          0          0          0          0

Based on the output, no missing values are recorded in the data.

Transforming variables

Some variables contain duplicate values. It is vital to transform them not only to standardize the data, but also to reduce the complexity of the data.

Transforming EVTYPE

The target variable EVTYPE contains the type of weather events. According to the output of the code below, there are 985 unique weather events listed in the data.

length(unique(mydata$EVTYPE))
## [1] 985

Some of them can be categorized into one weather event (i.e. MARINE HIGH WIND AND MARINE STRONG WIND can be grouped together as WIND). The following coding is used to transform the variable EVTYPE by grouping together similar weather events.

#Ccreate a new variable `EVENT` to transform the variable into groups
mydata$EVENT <- "OTHER"

# Group by keyword in `EVTYPE`
mydata$EVENT[grep("HAIL", mydata$EVTYPE, ignore.case = TRUE)] <- "HAIL"
mydata$EVENT[grep("HEAT", mydata$EVTYPE, ignore.case = TRUE)] <- "HEAT"
mydata$EVENT[grep("FLOOD", mydata$EVTYPE, ignore.case = TRUE)] <- "FLOOD"
mydata$EVENT[grep("WIND", mydata$EVTYPE, ignore.case = TRUE)] <- "WIND"
mydata$EVENT[grep("STORM", mydata$EVTYPE, ignore.case = TRUE)] <- "STORM"
mydata$EVENT[grep("SNOW", mydata$EVTYPE, ignore.case = TRUE)] <- "SNOW"
mydata$EVENT[grep("TORNADO", mydata$EVTYPE, ignore.case = TRUE)] <- "TORNADO"
mydata$EVENT[grep("WINTER", mydata$EVTYPE, ignore.case = TRUE)] <- "WINTER"
mydata$EVENT[grep("RAIN", mydata$EVTYPE, ignore.case = TRUE)] <- "RAIN"

# Check the variable
table(mydata$EVENT)
## 
##   FLOOD    HAIL    HEAT   OTHER    RAIN    SNOW   STORM TORNADO    WIND  WINTER 
##   82686  289270    2648   48970   12241   17660  113156   60700  255362   19604

Transforming PROPDMGEXP AND CROPDMGEXP

The variables PROPDMGEXP and CROPDMGEXP contains the unit property and crop damages, respectively, in dollars.

table(mydata$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5      6 
## 465934      1      8      5    216     25     13      4      4     28      4 
##      7      8      B      h      H      K      m      M 
##      5      1     40      1      6 424665      7  11330
table(mydata$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

K and k represent values in thousand dollars, M and m represent million dollars, and B represent billion dollars. Similar units are grouped together, whereas NAs are considered as normal dollar values.

# Convert to character type
mydata$PROPDMGEXP <- as.character(mydata$PROPDMGEXP)

# NA's considered as dollars
mydata$PROPDMGEXP[is.na(mydata$PROPDMGEXP)] <- 0 

# Everything exept K,M,B is dollar
mydata$PROPDMGEXP[!grepl("K|M|B", mydata$PROPDMGEXP, ignore.case = TRUE)] <- 0 

# Change values in the `PROPDMGEXP` variable
mydata$PROPDMGEXP[grep("K", mydata$PROPDMGEXP, ignore.case = TRUE)] <- "3"
mydata$PROPDMGEXP[grep("M", mydata$PROPDMGEXP, ignore.case = TRUE)] <- "6"
mydata$PROPDMGEXP[grep("B", mydata$PROPDMGEXP, ignore.case = TRUE)] <- "9"
mydata$PROPDMGEXP <- as.numeric(as.character(mydata$PROPDMGEXP))

# Create new variable where the actual property damage value is calculated
mydata$property.damage <- mydata$PROPDMG * 10^mydata$PROPDMGEXP

# Look at first ten (sorted) property damage values
sort(table(mydata$property.damage), decreasing = TRUE)[1:10]
## 
##      0   5000  10000   1000   2000  25000  50000   3000  20000  15000 
## 663123  31731  21787  17544  17186  17104  13596  10364   9179   8617
# Do the same with `CROPDMGEXP`
mydata$CROPDMGEXP <- as.character(mydata$CROPDMGEXP)
mydata$CROPDMGEXP[is.na(mydata$CROPDMGEXP)] <- 0
mydata$CROPDMGEXP[!grepl("K|M|B", mydata$CROPDMGEXP, ignore.case = TRUE)] <- 0
mydata$CROPDMGEXP[grep("K", mydata$CROPDMGEXP, ignore.case = TRUE)] <- "3"
mydata$CROPDMGEXP[grep("M", mydata$CROPDMGEXP, ignore.case = TRUE)] <- "6"
mydata$CROPDMGEXP[grep("B", mydata$CROPDMGEXP, ignore.case = TRUE)] <- "9"
mydata$CROPDMGEXP <- as.numeric(as.character(mydata$CROPDMGEXP))
mydata$crop.damage <- mydata$CROPDMG * 10^mydata$CROPDMGEXP
sort(table(mydata$crop.damage), decreasing = TRUE)[1:10]
## 
##      0   5000  10000  50000  1e+05   1000   2000  25000  20000  5e+05 
## 880198   4097   2349   1984   1233    956    951    830    758    721

Results

This section shows the analysis conducted and the results to answer the research questions.

1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Total number of fatalities and injuries based on weather event

fatal.injury <- mydata %>% 
  group_by(EVENT) %>% 
  summarize(Total = sum(FATALITIES + INJURIES, na.rm = TRUE)) %>% 
  mutate(Percent = round(Total/sum(Total) * 100, 3), 
         Type = rep("Fatalities and Injuries"))
fatal.injury %>% arrange(desc(Total))
## # A tibble: 10 x 4
##    EVENT   Total Percent Type                   
##    <chr>   <dbl>   <dbl> <chr>                  
##  1 TORNADO 97068  62.4   Fatalities and Injuries
##  2 OTHER   14850   9.54  Fatalities and Injuries
##  3 HEAT    12362   7.94  Fatalities and Injuries
##  4 WIND    10210   6.56  Fatalities and Injuries
##  5 FLOOD   10126   6.50  Fatalities and Injuries
##  6 STORM    5755   3.70  Fatalities and Injuries
##  7 WINTER   2169   1.39  Fatalities and Injuries
##  8 HAIL     1386   0.89  Fatalities and Injuries
##  9 SNOW     1328   0.853 Fatalities and Injuries
## 10 RAIN      419   0.269 Fatalities and Injuries

The table shows the total number of fatalities and injuries suffered by the population based on weather events. Based on the table, it is shown that fatalities and injuries caused by tornado-related events are the highest, making up 62.3% of the data, whereas rain-related events caused the least amount of fatalities and injuries, making up less than 0.3% of the data. This is also shown in the plot below. The values at each bar are the total counts on top and the percentage at the bottom.

Total number of fatalities based on weather event

fatalities <- mydata %>% 
  group_by(EVENT) %>% 
  summarize(Total = sum(FATALITIES, na.rm = TRUE)) %>% 
  mutate(Percent = round(Total/sum(Total) * 100, 3), 
         Type = rep("Fatalities"))
fatalities %>% arrange(desc(Total))
## # A tibble: 10 x 4
##    EVENT   Total Percent Type      
##    <chr>   <dbl>   <dbl> <chr>     
##  1 TORNADO  5661  37.4   Fatalities
##  2 HEAT     3138  20.7   Fatalities
##  3 OTHER    2626  17.3   Fatalities
##  4 FLOOD    1524  10.1   Fatalities
##  5 WIND     1209   7.98  Fatalities
##  6 STORM     416   2.75  Fatalities
##  7 WINTER    278   1.84  Fatalities
##  8 SNOW      164   1.08  Fatalities
##  9 RAIN      114   0.753 Fatalities
## 10 HAIL       15   0.099 Fatalities

The second table shows the number and percentage of fatalaties based on each weather events. Similar to the earlier table, fatalities caused by tornado-related events are the highest, making up 37.4% percent of the data. But this time, hail-related events caused the least amount of fatality, with less that 0.1% of the fatalities are caused by it.

Total number of injuries based on weather event

injuries <- mydata %>% 
  group_by(EVENT) %>% 
  summarize(Total = sum(INJURIES, na.rm = TRUE)) %>% 
  mutate(Percent = round(Total/sum(Total) * 100, 3), 
         Type = rep("Injuries"))
injuries %>% arrange(desc(Total))
## # A tibble: 10 x 4
##    EVENT   Total Percent Type    
##    <chr>   <dbl>   <dbl> <chr>   
##  1 TORNADO 91407  65.0   Injuries
##  2 OTHER   12224   8.70  Injuries
##  3 HEAT     9224   6.56  Injuries
##  4 WIND     9001   6.40  Injuries
##  5 FLOOD    8602   6.12  Injuries
##  6 STORM    5339   3.80  Injuries
##  7 WINTER   1891   1.35  Injuries
##  8 HAIL     1371   0.976 Injuries
##  9 SNOW     1164   0.828 Injuries
## 10 RAIN      305   0.217 Injuries

This third table, on the other hand, details the number and percentage of injuries caused by weather events. Again, tornado-related events top the list, and rain-related events caused the least amount of injury.

All of the results are also shown in the bar plots below. As shown in the plot, the number of fatalities is not as notable compared to injuries.

combined.plot <- rbind(fatal.injury, fatalities, injuries)

ggplot(combined.plot, aes(x = EVENT, y = Total, fill = Type)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(x = "Weather Events", title = "Total Number of Fatalities and Injuries", subtitle = "Based on Weather Events")

2. Across the United States, which types of events have the greatest economic consequences?

Total number of property and crop damages based on weather events

prop.crop <- mydata %>% 
  group_by(EVENT) %>% 
  summarize(Total = sum(property.damage + crop.damage, na.rm = TRUE)) %>% 
  mutate(Percent = round(Total / sum(Total) * 100, 3), 
         Type = rep("Property and Crop Damages"))
prop.crop %>% arrange(desc(Total))
## # A tibble: 10 x 4
##    EVENT           Total Percent Type                     
##    <chr>           <dbl>   <dbl> <chr>                    
##  1 FLOOD   179769100029.  37.7   Property and Crop Damages
##  2 OTHER   120835593207.  25.4   Property and Crop Damages
##  3 STORM    72678890281.  15.3   Property and Crop Damages
##  4 TORNADO  59010559549.  12.4   Property and Crop Damages
##  5 HAIL     18779880521.   3.94  Property and Crop Damages
##  6 WIND     12250885768.   2.57  Property and Crop Damages
##  7 WINTER    6824739251    1.43  Property and Crop Damages
##  8 RAIN      4189545992    0.879 Property and Crop Damages
##  9 SNOW      1158852852.   0.243 Property and Crop Damages
## 10 HEAT       924795030    0.194 Property and Crop Damages

The table above shows the total amount of property and crop damages caused by each weather event. Miscellaneous weather events dominated the list, with nearly 50% of the damages are caused by them. On the other hand, winter-related events caused the least amount of damage, making up less than 0.1% of the data.

Total number of property damages based on weather events

property <- mydata %>% 
  group_by(EVENT) %>% 
  summarize(Total = sum(property.damage, na.rm = TRUE)) %>% 
  mutate(Percent = round(Total / sum(Total) * 100, 3), 
         Type = rep("Property Damage"))
property %>% arrange(desc(Total))
## # A tibble: 10 x 4
##    EVENT           Total Percent Type           
##    <chr>           <dbl>   <dbl> <chr>          
##  1 FLOOD   167502193929.  39.2   Property Damage
##  2 OTHER    97246712337.  22.8   Property Damage
##  3 STORM    66304415393.  15.5   Property Damage
##  4 TORNADO  58593098029.  13.7   Property Damage
##  5 HAIL     15733043048.   3.68  Property Damage
##  6 WIND     10847166618.   2.54  Property Damage
##  7 WINTER    6777295251    1.59  Property Damage
##  8 RAIN      3270230192    0.765 Property Damage
##  9 SNOW      1024169752.   0.24  Property Damage
## 10 HEAT        20325750    0.005 Property Damage

This table shows the amount of property damages caused by each weather event. Based on the table above, nearly 30% of the property damages suffered in the US is caused by tornado-related events, while heat-related events make up 0.03% of the total property damages suffered by the country.

Total number of crop damages based on weather events

crop <- mydata %>% 
  group_by(EVENT) %>% 
  summarize(Total = sum(crop.damage, na.rm = TRUE)) %>% 
  mutate(Percent = round(Total / sum(Total) * 100, 3), 
         Type = rep("Crop Damage"))
crop %>% arrange(desc(Total))
## # A tibble: 10 x 4
##    EVENT         Total Percent Type       
##    <chr>         <dbl>   <dbl> <chr>      
##  1 OTHER   23588880870  48.0   Crop Damage
##  2 FLOOD   12266906100  25.0   Crop Damage
##  3 STORM    6374474888  13.0   Crop Damage
##  4 HAIL     3046837473   6.20  Crop Damage
##  5 WIND     1403719150   2.86  Crop Damage
##  6 RAIN      919315800   1.87  Crop Damage
##  7 HEAT      904469280   1.84  Crop Damage
##  8 TORNADO   417461520   0.85  Crop Damage
##  9 SNOW      134683100   0.274 Crop Damage
## 10 WINTER     47444000   0.097 Crop Damage

Based on the third table above, which shows the amount of crop damages suffered by the US, miscellaneous events make up nearly half of the total damages. Winter-related events caused the least amount of crop damage (less than 0.1%). We can see that this table is similar to the first table shown in this section.

The bar plot below shows the amount and distribution of the property and crop damages of each weather event. As shown in the plot, crop damages are not as prominent as property damages.

combined.plot2 <- rbind(prop.crop, property, crop)

ggplot(combined.plot2, aes(x = EVENT, y = Total/1000000, fill = Type)) +
  geom_bar(stat = "identity") + 
  coord_flip() +
  labs(x = "Weather Events", y = "Total", 
       title = "Total Amount of Property and Crop Damages", 
       subtitle = "Based on Weather Events")