Synopsis

The dataset contains information regarding events caused by storms. Two questions were addressed to investigate the most detrimental event type in regards to population health and economic consequence across USA. After data cleaning and analysis, the events were found to be lightnings and thunderstorm winds respectively. This is due to the calculation of the amount of harm done in respect to the duration of the events.

Data Processing

Loading the dataset

The given dataset has the extension .csv.bz2 so this compressed data can be loaded into R with the data.table and R.utils packages, using the function fread().

library(data.table)
library(R.utils)
data <- fread("repdata_data_StormData.csv.bz2", na.strings = c(NA_character_, ""))

Now, dim() is used to see the number of observations and variables while head() is used to see the overview of the contents.

dim(data)
## [1] 902297     37
head(data, 5)
##    STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME  STATE
##      <num>             <char>   <char>    <char>  <num>     <char> <char>
## 1:       1  4/18/1950 0:00:00     0130       CST     97     MOBILE     AL
## 2:       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN     AL
## 3:       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE     AL
## 4:       1   6/8/1951 0:00:00     0900       CST     89    MADISON     AL
## 5:       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN     AL
##     EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
##     <char>     <num>  <char>     <char>   <char>   <char>      <num>     <lgcl>
## 1: TORNADO         0    <NA>       <NA>     <NA>     <NA>          0         NA
## 2: TORNADO         0    <NA>       <NA>     <NA>     <NA>          0         NA
## 3: TORNADO         0    <NA>       <NA>     <NA>     <NA>          0         NA
## 4: TORNADO         0    <NA>       <NA>     <NA>     <NA>          0         NA
## 5: TORNADO         0    <NA>       <NA>     <NA>     <NA>          0         NA
##    END_RANGE END_AZI END_LOCATI LENGTH WIDTH     F   MAG FATALITIES INJURIES
##        <num>  <char>     <char>  <num> <num> <int> <num>      <num>    <num>
## 1:         0    <NA>       <NA>   14.0   100     3     0          0       15
## 2:         0    <NA>       <NA>    2.0   150     2     0          0        0
## 3:         0    <NA>       <NA>    0.1   123     2     0          0        2
## 4:         0    <NA>       <NA>    0.0   100     2     0          0        2
## 5:         0    <NA>       <NA>    0.0   150     2     0          0        2
##    PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP    WFO STATEOFFIC ZONENAMES LATITUDE
##      <num>     <char>   <num>     <char> <char>     <char>    <char>    <num>
## 1:    25.0          K       0       <NA>   <NA>       <NA>      <NA>     3040
## 2:     2.5          K       0       <NA>   <NA>       <NA>      <NA>     3042
## 3:    25.0          K       0       <NA>   <NA>       <NA>      <NA>     3340
## 4:     2.5          K       0       <NA>   <NA>       <NA>      <NA>     3458
## 5:     2.5          K       0       <NA>   <NA>       <NA>      <NA>     3412
##    LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
##        <num>      <num>      <num>  <char>  <num>
## 1:      8812       3051       8806    <NA>      1
## 2:      8755          0          0    <NA>      2
## 3:      8742          0          0    <NA>      3
## 4:      8626          0          0    <NA>      4
## 5:      8642          0          0    <NA>      5

As seen from the overview of the data above, there are lots of observations (902,297 observations in total) with 37 variables.

We want to answer 2 questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

To help answer these questions, the data will be transformed to simplify the analysis process. First, we have to consider what type of information is needed to answer each question. Use names() to see the variables in the dataset.

names(data)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Question 1

Starting from question 1, let us narrow down the variables that are needed to answer this question.

  • The types of events are indicated by the EVTYPE variable.
  • Population health can be defined by the total number of FATALITIES and INJURIES.
  • To know the extent of harm of each event, we need to consider the duration of the events which can be calculated from the date (BGN_DATE and END_DATE) and time (BGN_TIME and END_TIME).

Question 2

For question 2, we also need to narrow down the variables to answer this question.

  • The types of events are indicated by the EVTYPE variable.
  • Economic consequences can be defined by the population health from question 1 (FATALITIES and INJURIES) added together with the total number of PROPDMG and CROPDMG.
  • To know the extent of the harm, we also need to know the duration of the events, same as question 1.

Data transformations

After knowing what variables to choose, we can narrow down the dataset to contain those variables and exclude the unnecessary variables.

data_narrowed <- data[, c(2, 3, 8, 12, 13, 23, 24, 25, 27)]
names(data_narrowed)
## [1] "BGN_DATE"   "BGN_TIME"   "EVTYPE"     "END_DATE"   "END_TIME"  
## [6] "FATALITIES" "INJURIES"   "PROPDMG"    "CROPDMG"

Before transforming the data and making summaries, incomplete observations must be removed to streamline the analysis process. In this case, an incomplete observation is defined as an observation (row) that does not contain data (NA value) on the variables: BGN_DATE, END_DATE, BGN_TIME or END_TIME.

The tidyr package contains the function drop_na() which is handy to remove observations containing NA values in the variables above. Entries containing “?” are also present in the END_TIME variable, so rows with this should be removed.

library(tidyr)
data_cleaned <- data_narrowed %>% 
      drop_na(c("BGN_DATE", "END_DATE", "BGN_TIME", "END_TIME"))
data_cleaned <- data_cleaned[!grepl("\\?", data_cleaned$END_TIME),]
dim(data_cleaned)
## [1] 658538      9

Now that all observations are filled (no cells with NA), some transformations to the DATE and TIME variables are necessary to be able to compute the duration of the events.

1. Correct the formatting of the DATE and TIME variables

In END_TIME, there are entries that only describe the phase of the day, so some assumptions will be made for these entries based on The Britannica Dictionary. | Entry | Assumption | ——————————– | All Day | 23:59 | | Morning | 05:00 | | Late Afternoon | 16:00 | | Evening | 17:00 | | Overnight | 04:00 |

# Remove "0:00:00" from BGN_DATE and END_DATE
data_cleaned$BGN_DATE <- gsub("0:00:00", "", data_cleaned$BGN_DATE)
data_cleaned$END_DATE <- gsub("0:00:00", "", data_cleaned$END_DATE)

# Substitute phase of day with time assumptions
data_cleaned$END_TIME <- sub("All Day", "2359", data_cleaned$END_TIME)
data_cleaned$END_TIME <- sub("Morning", "0500", data_cleaned$END_TIME)
data_cleaned$END_TIME <- sub("Late Afterno", "2359", data_cleaned$END_TIME)
data_cleaned$END_TIME <- sub("Evening", "1700", data_cleaned$END_TIME)
data_cleaned$END_TIME <- sub("Overnight", "0400", data_cleaned$END_TIME)

# Convert BGN_TIME and END_TIME from a 4-digit format to the correct hour:minute format up to row 5019
data_cleaned$BGN_TIME[1:5019] <- sub("(.{2})", "\\1:", data_cleaned$BGN_TIME[1:5019])
data_cleaned$END_TIME[1:5019] <- gsub("([0-9]+).*$", "\\1", data_cleaned$END_TIME[1:5019]) # remove timezone
data_cleaned$END_TIME[1:5019] <- sub("(.{2})", "\\1:", data_cleaned$END_TIME[1:5019])

2. Make new variables - BEGIN and END

Combine BGN_DATE and BGN_TIME to a new variable BEGIN with the class type DATE. Combine the END_DATE and END_TIME to a new variable END with the class type DATE. The lubridate package contains the function parse_date_time() to easily correct the different formatting in these variables.

# Create the new BEGIN and END variables
data_cleaned$BEGIN <- with(data_cleaned, paste(BGN_DATE, BGN_TIME))
data_cleaned$END <- with(data_cleaned, paste(END_DATE, END_TIME))

# Convert to DATE class
library(lubridate)
data_cleaned$BEGIN <- parse_date_time(data_cleaned$BEGIN, 
                                      orders = c("%m/%d/%y %H:%M",
                                                 "%m/%d/%y %H:%M:%s %p"))
## Warning: 1 failed to parse.
data_cleaned$END <- parse_date_time(data_cleaned$END, 
                                      orders = c("%m/%d/%y %H:%M",
                                                 "%m/%d/%y %H:%M:%s %p"))
## Warning: 32 failed to parse.

As seen in the warning messages, there are entries that have become NA after parsing the variables. This is most likely due to invalid time entries such as AM, PM, and others. Because assumptions are difficult to be determined to substitute these entries, they will be removed with drop.na().

data_cleaned <- data_cleaned %>% 
      drop_na(c("BEGIN", "END"))

3. Calculate the new DURATION variable.

Now that everything is neat and tidy, we can add a new variable, DURATION (seconds), by subtracting END with BEGIN.

data_cleaned$DURATION <- data_cleaned$END - data_cleaned$BEGIN
sum(data_cleaned$DURATION <= 0) # check for invalid duration
## [1] 383828

Unfortunately, there are some duration with invalid entries such as having duration of 0 or negative values. The entries containing 0 duration will be retained by substituting it with 1. However, negative values will be removed.

data_final <- data_cleaned # to retain the cleaned data

# substitute 0 with 1 and remove all rows with negative values
data_final$DURATION <- replace(data_final$DURATION, data_final$DURATION == 0, 1)
data_final <- data_final[data_final$DURATION > 0, ]

data_final$DURATION <- as.numeric(data_final$DURATION)
sum(data_final$DURATION <= 0) # check for invalid duration
## [1] 0

Now, all entries are valid and the data is finally ready to be analysed.

Results

Question 1

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

A new variable will be computed to indicate the extent of harm each EVTYPE does - HARM_INDEX. This variable is calculated by the sum of INJURIES and FATALITIES divided by the DURATION.

# Make new HARM_INDEX variable
library(dplyr)
pophealth <- data_final %>%
      mutate(HARM_INDEX = ((FATALITIES + INJURIES)/DURATION))

# Group by EVTYPE
pophealth <- pophealth %>% 
      group_by(EVTYPE) %>% 
      summarise(HARM_INDEX = sum(HARM_INDEX))
pophealth$EVTYPE <- tolower(pophealth$EVTYPE)

Since there are many harmless EVTYPE in the dataset, it would be redundant to include these into the analysis. So, EVTYPE with HARM_INDEX > 0 will be plotted to see the trends. Here, the ggplot2 package will be used to make the plot.

pophealth <- pophealth[pophealth$HARM_INDEX > 0, ] # select EVTYPE with HARM_INDEX > 0

library(ggplot2)
ggplot(pophealth, aes(x = EVTYPE, y = HARM_INDEX)) + 
      geom_bar(stat = "identity", width = 1) + 
      coord_flip()

As seen from the bar plot above, the most harmful event type is the occurrence of lightning. It is important to know that the HARM_INDEX is calculated by the duration, so considering the instantaneous occurrence of lightnings, it is the most dangerous event type to population health by duration.

Question 2

Across the United States, which types of events have the greatest economic consequences?

Similarly to the previous question, the HARM_INDEX will be calculated which is the sum of INJURIES, FATALITIES, CROPDMG and PROPDMG divided by the DURATION.

# Make new HARM_INDEX variable
econcons <- data_final %>%
      mutate(HARM_INDEX = ((FATALITIES + INJURIES + PROPDMG + CROPDMG)/DURATION))

# Group by EVTYPE
econcons <- econcons %>% 
      group_by(EVTYPE) %>% 
      summarise(HARM_INDEX = sum(HARM_INDEX))
econcons$EVTYPE <- tolower(econcons$EVTYPE)

Same as the previous question, there are many event types with 0 HARM_INDEX, so these will be excluded as well.

econcons <- econcons[econcons$HARM_INDEX > 0, ] # select EVTYPE with HARM_INDEX > 0

library(ggplot2)
ggplot(econcons, aes(x = EVTYPE, y = HARM_INDEX)) + 
      geom_bar(stat = "identity", width = 1) + 
      coord_flip()

As seen in the bar plot above, the most detrimental event type for the economy is thunderstorm winds. This may be caused by the duration and frequency of the event, causing the buildup of variables that contribute to the HARM_INDEX.