The dataset contains information regarding events caused by storms. Two questions were addressed to investigate the most detrimental event type in regards to population health and economic consequence across USA. After data cleaning and analysis, the events were found to be lightnings and thunderstorm winds respectively. This is due to the calculation of the amount of harm done in respect to the duration of the events.
The given dataset has the extension .csv.bz2 so this compressed data can be loaded into R with the data.table and R.utils packages, using the function fread().
library(data.table)
library(R.utils)
data <- fread("repdata_data_StormData.csv.bz2", na.strings = c(NA_character_, ""))
Now, dim() is used to see the number of observations and variables while head() is used to see the overview of the contents.
dim(data)
## [1] 902297 37
head(data, 5)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## <num> <char> <char> <char> <num> <char> <char>
## 1: 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2: 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3: 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4: 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5: 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## <char> <num> <char> <char> <char> <char> <num> <lgcl>
## 1: TORNADO 0 <NA> <NA> <NA> <NA> 0 NA
## 2: TORNADO 0 <NA> <NA> <NA> <NA> 0 NA
## 3: TORNADO 0 <NA> <NA> <NA> <NA> 0 NA
## 4: TORNADO 0 <NA> <NA> <NA> <NA> 0 NA
## 5: TORNADO 0 <NA> <NA> <NA> <NA> 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES
## <num> <char> <char> <num> <num> <int> <num> <num> <num>
## 1: 0 <NA> <NA> 14.0 100 3 0 0 15
## 2: 0 <NA> <NA> 2.0 150 2 0 0 0
## 3: 0 <NA> <NA> 0.1 123 2 0 0 2
## 4: 0 <NA> <NA> 0.0 100 2 0 0 2
## 5: 0 <NA> <NA> 0.0 150 2 0 0 2
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE
## <num> <char> <num> <char> <char> <char> <char> <num>
## 1: 25.0 K 0 <NA> <NA> <NA> <NA> 3040
## 2: 2.5 K 0 <NA> <NA> <NA> <NA> 3042
## 3: 25.0 K 0 <NA> <NA> <NA> <NA> 3340
## 4: 2.5 K 0 <NA> <NA> <NA> <NA> 3458
## 5: 2.5 K 0 <NA> <NA> <NA> <NA> 3412
## LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## <num> <num> <num> <char> <num>
## 1: 8812 3051 8806 <NA> 1
## 2: 8755 0 0 <NA> 2
## 3: 8742 0 0 <NA> 3
## 4: 8626 0 0 <NA> 4
## 5: 8642 0 0 <NA> 5
As seen from the overview of the data above, there are lots of observations (902,297 observations in total) with 37 variables.
We want to answer 2 questions:
To help answer these questions, the data will be transformed to simplify the analysis process. First, we have to consider what type of information is needed to answer each question. Use names() to see the variables in the dataset.
names(data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
Starting from question 1, let us narrow down the variables that are needed to answer this question.
For question 2, we also need to narrow down the variables to answer this question.
After knowing what variables to choose, we can narrow down the dataset to contain those variables and exclude the unnecessary variables.
data_narrowed <- data[, c(2, 3, 8, 12, 13, 23, 24, 25, 27)]
names(data_narrowed)
## [1] "BGN_DATE" "BGN_TIME" "EVTYPE" "END_DATE" "END_TIME"
## [6] "FATALITIES" "INJURIES" "PROPDMG" "CROPDMG"
Before transforming the data and making summaries, incomplete observations must be removed to streamline the analysis process. In this case, an incomplete observation is defined as an observation (row) that does not contain data (NA value) on the variables: BGN_DATE, END_DATE, BGN_TIME or END_TIME.
The tidyr package contains the function drop_na() which is handy to remove observations containing NA values in the variables above. Entries containing “?” are also present in the END_TIME variable, so rows with this should be removed.
library(tidyr)
data_cleaned <- data_narrowed %>%
drop_na(c("BGN_DATE", "END_DATE", "BGN_TIME", "END_TIME"))
data_cleaned <- data_cleaned[!grepl("\\?", data_cleaned$END_TIME),]
dim(data_cleaned)
## [1] 658538 9
Now that all observations are filled (no cells with NA), some transformations to the DATE and TIME variables are necessary to be able to compute the duration of the events.
In END_TIME, there are entries that only describe the phase of the day, so some assumptions will be made for these entries based on The Britannica Dictionary. | Entry | Assumption | ——————————– | All Day | 23:59 | | Morning | 05:00 | | Late Afternoon | 16:00 | | Evening | 17:00 | | Overnight | 04:00 |
# Remove "0:00:00" from BGN_DATE and END_DATE
data_cleaned$BGN_DATE <- gsub("0:00:00", "", data_cleaned$BGN_DATE)
data_cleaned$END_DATE <- gsub("0:00:00", "", data_cleaned$END_DATE)
# Substitute phase of day with time assumptions
data_cleaned$END_TIME <- sub("All Day", "2359", data_cleaned$END_TIME)
data_cleaned$END_TIME <- sub("Morning", "0500", data_cleaned$END_TIME)
data_cleaned$END_TIME <- sub("Late Afterno", "2359", data_cleaned$END_TIME)
data_cleaned$END_TIME <- sub("Evening", "1700", data_cleaned$END_TIME)
data_cleaned$END_TIME <- sub("Overnight", "0400", data_cleaned$END_TIME)
# Convert BGN_TIME and END_TIME from a 4-digit format to the correct hour:minute format up to row 5019
data_cleaned$BGN_TIME[1:5019] <- sub("(.{2})", "\\1:", data_cleaned$BGN_TIME[1:5019])
data_cleaned$END_TIME[1:5019] <- gsub("([0-9]+).*$", "\\1", data_cleaned$END_TIME[1:5019]) # remove timezone
data_cleaned$END_TIME[1:5019] <- sub("(.{2})", "\\1:", data_cleaned$END_TIME[1:5019])
Combine BGN_DATE and BGN_TIME to a new variable BEGIN with the class type DATE. Combine the END_DATE and END_TIME to a new variable END with the class type DATE. The lubridate package contains the function parse_date_time() to easily correct the different formatting in these variables.
# Create the new BEGIN and END variables
data_cleaned$BEGIN <- with(data_cleaned, paste(BGN_DATE, BGN_TIME))
data_cleaned$END <- with(data_cleaned, paste(END_DATE, END_TIME))
# Convert to DATE class
library(lubridate)
data_cleaned$BEGIN <- parse_date_time(data_cleaned$BEGIN,
orders = c("%m/%d/%y %H:%M",
"%m/%d/%y %H:%M:%s %p"))
## Warning: 1 failed to parse.
data_cleaned$END <- parse_date_time(data_cleaned$END,
orders = c("%m/%d/%y %H:%M",
"%m/%d/%y %H:%M:%s %p"))
## Warning: 32 failed to parse.
As seen in the warning messages, there are entries that have become NA after parsing the variables. This is most likely due to invalid time entries such as AM, PM, and others. Because assumptions are difficult to be determined to substitute these entries, they will be removed with drop.na().
data_cleaned <- data_cleaned %>%
drop_na(c("BEGIN", "END"))
Now that everything is neat and tidy, we can add a new variable, DURATION (seconds), by subtracting END with BEGIN.
data_cleaned$DURATION <- data_cleaned$END - data_cleaned$BEGIN
sum(data_cleaned$DURATION <= 0) # check for invalid duration
## [1] 383828
Unfortunately, there are some duration with invalid entries such as having duration of 0 or negative values. The entries containing 0 duration will be retained by substituting it with 1. However, negative values will be removed.
data_final <- data_cleaned # to retain the cleaned data
# substitute 0 with 1 and remove all rows with negative values
data_final$DURATION <- replace(data_final$DURATION, data_final$DURATION == 0, 1)
data_final <- data_final[data_final$DURATION > 0, ]
data_final$DURATION <- as.numeric(data_final$DURATION)
sum(data_final$DURATION <= 0) # check for invalid duration
## [1] 0
Now, all entries are valid and the data is finally ready to be analysed.
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
A new variable will be computed to indicate the extent of harm each EVTYPE does - HARM_INDEX. This variable is calculated by the sum of INJURIES and FATALITIES divided by the DURATION.
# Make new HARM_INDEX variable
library(dplyr)
pophealth <- data_final %>%
mutate(HARM_INDEX = ((FATALITIES + INJURIES)/DURATION))
# Group by EVTYPE
pophealth <- pophealth %>%
group_by(EVTYPE) %>%
summarise(HARM_INDEX = sum(HARM_INDEX))
pophealth$EVTYPE <- tolower(pophealth$EVTYPE)
Since there are many harmless EVTYPE in the dataset, it would be redundant to include these into the analysis. So, EVTYPE with HARM_INDEX > 0 will be plotted to see the trends. Here, the ggplot2 package will be used to make the plot.
pophealth <- pophealth[pophealth$HARM_INDEX > 0, ] # select EVTYPE with HARM_INDEX > 0
library(ggplot2)
ggplot(pophealth, aes(x = EVTYPE, y = HARM_INDEX)) +
geom_bar(stat = "identity", width = 1) +
coord_flip()
As seen from the bar plot above, the most harmful event type is the
occurrence of lightning. It is important to know that
the HARM_INDEX is calculated by the duration, so considering the
instantaneous occurrence of lightnings, it is the most dangerous event
type to population health by duration.
Across the United States, which types of events have the greatest economic consequences?
Similarly to the previous question, the HARM_INDEX will be calculated which is the sum of INJURIES, FATALITIES, CROPDMG and PROPDMG divided by the DURATION.
# Make new HARM_INDEX variable
econcons <- data_final %>%
mutate(HARM_INDEX = ((FATALITIES + INJURIES + PROPDMG + CROPDMG)/DURATION))
# Group by EVTYPE
econcons <- econcons %>%
group_by(EVTYPE) %>%
summarise(HARM_INDEX = sum(HARM_INDEX))
econcons$EVTYPE <- tolower(econcons$EVTYPE)
Same as the previous question, there are many event types with 0 HARM_INDEX, so these will be excluded as well.
econcons <- econcons[econcons$HARM_INDEX > 0, ] # select EVTYPE with HARM_INDEX > 0
library(ggplot2)
ggplot(econcons, aes(x = EVTYPE, y = HARM_INDEX)) +
geom_bar(stat = "identity", width = 1) +
coord_flip()
As seen in the bar plot above, the most detrimental event type for the
economy is thunderstorm winds. This may be caused by
the duration and frequency of the event, causing the buildup of
variables that contribute to the HARM_INDEX.