R Markdown

RepData2 2024-01-15 U.S. National Oceanic and Atmospheric Administration’s (NOAA) Storm Analysis Synopsis This project involces exploring the storm database provided by the U.S. National Oceanic and Atmospheric Administration (NOAA). The database chronicles major storms and weather events in the United States, detailing their occurrence, and capturing the extent of fatalities, injuries, and property damage they cause. The main objective of our study is to understand the impact of different types of weather events - such as floods, typhoons, tornadoes, hail, and hurricanes - on public health and the economy. By analyzing this data, we aim to identify which weather events pose the greatest risk to the health of the US population and which have the most significant economic repercussions.

In our analysis, we have concluded that tornadoes are the most harmful events in terms of health impacts on the U.S. population, as they result in the highest number of fatalities and injuries. On the economic front, floods emerge as the most damaging, causing the greatest property and crop damage. Our results highlight the varied impacts of different weather events, underscoring the importance of tailored strategies for mitigating both health risks and economic losses.

Data Processing The packages included in this analysis are dplyr for data cleaning and ggplot2 for data visualization.

The data is downloaded from https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz as a compressed comma-separated-value (csv) file, which we will read. After the data is loaded, the structure and dimensions of the data are inspected.

df <- read.csv(“repdata_data_stormData.csv”)

Structure of data

str(df) ## ‘data.frame’: 902297 obs. of 37 variables: ## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 … ## $ BGN_DATE : chr “4/18/1950 0:00:00” “4/18/1950 0:00:00” “2/20/1951 0:00:00” “6/8/1951 0:00:00” … ## $ BGN_TIME : chr “0130” “0145” “1600” “0900” … ## $ TIME_ZONE : chr “CST” “CST” “CST” “CST” … ## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 … ## $ COUNTYNAME: chr “MOBILE” “BALDWIN” “FAYETTE” “MADISON” … ## $ STATE : chr “AL” “AL” “AL” “AL” … ## $ EVTYPE : chr “TORNADO” “TORNADO” “TORNADO” “TORNADO” … ## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 … ## $ BGN_AZI : chr “” “” “” “” … ## $ BGN_LOCATI: chr “” “” “” “” … ## $ END_DATE : chr “” “” “” “” … ## $ END_TIME : chr “” “” “” “” … ## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 … ## $ COUNTYENDN: logi NA NA NA NA NA NA … ## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 … ## $ END_AZI : chr “” “” “” “” … ## $ END_LOCATI: chr “” “” “” “” … ## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 … ## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 … ## $ F : int 3 2 2 2 2 2 2 1 3 3 … ## $ MAG : num 0 0 0 0 0 0 0 0 0 0 … ## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 … ## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 … ## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 … ## $ PROPDMGEXP: chr “K” “K” “K” “K” … ## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 … ## $ CROPDMGEXP: chr “” “” “” “” … ## $ WFO : chr “” “” “” “” … ## $ STATEOFFIC: chr “” “” “” “” … ## $ ZONENAMES : chr “” “” “” “” … ## $ LATITUDE : num 3040 3042 3340 3458 3412 … ## $ LONGITUDE : num 8812 8755 8742 8626 8642 … ## $ LATITUDE_E: num 3051 0 0 0 0 … ## $ LONGITUDE_: num 8806 0 0 0 0 … ## $ REMARKS : chr “” “” “” “” … ## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 … # Dimension of data dim(df) ## [1] 902297 37 Extracting selected variables We only want to use the following variables for our analysis:

FATALITIES - approximate number of deaths INJURIES - approximate number of injuries PROPDMG - approximate property damage PROPDMGEXP - property damage exponent (to interpret PROPDMG) CROPDMG - approximate crop damages CROPDMGEXP - crop damage exponent (to interpret CROPDMG) EVTYPE - weather event type The variables are selected using the select() function in dplyr.

mydata <- select(df, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, EVTYPE) head(mydata) ## FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP EVTYPE ## 1 0 15 25.0 K 0 TORNADO ## 2 0 0 2.5 K 0 TORNADO ## 3 0 2 25.0 K 0 TORNADO ## 4 0 2 2.5 K 0 TORNADO ## 5 0 2 2.5 K 0 TORNADO ## 6 0 6 2.5 K 0 TORNADO Checking for missing values We use the sapply function to check for any missing values, ensuring for a complete and thorough analysis.

sapply(mydata, function(x) sum(is.na(x))) ## FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP EVTYPE ## 0 0 0 0 0 0 0 Based on the output, no missing values are recorded in the data.

Transforming variables Some variables have duplicate values, so we need to transform them to standardize the data and reduce complexity. The target variable EVTYPE contains the type of weather events. According to the output of the code below, there are 985 unique weather events listed in the data.

length(unique(mydata$EVTYPE)) ## [1] 985 Some of the variables can be categorized into one weather event (i.e. MARINE HIGH WIND and MARINE STRONG WIND can be grouped together = WIND). The following coding is used to transform the variable EVTYPE by grouping together similar weather events.

Create a new variable EVENT to transform the variable into groups

mydata$EVENT <- “OTHER”

Group by keyword in EVTYPE

mydata\(EVENT[grep("HAIL", mydata\)EVTYPE, ignore.case = TRUE)] <- “HAIL” mydata\(EVENT[grep("HEAT", mydata\)EVTYPE, ignore.case = TRUE)] <- “HEAT” mydata\(EVENT[grep("FLOOD", mydata\)EVTYPE, ignore.case = TRUE)] <- “FLOOD” mydata\(EVENT[grep("WIND", mydata\)EVTYPE, ignore.case = TRUE)] <- “WIND” mydata\(EVENT[grep("STORM", mydata\)EVTYPE, ignore.case = TRUE)] <- “STORM” mydata\(EVENT[grep("SNOW", mydata\)EVTYPE, ignore.case = TRUE)] <- “SNOW” mydata\(EVENT[grep("TORNADO", mydata\)EVTYPE, ignore.case = TRUE)] <- “TORNADO” mydata\(EVENT[grep("WINTER", mydata\)EVTYPE, ignore.case = TRUE)] <- “WINTER” mydata\(EVENT[grep("RAIN", mydata\)EVTYPE, ignore.case = TRUE)] <- “RAIN”

Check the variable

table(mydata$EVENT) ## ## FLOOD HAIL HEAT OTHER RAIN SNOW STORM TORNADO WIND WINTER ## 82686 289270 2648 48970 12241 17660 113156 60700 255362 19604 # The variables PROPDMGEXP and CROPDMGEXP contains the unit property and crop damages, respectively, in dollars.

table(mydata\(PROPDMGEXP) ## ## - ? + 0 1 2 3 4 5 6 ## 465934 1 8 5 216 25 13 4 4 28 4 ## 7 8 B h H K m M ## 5 1 40 1 6 424665 7 11330 table(mydata\)CROPDMGEXP) ## ## ? 0 2 B k K m M ## 618413 7 19 1 9 21 281832 1 1994 K and k represent values in thousands of dollars, M and m represent millions of dollars, and B represent billions of dollars. Similar units are grouped together - whereas NAs are considered as normal dollar values.

Convert to character type

mydata\(PROPDMGEXP <- as.character(mydata\)PROPDMGEXP)

NA’s considered as dollars

mydata\(PROPDMGEXP[is.na(mydata\)PROPDMGEXP)] <- 0

Everything except K,M,B is dollar

mydata\(PROPDMGEXP[!grepl("K|M|B", mydata\)PROPDMGEXP, ignore.case = TRUE)] <- 0

Change values in the PROPDMGEXP variable

mydata\(PROPDMGEXP[grep("K", mydata\)PROPDMGEXP, ignore.case = TRUE)] <- “3” mydata\(PROPDMGEXP[grep("M", mydata\)PROPDMGEXP, ignore.case = TRUE)] <- “6” mydata\(PROPDMGEXP[grep("B", mydata\)PROPDMGEXP, ignore.case = TRUE)] <- “9” mydata\(PROPDMGEXP <- as.numeric(as.character(mydata\)PROPDMGEXP))

Create new variable where the actual property damage value is calculated

mydata\(property.damage <- mydata\)PROPDMG * 10^mydata$PROPDMGEXP

Look at first ten (sorted) property damage values

sort(table(mydata\(property.damage), decreasing = TRUE)[1:10] ## ## 0 5000 10000 1000 2000 25000 50000 3000 20000 15000 ## 663123 31731 21787 17544 17186 17104 13596 10364 9179 8617 # Do the same with `CROPDMGEXP` mydata\)CROPDMGEXP <- as.character(mydata\(CROPDMGEXP) mydata\)CROPDMGEXP[is.na(mydata\(CROPDMGEXP)] <- 0 mydata\)CROPDMGEXP[!grepl(“K|M|B”, mydata\(CROPDMGEXP, ignore.case = TRUE)] <- 0 mydata\)CROPDMGEXP[grep(“K”, mydata\(CROPDMGEXP, ignore.case = TRUE)] <- "3" mydata\)CROPDMGEXP[grep(“M”, mydata\(CROPDMGEXP, ignore.case = TRUE)] <- "6" mydata\)CROPDMGEXP[grep(“B”, mydata\(CROPDMGEXP, ignore.case = TRUE)] <- "9" mydata\)CROPDMGEXP <- as.numeric(as.character(mydata\(CROPDMGEXP)) mydata\)crop.damage <- mydata\(CROPDMG * 10^mydata\)CROPDMGEXP sort(table(mydata$crop.damage), decreasing = TRUE)[1:10] ## ## 0 5000 10000 50000 1e+05 1000 2000 25000 20000 5e+05 ## 880198 4097 2349 1984 1233 956 951 830 758 721 Results This section shows our analysis conducted and the results to answer the research questions.

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health? For this question, we look to find the total number of fatalities and injuries based on the weather event.

fatal.injury <- mydata %>% group_by(EVENT) %>% summarize(Total = sum(FATALITIES + INJURIES, na.rm = TRUE)) %>% mutate(Percent = round(Total/sum(Total) * 100, 3), Type = rep(“Fatalities and Injuries”)) fatal.injury %>% arrange(desc(Total)) ## # A tibble: 10 × 4 ## EVENT Total Percent Type
##
## 1 TORNADO 97068 62.4 Fatalities and Injuries ## 2 OTHER 14850 9.54 Fatalities and Injuries ## 3 HEAT 12362 7.94 Fatalities and Injuries ## 4 WIND 10210 6.56 Fatalities and Injuries ## 5 FLOOD 10126 6.50 Fatalities and Injuries ## 6 STORM 5755 3.70 Fatalities and Injuries ## 7 WINTER 2169 1.39 Fatalities and Injuries ## 8 HAIL 1386 0.89 Fatalities and Injuries ## 9 SNOW 1328 0.853 Fatalities and Injuries ## 10 RAIN 419 0.269 Fatalities and Injuries This table shows the total number of fatalities and injuries suffered by the population based on weather events. Based on the table, it is shown that fatalities and injuries caused by tornado-related events are the highest, making up 62.4% of the data, whereas rain-related events caused the least amount of fatalities and injuries, making up less than 0.3% of the data. This is also shown in the plot below. The percentage of total injury/fatalities for each weather event is visible at the top of the bar.

ggplot(fatal.injury, aes(x = reorder(EVENT, -Total), y = Total, fill = EVENT)) + geom_bar(stat = “identity”) + geom_text(aes(label = paste(Percent, “%”)), vjust = -0.5) + labs(title = “Total Fatalities and Injuries by Weather Event”, x = “Weather Event”, y = “Total Number of Fatalities and Injuries”) + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

  1. Across the United States, which types of events have the greatest economic consequences? We can analyze the total number of property and crop damages based on weather events.

prop.crop <- mydata %>% group_by(EVENT) %>% summarize(Total = sum(property.damage + crop.damage, na.rm = TRUE)) %>% mutate(Percent = round(Total / sum(Total) * 100, 3), Type = rep(“Property and Crop Damages”)) prop.crop %>% arrange(desc(Total)) ## # A tibble: 10 × 4 ## EVENT Total Percent Type
##
## 1 FLOOD 179769100029. 37.7 Property and Crop Damages ## 2 OTHER 120835593207. 25.4 Property and Crop Damages ## 3 STORM 72678890281. 15.3 Property and Crop Damages ## 4 TORNADO 59010559549. 12.4 Property and Crop Damages ## 5 HAIL 18779880521. 3.94 Property and Crop Damages ## 6 WIND 12250885768. 2.57 Property and Crop Damages ## 7 WINTER 6824739251 1.43 Property and Crop Damages ## 8 RAIN 4189545992 0.879 Property and Crop Damages ## 9 SNOW 1158852852. 0.243 Property and Crop Damages ## 10 HEAT 924795030 0.194 Property and Crop Damages The table above shows the total amount of property and crop damages caused by each weather event. Flooding dominates the list, with nearly 40% of the damages caused by them. On the other hand, heat-related events caused the least amount of damage, making up less than 0.2% of the data.

ggplot(prop.crop, aes(x = reorder(EVENT, -Total), y = Percent, fill = EVENT)) + geom_bar(stat = “identity”) + geom_text(aes(label = paste(Percent, “%”)), vjust = -0.5) + labs(title = “Economic Impact of Weather Events (Property and Crop Damages)”, x = “Weather Event”, y = “Percentage of Total Damages”) + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.