This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. We will use the database to answer the questions below and show the code for the entire analysis. Questions to be answered: 1. Across the United States, which types of events (as indicated in the EVTYPE) are most harmful with respect to population health? 2. Across the United States, which types of events have the greatest economic consequences?
In this session we will describe all steps to be performed in the dataset including downloading, reading, and transforming it to make sure the set is tidy before we can perform our analysis.
data<-read.csv("repdata_data_StormData.csv.bz2")
#Visualizing the data and its structure
head(data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
## 6 0 0 6
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
dim(data)
## [1] 902297 37
The data has 902297 observations in 37 variables. Several columns indicate location of the event, number of fatalities, mag, injuries, property and crop damage, More info can be found in [https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf]Link.
summary(is.na(data))
## STATE__ BGN_DATE BGN_TIME TIME_ZONE
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:902297 FALSE:902297 FALSE:902297 FALSE:902297
##
## COUNTY COUNTYNAME STATE EVTYPE
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:902297 FALSE:902297 FALSE:902297 FALSE:902297
##
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:902297 FALSE:902297 FALSE:902297 FALSE:902297
##
## END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI
## Mode :logical Mode :logical Mode:logical Mode :logical Mode :logical
## FALSE:902297 FALSE:902297 TRUE:902297 FALSE:902297 FALSE:902297
##
## END_LOCATI LENGTH WIDTH F
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:902297 FALSE:902297 FALSE:902297 FALSE:58734
## TRUE :843563
## MAG FATALITIES INJURIES PROPDMG
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:902297 FALSE:902297 FALSE:902297 FALSE:902297
##
## PROPDMGEXP CROPDMG CROPDMGEXP WFO
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:902297 FALSE:902297 FALSE:902297 FALSE:902297
##
## STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:902297 FALSE:902297 FALSE:902250 FALSE:902297
## TRUE :47
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:902257 FALSE:902297 FALSE:902297 FALSE:902297
## TRUE :40
#Changing date column
data$BGN_DATE<- as.POSIXct(data$BGN_DATE, format = "%m/%d/%Y")
There are a few columns with NA values, but only the F column, and some of the Latitude ones.
To start answering the questions we can look at fatalities and injuries summaries with time.
summary(data$FATALITIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0168 0.0000 583.0000
summary(data$INJURIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1557 0.0000 1700.0000
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
summary_data <- data %>%
group_by(EVTYPE) %>%
summarise(
Total_Fatalities = sum(FATALITIES, na.rm = TRUE),
Total_Injuries = sum(INJURIES, na.rm = TRUE),
Total_Harm = Total_Fatalities + Total_Injuries
) %>%
arrange(desc(Total_Harm)) %>%
slice_head(n = 10)
print(summary_data)
## # A tibble: 10 × 4
## EVTYPE Total_Fatalities Total_Injuries Total_Harm
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 TSTM WIND 504 6957 7461
## 4 FLOOD 470 6789 7259
## 5 LIGHTNING 816 5230 6046
## 6 HEAT 937 2100 3037
## 7 FLASH FLOOD 978 1777 2755
## 8 ICE STORM 89 1975 2064
## 9 THUNDERSTORM WIND 133 1488 1621
## 10 WINTER STORM 206 1321 1527
There are a few columns with NA values, but only the F column, and some of the Latitude ones.
After gathering the data, we will then create the figures to answer those.
library(dplyr)
library(tidyr)
library(ggplot2)
# Convert to long format for grouped bar chart
summary_long <- summary_data %>%
pivot_longer(cols = c(Total_Fatalities, Total_Injuries, Total_Harm),
names_to = "Harm_Type",
values_to = "Count")
# Plot
ggplot(summary_long, aes(x = reorder(EVTYPE, -Count), y = Count, fill = Harm_Type)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Total Fatalities, Injuries, and Harm by Event Type",
x = "Event Type",
y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_manual(values = c("salmon", "skyblue", "darkseagreen"))
The top 10 event types that account for the biggest harm to population (considering injuries and fatalities). Now let’s take a look at damage to property and crop.
summary(data$PROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 12.06 0.50 5000.00
summary(data$CROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 1.527 0.000 990.000
library(dplyr)
dmg_summary <- data %>%
group_by(EVTYPE) %>%
summarise(
Total_PropDmg = sum(PROPDMG, na.rm = TRUE),
Total_CropDmg = sum(CROPDMG, na.rm = TRUE),
Total_Dmg = Total_PropDmg + Total_CropDmg
) %>%
arrange(desc(Total_Dmg)) %>%
slice_head(n = 10)
print(dmg_summary)
## # A tibble: 10 × 4
## EVTYPE Total_PropDmg Total_CropDmg Total_Dmg
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADO 3212258. 100019. 3312277.
## 2 FLASH FLOOD 1420125. 179200. 1599325.
## 3 TSTM WIND 1335966. 109203. 1445168.
## 4 HAIL 688693. 579596. 1268290.
## 5 FLOOD 899938. 168038. 1067976.
## 6 THUNDERSTORM WIND 876844. 66791. 943636.
## 7 LIGHTNING 603352. 3581. 606932.
## 8 THUNDERSTORM WINDS 446293. 18685. 464978.
## 9 HIGH WIND 324732. 17283. 342015.
## 10 WINTER STORM 132721. 1979. 134700.
# Convert to long format for grouped bar chart
summary_dmg_long <- dmg_summary %>%
pivot_longer(cols = c(Total_Dmg, Total_PropDmg, Total_CropDmg),
names_to = "Damage_Type",
values_to = "Count")
# Plot
ggplot(summary_dmg_long, aes(x = reorder(EVTYPE, -Count), y = Count, fill = Damage_Type)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Total Property, Crop, and Damage by Event Type",
x = "Event Type",
y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_manual(values = c("salmon", "skyblue", "darkseagreen"))
One can clearly observe that tornados and floods are the main causes for
property damage. On the other hand, hail accounts for the damage caused
to crops.
By transforming the data and running the analysis, it is possible to conclude and regarding population, Tornados by far are the main cause for injuries and fatalities. Considering damage, tornados are also the key event types for damage to properties while hail is the biggest source of damage to crops.