Goal: This project looks to answer two main questions:
Which types of events cause the most harm to people’s health across the U.S.?
Which types of events lead to the biggest economic losses across the U.S.?
Results: Tornado events have caused the most harm to human health, while flood are the major cause of property damage.
Disclaimer: The event type names in the data set aren’t always clean — there are misspellings, slight variations, and duplicates. For this analysis, I’m treating each unique spelling as a separate event type, even if some of them might actually be the same thing.
file.path<-"C:/Users/Lenovo/Documents/R_datasets_practise/Coursera/repdata_data_useful_Data.csv.bz2"
setwd("C:/Users/Lenovo/Documents/R_datasets_practise/Coursera")
data <- read.csv("repdata_data_StormData.csv.bz2", header = TRUE, na.strings = "")
head(data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 <NA> <NA> <NA> <NA> 0 NA
## 2 0 <NA> <NA> <NA> <NA> 0 NA
## 3 0 <NA> <NA> <NA> <NA> 0 NA
## 4 0 <NA> <NA> <NA> <NA> 0 NA
## 5 0 <NA> <NA> <NA> <NA> 0 NA
## 6 0 <NA> <NA> <NA> <NA> 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 <NA> <NA> 14.0 100 3 0 0 15 25.0
## 2 0 <NA> <NA> 2.0 150 2 0 0 0 2.5
## 3 0 <NA> <NA> 0.1 123 2 0 0 2 25.0
## 4 0 <NA> <NA> 0.0 100 2 0 0 2 2.5
## 5 0 <NA> <NA> 0.0 150 2 0 0 2 2.5
## 6 0 <NA> <NA> 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 <NA> <NA> <NA> <NA> 3040 8812
## 2 K 0 <NA> <NA> <NA> <NA> 3042 8755
## 3 K 0 <NA> <NA> <NA> <NA> 3340 8742
## 4 K 0 <NA> <NA> <NA> <NA> 3458 8626
## 5 K 0 <NA> <NA> <NA> <NA> 3412 8642
## 6 K 0 <NA> <NA> <NA> <NA> 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 <NA> 1
## 2 0 0 <NA> 2
## 3 0 0 <NA> 3
## 4 0 0 <NA> 4
## 5 0 0 <NA> 5
## 6 0 0 <NA> 6
To answer the questions posed by the project, it isn’t necessary to work with the entire data, as all of the info present isn’t useful. For this project we only require the following fields. Presented below is a list of the variable we will be using and their meaning.
Now we will create a subset of the original data containing only these variables listed above.
useful_Data<- subset(data, EVTYPE != "?"
&
(FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0),
select = c("EVTYPE",
"FATALITIES",
"INJURIES",
"PROPDMG",
"PROPDMGEXP",
"CROPDMG",
"CROPDMGEXP"))
dim(useful_Data)
## [1] 254632 7
names(useful_Data)
## [1] "EVTYPE" "FATALITIES" "INJURIES" "PROPDMG" "PROPDMGEXP"
## [6] "CROPDMG" "CROPDMGEXP"
sum(is.na(useful_Data))
## [1] 164248
Taking a look at the property damage and crop damage columns, we must clean up the column in order to make our calculations easier.
library(stringr)
useful_Data$PROPDMGEXP <- str_replace(useful_Data$PROPDMGEXP, "K", "1000")
useful_Data$PROPDMGEXP <- str_replace(useful_Data$PROPDMGEXP, "M", "1000000")
useful_Data$PROPDMGEXP <- str_replace(useful_Data$PROPDMGEXP, "B", "1000000000")
useful_Data$PROPDMG <- useful_Data$PROPDMG * as.numeric(useful_Data$PROPDMGEXP)
## Warning: NAs introduced by coercion
Doing the same for crop data
useful_Data$CROPDMGEXP <- str_replace(useful_Data$CROPDMGEXP, "K", "1000")
useful_Data$CROPDMGEXP <- str_replace(useful_Data$CROPDMGEXP, "M", "1000000")
useful_Data$CROPDMGEXP <- str_replace(useful_Data$CROPDMGEXP, "B", "1000000000")
useful_Data$CROPDMG <- useful_Data$CROPDMG * as.numeric(useful_Data$CROPDMGEXP)
## Warning: NAs introduced by coercion
Below chunk of code will add a “health” and “propcost” columns to our data set.
stormdata<-useful_Data
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
stormdata$health<-stormdata$FATALITIES+stormdata$INJURIES
stormdata$propcost <- coalesce(stormdata$PROPDMG, 0 + stormdata$CROPDMG, 0)
head(stormdata$health)
## [1] 15 0 2 2 2 6
Summarize health by EVTYPE
Since we’re interested in “most” harmful, lets focus on the sum of damage done by event type and generate a data frame that displays the total health impact by event, total economic impact by event, and one that combines the two.
Creating a data frame for total health impact by event
library(dplyr)
mostharmful<-stormdata %>% group_by(EVTYPE) %>%
summarise(totalhealth=sum(health, na.rm=TRUE))
mostharmful<-arrange(mostharmful, desc(totalhealth))
head(mostharmful)
## # A tibble: 6 × 2
## EVTYPE totalhealth
## <chr> <dbl>
## 1 TORNADO 96979
## 2 EXCESSIVE HEAT 8428
## 3 TSTM WIND 7461
## 4 FLOOD 7259
## 5 LIGHTNING 6046
## 6 HEAT 3037
Doing the same for property damage
library(dplyr)
mostcost<-stormdata %>% group_by(EVTYPE) %>%
summarise(highestcost=sum(propcost, na.rm=TRUE))
mostcost<-arrange(mostcost, desc(highestcost))
head(mostcost)
## # A tibble: 6 × 2
## EVTYPE highestcost
## <chr> <dbl>
## 1 FLOOD 145148722800
## 2 HURRICANE/TYPHOON 69305840000
## 3 TORNADO 56937234641
## 4 STORM SURGE 43323536000
## 5 HAIL 16699513420
## 6 FLASH FLOOD 16174100137.
Now, let’s combine both the data frames into one combined data set
combined_data<-full_join(mostharmful, mostcost)
## Joining with `by = join_by(EVTYPE)`
head(combined_data)
## # A tibble: 6 × 3
## EVTYPE totalhealth highestcost
## <chr> <dbl> <dbl>
## 1 TORNADO 96979 56937234641
## 2 EXCESSIVE HEAT 8428 7755700
## 3 TSTM WIND 7461 4541651340
## 4 FLOOD 7259 145148722800
## 5 LIGHTNING 6046 935239306
## 6 HEAT 3037 402523500
top10life<-mostharmful[1:10,]
library(ggplot2)
ggplot(top10life, aes(EVTYPE, totalhealth, fill = EVTYPE)) +
geom_bar(stat = "identity") +
theme_minimal() + xlab("Event Type")+
theme(axis.text.x = element_text(angle = 40, hjust = 1)) +
ylab("No. of Individuals affected") +
ggtitle("Top 10 Deadliest Event Types in terms of Human Cost")
Fig1: A bar plot representing the 10 storm types that have caused the most harm to human health.
top10prop<-mostcost[1:10,]
library(ggplot2)
ggplot(top10prop, aes(EVTYPE, highestcost, fill = EVTYPE)) +
geom_bar(stat = "identity") +
theme_minimal() + xlab("Event Type")+
theme(axis.text.x = element_text(angle = 40, hjust = 1)) + xlab("")+
ylab("No. of Individuals affected") +
ggtitle("Top 10 Deadliest Event Types in terms of Property Loss")
Fig2: A bar plot representing the 10 storm types that have caused the most monetary harm to property.
From the analysis performed above, we can draw two conclusions
Tornados, by far caused the most harm to human life.
Floods have caused the monetary damage to property and is followed by hurricanes and tornados.