Synopsis-Course Project 2 Reproducible Research

Goal: This project looks to answer two main questions:

  1. Which types of events cause the most harm to people’s health across the U.S.?

  2. Which types of events lead to the biggest economic losses across the U.S.?

Results: Tornado events have caused the most harm to human health, while flood are the major cause of property damage.

Disclaimer: The event type names in the data set aren’t always clean — there are misspellings, slight variations, and duplicates. For this analysis, I’m treating each unique spelling as a separate event type, even if some of them might actually be the same thing.

Data Processing

Reading in the raw data

file.path<-"C:/Users/Lenovo/Documents/R_datasets_practise/Coursera/repdata_data_useful_Data.csv.bz2"
setwd("C:/Users/Lenovo/Documents/R_datasets_practise/Coursera")
data <- read.csv("repdata_data_StormData.csv.bz2", header = TRUE, na.strings = "")


head(data)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL TORNADO
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL TORNADO
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0    <NA>       <NA>     <NA>     <NA>          0         NA
## 2         0    <NA>       <NA>     <NA>     <NA>          0         NA
## 3         0    <NA>       <NA>     <NA>     <NA>          0         NA
## 4         0    <NA>       <NA>     <NA>     <NA>          0         NA
## 5         0    <NA>       <NA>     <NA>     <NA>          0         NA
## 6         0    <NA>       <NA>     <NA>     <NA>          0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0    <NA>       <NA>   14.0   100 3   0          0       15    25.0
## 2         0    <NA>       <NA>    2.0   150 2   0          0        0     2.5
## 3         0    <NA>       <NA>    0.1   123 2   0          0        2    25.0
## 4         0    <NA>       <NA>    0.0   100 2   0          0        2     2.5
## 5         0    <NA>       <NA>    0.0   150 2   0          0        2     2.5
## 6         0    <NA>       <NA>    1.5   177 2   0          0        6     2.5
##   PROPDMGEXP CROPDMG CROPDMGEXP  WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0       <NA> <NA>       <NA>      <NA>     3040      8812
## 2          K       0       <NA> <NA>       <NA>      <NA>     3042      8755
## 3          K       0       <NA> <NA>       <NA>      <NA>     3340      8742
## 4          K       0       <NA> <NA>       <NA>      <NA>     3458      8626
## 5          K       0       <NA> <NA>       <NA>      <NA>     3412      8642
## 6          K       0       <NA> <NA>       <NA>      <NA>     3450      8748
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806    <NA>      1
## 2          0          0    <NA>      2
## 3          0          0    <NA>      3
## 4          0          0    <NA>      4
## 5          0          0    <NA>      5
## 6          0          0    <NA>      6

To answer the questions posed by the project, it isn’t necessary to work with the entire data, as all of the info present isn’t useful. For this project we only require the following fields. Presented below is a list of the variable we will be using and their meaning.

  1. EVTYPE - Type of event
  2. FATALITIES - Number of fatalities
  3. INJURIES - injuries( non fatal)
  4. PROPDMG - Property damage in USD
  5. PROPDMGEXP - Unit multiplier for property damage (K, M, or B)
  6. CROPDMG - Crop damage
  7. CROPDMGEXP - Unit multiplier for crop damage (K, M, or B)

Now we will create a subset of the original data containing only these variables listed above.

useful_Data<- subset(data, EVTYPE != "?"
                                   &
(FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0),
                                   select = c("EVTYPE",
                                              "FATALITIES",
                                              "INJURIES", 
                                              "PROPDMG",
                                              "PROPDMGEXP",
                                              "CROPDMG",
                                              "CROPDMGEXP"))
                                          

dim(useful_Data)
## [1] 254632      7
names(useful_Data)
## [1] "EVTYPE"     "FATALITIES" "INJURIES"   "PROPDMG"    "PROPDMGEXP"
## [6] "CROPDMG"    "CROPDMGEXP"
sum(is.na(useful_Data))
## [1] 164248

Taking a look at the property damage and crop damage columns, we must clean up the column in order to make our calculations easier.

library(stringr)

useful_Data$PROPDMGEXP <- str_replace(useful_Data$PROPDMGEXP, "K", "1000")
useful_Data$PROPDMGEXP <- str_replace(useful_Data$PROPDMGEXP, "M", "1000000") 
useful_Data$PROPDMGEXP <- str_replace(useful_Data$PROPDMGEXP, "B", "1000000000")
useful_Data$PROPDMG <- useful_Data$PROPDMG * as.numeric(useful_Data$PROPDMGEXP)
## Warning: NAs introduced by coercion

Doing the same for crop data

useful_Data$CROPDMGEXP <- str_replace(useful_Data$CROPDMGEXP, "K", "1000")
useful_Data$CROPDMGEXP <- str_replace(useful_Data$CROPDMGEXP, "M", "1000000") 
useful_Data$CROPDMGEXP <- str_replace(useful_Data$CROPDMGEXP, "B", "1000000000")
useful_Data$CROPDMG <- useful_Data$CROPDMG * as.numeric(useful_Data$CROPDMGEXP)
## Warning: NAs introduced by coercion

Below chunk of code will add a “health” and “propcost” columns to our data set.

stormdata<-useful_Data
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
stormdata$health<-stormdata$FATALITIES+stormdata$INJURIES
stormdata$propcost <- coalesce(stormdata$PROPDMG, 0 + stormdata$CROPDMG, 0)
head(stormdata$health)
## [1] 15  0  2  2  2  6

Summarize health by EVTYPE

Since we’re interested in “most” harmful, lets focus on the sum of damage done by event type and generate a data frame that displays the total health impact by event, total economic impact by event, and one that combines the two.

Creating a data frame for total health impact by event

library(dplyr)
mostharmful<-stormdata %>% group_by(EVTYPE) %>% 
  summarise(totalhealth=sum(health, na.rm=TRUE))
mostharmful<-arrange(mostharmful, desc(totalhealth))

head(mostharmful)
## # A tibble: 6 × 2
##   EVTYPE         totalhealth
##   <chr>                <dbl>
## 1 TORNADO              96979
## 2 EXCESSIVE HEAT        8428
## 3 TSTM WIND             7461
## 4 FLOOD                 7259
## 5 LIGHTNING             6046
## 6 HEAT                  3037

Doing the same for property damage

library(dplyr)
mostcost<-stormdata %>% group_by(EVTYPE) %>% 
  summarise(highestcost=sum(propcost, na.rm=TRUE))
mostcost<-arrange(mostcost, desc(highestcost))

head(mostcost)
## # A tibble: 6 × 2
##   EVTYPE              highestcost
##   <chr>                     <dbl>
## 1 FLOOD             145148722800 
## 2 HURRICANE/TYPHOON  69305840000 
## 3 TORNADO            56937234641 
## 4 STORM SURGE        43323536000 
## 5 HAIL               16699513420 
## 6 FLASH FLOOD        16174100137.

Now, let’s combine both the data frames into one combined data set

combined_data<-full_join(mostharmful, mostcost)
## Joining with `by = join_by(EVTYPE)`
head(combined_data)
## # A tibble: 6 × 3
##   EVTYPE         totalhealth  highestcost
##   <chr>                <dbl>        <dbl>
## 1 TORNADO              96979  56937234641
## 2 EXCESSIVE HEAT        8428      7755700
## 3 TSTM WIND             7461   4541651340
## 4 FLOOD                 7259 145148722800
## 5 LIGHTNING             6046    935239306
## 6 HEAT                  3037    402523500

RESULTS

Top 10 Events that were most harmful to life

top10life<-mostharmful[1:10,]
library(ggplot2)
ggplot(top10life, aes(EVTYPE, totalhealth, fill = EVTYPE)) +
  geom_bar(stat = "identity") +
  theme_minimal() + xlab("Event Type")+
  theme(axis.text.x = element_text(angle = 40, hjust = 1)) + 
  ylab("No. of Individuals affected") +
  ggtitle("Top 10 Deadliest Event Types in terms of Human Cost")

Fig1: A bar plot representing the 10 storm types that have caused the most harm to human health.

Top 10 Events that were most harmful to property

top10prop<-mostcost[1:10,]
library(ggplot2)
ggplot(top10prop, aes(EVTYPE, highestcost, fill = EVTYPE)) +
  geom_bar(stat = "identity") +
  theme_minimal() + xlab("Event Type")+
  theme(axis.text.x = element_text(angle = 40, hjust = 1)) + xlab("")+
  ylab("No. of Individuals affected") +
  ggtitle("Top 10 Deadliest Event Types in terms of Property Loss")

Fig2: A bar plot representing the 10 storm types that have caused the most monetary harm to property.

Conclusions

From the analysis performed above, we can draw two conclusions

  1. Tornados, by far caused the most harm to human life.

  2. Floods have caused the monetary damage to property and is followed by hurricanes and tornados.