Synopsis

The project aims to identify the most affecting natural events on health and economy of the US. The first aim was operationalized as the top 10 most injurious and fatal events existing in the dataset, while the second aim was operationalized as the top 10 most costly events that caused property and monetory crop damage.

Data Processing

I already have .csv file required for my analysis in my working directory. So loading the datafromt eh csv file into the R. I had to unzip the file manually and then load it. It took lots of unending time for the bz2 to load

data <- read.csv("repdata_data_StormData.csv")

Loading the required variabled from the above data frame

Thus subseting the required content needed to solve both of our questions and make our processing faster

#make a new dataset with a column that only contains the un-frequented (i.e. unique) event types.
smallerdata <- data.frame(EVS=unique(data$EVTYPE))

#add a column that is the sum of all fatalities caused by the event
smallerdata$FATALITIES <- tapply(data$FATALITIES, data$EVTYPE , sum)[smallerdata$EVS]

#add a column that is the sum of all injuries caused by the event
smallerdata$INJ <- tapply(data$INJURIES, data$EVTYPE , sum)[smallerdata$EVS]

#add a column that is the sum of injuries and fatalities caused by the event
smallerdata$totalhealthissues <- smallerdata$INJ + smallerdata$FATALITIES

#add a column that is the sum of all property damage costs (in Thousand dollars) caused by the event
smallerdata$PROPDMG <- (tapply(data$PROPDMG, data$EVTYPE , sum)[smallerdata$EVS])/1000

#add a column that is the sum of all crop damage costs (in Thousand dollars) caused by the event
smallerdata$CROPDMG <- (tapply(data$CROPDMG, data$EVTYPE , sum)[smallerdata$EVS])/1000

#add a column that is the sum of poperty damage costs and crop damage costs caused by the event (in Thousand $)
smallerdata$totalcosts <- smallerdata$PROPDMG + smallerdata$CROPDMG

Now that we have extracted our datasets lets start analyzing it Drawing a boxplot for number of fatalities and injuries for each event shows us some very effective outliers, without which, the dataset would relatively have low dispersion.

boxplot (smallerdata$FATALITIES, smallerdata$INJ)

Since the outliers does not have much differnce, I ignore them and move forward with my analysis.

Results

Lets take out the top 10 most fatal events along with the top 10 most injurious events

#loading the required libraries
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.1
library(reshape)
## Warning: package 'reshape' was built under R version 3.6.1
#returns the top 10 most fatal events
mostfatal <- head(smallerdata[order(smallerdata$FATALITIES, decreasing= TRUE), ], 10)
#returns the top 10 most injurous events
mostinj <- head(smallerdata[order(smallerdata$INJ, decreasing= TRUE), ], 10)
#returns the top 10 most impactful events on health (i.e. fatalities + injuries)
mosttotalhealthissue <- head(smallerdata[order(smallerdata$totalhealthissues, decreasing= TRUE), ], 10)

#ordering the events according to the total health impact (i.e. injuries + fatalities) which is more intuitive
mosttotalhealthissue$EVS <- factor(mosttotalhealthissue$EVS, levels = mosttotalhealthissue$EVS[order(mosttotalhealthissue$totalhealthissues, decreasing= TRUE)])

1. Across the United States, which types of events are most harmful with respect to population health?

in order to tackle the above question we take the intersection of the most fatal and most injurious events in the USA

intrsctmostinjandfatal <- intersect(mostfatal[,1], mostinj[,1])
intrsctmostinjandfatal
## [1] "TORNADO"        "EXCESSIVE HEAT" "FLASH FLOOD"    "HEAT"          
## [5] "LIGHTNING"      "TSTM WIND"      "FLOOD"

lets represent the above results in a beautiful way

meltedmostfatalinj <- melt(mosttotalhealthissue[mosttotalhealthissue$EVS %in% intrsctmostinjandfatal,], id.vars =c("EVS", "PROPDMG", "CROPDMG", "totalcosts"))

ggplot(meltedmostfatalinj, aes(EVS, value)) + geom_bar(aes(fill = variable), position = "dodge", stat="identity")+
      ggtitle(label = "Top Most Weather Events Impacting Human Health in the USA") +
      xlab("Event") + ylab ("Affected # of People") + 
      scale_fill_discrete(name = "Variable", labels = c("Fatalities", "Injuries", "Total(Fatalities + Injuries)"))

The graph clearly shows that the highest impact is of the Tornadoes. Also, the injures are more common than the fatalities.

2. Across the United States, which types of events have the greatest economic consequences?

In order to tackle the above question we take out the top 10 economically costly events i.e events with high crop damage and property damage costs

#returns top 10 most costly events
mosteconomicalcost <- head(smallerdata[order(smallerdata$totalcosts, decreasing= TRUE), ], 10)
mosteconomicalcost
##                   EVS FATALITIES   INJ totalhealthissues   PROPDMG
## 1             TORNADO       5633 91346             96979 3212.2582
## 20        FLASH FLOOD        978  1777              2755 1420.1246
## 2           TSTM WIND        504  6957              7461 1335.9656
## 3                HAIL         15  1361              1376  688.6934
## 36              FLOOD        470  6789              7259  899.9385
## 16  THUNDERSTORM WIND        133  1488              1621  876.8442
## 15          LIGHTNING        816  5230              6046  603.3518
## 10 THUNDERSTORM WINDS         64   908               972  446.2932
## 46          HIGH WIND        248  1137              1385  324.7316
## 8        WINTER STORM        206  1321              1527  132.7206
##      CROPDMG totalcosts
## 1  100.01852  3312.2767
## 20 179.20046  1599.3251
## 2  109.20260  1445.1682
## 3  579.59628  1268.2897
## 36 168.03788  1067.9764
## 16  66.79145   943.6356
## 15   3.58061   606.9324
## 10  18.68493   464.9781
## 46  17.28321   342.0148
## 8    1.97899   134.6996

lets represent the above results in a beautiful way

#make sure of proper ordering
mosteconomicalcost$EVS <- factor(mosteconomicalcost$EVS, levels =
                                   mosteconomicalcost$EVS[order(mosteconomicalcost$totalcosts, decreasing= TRUE)])

#prepares the latter dataset for ggplotting
meltedmostecocost <- melt(mosteconomicalcost, id.vars =c("EVS", "FATALITIES", "INJ", 
                                                         
                                                         "totalhealthissues"))

#again, make sure of tidy ordering
meltedmostecocost$variable <- factor(meltedmostecocost$variable, levels = c("CROPDMG", "PROPDMG", "totalcosts"))

#gg barplotting

ggplot(meltedmostecocost, aes(EVS, value)) +
      geom_bar(aes(fill = variable), position = "dodge", stat="identity") +
      ggtitle(label = "Top 10 Weather Events Impacting Economy in the USA") +
      xlab("Event") + ylab ("Thousand US $ (Ignoring the EXPonential variables)") + 
      scale_fill_discrete(name = "Variable", labels = c("Crop Losses", "Property Losses", "Total Losses(Crop + Property)"))

From the above graph it is clear that, Tornadoe is once again the most costly event in USA causing a lot of damage to property over the crops.