Synopsis

The main aim of this project is to explore NOAA Storm Data and figure out a few facts about sever whether and their consequences. The results of this analysis can be used to alert the government and get themselves prepared.

Data Processing

Reading the data

In this section, I have read in the data and viewed a summary of the data and NAs in the data set.

stormdat <- read.csv("repdata%2Fdata%2FStormData.csv",na.strings = c("","NA"))
str(stormdat)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29600 levels "5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13512 1872 4597 10591 4371 10093 1972 23872 24417 4597 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 34 levels "  N"," NW","E",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ BGN_LOCATI: Factor w/ 54428 levels "- 1 N Albion",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_DATE  : Factor w/ 6662 levels "1/1/1993 0:00:00",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_TIME  : Factor w/ 3646 levels " 0900CST"," 200CST",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 23 levels "E","ENE","ESE",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_LOCATI: Factor w/ 34505 levels "- .5 NNW","- 11 ESE Jay",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 18 levels "-","?","+","0",..: 16 16 16 16 16 16 16 16 16 16 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 8 levels "?","0","2","B",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ WFO       : Factor w/ 541 levels " CI","$AC","$AG",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ STATEOFFIC: Factor w/ 249 levels "ALABAMA, Central",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ ZONENAMES : Factor w/ 25111 levels "                                                                                                                               "| __truncated__,..: NA NA NA NA NA NA NA NA NA NA ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436773 levels "-2 at Deer Park\n",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
colSums(is.na(stormdat))
##    STATE__   BGN_DATE   BGN_TIME  TIME_ZONE     COUNTY COUNTYNAME 
##          0          0          0          0          0       1589 
##      STATE     EVTYPE  BGN_RANGE    BGN_AZI BGN_LOCATI   END_DATE 
##          0          0          0     547332     287743     243411 
##   END_TIME COUNTY_END COUNTYENDN  END_RANGE    END_AZI END_LOCATI 
##     238978          0     902297          0     724837     499225 
##     LENGTH      WIDTH          F        MAG FATALITIES   INJURIES 
##          0          0     843563          0          0          0 
##    PROPDMG PROPDMGEXP    CROPDMG CROPDMGEXP        WFO STATEOFFIC 
##          0     465934          0     618413     142069     248769 
##  ZONENAMES   LATITUDE  LONGITUDE LATITUDE_E LONGITUDE_    REMARKS 
##     594029         47          0         40          0     287433 
##     REFNUM 
##          0

Transformations on data

  • I am making a copy of the data to avoid reloading the data in event of mistakes
storm1 <- stormdat
  • For Ease of use, I have converted the column to lower case.
names(storm1) <- tolower(names(storm1))
names(storm1)
##  [1] "state__"    "bgn_date"   "bgn_time"   "time_zone"  "county"    
##  [6] "countyname" "state"      "evtype"     "bgn_range"  "bgn_azi"   
## [11] "bgn_locati" "end_date"   "end_time"   "county_end" "countyendn"
## [16] "end_range"  "end_azi"    "end_locati" "length"     "width"     
## [21] "f"          "mag"        "fatalities" "injuries"   "propdmg"   
## [26] "propdmgexp" "cropdmg"    "cropdmgexp" "wfo"        "stateoffic"
## [31] "zonenames"  "latitude"   "longitude"  "latitude_e" "longitude_"
## [36] "remarks"    "refnum"

Data Analysis

Analyzing events that are most harmful with respect to population health.

There are two variables in the dataset with respect to human population health : fatalities and injuries. So, I have first added the 2 variables together. Then aggregation was done based on the combined variable. I have taken into consideration only the Top 6 causes, which makes sense because, they are the highest priority.

storm1$fatalities[which(is.na(storm1$fatalities))] <- 0
storm1$injuries[which(is.na(storm1$injuries))] <- 0
storm1$pop.damage <-  storm1$fatalities + storm1$injuries

pop.fatalities <- aggregate(pop.damage ~ evtype,sum, data = storm1)
pop.fatalities <- head(pop.fatalities[order(pop.fatalities$pop.damage,decreasing = T),])

Analyzing events that have the greatest economic consequences.

There are two variables in the dataset which have greatest economic consequence. They are Property damage and crop damage. But, the place values of the costs are in two different columns. So the place values are cleansed and combined into a single column. Finally the 2 columns with cost fue to prop damage and crop damage are added together into a single column and then the aggregation is done based on this variable. I have taken into consideration only the Top 6 causes, which makes sense because, they are the highest priority.

levels(storm1$propdmgexp)
##  [1] "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K" "m"
## [18] "M"
levels(storm1$propdmgexp) <- c(0,0,0,0,10,
                               100,1000,10000,100000,
                               1000000,10000000,100000000,1000000000,
                               100,100,1000,1000000,1000000)

storm1$propdmgexp <- as.numeric(as.character(storm1$propdmgexp))
storm1$propdmgcount <- storm1$propdmg * storm1$propdmgexp
storm1$propdmgcount[which(is.na(storm1$propdmgcount))] <- 0

levels(storm1$cropdmgexp)
## [1] "?" "0" "2" "B" "k" "K" "m" "M"
levels(storm1$cropdmgexp) <- c(0,0,100,1000000000,1000,1000,1000000,1000000)
storm1$cropdmgexp <- as.numeric(as.character(storm1$cropdmgexp))
storm1$cropdmgcount <- storm1$cropdmg * storm1$cropdmgexp
storm1$cropdmgcount[which(is.na(storm1$cropdmgcount))] <- 0

storm1$damage <- storm1$propdmgcount + storm1$cropdmgcount

economic <- aggregate(damage ~ evtype, sum, data = storm1)
economic1 <- aggregate(damage ~ evtype, max, data = economic)
economic1 <- head(economic1[order(economic1$damage,decreasing = T),])

Results

1. Which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

print(pop.fatalities)
##             evtype pop.damage
## 834        TORNADO      96979
## 130 EXCESSIVE HEAT       8428
## 856      TSTM WIND       7461
## 170          FLOOD       7259
## 464      LIGHTNING       6046
## 275           HEAT       3037
library(ggplot2)
ggplot(pop.fatalities,aes(x=reorder(evtype, -pop.damage), y=pop.damage)) + geom_bar(stat = "identity") + xlab("Event Types") + ylab("Magnitude of Damage") + ggtitle("Most Harmful effects with respect to population health")

It can be seen that the most harmful event is “TORNADO”.

2. Across the United States, which types of events have the greatest economic consequences?

print(economic1)
##                evtype       damage
## 170             FLOOD 150319678250
## 411 HURRICANE/TYPHOON  71913712800
## 834           TORNADO  57362333590
## 670       STORM SURGE  43323541000
## 244              HAIL  18761221670
## 153       FLASH FLOOD  18243990610
#Plotting the causes of most severe economic damages
ggplot(economic1,aes(x=reorder(evtype, -damage), y=damage)) + geom_bar(stat = "identity") + xlab("Event Types") + ylab("Cost of Damage") + ggtitle("Causes of Most Economic damage")

From the above plot the top 6 events that caused the most economic damage is shown.