Analysing US Storm data

This analysis is of the US National Oceanic and Atmospheric Administraions (NOAA) Storm data base . This date base is the collection of records of storms , severe weather events etc and provides information of when and where they occured , what were the injuries, fatalities etc. It also estimates the damage to the property , crop damage etc .

The analysis is to determine which event type has caused substantial effect on people’s health and which event types have caused major impact on the economy .

Synopsis :

The database contains 90 K observations of 37 variables. It was analysed to understand how many event types were recorded (985 different categories of events). Also the database was summarised to check for missing values . The potential outliers were analysed and checked for its validity by reading through the event recording details in ‘Remarks’ .

Data Processing :

Loading data into R - Since we are dealing with a lrge data set , use ‘fread’ for faster reading of data into R .

library("data.table")

## Warning: package 'data.table' was built under R version 3.2.5

stormdata <- fread( "StormData.csv",verbose = TRUE )

## Input contains no \n. Taking this to be a filename to open
## File opened, filesize is 0.523066 GB.
## Memory mapping ... ok
## Detected eol as \r\n (CRLF) in that order, the Windows standard.
## Positioned on line 1 after skip or autostart
## This line is the autostart and not blank so searching up for the last non-blank ... line 1
## Detecting sep ... ','
## Detected 37 columns. Longest stretch was from line 1 to line 30
## Starting data input on line 1 (either column names or first row of data). First 10 characters: "STATE__",
## All the fields on line 1 are character fields. Treating as the column names.
## Count of eol: 1307675 (including 1 at the end)
## Count of sep: 34819802
## nrow = MIN( nsep [34819802] / ncol [37] -1, neol [1307675] - nblank [1] ) = 967216
## Type codes (   first 5 rows): 3444344430000303003343333430000333303
## Type codes (+ middle 5 rows): 3444344434444303443343333434440333343
## Type codes (+   last 5 rows): 3444344434444303443343333434444333343
## Type codes: 3444344434444303443343333434444333343 (after applying colClasses and integer64)
## Type codes: 3444344434444303443343333434444333343 (after applying drop or select (if supplied)
## Allocating 37 column slots (37 - 0 dropped)
## 
Read 0.0% of 967216 rows
Read 18.6% of 967216 rows
Read 30.0% of 967216 rows
Read 41.4% of 967216 rows
Read 52.7% of 967216 rows
Read 61.0% of 967216 rows
Read 73.4% of 967216 rows
Read 81.7% of 967216 rows
Read 88.9% of 967216 rows
Read 902297 rows and 37 (of 37) columns from 0.523 GB file in 00:00:12

## Warning in fread("StormData.csv", verbose = TRUE): Read less rows (902297)
## than were allocated (967216). Run again with verbose=TRUE and please
## report.

##    0.000s (  0%) Memory map (rerun may be quicker)
##    0.000s (  0%) sep and header detection
##    1.454s ( 13%) Count rows (wc -l)
##    0.000s (  0%) Column type detection (first, middle and last 5 rows)
##    0.562s (  5%) Allocation of 902297x37 result (xMB) in RAM
##    9.135s ( 82%) Reading data
##    0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
##    0.000s (  0%) Coercing data already read in type bumps (if any)
##    0.020s (  0%) Changing na.strings to NA
##   11.171s        Total

Convert the fields Event type , State and County to factors.

stormdata$COUNTY <- as.factor(stormdata$COUNTY)
stormdata$STATE <- as.factor(stormdata$STATE)
stormdata$EVTYPE <- as.factor(stormdata$EVTYPE)

Subset the storm data base so that we select only the required fields . This eases the load on processing .

stormdata1 <- as.data.frame(stormdata)

columns_needed <- c("FATALITIES","INJURIES","EVTYPE","CROPDMG","PROPDMG")

required_data <- stormdata1[columns_needed]

Sum up the fields by Event type .

health1 <- aggregate( . ~ EVTYPE , data = required_data, sum)

Question 1 - Which type of events are most harmful to population health ?

For answering this question , lets look into three values of events that cause the largest values of the ‘Fatalities’ and ‘Injuries’

Lets plot the 3 events that cause the largest injuries .

descend_injuries <- order(health1$INJURIES , decreasing = TRUE)

Injuries_data  <- health1[descend_injuries,] 
First3_Injuries <- Injuries_data[1:3,]

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.2.5

ggplot(First3_Injuries , aes(FATALITIES,INJURIES , width = 5000)) + geom_bar(stat="identity", fill ="green" , color = "red") + facet_grid(~ EVTYPE)

Similarly lets plot the three values of events that caused the largest fatalities .

descend_fatalities <- order(health1$FATALITIES , decreasing = TRUE)

Fatalities_data  <- health1[descend_fatalities,] 
First3_Fatality <- Fatalities_data[1:3,]

ggplot(First3_Fatality , aes(FATALITIES,INJURIES , width = 200)) + geom_bar(stat="identity" , fill ="blue" , color = "red") + facet_grid(~ EVTYPE)

Question 2 - Which type of events have the greatest economic consequences ?

The data provides us with the estimate of property damage and crop damage done by the different events . By summing the the values by event type we can understand which event type caused greatest economic consequence .

Lets plot the three evnts that caused maximum property damage .

descend_propertydmg <- order(health1$PROPDMG , decreasing = TRUE)

Propdmg_data  <- health1[descend_propertydmg,] 
First3_Propdmg <- Propdmg_data[1:3,]
ggplot(First3_Propdmg , aes(CROPDMG,PROPDMG , width = 2000)) + geom_bar(stat="identity" , fill ="yellow" , color = "red") + facet_grid(~ EVTYPE)

Now , lets plot the three events that caused maximum damage to crops .

descend_cropdmg <- order(health1$CROPDMG , decreasing = TRUE)

Cropdmg_data  <- health1[descend_cropdmg,] 
First3_Cropdmg <- Cropdmg_data[1:3,]
ggplot(First3_Cropdmg , aes(PROPDMG,CROPDMG , width = 20000)) + geom_bar(stat="identity" , fill ="grey" , color = "red") + facet_grid(~ EVTYPE)

Analysing US Storm data

Synopsis :

Data Processing :

Question 1 - Which type of events are most harmful to population health ?

Question 2 - Which type of events have the greatest economic consequences ?

RESULTS

MOST HARM TO POPULATION HEALTH :

GREATEST ECONOMIC CONSEQUENCE :