Weather Events and Their Health and Economic Effects

Synopsis

This project involves exploring the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The task of this project is to assess which events cause the most harmful with respect to people's lives and the amount of economic damage caused.

Data Processing

The issue with this dataset is that it is quite complex and messy in the sense that there are a lot of entry errors and there are a lot of variables to consider. Therefore, the analysis will proceed in two steps:

The analysis has been broken down thusly to make it a little easier to take in.

Loading Dataset

Phase 1: Plotting relationships with uncleaned variables

Since cleaning is going to take a long time. I'm going to explore the relationships between the events as they now exist and the other variables of interest.
Variables of interest include

Conclusions

Preliminary findings are inconclusive. All that can be said for certain is that different events have cause differing types of damage, which is to be expected. The identity of these events remains unknown. The EVTYPE variable is very untidy, with same events occuring under different observations. A lot of cleaning will be needed.
Below are conclusions of the first phase of exploration:
MEAN

SUM

Phase 2: Cleaning the EVTYPE variable

This is going to take a lot of work to accomplish.
The expectation is that extensive use of regular expressions will suffice.

Official Events

Random Sampling

To reduce the complexity of the task, we will select a random sample of the observations of the dataframe and work on that rather than the entire dataframe.

Matching Observations by Patterns

Tidying my data

The result of sapply() is messy so it needs to be tidyied (pun intended). We'll use the pivot_longer() function from the tidyr package.

Plotting The Stats

Finally after that whole process it is now possible to make plots.

Conclusion

There are about fisty separate event categories. The data we began with was rife with mispelling and misattributed errors. By using a relatively simple regular expression pattern, we have been able to pick up the most impactful events from the data frame.
Also, because the dataframe is so large, we have resorted to taking a random sample of the observations to allow us to fast track the process. The expectation is that this is a truthful representation of the whole dataset.

Findings