1) Earthquakes

Data Source

The U.S. Geological Survey (USGS) is a program run by the National Institute of Standards and Technology (NIST) to help provide data and information about the occurences of earthquakes. The data is provided in a variety of formats and in a number of frequencies. For this analysis, data on all recorded earthquakes from the 30 days ending November 18, 2019 at 1:08 P.M. PST is being analyzed. This data was obtained from https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv at the USGS web site. This data was placed into a file named earthquake.csv and is provided along with this report. The table below provides a brief overview of the variables contained with the dataset.

Variable Description
time Time of Earthquake occurence
latitude Latitude Location of Earthquake
longitude Longitude location of Earthquake
depth Depth of the Event
mag Magnitude of Event
magType Algorithm or Method Used to Evaluate the Method of the Earthquake
nst Number of Seismic Stations used to evaluate Earthquake Location
gap The Largest azimutahl gap between azimuthally adjacent stations (in degrees)
horizontalError Uncertainty of Observed Event’s Location (in KM)
dmin Smallest observed Distance to event epicenter from the Closest Seismic Station
rms Root Mean Square Calculations of Residuals in predictions of Event occurence.
net ID of Data Contributor
id Unique Identification of Eathquake
updated Time of Upload in Original Dataset
place Nearby Named Geographical Region
horizontalError Uncertainty of Earthquake Location (in KM)
depthError Uncertainty of Earthquake Depth (in KM)
magNst Total number of Seismic Stations used to Calculate Earthquake’s Magnitude
Status Indicates Whether Event has been viewed by a Person
locationSource Network that Authored location of Event
magSource Network that Authored Preferred Magnitude

A small sample of the dataset is provided below. Notice that during this time period of 30 days ending November 18, 2019, there were 11,886 observed earthquakes.

Vizualizations

Earthquake Locations

The USGS collects data on earthquakes that occur around the world, not just in the US. Figure 4 shows the locations of the 11,866 observed earthquakes.

Globally Observed Earthquakes

Globally Observed Earthquakes

A cursory view of the map shows clustering earthquakes in typically known locations for earthquakes such as Alaska, the western United States, eastern Mediterranean, and Asian “Ring-of-Fire” around the countries of Japan, Malaysia, Phillipines. A closer inspection reveals that while the United States does have a lot of frequent earthquakes, they tend to be lower in strength (i.e. magnitude) than in other parts of the world. For example, both the western coast of South America and “Ring-of-Fire” show highly concentrated zones of very strong earthquakes.

Earthquake Magnitudes

Magnitude of Globally Observed Earthquakes

Magnitude of Globally Observed Earthquakes

Figure 5 shows the approximate magnitudes of the observed earthquakes. The size of each circle show the approximate magnitude of each quake.

Figure 6 shows the approximate magnitudes of observed earthquakes in California and nearby areas.

Observed Earthquakes Around California

Observed Earthquakes Around California

Earthquake Magnitudes Detected by Sensors

Magnitudes Detected by Sensors

Magnitudes Detected by Sensors

Analysis: For the Bubble Chart in figure 7, we decided to utilize the variables NST, MagNST, and the Magnitude. The x-axis represents the amount of seismic sensors used to detect the location, and the y-axis represents the number of seismic sensors used to calculate the magnitude. The diameter of each circle, is based off of the Magnitude for each indivudal observation.

As evident from the bubble chart, it appeards that there are more Location seismic sensors used than sensors used to calculate magnitude. In addition, one can tell from the distribuion of the larger circles, that the number of seismic sensors used has no effect on evaluating the magnitude of a quake.

2) Disease / Illness

Data Source

The World Health Organization (WHO) was created shortly after World War II as an international agency whose mission would be to improve overall world health. The WHO works within the United Nations system to help prevent and fight diseases around the world. They maintain a information about this mission in an online database called the Global Health Observatory (GHO). This can be accessed at https://www.who.int/data/gho.

The R package WHO provides an interface to the GHO database. This API is can obtain various datasets directly from the database. This analysis will focus on data related to Cholera. Cholera is an infection caused by eating or drinking food or water that is infected the the bacterium Vibrio cholerae. While it is preventable and treatable, it can cause death. The WHO estimates there are upward of 4 million cases of the infection with upwards of 143,000 of these resulting in death. See https://www.who.int/health-topics/cholera#tab=tab_1 for further details.

The R API was used to collect data related to cholera. Observations for the following indicators were collected and analyzed.

Indicator Description Further Details
CHOLERA_0000000001 Number of reported cases of cholera https://www.who.int/data/gho/indicator-metadata-registry/imr-details/42
WSH_10 Number of diarrhoea deaths from inadequate water, sanitation and hygiene https://www.who.int/data/gho/indicator-metadata-registry/imr-details/2260

Vizualizations

Cases of Cholera by Deaths from Improper Water

Cases of Cholera by Deaths from Improper Water

Cases of Cholera by Deaths from Improper Water

The smoothed scatterplot in figure 8 depicts that as the number of cases of Cholera increases, the number of deaths from a basic water source remains relatively constant. This may be counter-intuitive from certain perspectives that may think that there should be a strong, positive linear correlation between Cholera cases and related deaths. However, it appears that the amount of Cholera cases is at a constant trend with the amount of deaths in a country.

Reported Cholera Cases by Country

Reported Cholera Cases by Country

Reported Cholera Cases by Country

The Lollipop chart in figure 9 depicts that that the country with the most reported Cholera cases, is Haiti. Other countries with significant Cholera populations, include the Democratic Republic of the Congo, Yemen, Somalia, the United Republoic of Tanzania, Kenya and South Sudan. For the most part, it appears that Cholera is not very prevalent among other countries.

Regional Share of Cholera Cases

Regional Share of Cholera Cases

Regional Share of Cholera Cases

The pie chart in figure 11 shows the relative share of reported cholera cases for each region. Africa has the largest share of cases followed closely by the America then the Eastern Mediterranean. The region with the least amount of observed Cholera cases, was Europe. The sliver for Europe is so small, it is not visible on this pie chart.

Data Wrangling

Although one would think the data maintained by the WHO in the GHO database is clean and possibly even tidy, it is not. This section describes how this data was retrieved, transformed and prepared for the preceding analysis and visualizations.

First, the CHOLERA_0000000001 dataset containing the observed number of reported cases of Cholera is read from the GHO database.

tb_cholera <- get_data("CHOLERA_0000000001")

tb_cholera <- tb_cholera %>%
  group_by(country) %>%
  arrange(country, year) %>%
  select(country, year, value, region) %>%
  rename("cases" = value)

Next, the WSH_10 dataset containing the observed deaths due to poor water quality was then downloaded. This data is joined to the reported cholera cases data. However, it is important to note that the WSH-10 dataset only contains observations for 2016.

d_table <- get_data("WSH_10")

viz_data <- tb_cholera %>%
  left_join(d_table,
            by = c("country", "year", "region")) %>%
  filter( year == 2016) %>%
  rename('deaths' = value)

Although the data appears tidy and is close to being usable, the deaths variable contains extra data besides the death counts that needs to be ignored. The death counts are parsed out of the data to make the data yet even closer to being ready to analyze.

viz_data$deaths <- viz_data$deaths %>%
  str_remove(pattern = "[:space:]\\[[:digit:]+-[:digit:]+\\]") %>%
  parse_integer()

There is one variable gho that exists in the original dataset but it not needed. This variable is removed leaving the viz_data dataset as tidy and ready for the above analysis.

viz_data <- viz_data %>% select(-gho)