This case study is the capstone project required to achieve the Data Analytics Professional Certificate offered by Google upon successful completion of this course.
Earthquakes and tsunamis have had and will continue to have a major impact on human life on planet Earth throughout history to the present. That is why I chose to analyze the Global Significant Earthquake Database (GSED) made available by the National Centers for Environmental Information (NCEI)/World Data System (WDS) through the National Oceanic and Atmospheric Administration (NOAA)
The GSED is a global listing of 6273 earthquakes from 2150 BC to to August 26 of 2021.
Description of the Google’s method applied for this analysis:
Ask: It defines the problem to solve and objectives.
Prepare: It involves data generation, collection, storage, and protection.
Process: It involves data cleaning, transformation and integrity.
Analyze: It involves data exploration, visualization, and analysis choosing the right tools.
Share: It brings data to life using visuals to help others understand results.
Act: It involves application of insights to solve the problem.
The purpose of this project is to provide and quantify some measurable insights into the nature, characteristics, and impact of global earthquakes from 1900 to August 2021.
The inquiries and objectives of this analysis are:
The GSED is a dataset of 6273 entries and 47 total columns In order to be classified as a significant earthquake, the event must meet at least one of the following criteria: Moderate damage (approximately $1 million or more), 10 or more deaths, Magnitude 7.5 or greater, Modified Mercalli Intensity X or greater, or the earthquake generated a tsunami.
This database provides information on the date and time of occurrence, latitude and longitude, focal depth, magnitude, maximum MMI intensity, and socio-economic data such as the total number of casualties, injuries, houses destroyed, and houses damaged, and $ dollage damage estimates. References, political geography, and additional comments are also provided for each earthquake. If the earthquake was associated with a tsunami or volcanic eruption, it is flagged and linked to the related tsunami event or significant volcanic eruption.
This dataset should be cited as the following:
National Geophysical Data Center / World Data Service (NGDC/WDS): NCEI/WDS Global Significant Earthquake Database. NOAA National Centers for Environmental Information. doi:10.7289/V5TD9V7K
And, this public dataset is hosted in Google BigQuery and is included in BigQuery’s 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset.
So, this dataset can be found in the Google Cloud Platform (GCP) by following the path:
Which can be accessed from any Google account through the Global Significant Earthquakes Database (GSED) link.
The downloaded earthquake database is a csv file, which was imported using the read_csv() function.
(All the functions referenced from now on are from the R programming language)
earthquakes <- read_csv(
"C:/Users/xiriu/Documents/Data_Analytics/Capstone/eartquakes-BQ-raw.csv",
col_types = cols(.default = "?", total_missing
= "d",
total_missing_description = "d")) %>%
arrange(-year)
The colnames() function was used to generate the following list of all the 47 variables involved in this database:
colnames(earthquakes)
## [1] "id" "flag_tsunami"
## [3] "year" "month"
## [5] "day" "hour"
## [7] "minute" "second"
## [9] "focal_depth" "eq_primary"
## [11] "eq_mag_mw" "eq_mag_ms"
## [13] "eq_mag_mb" "eq_mag_ml"
## [15] "eq_mag_mfa" "eq_mag_unk"
## [17] "intensity" "country"
## [19] "state" "location_name"
## [21] "latitude" "longitude"
## [23] "region_code" "deaths"
## [25] "deaths_description" "missing"
## [27] "missing_description" "injuries"
## [29] "injuries_description" "damage_millions_dollars"
## [31] "damage_description" "houses_destroyed"
## [33] "houses_destroyed_description" "houses_damaged"
## [35] "houses_damaged_description" "total_deaths"
## [37] "total_deaths_description" "total_missing"
## [39] "total_missing_description" "total_injuries"
## [41] "total_injuries_description" "total_damage_millions_dollars"
## [43] "total_damage_description" "total_houses_destroyed"
## [45] "total_houses_destroyed_description" "total_houses_damaged"
## [47] "total_houses_damaged_description"
The first known earthquake detector was invented in 132 A.D. by the Chinese astronomer and mathematician Chang Heng. He called it an “earthquake weathercock.” And, in 136 A.D. a Chinese scientist named Choke updated the meter and called it a “seismoscope.”
Seismologists study earthquakes by looking at the damage that was caused and by using seismometers. A seismometer is an instrument that records the shaking of the Earth’s surface caused by seismic waves. The term seismograph usually refers to the combined seismometer and recording device.
Since the energy released by an earthquake travels in a wave, and earthquakes are actually recorded by a seismographic network, there are many different ways to measure different aspects of an earthquake. Magnitude is the most common measure of an earthquake’s “size” or strength , and the Moment Magnitude is considered the most accurate scientific scale (the Richter scale is an outdated method that is no longer used for large earthquakes). Also, the magnitude does not depend on where the measurement is made.
The GSED provides the following logarithmic magnitudes which valid values go from 0 to 9.9:
* Mw - It is based on the moment magnitude scale
* Ms - It is the surface-wave magnitude of the earthquake
* Mb - It is the compressional body wave (P-wave) magnitude
* Ml - It was the original magnitude relationship defined by Richter and Gutenberg for local earthquakes in 1935
* Mfa - These magnitudes are computed from the felt area, for earthquakes that occurred before seismic instruments were in general use
Because of the logarithmic basis of these scales, each whole number increase in magnitude represents a tenfold increase in measured amplitude; as an estimate of energy, each whole number step in the magnitude scale corresponds to the release of about 31 times more energy than the amount associated with the preceding whole number value.
For example, the following is a comparison of the strength of earthquakes of different magnitudes based on a magnitude 4 earthquake:
On the other hand, intensity scales, like the Modified Mercalli Scale (valid values go from 1 to 12) measure the amount of shaking at a particular location. An earthquake causes many different intensities of shaking in the area of the epicenter where it occurs. So the intensity of an earthquake will vary depending on where you are. The Mercalli Scale is based on observable earthquake damage.The GSED provides the Modified Mercalli (MMI) Intensity when available.
So, from a scientific standpoint, the magnitude scale is based on seismic records while the Mercalli is based on observable data which can be subjective (we will see it illustrated in fig. 11 , phase 6)
For the purpose of this analysis, I will be focusing on their equivalent magnitude. This magnitude is chosen from the available magnitude scales in this order: Mw Magnitude, Ms Magnitude, Mb Magnitude, Ml Magnitude, and Mfa Magnitude.
Generally, earthquakes of magnitude 6 and above are the ones for concern. When nearby, they can cause shaking intensities that can begin to break chimneys and cause considerable damage to the most seismically vulnerable structures, such as non-retrofitted brick buildings.
Earthquake magnitude scales
Fig. 1
Modified Mercalli Intensity (MMI) correlation with magnitude
Magnitude | Typical Maximum MMI Intensity |
---|---|
1.0 - 3.0 | I |
3.0 - 3.9 | II - III |
4.0 - 4.9 | IV - V |
5.0 - 5.9 | VI - VII |
6.0 - 6.9 | VII - IX |
7.0 and higher | VII or higher |
This data set was verified in SQL for duplicates and blank observations before downloading it using the queries:
SELECT COUNT(id) FROM bigquery-public-data.noaa_significant_earthquakes.earthquakes
And,
SELECT COUNT(DISTINCT(id)) FROM bigquery-public-data.noaa_significant_earthquakes.earthquakes
Which resulted in the same number of observations: 6273, which means the GSED dataset include records for 6273 earthquakes.
However, when importing the GSED’s csv file, there was a warning about some values in the column 38 (“total_missing”) and 39 (“total_missing_description”) where a boolean TRUE or FALSE value was expected, but some numeric values were found instead.
The functions str() and spec() showed the reason: When importing the csv file, they were defined as col_logical(). Even though this issue did not impact in any way the final result of the present analysis, it was resolved by defining these two columns as col_double() types at the time of importing them (as shown in the “importing_earthquake_database” previous code chunk).
On the other hand, considering that:
I decided to choose the study period of 1900 to 2021 because of the amount and accuracy of data available. Next, the imported file was filtered to get earthquakes from 1900 to 2021 only. And this is the base file used for this analysis.
So, the number of earthquakes in the GSED database before 1900 CE is 2,490 (39.7%) and after 1900 CE is 3,783 (60.3%). And these latter are used for this analysis project.
Also, the “skim_without_charts()” function was very useful at this stage to quantify the “NA” values for each variable. In the new data set from 1900, the following values have been found missing for the most important variables (shown between parenthesis) used in this analysis:
SQL was crucial to download the csv version of the GSED from Google’s BigQuery.
But all stages of this analysis have been done using the “R” programming language hosted in the RStudio desktop version “RStudio 2021.09.2 Build 382 for Windows”
Excel and SQL were used scarcely (in the process stage only for double verification) since “R” can be used to achieve the goals of all data analysis phases.
Furthermore, as previously listed in phase three, the skim_without_charts() function showed missing values for all variables involved in this analysis with exception of the year. This meant that observations containing “NA” values in the variables represented in axis x and y had to be dropped first, before any analysis could start. The function drop_na() was very useful for this purpose.
Likewise, when executing arithmetical operations on variables, like calculating sums, averages, and minimums and maximums, the “NA” values must be overlooked for these operations to work. This was possible using the s() function included in the “hablar” package.
Filter, summary, arrange, grouping, and slicing functions were also widely used to reshape data to be fit for graphics.
The interactive world map offers the option to filter earthquakes by magnitude, and is also a very useful source of information.
Now that we have the GSED dataset clean and organized, it’s time to share and comment on the results of this analysis to comply with the goals listed in phase one above.
Since the stakeholders are the general public. Each person may take and use any part of this information as they see fit. The intent is to provide useful earthquake information in a way that is accessible and easy to follow for anyone interested in taking a look at some statistics information about earthquakes recorded from 1900 to August/2021.
Of course, more work can be done on the GSED database, this is just a small sample within the scope of this capstone project only.And, someone else may draw more conclusions from these visualizations as well. The ones I provided are just a few examples.
I am thankful to Google that provided this Professional Data Analytics course and certificate, to the NOAA for providing the GSED database, and to the simply amazing “R” community out there. Thank you for reading this post.