Earthquakes project

Professional Google Data Analytics Certificate

Introduction

This case study is the capstone project required to achieve the Data Analytics Professional Certificate offered by Google upon successful completion of this course.

Earthquakes and tsunamis have had and will continue to have a major impact on human life on planet Earth throughout history to the present. That is why I chose to analyze the Global Significant Earthquake Database (GSED) made available by the National Centers for Environmental Information (NCEI)/World Data System (WDS) through the National Oceanic and Atmospheric Administration (NOAA)

The GSED is a global listing of 6273 earthquakes from 2150 BC to to August 26 of 2021.

Description of the Google’s method applied for this analysis:
Ask: It defines the problem to solve and objectives.
Prepare: It involves data generation, collection, storage, and protection.
Process: It involves data cleaning, transformation and integrity.
Analyze: It involves data exploration, visualization, and analysis choosing the right tools.
Share: It brings data to life using visuals to help others understand results.
Act: It involves application of insights to solve the problem.

Phase One: Ask

The purpose of this project is to provide and quantify some measurable insights into the nature, characteristics, and impact of global earthquakes from 1900 to August 2021.

The inquiries and objectives of this analysis are:

Present an interactive world map showing all earthquakes and tsunamis, color coded and with the option to filter. them by magnitude range. The user must also be able to click on each circle that represents an earthquake and obtain data on magnitude, intensity, year when it occurred, and total deaths.
Strongest earthquakes and tsunamis overall.
Deadliest earthquakes and tsunamis overall.
Breakdown of total earthquakes and tsunamis per year and average.
Breakdown of earthquakes and tsunamis by magnitude and their focal depth range.
Correlation between magnitude/intensity and focal depth.
Countries with most earthquakes and tsunamis, including total deaths.
Deadliest earthquakes and tsunamis by country, with year.
Deadliest earthquakes and tsunamis by country, with magnitude.
Deadliest earthquakes and tsunamis by magnitude, with year.

Phase Two: Prepare

Data description

The GSED is a dataset of 6273 entries and 47 total columns In order to be classified as a significant earthquake, the event must meet at least one of the following criteria: Moderate damage (approximately $1 million or more), 10 or more deaths, Magnitude 7.5 or greater, Modified Mercalli Intensity X or greater, or the earthquake generated a tsunami.

This database provides information on the date and time of occurrence, latitude and longitude, focal depth, magnitude, maximum MMI intensity, and socio-economic data such as the total number of casualties, injuries, houses destroyed, and houses damaged, and $ dollage damage estimates. References, political geography, and additional comments are also provided for each earthquake. If the earthquake was associated with a tsunami or volcanic eruption, it is flagged and linked to the related tsunami event or significant volcanic eruption.

This dataset should be cited as the following:
National Geophysical Data Center / World Data Service (NGDC/WDS): NCEI/WDS Global Significant Earthquake Database. NOAA National Centers for Environmental Information. doi:10.7289/V5TD9V7K

And, this public dataset is hosted in Google BigQuery and is included in BigQuery’s 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset.

So, this dataset can be found in the Google Cloud Platform (GCP) by following the path:

Analytics
BigQuery
SQL workspace
bigquery-public-data
noaa_significant_earthquakes
earthquakes

Which can be accessed from any Google account through the Global Significant Earthquakes Database (GSED) link.

The downloaded earthquake database is a csv file, which was imported using the read_csv() function.
(All the functions referenced from now on are from the R programming language)

earthquakes <- read_csv(
  "C:/Users/xiriu/Documents/Data_Analytics/Capstone/eartquakes-BQ-raw.csv", 
                         col_types = cols(.default = "?", total_missing
                                       = "d", 
                                       total_missing_description = "d")) %>% 
  arrange(-year)

The colnames() function was used to generate the following list of all the 47 variables involved in this database:

colnames(earthquakes)

##  [1] "id"                                 "flag_tsunami"                      
##  [3] "year"                               "month"                             
##  [5] "day"                                "hour"                              
##  [7] "minute"                             "second"                            
##  [9] "focal_depth"                        "eq_primary"                        
## [11] "eq_mag_mw"                          "eq_mag_ms"                         
## [13] "eq_mag_mb"                          "eq_mag_ml"                         
## [15] "eq_mag_mfa"                         "eq_mag_unk"                        
## [17] "intensity"                          "country"                           
## [19] "state"                              "location_name"                     
## [21] "latitude"                           "longitude"                         
## [23] "region_code"                        "deaths"                            
## [25] "deaths_description"                 "missing"                           
## [27] "missing_description"                "injuries"                          
## [29] "injuries_description"               "damage_millions_dollars"           
## [31] "damage_description"                 "houses_destroyed"                  
## [33] "houses_destroyed_description"       "houses_damaged"                    
## [35] "houses_damaged_description"         "total_deaths"                      
## [37] "total_deaths_description"           "total_missing"                     
## [39] "total_missing_description"          "total_injuries"                    
## [41] "total_injuries_description"         "total_damage_millions_dollars"     
## [43] "total_damage_description"           "total_houses_destroyed"            
## [45] "total_houses_destroyed_description" "total_houses_damaged"              
## [47] "total_houses_damaged_description"

Earthquakes data concepts

The first known earthquake detector was invented in 132 A.D. by the Chinese astronomer and mathematician Chang Heng. He called it an “earthquake weathercock.” And, in 136 A.D. a Chinese scientist named Choke updated the meter and called it a “seismoscope.”

Seismologists study earthquakes by looking at the damage that was caused and by using seismometers. A seismometer is an instrument that records the shaking of the Earth’s surface caused by seismic waves. The term seismograph usually refers to the combined seismometer and recording device.

Since the energy released by an earthquake travels in a wave, and earthquakes are actually recorded by a seismographic network, there are many different ways to measure different aspects of an earthquake. Magnitude is the most common measure of an earthquake’s “size” or strength , and the Moment Magnitude is considered the most accurate scientific scale (the Richter scale is an outdated method that is no longer used for large earthquakes). Also, the magnitude does not depend on where the measurement is made.

The GSED provides the following logarithmic magnitudes which valid values go from 0 to 9.9:
* Mw - It is based on the moment magnitude scale
* Ms - It is the surface-wave magnitude of the earthquake
* Mb - It is the compressional body wave (P-wave) magnitude
* Ml - It was the original magnitude relationship defined by Richter and Gutenberg for local earthquakes in 1935
* Mfa - These magnitudes are computed from the felt area, for earthquakes that occurred before seismic instruments were in general use

Because of the logarithmic basis of these scales, each whole number increase in magnitude represents a tenfold increase in measured amplitude; as an estimate of energy, each whole number step in the magnitude scale corresponds to the release of about 31 times more energy than the amount associated with the preceding whole number value.

For example, the following is a comparison of the strength of earthquakes of different magnitudes based on a magnitude 4 earthquake:

A magnitude 5 is 10 times stronger
A magnitude 6 is 100 times stronger
A magnitude 7 is 1000 times stronger
A magnitude 8 is 10 000 times stronger

On the other hand, intensity scales, like the Modified Mercalli Scale (valid values go from 1 to 12) measure the amount of shaking at a particular location. An earthquake causes many different intensities of shaking in the area of the epicenter where it occurs. So the intensity of an earthquake will vary depending on where you are. The Mercalli Scale is based on observable earthquake damage.The GSED provides the Modified Mercalli (MMI) Intensity when available.
So, from a scientific standpoint, the magnitude scale is based on seismic records while the Mercalli is based on observable data which can be subjective (we will see it illustrated in fig. 11 , phase 6)

For the purpose of this analysis, I will be focusing on their equivalent magnitude. This magnitude is chosen from the available magnitude scales in this order: Mw Magnitude, Ms Magnitude, Mb Magnitude, Ml Magnitude, and Mfa Magnitude.

Generally, earthquakes of magnitude 6 and above are the ones for concern. When nearby, they can cause shaking intensities that can begin to break chimneys and cause considerable damage to the most seismically vulnerable structures, such as non-retrofitted brick buildings.

Earthquake magnitude scales

Fig. 1

Modified Mercalli Intensity (MMI) correlation with magnitude

Magnitude	Typical Maximum MMI Intensity
1.0 - 3.0	I
3.0 - 3.9	II - III
4.0 - 4.9	IV - V
5.0 - 5.9	VI - VII
6.0 - 6.9	VII - IX
7.0 and higher	VII or higher

Phase Three: Process

This data set was verified in SQL for duplicates and blank observations before downloading it using the queries:

SELECT COUNT(id) FROM bigquery-public-data.noaa_significant_earthquakes.earthquakes

And,

SELECT COUNT(DISTINCT(id)) FROM bigquery-public-data.noaa_significant_earthquakes.earthquakes

Which resulted in the same number of observations: 6273, which means the GSED dataset include records for 6273 earthquakes.

However, when importing the GSED’s csv file, there was a warning about some values in the column 38 (“total_missing”) and 39 (“total_missing_description”) where a boolean TRUE or FALSE value was expected, but some numeric values were found instead.

The functions str() and spec() showed the reason: When importing the csv file, they were defined as col_logical(). Even though this issue did not impact in any way the final result of the present analysis, it was resolved by defining these two columns as col_double() types at the time of importing them (as shown in the “importing_earthquake_database” previous code chunk).

On the other hand, considering that:

For earthquakes prior to about 1890, magnitudes have been estimated by looking at the physical effects (such as amount of faulting, landslides, sandblows or river channel changes) plus the human effects (such as the area of damage or felt reports or how strongly a quake was felt) and comparing them to modern earthquakes
The first seismographs came into use about 1890
In 1935 Charles Richter developed the magnitude scale

I decided to choose the study period of 1900 to 2021 because of the amount and accuracy of data available. Next, the imported file was filtered to get earthquakes from 1900 to 2021 only. And this is the base file used for this analysis.

So, the number of earthquakes in the GSED database before 1900 CE is 2,490 (39.7%) and after 1900 CE is 3,783 (60.3%). And these latter are used for this analysis project.

Also, the “skim_without_charts()” function was very useful at this stage to quantify the “NA” values for each variable. In the new data set from 1900, the following values have been found missing for the most important variables (shown between parenthesis) used in this analysis:

Country (country): 0
Year (year): 0
Tsunami (flag_tsunami): 2735
Magnitude (eq_primary): 300
Intensity (intensity): 2231 (These missing values could be because there was no tsunami associated with this earthquake, or due to a missing value itself)
Focal depth (focal_depth): 706
Latitude (latitude): 10
Longitude (longitude): 7
Total deaths (total_deaths): 2416

Phase Four: Analyze

SQL was crucial to download the csv version of the GSED from Google’s BigQuery.

But all stages of this analysis have been done using the “R” programming language hosted in the RStudio desktop version “RStudio 2021.09.2 Build 382 for Windows”

Excel and SQL were used scarcely (in the process stage only for double verification) since “R” can be used to achieve the goals of all data analysis phases.

Furthermore, as previously listed in phase three, the skim_without_charts() function showed missing values for all variables involved in this analysis with exception of the year. This meant that observations containing “NA” values in the variables represented in axis x and y had to be dropped first, before any analysis could start. The function drop_na() was very useful for this purpose.

Likewise, when executing arithmetical operations on variables, like calculating sums, averages, and minimums and maximums, the “NA” values must be overlooked for these operations to work. This was possible using the s() function included in the “hablar” package.

Filter, summary, arrange, grouping, and slicing functions were also widely used to reshape data to be fit for graphics.

The interactive world map offers the option to filter earthquakes by magnitude, and is also a very useful source of information.

Now that we have the GSED dataset clean and organized, it’s time to share and comment on the results of this analysis to comply with the goals listed in phase one above.

Phase Five: Share

1. Interactive world map showing all earthquakes and tsunamis, color coded by magnitude, and with the option to filter them by magnitude range

On the world map below, individual earthquakes are represented by color coded circles following this pattern:

8.0 - 9.9 ==> Magenta
7.0 - 7.9 ==> Red
6.0 - 6.9 ==> Maroon
5.0 - 5.9 ==> Dark Orange
1.5 - 4.9 ==> Blue

On the map’s top right position there is a magnitude check mark box that filters earthquakes by the selected magnitudes.

There are two ways to zoom in and out, either by using the “+” and “-” signs on the top left, or by scrolling up and down with the mouse wheel in the area map. To scroll up and down the page, just move the cursor outside the map. You can also click and drag to move the map in any direction.

The map also groups the earthquakes on each magnitude layer in clusters that depend on the zoom level, and adjust themselves automatically to show the number of earthquakes grouped in each cluster. In this way the earthquakes visualization doesn’t look too crowed.

Once an individual earthquake colored circle is shown, a label containing its associated data: magnitude, intensity, year, depth and deaths, is displayed by just hovering the mouse cursor on that colored circle. But, to get a better view of these values you can click on the circle as soon as the label is displayed to see this values on a steady bigger box. To close this box, you can click on the “x” inside the box, or just click outside.

Also, as mentioned on phase three, it was found that 300 earthquakes in the GSED did not have a magnitude value of any kind listed. As a result, these earthquakes were not included in this world map earthquake visualization.

2. Strongest earthquakes and tsunamis overall

The following table below shows an overview of earthquake and tsunami data.

Fig. 2

A breakdown of the data above for the 12 strongest and deadliest earthquakes is presented in the next two tables:

Top 12 strongest earthquakes/tsunamis
Year	Tsunami	Magnitude	Intensity	Country	Total deaths
1960	Tsu	9.5	12	CHILE	2226
1964	Tsu	9.2	10	USA	139
2011	Tsu	9.1	NA	JAPAN	18429
2004	Tsu	9.1	NA	INDONESIA	227899
1952	Tsu	9.0	7	RUSSIA	10000
2010	Tsu	8.8	9	CHILE	558
1965	Tsu	8.7	6	USA	NA
1922	Tsu	8.7	11	CHILE	700
2012	Tsu	8.6	NA	INDONESIA	10
2005	Tsu	8.6	NA	INDONESIA	1313
1957	Tsu	8.6	NA	USA	2
1950	Tsu	8.6	11	INDIA	1530

Note: The “Tsu” data under the “Tsunami” column indicates an earthquake that also involved a tsunami.

3. Deadliest earthquakes and tsunamis overall

Top 12 deadliest earthquakes/tsunamis
Year	Tsunami	Magnitude	Intensity	Country	Total deaths
2010	Tsu	7.0	NA	HAITI	316000
1976	NA	7.5	11	CHINA	242769
2004	Tsu	9.1	NA	INDONESIA	227899
1920	Tsu	8.3	12	CHINA	200000
1923	Tsu	7.9	NA	JAPAN	142807
1948	NA	7.2	10	TURKMENISTAN	110000
2008	Tsu	7.9	9	CHINA	87652
1908	Tsu	7.0	11	ITALY	80000
2005	NA	7.6	8	PAKISTAN	76213
1970	Tsu	7.9	10	PERU	66794
1935	Tsu	7.5	10	PAKISTAN	60000
1927	NA	7.6	11	CHINA	40912

Note: The “NA” value under the “Tsunami” column indicates only an earthquake.

Observations

All the strongest earthquakes also involved tsunamis
The deadliest tsunamis/earthquakes are not necessarily the most powerful in magnitude. The deadliest tsunami/earthquake in history was 7.0 “only” in Haiti (2010), with a death toll of 316 000
On the other hand, the most powerful in magnitude tsunamis/earthquakes are not necessarily the deadliest, Which depend on the location among other factors
The only exception is the tsunami/earthquake occurred in Indonesia in 2004, which combined a powerful 9.1 magnitude with the second highest death toll for a tsunami/earthquake: 227 899. So, it is the only tsunami/earthquake these two tables have in common
The deadliest earthquake that did not include a tsunami occurred in China in 1976 with a death toll of 242 769 and a magnitude of 7.5

4. Breakdown of total earthquakes and tsunamis per year and average

This data was already mentioned above and now it is shown as visualization.

Fig. 3

Fig. 4

5. Breakdown of earthquakes and tsunamis by magnitude and their focal depth range

From the two graphics below, we can see the amount of earthquakes and tsunamis peak between the magnitudes 7.0 and 7.5. And from a statistical point of view, these graphs may follow a normal distribution, which may be further investigated.

Besides, earthquakes involving tsunamis don’t seem to have focal depths beyond 130 Km, while the other earthquakes may reach focal depths up to 700 Km. In general, shallow earthquakes generally tend to be more damaging than deeper ones.It may be because seismic waves from deep earthquakes have to travel farther to the surface losing energy along the way.

Also, using the world map filter, we find out the 2004 devastating tsunami in Indonesia (which originated in the Indian Ocean) was 9.1 and had a focal depth of 30 Km only, which supports the previous claim.

Fig. 5

Fig. 6

6. Correlation between magnitude/intensity and focal depth

Earth is composed of four distinct layers: The inner core, the outer core, the mantle and the crust, which is what we live on.
And, Earth’s crust is like the shell of a hard-boiled egg. It is extremely thin, cold and brittle compared to what lies below it.
Besides, Earth’s crust ranges from 5 to 100 kilometers thick depending on oceanic versus continental crust (as seen in the picture below). The thin oceanic crust is denser than the thicker continental crust. In a very general way, it is thought that Earth’s crust “floats” on top of the soft plastic-like mantle below.

Fig. 7

Earthquakes occur in the crust or upper mantle, which ranges from the earth’s surface to about 800 kilometers deep (about 500 miles).

The strength of shaking from an earthquake diminishes with increasing distance from the earthquake’s source, so the strength of shaking at the surface from an earthquake that occurs at 500 km deep is considerably less than if the same earthquake had occurred at 20 km depth.

Most parts of the world experience at least occasional shallow earthquakes—those that originate within 60 km (40 miles) of the Earth’s outer surface. In fact, the great majority of earthquakes focal depths are shallow. Of the total energy released in earthquakes, 12 percent comes from intermediate earthquakes—that is, quakes with a focal depth ranging from about 60 to 300 km. About 3 percent of total energy comes from deeper earthquakes, up to 700 km.

For all earthquakes studied, the graphic below confirms that the vast majority of earthquakes originate in the crust:

Shallow (0-60Km): 86.4 %
Intermediate (60-300Km): 11.74 %
Deep (300-700Km): 1.9 %

Fig. 8

Next, the intensity graphic follows a similar pattern to the magnitude one. Nevertheless, deep earthquakes do not seem to have caused much trouble, reaching a maximum intensity level of VI (6) only. These, and the other deep earthquakes (in the previous graphic) recorded a magnitude of 7 and higher, and may have actually reached medium intensity levels mostly according to the Intensity vs. Magnitude graphic below in this section.

Fig. 9

On the other hand, the MMI intensity scale is from 1 to 12 (whole numbers only). The lesser degrees of the MMI scale generally describe the manner in which the earthquake is felt by people. The greater numbers of the scale are based on observed structural damage.

This table gives MMIs that are typically observed at locations near the epicenter of the earthquake:

Fig. 10

In the next graphic we see all the earthquakes mapped according to their magnitude (fractional) and intensity (no-fractional) values. We can draw some interesting observations from it which can be explained considering the nature of the MMI scale. Since the Mercalli Scale is based on observable earthquake damage, it depends on the observer, and therefore it is rather subjective.

What this graphic tells:

The same intensity can be reached with a wide range of magnitudes
A level IX (9) intensity earthquake is considered “violent”, with the upper levels 10-12 considered “extreme”, but we can see that we can consistently reach a violent level of damage starting with a 5.0 magnitude earthquake. And the extreme ones starting with a 5.5 magnitude earthquake only.
On the other hand, we can have powerful earthquakes between the magnitudes 8 and 9 that only reach an intensity of 7 or less.
An extreme level XII (12) intensity -total damage- earthquake can be caused by a 6.8 magnitude earthquake, and of course, it also follows the natural logic: more powerful = more damage.

Fig. 11

Additionally, and as a complement to the previous information, we can see that Fiji, Peru and Russia have recorded the deepest focal depth earthquakes, as well as the most number of them: 9, 8 and 8 respectively ordered by their count.

Fig. 12

7. Countries with most earthquakes and tsunamis, including total deaths

Note: In this and the following graphics, the term earthquakes also include tsunamis, but I have also provided a similar visualization for tsunamis right below each one of them.

For the next two graphics, I have tried to provide as much information as possible without compromising clarity. For each country bar is printed the total earthquakes and tsunamis and right next to this figure and between parenthesis is found the total amount of deaths.

So, the country with most earthquakes and tsunamis since 1900 is China, with 365 occurrences and a total death toll of 650,647; followed by Indonesia and Japan. It seems this region of planet earth experiences the most amount of earthquakes and their fatalities.

But, regarding tsunamis China falls to place 17 having experienced only 14 of them, which seem to have been very deadly because of the 289,752 fatalities registered.

On the other hand, Japan raises to 1st place in tsunamis, with 158 of them, but the total deaths (178,616) is 61% only of the ones in China (289,752); even though the difference in tsunamis is 11.3 times bigger: 158 vs. 14.

Haiti, that has experienced the deadliest tsunami does not show up in this list because it had only 2 tsunamis in the last 122 years. And, this visualization is about the most number of tsunamis. For this reason, I am providing additional graphics in the sections 8 to 10 below.

Fig. 13

Fig. 14

8. Deadliest earthquakes and tsunamis by country, with year

The following two graphics provide additional information to the previous ones. They just add the year data, and the total death toll by year.

China is first as previously concluded, but it does not show here its total death toll of 650,647 because not all the earthquakes are included, but only the top 25 deadliest ones.

This graphic also shows a country that did not show up in the previous section: Haiti, which is second in earthquake fatalities, but first in tsunami fatalities. That is because of the 2010 earthquake/tsunami that caused the highest number of deaths of all tsunamis since 1900: 316,000.

In this set of graphics and the next one, China and Haiti swap the first two places in the list.

And, this is revealing:

Japan has experienced 158 tsunamis in the magnitude range of 5.6 up to 9.1 (in 2011), with 178,616 fatalities.
Haiti has experienced 2 tsunamis with 318,248 fatalities:
- The 2010 tsunami ==> magnitude 7 and 316,000 deaths
- The 2021 tsunami ==> magnitude 7.2 and 2,248 deaths

This is that Japan has had 79 times more tsunamis than Haiti, with 56.1 % of Haiti fatalities only.
Perhaps, it reveals in numbers the sheer difference between two extreme economies: A rich and developed country at one extreme, and a country at the other extreme of poverty.

Fig. 15

Fig. 16

9. Deadliest earthquakes and tsunamis by country, with magnitude

This set of graphics complement my previous observations: China, Haiti and Indonesia share the first 3 places in the last two sets of visualizations, and the reasons are there.

Fig. 17

Fig. 18

10. Deadliest earthquakes and tsunamis by magnitude, with year

This last set of graphics is interesting because it supports the visualization of earthquakes and tsunamis by magnitude in section 5. It shows that the earthquakes and tsunamis of magnitude 7 to 7.9 have been the most frequent and the ones that have claimed the most amount of lives. Of course, followed by the far less frequent (but 10 to 100 times stronger) earthquakes and tsunamis of higher magnitudes: 9.1 and 8.3.

Fig. 19

Fig. 20

Phase Six: Act

Since the stakeholders are the general public. Each person may take and use any part of this information as they see fit. The intent is to provide useful earthquake information in a way that is accessible and easy to follow for anyone interested in taking a look at some statistics information about earthquakes recorded from 1900 to August/2021.

Of course, more work can be done on the GSED database, this is just a small sample within the scope of this capstone project only.And, someone else may draw more conclusions from these visualizations as well. The ones I provided are just a few examples.

I am thankful to Google that provided this Professional Data Analytics course and certificate, to the NOAA for providing the GSED database, and to the simply amazing “R” community out there. Thank you for reading this post.