Welcome to Covid19 Data Analysis

By the end of this project, you will learn how to preprocess and merge datasets to calculate needed measures and prepare them for an Analysis. in this Course, we are going to work with the COVID19 dataset, published by John Hopkins University, which consists of the data related to cumulative number of confirmed cases, per day, in each Country. Also, we have another dataset consisting of various life factors, scored by the people living in each country around the globe. We are going to merge these two datasets to see if there is any relationship between the spread of the virus in a country and how happy people are, living in that country.

Learning Objectives:

Understanding the purpose of the project, the datasets that will be used, and the questions that will be answered with the analysis. Importing COVID19 dataset and preparing it for the analysis by dropping columns and aggregating rows. Finding and calculating a good measure for the analysis. Merging two datasets and finding correlations among the datasets. Visualizing analysis results using Seaborn.

1.0. Let’s import the modules

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
print('Modules are imported')
## Modules are imported

1.2 Importing covid19 dataset

Importing “Covid19_Confirmend_dataset.csv” from *./Dataset” folder

corona_dataset_csv = pd.read_csv("Datasets/covid19_Confirmed_dataset.csv")
corona_dataset_csv.head()
##   Province/State Country/Region      Lat  ...  4/28/20  4/29/20  4/30/20
## 0            NaN    Afghanistan  33.0000  ...     1828     1939     2171
## 1            NaN        Albania  41.1533  ...      750      766      773
## 2            NaN        Algeria  28.0339  ...     3649     3848     4006
## 3            NaN        Andorra  42.5063  ...      743      743      745
## 4            NaN         Angola -11.2027  ...       27       27       27
## 
## [5 rows x 104 columns]

let’s check the shape of the dataset

corona_dataset_csv.shape
## (266, 104)

1.3 Delete the useless columns

corona_dataset_csv.drop(["Lat", "Long"],axis=1, inplace = True)
corona_dataset_csv.head()
##   Province/State Country/Region  1/22/20  ...  4/28/20  4/29/20  4/30/20
## 0            NaN    Afghanistan        0  ...     1828     1939     2171
## 1            NaN        Albania        0  ...      750      766      773
## 2            NaN        Algeria        0  ...     3649     3848     4006
## 3            NaN        Andorra        0  ...      743      743      745
## 4            NaN         Angola        0  ...       27       27       27
## 
## [5 rows x 102 columns]

1.4 Aggregate the cows by the country

corona_dataset_aggregated = corona_dataset_csv.groupby("Country/Region").sum()
corona_dataset_aggregated.head(10)
##                                                         Province/State  ...  4/30/20
## Country/Region                                                          ...         
## Afghanistan                                                          0  ...     2171
## Albania                                                              0  ...      773
## Algeria                                                              0  ...     4006
## Andorra                                                              0  ...      745
## Angola                                                               0  ...       27
## Antigua and Barbuda                                                  0  ...       24
## Argentina                                                            0  ...     4428
## Armenia                                                              0  ...     2066
## Australia            Australian Capital TerritoryNew South WalesNor...  ...     6766
## Austria                                                              0  ...    15452
## 
## [10 rows x 101 columns]
corona_dataset_aggregated.drop("Province/State",axis=1, inplace = True)
corona_dataset_aggregated.shape
## (187, 100)

2.0 Calculating a good measure

We need to find a good measure represented as a number, ddescribing the spread of the virus in a country

corona_dataset_aggregated.loc["China"][:3].plot()

2.1 Calculate the first derivative og the curve

corona_dataset_aggregated.loc["China"].diff().plot()

corona_dataset_aggregated.loc["Switzerland"].diff().plot()
corona_dataset_aggregated.loc["Sweden"].diff().plot()

2.2 Find maximum infection rate for China

corona_dataset_aggregated.loc["China"].diff().max()
## 15136.0

2.3 Find maximum infection rate for other countries

corona_dataset_aggregated.loc["Spain"].diff().max()
## 9630.0
corona_dataset_aggregated.loc["Italy"].diff().max()
## 6557.0
countries = list(corona_dataset_aggregated.index)
max_infection_rates = []
for c in countries:
  rate = corona_dataset_aggregated.loc[c].diff().max()
  max_infection_rates.append(rate)
corona_dataset_aggregated["max_infection_rate"] = max_infection_rates
corona_dataset_aggregated.head()
##                 1/22/20  1/23/20  1/24/20  ...  4/29/20  4/30/20  max_infection_rate
## Country/Region                             ...                                      
## Afghanistan           0        0        0  ...     1939     2171               232.0
## Albania               0        0        0  ...      766      773                34.0
## Algeria               0        0        0  ...     3848     4006               199.0
## Andorra               0        0        0  ...      743      745                43.0
## Angola                0        0        0  ...       27       27                 5.0
## 
## [5 rows x 101 columns]

2.4 Create a new dataframe with only needed column

corona_data = pd.DataFrame(corona_dataset_aggregated["max_infection_rate"])
corona_data.head()
##                 max_infection_rate
## Country/Region                    
## Afghanistan                  232.0
## Albania                       34.0
## Algeria                      199.0
## Andorra                       43.0
## Angola                         5.0

3.0 Import the WorldHappinessReport.csv dataset

select bthe needed columns for tge analysis Join the datasets calculate the correlation as the result of our analysis.

3.1 importing the dataset

happiness_report_csv = pd.read_csv("Datasets/worldwide_happiness_report.csv")
happiness_report_csv.head()
##    Overall rank Country or region  ...  Generosity  Perceptions of corruption
## 0             1           Finland  ...       0.153                      0.393
## 1             2           Denmark  ...       0.252                      0.410
## 2             3            Norway  ...       0.271                      0.341
## 3             4           Iceland  ...       0.354                      0.118
## 4             5       Netherlands  ...       0.322                      0.298
## 
## [5 rows x 9 columns]
happiness_report_csv.shape
## (156, 9)

3.2 Let’s drop the useless columns

useless_cols = ["Overall rank", "Score", "Generosity", "Perceptions of corruption"]
happiness_report_csv.drop(useless_cols, axis=1, inplace=True)
happiness_report_csv.head()
##   Country or region  ...  Freedom to make life choices
## 0           Finland  ...                         0.596
## 1           Denmark  ...                         0.592
## 2            Norway  ...                         0.603
## 3           Iceland  ...                         0.591
## 4       Netherlands  ...                         0.557
## 
## [5 rows x 5 columns]

3.3. Change the indices of the dataframe

happiness_report_csv.set_index("Country or region", inplace = True)
happiness_report_csv.head()
##                    GDP per capita  ...  Freedom to make life choices
## Country or region                  ...                              
## Finland                     1.340  ...                         0.596
## Denmark                     1.383  ...                         0.592
## Norway                      1.488  ...                         0.603
## Iceland                     1.380  ...                         0.591
## Netherlands                 1.396  ...                         0.557
## 
## [5 rows x 4 columns]

3.5 Coroina Dataframe

corona_data.head()
##                 max_infection_rate
## Country/Region                    
## Afghanistan                  232.0
## Albania                       34.0
## Algeria                      199.0
## Andorra                       43.0
## Angola                         5.0
corona_data.shape
## (187, 1)

3.6 World Happiness Report dataframe

happiness_report_csv.head()
##                    GDP per capita  ...  Freedom to make life choices
## Country or region                  ...                              
## Finland                     1.340  ...                         0.596
## Denmark                     1.383  ...                         0.592
## Norway                      1.488  ...                         0.603
## Iceland                     1.380  ...                         0.591
## Netherlands                 1.396  ...                         0.557
## 
## [5 rows x 4 columns]
happiness_report_csv.shape
## (156, 4)

3.4 Let’s join the two dataset we have prepared

data1 = corona_data.join(happiness_report_csv, how="inner")
data1.head()
##              max_infection_rate  ...  Freedom to make life choices
## Afghanistan               232.0  ...                         0.000
## Albania                    34.0  ...                         0.383
## Algeria                   199.0  ...                         0.086
## Argentina                 291.0  ...                         0.471
## Armenia                   134.0  ...                         0.283
## 
## [5 rows x 5 columns]

3.5 Correlation Matrix

data1.corr()
##                               max_infection_rate  ...  Freedom to make life choices
## max_infection_rate                      1.000000  ...                      0.078196
## GDP per capita                          0.250118  ...                      0.394603
## Social support                          0.191958  ...                      0.456246
## Healthy life expectancy                 0.289263  ...                      0.427892
## Freedom to make life choices            0.078196  ...                      1.000000
## 
## [5 rows x 5 columns]

4.0 Visualisation of the results

Our analysis is not finished unless we visialize the results in terms figures and graphs so that everyone can understand what you get out of our analysis.

sns.scatterplot(data=data1, x= "GDP per capita", y = "max_infection_rate")

4.1 Plotting GDP vs maximum Infection rate

x = data1["GDP per capita"]
y = data1["max_infection_rate"]
y_log = np.log10(y)
sns.scatterplot(data=data1, x=x, y=y_log)

x = data1["GDP per capita"]
y = data1["max_infection_rate"]
y_log = np.log10(y)
sns.regplot(data=data1, x=x, y=y_log)

4.1 Plotting Social support vs maximum Infection rate

x = data1["Social support"]
y = data1["max_infection_rate"]
y_log = np.log10(y)
sns.regplot(data=data1, x=x, y=y_log)

4.2 Plotting Healthy life expectancy vs maximum Infection rate

x = data1["Healthy life expectancy"]
y = data1["max_infection_rate"]
y_log = np.log10(y)
sns.regplot(data=data1, x=x, y=y_log)

4.3 Plotting Freedom to make life choices vs maximum Infection rate

x = data1["Freedom to make life choices"]
y = data1["max_infection_rate"]
y_log = np.log10(y)
sns.regplot(data=data1, x=x, y=y_log)