By the end of this project, you will learn how to preprocess and merge datasets to calculate needed measures and prepare them for an Analysis. in this Course, we are going to work with the COVID19 dataset, published by John Hopkins University, which consists of the data related to cumulative number of confirmed cases, per day, in each Country. Also, we have another dataset consisting of various life factors, scored by the people living in each country around the globe. We are going to merge these two datasets to see if there is any relationship between the spread of the virus in a country and how happy people are, living in that country.
Learning Objectives:
Understanding the purpose of the project, the datasets that will be used, and the questions that will be answered with the analysis. Importing COVID19 dataset and preparing it for the analysis by dropping columns and aggregating rows. Finding and calculating a good measure for the analysis. Merging two datasets and finding correlations among the datasets. Visualizing analysis results using Seaborn.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
print('Modules are imported')
## Modules are imported
Importing “Covid19_Confirmend_dataset.csv” from *./Dataset” folder
corona_dataset_csv = pd.read_csv("Datasets/covid19_Confirmed_dataset.csv")
corona_dataset_csv.head()
## Province/State Country/Region Lat ... 4/28/20 4/29/20 4/30/20
## 0 NaN Afghanistan 33.0000 ... 1828 1939 2171
## 1 NaN Albania 41.1533 ... 750 766 773
## 2 NaN Algeria 28.0339 ... 3649 3848 4006
## 3 NaN Andorra 42.5063 ... 743 743 745
## 4 NaN Angola -11.2027 ... 27 27 27
##
## [5 rows x 104 columns]
let’s check the shape of the dataset
corona_dataset_csv.shape
## (266, 104)
corona_dataset_csv.drop(["Lat", "Long"],axis=1, inplace = True)
corona_dataset_csv.head()
## Province/State Country/Region 1/22/20 ... 4/28/20 4/29/20 4/30/20
## 0 NaN Afghanistan 0 ... 1828 1939 2171
## 1 NaN Albania 0 ... 750 766 773
## 2 NaN Algeria 0 ... 3649 3848 4006
## 3 NaN Andorra 0 ... 743 743 745
## 4 NaN Angola 0 ... 27 27 27
##
## [5 rows x 102 columns]
corona_dataset_aggregated = corona_dataset_csv.groupby("Country/Region").sum()
corona_dataset_aggregated.head(10)
## Province/State ... 4/30/20
## Country/Region ...
## Afghanistan 0 ... 2171
## Albania 0 ... 773
## Algeria 0 ... 4006
## Andorra 0 ... 745
## Angola 0 ... 27
## Antigua and Barbuda 0 ... 24
## Argentina 0 ... 4428
## Armenia 0 ... 2066
## Australia Australian Capital TerritoryNew South WalesNor... ... 6766
## Austria 0 ... 15452
##
## [10 rows x 101 columns]
corona_dataset_aggregated.drop("Province/State",axis=1, inplace = True)
corona_dataset_aggregated.shape
## (187, 100)
We need to find a good measure represented as a number, ddescribing the spread of the virus in a country
corona_dataset_aggregated.loc["China"][:3].plot()
corona_dataset_aggregated.loc["China"].diff().plot()
corona_dataset_aggregated.loc["Switzerland"].diff().plot()
corona_dataset_aggregated.loc["Sweden"].diff().plot()
corona_dataset_aggregated.loc["China"].diff().max()
## 15136.0
corona_dataset_aggregated.loc["Spain"].diff().max()
## 9630.0
corona_dataset_aggregated.loc["Italy"].diff().max()
## 6557.0
countries = list(corona_dataset_aggregated.index)
max_infection_rates = []
for c in countries:
rate = corona_dataset_aggregated.loc[c].diff().max()
max_infection_rates.append(rate)
corona_dataset_aggregated["max_infection_rate"] = max_infection_rates
corona_dataset_aggregated.head()
## 1/22/20 1/23/20 1/24/20 ... 4/29/20 4/30/20 max_infection_rate
## Country/Region ...
## Afghanistan 0 0 0 ... 1939 2171 232.0
## Albania 0 0 0 ... 766 773 34.0
## Algeria 0 0 0 ... 3848 4006 199.0
## Andorra 0 0 0 ... 743 745 43.0
## Angola 0 0 0 ... 27 27 5.0
##
## [5 rows x 101 columns]
corona_data = pd.DataFrame(corona_dataset_aggregated["max_infection_rate"])
corona_data.head()
## max_infection_rate
## Country/Region
## Afghanistan 232.0
## Albania 34.0
## Algeria 199.0
## Andorra 43.0
## Angola 5.0
select bthe needed columns for tge analysis Join the datasets calculate the correlation as the result of our analysis.
happiness_report_csv = pd.read_csv("Datasets/worldwide_happiness_report.csv")
happiness_report_csv.head()
## Overall rank Country or region ... Generosity Perceptions of corruption
## 0 1 Finland ... 0.153 0.393
## 1 2 Denmark ... 0.252 0.410
## 2 3 Norway ... 0.271 0.341
## 3 4 Iceland ... 0.354 0.118
## 4 5 Netherlands ... 0.322 0.298
##
## [5 rows x 9 columns]
happiness_report_csv.shape
## (156, 9)
useless_cols = ["Overall rank", "Score", "Generosity", "Perceptions of corruption"]
happiness_report_csv.drop(useless_cols, axis=1, inplace=True)
happiness_report_csv.head()
## Country or region ... Freedom to make life choices
## 0 Finland ... 0.596
## 1 Denmark ... 0.592
## 2 Norway ... 0.603
## 3 Iceland ... 0.591
## 4 Netherlands ... 0.557
##
## [5 rows x 5 columns]
happiness_report_csv.set_index("Country or region", inplace = True)
happiness_report_csv.head()
## GDP per capita ... Freedom to make life choices
## Country or region ...
## Finland 1.340 ... 0.596
## Denmark 1.383 ... 0.592
## Norway 1.488 ... 0.603
## Iceland 1.380 ... 0.591
## Netherlands 1.396 ... 0.557
##
## [5 rows x 4 columns]
corona_data.head()
## max_infection_rate
## Country/Region
## Afghanistan 232.0
## Albania 34.0
## Algeria 199.0
## Andorra 43.0
## Angola 5.0
corona_data.shape
## (187, 1)
happiness_report_csv.head()
## GDP per capita ... Freedom to make life choices
## Country or region ...
## Finland 1.340 ... 0.596
## Denmark 1.383 ... 0.592
## Norway 1.488 ... 0.603
## Iceland 1.380 ... 0.591
## Netherlands 1.396 ... 0.557
##
## [5 rows x 4 columns]
happiness_report_csv.shape
## (156, 4)
data1 = corona_data.join(happiness_report_csv, how="inner")
data1.head()
## max_infection_rate ... Freedom to make life choices
## Afghanistan 232.0 ... 0.000
## Albania 34.0 ... 0.383
## Algeria 199.0 ... 0.086
## Argentina 291.0 ... 0.471
## Armenia 134.0 ... 0.283
##
## [5 rows x 5 columns]
data1.corr()
## max_infection_rate ... Freedom to make life choices
## max_infection_rate 1.000000 ... 0.078196
## GDP per capita 0.250118 ... 0.394603
## Social support 0.191958 ... 0.456246
## Healthy life expectancy 0.289263 ... 0.427892
## Freedom to make life choices 0.078196 ... 1.000000
##
## [5 rows x 5 columns]
Our analysis is not finished unless we visialize the results in terms figures and graphs so that everyone can understand what you get out of our analysis.
sns.scatterplot(data=data1, x= "GDP per capita", y = "max_infection_rate")
x = data1["GDP per capita"]
y = data1["max_infection_rate"]
y_log = np.log10(y)
sns.scatterplot(data=data1, x=x, y=y_log)
x = data1["GDP per capita"]
y = data1["max_infection_rate"]
y_log = np.log10(y)
sns.regplot(data=data1, x=x, y=y_log)
x = data1["Healthy life expectancy"]
y = data1["max_infection_rate"]
y_log = np.log10(y)
sns.regplot(data=data1, x=x, y=y_log)
x = data1["Freedom to make life choices"]
y = data1["max_infection_rate"]
y_log = np.log10(y)
sns.regplot(data=data1, x=x, y=y_log)