COVID-19 Vaccination Data Analysis in R

The dataset used in this analysis contains information on COVID-19 vaccine distribution and administration. It includes attributes such as date, location (e.g., country, region), vaccine manufacturer, doses distributed, doses administered, and population demographics. Each record represents a daily snapshot of vaccine-related activities in a specific geographic area.

The dataset consists of COVID-19-related metrics for different countries or regions. Here’s an introduction to each column:

Load the dataset and the required packages

Importing Dataset

Importing three datasets into this project

-covid– This dataset contains Country/Region, Continent, Population, TotalCases, NewCases, TotalDeaths, NewDeaths, TotalRecovered, NewRecovered, ActiveCases, Serious, Critical, Tot Cases/1M pop, Deaths/1M pop, TotalTests, Tests/1M pop, WHO Region, iso_alpha. -covid_grouped– This dataset contains Date(from 20-01-22 to 20-07-27), Country/Region, Confirmed, Deaths, Recovered, Active, New cases, New deaths, New recovered, WHO Region, iso_alpha. -coviddeath– This dataset contains real-world examples of a number of Covid-19 deaths and the reasons behind the deaths.

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
# Load the dataset
data <- read.csv("covid.csv")
data_grouped <-  read.csv("covid_grouped.csv")
data_death <-  read.csv("coviddeath.csv")
head(data)
##   Country.Region     Continent Population TotalCases NewCases TotalDeaths
## 1            USA North America  331198130    5032179       NA      162804
## 2         Brazil South America  212710692    2917562       NA       98644
## 3          India          Asia 1381344997    2025409       NA       41638
## 4         Russia        Europe  145940924     871894       NA       14606
## 5   South Africa        Africa   59381566     538184       NA        9604
## 6         Mexico North America  129066160     462690     6590       50517
##   NewDeaths TotalRecovered NewRecovered ActiveCases Serious.Critical
## 1        NA        2576668           NA     2292707            18296
## 2        NA        2047660           NA      771258             8318
## 3        NA        1377384           NA      606387             8944
## 4        NA         676357           NA      180931             2300
## 5        NA         387316           NA      141264              539
## 6       819         308848         4140      103325             3987
##   Tot.Cases.1M.pop Deaths.1M.pop TotalTests Tests.1M.pop     WHO.Region
## 1            15194           492   63139605       190640       Americas
## 2            13716           464   13206188        62085       Americas
## 3             1466            30   22149351        16035 South-EastAsia
## 4             5974           100   29716907       203623         Europe
## 5             9063           162    3149807        53044         Africa
## 6             3585           391    1056915         8189       Americas
##   iso_alpha
## 1       USA
## 2       BRA
## 3       IND
## 4       RUS
## 5       ZAF
## 6       MEX

Getting dataset information

dim(data)
## [1] 209  17
str(data)
## 'data.frame':    209 obs. of  17 variables:
##  $ Country.Region  : chr  "USA" "Brazil" "India" "Russia" ...
##  $ Continent       : chr  "North America" "South America" "Asia" "Europe" ...
##  $ Population      : num  3.31e+08 2.13e+08 1.38e+09 1.46e+08 5.94e+07 ...
##  $ TotalCases      : int  5032179 2917562 2025409 871894 538184 462690 455409 366671 357710 354530 ...
##  $ NewCases        : num  NA NA NA NA NA 6590 NA NA NA NA ...
##  $ TotalDeaths     : num  162804 98644 41638 14606 9604 ...
##  $ NewDeaths       : num  NA NA NA NA NA 819 NA NA NA NA ...
##  $ TotalRecovered  : num  2576668 2047660 1377384 676357 387316 ...
##  $ NewRecovered    : num  NA NA NA NA NA 4140 NA NA NA NA ...
##  $ ActiveCases     : num  2292707 771258 606387 180931 141264 ...
##  $ Serious.Critical: num  18296 8318 8944 2300 539 ...
##  $ Tot.Cases.1M.pop: num  15194 13716 1466 5974 9063 ...
##  $ Deaths.1M.pop   : num  492 464 30 100 162 391 619 517 234 610 ...
##  $ TotalTests      : num  63139605 13206188 22149351 29716907 3149807 ...
##  $ Tests.1M.pop    : num  190640 62085 16035 203623 53044 ...
##  $ WHO.Region      : chr  "Americas" "Americas" "South-EastAsia" "Europe" ...
##  $ iso_alpha       : chr  "USA" "BRA" "IND" "RUS" ...

Cleaning data (Example - Using Python in R)

import pandas as pd
df = r.data  # Accessing a r variable
df = pd.DataFrame(df)  
print(df.columns)
## Index(['Country.Region', 'Continent', 'Population', 'TotalCases', 'NewCases',
##        'TotalDeaths', 'NewDeaths', 'TotalRecovered', 'NewRecovered',
##        'ActiveCases', 'Serious.Critical', 'Tot.Cases.1M.pop', 'Deaths.1M.pop',
##        'TotalTests', 'Tests.1M.pop', 'WHO.Region', 'iso_alpha'],
##       dtype='object')

Remove Unnecessary Rows

# Drop NewCases, NewDeaths, NewRecovered rows from 

df.drop(['NewCases', 'NewDeaths', 'NewRecovered'], 
              axis=1, inplace=True)

# Select random set of values from 
df.sample(5)
##     Country.Region      Continent  ...            WHO.Region  iso_alpha
## 167    Isle of Man         Europe  ...                              IMN
## 43        Portugal         Europe  ...                Europe        PRT
## 29          Sweden         Europe  ...                Europe        SWE
## 65      Costa Rica  North America  ...              Americas        CRI
## 99           Libya         Africa  ...  EasternMediterranean        LBY
## 
## [5 rows x 14 columns]

Creating a table using Data Table Package (Switching back to R)

# Load required packages
library(reticulate)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(DT)

df <-  py$df #Importing a python variable

# Create interactive table using DT package
datatable(df, options = list(
  pageLength = 5,
  scrollX = TRUE
))

Bar graphs- Comparisons between COVID infected countries in terms of total cases, total deaths, total recovered & total tests

We are going to create a bar chart to visualize the total no. of cases in each country by using ggplot2

Horizontal bar chart for Total cases

top15 <- df %>%
  arrange(desc(TotalCases)) %>%
  slice_head(n = 15)

ggplot(top15, aes(x = reorder(`Country.Region`, TotalCases), y = TotalCases, fill = TotalCases)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Top 15 Countries by Total Cases", x = "Country", y = "Total Cases") +
  theme_minimal()

Vertical bar chart for Total cases

# Top 15 countries by TotalCases
top15_df <- df %>%
  arrange(desc(TotalCases)) %>%
  slice(1:15)

# Bar plot colored by TotalCases
ggplot(top15_df, aes(x = reorder(Country.Region, -TotalCases), y = TotalCases, fill = TotalCases)) +
  geom_bar(stat = "identity") +
  labs(x = "Country", y = "Total Cases", title = "Top 15 Countries by Total COVID Cases") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Bar chart for Total Deaths

ggplot(top15_df, aes(x = reorder(Country.Region, -TotalCases), y = TotalCases, fill = TotalDeaths)) +
  geom_bar(stat = "identity") +
  labs(x = "Country", y = "Total Cases", title = "Top 15 Countries - Total Deaths") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Bar Chart for Total Recovered

ggplot(top15_df, aes(x = reorder(Country.Region, -TotalCases), y = TotalCases, fill = TotalRecovered)) +
  geom_bar(stat = "identity") +
  labs(x = "Country", y = "Total Cases", title = "Top 15 Countries - Total Recovered") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Bar Chart for Total Tests

ggplot(top15_df, aes(x = TotalTests, y = reorder(Country.Region, TotalTests), fill = TotalTests)) +
  geom_bar(stat = "identity") +
  labs(x = "Total Tests", y = "Country", title = "Horizontal Bar Plot - Total Tests by Country") +
  theme_minimal()

Horizontal bar plot for Total tests by Continent

ggplot(top15_df, aes(x = TotalTests, y = reorder(Continent, TotalTests), fill = Continent)) +
  geom_bar(stat = "identity") +
  labs(x = "Total Tests", y = "Continent", title = "Total Tests by Continent") +
  theme_minimal()

Bubble plot for Total Cases by Continent

ggplot(df, aes(x = Continent, y = TotalCases, size = TotalCases, color = TotalCases)) +
  geom_point(alpha = 0.6) +
  labs(title = "Bubble Chart: Total Cases by Continent") +
  theme_minimal()

Bubble plot for Total Cases by Continent (log scale)

ggplot(df, aes(x = Continent, y = TotalCases, size = TotalCases, color = TotalCases)) +
  geom_point(alpha = 0.6) +
  scale_y_log10() +
  labs(title = "Bubble Chart: Total Cases by Continent (Log Scale)") +
  theme_minimal()

Heatmap for COVID-19 disease intensity across countries

The heatmap visualizes the relative intensity of the pandemic impact across nations, helping to quickly identify severely affected areas.

# Select and transform top 15 countries
heatmap_df <- df %>%
  arrange(desc(TotalCases)) %>%
  slice(1:15) %>%
  select(Country.Region, TotalCases, TotalDeaths, TotalRecovered, TotalTests) %>%
  pivot_longer(cols = -Country.Region, names_to = "Metric", values_to = "Value")

Data was normalized or scaled to ensure comparability across different numerical ranges, and visual cues were provided using a color palette that enhances interpretation.

# Normalize values for better heatmap contrast
heatmap_df <- heatmap_df %>%
  group_by(Metric) %>%
  mutate(NormalizedValue = Value / max(Value, na.rm = TRUE))

Heatmap was created using ggplot2

  • Rows represent different countries or regions.

  • Columns represent selected variables such as total cases, deaths, and recoveries.

ggplot(heatmap_df, aes(x = Metric, y = reorder(Country.Region, desc(Country.Region)), fill = NormalizedValue)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(title = "COVID-19 Metrics Heatmap (Top 15 Countries)",
       x = "Metric", y = "Country", fill = "Normalized\nValue") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))