AN R PROJECT ON AFRICA CONTINENT COVID-19 DATA ANALYSIS AS OF JANUARY 2020 TO MAY 2023

This project is sequal to ‘AFRICA CONTINENT COVID-19 DATA ANALYSIS AS OF JANUARY 2020 TO MAY 2023: AN SQL SERVER (SSMS) AND TEABLU PROJECT’. The urge to know the relationship of total_deaths as it relate to other variables in Africal Continent gave birth to this Project

Step One:

I loaded my libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

library(magrittr)

## 
## Attaching package: 'magrittr'
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## The following object is masked from 'package:tidyr':
## 
##     extract

library(lubridate)
library(ggplot2)

Step Two:

Read in my data and converted from csv to rds file

covid<-read.csv('C:/Users/ebene/Documents/covid_data1.csv')
# Save as RDS file
saveRDS(covid, "C:/Users/ebene/Documents/covid_data1.rds")

Step three:

Looked at my data structures to better understant it

str(covid)# 307831 observations and 14 veriables

## 'data.frame':    307831 obs. of  14 variables:
##  $ iso_code               : chr  "AFG" "AFG" "AFG" "AFG" ...
##  $ continent              : chr  "Asia" "Asia" "Asia" "Asia" ...
##  $ location               : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ date                   : chr  "03/01/2020" "04/01/2020" "05/01/2020" "06/01/2020" ...
##  $ total_cases            : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_cases              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total_deaths           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_deaths             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ icu_patients           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ hosp_patients          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ total_vaccinations     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ people_vaccinated      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ people_fully_vaccinated: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ total_boosters         : num  NA NA NA NA NA NA NA NA NA NA ...

Step Three:

Filtered rows to Africa continent

covid_2 <- subset(covid, continent == 'Africa')# has 69370 observation and 14 veriables
#covid[continent == 'Africa', ]

Step four:

I chosed the columns of intrest, then looked at the summary of my dataframe to futher understan it

covid_3 <-(covid_2[,5:14])# only need column 5 to 14 
summary(covid_3)

##   total_cases        new_cases        total_deaths      new_deaths     
##  Min.   :      1   Min.   :    0.0   Min.   :     1   Min.   :  0.000  
##  1st Qu.:   6279   1st Qu.:    0.0   1st Qu.:   120   1st Qu.:  0.000  
##  Median :  24541   Median :    2.0   Median :   365   Median :  0.000  
##  Mean   : 134406   Mean   :  188.6   Mean   :  3077   Mean   :  3.733  
##  3rd Qu.:  90780   3rd Qu.:   52.0   3rd Qu.:  1739   3rd Qu.:  0.000  
##  Max.   :4072533   Max.   :46914.0   Max.   :102595   Max.   :951.000  
##  NA's   :5158      NA's   :11        NA's   :8010     NA's   :8        
##   icu_patients    hosp_patients   total_vaccinations  people_vaccinated 
##  Min.   :   0.0   Min.   :   96   Min.   :        0   Min.   :       0  
##  1st Qu.:  19.0   1st Qu.: 3076   1st Qu.:   368280   1st Qu.:  271668  
##  Median :  49.0   Median : 4954   Median :  1841568   Median : 1415448  
##  Mean   : 423.2   Mean   : 6146   Mean   :  7848974   Mean   : 4910437  
##  3rd Qu.: 618.0   3rd Qu.: 8327   3rd Qu.:  8574775   3rd Qu.: 5628591  
##  Max.   :2694.0   Max.   :18034   Max.   :116606863   Max.   :82325196  
##  NA's   :67673    NA's   :68539   NA's   :63153       NA's   :63521     
##  people_fully_vaccinated total_boosters    
##  Min.   :       5        Min.   :       3  
##  1st Qu.:  223420        1st Qu.:  126687  
##  Median : 1088142        Median :  612567  
##  Mean   : 4319827        Mean   : 1460859  
##  3rd Qu.: 4652414        3rd Qu.: 1982127  
##  Max.   :69793071        Max.   :15063017  
##  NA's   :64358           NA's   :67753

The mean and median of most variables are far apart and maximum values and minimum values are equally far apart, there are too many NA values

Step Five:

I omited the NA values and created a plot of total_death by people vaccinated

# Remove rows with NA values
covid_4 <- na.omit(covid_3)

# Create the plot
ggplot(covid_4, aes(x = people_vaccinated, y = total_deaths)) +
  geom_point()

The relationship between total_death and population_viccinated is positive, there was increase in total_death as population_viccinated increases, this is no suprise, greater percentage of the population where not vaccinated. Seychelle, Morocco, Tunisia and South Africa top the chart of percentage of people fully vaccinated by population with 53.04%, 33.16%, 30.04% and 23.82% respectively. it was only Seychelle, that had 50% 0f her population vacinated, detail is in my first work as stated above

Step Five b:

looked at how othere virables related to total_deaths

# Create the plot
ggplot(covid_4, aes(x = people_fully_vaccinated , y = total_deaths)) +
  geom_point()

It is a positive relationship, reason is same as meantion before, most country in Africa did not have up to 30% of their population vacinated

Step Five c

ggplot(covid_4, aes(x = total_cases, y = total_deaths)) +
  geom_point()

This plot indicates that the relationship between total_cases and total death is a positiv relationship as total_cases incresed, there was a corresponding incress in total_deaths. to understand this relationship better let me quantify the direction and strength of these relationships so that I can communicate them better? This will lead me to correlation coefficient after the next plot

ggplot(covid_4, aes(x = total_boosters, y = total_deaths)) +
  geom_point()

it is a posotive relationship, but there was a period of lag, which could be as a result of the vaccing working.

Step 6 Correlation Matrix

cor(covid_4[,c('total_deaths', 'total_cases', 'people_fully_vaccinated', 'people_vaccinated', 'total_boosters')])

##                         total_deaths total_cases people_fully_vaccinated
## total_deaths               1.0000000   0.8413319               0.9360677
## total_cases                0.8413319   1.0000000               0.9592113
## people_fully_vaccinated    0.9360677   0.9592113               1.0000000
## people_vaccinated          0.9496054   0.9557839               0.9986277
## total_boosters             0.9153427   0.9462191               0.9943959
##                         people_vaccinated total_boosters
## total_deaths                    0.9496054      0.9153427
## total_cases                     0.9557839      0.9462191
## people_fully_vaccinated         0.9986277      0.9943959
## people_vaccinated               1.0000000      0.9908188
## total_boosters                  0.9908188      1.0000000

You can see the correlation between total_death and total_cases is 0.84, It’s also worth noting that exploring the correlations between the independent variables can be useful as well.

Step Seven:

I explor more optiong looking at all the relationships

covid_5 <- cor(covid_4 %>% select(where(is.numeric)))
covid_5

##                         total_cases  new_cases total_deaths new_deaths
## total_cases               1.0000000 -0.1455128    0.8413319 -0.4956454
## new_cases                -0.1455128  1.0000000   -0.1758851  0.2021870
## total_deaths              0.8413319 -0.1758851    1.0000000 -0.6415155
## new_deaths               -0.4956454  0.2021870   -0.6415155  1.0000000
## icu_patients             -0.4676779  0.3432254   -0.7928338  0.5780256
## hosp_patients            -0.6270871  0.2942600   -0.9011076  0.6337101
## total_vaccinations        0.9551956 -0.1233645    0.9317205 -0.6294260
## people_vaccinated         0.9557839 -0.1491606    0.9496054 -0.6242294
## people_fully_vaccinated   0.9592113 -0.1295804    0.9360677 -0.6149295
## total_boosters            0.9462191 -0.1015376    0.9153427 -0.6411559
##                         icu_patients hosp_patients total_vaccinations
## total_cases               -0.4676779    -0.6270871          0.9551956
## new_cases                  0.3432254     0.2942600         -0.1233645
## total_deaths              -0.7928338    -0.9011076          0.9317205
## new_deaths                 0.5780256     0.6337101         -0.6294260
## icu_patients               1.0000000     0.9671790         -0.6488904
## hosp_patients              0.9671790     1.0000000         -0.7809003
## total_vaccinations        -0.6488904    -0.7809003          1.0000000
## people_vaccinated         -0.6705657    -0.8024033          0.9974619
## people_fully_vaccinated   -0.6530315    -0.7842574          0.9990956
## total_boosters            -0.6334861    -0.7648311          0.9977908
##                         people_vaccinated people_fully_vaccinated
## total_cases                     0.9557839               0.9592113
## new_cases                      -0.1491606              -0.1295804
## total_deaths                    0.9496054               0.9360677
## new_deaths                     -0.6242294              -0.6149295
## icu_patients                   -0.6705657              -0.6530315
## hosp_patients                  -0.8024033              -0.7842574
## total_vaccinations              0.9974619               0.9990956
## people_vaccinated               1.0000000               0.9986277
## people_fully_vaccinated         0.9986277               1.0000000
## total_boosters                  0.9908188               0.9943959
##                         total_boosters
## total_cases                  0.9462191
## new_cases                   -0.1015376
## total_deaths                 0.9153427
## new_deaths                  -0.6411559
## icu_patients                -0.6334861
## hosp_patients               -0.7648311
## total_vaccinations           0.9977908
## people_vaccinated            0.9908188
## people_fully_vaccinated      0.9943959
## total_boosters               1.0000000

That’s too much information to process, let me make it easier to see patterns by using colors, shapes, and groups.

# Assuming you have a data frame named "data" containing numeric variables
# Create the correlation matrix

# Load the corrplot package
library(corrplot)

## corrplot 0.92 loaded

# Generate the correlation plot
corrplot(covid_5,
         method = "color",  # You can also choose "pie" or "ellipse" for different visualization methods
         order = "hclust",  # Orders the variables based on their similarity
         addCoef.col = "black",
         number.cex = 0.6)  # Adjust the size of the numbers in the cells

First, you’ll notice that going down the diagonal there is a value of 1 for every cell. This is to be expected because it means that the variable is perfectly correlated with itself.

Perhaps you will also notice that after ordering the variables using hierarchical clustering, it’s easier to see which variables behave similarly. For instance, people_fully_vacinated and total vaccination seem to have the same pattern of positive and negative correlations with the other variables, and a strong positive correlation with each other.

One way to prioritize your investigation is to focus on the column or row of the dependent variable, total_deaths. By looking at the total_deaths column we can see people_vaccinated is the variable that has the strongest positive correlation of 0.95 , while new_cases has the strongest negative correlation of -0.18. Thus, these correlation coefficients quantify the relationship that was displayed in the scatter plots.

Scatterplot of total_deaths and new_cases

Let’s look at one more scatterplot for a variable that is not strongly correlated with total_deaths. Based on the correlation matrix above, these variables have a correlation of -0.18.

# Create the scatter plot
ggplot(covid_4, aes(x = new_cases, y = total_deaths)) +
  geom_point()

STEP EIGHT Regression

Here’s how we can create a linear model to predict tota_deaths by regressing on new_cases and total cases from the covid_4 datafram. this is dome by using the lm function

ggplot(covid_4, aes(x = new_cases, y = total_deaths)) +
  geom_point()+
 stat_smooth(method = 'lm')

## `geom_smooth()` using formula = 'y ~ x'

lm1 <- lm(totalRevenue ~ Fuel_py1, data = trd)

ggplot(covid_4, aes(x = total_cases, y = total_deaths)) +
  geom_point()+
   stat_smooth(method = 'lm')

## `geom_smooth()` using formula = 'y ~ x'

Step Nine:

appling lm()

lm_1 <- lm(total_deaths ~ new_cases, data = covid_4)
summary(lm_1)

## 
## Call:
## lm(formula = total_deaths ~ new_cases, data = covid_4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6635.0  -365.7   688.0  1459.8  2604.3 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.003e+05  2.364e+02 424.537   <2e-16 ***
## new_cases   -1.559e-01  6.920e-02  -2.253   0.0256 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2096 on 159 degrees of freedom
## Multiple R-squared:  0.03094,    Adjusted R-squared:  0.02484 
## F-statistic: 5.076 on 1 and 159 DF,  p-value: 0.02563

lm_2 <- lm(total_deaths ~ total_cases, data = covid_4)
summary(lm_2)

## 
## Call:
## lm(formula = total_deaths ~ total_cases, data = covid_4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3427.2  -429.2  -276.7  1142.5  1325.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.313e+04  2.388e+03   22.25   <2e-16 ***
## total_cases 1.227e-02  6.250e-04   19.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1151 on 159 degrees of freedom
## Multiple R-squared:  0.7078, Adjusted R-squared:  0.706 
## F-statistic: 385.2 on 1 and 159 DF,  p-value: < 2.2e-16

Step Ten:

Summary of Regression Lm_1

The following observations were made: the relationship was between total_death as the dependent veriable and new_cases as the independent variable and a linear regression was fitted using lm() function.

Coefficients The analysis shows two coefficients, one for the intercept and another for the variable “new_cases.” The coefficient for the intercept is 1.003e+05 (100,300) with a standard error of 2.364e+02 (236.4), and the coefficient for “new_cases” is -1.559e-01 (-0.1559) with a standard error of 6.920e-02 (0.0692). These coefficients represent the estimated effect of the corresponding variables on the outcome.

t-value and p-value The t-value measures the significance of each coefficient estimate. The t-value for the intercept is very large (424.537), indicating its high significance. However, the t-value for the “new_cases” coefficient is -2.253, and the corresponding p-value is 0.0256, which is less significant but still below the conventional threshold of 0.05. The significance code ’*’ is used to indicate that the “new_cases” coefficient is statistically significant at the 0.05 level.

Residual standard error The residual standard error measures the average amount by which the observed values deviate from the predicted values. In this analysis, the residual standard error is 2096, indicating the average magnitude of the residuals.

R-squared and adjusted R-squared The R-squared value (0.03094) represents the proportion of the variance in the dependent variable that can be explained by the independent variable “new_cases.” In this case, the R-squared value is very low, suggesting that the linear relationship between “new_cases” and “total_deaths” explains only a small portion of the variability in the outcome. The adjusted R-squared value (0.02484) adjusts the R-squared value for the number of predictors in the model.

F-statistic and p-value The F-statistic tests the overall significance of the model. In this case, the F-statistic is 5.076 with 1 and 159 degrees of freedom, and the associated p-value is 0.02563. This indicates that the model as a whole is statistically significant, although the R-squared value is low.

summary, the analysis suggests a weak linear relationship between the variable “new_cases” and the outcome variable “total_deaths.” The coefficient estimate for “new_cases” is negative (-0.1559), indicating that an increase in “new_cases” is associated with a decrease in the outcome variable. However, the low R-squared value suggests that “new_cases” explains only a small portion of the variability in “total_deaths.” Additionally, the significance of the “new_cases” coefficient is modest, with a p-value of 0.0256, indicating that it may not be strongly significant at conventional significance levels.

Step Eleven: Summary of Regression Lm_2

The following observations were made: the relationship was between total_death as the dependent veriable and total_cases as the independent variable and a linear regression was fitted using lm() function.

Coefficients: The analysis shows two coefficients, one for the intercept and another for the variable “total_cases.” The coefficient for the intercept is 5.313e+04 (53,130) with a standard error of 2.388e+03 (2,388), and the coefficient for “total_cases” is 1.227e-02 (0.01227) with a standard error of 6.250e-04 (0.000625). These coefficients indicate the estimated effect of the corresponding variables on the outcome.

t-value and p-value The t-value measures the significance of each coefficient estimate. The larger the absolute t-value, the more significant the coefficient is. In this case, both the intercept and the “total_cases” coefficient have high t-values (22.25 and 19.63, respectively), indicating that they are statistically significant. The corresponding p-values (<2e-16) are also extremely small, which suggests strong evidence against the null hypothesis that these coefficients are zero.

Significance codes The analysis provides significance codes to indicate the level of significance. In this case, both coefficients are marked with ’***,’ which signifies the highest level of significance (p-value < 0.001).

Residual standard error The residual standard error measures the average amount by which the observed values deviate from the predicted values. In this analysis, the residual standard error is 1151, indicating the average magnitude of the residuals.

R-squared and adjusted R-squared The R-squared value (0.7078) represents the proportion of the variance in the dependent variable that can be explained by the independent variables. In other words, around 70.78% of the variability in the outcome variable can be accounted for by the linear relationship with “total_cases.” The adjusted R-squared value (0.706) adjusts the R-squared value for the number of predictors in the model.

F-statistic and p-value The F-statistic tests the overall significance of the model. A larger F-statistic indicates a better fit of the model. In this case, the F-statistic is 385.2 with 1 and 159 degrees of freedom, and the associated p-value is < 2.2e-16, indicating that the model as a whole is highly significant.

summary the analysis suggests a strong linear relationship between the variable “total_cases” and the outcome variable. The coefficient estimate for “total_cases” is positive (0.01227), indicating that an increase in “total_cases” is associated with an increase in the outcome variable. The model has a good fit, as indicated by the high R-squared value and the overall significance of the model.

Thanks for your time