For the take-home part of the MSDS 401 Final Exam, you are tasked with analyzing data on new daily covid-19 cases and deaths in European Union (EU) and European Economic Area (EEA) countries. A data file may be downloaded here, or you may use the provided read.csv() code in the ‘setup’ code chunk below to read the data directly from the web csv. Either approach is acceptable; the data should be the same.

Once you have defined a data frame with the daily case and death and country data, you are asked to: (1) perform an Exploratory Data Analysis (EDA), (2) perform some hypothesis testing, (3) perform some correlation testing, and (4) fit and describe a linear regression model. Each of these four (4) items is further explained below and “code chunks” have been created for you in which to add your R code, just as with the R and Data Analysis Assignments. You may add additional code chunks, as needed. You should make comments in the code chunks or add clarifying text between code chunks that you think further your work.

A data dictionary for the dataset is available here.

Definitions:


1. Descriptive Statistics

Perform an Exploratory Data Analysis (EDA). Your EDA is exactly that: yours. Your knit .html should include the visualizations and summary tables that you find valuable while exploring this dataset. However, at minimum, your EDA must include the following:

data$incidence_rate <- (data$cases/data$popData2020)*100000
data$fatality_rate <- (data$deaths/data$popData2020)*100000

data <- data[data$cases>=0,]
data <- data[data$deaths>=0,]

glimpse(data)
## Rows: 28,698
## Columns: 12
## $ dateRep                 <date> 2022-10-23, 2022-10-22, 2022-10-21, 2022-10-2…
## $ day                     <int> 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12…
## $ month                   <int> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10…
## $ year                    <int> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022…
## $ cases                   <int> 3557, 5494, 7776, 8221, 10007, 13204, 9964, 66…
## $ deaths                  <int> 0, 4, 4, 6, 8, 7, 8, 12, 6, 10, 14, 13, 11, 10…
## $ countriesAndTerritories <fct> Austria, Austria, Austria, Austria, Austria, A…
## $ geoId                   <fct> AT, AT, AT, AT, AT, AT, AT, AT, AT, AT, AT, AT…
## $ countryterritoryCode    <fct> AUT, AUT, AUT, AUT, AUT, AUT, AUT, AUT, AUT, A…
## $ popData2020             <int> 8901064, 8901064, 8901064, 8901064, 8901064, 8…
## $ incidence_rate          <dbl> 39.96151, 61.72296, 87.36034, 92.35974, 112.42…
## $ fatality_rate           <dbl> 0.00000000, 0.04493845, 0.04493845, 0.06740767…
plot1 <- ggplot(data=data,aes(x=as.character(dateRep),y=incidence_rate,group=countriesAndTerritories)) +
  geom_point(aes(color=countriesAndTerritories)) +
  ggtitle("COVID-19 Incidence Rates per 100,000 people per Country") +
  theme(plot.title = element_text(hjust = .5)) +
  xlab("Time") + 
  ylab("Incidence Rate") +
  scale_fill_brewer(palette='Spectral')

plot2 <- ggplot(data=data,aes(x=as.character(dateRep),y=fatality_rate,group=countriesAndTerritories)) +
  geom_point(aes(color=countriesAndTerritories)) +
  ggtitle("COVID-19 Mortality Rates per 100,000 people per Country") +
  theme(plot.title = element_text(hjust = .5)) +
  xlab("Time") + 
  ylab("Mortality Rate") +
  scale_fill_brewer(palette='Spectral')

grid.arrange(plot1, plot2, ncol=1, nrow =2)

data <- data %>%
  group_by(countriesAndTerritories) %>%
  mutate(across(cases:deaths, sum, .names = "{.col}_sum")) %>%
  ungroup()

aggs <- aggregate(cbind(deaths, cases) ~ countriesAndTerritories, data = data, FUN= sum, na.rm = TRUE)

aggs$Case_Fatality_Percent <- (aggs$deaths/aggs$cases)*100

2. Inferential Statistics

Select two (2) countries of your choosing and compare their incidence or fatality rates using hypothesis testing. At minimum, your work should include the following:

newdata <- subset(data, countriesAndTerritories == "Austria" | countriesAndTerritories == "Latvia")

plot3 <- ggplot(data = newdata, aes(x=as.character(dateRep), y=incidence_rate, group=countriesAndTerritories)) +
  geom_point(aes(color=countriesAndTerritories)) +
  ggtitle("COVID-19 Incidence Rates in Austria and Latvia - per 100,000 people") +
  theme(plot.title = element_text(hjust = .5)) +
  xlab("Time") +
  ylab("Incidence Rate") +
  scale_color_manual(values = c("cyan", "gray"))

plot3

####Statement of Null & Alternative Hypotheses:
#H0: mu1 - mu2 = 0
#H0: The difference between the mean COVID-19 incidence rates per 100,000 people in Austria and Latvia the same. 
#H0: mu1 - mu2 != 0
#H0: The difference between the mean COVID-19 incidence rates per 100,000 people in Austria and Latvia is the not same.####

austria <- subset(data, countriesAndTerritories == "Austria")
latvia <- subset(data, countriesAndTerritories == "Latvia")

base::cat("Austria Skewness:",
    rockchalk::skewness(austria$incidence_rate),"\nAustriaKurtosis:",
    rockchalk::kurtosis(austria$incidence_rate), "\nLatviaSkewness:",
    rockchalk::skewness(latvia$incidence_rate),"\nLatviaKurtosis:",
    rockchalk::kurtosis(latvia$incidence_rate))
## Austria Skewness: 2.804071 
## AustriaKurtosis: 8.618152 
## LatviaSkewness: 3.770176 
## LatviaKurtosis: 14.84807
####I will use the T-test to test my hypothesis, but since the population variance is unknown, the difference between the two means is also unknown. The T-test is based on samples of independent data, and it assumes that the population is normally distributed. However, the skewness and kurtosis checks (above) show that both samples have a right skew and high peakness, and therefore are not normally distributed for incidence rates. To address this, we will take the log base-10 of the incidence rates. Before that, we need to check if the sample variances are equal using the F statistic.

#There are certain cases where there are zero incidence rates or numbers. In such cases, performing a log base-10 calculation will result in an undefined number. These instances should be kept as zero, while for all others, log base-10 should be applied.

data$L_inc <- numeric(nrow(data))

k <- 1 
for (k in 1:nrow(data)) {
  if (data$cases[k] <= 0 | is.na(data$cases[k])){data$L_inc[k] <- 0}
  else {data$L_inc[k] <- log10(data$incidence_rate[k])}
  k <- k+1}

Austria <- subset(data, countriesAndTerritories == "Austria")
Latvia <- subset(data, countriesAndTerritories == "Latvia")

par(mfrow = c(2,1))

plot(Austria$L_inc, main = "Austria Distribution")
plot(Latvia$L_inc, main = "Latvia Distribution")

par(mfrow = c(1,1))

base::cat("Log 10 Austria Skewness:",
    rockchalk::skewness(Austria$L_inc),"\nLog 10 AustriaKurtosis:",
    rockchalk::kurtosis(Austria$L_inc), "\nLog 10 LatviaSkewness:",
    rockchalk::skewness(Latvia$L_inc),"\nLog 10 LatviaKurtosis:",
    rockchalk::kurtosis(Latvia$L_inc))
## Log 10 Austria Skewness: -0.6441102 
## Log 10 AustriaKurtosis: -0.03453257 
## Log 10 LatviaSkewness: -0.6409188 
## Log 10 LatviaKurtosis: -0.3141158
####Based on the findings, it seems that the skewness and kurtosis values for both Austria and Latvia are different from zero. A positive skewness value indicates that the distribution is skewed to the right, while a negative skewness value indicates that the distribution is skewed to the left. A kurtosis value greater than 3 indicates that the distribution has heavier tails than a normal distribution, while a kurtosis value less than 3 indicates that the distribution has lighter tails than a normal distribution.The log 10 skewness and kurtosis values for both countries are negative, which suggests that the distributions are slightly skewed to the left and have lighter tails than a normal distribution. 


#H0: variance-1 = variance-2
#Ha: variance-1 != variance-2
#Let alpha = .05

df_num <- nrow(Austria) - 1
df_denom <- nrow(Latvia) - 1
cv_upper <- qf(p=.025, df1=df_num, df2 = df_denom, lower.tail = FALSE)
cv_lower <- qf(p=.0975, df1=df_num, df2 = df_denom, lower.tail = TRUE)

f <- var(Austria$L_inc)/var(Latvia$L_inc) 
{
 
  if (f>cv_upper | f<cv_lower) {
   
   print ("Reject the null hypothesis. Variances are notequal.")

} else {
    
  print ("Fail to reject the null hypothesis. Variances are equal.")} 
}
## [1] "Reject the null hypothesis. Variances are notequal."
t.test(Austria$L_inc, Latvia$L_inc,alternative = "two.sided", mu = 0, paired = FALSE, var.equal = TRUE,conf.level = 0.95)
## 
##  Two Sample t-test
## 
## data:  Austria$L_inc and Latvia$L_inc
## t = 4.8126, df = 1939, p-value = 1.605e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1189837 0.2826573
## sample estimates:
## mean of x mean of y 
##  1.223009  1.022189
####The output of the two-sample t-test suggests that there is a statistically significant difference between the mean values of the L_inc variable for Austria and Latvia (t = 4.8126, df = 1939, p-value = 1.605e-06). The 95% confidence interval for the difference in means is (0.1189837, 0.2826573), which suggests that the mean value of L_inc for Austria is higher than that of Latvia. After conducting two sample T-tests with an alpha level of .05, we reject the null hypothesis, indicating a significant difference in mean between Austria and Latvia.

3. Correlation

Considering all countries, explore the relationship between incidence rates and fatality rates. At minimum, your work should include the following:

par(mfrow = c(2,1))

plot(x = data$incidence_rate, y = data$fatality_rate, xlab = "Incidence Rates", ylab = "Fatality Rates", main = "COVID-19 Incidence Rates vs. Fatality Rates", col = "steelblue")

#A secondary plot will be created to define x and y limits, allowing for a better understanding of correlation without the influence of outliers.

plot(x = data$incidence_rate, y = data$fatality_rate, xlab = "Incidence Rates", ylab = "Fatality Rates", main = "COVID-19 Incidence Rates vs. Fatality Rates - without Outliers", col = "pink4", xlim = c(0,1000), ylim = c(0,15))

par(mfrow = c(1,1))

#The Pearson correlation coefficient (PCC), which will be utilized, is a measure of linear correlation between two sets of data. It is calculated as the ratio between the covariance of two variables and the product of their standard deviations. This results in a normalized measurement of covariance that always falls between -1 and 1. However, it is important to note that PCC only reflects linear correlation and cannot account for other types of relationships or correlations. My hypothesis is as follows, the increase in COVID incidents, there will be an increase of COVID fatalities. Pearson correlation coefficient. (2023, November 6). In Wikipedia. https://en.wikipedia.org/wiki/Pearson_correlation_coefficient 

cor.test(x = data$incidence_rate, y = data$fatality_rate, alternative = "greater", method = "pearson", conf.level - .95)
## 
##  Pearson's product-moment correlation
## 
## data:  data$incidence_rate and data$fatality_rate
## t = 19.134, df = 28311, p-value < 2.2e-16
## alternative hypothesis: true correlation is greater than 0
## 95 percent confidence interval:
##  0.1033278 1.0000000
## sample estimates:
##       cor 
## 0.1129893
cor_cof <- cor(x = data$incidence_rate, y = data$fatality_rate, method = "pearson", use = "complete.obs")
cat("The Correlation Coefficient with the Pearson Method for Incidence Rate vs. Fatality Rate:", cor_cof)
## The Correlation Coefficient with the Pearson Method for Incidence Rate vs. Fatality Rate: 0.1129893

4. Regression

Here, we will fit a model on data from twenty (20) countries considering total new cases as a function of population, population density and gross domestic product (GDP) per capita. Note that the GDP per capita is given in “purchasing power standard,” which considers the costs of goods and services in a country relative to incomes in that country; i.e. we will consider this as appropriately standardized.

Code is given below defining a new data frame, ‘model_df,’ which provides the total area and standardized GDP per capita for the twenty (20) countries for our model fit. You are responsible for creating a vector of the total new cases across the time frame of the dataset, for each of those countries, and adding that vector to our ’model_df” data frame.

# The code below creates a new data frame, 'model_df,' that includes the area,
# GDP per capita, population and population density for the twenty (20)
# countries of interest. All you should need to do is execute this code, as is.

# You do not need to add code in this chunk. You will need to add code in the
# 'regression_b,' 'regression_c' and 'regression_d' code chunks.

twenty_countries <- c("Austria", "Belgium", "Bulgaria", "Cyprus", "Denmark",
                      "Finland", "France", "Germany", "Hungary", "Ireland",
                      "Latvia", "Lithuania", "Malta", "Norway", "Poland",
                      "Portugal", "Romania", "Slovakia", "Spain", "Sweden")

sq_km <- c(83858, 30510, 110994, 9251, 44493, 338145, 551695, 357386, 93030,
           70273, 64589, 65300, 316, 385178, 312685, 88416, 238397, 49036,
           498511, 450295)

gdp_pps <- c(128, 118, 51, 91, 129, 111, 104, 123, 71, 190, 69, 81, 100, 142,
             71, 78, 65, 71, 91, 120)

model_df <- data %>%
  select(c(countriesAndTerritories, popData2020)) %>%
  filter(countriesAndTerritories %in% twenty_countries) %>%
  distinct(countriesAndTerritories, .keep_all = TRUE) %>%
  add_column(sq_km, gdp_pps) %>%
  mutate(pop_dens = popData2020 / sq_km) %>%
  rename(country = countriesAndTerritories, pop = popData2020)

Next, we need to add one (1) more column to our ‘model_df’ data frame. Specifically, one that has the total number of new cases for each of the twenty (20) countries. We calculate the total number of new cases by summing all the daily new cases, for each country, across all the days in the dataset.

### The following code will be removed for students to complete the work themselves.

total_cases <- data %>%
  select(c(countriesAndTerritories, cases)) %>%
  group_by(countriesAndTerritories) %>%
  dplyr::summarize(total_cases = sum(cases, na.rm = TRUE)) %>%
  filter(countriesAndTerritories %in% twenty_countries) %>%
  select(total_cases)

model_df <- model_df %>%
  add_column(total_cases)

Now, we will fit our model using the data in ‘model_df.’ We are interested in explaining total cases (response) as a function of population (explanatory), population density (explanatory), and GDP (explanatory).

At minimum, your modeling work should including the following:

glimpse(model_df)
## Rows: 20
## Columns: 6
## $ country     <fct> Austria, Belgium, Bulgaria, Cyprus, Denmark, Finland, Fran…
## $ pop         <int> 8901064, 11522440, 6951482, 888005, 5822763, 5525292, 6732…
## $ sq_km       <dbl> 83858, 30510, 110994, 9251, 44493, 338145, 551695, 357386,…
## $ gdp_pps     <dbl> 128, 118, 51, 91, 129, 111, 104, 123, 71, 190, 69, 81, 100…
## $ pop_dens    <dbl> 106.14448, 377.66109, 62.62935, 95.99016, 130.86919, 16.34…
## $ total_cases <int> 5402162, 4597870, 1271735, 599500, 3221572, 1335318, 36952…
cat("\n\n")
summary(model_df)
##      country        pop               sq_km           gdp_pps     
##  Austria : 1   Min.   :  514564   Min.   :   316   Min.   : 51.0  
##  Belgium : 1   1st Qu.: 5266795   1st Qu.: 60701   1st Qu.: 71.0  
##  Bulgaria: 1   Median : 7926273   Median : 90723   Median : 95.5  
##  Cyprus  : 1   Mean   :17305840   Mean   :192118   Mean   :100.2  
##  Denmark : 1   3rd Qu.:13474040   3rd Qu.:342955   3rd Qu.:120.8  
##  Finland : 1   Max.   :83166711   Max.   :551695   Max.   :190.0  
##  (Other) :14                                                      
##     pop_dens        total_cases      
##  Min.   :  13.94   Min.   :  115285  
##  1st Qu.:  57.67   1st Qu.: 1319422  
##  Median : 100.50   Median : 2570210  
##  Mean   : 179.14   Mean   : 6496297  
##  3rd Qu.: 121.55   3rd Qu.: 5430242  
##  Max.   :1628.37   Max.   :36952159  
## 
rs <- lm(total_cases~pop+pop_dens+gdp_pps, data = model_df)

summary(rs)
## 
## Call:
## lm(formula = total_cases ~ pop + pop_dens + gdp_pps, data = model_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -8290586 -1092896   -81818  1657083  8955212 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.875e+06  2.787e+06  -1.390    0.183    
## pop          4.285e-01  3.676e-02  11.658 3.13e-09 ***
## pop_dens     6.502e+02  2.432e+03   0.267    0.793    
## gdp_pps      2.834e+04  2.559e+04   1.108    0.284    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3706000 on 16 degrees of freedom
## Multiple R-squared:  0.8963, Adjusted R-squared:  0.8769 
## F-statistic:  46.1 on 3 and 16 DF,  p-value: 4.251e-08
####After analyzing the model fit, we found that the model is statistically significant overall. However, when we looked at individual coefficients, we realized that only the 'population' coefficient is statistically significant. This is logical because more cases are expected with larger populations. It is important to note that while this coefficient estimate is low, it carries the most statistical significance in the model.

####The Adjusted r-squared value of the model is 0.8963, indicating a relatively high fit overall. In my opinion, it would be better to remove the gdp, which has a high contribution to the model fit but is not statistically significant.

The last thing we will do is use our model to predict the total new cases of two (2) countries not included in our model fit. At minimum, your work should include:

# The code below defines our 'newdata' data frame for applying our model to the
# population, population density and GDP per capita for two (2). Please execute
# the code as given.

newdata <- data.frame(country = c("Luxembourg", "Netherlands"),
                      pop = c(626108, 17407585),
                      gdp_pps = c(261, 130),
                      pop_dens = c(626108, 17407585) / c(2586, 41540))

# Add code here returning the actual  total cases from our dataset for the
# Netherlands and Luxembourg.

total_cases <- data%>%
  select(c(countriesAndTerritories, cases))%>%
  group_by(countriesAndTerritories)%>%
  dplyr::summarize(total_cases = sum(cases, na.rm = TRUE))%>%
  filter(countriesAndTerritories %in% c("Luxembourg", "Netherlands"))%>%
  select(total_cases)

newdata <- newdata%>%
  add_column(total_cases)

cat("Total Cases Luxembourg:", newdata$total_cases[newdata$country == "Luxembourg"],
    "\nTotal Cases Netherlands:", newdata$total_cases[newdata$country == "Netherlands"])
## Total Cases Luxembourg: 301031 
## Total Cases Netherlands: 8494705
# Add code here returning the total cases for the Netherlands and Luxembourg
# predicted by our model.

myforecast <- -3.875e+06+4.285e-01*newdata$pop+6.502e+02*newdata$pop_dens+2.834e+04*newdata$gdp_pps
cat("\n\nForecasted Classes Luxembourg:", myforecast[newdata$country == "Luxembourg"], "\n\nForecasted Cases Netherlands:", myforecast[newdata$country == "Netherlands"])
## 
## 
## Forecasted Classes Luxembourg: 3947450 
## 
## Forecasted Cases Netherlands: 7540820
cat("\n\n% Forecasted vs. Actual Cases Luxembourg:", myforecast[newdata$country == "Luxembourg"]/newdata$total_cases[newdata$country == "Luxembourg"], "%\n% Forecasted vs. Actual Cases Netherlands:", myforecast[newdata$country == "Netherlands"]/newdata$total_cases[newdata$country == "Netherlands"], "%")
## 
## 
## % Forecasted vs. Actual Cases Luxembourg: 13.1131 %
## % Forecasted vs. Actual Cases Netherlands: 0.8877083 %
####The model's forecasting for Luxembourg was inaccurate because it considered the GDP, which resulted in a poor prediction. To improve the forecasting report for both Luxembourg and Netherlands, I suggest removing the GDP value from the model.

#Removal of gdp from model

myforecast_withoutgdp <- -1.064e+06+4.295e-01*newdata$pop+7.110+02*newdata$pop_dens
cat("\n\nForecasted Cases Luxembourg:", myforecast_withoutgdp[newdata$country == "Luxembourg"], "\nForecasted Cases Netherlands:", myforecast_withoutgdp[newdata$country == "Netherlands"])
## 
## 
## Forecasted Cases Luxembourg: -794595.3 
## Forecasted Cases Netherlands: 6413403
cat("\n\n% Forecasted vs. Actual Cases Luxembourg:", myforecast_withoutgdp[newdata$country == "Luxembourg"]/newdata$total_cases[newdata$country == "Luxembourg"], "%\n% Forecasted vs. Actual Cases Netherlands:", myforecast_withoutgdp[newdata$country == "Netherlands"]/newdata$total_cases[newdata$country == "Netherlands"],"%")
## 
## 
## % Forecasted vs. Actual Cases Luxembourg: -2.63958 %
## % Forecasted vs. Actual Cases Netherlands: 0.7549883 %
####The Forecasted vs. Actual Cases in Luxembourg, returned a negative percentage of -2.63958%, a bit over a 10.5% difference from the previous prediction analysis. We should include another variable to create a better model for prediction analysis.