Take Home Final Exam Dukum

For the take-home part of the MSDS 401 Final Exam, you are tasked with analyzing data on new daily covid-19 cases and deaths in European Union (EU) and European Economic Area (EEA) countries. A data file may be downloaded here, or you may use the provided read.csv() code in the ‘setup’ code chunk below to read the data directly from the web csv. Either approach is acceptable; the data should be the same.

Once you have defined a data frame with the daily case and death and country data, you are asked to: (1) perform an Exploratory Data Analysis (EDA), (2) perform some hypothesis testing, (3) perform some correlation testing, and (4) fit and describe a linear regression model. Each of these four (4) items is further explained below and “code chunks” have been created for you in which to add your R code, just as with the R and Data Analysis Assignments. You may add additional code chunks, as needed. You should make comments in the code chunks or add clarifying text between code chunks that you think further your work.

A data dictionary for the dataset is available here.

Definitions:

“Incidence rate” is equal to new daily cases per 100K individuals. Country population estimates can be found in ‘popData2020.’ You will calculate a daily incidence rate in item (1), for each country, that we will explore further in items (2) and (3).
“Fatality rate” is equal to new daily deaths per 100K individuals. Country population estimates can be found in ‘popData2020.’ You will calculate a daily fatality rate in item (1), for each country, that we will explore further in items (2) and (3).

1. Descriptive Statistics

Perform an Exploratory Data Analysis (EDA). Your EDA is exactly that: yours. Your knit .html should include the visualizations and summary tables that you find valuable while exploring this dataset. However, at minimum, your EDA must include the following:

Creation of a vector, ‘incidence_rate,’ equal to the daily new cases per 100K individuals, per country. Country populations are provided in ‘popData2020.’ This vector should be added to the ‘data’ data frame.
Creation of a vector, ‘fatality_rate,’ equal to the new deaths per 100K individuals, per country. Country populations are provided in ‘popData2020.’ This vector should be added to the ‘data’ data frame.
A visualization exploring new cases or incidence rates, per country, over time. You may choose a subset of countries, if you wish, but your visualization should include at least five (5) countries and include the entire time frame of the dataset.
A visualization exploring new deaths or fatality rates, per country, over time. You may choose a subset of countries, if you wish, but your visualization should include at least five (5) countries.
A table or visualization exploring some other aspect of the data. For example, you could explore case fatality rates per country; the number of deaths divided by the total number of cases. Note that to do this, you would want to like across the entire time of the dataset, looking at the total cases and deaths, per country.

data$incidence_rate <- (data$cases/data$popData2020)*100000
data$fatality_rate <- (data$deaths/data$popData2020)*100000

data <- data[data$cases>=0,]
data <- data[data$deaths>=0,]

glimpse(data)

## Rows: 28,698
## Columns: 12
## $ dateRep                 <date> 2022-10-23, 2022-10-22, 2022-10-21, 2022-10-2…
## $ day                     <int> 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12…
## $ month                   <int> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10…
## $ year                    <int> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022…
## $ cases                   <int> 3557, 5494, 7776, 8221, 10007, 13204, 9964, 66…
## $ deaths                  <int> 0, 4, 4, 6, 8, 7, 8, 12, 6, 10, 14, 13, 11, 10…
## $ countriesAndTerritories <fct> Austria, Austria, Austria, Austria, Austria, A…
## $ geoId                   <fct> AT, AT, AT, AT, AT, AT, AT, AT, AT, AT, AT, AT…
## $ countryterritoryCode    <fct> AUT, AUT, AUT, AUT, AUT, AUT, AUT, AUT, AUT, A…
## $ popData2020             <int> 8901064, 8901064, 8901064, 8901064, 8901064, 8…
## $ incidence_rate          <dbl> 39.96151, 61.72296, 87.36034, 92.35974, 112.42…
## $ fatality_rate           <dbl> 0.00000000, 0.04493845, 0.04493845, 0.06740767…

plot1 <- ggplot(data=data,aes(x=as.character(dateRep),y=incidence_rate,group=countriesAndTerritories)) +
  geom_point(aes(color=countriesAndTerritories)) +
  ggtitle("COVID-19 Incidence Rates per 100,000 people per Country") +
  theme(plot.title = element_text(hjust = .5)) +
  xlab("Time") + 
  ylab("Incidence Rate") +
  scale_fill_brewer(palette='Spectral')

plot2 <- ggplot(data=data,aes(x=as.character(dateRep),y=fatality_rate,group=countriesAndTerritories)) +
  geom_point(aes(color=countriesAndTerritories)) +
  ggtitle("COVID-19 Mortality Rates per 100,000 people per Country") +
  theme(plot.title = element_text(hjust = .5)) +
  xlab("Time") + 
  ylab("Mortality Rate") +
  scale_fill_brewer(palette='Spectral')

grid.arrange(plot1, plot2, ncol=1, nrow =2)

data <- data %>%
  group_by(countriesAndTerritories) %>%
  mutate(across(cases:deaths, sum, .names = "{.col}_sum")) %>%
  ungroup()

aggs <- aggregate(cbind(deaths, cases) ~ countriesAndTerritories, data = data, FUN= sum, na.rm = TRUE)

aggs$Case_Fatality_Percent <- (aggs$deaths/aggs$cases)*100

2. Inferential Statistics

Select two (2) countries of your choosing and compare their incidence or fatality rates using hypothesis testing. At minimum, your work should include the following:

Visualization(s) comparing the daily incidence or fatality rates of the selected countries,
A statement of the null hypothesis.
A short justification of the statistical test selected.
- Why is the test you selected an appropriate one for the comparison we’re making?
A brief discussion of any distributional assumptions of that test.
- Does the statistical test we selected require assumptions about our data?
- If so, does our data satisfy those assumptions?
Your selected alpha.
The test function output; i.e. the R output.
The relevant confidence interval, if not returned by the R test output.
A concluding statement on the outcome of the statistical test.
- i.e. Based on our selected alpha, do we reject or fail to reject our null hypothesis?

newdata <- subset(data, countriesAndTerritories == "Austria" | countriesAndTerritories == "Latvia")

plot3 <- ggplot(data = newdata, aes(x=as.character(dateRep), y=incidence_rate, group=countriesAndTerritories)) +
  geom_point(aes(color=countriesAndTerritories)) +
  ggtitle("COVID-19 Incidence Rates in Austria and Latvia - per 100,000 people") +
  theme(plot.title = element_text(hjust = .5)) +
  xlab("Time") +
  ylab("Incidence Rate") +
  scale_color_manual(values = c("cyan", "gray"))

plot3

####Statement of Null & Alternative Hypotheses:
#H0: mu1 - mu2 = 0
#H0: The difference between the mean COVID-19 incidence rates per 100,000 people in Austria and Latvia the same. 
#H0: mu1 - mu2 != 0
#H0: The difference between the mean COVID-19 incidence rates per 100,000 people in Austria and Latvia is the not same.####

austria <- subset(data, countriesAndTerritories == "Austria")
latvia <- subset(data, countriesAndTerritories == "Latvia")

base::cat("Austria Skewness:",
    rockchalk::skewness(austria$incidence_rate),"\nAustriaKurtosis:",
    rockchalk::kurtosis(austria$incidence_rate), "\nLatviaSkewness:",
    rockchalk::skewness(latvia$incidence_rate),"\nLatviaKurtosis:",
    rockchalk::kurtosis(latvia$incidence_rate))

## Austria Skewness: 2.804071 
## AustriaKurtosis: 8.618152 
## LatviaSkewness: 3.770176 
## LatviaKurtosis: 14.84807

####I will use the T-test to test my hypothesis, but since the population variance is unknown, the difference between the two means is also unknown. The T-test is based on samples of independent data, and it assumes that the population is normally distributed. However, the skewness and kurtosis checks (above) show that both samples have a right skew and high peakness, and therefore are not normally distributed for incidence rates. To address this, we will take the log base-10 of the incidence rates. Before that, we need to check if the sample variances are equal using the F statistic.

#There are certain cases where there are zero incidence rates or numbers. In such cases, performing a log base-10 calculation will result in an undefined number. These instances should be kept as zero, while for all others, log base-10 should be applied.

data$L_inc <- numeric(nrow(data))

k <- 1 
for (k in 1:nrow(data)) {
  if (data$cases[k] <= 0 | is.na(data$cases[k])){data$L_inc[k] <- 0}
  else {data$L_inc[k] <- log10(data$incidence_rate[k])}
  k <- k+1}

Austria <- subset(data, countriesAndTerritories == "Austria")
Latvia <- subset(data, countriesAndTerritories == "Latvia")

par(mfrow = c(2,1))

plot(Austria$L_inc, main = "Austria Distribution")
plot(Latvia$L_inc, main = "Latvia Distribution")

par(mfrow = c(1,1))

base::cat("Log 10 Austria Skewness:",
    rockchalk::skewness(Austria$L_inc),"\nLog 10 AustriaKurtosis:",
    rockchalk::kurtosis(Austria$L_inc), "\nLog 10 LatviaSkewness:",
    rockchalk::skewness(Latvia$L_inc),"\nLog 10 LatviaKurtosis:",
    rockchalk::kurtosis(Latvia$L_inc))

## Log 10 Austria Skewness: -0.6441102 
## Log 10 AustriaKurtosis: -0.03453257 
## Log 10 LatviaSkewness: -0.6409188 
## Log 10 LatviaKurtosis: -0.3141158

####Based on the findings, it seems that the skewness and kurtosis values for both Austria and Latvia are different from zero. A positive skewness value indicates that the distribution is skewed to the right, while a negative skewness value indicates that the distribution is skewed to the left. A kurtosis value greater than 3 indicates that the distribution has heavier tails than a normal distribution, while a kurtosis value less than 3 indicates that the distribution has lighter tails than a normal distribution.The log 10 skewness and kurtosis values for both countries are negative, which suggests that the distributions are slightly skewed to the left and have lighter tails than a normal distribution. 


#H0: variance-1 = variance-2
#Ha: variance-1 != variance-2
#Let alpha = .05

df_num <- nrow(Austria) - 1
df_denom <- nrow(Latvia) - 1
cv_upper <- qf(p=.025, df1=df_num, df2 = df_denom, lower.tail = FALSE)
cv_lower <- qf(p=.0975, df1=df_num, df2 = df_denom, lower.tail = TRUE)

f <- var(Austria$L_inc)/var(Latvia$L_inc) 
{
 
  if (f>cv_upper | f<cv_lower) {
   
   print ("Reject the null hypothesis. Variances are notequal.")

} else {
    
  print ("Fail to reject the null hypothesis. Variances are equal.")} 
}

## [1] "Reject the null hypothesis. Variances are notequal."

t.test(Austria$L_inc, Latvia$L_inc,alternative = "two.sided", mu = 0, paired = FALSE, var.equal = TRUE,conf.level = 0.95)

## 
##  Two Sample t-test
## 
## data:  Austria$L_inc and Latvia$L_inc
## t = 4.8126, df = 1939, p-value = 1.605e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1189837 0.2826573
## sample estimates:
## mean of x mean of y 
##  1.223009  1.022189

####The output of the two-sample t-test suggests that there is a statistically significant difference between the mean values of the L_inc variable for Austria and Latvia (t = 4.8126, df = 1939, p-value = 1.605e-06). The 95% confidence interval for the difference in means is (0.1189837, 0.2826573), which suggests that the mean value of L_inc for Austria is higher than that of Latvia. After conducting two sample T-tests with an alpha level of .05, we reject the null hypothesis, indicating a significant difference in mean between Austria and Latvia.

3. Correlation

Considering all countries, explore the relationship between incidence rates and fatality rates. At minimum, your work should include the following:

Visualization(s) showing the distributions of daily incidence and fatality rates, regardless of country. Please note that both country and date should be disregarded here.
A short statement identifying the most appropriate correlation coefficient.
- For the correlation we’re interested in, which correlation coefficient is most appropriate?
- Why do you find the correlation coefficient selected to be the most appropriate?
The calculated correlation coefficient or coefficient test output; e.g. cor() or cor.test().

par(mfrow = c(2,1))

plot(x = data$incidence_rate, y = data$fatality_rate, xlab = "Incidence Rates", ylab = "Fatality Rates", main = "COVID-19 Incidence Rates vs. Fatality Rates", col = "steelblue")

#A secondary plot will be created to define x and y limits, allowing for a better understanding of correlation without the influence of outliers.

plot(x = data$incidence_rate, y = data$fatality_rate, xlab = "Incidence Rates", ylab = "Fatality Rates", main = "COVID-19 Incidence Rates vs. Fatality Rates - without Outliers", col = "pink4", xlim = c(0,1000), ylim = c(0,15))

par(mfrow = c(1,1))

#The Pearson correlation coefficient (PCC), which will be utilized, is a measure of linear correlation between two sets of data. It is calculated as the ratio between the covariance of two variables and the product of their standard deviations. This results in a normalized measurement of covariance that always falls between -1 and 1. However, it is important to note that PCC only reflects linear correlation and cannot account for other types of relationships or correlations. My hypothesis is as follows, the increase in COVID incidents, there will be an increase of COVID fatalities. Pearson correlation coefficient. (2023, November 6). In Wikipedia. https://en.wikipedia.org/wiki/Pearson_correlation_coefficient 

cor.test(x = data$incidence_rate, y = data$fatality_rate, alternative = "greater", method = "pearson", conf.level - .95)

## 
##  Pearson's product-moment correlation
## 
## data:  data$incidence_rate and data$fatality_rate
## t = 19.134, df = 28311, p-value < 2.2e-16
## alternative hypothesis: true correlation is greater than 0
## 95 percent confidence interval:
##  0.1033278 1.0000000
## sample estimates:
##       cor 
## 0.1129893

cor_cof <- cor(x = data$incidence_rate, y = data$fatality_rate, method = "pearson", use = "complete.obs")
cat("The Correlation Coefficient with the Pearson Method for Incidence Rate vs. Fatality Rate:", cor_cof)

## The Correlation Coefficient with the Pearson Method for Incidence Rate vs. Fatality Rate: 0.1129893

4. Regression

Here, we will fit a model on data from twenty (20) countries considering total new cases as a function of population, population density and gross domestic product (GDP) per capita. Note that the GDP per capita is given in “purchasing power standard,” which considers the costs of goods and services in a country relative to incomes in that country; i.e. we will consider this as appropriately standardized.

Code is given below defining a new data frame, ‘model_df,’ which provides the total area and standardized GDP per capita for the twenty (20) countries for our model fit. You are responsible for creating a vector of the total new cases across the time frame of the dataset, for each of those countries, and adding that vector to our ’model_df” data frame.

# The code below creates a new data frame, 'model_df,' that includes the area,
# GDP per capita, population and population density for the twenty (20)
# countries of interest. All you should need to do is execute this code, as is.

# You do not need to add code in this chunk. You will need to add code in the
# 'regression_b,' 'regression_c' and 'regression_d' code chunks.

twenty_countries <- c("Austria", "Belgium", "Bulgaria", "Cyprus", "Denmark",
                      "Finland", "France", "Germany", "Hungary", "Ireland",
                      "Latvia", "Lithuania", "Malta", "Norway", "Poland",
                      "Portugal", "Romania", "Slovakia", "Spain", "Sweden")

sq_km <- c(83858, 30510, 110994, 9251, 44493, 338145, 551695, 357386, 93030,
           70273, 64589, 65300, 316, 385178, 312685, 88416, 238397, 49036,
           498511, 450295)

gdp_pps <- c(128, 118, 51, 91, 129, 111, 104, 123, 71, 190, 69, 81, 100, 142,
             71, 78, 65, 71, 91, 120)

model_df <- data %>%
  select(c(countriesAndTerritories, popData2020)) %>%
  filter(countriesAndTerritories %in% twenty_countries) %>%
  distinct(countriesAndTerritories, .keep_all = TRUE) %>%
  add_column(sq_km, gdp_pps) %>%
  mutate(pop_dens = popData2020 / sq_km) %>%
  rename(country = countriesAndTerritories, pop = popData2020)

Next, we need to add one (1) more column to our ‘model_df’ data frame. Specifically, one that has the total number of new cases for each of the twenty (20) countries. We calculate the total number of new cases by summing all the daily new cases, for each country, across all the days in the dataset.

### The following code will be removed for students to complete the work themselves.

total_cases <- data %>%
  select(c(countriesAndTerritories, cases)) %>%
  group_by(countriesAndTerritories) %>%
  dplyr::summarize(total_cases = sum(cases, na.rm = TRUE)) %>%
  filter(countriesAndTerritories %in% twenty_countries) %>%
  select(total_cases)

model_df <- model_df %>%
  add_column(total_cases)

Now, we will fit our model using the data in ‘model_df.’ We are interested in explaining total cases (response) as a function of population (explanatory), population density (explanatory), and GDP (explanatory).

At minimum, your modeling work should including the following:

A description - either narrative or using R output - of your ‘model_df’ data frame.
- Consider: what data types are present? What do our rows and columns represent?
The lm() summary() output of your fitted model. As we did in the second Data Analysis Assignment, you can pass your fitted model object - i.e. the output of lm() - to summary() and get additional details, including R^2, on your model fit.
A short statement on the fit of the model.
- Which, if any, of our coefficients are statistically significant?
- What is the R^2 of our model?
- Should we consider a reduced model; i.e. one with fewer parameters?

glimpse(model_df)

## Rows: 20
## Columns: 6
## $ country     <fct> Austria, Belgium, Bulgaria, Cyprus, Denmark, Finland, Fran…
## $ pop         <int> 8901064, 11522440, 6951482, 888005, 5822763, 5525292, 6732…
## $ sq_km       <dbl> 83858, 30510, 110994, 9251, 44493, 338145, 551695, 357386,…
## $ gdp_pps     <dbl> 128, 118, 51, 91, 129, 111, 104, 123, 71, 190, 69, 81, 100…
## $ pop_dens    <dbl> 106.14448, 377.66109, 62.62935, 95.99016, 130.86919, 16.34…
## $ total_cases <int> 5402162, 4597870, 1271735, 599500, 3221572, 1335318, 36952…

cat("\n\n")

summary(model_df)

##      country        pop               sq_km           gdp_pps     
##  Austria : 1   Min.   :  514564   Min.   :   316   Min.   : 51.0  
##  Belgium : 1   1st Qu.: 5266795   1st Qu.: 60701   1st Qu.: 71.0  
##  Bulgaria: 1   Median : 7926273   Median : 90723   Median : 95.5  
##  Cyprus  : 1   Mean   :17305840   Mean   :192118   Mean   :100.2  
##  Denmark : 1   3rd Qu.:13474040   3rd Qu.:342955   3rd Qu.:120.8  
##  Finland : 1   Max.   :83166711   Max.   :551695   Max.   :190.0  
##  (Other) :14                                                      
##     pop_dens        total_cases      
##  Min.   :  13.94   Min.   :  115285  
##  1st Qu.:  57.67   1st Qu.: 1319422  
##  Median : 100.50   Median : 2570210  
##  Mean   : 179.14   Mean   : 6496297  
##  3rd Qu.: 121.55   3rd Qu.: 5430242  
##  Max.   :1628.37   Max.   :36952159  
##

rs <- lm(total_cases~pop+pop_dens+gdp_pps, data = model_df)

summary(rs)

## 
## Call:
## lm(formula = total_cases ~ pop + pop_dens + gdp_pps, data = model_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -8290586 -1092896   -81818  1657083  8955212 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.875e+06  2.787e+06  -1.390    0.183    
## pop          4.285e-01  3.676e-02  11.658 3.13e-09 ***
## pop_dens     6.502e+02  2.432e+03   0.267    0.793    
## gdp_pps      2.834e+04  2.559e+04   1.108    0.284    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3706000 on 16 degrees of freedom
## Multiple R-squared:  0.8963, Adjusted R-squared:  0.8769 
## F-statistic:  46.1 on 3 and 16 DF,  p-value: 4.251e-08

####After analyzing the model fit, we found that the model is statistically significant overall. However, when we looked at individual coefficients, we realized that only the 'population' coefficient is statistically significant. This is logical because more cases are expected with larger populations. It is important to note that while this coefficient estimate is low, it carries the most statistical significance in the model.

####The Adjusted r-squared value of the model is 0.8963, indicating a relatively high fit overall. In my opinion, it would be better to remove the gdp, which has a high contribution to the model fit but is not statistically significant.

The last thing we will do is use our model to predict the total new cases of two (2) countries not included in our model fit. At minimum, your work should include:

The predicted total new cases for both countries.
The actual total new cases for both countries.
A short statement on the performance of the model in these two (2) cases.
- Compare the new predictions to those made on the fitted dataset. You may compare the predicted values or the residuals.

# The code below defines our 'newdata' data frame for applying our model to the
# population, population density and GDP per capita for two (2). Please execute
# the code as given.

newdata <- data.frame(country = c("Luxembourg", "Netherlands"),
                      pop = c(626108, 17407585),
                      gdp_pps = c(261, 130),
                      pop_dens = c(626108, 17407585) / c(2586, 41540))

# Add code here returning the actual  total cases from our dataset for the
# Netherlands and Luxembourg.

total_cases <- data%>%
  select(c(countriesAndTerritories, cases))%>%
  group_by(countriesAndTerritories)%>%
  dplyr::summarize(total_cases = sum(cases, na.rm = TRUE))%>%
  filter(countriesAndTerritories %in% c("Luxembourg", "Netherlands"))%>%
  select(total_cases)

newdata <- newdata%>%
  add_column(total_cases)

cat("Total Cases Luxembourg:", newdata$total_cases[newdata$country == "Luxembourg"],
    "\nTotal Cases Netherlands:", newdata$total_cases[newdata$country == "Netherlands"])

## Total Cases Luxembourg: 301031 
## Total Cases Netherlands: 8494705

# Add code here returning the total cases for the Netherlands and Luxembourg
# predicted by our model.

myforecast <- -3.875e+06+4.285e-01*newdata$pop+6.502e+02*newdata$pop_dens+2.834e+04*newdata$gdp_pps
cat("\n\nForecasted Classes Luxembourg:", myforecast[newdata$country == "Luxembourg"], "\n\nForecasted Cases Netherlands:", myforecast[newdata$country == "Netherlands"])

## 
## 
## Forecasted Classes Luxembourg: 3947450 
## 
## Forecasted Cases Netherlands: 7540820

cat("\n\n% Forecasted vs. Actual Cases Luxembourg:", myforecast[newdata$country == "Luxembourg"]/newdata$total_cases[newdata$country == "Luxembourg"], "%\n% Forecasted vs. Actual Cases Netherlands:", myforecast[newdata$country == "Netherlands"]/newdata$total_cases[newdata$country == "Netherlands"], "%")

## 
## 
## % Forecasted vs. Actual Cases Luxembourg: 13.1131 %
## % Forecasted vs. Actual Cases Netherlands: 0.8877083 %

####The model's forecasting for Luxembourg was inaccurate because it considered the GDP, which resulted in a poor prediction. To improve the forecasting report for both Luxembourg and Netherlands, I suggest removing the GDP value from the model.

#Removal of gdp from model

myforecast_withoutgdp <- -1.064e+06+4.295e-01*newdata$pop+7.110+02*newdata$pop_dens
cat("\n\nForecasted Cases Luxembourg:", myforecast_withoutgdp[newdata$country == "Luxembourg"], "\nForecasted Cases Netherlands:", myforecast_withoutgdp[newdata$country == "Netherlands"])

## 
## 
## Forecasted Cases Luxembourg: -794595.3 
## Forecasted Cases Netherlands: 6413403

cat("\n\n% Forecasted vs. Actual Cases Luxembourg:", myforecast_withoutgdp[newdata$country == "Luxembourg"]/newdata$total_cases[newdata$country == "Luxembourg"], "%\n% Forecasted vs. Actual Cases Netherlands:", myforecast_withoutgdp[newdata$country == "Netherlands"]/newdata$total_cases[newdata$country == "Netherlands"],"%")

## 
## 
## % Forecasted vs. Actual Cases Luxembourg: -2.63958 %
## % Forecasted vs. Actual Cases Netherlands: 0.7549883 %

####The Forecasted vs. Actual Cases in Luxembourg, returned a negative percentage of -2.63958%, a bit over a 10.5% difference from the previous prediction analysis. We should include another variable to create a better model for prediction analysis.

Take Home Final Exam Dukum_Dian

Definitions:

1. Descriptive Statistics

2. Inferential Statistics

3. Correlation

4. Regression