2023-06-11

Overview

The overarching goal of this is to show if there is to perform a correlation analysis on few of the data points available in the dataset. This presentation uses a Global AI Index hosted on Kaggle, for countries in various areas of the world with scores of different aspects of each country affecting the total score of AI advancement.

The Global AI Index data below is what we’ll be working with

str(ai_data)
spc_tbl_ [62 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Country              : chr [1:62] "United States of America" "China" "United Kingdom" "Canada" ...
 $ Talent               : num [1:62] 100 16.5 39.6 31.3 35.8 ...
 $ Infrastructure       : num [1:62] 94 100 71.4 77 67.6 ...
 $ Operating Environment: num [1:62] 64.6 91.6 74.7 93.9 82.4 ...
 $ Research             : num [1:62] 100 71.4 36.5 30.7 32.6 ...
 $ Development          : num [1:62] 100 80 25 25.8 28 ...
 $ Government Strategy  : num [1:62] 77.4 94.9 82.8 100 43.9 ...
 $ Commercial           : num [1:62] 100 44 18.9 14.9 27.3 ...
 $ Total score          : num [1:62] 100 62.9 40.9 40.2 39.9 ...
 $ Region               : chr [1:62] "Americas" "Asia-Pacific" "Europe" "Americas" ...
 $ Cluster              : chr [1:62] "Power players" "Power players" "Traditional champions" "Traditional champions" ...
 $ Income group         : chr [1:62] "High" "Upper middle" "High" "High" ...
 $ Political regime     : chr [1:62] "Liberal democracy" "Closed autocracy" "Liberal democracy" "Liberal democracy" ...
 - attr(*, "spec")=
  .. cols(
  ..   Country = col_character(),
  ..   Talent = col_double(),
  ..   Infrastructure = col_double(),
  ..   `Operating Environment` = col_double(),
  ..   Research = col_double(),
  ..   Development = col_double(),
  ..   `Government Strategy` = col_double(),
  ..   Commercial = col_double(),
  ..   `Total score` = col_double(),
  ..   Region = col_character(),
  ..   Cluster = col_character(),
  ..   `Income group` = col_character(),
  ..   `Political regime` = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

Data Cleanup

To create a view in which all countries with AI advancement data are shown, I have also chosen another dataset with geodetic coordinates of all countries. But since they are two different datasets, below I have merged them together for ease of use.

# Filter data in geodetic_data that exists in ai_data
dataInAi <- filter(geodetic_data, country %in% ai_data$Country)
# Change country name mismatch
geodetic_data$country[geodetic_data$country == "Netherlands"] <- "The Netherlands"
geodetic_data$country[geodetic_data$country == "United States"] <- "United States of America"
# Change column name to match both datasets
names(ai_data)[1] <- "country" 
# merge geodetic_data into ai_data
merged <- merge(select(geodetic_data, 2:4), ai_data, by = "country")
str(merged)
'data.frame':   62 obs. of  15 variables:
 $ country              : chr  "Argentina" "Armenia" "Australia" "Austria" ...
 $ latitude             : num  -38.4 40.1 -25.3 47.5 25.9 ...
 $ longitude            : num  -63.6 45 133.8 14.6 50.6 ...
 $ Talent               : num  8.4 6.69 25.43 16.97 4.99 ...
 $ Infrastructure       : num  56.1 37.8 63.4 64.5 60.4 ...
 $ Operating Environment: num  76 58.4 61.2 76.3 60.9 ...
 $ Research             : num  1.25 0.28 32.63 23.56 2.53 ...
 $ Development          : num  3.19 0.33 41.15 17.81 0 ...
 $ Government Strategy  : num  54.9 14.4 82.1 72.1 17.7 ...
 $ Commercial           : num  0.34 1.37 6.72 3.08 0.24 ...
 $ Total score          : num  15.24 8.49 33.86 26.89 11.79 ...
 $ Region               : chr  "Americas" "Europe" "Asia-Pacific" "Europe" ...
 $ Cluster              : chr  "Waking up" "Waking up" "Rising stars" "Waking up" ...
 $ Income group         : chr  "Upper middle" "Upper middle" "High" "High" ...
 $ Political regime     : chr  "Electoral democracy" "Electoral democracy" "Liberal democracy" "Electoral democracy" ...

Geo Data Points

This shows all countries we have AI advancement data for using the merged dataset

Data Visualization

This shows a global heat map of countries that we have AI data for in the Global AI Index and colored by total score of AI advancement

Data Visualization (cont.)

This is a graph of the Average Total Score by Political Regime showing how much the political regime affects the total AI advancement index score

Data Visualization (cont.)

This is a graph of the Average Total Score by Cluster showing how much the political regime affects the total AI advancement index score

Correlation Analysis

Correlation is a statistical measure that determines the relationship between two variables. It quantifies the strength and direction of the relationship. Correlation coefficient ranges from -1 to +1.

Types of Correlation

  • Positive Correlation
    • Positive correlation occurs when both variables increase or decrease together and the correlation coefficient (r) is greater than 0.
  • Negative Correlation
    • Negative correlation occurs when one variable increases while the other decreases and the orrelation coefficient (r) is less than 0.

Correlation Analysis of Research and Total Score

Out of the scored areas in the AI Index of each country, I think the amount of Research and the Total Score output would have a very high correlation. To show this visually I have a plot here

Correlation Analysis (cont.)

To calculate the correlation coefficient, denoted by \(r\), we will use this correlation formula: \(r = \frac{\text{cov}(x, y)}{\text{sd}(x) \cdot \text{sd}(y)}\) In which cov() is covariance and sd() is standard deviation

Using R the correlation coefficient is

cor(ai_data$Research, ai_data$`Total score`)
[1] 0.9458771

Since the correlation is positive and very close to 1, this means the the two data points of the AI Index have a very high correlation relationship

Correlation Formula Proof

To prove the formula earlier for correlation was correct, we will also find the covariance and standard deviation of both Research and Total Score values

Using R the covariance is

cov <- cov(ai_data$Research, ai_data$`Total score`)
cov
[1] 249.1081

Using R the Standard deviations are

sd_research <- sd(ai_data$Research)
sd_totalScore <- sd(ai_data$`Total score`)
paste0("(", sd_research, ", ", sd_totalScore, ")")
[1] "(17.4139958785747, 15.1235859350083)"

Now to put into the formula:

\(r = \frac{\text{cov}(x, y)}{\text{sd}(x) \cdot \text{sd}(y)}\)

\(r = \frac{249.1081361}{17.4139959 \cdot 15.1235859}\)

\(r = 0.9458771\)

We have proved the correlation formula is correct and the correlation coefficient

Linear Regression

To show the high relationship of those two variables, here is a linear regession model on those two data points in the AI Index

mod <- lm(ai_data$`Total score` ~ ai_data$Research, data = ai_data)
intercept <- coef(mod)[1]
slope <- coef(mod)[2]
residuals <- residuals(mod)

The line equation for this linear regression is \(y = β0 + β1x + ε\) where \(β0\) is the intercept, \(β1\) is the slope and \(ε\) is the assumed residual error term

After populating the variables from the calculations above, the equation for the linear regression becomes

\(y = 10.27 + 0.82x + ε\)

Linear Regression Line

Here is the same scatter plot after calculating the linear regression of Total Score and Research

Data Set Citation