Identifying Socio-Economic and Educational Factors Impacting Global Average IQ Scores

Introduction

The intelligence quotient or IQ, is a meausure of cognitive ability determined by many different tests of reasoning, knowledge, and mental processing speed. A higher IQ has been
linked to a lower risk of death. In this study, we will look at countries' average IQ alongside data including Literacy Rate, Population, Gross National Income, and more to determine
the factors most related to average IQ

Problem statement

While there are many studies that show correlations of IQ with measures such as risk of death or career success, there is still a lack of research around the specific socio-economic
and educationfal factors that contribute to average IQ scores globally. By identifying these factors, it would be possible for nations to create focussing in developing more intelligent
civilizations in the future.

Objective

Using a set of data providing data of each countries average IQ as well as literacy rate, Nobel prizes won, Human Development Index score, Gross National Index score, time spent in school, and population,
we will use multiple regression analysis to compare these factors and find any correlations between them with average IQ.

Methods

Step 1: Install and load required packages

install.packages("gclus", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/Admin/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'gclus' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Admin\AppData\Local\Temp\RtmpeiIpKr\downloaded_packages
install.packages("ggpubr", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/Admin/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'ggpubr' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Admin\AppData\Local\Temp\RtmpeiIpKr\downloaded_packages
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/Admin/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'tidyverse' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Admin\AppData\Local\Temp\RtmpeiIpKr\downloaded_packages
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.4.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.4.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.2
## Warning: package 'readr' was built under R version 4.4.2
## Warning: package 'dplyr' was built under R version 4.4.2
## Warning: package 'forcats' was built under R version 4.4.2
## Warning: package 'lubridate' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gclus)
## Warning: package 'gclus' was built under R version 4.4.2
## Loading required package: cluster
library(readxl)
library("utils")

Step 2: Import and summarize the data

iq_data <- read_excel("avgIQpercountry.xls")
head(iq_data)
## # A tibble: 6 × 10
##    Rank Country     `Average IQ` Continent `Literacy Rate` `Nobel Prices`
##   <dbl> <chr>              <dbl> <chr>               <dbl>          <dbl>
## 1     1 Japan               106. Asia                 0.99             29
## 2     2 Taiwan              106. Asia                 0.96              4
## 3     3 Singapore           106. Asia                 0.97              0
## 4     4 Hong Kong           105. Asia                 0.94              1
## 5     5 China               104. Asia                 0.96              8
## 6     6 South Korea         102. Asia                 0.98              0
## # ℹ 4 more variables: `HDI (2021)` <dbl>,
## #   `Mean years of schooling - 2021` <dbl>, `GNI - 2021` <dbl>,
## #   `Population - 2023` <chr>
summary(iq_data)
##       Rank       Country            Average IQ      Continent        
##  Min.   :  1   Length:193         Min.   : 42.99   Length:193        
##  1st Qu.: 49   Class :character   1st Qu.: 74.33   Class :character  
##  Median : 97   Mode  :character   Median : 82.24   Mode  :character  
##  Mean   : 97                      Mean   : 82.05                     
##  3rd Qu.:145                      3rd Qu.: 91.60                     
##  Max.   :193                      Max.   :106.48                     
##                                                                      
##  Literacy Rate     Nobel Prices       HDI (2021)    
##  Min.   :0.1900   Min.   :  0.000   Min.   :0.3850  
##  1st Qu.:0.8000   1st Qu.:  0.000   1st Qu.:0.5995  
##  Median :0.9500   Median :  0.000   Median :0.7450  
##  Mean   :0.8642   Mean   :  5.922   Mean   :0.7241  
##  3rd Qu.:0.9900   3rd Qu.:  1.000   3rd Qu.:0.8440  
##  Max.   :1.0000   Max.   :400.000   Max.   :0.9620  
##                                     NA's   :14      
##  Mean years of schooling - 2021   GNI - 2021     Population - 2023 
##  Min.   : 2.100                 Min.   :   732   Length:193        
##  1st Qu.: 6.400                 1st Qu.:  4593   Class :character  
##  Median : 9.400                 Median : 12672   Mode  :character  
##  Mean   : 9.028                 Mean   : 20812                     
##  3rd Qu.:11.600                 3rd Qu.: 30588                     
##  Max.   :14.100                 Max.   :146830                     
##  NA's   :14                     NA's   :14
Data Description: A description of the features are presented in the table below.
Variable                        | Definition
------------                    |--------------
1. Rank                         | The country's rank by IQ
2. Country                      | Country name
3. Average IQ                   | The average IQ score of the country
4. Continent                    | The continent the country is located in
5. Literacy Rate                | The literacy rate of the country expressed as a percentage
6. Nobel Prizes                 | The total number of Nobel Prizes won by the country
7. HDI (2021)                   | The Human Development Index score of the country, ranging from 0 to 1
8. Mean years of schooling-2021 | The mean years of schooling in the country
9. GNI-2021                     | The Gross National Income earned in 2021 expressed in international dollars
10. Population-2023             | The population of the country in 2023

Step 3: Data visualization

options(max.print=999999)
iq_data <- read.csv("avgIQpercountry.csv")

plot( # iq vs literacy rate
  x = iq_data$Average.IQ,
  y = iq_data$Literacy.Rate,
  main = "Average IQ vs Literacy Rate",
  xlab = "Average IQ",
  ylab = "Literacy Rate", 
  cex = 1,
  col = 12,
  pch=20
)

plot( # iq vs Nobel Prizes Won
  x = iq_data$Average.IQ,
  y = iq_data$`Nobel Prices`,
  main = "Average IQ vs Nobel Prizes Won",
  xlab = "Average IQ",
  ylab = "Nobel Prizes Won", 
  cex = 1,
  col = 12,
  pch=20
)

plot( # iq vs HDI
  x = iq_data$Average.IQ,
  y = iq_data$`HDI (2021)`,
  main = "Average IQ vs HDI (2021)",
  xlab = "Average IQ",
  ylab = "HDI (2021)", 
  cex = 1,
  col = 12,
  pch=20
)

# iq vs schooling
plot(
  x = iq_data$Average.IQ,
  y = iq_data$Mean.years.of.schooling...2021,
  main = "Average IQ vs Average Years of Schooling",
  xlab = "Average IQ",
  ylab = "Average Years of Schooling", 
  cex = 1,
  col = 12,
  pch=20
)

plot( # iq vs GNI
  x = iq_data$Average.IQ,
  y = iq_data$`GNI - 2021`,
  main = "Average IQ vs GNI (2021)",
  xlab = "Average IQ",
  ylab = "GNI (2021)", 
  cex = 1,
  col = 12,
  pch=20
)

# iq vs population
filtered_pop <- subset(iq_data, iq_data$Population...2023 > 100000000)
plot(
  x = filtered_pop$Average.IQ,
  y = filtered_pop$Population...2023,
  main = "Average IQ vs Population",
  xlab = "Average IQ",
  ylab = "Population", 
  cex = 1,
  col = 12,
  pch=20
)

Interpretation: Literacy Rate is possibly left skewed and does not appear to have a clear correlation with IQ
Nobel Prizes Won is possibly right skewed and appears to have a negative correlation with IQ
HDI is possibly right skewed and appears to have a negative correlation with IQ
Average Years of Schooling is possibly left skewed and does not appear have a clear correlation with IQ
GNI is possible right skewed and appears to have a negative correlation with IQ
Population has no skewness and does not appear to have a negative correlation with IQ. 

Step 4: Identify contributing independent variables using correlation

iq_data <- read_excel("avgIQpercountry.xls")
iq_data$`Population - 2023` <- as.numeric(as.character(iq_data$`Population - 2023`))
## Warning: NAs introduced by coercion
corr_data <- iq_data[,c("Literacy Rate", "HDI (2021)", "Mean years of schooling - 2021", "GNI - 2021", "Nobel Prices", "Population - 2023", "Average IQ")]
pairs(~`Literacy Rate` + `HDI (2021)` + `Mean years of schooling - 2021` + `GNI - 2021` + `Nobel Prices` + `Population - 2023` + `Average IQ`, data = iq_data)

corr <- cor(corr_data)
corr
##                                Literacy Rate HDI (2021)
## Literacy Rate                      1.0000000         NA
## HDI (2021)                                NA          1
## Mean years of schooling - 2021            NA         NA
## GNI - 2021                                NA         NA
## Nobel Prices                       0.1190685         NA
## Population - 2023                         NA         NA
## Average IQ                         0.6347257         NA
##                                Mean years of schooling - 2021 GNI - 2021
## Literacy Rate                                              NA         NA
## HDI (2021)                                                 NA         NA
## Mean years of schooling - 2021                              1         NA
## GNI - 2021                                                 NA          1
## Nobel Prices                                               NA         NA
## Population - 2023                                          NA         NA
## Average IQ                                                 NA         NA
##                                Nobel Prices Population - 2023 Average IQ
## Literacy Rate                     0.1190685                NA  0.6347257
## HDI (2021)                               NA                NA         NA
## Mean years of schooling - 2021           NA                NA         NA
## GNI - 2021                               NA                NA         NA
## Nobel Prices                      1.0000000                NA  0.2056444
## Population - 2023                        NA                 1         NA
## Average IQ                        0.2056444                NA  1.0000000
Interpretation: Average IQ appears to be correlated with literacy rate

Step 5: Create the regression model and identify significant variables

model <- lm(`Average IQ` ~ `Literacy Rate` + `HDI (2021)` + `Mean years of schooling - 2021` + `GNI - 2021` + `Nobel Prices` + `Population - 2023`, data = iq_data)
summary(model)
## 
## Call:
## lm(formula = `Average IQ` ~ `Literacy Rate` + `HDI (2021)` + 
##     `Mean years of schooling - 2021` + `GNI - 2021` + `Nobel Prices` + 
##     `Population - 2023`, data = iq_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.8726  -3.9832   0.9518   4.5501  27.3856 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      4.110e+01  5.479e+00   7.501 3.23e-12 ***
## `Literacy Rate`                  7.284e+00  7.266e+00   1.002   0.3175    
## `HDI (2021)`                     3.771e+01  1.451e+01   2.600   0.0101 *  
## `Mean years of schooling - 2021` 5.214e-01  5.554e-01   0.939   0.3491    
## `GNI - 2021`                     9.994e-05  5.290e-05   1.889   0.0605 .  
## `Nobel Prices`                   4.580e-03  2.098e-02   0.218   0.8274    
## `Population - 2023`              9.028e-09  4.302e-09   2.099   0.0373 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.677 on 172 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.5912, Adjusted R-squared:  0.5769 
## F-statistic: 41.46 on 6 and 172 DF,  p-value: < 2.2e-16
Interpretation: At an alpha level of 0.05, the independent variables are Population and Human Development Index.

Step 6: Regression model using only significant variables

model <- lm(`Average IQ` ~ `HDI (2021)` + `Population - 2023`, data = iq_data)
summary(model)
## 
## Call:
## lm(formula = `Average IQ` ~ `HDI (2021)` + `Population - 2023`, 
##     data = iq_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.733  -3.247   1.053   4.855  26.736 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.361e+01  3.170e+00   10.61   <2e-16 ***
## `HDI (2021)`        6.621e+01  4.269e+00   15.51   <2e-16 ***
## `Population - 2023` 8.120e-09  4.186e-09    1.94    0.054 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.698 on 176 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.5797, Adjusted R-squared:  0.5749 
## F-statistic: 121.4 on 2 and 176 DF,  p-value: < 2.2e-16
Interpretation: The estimated regression model is y = 33.61 + 66.21x1 + 8.12e-9x2, where y = Average IQ, x1 = HDI and x2 = Population

Step 7: Analyzing Adjusted R-Squared

Interpretation: The adjusted R-squared value is 0.57, indicating that 57% of variability in Average IQ is explained by the predictors after their number is taken into consideration.

Step 8: Model significance

Interpretation: Because the p-value < 2.2e-16, the model is statistically significant.

Step 9: Coefficient Interpretation

B0) HDI (2021): 33.61 - For every unit that is moved in HDI, Average IQ is expected to increase by 33.31
B1) Population: 8.12e-09 - This coefficient implies that the population of a country is associated with an increase in average IQ by a small amount of 8.12e-09. 

Step 10: Conclusion and Recommendations

1. The significant variables that impact a country's average IQ score are Human Development Index and Population
2. We determined that it is possible to predict a country's average IQ based on HDI and Population
3. HDI is the most significant factor related to average IQ with a p-value near 0.01
4. Considering that HDI is based on health, knowledge measured in mean years of schooling and expected years of schooling, and standard of living, we have more data that suggests that intelligence is
closely tied to people's health and standard of living.
5. Countries can reap the benefits of a more intelligent population by focusing their efforts in developing the capabilities and standard of living of its citizens.