Methods
Step 1: Install and load required packages
install.packages("gclus", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/Admin/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'gclus' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Admin\AppData\Local\Temp\RtmpeiIpKr\downloaded_packages
install.packages("ggpubr", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/Admin/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'ggpubr' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Admin\AppData\Local\Temp\RtmpeiIpKr\downloaded_packages
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/Admin/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'tidyverse' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Admin\AppData\Local\Temp\RtmpeiIpKr\downloaded_packages
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.4.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.4.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.2
## Warning: package 'readr' was built under R version 4.4.2
## Warning: package 'dplyr' was built under R version 4.4.2
## Warning: package 'forcats' was built under R version 4.4.2
## Warning: package 'lubridate' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gclus)
## Warning: package 'gclus' was built under R version 4.4.2
## Loading required package: cluster
library(readxl)
library("utils")
Step 2: Import and summarize the data
iq_data <- read_excel("avgIQpercountry.xls")
head(iq_data)
## # A tibble: 6 × 10
## Rank Country `Average IQ` Continent `Literacy Rate` `Nobel Prices`
## <dbl> <chr> <dbl> <chr> <dbl> <dbl>
## 1 1 Japan 106. Asia 0.99 29
## 2 2 Taiwan 106. Asia 0.96 4
## 3 3 Singapore 106. Asia 0.97 0
## 4 4 Hong Kong 105. Asia 0.94 1
## 5 5 China 104. Asia 0.96 8
## 6 6 South Korea 102. Asia 0.98 0
## # ℹ 4 more variables: `HDI (2021)` <dbl>,
## # `Mean years of schooling - 2021` <dbl>, `GNI - 2021` <dbl>,
## # `Population - 2023` <chr>
summary(iq_data)
## Rank Country Average IQ Continent
## Min. : 1 Length:193 Min. : 42.99 Length:193
## 1st Qu.: 49 Class :character 1st Qu.: 74.33 Class :character
## Median : 97 Mode :character Median : 82.24 Mode :character
## Mean : 97 Mean : 82.05
## 3rd Qu.:145 3rd Qu.: 91.60
## Max. :193 Max. :106.48
##
## Literacy Rate Nobel Prices HDI (2021)
## Min. :0.1900 Min. : 0.000 Min. :0.3850
## 1st Qu.:0.8000 1st Qu.: 0.000 1st Qu.:0.5995
## Median :0.9500 Median : 0.000 Median :0.7450
## Mean :0.8642 Mean : 5.922 Mean :0.7241
## 3rd Qu.:0.9900 3rd Qu.: 1.000 3rd Qu.:0.8440
## Max. :1.0000 Max. :400.000 Max. :0.9620
## NA's :14
## Mean years of schooling - 2021 GNI - 2021 Population - 2023
## Min. : 2.100 Min. : 732 Length:193
## 1st Qu.: 6.400 1st Qu.: 4593 Class :character
## Median : 9.400 Median : 12672 Mode :character
## Mean : 9.028 Mean : 20812
## 3rd Qu.:11.600 3rd Qu.: 30588
## Max. :14.100 Max. :146830
## NA's :14 NA's :14
Data Description: A description of the features are presented in the table below.
Variable | Definition
------------ |--------------
1. Rank | The country's rank by IQ
2. Country | Country name
3. Average IQ | The average IQ score of the country
4. Continent | The continent the country is located in
5. Literacy Rate | The literacy rate of the country expressed as a percentage
6. Nobel Prizes | The total number of Nobel Prizes won by the country
7. HDI (2021) | The Human Development Index score of the country, ranging from 0 to 1
8. Mean years of schooling-2021 | The mean years of schooling in the country
9. GNI-2021 | The Gross National Income earned in 2021 expressed in international dollars
10. Population-2023 | The population of the country in 2023
Step 3: Data visualization
options(max.print=999999)
iq_data <- read.csv("avgIQpercountry.csv")
plot( # iq vs literacy rate
x = iq_data$Average.IQ,
y = iq_data$Literacy.Rate,
main = "Average IQ vs Literacy Rate",
xlab = "Average IQ",
ylab = "Literacy Rate",
cex = 1,
col = 12,
pch=20
)

plot( # iq vs Nobel Prizes Won
x = iq_data$Average.IQ,
y = iq_data$`Nobel Prices`,
main = "Average IQ vs Nobel Prizes Won",
xlab = "Average IQ",
ylab = "Nobel Prizes Won",
cex = 1,
col = 12,
pch=20
)

plot( # iq vs HDI
x = iq_data$Average.IQ,
y = iq_data$`HDI (2021)`,
main = "Average IQ vs HDI (2021)",
xlab = "Average IQ",
ylab = "HDI (2021)",
cex = 1,
col = 12,
pch=20
)

# iq vs schooling
plot(
x = iq_data$Average.IQ,
y = iq_data$Mean.years.of.schooling...2021,
main = "Average IQ vs Average Years of Schooling",
xlab = "Average IQ",
ylab = "Average Years of Schooling",
cex = 1,
col = 12,
pch=20
)

plot( # iq vs GNI
x = iq_data$Average.IQ,
y = iq_data$`GNI - 2021`,
main = "Average IQ vs GNI (2021)",
xlab = "Average IQ",
ylab = "GNI (2021)",
cex = 1,
col = 12,
pch=20
)

# iq vs population
filtered_pop <- subset(iq_data, iq_data$Population...2023 > 100000000)
plot(
x = filtered_pop$Average.IQ,
y = filtered_pop$Population...2023,
main = "Average IQ vs Population",
xlab = "Average IQ",
ylab = "Population",
cex = 1,
col = 12,
pch=20
)

Interpretation: Literacy Rate is possibly left skewed and does not appear to have a clear correlation with IQ
Nobel Prizes Won is possibly right skewed and appears to have a negative correlation with IQ
HDI is possibly right skewed and appears to have a negative correlation with IQ
Average Years of Schooling is possibly left skewed and does not appear have a clear correlation with IQ
GNI is possible right skewed and appears to have a negative correlation with IQ
Population has no skewness and does not appear to have a negative correlation with IQ.
Step 4: Identify contributing independent variables using
correlation
iq_data <- read_excel("avgIQpercountry.xls")
iq_data$`Population - 2023` <- as.numeric(as.character(iq_data$`Population - 2023`))
## Warning: NAs introduced by coercion
corr_data <- iq_data[,c("Literacy Rate", "HDI (2021)", "Mean years of schooling - 2021", "GNI - 2021", "Nobel Prices", "Population - 2023", "Average IQ")]
pairs(~`Literacy Rate` + `HDI (2021)` + `Mean years of schooling - 2021` + `GNI - 2021` + `Nobel Prices` + `Population - 2023` + `Average IQ`, data = iq_data)

corr <- cor(corr_data)
corr
## Literacy Rate HDI (2021)
## Literacy Rate 1.0000000 NA
## HDI (2021) NA 1
## Mean years of schooling - 2021 NA NA
## GNI - 2021 NA NA
## Nobel Prices 0.1190685 NA
## Population - 2023 NA NA
## Average IQ 0.6347257 NA
## Mean years of schooling - 2021 GNI - 2021
## Literacy Rate NA NA
## HDI (2021) NA NA
## Mean years of schooling - 2021 1 NA
## GNI - 2021 NA 1
## Nobel Prices NA NA
## Population - 2023 NA NA
## Average IQ NA NA
## Nobel Prices Population - 2023 Average IQ
## Literacy Rate 0.1190685 NA 0.6347257
## HDI (2021) NA NA NA
## Mean years of schooling - 2021 NA NA NA
## GNI - 2021 NA NA NA
## Nobel Prices 1.0000000 NA 0.2056444
## Population - 2023 NA 1 NA
## Average IQ 0.2056444 NA 1.0000000
Interpretation: Average IQ appears to be correlated with literacy rate
Step 5: Create the regression model and identify significant
variables
model <- lm(`Average IQ` ~ `Literacy Rate` + `HDI (2021)` + `Mean years of schooling - 2021` + `GNI - 2021` + `Nobel Prices` + `Population - 2023`, data = iq_data)
summary(model)
##
## Call:
## lm(formula = `Average IQ` ~ `Literacy Rate` + `HDI (2021)` +
## `Mean years of schooling - 2021` + `GNI - 2021` + `Nobel Prices` +
## `Population - 2023`, data = iq_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.8726 -3.9832 0.9518 4.5501 27.3856
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.110e+01 5.479e+00 7.501 3.23e-12 ***
## `Literacy Rate` 7.284e+00 7.266e+00 1.002 0.3175
## `HDI (2021)` 3.771e+01 1.451e+01 2.600 0.0101 *
## `Mean years of schooling - 2021` 5.214e-01 5.554e-01 0.939 0.3491
## `GNI - 2021` 9.994e-05 5.290e-05 1.889 0.0605 .
## `Nobel Prices` 4.580e-03 2.098e-02 0.218 0.8274
## `Population - 2023` 9.028e-09 4.302e-09 2.099 0.0373 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.677 on 172 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.5912, Adjusted R-squared: 0.5769
## F-statistic: 41.46 on 6 and 172 DF, p-value: < 2.2e-16
Interpretation: At an alpha level of 0.05, the independent variables are Population and Human Development Index.
Step 6: Regression model using only significant variables
model <- lm(`Average IQ` ~ `HDI (2021)` + `Population - 2023`, data = iq_data)
summary(model)
##
## Call:
## lm(formula = `Average IQ` ~ `HDI (2021)` + `Population - 2023`,
## data = iq_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.733 -3.247 1.053 4.855 26.736
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.361e+01 3.170e+00 10.61 <2e-16 ***
## `HDI (2021)` 6.621e+01 4.269e+00 15.51 <2e-16 ***
## `Population - 2023` 8.120e-09 4.186e-09 1.94 0.054 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.698 on 176 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.5797, Adjusted R-squared: 0.5749
## F-statistic: 121.4 on 2 and 176 DF, p-value: < 2.2e-16
Interpretation: The estimated regression model is y = 33.61 + 66.21x1 + 8.12e-9x2, where y = Average IQ, x1 = HDI and x2 = Population
Step 7: Analyzing Adjusted R-Squared
Interpretation: The adjusted R-squared value is 0.57, indicating that 57% of variability in Average IQ is explained by the predictors after their number is taken into consideration.
Step 8: Model significance
Interpretation: Because the p-value < 2.2e-16, the model is statistically significant.
Step 9: Coefficient Interpretation
B0) HDI (2021): 33.61 - For every unit that is moved in HDI, Average IQ is expected to increase by 33.31
B1) Population: 8.12e-09 - This coefficient implies that the population of a country is associated with an increase in average IQ by a small amount of 8.12e-09.
Step 10: Conclusion and Recommendations
1. The significant variables that impact a country's average IQ score are Human Development Index and Population
2. We determined that it is possible to predict a country's average IQ based on HDI and Population
3. HDI is the most significant factor related to average IQ with a p-value near 0.01
4. Considering that HDI is based on health, knowledge measured in mean years of schooling and expected years of schooling, and standard of living, we have more data that suggests that intelligence is
closely tied to people's health and standard of living.
5. Countries can reap the benefits of a more intelligent population by focusing their efforts in developing the capabilities and standard of living of its citizens.