Title: Finding Happiness in Innovation

a. Introduction. Variable Definitions.

The primary question in this study is: Are happier countries more innovative and progressive? If so, which factors contribute to making them the most innovative?

The question is silly when applied to humans. Of course, happier people are more productive in their endeavors! And conversely they are less inventive and less productive when they are unhappy. But are countries like that? The answer is not trivial. As an example, countries are at the unhappiest period of their existence during wars. Yet, two of the most innovative periods for the US were during the Manhattan project (WWII era) and the Cold War era. And some of the great civilizations in history thrived while invading others or when facing adversity and competition. Thus, the answer depends on which country and period in history we are referring to.

I don’t have a preconceived answer and this project is meant to scratch the surface of the problem. The approach will be to postulate variables that quantify or encourage innovation, and then to postulate variables that affect it, including degree of happiness. Naturally, some variables will influence innovation more than others and the goal will be to reduce the number of variables to the most influential. We will therefore use Principal Component Analysis (PCA) to statistically determine a set of orthogonal variables (principal components), and construct a reduced model that still explains a significant fraction of the observations and their variance from a null regression model. Happiness may or not emerge among the significant variables in our data. But we will determine its relative weight in the principal components.

How to measure country Innovation? I will use a very well established index called Global Innovation Index (GII), developed and maintained by the World Intellectual Property Office (WIPO). The GII is a real number between 0 and 100, derived from the statistical analysis of 80 separate indicators (https://www.wipo.int/en/web/global-innovation-index) for over 150 countries, and used by strategists and policy makers worldwide. Below I give the link to the actual source of the compiled index since 2011.

How to measure country Happiness? I will use the well known Happiness index (https://www.worldhappiness.report/about/), a real number between 0 and 10, based on a Gallup World Poll of over 140 countries since late 2011, where representative samples of citizens answer this question: On which step of a ladder, the so-called Cantril Self-Anchoring Striving Scale, would you personally feel you stand at this time (0 at the bottom, the worst situation, and 10 at the top, the best life for you)? The Reports emerging from this polling are subjective measures of current well-being of individuals in a country and have been used by the United Nations General Assembly to support some of their proposed measures. The latest report is from 2025 (https://www.gallup.com/analytics/349487/world-happiness-report.aspx.)

How do we identify the other independent variables? We are really seeking variables that potentially may impact both Innovation and Happiness, as we want to answer the question of the importance of personal happiness to country innovation. Thus I avoided attempting to reproduce the 80 indicators on which GII is based, and identified 19 from the World Bank Development Indicators that may influence both Innovation and Happiness. Below I give more details and the links.

Most of the World Bank indicators are available since at least 2000. But the GII is available since 2011 and the Happiness index since late 2011, which limits the validity of our PCA model to the 2011-2024 period.

The 21 World Bank independent variables are all numerical and measure: (1) Infrastructure, (2) Research and manufacturing investment, (3) Human capital, (4) Knowledge output, (5) National effectiveness. Individuals are happier and nations more innovative when they have appropriate resources, progressive societies, government encouragement, security of health, education and livelihood.

Quantitative variable names and World Bank indicator codes:

Infrastructure:

Access to electricity (% of population) - EG.ELC.ACCS.ZS
Electric power consumption (kWh per capita) - EG.USE.ELEC.KH.PC
Secure Internet servers (per 1 million people) - IT.NET.SECR.P6
Individuals using the Internet (% of population) - IT.NET.USER.ZS

Research and advanced manufacturing investment:

Research and development expenditure (% of GDP) - GB.XPD.RSDV.GD.ZS
Foreign direct investment, net (BoP, current US$) - BN.KLT.DINV.CD
Medium and high-tech manufacturing value added (% manufact. value added) - NV.MNF.TECH.ZS.UN

Human capital:

Life expectancy at birth, total (years) - SP.DYN.LE00.IN
Domestic general government health expenditure per capita, PPP (interntl $) - SH.XPD.GHED.PP.CD
Unemployed with advanced education (% of labor force with advanced education) - SL.UEM.ADVN.ZS
Government expenditure on education, total (% of GDP) - SE.XPD.TOTL.GD.ZS
Completed upper secondary educ., pop. 25+, total (%) - SE.SEC.CUAT.UP.ZS

Knowledge output:

Researchers in R&D (per million people) - SP.POP.SCIE.RD.P6
Patent applications, residents - IP.PAT.RESD
Scientific and technical journal articles (per million people)- IP.JRN.ARTC.SC
High-technology exports (% manufactured exports) - TX.VAL.TECH.MF.ZS
ICT service exports (BoP, % service exports) - BX.GSR.CCIS.ZS

National effectiveness:

GDP, PPP (current interntl $) - NY.GDP.MKTP.PP.CD - measure of country wealth
GDP per person employed, (current interntl $) - SL.GDP.PCAP.EM.KD - measure of productivity
GNI per capita, Atlas method (current US$) - NY.GNP.PCAP.CD - measure of individual wealth - it is also the basis for the World Bank’s classification of country by income level in 2024:

Low income: $1,145 or less

Lower-middle income: $1,146–4,515

Upper-middle income: $4,516–14,005

High income: $14,006 or more

Adjusted net national income (current US$) - GNI minus consumption of resources - NY.ADJ.NNTY.CD

I focused mainly on the Principal Component Analysis tasks to explain the 2011-2024 data, and didn’t explore in detail the interesting aspects of understanding how countries achieve leading innovator status and how they have benefited from having an overall happier society.

b. Source databases.

The link to the bound dataset used in this study is https://raw.githubusercontent.com/raul-miranda/DS101-2025/refs/heads/main/worldbank_gii_happiness.csv. It contains 5425 rows.

To produce it, I compiled the World Development Indicators mentioned above from http://data.worldbank.org/, and downloaded the set to my github repository.

Link to repo: https://github.com/raul-miranda/DS101-2025

Name of the dataset: worldbankindicators_2000_2024_expanded.csv.

The Global Innovation report for 2025 is available at WIPO: https://www.wipo.int/en/web/global-innovation-index. However, the compilation of Global Innovation Indexes (GII) with all the supporting indicators is only made available upon paid membership. I found a compilation of the overall GII at kaggle that has been downloaded many times and has not received negative comments. The downloadable zip file is at https://www.kaggle.com/datasets/karlakovacs/global-innovation-index-wipo-2011-2024-data?resource=download.

Name of the dataset: GII_2011_2024_long_format.csv.

The World Happiness Report is available at https://www.worldhappiness.report/data-sharing/. (Direct download link: https://files.worldhappiness.report/WHR25_Data_Figure_2.1v3.xlsx?_gl=1*kq8vgr*_gcl_au*MTA1MzA2NDk3My4xNzY0NjM3MTQx). The Gallup poll results are available in their annual report. The happiness index is reported as a 3-year rolling average. Our World In Data offers the same data for free at https://ourworldindata.org/grapher/happiness-cantril-ladder.

Name of the dataset: happiness-cantril-ladder.csv.

Datasets are tabulated by country_name, country_code and year. The original bound data set contains about 5400 records in the 2000-2024 period (long format), but not all countries reported complete information.

c. Dataset preparation

Here I read and bind the 3 raw datasets (World Bank, GII, and Happiness) and produce a compiled file containing 23 columns and 5425 rows, which is uploaded to my repo at https://raw.githubusercontent.com/raul-miranda/DS101-2025/refs/heads/main/worldbank_gii_happiness.csv.

setwd(getwd()) # ensure we're at the directory where the RMD is running; but the next read_csvs are from my github acct

gii_data <- read_csv("https://raw.githubusercontent.com/raul-miranda/DS101-2025/refs/heads/main/GII_2011_2024_long_format.csv", show_col_types = FALSE)
happiness_data <- read_csv("https://raw.githubusercontent.com/raul-miranda/DS101-2025/refs/heads/main/happiness-cantril-ladder.csv", show_col_types = FALSE)
worldbank_data <- read_csv("https://raw.githubusercontent.com/raul-miranda/DS101-2025/refs/heads/main/worldbankindicators_2000_2024_expanded.csv", show_col_types = FALSE)

Bind datasets. Exclude columns “Rank” from gii_data, and “Entity” from happiness_date. Rename “Score” to “innovation_score” and “Cantril ladder score” to “happiness_score”. Rename “Country name” to “name”, “Country code” to “code”, and “Year” to “year”. Write out the bound file to my folder for manual upload to github.

Examine the top rows of bound_set and take a peek at its variable structure.

bound_set <- worldbank_data |> left_join(select(gii_data, -"Rank"), by= c("Country Code"= "Country", "Year"="Year")) |> left_join(select(happiness_data, -"Entity"), by= c("Country Code"="Code", "Year"="Year")) |>

rename (name = "Country Name", code = "Country Code", year = Year, innovation_score = Score, happiness_score = "Cantril ladder score")

write.csv(bound_set, "worldbank_gii_happiness.csv", row.names = FALSE)  #write it out to my project folder


head (bound_set, 3)

## # A tibble: 3 × 26
##   name        code   year `access_electricity%pop` `adj_national_income$`
##   <chr>       <chr> <dbl>                    <dbl>                  <dbl>
## 1 Afghanistan AFG    2000                      4.4                      0
## 2 Afghanistan AFG    2001                      9.3                      0
## 3 Afghanistan AFG    2002                     14.1                      0
## # ℹ 21 more variables: `gov_health_expend_intl$cap` <dbl>,
## #   `gov_educ_exp%gdp` <dbl>, `compl_upp_sec_educ%pop` <dbl>,
## #   power_cons_kwh_cap <dbl>, `foreign_dir_investm$` <dbl>,
## #   `gni_cap_atlas$` <dbl>, `gdp_person_emplyd_intl$` <dbl>, `gdp_intl$` <dbl>,
## #   `internet_use%pop` <dbl>, life_expect_birth_yrs <dbl>,
## #   `hi_tech_manuf%valuadded` <dbl>, patent_apps_res <dbl>,
## #   `rd_expend%gdp` <dbl>, rd_researchers_millpop <dbl>, …

str(bound_set)

## spc_tbl_ [5,425 × 26] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ name                       : chr [1:5425] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ code                       : chr [1:5425] "AFG" "AFG" "AFG" "AFG" ...
##  $ year                       : num [1:5425] 2000 2001 2002 2003 2004 ...
##  $ access_electricity%pop     : num [1:5425] 4.4 9.3 14.1 19 23.8 28.7 33.5 38.4 42.4 48.3 ...
##  $ adj_national_income$       : num [1:5425] 0 0 0 0 0 ...
##  $ gov_health_expend_intl$cap : num [1:5425] 0 0 0.765 6.256 5.199 ...
##  $ gov_educ_exp%gdp           : num [1:5425] 0 0 0 0 0 0 0 0 0 0 ...
##  $ compl_upp_sec_educ%pop     : num [1:5425] 0 0 0 0 0 0 0 0 0 0 ...
##  $ power_cons_kwh_cap         : num [1:5425] 0 0 0 0 0 0 0 0 0 0 ...
##  $ foreign_dir_investm$       : num [1:5425] 0 0 0 0 0 ...
##  $ gni_cap_atlas$             : num [1:5425] 0 0 180 190 210 250 270 330 370 460 ...
##  $ gdp_person_emplyd_intl$    : num [1:5425] 7550 6813 8327 8520 8326 ...
##  $ gdp_intl$                  : num [1:5425] 1.64e+10 1.52e+10 1.98e+10 2.20e+10 2.29e+10 ...
##  $ internet_use%pop           : num [1:5425] 0 0.00472 0.00456 0.08789 0.10581 ...
##  $ life_expect_birth_yrs      : num [1:5425] 55 55.5 56.2 57.2 57.8 ...
##  $ hi_tech_manuf%valuadded    : num [1:5425] 13.6 13.6 13.6 13.6 14.3 ...
##  $ patent_apps_res            : num [1:5425] 0 0 0 0 0 0 0 0 0 0 ...
##  $ rd_expend%gdp              : num [1:5425] 0 0 0 0 0 0 0 0 0 0 ...
##  $ rd_researchers_millpop     : num [1:5425] 0 0 0 0 0 0 0 0 0 0 ...
##  $ secure_intern_serv_millpop : num [1:5425] 0 0 0 0 0 0 0 0 0 0 ...
##  $ unemplyd_adv_educ%educ_pop : num [1:5425] 0 0 0 0 0 0 0 0 0 0 ...
##  $ hi-tec_exprts%manuf_exprts : num [1:5425] 0 0 0 0 0 0 0 0 0 0 ...
##  $ ict_serv_exprts%serv_exprts: num [1:5425] 0 0 0 0 0 ...
##  $ scien_pubs_millpop         : num [1:5425] 4 1 5.5 7.34 6.75 ...
##  $ innovation_score           : num [1:5425] NA NA NA NA NA NA NA NA NA NA ...
##  $ happiness_score            : num [1:5425] NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Country Name` = col_character(),
##   ..   `Country Code` = col_character(),
##   ..   Year = col_double(),
##   ..   `access_electricity%pop` = col_double(),
##   ..   `adj_national_income$` = col_double(),
##   ..   `gov_health_expend_intl$cap` = col_double(),
##   ..   `gov_educ_exp%gdp` = col_double(),
##   ..   `compl_upp_sec_educ%pop` = col_double(),
##   ..   power_cons_kwh_cap = col_double(),
##   ..   `foreign_dir_investm$` = col_double(),
##   ..   `gni_cap_atlas$` = col_double(),
##   ..   `gdp_person_emplyd_intl$` = col_double(),
##   ..   `gdp_intl$` = col_double(),
##   ..   `internet_use%pop` = col_double(),
##   ..   life_expect_birth_yrs = col_double(),
##   ..   `hi_tech_manuf%valuadded` = col_double(),
##   ..   patent_apps_res = col_double(),
##   ..   `rd_expend%gdp` = col_double(),
##   ..   rd_researchers_millpop = col_double(),
##   ..   secure_intern_serv_millpop = col_double(),
##   ..   `unemplyd_adv_educ%educ_pop` = col_double(),
##   ..   `hi-tec_exprts%manuf_exprts` = col_double(),
##   ..   `ict_serv_exprts%serv_exprts` = col_double(),
##   ..   scien_pubs_millpop = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

d. Dataset cleaning

Happiness_score value for 2013 is NA for all countries; and for 2022 is NA for some countries. Since the score values reported by Gallup are a 3-year rolling average, I imputed the NA 2013 with the average of 2012 and 2014; and the NA 2022 with average of 2021 and 2023.

bound_set <- bound_set |> group_by(code) |>
  mutate (happiness_score = if_else (year==2013 & is.na(happiness_score), 
    (lag(happiness_score, n=1, order_by=year) + lead(happiness_score, n=1, order_by = year))/2, #lag() is 2012, lead() is 2014
    happiness_score)) |>
  mutate (happiness_score = if_else (year==2022 & is.na(happiness_score), 
    (lag(happiness_score, n=1, order_by=year) + lead(happiness_score, n=1, order_by = year))/2, #lag() is 2021, lead() is 2023
    happiness_score)) |>  ungroup()

head(bound_set,16)

## # A tibble: 16 × 26
##    name        code   year `access_electricity%pop` `adj_national_income$`
##    <chr>       <chr> <dbl>                    <dbl>                  <dbl>
##  1 Afghanistan AFG    2000                      4.4                      0
##  2 Afghanistan AFG    2001                      9.3                      0
##  3 Afghanistan AFG    2002                     14.1                      0
##  4 Afghanistan AFG    2003                     19                        0
##  5 Afghanistan AFG    2004                     23.8                      0
##  6 Afghanistan AFG    2005                     28.7                      0
##  7 Afghanistan AFG    2006                     33.5                      0
##  8 Afghanistan AFG    2007                     38.4                      0
##  9 Afghanistan AFG    2008                     42.4                      0
## 10 Afghanistan AFG    2009                     48.3            11080132310
## 11 Afghanistan AFG    2010                     42.7            14289975734
## 12 Afghanistan AFG    2011                     43.2            16630409946
## 13 Afghanistan AFG    2012                     69.1            18511801827
## 14 Afghanistan AFG    2013                     68              18890016968
## 15 Afghanistan AFG    2014                     89.5            18885765416
## 16 Afghanistan AFG    2015                     71.5            18572571196
## # ℹ 21 more variables: `gov_health_expend_intl$cap` <dbl>,
## #   `gov_educ_exp%gdp` <dbl>, `compl_upp_sec_educ%pop` <dbl>,
## #   power_cons_kwh_cap <dbl>, `foreign_dir_investm$` <dbl>,
## #   `gni_cap_atlas$` <dbl>, `gdp_person_emplyd_intl$` <dbl>, `gdp_intl$` <dbl>,
## #   `internet_use%pop` <dbl>, life_expect_birth_yrs <dbl>,
## #   `hi_tech_manuf%valuadded` <dbl>, patent_apps_res <dbl>,
## #   `rd_expend%gdp` <dbl>, rd_researchers_millpop <dbl>, …

Produce a reduced dataset for further processing: reduced dataset for years 2011-2024, when most Word Bank indicators are available, although still missing many innovation and happiness scores.

years_11_24  <- bound_set |> filter( year >= 2011 & year <= 2024) 

colSums (is.na(years_11_24))

##                        name                        code 
##                           0                           0 
##                        year      access_electricity%pop 
##                           0                           0 
##        adj_national_income$  gov_health_expend_intl$cap 
##                           0                           0 
##            gov_educ_exp%gdp      compl_upp_sec_educ%pop 
##                           0                           0 
##          power_cons_kwh_cap        foreign_dir_investm$ 
##                           0                           0 
##              gni_cap_atlas$     gdp_person_emplyd_intl$ 
##                           0                           0 
##                   gdp_intl$            internet_use%pop 
##                           0                           0 
##       life_expect_birth_yrs     hi_tech_manuf%valuadded 
##                           0                           0 
##             patent_apps_res               rd_expend%gdp 
##                           0                           0 
##      rd_researchers_millpop  secure_intern_serv_millpop 
##                           0                           0 
##  unemplyd_adv_educ%educ_pop  hi-tec_exprts%manuf_exprts 
##                           0                           0 
## ict_serv_exprts%serv_exprts          scien_pubs_millpop 
##                           0                           0 
##            innovation_score             happiness_score 
##                         952                         950

We will create a reduced subset of years_11_24 that drops countries that have NA for all innovation_score and then drops countries that have NA for all happiness_score. Having at least one innovation_score and one happiness_score is absolutely necessary to maintain a country on the list – we cannot impute scores with means when they are missing completely.

But leave countries that have at least one non-NA in innovation_score or in happiness_score.

years_11_24_no_is_no_hs  <- years_11_24 |> group_by(code)   |> 
                filter(any(!is.na(innovation_score))) |>
                filter(any(!is.na(happiness_score)))  |> ungroup()
colSums (is.na(years_11_24_no_is_no_hs))

##                        name                        code 
##                           0                           0 
##                        year      access_electricity%pop 
##                           0                           0 
##        adj_national_income$  gov_health_expend_intl$cap 
##                           0                           0 
##            gov_educ_exp%gdp      compl_upp_sec_educ%pop 
##                           0                           0 
##          power_cons_kwh_cap        foreign_dir_investm$ 
##                           0                           0 
##              gni_cap_atlas$     gdp_person_emplyd_intl$ 
##                           0                           0 
##                   gdp_intl$            internet_use%pop 
##                           0                           0 
##       life_expect_birth_yrs     hi_tech_manuf%valuadded 
##                           0                           0 
##             patent_apps_res               rd_expend%gdp 
##                           0                           0 
##      rd_researchers_millpop  secure_intern_serv_millpop 
##                           0                           0 
##  unemplyd_adv_educ%educ_pop  hi-tec_exprts%manuf_exprts 
##                           0                           0 
## ict_serv_exprts%serv_exprts          scien_pubs_millpop 
##                           0                           0 
##            innovation_score             happiness_score 
##                           0                         102

Some countries did not report of all the economic variables in years 2022-2024, and the dataset shows them as zero. In general, most economic variables do not change radically over 2 or 3 years. To avoid negatively biasing the data, I will impute the zeros in 2022 with the 2021 values, the zeros in 2023 with the 2022 values, and the zeros in 2024 with the 2023 values, in that order. This is more appropriate than imputing with the mean or median for the 2011-2024 values, in particular if we want to display a time series and not show a sudden decrease in the variables in 2022-2024. Some countries have not reported some of the variables during the entire 2011-2024 period, but we will not falsely impute those cases.

years_11_24_no_is_no_hs_imputed <- years_11_24_no_is_no_hs |> 
group_by(code) |> 

  mutate(across(where(is.numeric) & !year,  ~{            # do this for every numeric col except year
    prev_yr <- lag(.x, n=1, default=NA)          # gather the col value for previous year relative to the current row
    ifelse(year == 2022 & .x == 0, prev_yr, .x) })) |>    # if 2022 is zero, impute with 2021; do this for the entire dataset

  mutate(across(where(is.numeric) & !year,  ~{            # go back to the beginning and repeat
    prev_yr <- lag(.x, n=1, default=NA)          # gather the previous year value
    ifelse(year == 2023 & .x == 0, prev_yr, .x) }))  |>   # if 2023 is zero, impute with 2022; do this for the entire dataset
  
  mutate(across(where(is.numeric) & !year,  ~{            # go back to the beginning and repeat
    prev_yr <- lag(.x, n=1, default=NA)          # gather the previous year value
    ifelse(year == 2024 & .x == 0, prev_yr, .x) })) |>   # if 2024 is zero impute with 2023.

  ungroup()   #restore things

#note: could have done a for-loop, but for 3 years the loop is not more efficient or clearer than just repeating the code.

WIPO assigned innovation scores of -1 to several countries/years (mainly undeveloped ones) when lacking data. I will impute the -1 innovation score and the remaining NA happiness score with the averages for each country in the 2011-2024 year period.

years_11_24_no_is_no_hs_imputed <- years_11_24_no_is_no_hs_imputed |> 

  group_by(code) |> mutate(
    innovation_score = if_else(innovation_score==-1, mean(innovation_score[innovation_score != -1], na.rm=TRUE ),innovation_score),  #exclude -1 scores when calculating the mean()
    happiness_score = if_else(is.na(happiness_score), mean(happiness_score, na.rm=TRUE), happiness_score)) |>
  ungroup()
colSums (is.na(years_11_24_no_is_no_hs_imputed))

##                        name                        code 
##                           0                           0 
##                        year      access_electricity%pop 
##                           0                           0 
##        adj_national_income$  gov_health_expend_intl$cap 
##                           0                           0 
##            gov_educ_exp%gdp      compl_upp_sec_educ%pop 
##                           0                           0 
##          power_cons_kwh_cap        foreign_dir_investm$ 
##                           0                           0 
##              gni_cap_atlas$     gdp_person_emplyd_intl$ 
##                           0                           0 
##                   gdp_intl$            internet_use%pop 
##                           0                           0 
##       life_expect_birth_yrs     hi_tech_manuf%valuadded 
##                           0                           0 
##             patent_apps_res               rd_expend%gdp 
##                           0                           0 
##      rd_researchers_millpop  secure_intern_serv_millpop 
##                           0                           0 
##  unemplyd_adv_educ%educ_pop  hi-tec_exprts%manuf_exprts 
##                           0                           0 
## ict_serv_exprts%serv_exprts          scien_pubs_millpop 
##                           0                           0 
##            innovation_score             happiness_score 
##                           0                           0

e. Exploratory data analysis

Structure and summary of working dataset.

Get the basic stats of the innovation score and the happiness score and show a plot.

str(years_11_24_no_is_no_hs_imputed)

## tibble [2,016 × 26] (S3: tbl_df/tbl/data.frame)
##  $ name                       : chr [1:2016] "Albania" "Albania" "Albania" "Albania" ...
##  $ code                       : chr [1:2016] "ALB" "ALB" "ALB" "ALB" ...
##  $ year                       : num [1:2016] 2011 2012 2013 2014 2015 ...
##  $ access_electricity%pop     : num [1:2016] 99.7 99.9 99.9 100 100 99.9 99.9 100 100 100 ...
##  $ adj_national_income$       : num [1:2016] 1.09e+10 1.02e+10 1.09e+10 1.11e+10 9.44e+09 ...
##  $ gov_health_expend_intl$cap : num [1:2016] 244 258 271 291 314 ...
##  $ gov_educ_exp%gdp           : num [1:2016] 3.18 3.3 3.54 3.18 3.44 ...
##  $ compl_upp_sec_educ%pop     : num [1:2016] 43.5 45.3 43.8 46 48.2 ...
##  $ power_cons_kwh_cap         : num [1:2016] 2206 2118 2533 2309 2098 ...
##  $ foreign_dir_investm$       : num [1:2016] -8.47e+08 -8.35e+08 -1.23e+09 -1.07e+09 -9.10e+08 ...
##  $ gni_cap_atlas$             : num [1:2016] 4460 4370 4560 4570 4410 4370 4340 4910 5300 5320 ...
##  $ gdp_person_emplyd_intl$    : num [1:2016] 29637 31238 35412 36366 35343 ...
##  $ gdp_intl$                  : num [1:2016] 2.98e+10 3.03e+10 3.07e+10 3.27e+10 3.38e+10 ...
##  $ internet_use%pop           : num [1:2016] 47 49.4 51.8 54.3 56.9 ...
##  $ life_expect_birth_yrs      : num [1:2016] 78.3 78.1 78 78 78.4 ...
##  $ hi_tech_manuf%valuadded    : num [1:2016] 10.55 17.1 5.64 6.63 5.8 ...
##  $ patent_apps_res            : num [1:2016] 3 0 0 10 14 20 16 15 4 0 ...
##  $ rd_expend%gdp              : num [1:2016] 0 0 0 0 0 0 0 0 0 0 ...
##  $ rd_researchers_millpop     : num [1:2016] 0 0 0 0 0 0 0 0 0 0 ...
##  $ secure_intern_serv_millpop : num [1:2016] 8.26 25.51 35.58 52.27 68.39 ...
##  $ unemplyd_adv_educ%educ_pop : num [1:2016] 15.6 15.8 14.9 17.8 19.3 ...
##  $ hi-tec_exprts%manuf_exprts : num [1:2016] 1.017 0.923 0.89 0.257 1.634 ...
##  $ ict_serv_exprts%serv_exprts: num [1:2016] 3.85 6.42 8.02 6.6 4.98 ...
##  $ scien_pubs_millpop         : num [1:2016] 141 151 149 154 159 ...
##  $ innovation_score           : num [1:2016] 30.4 30.4 30.9 30.5 30.7 ...
##  $ happiness_score            : num [1:2016] 5.13 5.55 5.25 4.96 4.66 ...

summary(years_11_24_no_is_no_hs_imputed)

##      name               code                year      access_electricity%pop
##  Length:2016        Length:2016        Min.   :2011   Min.   :  6.20        
##  Class :character   Class :character   1st Qu.:2014   1st Qu.: 86.30        
##  Mode  :character   Mode  :character   Median :2018   Median : 99.80        
##                                        Mean   :2018   Mean   : 86.31        
##                                        3rd Qu.:2021   3rd Qu.:100.00        
##                                        Max.   :2024   Max.   :100.00        
##  adj_national_income$ gov_health_expend_intl$cap gov_educ_exp%gdp
##  Min.   :0.000e+00    Min.   :   0.0             Min.   : 0.000  
##  1st Qu.:1.309e+10    1st Qu.: 105.2             1st Qu.: 2.591  
##  Median :4.989e+10    Median : 483.6             Median : 3.933  
##  Mean   :4.536e+11    Mean   :1170.7             Mean   : 3.724  
##  3rd Qu.:2.621e+11    3rd Qu.:1622.6             3rd Qu.: 5.180  
##  Max.   :1.959e+13    Max.   :8505.6             Max.   :10.827  
##  compl_upp_sec_educ%pop power_cons_kwh_cap foreign_dir_investm$
##  Min.   : 0.00          Min.   :    0.0    Min.   :-3.454e+11  
##  1st Qu.: 0.00          1st Qu.:  605.9    1st Qu.:-2.882e+09  
##  Median :36.72          Median : 2217.3    Median :-7.511e+08  
##  Mean   :37.63          Mean   : 4074.7    Mean   :-1.007e+09  
##  3rd Qu.:71.57          3rd Qu.: 5251.4    3rd Qu.:-2.104e+07  
##  Max.   :98.29          Max.   :55085.2    Max.   : 2.183e+11  
##  gni_cap_atlas$   gdp_person_emplyd_intl$   gdp_intl$         internet_use%pop
##  Min.   :     0   Min.   :     0          Min.   :0.000e+00   Min.   :  0.00  
##  1st Qu.:  2100   1st Qu.: 17994          1st Qu.:4.518e+10   1st Qu.: 31.00  
##  Median :  6225   Median : 45696          Median :1.627e+11   Median : 63.26  
##  Mean   : 16100   Mean   : 57138          Mean   :9.195e+11   Mean   : 57.07  
##  3rd Qu.: 21385   3rd Qu.: 83556          3rd Qu.:6.060e+11   3rd Qu.: 83.51  
##  Max.   :105070   Max.   :291199          Max.   :3.819e+13   Max.   :100.00  
##  life_expect_birth_yrs hi_tech_manuf%valuadded patent_apps_res    
##  Min.   :46.93         Min.   : 0.000          Min.   :      0.0  
##  1st Qu.:67.65         1st Qu.: 8.533          1st Qu.:      0.0  
##  Median :74.10         Median :21.116          Median :     66.5  
##  Mean   :72.92         Mean   :23.649          Mean   :  13903.8  
##  3rd Qu.:78.88         3rd Qu.:36.802          3rd Qu.:    792.5  
##  Max.   :85.53         Max.   :92.579          Max.   :1426644.0  
##  rd_expend%gdp    rd_researchers_millpop secure_intern_serv_millpop
##  Min.   :0.0000   Min.   :   0.00        Min.   :     0.02         
##  1st Qu.:0.0000   1st Qu.:   0.00        1st Qu.:    24.98         
##  Median :0.1697   Median :  90.97        Median :   246.70         
##  Mean   :0.6648   Mean   :1322.11        Mean   : 10883.27         
##  3rd Qu.:0.9489   3rd Qu.:1883.16        3rd Qu.:  3218.64         
##  Max.   :6.0192   Max.   :9434.76        Max.   :557174.25         
##  unemplyd_adv_educ%educ_pop hi-tec_exprts%manuf_exprts
##  Min.   : 0.000             Min.   : 0.000            
##  1st Qu.: 0.000             1st Qu.: 1.449            
##  Median : 3.550             Median : 5.738            
##  Mean   : 4.887             Mean   : 9.767            
##  3rd Qu.: 6.588             3rd Qu.:13.941            
##  Max.   :35.184             Max.   :72.637            
##  ict_serv_exprts%serv_exprts scien_pubs_millpop innovation_score
##  Min.   : 0.000              Min.   :     0     Min.   :10.20   
##  1st Qu.: 2.800              1st Qu.:   197     1st Qu.:25.33   
##  Median : 6.335              Median :  1182     Median :31.50   
##  Mean   : 9.406              Mean   : 18243     Mean   :34.37   
##  3rd Qu.:11.414              3rd Qu.: 10723     3rd Qu.:41.74   
##  Max.   :64.806              Max.   :898949     Max.   :68.40   
##  happiness_score
##  Min.   :2.392  
##  1st Qu.:4.673  
##  Median :5.523  
##  Mean   :5.516  
##  3rd Qu.:6.343  
##  Max.   :7.856

cat("Innovation Score (0-100)","\n"); summary(years_11_24_no_is_no_hs_imputed$innovation_score)

## Innovation Score (0-100)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.20   25.33   31.50   34.37   41.74   68.40

cat("Happiness score (0-10)", "\n"); summary (years_11_24_no_is_no_hs_imputed$happiness_score)

## Happiness score (0-10)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.392   4.673   5.523   5.516   6.343   7.856

ggplot(data = years_11_24_no_is_no_hs_imputed, aes(x = happiness_score, y = innovation_score)) +
  geom_point() +
  labs(title = "Country Innovation vs Happiness",
       x = "Happiness (0-10)",
       y = "Innovation (0-100)") +
  theme_minimal()

# plot it also on a log(y) scale for clarity

ggplot(data = years_11_24_no_is_no_hs_imputed, aes(x = happiness_score, y = innovation_score)) +
  geom_point() +
  scale_y_log10() +
  labs(title = "Country Innovation vs Happiness",
       x = "Happiness (0-10)",
       y = "Innovation (0-100)") +
  theme_minimal()

The first scatterplot shows a higher than linear correlation with significant scatter at the lower I and H values. The log(innovation) vs happiness appears more linear and positively correlated, with much scatter in the lower half. This indicates that happiness is not a complete predictor of innovation, and that other related factors are important, which will become obvious as we carry out the multi-factor PCA.

f. Preparation for PCA

Now, in preparation for PCA, we produce a training set and a testing set. We will use the training set to do the PCA, and get eigenvectors and eigenvalues. Then we will project both the training and testing sets into a reduced dimensional space defined by the leading principal components. We will use the same transformation (means, sds, and PCs) derived from the training set to project both the training and testing sets. A comparison of outcomes will allow us to evaluate the generalizability of the PCA model.

Will use R’s sample() function to split the data into a training set (80%) and a test set (20%).

set.seed(123) #   set.seed generates a reproducible seed for the random number generator

training_idx <- sample(1:nrow(years_11_24_no_is_no_hs_imputed), 0.8 * nrow(years_11_24_no_is_no_hs_imputed))

training_11_24  <- years_11_24_no_is_no_hs_imputed [training_idx, ]         # this is the training set, with ca 1610 rows
testing_11_24   <- years_11_24_no_is_no_hs_imputed [-training_idx, ]        # this is the test set, with ca 410 rows

Centering and scaling variables

Following with the preparation for PCA, let’s center and scale all numeric variables across all rows (not by country), that is, do (variable - mean(variable))/sd(variable). We will use R’s scale() rather than manual centering and scaling. And extract only the numeric values without year column for the training and testing sets.

First scale the training set, and then the testing set using the same means and sd to preserve consistency and generalizability of the PCA model.

training_11_24_scaled <- training_11_24 |> select(where(is.numeric), -year) |> # training data without year
  scale(center=TRUE, scale=TRUE)                                               # scale the training data
training_mean <- attr(training_11_24_scaled, "scaled:center")  # remember the means from the scaled training data
training_sd <- attr(training_11_24_scaled, "scaled:scale")     # remember the sd from the scaled training data

testing_11_24_scaled <- testing_11_24 |> select(where(is.numeric), -year) |>  # testing data without year
  scale(center=training_mean, scale=training_sd) # scale the testing data with the same training set means and sd

g. PCA of the training set and Interpretation

Now, do PCA, and plot the eigenvalues in order of importance. (Scree plot.)

training_11_24_pca <- prcomp(training_11_24_scaled, center = TRUE, scale. = TRUE)   

plot(training_11_24_pca, type="lines")   # this is the Scree plot, with an apparent "shoulder" at PC 3 to 4.

Scree plot

Clearly, the first 3 or 4 eigenvalues show that the first 3 to 4 eigenvectors (PCs) carry a very significant proportion of the variance.

But, for better analysis, show the basic statistics of the PCs.

summary (training_11_24_pca)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     3.0791 1.8383 1.20387 1.07441 1.03220 0.95469 0.92450
## Proportion of Variance 0.4122 0.1469 0.06301 0.05019 0.04632 0.03963 0.03716
## Cumulative Proportion  0.4122 0.5591 0.62215 0.67234 0.71866 0.75829 0.79545
##                            PC8    PC9    PC10    PC11   PC12    PC13   PC14
## Standard deviation     0.91293 0.8752 0.73113 0.67054 0.6217 0.58370 0.5296
## Proportion of Variance 0.03624 0.0333 0.02324 0.01955 0.0168 0.01481 0.0122
## Cumulative Proportion  0.83169 0.8650 0.88823 0.90778 0.9246 0.93940 0.9516
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.50616 0.47355 0.40476 0.35924 0.33451 0.30870 0.25606
## Proportion of Variance 0.01114 0.00975 0.00712 0.00561 0.00487 0.00414 0.00285
## Cumulative Proportion  0.96273 0.97248 0.97961 0.98522 0.99008 0.99422 0.99708
##                          PC22    PC23
## Standard deviation     0.2249 0.12916
## Proportion of Variance 0.0022 0.00073
## Cumulative Proportion  0.9993 1.00000

Importance of components

Standard deviation: shown for each PC, in decreasing order of importance. The eigenvalues by def. are the squares of the standard deviations.

Proportion of variance: calculated for each PC as the eigenvalue/sum(all eigenvalues), capturing the fraction of the total variance explained by the PC. For example, PC1 captures 41.2% of the variance, PC2 captures the next 14.7% of the variance, and so on.

Cumulative Proportion: the cumulative sum of proportion of variances. For example, PC1-PC2 capture 56% of the variance; PC1-PC8: 83%.

PC1-PC2 is adequate for a rapid visual examination; however, we need PC1-PC8 to cover at least 80% of the variance.

The Kaiser criterion says that we should keep all PCs with eigenvalues greater than 1, that is with variance greater than 1 SD. This would mean PC1-PC5, that explain 72% of the total variance.

For testing accuracy, we will continue considering 5 PCs to capture 72% of the variance.

Examine the Loadings, Scores, and the biplot

Loadings – weight of each variable on the PCs, i.e., the correlation of each variable (e.g., access_electricity) with each PC. Appear as vectors in the biplot, with magnitude and direction.

Scores – the projections of variables on the PCs. Coordinates of each original observation (country-year) in the PC space.

Biplot – scatterplot of scores of observations and loadings of variables in PC1-PC2 space. Each point represents an observation.

training_loadings <- as.data.frame(training_11_24_pca$rotation)  # Loadings or rotation matrix - weights of the variables in the PCs.

training_scores <- as.data.frame(training_11_24_pca$x)   # Scores - or coordinates of each observation in the PC space.

training_loadings

##                                     PC1          PC2         PC3          PC4
## access_electricity%pop       0.20316700  0.077106901  0.48979821  0.063528309
## adj_national_income$         0.13262112 -0.447738860 -0.01128397  0.018774599
## gov_health_expend_intl$cap   0.28642445  0.052795667 -0.17051916  0.035201014
## gov_educ_exp%gdp             0.09180427  0.089810352 -0.32324198 -0.229925574
## compl_upp_sec_educ%pop       0.21902604  0.108523752  0.21395054 -0.253954516
## power_cons_kwh_cap           0.22248778  0.078782978 -0.03143678  0.215684113
## foreign_dir_investm$         0.02306965  0.129828807 -0.20701621  0.228839763
## gni_cap_atlas$               0.28582657  0.077444920 -0.14495549  0.128709989
## gdp_person_emplyd_intl$      0.28016528  0.107693320  0.01187456  0.116297664
## gdp_intl$                    0.11815734 -0.490431449  0.03986956 -0.017879132
## internet_use%pop             0.26365605  0.103099856  0.23395957  0.034952906
## life_expect_birth_yrs        0.27013223  0.094809925  0.28304426  0.086024776
## hi_tech_manuf%valuadded      0.24484455 -0.005805019  0.01751150 -0.042308889
## patent_apps_res              0.07790886 -0.457794681  0.04082463  0.086946631
## rd_expend%gdp                0.26402320 -0.050834335 -0.16724112 -0.114190548
## rd_researchers_millpop       0.26519546  0.054703399 -0.18587509 -0.076395598
## secure_intern_serv_millpop   0.14947848  0.009786252 -0.25796375 -0.179578471
## unemplyd_adv_educ%educ_pop  -0.02627084  0.028399693  0.45238189 -0.474395852
## hi-tec_exprts%manuf_exprts   0.17596634 -0.028331285 -0.09765687 -0.085948763
## ict_serv_exprts%serv_exprts  0.04720601 -0.004278094 -0.17062690 -0.659624944
## scien_pubs_millpop           0.12906244 -0.490610902  0.03569117  0.004358199
## innovation_score             0.29453285  0.010591898 -0.04415266 -0.044053225
## happiness_score              0.26521145  0.111506134  0.04043206  0.095479125
##                                      PC5          PC6          PC7          PC8
## access_electricity%pop      -0.109414483 -0.190187315 -0.007565631  0.069455817
## adj_national_income$         0.049442085 -0.040911275 -0.030546917 -0.185111126
## gov_health_expend_intl$cap   0.141968597  0.055224479 -0.120946010 -0.188794037
## gov_educ_exp%gdp            -0.194508622 -0.788642284  0.071149983 -0.159617719
## compl_upp_sec_educ%pop       0.024781304 -0.064036080  0.120940445 -0.123315367
## power_cons_kwh_cap          -0.005532514  0.028413375 -0.393335093 -0.089723618
## foreign_dir_investm$         0.713055544 -0.232205654  0.090136601  0.317149950
## gni_cap_atlas$               0.052902046  0.162185235 -0.154000938 -0.109097296
## gdp_person_emplyd_intl$      0.029379427  0.199446830 -0.131392741 -0.084097385
## gdp_intl$                    0.026132281 -0.052361380 -0.040840801 -0.042961583
## internet_use%pop            -0.056450410 -0.128225992 -0.044576963 -0.014609844
## life_expect_birth_yrs       -0.030397734 -0.048627741 -0.031871181  0.076021744
## hi_tech_manuf%valuadded     -0.013976519  0.185086448  0.173203650  0.308288827
## patent_apps_res              0.050112967 -0.080532757 -0.009389076  0.161871327
## rd_expend%gdp                0.164289279 -0.055028379  0.079566122  0.145003867
## rd_researchers_millpop       0.155668376  0.082365407  0.194869415  0.008622786
## secure_intern_serv_millpop  -0.007478542  0.269968063  0.261517452 -0.537600635
## unemplyd_adv_educ%educ_pop   0.451616073  0.002013406  0.147573091 -0.186937389
## hi-tec_exprts%manuf_exprts  -0.347919730  0.144876939  0.497443601  0.391161987
## ict_serv_exprts%serv_exprts -0.029948813  0.127733447 -0.555456232  0.349547904
## scien_pubs_millpop           0.043343200 -0.068989596 -0.035384748 -0.021564328
## innovation_score            -0.005663304 -0.009010362  0.109080870  0.084555949
## happiness_score             -0.168071247 -0.158437577 -0.150034385  0.004859997
##                                       PC9        PC10        PC11         PC12
## access_electricity%pop      -0.3409047217  0.05646494  0.11328971  0.132204575
## adj_national_income$         0.0005089846  0.04447504 -0.00037551 -0.375117130
## gov_health_expend_intl$cap   0.0570425078  0.04546782  0.05808652 -0.117513107
## gov_educ_exp%gdp             0.1025688234 -0.07859850  0.20128412 -0.020895985
## compl_upp_sec_educ%pop       0.0087984718  0.23441551 -0.75814643 -0.232357563
## power_cons_kwh_cap           0.2964670181 -0.41177129 -0.20823075  0.201123645
## foreign_dir_investm$        -0.3449507571 -0.21031297 -0.11272279 -0.173182983
## gni_cap_atlas$               0.1388186528 -0.10574156  0.10685516 -0.171680881
## gdp_person_emplyd_intl$      0.0972681539 -0.04007233  0.15152610 -0.151274067
## gdp_intl$                   -0.0603471392 -0.02736703  0.01973595 -0.176121812
## internet_use%pop            -0.1586358114 -0.14679970 -0.06229387  0.053072459
## life_expect_birth_yrs       -0.1148475923 -0.02927932  0.10319937  0.173628395
## hi_tech_manuf%valuadded     -0.0506025230  0.30896575  0.39020215 -0.190486296
## patent_apps_res             -0.0342802446 -0.12300994 -0.07951331  0.491809391
## rd_expend%gdp                0.1995113489  0.30614173 -0.06223257  0.285136276
## rd_researchers_millpop       0.1769180249  0.23313420 -0.02001267  0.354571273
## secure_intern_serv_millpop  -0.5571828954 -0.17841177  0.03659955  0.219661566
## unemplyd_adv_educ%educ_pop   0.3146678671 -0.30513647  0.27622470 -0.015537074
## hi-tec_exprts%manuf_exprts   0.1007587821 -0.53653547 -0.12196196 -0.159753970
## ict_serv_exprts%serv_exprts -0.2431938724 -0.07756923 -0.01953002  0.003031005
## scien_pubs_millpop          -0.0181552950 -0.02206430 -0.01588087 -0.028920252
## innovation_score             0.1304866883  0.05537342  0.02495161  0.002740962
## happiness_score             -0.1461452907  0.02936745  0.08062957 -0.163515875
##                                     PC13         PC14         PC15         PC16
## access_electricity%pop      -0.005864037 -0.266075566 -0.163547422  0.229907597
## adj_national_income$        -0.009601885 -0.295446582 -0.234414847  0.003767368
## gov_health_expend_intl$cap  -0.224483355  0.016415760 -0.099398821 -0.169130266
## gov_educ_exp%gdp             0.111745383  0.143349721 -0.024971955  0.152172906
## compl_upp_sec_educ%pop       0.123110837  0.174642163  0.164433938  0.085711111
## power_cons_kwh_cap           0.541620401 -0.255437344  0.004398078  0.039625004
## foreign_dir_investm$         0.025498065 -0.021941323  0.009639595  0.054642542
## gni_cap_atlas$              -0.092975818  0.140802883  0.132653882  0.119514631
## gdp_person_emplyd_intl$     -0.231962913  0.369493931 -0.105053567  0.225785242
## gdp_intl$                    0.060716340 -0.049292428 -0.108074613  0.050075637
## internet_use%pop             0.064832408  0.389245780 -0.469977002 -0.382816415
## life_expect_birth_yrs       -0.181366710 -0.156940744  0.036289287  0.262106621
## hi_tech_manuf%valuadded      0.645522709  0.178242273  0.102175279 -0.052223655
## patent_apps_res             -0.054744738  0.410631081  0.325114912 -0.057021234
## rd_expend%gdp               -0.040178226 -0.363234206 -0.061062733 -0.213371807
## rd_researchers_millpop      -0.105244244  0.004645052 -0.310091769 -0.059141375
## secure_intern_serv_millpop   0.148815929 -0.073074068  0.112839264  0.025110672
## unemplyd_adv_educ%educ_pop   0.002490355 -0.026131483  0.135332914 -0.135784541
## hi-tec_exprts%manuf_exprts  -0.131110705 -0.116017633 -0.094520405 -0.071133905
## ict_serv_exprts%serv_exprts -0.067433354 -0.009777606 -0.066793107  0.062499784
## scien_pubs_millpop          -0.019677690  0.067516683  0.032889113  0.014912631
## innovation_score            -0.120066280 -0.054083526  0.308754996  0.432043338
## happiness_score             -0.183083731 -0.174735682  0.511135177 -0.574911910
##                                     PC17         PC18         PC19         PC20
## access_electricity%pop      -0.314105604 -0.309672573  0.304415303  0.209140444
## adj_national_income$        -0.008944124  0.025509328 -0.121576210 -0.212318565
## gov_health_expend_intl$cap  -0.024809122  0.194255067  0.080395831  0.778614815
## gov_educ_exp%gdp            -0.038021375  0.086291488  0.101112040 -0.038209233
## compl_upp_sec_educ%pop      -0.067843028  0.116962335  0.113971427 -0.005738801
## power_cons_kwh_cap          -0.104423876 -0.014976736 -0.043910174  0.003801451
## foreign_dir_investm$        -0.040675898 -0.012265488 -0.010583922 -0.056460108
## gni_cap_atlas$               0.073970305  0.013058763  0.226907802 -0.064116467
## gdp_person_emplyd_intl$     -0.116544934 -0.169970742  0.301882183 -0.370329158
## gdp_intl$                   -0.060888215  0.004223584 -0.015329675 -0.015134386
## internet_use%pop             0.373625871 -0.185216559 -0.282690839 -0.033318253
## life_expect_birth_yrs        0.273448990  0.717729559 -0.136252420 -0.155628355
## hi_tech_manuf%valuadded     -0.017471953  0.128697625  0.020848370  0.066273660
## patent_apps_res             -0.079861205  0.027007393  0.142355281  0.067084243
## rd_expend%gdp                0.442099140 -0.174176882  0.417403676 -0.143135568
## rd_researchers_millpop      -0.574968274  0.096844296 -0.321467486 -0.158627715
## secure_intern_serv_millpop   0.074962020 -0.036901664  0.035922071 -0.068650256
## unemplyd_adv_educ%educ_pop  -0.033712722 -0.017765877  0.009767949 -0.013927145
## hi-tec_exprts%manuf_exprts  -0.061826310  0.043979315  0.124224202  0.059780581
## ict_serv_exprts%serv_exprts -0.044245195  0.006824996 -0.036089037 -0.017523661
## scien_pubs_millpop           0.007607684  0.020944629 -0.026334067  0.060220431
## innovation_score             0.224147570 -0.449015573 -0.533800332  0.159796339
## happiness_score             -0.222133790 -0.054552789 -0.132200454 -0.202333940
##                                     PC21         PC22         PC23
## access_electricity%pop      -0.150245390 -0.019102240 -0.041907983
## adj_national_income$        -0.329197592  0.536125183  0.026751139
## gov_health_expend_intl$cap   0.137220972  0.162459000  0.058756003
## gov_educ_exp%gdp             0.005073354  0.027455965 -0.001771478
## compl_upp_sec_educ%pop      -0.008740239  0.005002676  0.013512469
## power_cons_kwh_cap           0.119738101  0.062001250 -0.021949174
## foreign_dir_investm$         0.028140409  0.003475330 -0.016498786
## gni_cap_atlas$              -0.667659488 -0.413844697 -0.005837778
## gdp_person_emplyd_intl$      0.439423711  0.240728355 -0.014610944
## gdp_intl$                    0.297999820 -0.511663471  0.570764934
## internet_use%pop            -0.091932187 -0.044605243  0.012090545
## life_expect_birth_yrs        0.069873377  0.006962597  0.020750645
## hi_tech_manuf%valuadded      0.001537100  0.090756328 -0.017101756
## patent_apps_res             -0.182278779  0.297117644  0.203450615
## rd_expend%gdp                0.081612769 -0.021205895  0.014024573
## rd_researchers_millpop      -0.095574980 -0.128460868 -0.024379281
## secure_intern_serv_millpop   0.047532339  0.017024607 -0.012853840
## unemplyd_adv_educ%educ_pop  -0.011006947  0.008622050  0.008591173
## hi-tec_exprts%manuf_exprts   0.011970600  0.022724159 -0.018142855
## ict_serv_exprts%serv_exprts -0.044804130  0.035309461 -0.013319840
## scien_pubs_millpop           0.176593200 -0.262316397 -0.787419370
## innovation_score             0.036524633  0.056003958  0.056798509
## happiness_score              0.088999011 -0.024094419  0.007910964

head (training_scores, 10)

##            PC1        PC2        PC3        PC4         PC5        PC6
## 1  -2.71208621 -0.2717319 -0.6239425 -0.4623557 -0.08736219  0.1230622
## 2   5.40458435  1.5159919 -1.0723738 -0.1159245  0.12021344 -1.3906208
## 3   0.02231588  0.4518721 -2.3739180 -1.1847722 -0.92852983  0.8399403
## 4  -1.36545659  0.2762572  0.5615292  0.1435630 -0.33521307 -0.5125405
## 5  -3.66568543 -0.3522913 -0.3015298 -0.1644236  0.65742844  0.0165078
## 6   0.26267053  0.3997017 -0.0744158  0.8725797 -1.70599960 -0.4994868
## 7   0.58917233 -0.2571519  1.2814419 -0.2351937  0.59870718 -0.4931806
## 8   2.26550796  0.8805655  0.8085835  1.3414248 -0.67125148  1.5561267
## 9   2.86655140  1.0283999 -0.5941148  1.1050917 -0.24545626 -0.6169065
## 10 -3.60248127 -0.5179609 -2.3967207  0.1970093 -0.42386884 -0.2789826
##            PC7          PC8         PC9        PC10        PC11        PC12
## 1   0.06164683  0.346874756  0.05524839 -0.46899499  0.39684184 -0.26043311
## 2   0.66827072  0.007930293  1.53033573  1.36619802  0.77706790  0.15936386
## 3   1.53826743 -4.305574903 -5.73152920 -1.22995670  1.18046371  1.91682639
## 4  -0.03915452  0.125958249 -0.58304772  0.20431199  0.09608363 -0.16063899
## 5   0.37871561 -0.669608126  0.64146226 -0.54579840  0.02970345  0.03729472
## 6   0.41417849  0.870772796 -0.60193348 -1.68453139  0.42313781 -0.16997207
## 7   0.41614103 -0.390985482  0.50371136  0.39333014  0.95589568  0.25474489
## 8   0.85372442  0.811908920 -0.12955590 -0.71085632 -0.27196005 -0.36398106
## 9  -0.33312374 -0.228436478  0.44400524  0.04777887  1.18379768  1.00818130
## 10  0.65757479 -0.151950176  0.91134034 -0.07285224 -0.26421540 -0.36578252
##            PC13        PC14        PC15        PC16        PC17        PC18
## 1   0.005258229 -0.03946354  0.11284612 -0.62248035 -0.30746022 -0.64941252
## 2  -0.504483097  0.32466856  0.30342227 -0.58848515 -0.43323557 -0.13123012
## 3   1.077302669 -0.43743772  0.88073756 -0.08079890  0.68401626 -0.69803322
## 4  -0.092019616 -0.36640829  0.49291023 -0.51022090 -0.39939729  0.06776449
## 5  -0.134546267  0.09022968 -0.19737198 -0.64086457 -0.09815479 -0.15711998
## 6  -0.425155149  0.02771789 -0.92364596 -0.81119352 -0.15482185 -0.09447330
## 7   0.280874666 -0.02093498 -0.22577345  0.08388804  0.37103396 -0.09282740
## 8  -1.156925637 -0.17335112  0.28112240 -0.15954463  0.18860650 -0.05636166
## 9  -0.941394260 -0.46739605 -0.05619996 -0.36307103 -0.20957384 -0.19147571
## 10  0.250261486  0.31106948  0.38882543 -0.08344845  0.13163527  0.30493193
##           PC19        PC20        PC21         PC22        PC23
## 1  -0.05106194  0.07585810 -0.14174605  0.012662800 -0.03225248
## 2   0.07273135 -0.20250377 -0.50380685 -0.410029219  0.05925335
## 3  -0.09882520 -0.63437062  0.18217275  0.137247676 -0.13462471
## 4   0.17724178 -0.02394272 -0.07011085  0.015688232 -0.03073756
## 5   0.25045232 -0.02175906 -0.17550657 -0.137530428 -0.03517410
## 6   0.16276950 -0.33549806  0.20779677 -0.001284633  0.01412862
## 7   0.07079204 -0.33040879  0.41999381 -0.109462445  0.09342940
## 8  -0.50378243  0.33791783  0.25368058  0.161588023  0.08695952
## 9  -1.11946533 -0.01528679 -0.21113138 -0.301887658  0.04325492
## 10 -0.16320263 -0.06773584  0.14828454  0.042963500  0.01363937

Conclusion from the loadings

I will only analyze the first two components here. After looking at the biplots later in the report, I will analyze the top 5 components.

PC1 has important contributions from several variables, as opposed to one or two dominant ones. Loadings above .28: gov_health_expenditure, gni_capita, gdp_person_employed, innovation_score; loadings above .26: happiness_score, internet_use, life_expectancy_birth, RD_expenditure, RD_researchers. This principal component captures the balance of scientific, economic, and social strengths of countries, essential for innovation capacity. This breath of coverage is reflected by its high eigenvalue.

PC2 has fewer dominant contributions, which are all negative and below -.44: adjusted_national_income, GDP, patent applications, scientific publications. Much less dominant contributions, all positive, and above 0.10: completed_upper_second_educ, foreign direct investment, gdp_person_employed, internet use, happiness score. This principal component captures scientific and economic strengths of countries better than PC1, counteracted to a minor extent by social factors, such as happiness, or external influences, such as foreign direct investment.

The contributions of each variable to the various PCs show interesting features: for example access_electricity contributes mostly to PC3, innovation_score mostly to PC19, happiness_score mostly to PC16. But no PC is dominated by a single variable, a sign of some degree of collinearity.

The relationships can be better understood by examing the biplots. We’ll focus only on PC1-PC2 biplots.

#  biplot (training_11_24_pca, c(1,2))   # biplot of scores and variable loadings in PC1, PC2

The biplot() function yields an unclear picture for 23 variables.

I searched for an alternative. The function fviz_pca_biplot from library factoextra seems a little clearer.

fviz_pca_biplot(
  training_11_24_pca, geom.ind = "point", col.ind =  "lightgreen", col.var = "black",
  repel = TRUE, arrowsize = 0.7)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
##   Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

I put some color correlated with length of vector, ie magnitude or importance of variable

fviz_pca_biplot(
  training_11_24_pca,
  geom.ind = "point", col.ind = "orange", label = "var", labelsize = 3,  arrowsize = 0.7,
  col.var = "contrib", gradient.cols = c("blue", "red"), palette = "Dark2", repel = TRUE,
  title = "Country/year scores in PCA Space")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
##   Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The biplot shows important outliers in the 4th quadrant. To identify those country-years, we’ll group by country (code), get the mean scores per country, and do a ggplot to label the countries.

Above we also see 5 clusters of observations:

1: PC1<0 & PC2>=-2.5,

2: PC1<0 & PC2<-2.5,

3: PC1>=0 & PC1<=2.5;

4: PC1>2.5 & PC2>0;

5: PC1>2.5 & PC2<=0;

let’s color those clusters.

Let’s consider only the first 5 PCs, following the earlier conclusions from the Spree plot and Kaiser criterion.

training_scores_5 <- training_scores[,1:5] 
training_scores_5$code <- training_11_24$code   # restore the country code
training_scores_5_country <- training_scores_5  |> group_by(code) |>
  summarize(across(starts_with("PC"), \(x) mean(x, na.rm = TRUE)))   # average scores of countries over 2011-2024 period
training_scores_5_country <- training_scores_5_country |> 
  mutate(  cluster = case_when(                             # define clusters 1, 2, 3,  4, and 5
            PC1 < 0 & PC2 >= -2.5 ~ 1,
            PC1 < 0 & PC2 < -2.5  ~ 2,
            PC1 >= 0 & PC1 <= 2.5 ~ 3,
            PC1 > 2.5 & PC2 >   0 ~ 4,
            PC1 > 2.5 & PC2 <= 0  ~ 5
         ))
ggplot(training_scores_5_country, aes(x = PC1, y = PC2, label=code, color = cluster)) +
  geom_point(size = 3) +
  geom_text_repel(size = 3) +
  theme_minimal(base_size = 14) +
  labs(title = "Country average scores in PC1-PC2 space",
       subtitle = "",
       x = "PC1", y = "PC2")

## Warning: ggrepel: 127 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Go back to the unclustered data to spread out the scores over the years. As stated above, PC2 reflects economic and technological strength (and its dominant loadings are negative) so this will highlight the most economically and technologically dominant countries-years.

ggplot(training_11_24, aes(x=training_11_24_pca$x[,1], y=training_11_24_pca$x[,2], label = code)) +
  geom_point() + geom_text_repel() + theme_minimal()

## Warning: ggrepel: 1592 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Conclusions from the biplots

From the biplots we see that some of the variables have strong loadings, greater than 0.25, on some of the PCs. Those that have strong loadings are worth discussing.

We qualitatively analyze the biplots to derive an interpretation of the meaning of each PC. Long vector identify the leading variables. Parallel vectors indicate correlation between variables.

16 of the variables contribute moderately to significantly to +PC1 and have small positive or slightly negative PC2 (their vectors are almost parallel to PC1, showing some collinearity); as stated above, those variables qualify scientific productivity factors: RD expenditure, GNI, innovation score, high-tech manufacturing, etc., as well as social factors: access to electricity, completed secondary education, life expectancy, etc.

4 of the variables in the 4th quadrant contribute equally strongly to +PC1 and very strongly to -PC2 (negative effect with vectors pointing down in the 4th quadrant): GDP, adjusted national income, patent applications and scientific publications. It is interesting to note that the points (scores) corresponding to the USA and China determine the magnitude and directions of those vectors. This quadrant of the biplot contains the group of countries-years that represents the most advanced economies and technological leaders.

2 of the variables point in the +PC1 and +PC2 directions (first quadrant) but are relatively weak: gov. expenditure in education, and foreign direct investments.

1 of the weak variables in the second quadrant (-PC1 +PC2 direction) is: unemployed with advanced education.

Major contributions from each variable

The largest contributions of each variable to each PC can be a little better visualized with a heatmap.

training_loadings_long <- training_loadings |> # loadings of each variable to each PC
  rownames_to_column("variable") |>
  pivot_longer(-variable, names_to = "PC", values_to = "value")

ggplot(training_loadings_long, aes(x = fct_inorder(PC), y = variable, fill = value)) +
  geom_tile() +
 scale_fill_gradient2(low = "red", mid= "lightyellow", high = "darkblue", midpoint= 0) +
    theme_minimal(base_size = 10) +
  theme(
    axis.text.x = element_text(angle=90, hjust=1),
    axis.text.y = element_text(size=7)
  ) +
  labs(title="Variable Contributions to the PCs (Loadings)",
       fill="R²")

Conclusions from the heatmap

Again we can see that PC1 receives modest contributions from several variables, capturing the balance of scientific, societal, and economic strengths of countries. More specific analysis is provided in the next section.

PC2 is more focused on scientific measures and economic power.

PC3 reflects societal variables mainly.

PC4 is strongly influenced by ICT exports and social support.

PC5 is strongly influenced by foreign direct investments, high-tech exports and societal loads.

The heatmap also shows that the contributions above 45% are: ict_service_exports to PC4, foreign_direct_investment to PC5, gov_educational_expenses to PC6, completed_upper_secondary_educ to PC11, life_expectancy_birth to PC18, gov_health_expenditure to PC20, GNI_capita to PC21 and scientific_publications to PC23.

Final conclusions on the meaning of each PC:

PC1 – Technological, economic and social capital of nations. THIS is the main topic of this project, and PC1 captures most of it.

Important variables for PC1 and their loadings:

—Scientific/technological:

Innovation score, .29

Researchers in R&D, .27

RD expenditures, .26

High-Tech manufacturing, .24

—Social advancement:

Life expectancy, .27

Completion of secondary education, .22

Internet use, .27

Access to electricity, .20

—Individual wealth:

GNI/capita, .29

GDP/worker, .28

Meaning of PC1: indicates degree of economic, scientific, technological, and societal access. Countries-years with high PC1 have high GDP per capita, excellent education, high level of innovation, and high scientific productivity and technological impact. Examples are western European, north American and Asean countries.

PC2 – Contrast between Size of the economy and scientific/technological output.

All of the important variables for PC2 have negative loadings, decreasing PC2.

Scientific publications, -.49

Patent applications, -.46

GDP, -.49

Adjusted national income, -.45

Relatively minor variables that increase PC2 are:

Internet use, and Unemployed with advanced education.

Meaning of PC2: contrasts countries-years according to economic size and scientific/technological output. For example, distinguishes Switzerland from USA, both with high technological output, but of drastically different economic size. Large negative PC2 identifies the largest economies with the leading technologies, like USA and China.

PC3 – Contrast between Social promotion and technological promotion.

Important variables are:

Access to electricity, .49

Unemployed with advanced education, .45

Life expectancy, .28

Internet use, .23

Government expenditure in education, -.32

Foreign direct investment, -.21

RD investment, -.17

Meaning of PC3: contrasts countries-years with strong social polices (positive loadings) from those with strong promotion for technology (negative loadings)

PC4 – Contrast between Technical services and high-energy production

Important variables are:

ICT service exports, -.66

Unemployed with advanced education, -.47

Government expenditure in education, -.23

Foreign direct investment, .23

Power consumption, .22

Meaning of PC4: captures the difference between countries-years that export much information technologies (negative loadings) from those that invest in heavy industry (positive loadings). Examples are eastern European, Russia, and some east Asian countries.

PC5 – Direct foreign investment versus manufacturing

Important variables are:

Foreign direct investment, .71

Unemployed with advanced education, -.45

High-tech exports, -.35

Meaning of PC5: captures the emphasis between countries-years of strong foreign direct investment (positive) against those focused on exports of high-tech products (negative). Examples exist in some of the smaller economies and developing countries like Brazil and Taiwan.

h. TESTING the PCA model with the testing set

Now let’s transform the testing set using the predict() function with the same model training_11_24_pca derived from the training set, and project the testing set on the same PCs.

training_11_24_5 <- as.data.frame(training_11_24_pca$x[ , 1:5])   # subset of training scores on 5 PCs
pred_testing_11_24 <-  as.data.frame(predict(training_11_24_pca, newdata=testing_11_24_scaled))   # predict scores from the testing set
pred_testing_11_24_5 <-  as.data.frame(pred_testing_11_24[ , 1:5]) # subset of testing scores on 5 PCs

Now let’s compare PC1 and PC2 for both training model and testing predictions.

training_11_24_5$id <- "Training"
pred_testing_11_24_5$id <- "Testing"
bound_sets <- rbind(training_11_24_5, pred_testing_11_24_5)
ggplot(bound_sets, aes(x = PC1, y = PC2, color = id)) + geom_point(alpha = 0.5) + labs(title = "Comparing Training and Testing Structures")  + scale_color_brewer(palette = "Dark2")

Conclusion

The testing data (in PC1-PC2 space) shows perfect alignment with the training data, which visually indicates that the PC model is generalizable to data not seen before in training.

i. Further Statistical Analysis

USING the 5 Principal Component model to predict innovation with linear regression

Having qualitatively verified the quality of the reduced (5 Principal Component) model, we will proceed to use it to predict innovation_score with multilinear regression and get a statistical measure of goodness of fit.

training_scores$code <- training_11_24$code                                     #restore code as separate var
training_scores$year <- training_11_24$year                                     #restore year as separate var
training_11_24_scaled <- as.data.frame(training_11_24_scaled)                   #make training_11_24_scaled a dataframe
training_scores$innovation_score <- training_11_24_scaled$innovation_score      #restore innovation score as separate var
training_scores$happiness_score <- training_11_24_scaled$happiness_score        #restore happiness score as separate var

training_11_24_regress <- lm(innovation_score ~ PC1 + PC2 + PC3 + PC4 + PC5,    #linear regression
            data = training_scores)

summary(training_11_24_regress)

## 
## Call:
## lm(formula = innovation_score ~ PC1 + PC2 + PC3 + PC4 + PC5, 
##     data = training_scores)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2011 -0.2586  0.0444  0.2911  1.3373 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.007e-16  1.035e-02   0.000   1.0000    
## PC1          2.945e-01  3.361e-03  87.621  < 2e-16 ***
## PC2          1.059e-02  5.631e-03   1.881   0.0601 .  
## PC3         -4.415e-02  8.598e-03  -5.135 3.16e-07 ***
## PC4         -4.405e-02  9.633e-03  -4.573 5.18e-06 ***
## PC5         -5.663e-03  1.003e-02  -0.565   0.5723    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4154 on 1606 degrees of freedom
## Multiple R-squared:  0.828,  Adjusted R-squared:  0.8274 
## F-statistic:  1546 on 5 and 1606 DF,  p-value: < 2.2e-16

Conclusion

The linear model on the PC1-PC5 space is statistically significant as its p-value is close to 0 and the Adjusted R-squared is 82.7% and F-statistic is large.

The coefficients for PC1, PC3, and PC4 are statistically significant with p-value well under .05. The coefficient for PC2 is borderline, within 94% confidence. The one for PC5 is not significant but PC5 contribution to innovation_score is an order of magnitude smaller than the other PCs. PC1 and PC2 increase innovation_score, while PC3-PC5 decrease it. An interpretation of this is given below, after specifically looking at the role of happiness_score.

ADDING happiness to the 5 Principal Component model to predict innovation

We will answer the initial question in this project whether happiness has anything to do with innovation. To do so, we’ll add specifically happiness_score as independent variable in the model and see what it does to the regression. I note that the happiness_score variable is almost orthogonal to PC1-PC5, contributing a fraction of 7% to PC1 (obtained from the loadings-squared matrix), 1% to PC2, 0% to PC3, 1% to PC4 and 3% to PC5.

training_11_24_regress_happy <- 
  lm (innovation_score ~ PC1 + PC2 + PC3 + PC4 + PC5 + happiness_score,          #linear regression
    data = training_scores)

summary(training_11_24_regress_happy)

## 
## Call:
## lm(formula = innovation_score ~ PC1 + PC2 + PC3 + PC4 + PC5 + 
##     happiness_score, data = training_scores)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1841 -0.2546  0.0382  0.2766  1.3769 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.055e-16  1.024e-02   0.000 1.000000    
## PC1              3.271e-01  6.386e-03  51.225  < 2e-16 ***
## PC2              2.430e-02  6.024e-03   4.034 5.74e-05 ***
## PC3             -3.918e-02  8.546e-03  -4.585 4.90e-06 ***
## PC4             -3.232e-02  9.731e-03  -3.321 0.000917 ***
## PC5             -2.633e-02  1.051e-02  -2.506 0.012308 *  
## happiness_score -1.229e-01  2.056e-02  -5.980 2.74e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.411 on 1605 degrees of freedom
## Multiple R-squared:  0.8317, Adjusted R-squared:  0.8311 
## F-statistic:  1322 on 6 and 1605 DF,  p-value: < 2.2e-16

Conclusion

The linear model with 5 PCs plus happiness_score is statistically significant with p-value close to zero and Adjusted R-Square of 83.1%, slightly larger than the 5 PC model. Importantly, all of the regression coefficients are statistically significant, with p-values well under .05 in most cases, except PC5. PC1 and PC2 increase innovation_score, with PC1 having the largest contribution to innovation. PC3-PC5 and happiness_score decrease innovation score.

REMOVING innovation score from PCA to leave it as an outcome variable for regression

Method 2: we will briefly re-run the PCA but on a dataset that does not include innovation_score, so that we can leave it purely as an outcome variable for regression. We will do PCA on the other 22 variables (World Bank indicators and happiness_score).

training_11_24_scaled <- as.data.frame(training_11_24_scaled)
training_11_24_scaled_no_is <- training_11_24_scaled |> select (-innovation_score)   #remove innovation_score
training_11_24_no_is_pca <- prcomp(training_11_24_scaled_no_is, center = TRUE, scale. = TRUE)    # do PCA on 22 vars

plot(training_11_24_no_is_pca, type="lines")   # this is the Scree plot, with an apparent "shoulder" at PC 3 to 4.

summary (training_11_24_no_is_pca)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.9455 1.8381 1.20276 1.07349 1.03218 0.95466 0.92119
## Proportion of Variance 0.3943 0.1536 0.06576 0.05238 0.04843 0.04143 0.03857
## Cumulative Proportion  0.3943 0.5479 0.61369 0.66607 0.71449 0.75592 0.79449
##                           PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.9095 0.86754 0.73021 0.67039 0.62169 0.58140 0.52938
## Proportion of Variance 0.0376 0.03421 0.02424 0.02043 0.01757 0.01536 0.01274
## Cumulative Proportion  0.8321 0.86631 0.89054 0.91097 0.92854 0.94390 0.95664
##                           PC15    PC16   PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.49826 0.44231 0.3981 0.34861 0.31022 0.25627 0.22560
## Proportion of Variance 0.01128 0.00889 0.0072 0.00552 0.00437 0.00299 0.00231
## Cumulative Proportion  0.96793 0.97682 0.9840 0.98955 0.99392 0.99691 0.99922
##                           PC22
## Standard deviation     0.13093
## Proportion of Variance 0.00078
## Cumulative Proportion  1.00000

fviz_pca_biplot(
  training_11_24_no_is_pca,
  geom.ind = "point", col.ind = "orange", label = "var", labelsize = 3,  arrowsize = 0.7,
  col.var = "contrib", gradient.cols = c("blue", "red"), palette = "Dark2", repel = TRUE,
  title = "Country/year scores in PCA Space")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
##   Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The results for this PCA with 22 variables in comparison with the previous PCA with 23 variables is very similar both in the loadings or rotation matrix (magnitude and directions of variable vectors) and scores (points for countries-years in the PC1-PC2 space). Spree curve and Kaiser criterion indicate that PC1:PC5 capture 71% of the variance and are sufficient for a reduced model. The conclusion is that the previous analysis of the meaning of the PCs still applies for this PCA.

Now let’s add the columns for innovation_score and happiness_score to the new scores, and perform a linear regression on the reduced model (PC1-PC5 plus happiness_score) to predict innovation_score.

training_scores_no_is <- as.data.frame(training_11_24_no_is_pca$x)              #gather the scores from the new PCA
copy_cat  <-  c("innovation_score","happiness_score")                           # 2 cols from training_11_24_scaled
training_scores_no_is[, copy_cat] <- training_11_24_scaled[, copy_cat]          #add them to new training_scores_no_is

Now re-do the linear regression with the independent variables PC1-PC5 derived from the new PCA, plus happiness_score.

training_11_24_regress_happy_2 <- 
  lm (innovation_score ~ PC1 + PC2 + PC3 + PC4 + PC5 + happiness_score,          #linear regression
    data = training_scores_no_is)

summary(training_11_24_regress_happy_2)

## 
## Call:
## lm(formula = innovation_score ~ PC1 + PC2 + PC3 + PC4 + PC5 + 
##     happiness_score, data = training_scores_no_is)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.32721 -0.27937  0.04466  0.31268  1.48296 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.542e-16  1.135e-02   0.000  1.00000    
## PC1              3.220e-01  7.464e-03  43.140  < 2e-16 ***
## PC2              1.993e-02  6.702e-03   2.974  0.00299 ** 
## PC3             -3.891e-02  9.475e-03  -4.107 4.22e-05 ***
## PC4             -3.235e-02  1.078e-02  -3.001  0.00273 ** 
## PC5             -1.755e-02  1.167e-02  -1.504  0.13270    
## happiness_score -7.487e-02  2.298e-02  -3.258  0.00114 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4558 on 1605 degrees of freedom
## Multiple R-squared:  0.793,  Adjusted R-squared:  0.7922 
## F-statistic:  1025 on 6 and 1605 DF,  p-value: < 2.2e-16

Conclusion The new model with the set of 5 PCs derived from a dataset that did not include innovation_score performs well, with an adjusted R-square of 79%, lower than the previous model (83%), but acceptable. The p-value for the new model is close to zero and F-statistic is large. The PC1-PC4 and happiness_score coefficients are below .05, and are statistically significant. PC1 and PC2 are positive, and PC3-PC5 are negative, as in the previous model.

j. Discussion and General Conclusions

We started with a general question of what socioeconomic and scientific-technological variables have influenced the innovation of countries over the period 2011-2024. A special question has been whether degree of happiness of nations influence their level of innovation.

We have carried out Principal Component Analysis for 23 variables and have identified 5 principal components as sufficient to statistically account for 72% of the variance. We then interpreted the practical meaning of each of the 5 PCs in the context of the data. The 5 orthogonal vectors PC1-PC5 plus happiness_score have then been used in a multilinear regression model to predict innovation_score resulting in an excellent goodness of fit (Adj R-squared= 83%) and statistically significant coefficients. We also did PCA on 22 variables, removing innovation_score to leave it as a pure dependent variable for linear regression. The reduced model of the new PC1-PC5 plus happiness_score still has acceptable goodness of fit (Adj R-squared= 79%) and statistically significant coefficients. The signs of the regression coefficients are very important and they led us to the following general conclusions.

Since PC1 represents the scientific, technological and societal strengths of countries, the positive coefficient means that countries with more of those assets in good balance will be more innovative in general. PC2, on the other hand, distinguishes large rich countries (like USA) from small rich ones (like Switzerland). PC2 is more negative for large countries than small ones, thus the positive regression coefficient says that small rich countries tend to be more innovative than large rich countries. PC3 distinguishes emphasis on societal investment (positive) versus technological investment (negative). The negative regression coefficient says that countries that emphasize technological investments tend to be more innovative than those that invest heavily in social gains. PC4 distinguishes emphasis on ICT exports (negative) versus heavy industrial development (positive). Thus the negative regression coefficient says that countries that emphasize ICT exports versus heavy industry are more innovative. PC5 distinguishes emphasis on direct foreign investment (positive) versus high-tech exports (negative), thus the negative regression coefficient says that countries that are more focused on high-tech exports than in attracting foreign investment tend to be more innovative.

This all makes sense and agrees with our perception. How about happiness_score? The non-negligible regression coefficient and the low p-value says that it is an important and statistically significant variable. But why is the regression coefficient negative? My initial EDA showed that there is an overall positive correlation between innovation and happiness. However, the model appears to say that countries that increase their happiness tend to become less innovative!

Rationalization: overall, of course, countries that are powerful and invest heavily in industrial and societal development, tend to be happy and also innovative, as shown by the innovation dependence on PC1-PC5, which are the variables that contribute the most to innovation_score. HOWEVER, countries that are too happy and satisfied with their status quo tend to relax and become less productive, scientifically and technologically speaking. Thus, on one extreme, social unhapinness leads to low productivity, lack of motivation and consequently, poor innovation potential. On the other extreme, a large degree of happiness leads to contempt and lack of competitive goals. But there is a sweet spot. It does seem that for countries to become highly innovative, not only do they need the scientific, economic and societal capabilities, but they also need international competition, some level of struggle and anxiety, and thus a little bit of unhapinness.

k. Obstacles

The main challenge was to assemble the database in order to contain the relevant information I needed. I had to resort to three sources, World Bank, WIPO (World Intellectual Property Office) and Gallup World Poll. A major challenge was the NA information once the three sources were bound, and had to resort to significant cleaning tasks. The modeling itself demanded much time studying from several data books (provided in the class references) on how to carry out and interpret PCA, and then developing a good understanding of the meaning of the principal components in my data. The actual coding was standard. With more time I would do more graphics.

l. Future steps

I ran out of time to interpret the implications of the 5-PC model. I think there are rich conclusions to be derived not only on the subject of this report, innovation against happiness, but beyond it. I could examine specific countries and groups of countries classified in various ways. An important aspect that could be explored is to compare the predicted innovation_score from the last regression analysis done, with the actual innovation_score (GII from WIPO) for all countries. Then to calculate something I would call “innovation potential”= predicted-actual. Predictions are based on socioeconomic and technical capabilities, so countries with positive “innovation potential” have headspace to grow (underperformed), while countries with negative “innovation potential” have exceeded their capacity (overperformed). I wonder which are overperformers and which are underperformers, how have they progressed over the years, and why. Next time. This type of analysis could help strategists involved with promoting international scientific and technological advancement.

m. References

. Class notes

. H. Wickham, et al., R for Data Science, 2nd Ed., https://r4ds.hadley.nz/ (2023)

. WIPO,Global Innovation Index, Nov. 2025 Report, (https://www.wipo.int/en/web/global-innovation-index), Data Explorer (https://www.wipo.int/gii-ranking/en/), accessed Nov. 2025

. Nilsson, AH, et al., Scientific Reports 14, Feb. 1, 2024, The Cantril Ladder elicits thoughts about power and wealth (https://www.nature.com/articles/s41598-024-52939-y), accessed Nov. 2025

. What is Cantril’s Ladder?, Scioto Analysis, Feb. 9, 2024 (https://www.sciotoanalysis.com/news/2024/2/9/what-is-cantrils-ladder), accessed Nov. 2025

. Country codes and world classification (https://unstats.un.org/unsd/methodology/m49/overview/), accessed in Dec. 2025.

n. Published

https://rpubs.com/rmiranda/1380294

FINAL PROJECT - Finding Happiness in Innovation

Raul Miranda

2025-12-15

Title: Finding Happiness in Innovation

a. Introduction. Variable Definitions.

b. Source databases.

c. Dataset preparation

d. Dataset cleaning

e. Exploratory data analysis

f. Preparation for PCA

g. PCA of the training set and Interpretation

h. TESTING the PCA model with the testing set

i. Further Statistical Analysis

j. Discussion and General Conclusions

k. Obstacles

l. Future steps

m. References

n. Published