Title: High-Tech Potential of Countries

a. Introduction

The primary question in this study is: what makes countries the most productive in high tech industries? In essence, we would like to identify the countries that are in the upper quantile of countries that achieved high productivity in high-tech in the 2000-2024 period. Those that did are labeled High and the rest are labeled Low. Since status may change over time, we will find some countries oscillating between High and Low.

In order to answer the question, let’s define productivity in terms of measurable indicators found in the databases provided. After some considerations, I arrived at the set of indicators shown below, and that are accessible in World Bank Indicators. They include (1) direct measures of national productivity: High-technology exports, ICT goods exports, ICT service exports and Scientific and technical journal articles; (2) measures of national resources: Access to electricity, Internet users, and Secure internet servers; and (3) measures of research promotion: Research and development expenditure, and Researchers in R&D.

World Bank lacked a single indicator that quantitatively evaluates overall performance, specifically in high-tech. “High-tech” includes advanced technologies derived from scientific achievements in medicine, industry, information, space, etc. I found a couple of good indicators, but was particularly impressed by the United Nations Frontier Technology Readiness Index (see reference below). A body of experts have tracked and maintained this performance index since 2010 until 2023 most recently and publishes databases and perspectives. The Overall performance index for countries is a number between 0 and 1. Its statistical distribution shows four quantiles, with the upper (4Q) in the range of 0.70 and 1.

I used that FTRI index to label a categorical variable I called tech_readiness (Technological Readiness), which acquires two values: Low for 0.0 \(\le\) FTRI < 0.7, and High for 0.7 \(\le\) FTRI \(\le\) 1.0.

In this project I use Logistic Regression to classify countries as Low or High for a given year over the 2000-2024 period. The predictor variables are the World Bank indicators, and the outcome variable is the Technological Readiness label derived from the FTRI index. Since the FTRI index is lacking before 2010, I developed a model based on the 2010 to 2023 dataset, tested its statistical accuracy, and used it to predict the 2000-2009 Technological Readines of countries, based on the reported technological indices given by the World Bank.

I focused mainly on the logistic regression modeling task, and didn’t have time to go into the most useful details of studying which economic indicators characterize the higher performers.

b. Source databases. Variable definition

Downloaded to my project folder a compiled set of World Bank Indicators from http://data.worldbank.org/, including the indicators mentioned below. Name of the dataset in my folder: worlbankdataset.csv.

In addition, from the UN Frontier Technology Readiness Index (https://unctadstat.unctad.org/datacentre/) downloaded https://unctadstat.unctad.org/datacentre/dataviewer/US.FTRI. Name of the dataset in my folder: FTRI_table.csv.

The original table had the M-49 country codes (used by the UN) but lacked the ISO-3 country codes (used by the World Bank), so I used a conversion table from https://unstats.un.org/unsd/methodology/m49/overview/.

Datasets are tabulated by country_name, country_code and year. The original bound data set contains about 5400 records in the 2000-2024 period (long format), but not all countries reported complete information.

Outcome variable (categorical): tech_readiness (high, low)

From https://unctadstat.unctad.org/datacentre/reportInfo/US.FTRI – I defined the indicator Technological Readiness based on stats of FTRI Overall Index: 1Q-4Q: 0-0.3-0.5-0.7-1.0. High tech_readiness corresponds to FTRI Overall Index \(\ge 0.70.\)

From http://data.worldbank.org/, predictor variables:

Access to electricity (% of population) - quantitative

Internet users (% of population) - quantitative

Research and development expenditure (% of GDP) - quantitative

Researchers in R&D (per million people) - quantitative

Scientific and technical journal articles - quantitative

High-technology exports (% of manufactured exports) - quantitative

Secure Internet servers - quantitative

ICT goods exports (% of total goods exports) - quantitative

ICT service exports (% of service exports, BoP) - quantitative

c. Dataset preparation and cleaning

Read datasets and bind them. Examine the top rows to take a peek at its variable structure.

setwd(getwd()) # ensure we're at the directory where the RMD is running
worldbank <- read_csv("worldbankdataset.csv", show_col_types = FALSE)
ftri <- read_csv("FTRI_table.csv", show_col_types = FALSE)

combined_set <- left_join(worldbank, ftri, by= c("country_code" = "code", "year"="year"))

head (combined_set, 3)

## # A tibble: 3 × 18
##   country_name country_code  year access_electricity high_tech_exports
##   <chr>        <chr>        <dbl>              <dbl>             <dbl>
## 1 Afghanistan  AFG           2000                4.4                 0
## 2 Afghanistan  AFG           2001                9.3                 0
## 3 Afghanistan  AFG           2002               14.1                 0
## # ℹ 13 more variables: ict_goods_exports <dbl>, ict_service_exports <dbl>,
## #   internet_users <dbl>, rd_expenditures_gdp <dbl>, researchers_millpop <dbl>,
## #   scientific_articles <dbl>, secure_servers <dbl>, access_finance_idx <dbl>,
## #   ict_idx <dbl>, industry_activity_idx <dbl>, research_development_idx <dbl>,
## #   skills_idx <dbl>, overall_index <dbl>

str(combined_set)

## spc_tbl_ [5,425 × 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ country_name            : chr [1:5425] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ country_code            : chr [1:5425] "AFG" "AFG" "AFG" "AFG" ...
##  $ year                    : num [1:5425] 2000 2001 2002 2003 2004 ...
##  $ access_electricity      : num [1:5425] 4.4 9.3 14.1 19 23.8 28.7 33.5 38.4 42.4 48.3 ...
##  $ high_tech_exports       : num [1:5425] 0 0 0 0 0 0 0 0 0 0 ...
##  $ ict_goods_exports       : num [1:5425] 0 0 0 0 0 0 0 0 0 0 ...
##  $ ict_service_exports     : num [1:5425] 0 0 0 0 0 ...
##  $ internet_users          : num [1:5425] 0 0.00472 0.00456 0.08789 0.10581 ...
##  $ rd_expenditures_gdp     : num [1:5425] 0 0 0 0 0 0 0 0 0 0 ...
##  $ researchers_millpop     : num [1:5425] 0 0 0 0 0 0 0 0 0 0 ...
##  $ scientific_articles     : num [1:5425] 4 1 5.5 7.34 6.75 ...
##  $ secure_servers          : num [1:5425] 0 0 0 0 0 0 0 0 0 0 ...
##  $ access_finance_idx      : num [1:5425] NA NA NA NA NA NA NA NA NA NA ...
##  $ ict_idx                 : num [1:5425] NA NA NA NA NA NA NA NA NA NA ...
##  $ industry_activity_idx   : num [1:5425] NA NA NA NA NA NA NA NA NA NA ...
##  $ research_development_idx: num [1:5425] NA NA NA NA NA NA NA NA NA NA ...
##  $ skills_idx              : num [1:5425] NA NA NA NA NA NA NA NA NA NA ...
##  $ overall_index           : num [1:5425] NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   country_name = col_character(),
##   ..   country_code = col_character(),
##   ..   year = col_double(),
##   ..   access_electricity = col_double(),
##   ..   high_tech_exports = col_double(),
##   ..   ict_goods_exports = col_double(),
##   ..   ict_service_exports = col_double(),
##   ..   internet_users = col_double(),
##   ..   rd_expenditures_gdp = col_double(),
##   ..   researchers_millpop = col_double(),
##   ..   scientific_articles = col_double(),
##   ..   secure_servers = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Get the basic stats of the FTRI overall_index

summary(combined_set$overall_index)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.3211  0.4759  0.5113  0.6984  1.0000    3007

Define the outcome variable tech_readiness (high, low) based on overall_index \(\ge 0.7\) for “high”. Then make it a factor for glm to work, and relevel the ref to “low”.

combined_set <- combined_set|> mutate (tech_readiness= if_else(overall_index >= 0.7, "high", "low"))
combined_set$tech_readiness <- as.factor(combined_set$tech_readiness)
combined_set$tech_readiness <- relevel(combined_set$tech_readiness, ref= "low")
summary (combined_set)

##  country_name       country_code            year      access_electricity
##  Length:5425        Length:5425        Min.   :2000   Min.   :  0.00    
##  Class :character   Class :character   1st Qu.:2006   1st Qu.: 57.00    
##  Mode  :character   Mode  :character   Median :2012   Median : 98.60    
##                                        Mean   :2012   Mean   : 77.16    
##                                        3rd Qu.:2018   3rd Qu.:100.00    
##                                        Max.   :2024   Max.   :100.00    
##                                                                         
##  high_tech_exports ict_goods_exports ict_service_exports internet_users  
##  Min.   : 0.000    Min.   : 0.000    Min.   : 0.00000    Min.   :  0.00  
##  1st Qu.: 0.000    1st Qu.: 0.000    1st Qu.: 0.06625    1st Qu.:  2.80  
##  Median : 0.000    Median : 0.110    Median : 3.22588    Median : 24.44  
##  Mean   : 4.933    Mean   : 2.864    Mean   : 5.99365    Mean   : 34.44  
##  3rd Qu.: 5.677    3rd Qu.: 1.590    3rd Qu.: 7.90453    3rd Qu.: 64.50  
##  Max.   :95.618    Max.   :63.640    Max.   :64.80595    Max.   :100.00  
##                                                                          
##  rd_expenditures_gdp researchers_millpop scientific_articles secure_servers    
##  Min.   :0.0000      Min.   :    0.0     Min.   :     0.0    Min.   :       0  
##  1st Qu.:0.0000      1st Qu.:    0.0     1st Qu.:     4.3    1st Qu.:       0  
##  Median :0.0000      Median :    0.0     Median :    91.3    Median :      20  
##  Mean   :0.3715      Mean   :  665.6     Mean   :  8510.3    Mean   :  135938  
##  3rd Qu.:0.3542      3rd Qu.:  281.6     3rd Qu.:  1477.9    3rd Qu.:    1103  
##  Max.   :6.0192      Max.   :20047.0     Max.   :898949.1    Max.   :66850221  
##                                                                                
##  access_finance_idx    ict_idx       industry_activity_idx
##  Min.   :0.0000     Min.   :0.0000   Min.   :0.0000       
##  1st Qu.:0.7716     1st Qu.:0.3890   1st Qu.:0.3221       
##  Median :0.8386     Median :0.5836   Median :0.4919       
##  Mean   :0.8253     Mean   :0.5655   Mean   :0.4990       
##  3rd Qu.:0.8924     3rd Qu.:0.7471   3rd Qu.:0.6810       
##  Max.   :1.0000     Max.   :1.0000   Max.   :1.0000       
##  NA's   :3007       NA's   :3007     NA's   :3007         
##  research_development_idx   skills_idx     overall_index    tech_readiness
##  Min.   :0.0000           Min.   :0.0000   Min.   :0.0000   low :1817     
##  1st Qu.:0.1022           1st Qu.:0.2760   1st Qu.:0.3211   high: 601     
##  Median :0.2100           Median :0.4465   Median :0.4759   NA's:3007     
##  Mean   :0.2911           Mean   :0.4571   Mean   :0.5113                 
##  3rd Qu.:0.4653           3rd Qu.:0.6309   3rd Qu.:0.6984                 
##  Max.   :1.0000           Max.   :1.0000   Max.   :1.0000                 
##  NA's   :3007             NA's   :3007     NA's   :3007

Since the FTRI data is complete between 2010 and 2023, except for Syria and island states, we will reduce the dataset, to exclude years outside 2010-23 and country/years that don’t have tech_readiness score: Syria and small island states. After fitting a logistic regression model and testing its accuracy, we will impute the missing tech_readiness with values predicted by the regression model for the 2000-2009 period and 2024.

combined_reduced <- combined_set |>
      filter (year >= 2010 & year <= 2023 & !is.na(tech_readiness))     # the reduced set contains ca 2400 rows
#view(combined_reduced)   #remove comment to view set
colSums(is.na(combined_reduced))

##             country_name             country_code                     year 
##                        0                        0                        0 
##       access_electricity        high_tech_exports        ict_goods_exports 
##                        0                        0                        0 
##      ict_service_exports           internet_users      rd_expenditures_gdp 
##                        0                        0                        0 
##      researchers_millpop      scientific_articles           secure_servers 
##                        0                        0                        0 
##       access_finance_idx                  ict_idx    industry_activity_idx 
##                        0                        0                        0 
## research_development_idx               skills_idx            overall_index 
##                        0                        0                        0 
##           tech_readiness 
##                        0

d. Exploratory data analysis

We’ll look at the boxplots of the variables grouped by tech_readiness.

vars <- c("scientific_articles","access_electricity","high_tech_exports","ict_service_exports","ict_goods_exports","researchers_millpop","secure_servers","internet_users","rd_expenditures_gdp")

for (var in vars) {
  boxplot(combined_reduced[[var]] ~ combined_reduced$tech_readiness,
          main = paste(var),
          ylab = var,
          col = "lightblue",
          border = "darkblue"
          )
}

Conclusions

For most variables, there is clear distinction of means/medians between the two groups (low and high). Some of the variables appear normally distributed within each group, with a few outliers. Others show many significantly separated outliers. Those outliers actually fall in upper 70% of the distribution for each variable. So, they play an important role in separating high from low.

Variable rd_expenditures seems to be highly influential in separating the groups. Also important appear to be scientific_articles, high_tech_exports, and researchers_millpop. secure_servers appear to completely separate low from high, so it may cause instability in the regression.

To appropriately analyze this dataset that presents variable distributions not perfectly normal, an appropriate logistic method should be sought or at least compared with the outcome from glm() which assumes normal distribution.

However, in order to carry out this project within a reasonable time I will only use the glm() method learned in class. We will actually see whether glm() can produce a reliable and accurate model to explain this dataset.

e. Statistical analysis by logistic regression

After studying logistic regression, I learned that the best approach to develop a model is to train the model with a subset of the dataset (training set) and to test its accuracy with the remaining rows (the test set).

I will follow an approach described by Wickham using createDataPartition (from library caret), to split the data into a training set (80%) and a test set (20%). Base R has sample() but that function doesn’t balance the proportion of low/high in tech_readiness. createDataPartition preserves the proportion of each class (high, low) in both the training and test sets.

set.seed(123) #   set.seed generates a reproducible seed for the  random number generator used by createDataPartition.

trainset_idx <- createDataPartition(y = combined_reduced$tech_readiness, p = 0.8, list = FALSE)

combined_red_train <- combined_reduced[trainset_idx, ]    # this is the training set, with ca 1930 rows
combined_red_test <- combined_reduced[-trainset_idx, ]    # this is the test set, with ca 480 rows

Run with the training dataset

Now go for logistic regression with the training dataset with tech_readiness as output and 9 vars as predictors

model_train <- glm(tech_readiness ~ scientific_articles + access_electricity + 
                     high_tech_exports + ict_service_exports +  ict_goods_exports +  
                 researchers_millpop +  secure_servers + internet_users + rd_expenditures_gdp, 
                        family = binomial, data = combined_red_train)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(model_train)

## 
## Call:
## glm(formula = tech_readiness ~ scientific_articles + access_electricity + 
##     high_tech_exports + ict_service_exports + ict_goods_exports + 
##     researchers_millpop + secure_servers + internet_users + rd_expenditures_gdp, 
##     family = binomial, data = combined_red_train)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -5.363e+01  1.552e+01  -3.456 0.000549 ***
## scientific_articles  1.428e-05  5.429e-06   2.631 0.008511 ** 
## access_electricity   4.969e-01  1.564e-01   3.177 0.001490 ** 
## high_tech_exports    6.288e-02  9.842e-03   6.389 1.67e-10 ***
## ict_service_exports -1.246e-02  1.164e-02  -1.070 0.284554    
## ict_goods_exports   -1.381e-02  1.441e-02  -0.959 0.337804    
## researchers_millpop  8.536e-04  1.116e-04   7.649 2.02e-14 ***
## secure_servers       2.878e-06  5.365e-07   5.365 8.09e-08 ***
## internet_users       1.032e-02  5.744e-03   1.797 0.072267 .  
## rd_expenditures_gdp  1.026e+00  2.166e-01   4.736 2.18e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2170.17  on 1934  degrees of freedom
## Residual deviance:  634.15  on 1925  degrees of freedom
## AIC: 654.15
## 
## Number of Fisher Scoring iterations: 12

Interpretation of the logistic regression and its accuracy

The deviances are measures of goodness of fit. Null d. is a measure of residuals with a model that is equal to the intercept term. Residual d. is a measure of residuals considering the intercept and all the variables. A large difference Null - Residual (2170-634) indicates a good fit as the model predicts the tech_readiness outcome with smaller deviance than just a null (intercept-only) model. The parenthesis about the dispersion parameter taken as 1 means that the model assumes normal distribution of the data and variance.

The regression coefficients show the change in the log odds of the tech_readiness for a unit change of each predictor variable. So, for example, a 1% increase of %people with access to electricity, increases the log odds of tech_readiness by roughly 0.5. While a 1% increase of %people that are internet_users, increases the log odds of tech_readiness by 0.01, substantially less than electricity_access. We can also see that a 1% of GDP increase in rd_expenditures increases the log odds of tech_readiness by roughly 1, a very significant effect.

The p-value of the intercept and 6 of the variables is well below 0.05, meaning that the calculated coefficients are statistically significant within 95% confidence. Four of them show p-value of almost 0, high_tech_exports, researchers_millpop, secure_servers, and rd_expenditures. While three of them show p-values above 0.05, ict_service_exports, ict_goods_exports and internet_users, meaning that their coefficients are not statistically significance with 95% confidence.

To calculate the p-value of the logistic model compared to the null model we can do :

with(model_train, null.deviance - deviance)

## [1] 1536.024

with(model_train, df.null - df.residual)

## [1] 9

with(model_train, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))

## [1] 0

This again shows the difference in deviance between a null and a full model. 9 are the degrees of freedom. And the p-value is 0, showing that the model explains over 95% of the variance (deviances).

Now we will obtain the odds ratios by exponentiating the intercept and coefficients, and will get the confidence intervals (confint), and will show those side by side.

exp(coef(model_train))

##         (Intercept) scientific_articles  access_electricity   high_tech_exports 
##        5.128837e-24        1.000014e+00        1.643657e+00        1.064895e+00 
## ict_service_exports   ict_goods_exports researchers_millpop      secure_servers 
##        9.876184e-01        9.862872e-01        1.000854e+00        1.000003e+00 
##      internet_users rd_expenditures_gdp 
##        1.010378e+00        2.789784e+00

exp(cbind(OR = coef(model_train), confint(model_train)))

## Waiting for profiling to be done...

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

##                               OR        2.5 %       97.5 %
## (Intercept)         5.128837e-24 1.663066e-39 2.953516e-13
## scientific_articles 1.000014e+00 1.000004e+00 1.000025e+00
## access_electricity  1.643657e+00 1.279432e+00 2.353365e+00
## high_tech_exports   1.064895e+00 1.044657e+00 1.085891e+00
## ict_service_exports 9.876184e-01 9.648224e-01 1.009891e+00
## ict_goods_exports   9.862872e-01 9.591027e-01 1.015126e+00
## researchers_millpop 1.000854e+00 1.000633e+00 1.001073e+00
## secure_servers      1.000003e+00 1.000002e+00 1.000004e+00
## internet_users      1.010378e+00 9.992409e-01 1.022083e+00
## rd_expenditures_gdp 2.789784e+00 1.886414e+00 4.464174e+00

The Warnings show that in a number of cases the prediction is exactly 0 or 1, which is a sign of complete separation or quasi-complete separation, ie one or more predictor variables correlating exactly with the tech_readiness outcome. We already saw this potential problem with secure_servers in the EDA section. However, the p-values lend confidence in the value of coefficients for most (except 3) variables. And moreover, the model converged so I will continue to explore it.

Reading about this problem, it seems that a possible approach is to reduce the number of variables, or to scale and center the variables. If I have time, I will later try reducing by eliminating the ones with the highest p-value coefficients, namely ict_service exports and ict_goods_exports, since overlap to some degree with high_tech_exports, and secure_servers.

Before doing that, though, let’s get to calculate the confusion table and the ROC curve with the full 9-variable model.

Using the model (model_train) predict probabilities and the the predicted technological readiness (pred_tr) by using a cutoff of the sigmoid curve of 0.4, above which is labeled “high” and below which is labeled “low”.

pred_prob <- predict(model_train, type = "response")
pred_tr <- ifelse(pred_prob >= 0.4, "high", "low") 

compare_thetwo <- data.frame(ActualTR = combined_red_train$tech_readiness, PredictedTR = pred_tr)

Create the confusion table of tech_readiness, with Actual values in the rows, and Predicted values in the columns.

summ_compare <- table(compare_thetwo$ActualTR, compare_thetwo$PredictedTR)
summ_compare

##       
##        high  low
##   low    57 1397
##   high  423   58

Now do the accuracy eval; with ‘positive’=‘high’, ‘negative’=‘low’.

TP <- 423; TN <- 1397; FP <- 57; FN <- 58

Accuracy <- (TP + TN) / (TP + TN + FP + FN)
Precision <- TP / (TP + FP)
Sensitivity <- TP / (TP + FN)  # Recall
Specificity <- TN / (TN + FP)

F1_Score <- 2 * (Precision * Sensitivity) / (Precision + Sensitivity)
cat (" Accuracy: ",Accuracy,"\n","Precision: ",Precision,"\n", "Recall/sensitivity: ", Sensitivity, "\n", "Specificity: ", Specificity, "\n","F1_Score: ", F1_Score)

##  Accuracy:  0.9405685 
##  Precision:  0.88125 
##  Recall/sensitivity:  0.8794179 
##  Specificity:  0.9607978 
##  F1_Score:  0.880333

These numbers confirm the reliability of the full model. The F1_score close to 1 indicates good balance of precision and recall.

DO the ROC curve and AUC:

library(pROC)

roc_obj <- roc(response = combined_red_train$tech_readiness,
                predictor = pred_prob,
                direction = "<")  # smaller prob = low

## Setting levels: control = low, case = high

# Print AUC value
auc_val <- auc(roc_obj); auc_val

## Area under the curve: 0.9738

With an AUC close to 1 (and far from 0.5) the model shows high accuracy in classifying highs and lows.

Now plot the ROC with AUC displayed

plot.roc(roc_obj, print.auc = TRUE, legacy.axes = TRUE,
         xlab = "False Positive Rate (1 - Specificity)",
         ylab = "True Positive Rate (Sensitivity)")

Conclusion

Model model_train is accurate and precise and explains the deviances of the training set well. To further assess the accuracy of this model, now use the test data subset.

Run with the Testing Set

Now further test the accuracy of model_train using the test dataset. Predict probs from the test set we called combined_red_test, and then set the threshold in the sigmoid as before, to 0.4. (“high”: \(\le 0.4\))

pred_prob <- predict(model_train, newdata= combined_red_test, type = "response")
pred_tr <- ifelse(pred_prob >= 0.4, "high", "low") 

compare_thetwo <- data.frame(ActualTR = combined_red_test$tech_readiness, PredictedTR = pred_tr)

Create the confusion matrix for tech_readiness from the predictions obtained with test set.

summ_compare <- table(compare_thetwo$ActualTR, compare_thetwo$PredictedTR)
summ_compare

##       
##        high low
##   low    10 353
##   high  102  18

Now do the accuracy eval; with ‘positive’=‘high’, ‘negative’=‘low’.

TP <- 102; TN <- 353; FP <- 10; FN <- 18

Accuracy <- (TP + TN) / (TP + TN + FP + FN)
Precision <- TP / (TP + FP)
Sensitivity <- TP / (TP + FN)  # Recall
Specificity <- TN / (TN + FP)

F1_Score <- 2 * (Precision * Sensitivity) / (Precision + Sensitivity)
cat (" Accuracy: ",Accuracy,"\n","Precision: ",Precision,"\n", "Recall/sensitivity: ", Sensitivity, "\n", "Specificity: ", Specificity, "\n","F1_Score: ", F1_Score)

##  Accuracy:  0.942029 
##  Precision:  0.9107143 
##  Recall/sensitivity:  0.85 
##  Specificity:  0.9724518 
##  F1_Score:  0.8793103

These numbers again confirm the reliability of the full model. The F1_score close to 1 indicates good balance of precision and recall.

Now do the ROC curve and AUC.

library(pROC)

roc_obj <- roc(response = combined_red_test$tech_readiness,
                predictor = pred_prob,
               direction = "<")  # smaller prob = low

## Setting levels: control = low, case = high

# Print AUC value
auc_val <- auc(roc_obj); auc_val

## Area under the curve: 0.9634

With an AUC close to 1 (and far from 0.5) the model shows high accuracy in classifying highs and lows also in the test set.
Plot ROC with AUC displayed

plot.roc(roc_obj, print.auc = TRUE, legacy.axes = TRUE,
         xlab = "False Positive Rate (1 - Specificity)",
         ylab = "True Positive Rate (Sensitivity)")

Conclusion

Model model_train displays high accuracy, specificity and sensitivity, with the test dataset, equally as well as with the training set.

Backfill the missing Technical Readiness values

Since model_train accuracy is equally high, above 94% for both test and train sets, let’s go ahead and use the model_train to impute now all of the missing technical readiness levels in the FTR index data table.

We will use the original World Bank data for the other columns to predict (and back fill) the Tech_readiness since 2000, but only for the missing values of tech_readiness in the combined_set. We will not override the existing true values with the predicted ones.

In addition, I noticed that many countries including many of the developed ones, reported scarce statistics for 2024, so, rather than dropping 2024, I will also impute the 2024 variable values reported as zero, with the 2023 values, since most countries will not change their technological readiness status in a year. Notice that if 2023 value is zero, 2024 will also be zero, so 2024 will not be inflated.

Impute the 2024 values of the variables in var_list (when they are zero) with the 2023 values:

var_list <- c("scientific_articles", "access_electricity", "high_tech_exports", "ict_service_exports", "ict_goods_exports", "researchers_millpop", "secure_servers", "internet_users", "rd_expenditures_gdp")

impute_var <- function(var,year,targety) {
  case_when (
    year == targety & var == 0 ~ lag(var, n=1), TRUE ~var) }      # use  lag() to get value of previous year
                                                                # leave var untouched if not zero

combined_set <- combined_set |> group_by(country_code) |>
mutate(across(all_of(var_list), ~impute_var(.x,year,2023)))  |>    # impute 2023 when zero
mutate(across(all_of(var_list), ~impute_var(.x,year,2024)))  |>    # impute 2024 when zero
ungroup()

Now predict the probabilities for the full combined_set (5400 rows) using the successful model_train:

combined_set$predprob <- predict(model_train, newdata= combined_set, type= "response")

# And then determine pred_tech_readiness this way: When tech_readiness is available in combined_set, pred_tech_readiness = tech_readiness (the true value).  When tech_readiness is NA, pred_tech_readiness = high or low (depending on whether predprob (probability) is above or below threshold of 0.4). 

combined_set$pred_tech_readiness <- if_else(is.na(combined_set$tech_readiness), if_else(combined_set$predprob >=.4, "high","low"), combined_set$tech_readiness)

Check how the basic statistics have evolved between “original” set (2010-2023) and “predicted” set (2000-2024). The ratio High/Low is higher for the 2010-2023 period than for 2000-2024 because some countries became more productive and entered the “high” status in the latter decade compared to the early 2000s.

combined_set |>
  summarize( Orig_high = sum(tech_readiness == "high", na.rm = TRUE), Orig_low = sum(tech_readiness == "low", na.rm = TRUE),
             OrigHiLowratio = Orig_high/Orig_low,
             Pred_high = sum(pred_tech_readiness == "high", na.rm = TRUE), Pred_low = sum(pred_tech_readiness == "low", na.rm = TRUE),
             PredHiLowratio = Pred_high/Pred_low ) |> print()

## # A tibble: 1 × 6
##   Orig_high Orig_low OrigHiLowratio Pred_high Pred_low PredHiLowratio
##       <int>    <int>          <dbl>     <int>    <int>          <dbl>
## 1       601     1817          0.331       912     4513          0.202

Time-evolution of countries and indicators

Show the evolution over time: in the aggregate depict the yearly progress of country evolving from Low to High, and of a couple of crucial independent indicators: R&D Expenditures and High-Tech Exports.

combined_set_agg <- combined_set |>
  group_by(year, pred_tech_readiness) |>
  summarize(`R&D Expenditures` = mean(rd_expenditures_gdp), `High-Tech Exports` = mean(high_tech_exports), Countries = n(), .groups = 'drop')


ggplot(combined_set_agg, aes(x = year)) + 
  geom_point(aes(y = Countries, color = pred_tech_readiness , shape = pred_tech_readiness ), size = 3) +
  labs(
    title = "World Evolution towards High-Tech in 2000-2024",
    x = "Year", y = "Countries",
    color = "Technological Readiness",
    shape = "Technological Readiness"
  ) +
  theme_minimal()

ggplot(combined_set_agg, aes(x = year)) + 
  geom_point(aes(y = `High-Tech Exports`, color = pred_tech_readiness , shape = pred_tech_readiness ), size = 3) +
  labs(
    title = "World Evolution towards High-Tech in 2000-2024",
    x = "Year", y = "High-Tech Exports (% all exports)",
    color = "Technological Readiness",
    shape = "Technological Readiness"
  ) +
  theme_minimal()

ggplot(combined_set_agg, aes(x = year)) +
  geom_point(aes(y = `R&D Expenditures`, color = pred_tech_readiness , shape = pred_tech_readiness ), size = 3) +
  labs(
    title = "World Evolution towards High-Tech in 2000-2024",
    x = "Year", y = "R&D Expenditures (mean %GDP)",
    color = "Technological Readiness",
    shape = "Technological Readiness"
  ) +
  theme_minimal()

f. Overall Conclusions

A logistic regression model to classify countries into high and low in terms of technological readiness was developed on the basis of economic indicators available in World Bank, and the FTRI technology readiness index available at the United Nations databases. The model was trained with a subset of the data and further tested for accuracy with another subset.

The logistic regression model is accurate and is able to predict technical readiness on the basis of the variables selected. The classification into high and low technical readiness allows an analysis of which factors are most relevant to promote countries to evolve from low to high. Two very important factors are R&D expenditures and high-technology exports. Surprisingly, the factors that are least significant in the model are ICT (information and communication technology) goods and services, despite their relevance to high-tech. Other factors have medium-level significance: number of researchers, scientific_articles, access to electricity, number of secure_servers, and percent of internet_users in the population.

The time evolution of the countries from one class to the other (low to high) shows that it is very difficult to “close the gap”. Two of the indicators display that clearly. The R&D Expenditure gap is about 1.5% GDP and it has decreased over the past decade because the “high” countries are spending less. In parallel, the High-Tech Exports gap is 15% of manufactured exports. The gap is widening in the last decade as “high countries” export more and “low countries” export less.

An important outcome from this project was to be able to predict Technological Readiness for 2000-2024, beyond the range covered by the FTRI index from 2010-2023.

g. Obstacles

Assembling the database from two different sources was challenging, as the process generated NAs for years not covered by one of the other source. The modeling itself demanded studying some case studies of applications of logistical regression, as interpretation is more difficult than with linear regression. We learned about the occurrence of complete separation, which needs to be dealt with in various ways. The regression generated a few warnings of complete separation, but at the end we showed that the model was highly accurate and didn’t need special treatment for the model to converge.

h. Future steps

The glm() model appears to be accurate and reliable after examining the outcome and statistical significance of coefficients and the model. However, the warnings of complete or quasi complete separation indicates that further work is needed to ensure the reliability. Variable scaling and centering, as well as reducing the number of variables can be tried to see whether the warnings disappear. Reading about this, it seems that there are other logistic regression methods that are better suited for non-normally distributed data. Hence learning about those methods could be tried as well.

i. References

. Class notes

. H. Wickham, et al., R for Data Science, 2nd Ed., https://r4ds.hadley.nz/ (2023)

. United Nations, Frontier Technology Readiness Index, annual. Understanding the index and definitions of each index. https://unctadstat.unctad.org/datacentre/reportInfo/US.FTRI, accessed in November 2025.

j. Published

Published at https://rpubs.com/rmiranda/1373253

PROJECT 3 - High Tech Potential of Countries

Raul Miranda

2025-11-26