Do the necessary library installations
# install.packages("tidyverse","pROC")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
The primary question in this study is: what makes countries the most productive in high tech industries? In essence, we would like to identify the countries that are in the upper quantile of countries that achieved high productivity in high-tech in the 2000-2024 period. Those that did are labeled High and the rest are labeled Low. Since status may change over time, we will find some countries oscillating between High and Low.
In order to answer the question, let’s define productivity in terms of measurable indicators found in the databases provided. After some considerations, I arrived at the set of indicators shown below, and that are accessible in World Bank Indicators. They include (1) direct measures of national productivity: High-technology exports, ICT goods exports, ICT service exports and Scientific and technical journal articles; (2) measures of national resources: Access to electricity, Internet users, and Secure internet servers; and (3) measures of research promotion: Research and development expenditure, and Researchers in R&D.
World Bank lacked a single indicator that quantitatively evaluates overall performance, specifically in high-tech. “High-tech” includes advanced technologies derived from scientific achievements in medicine, industry, information, space, etc. I found a couple of good indicators, but was particularly impressed by the United Nations Frontier Technology Readiness Index (see reference below). A body of experts have tracked and maintained this performance index since 2010 until 2023 most recently and publishes databases and perspectives. The Overall performance index for countries is a number between 0 and 1. Its statistical distribution shows four quantiles, with the upper (4Q) in the range of 0.70 and 1.
I used that FTRI index to label a categorical variable I called tech_readiness (Technological Readiness), which acquires two values: Low for 0.0 \(\le\) FTRI < 0.7, and High for 0.7 \(\le\) FTRI \(\le\) 1.0.
In this project I use Logistic Regression to classify countries as Low or High for a given year over the 2000-2024 period. The predictor variables are the World Bank indicators, and the outcome variable is the Technological Readiness label derived from the FTRI index. Since the FTRI index is lacking before 2010, I developed a model based on the 2010 to 2023 dataset, tested its statistical accuracy, and used it to predict the 2000-2009 Technological Readines of countries, based on the reported technological indices given by the World Bank.
I focused mainly on the logistic regression modeling task, and didn’t
have time to go into the most useful details of studying which economic
indicators characterize the higher performers.
Downloaded to my project folder a compiled set of World Bank Indicators from http://data.worldbank.org/, including the indicators mentioned below. Name of the dataset in my folder: worlbankdataset.csv.
In addition, from the UN Frontier Technology Readiness Index (https://unctadstat.unctad.org/datacentre/) downloaded https://unctadstat.unctad.org/datacentre/dataviewer/US.FTRI. Name of the dataset in my folder: FTRI_table.csv.
The original table had the M-49 country codes (used by the UN) but lacked the ISO-3 country codes (used by the World Bank), so I used a conversion table from https://unstats.un.org/unsd/methodology/m49/overview/.
Datasets are tabulated by country_name, country_code and year. The original bound data set contains about 5400 records in the 2000-2024 period (long format), but not all countries reported complete information.
Outcome variable (categorical): tech_readiness (high, low)
From https://unctadstat.unctad.org/datacentre/reportInfo/US.FTRI – I defined the indicator Technological Readiness based on stats of FTRI Overall Index: 1Q-4Q: 0-0.3-0.5-0.7-1.0. High tech_readiness corresponds to FTRI Overall Index \(\ge 0.70.\)
From http://data.worldbank.org/, predictor variables:
Access to electricity (% of population) - quantitative
Internet users (% of population) - quantitative
Research and development expenditure (% of GDP) - quantitative
Researchers in R&D (per million people) - quantitative
Scientific and technical journal articles - quantitative
High-technology exports (% of manufactured exports) - quantitative
Secure Internet servers - quantitative
ICT goods exports (% of total goods exports) - quantitative
ICT service exports (% of service exports, BoP) - quantitative
Read datasets and bind them. Examine the top rows to take a peek at its variable structure.
setwd(getwd()) # ensure we're at the directory where the RMD is running
worldbank <- read_csv("worldbankdataset.csv", show_col_types = FALSE)
ftri <- read_csv("FTRI_table.csv", show_col_types = FALSE)
combined_set <- left_join(worldbank, ftri, by= c("country_code" = "code", "year"="year"))
head (combined_set, 3)
## # A tibble: 3 × 18
## country_name country_code year access_electricity high_tech_exports
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan AFG 2000 4.4 0
## 2 Afghanistan AFG 2001 9.3 0
## 3 Afghanistan AFG 2002 14.1 0
## # ℹ 13 more variables: ict_goods_exports <dbl>, ict_service_exports <dbl>,
## # internet_users <dbl>, rd_expenditures_gdp <dbl>, researchers_millpop <dbl>,
## # scientific_articles <dbl>, secure_servers <dbl>, access_finance_idx <dbl>,
## # ict_idx <dbl>, industry_activity_idx <dbl>, research_development_idx <dbl>,
## # skills_idx <dbl>, overall_index <dbl>
str(combined_set)
## spc_tbl_ [5,425 × 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ country_name : chr [1:5425] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ country_code : chr [1:5425] "AFG" "AFG" "AFG" "AFG" ...
## $ year : num [1:5425] 2000 2001 2002 2003 2004 ...
## $ access_electricity : num [1:5425] 4.4 9.3 14.1 19 23.8 28.7 33.5 38.4 42.4 48.3 ...
## $ high_tech_exports : num [1:5425] 0 0 0 0 0 0 0 0 0 0 ...
## $ ict_goods_exports : num [1:5425] 0 0 0 0 0 0 0 0 0 0 ...
## $ ict_service_exports : num [1:5425] 0 0 0 0 0 ...
## $ internet_users : num [1:5425] 0 0.00472 0.00456 0.08789 0.10581 ...
## $ rd_expenditures_gdp : num [1:5425] 0 0 0 0 0 0 0 0 0 0 ...
## $ researchers_millpop : num [1:5425] 0 0 0 0 0 0 0 0 0 0 ...
## $ scientific_articles : num [1:5425] 4 1 5.5 7.34 6.75 ...
## $ secure_servers : num [1:5425] 0 0 0 0 0 0 0 0 0 0 ...
## $ access_finance_idx : num [1:5425] NA NA NA NA NA NA NA NA NA NA ...
## $ ict_idx : num [1:5425] NA NA NA NA NA NA NA NA NA NA ...
## $ industry_activity_idx : num [1:5425] NA NA NA NA NA NA NA NA NA NA ...
## $ research_development_idx: num [1:5425] NA NA NA NA NA NA NA NA NA NA ...
## $ skills_idx : num [1:5425] NA NA NA NA NA NA NA NA NA NA ...
## $ overall_index : num [1:5425] NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. country_name = col_character(),
## .. country_code = col_character(),
## .. year = col_double(),
## .. access_electricity = col_double(),
## .. high_tech_exports = col_double(),
## .. ict_goods_exports = col_double(),
## .. ict_service_exports = col_double(),
## .. internet_users = col_double(),
## .. rd_expenditures_gdp = col_double(),
## .. researchers_millpop = col_double(),
## .. scientific_articles = col_double(),
## .. secure_servers = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Get the basic stats of the FTRI overall_index
summary(combined_set$overall_index)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.3211 0.4759 0.5113 0.6984 1.0000 3007
Define the outcome variable tech_readiness (high, low) based on overall_index \(\ge 0.7\) for “high”. Then make it a factor for glm to work, and relevel the ref to “low”.
combined_set <- combined_set|> mutate (tech_readiness= if_else(overall_index >= 0.7, "high", "low"))
combined_set$tech_readiness <- as.factor(combined_set$tech_readiness)
combined_set$tech_readiness <- relevel(combined_set$tech_readiness, ref= "low")
summary (combined_set)
## country_name country_code year access_electricity
## Length:5425 Length:5425 Min. :2000 Min. : 0.00
## Class :character Class :character 1st Qu.:2006 1st Qu.: 57.00
## Mode :character Mode :character Median :2012 Median : 98.60
## Mean :2012 Mean : 77.16
## 3rd Qu.:2018 3rd Qu.:100.00
## Max. :2024 Max. :100.00
##
## high_tech_exports ict_goods_exports ict_service_exports internet_users
## Min. : 0.000 Min. : 0.000 Min. : 0.00000 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.06625 1st Qu.: 2.80
## Median : 0.000 Median : 0.110 Median : 3.22588 Median : 24.44
## Mean : 4.933 Mean : 2.864 Mean : 5.99365 Mean : 34.44
## 3rd Qu.: 5.677 3rd Qu.: 1.590 3rd Qu.: 7.90453 3rd Qu.: 64.50
## Max. :95.618 Max. :63.640 Max. :64.80595 Max. :100.00
##
## rd_expenditures_gdp researchers_millpop scientific_articles secure_servers
## Min. :0.0000 Min. : 0.0 Min. : 0.0 Min. : 0
## 1st Qu.:0.0000 1st Qu.: 0.0 1st Qu.: 4.3 1st Qu.: 0
## Median :0.0000 Median : 0.0 Median : 91.3 Median : 20
## Mean :0.3715 Mean : 665.6 Mean : 8510.3 Mean : 135938
## 3rd Qu.:0.3542 3rd Qu.: 281.6 3rd Qu.: 1477.9 3rd Qu.: 1103
## Max. :6.0192 Max. :20047.0 Max. :898949.1 Max. :66850221
##
## access_finance_idx ict_idx industry_activity_idx
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.7716 1st Qu.:0.3890 1st Qu.:0.3221
## Median :0.8386 Median :0.5836 Median :0.4919
## Mean :0.8253 Mean :0.5655 Mean :0.4990
## 3rd Qu.:0.8924 3rd Qu.:0.7471 3rd Qu.:0.6810
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## NA's :3007 NA's :3007 NA's :3007
## research_development_idx skills_idx overall_index tech_readiness
## Min. :0.0000 Min. :0.0000 Min. :0.0000 low :1817
## 1st Qu.:0.1022 1st Qu.:0.2760 1st Qu.:0.3211 high: 601
## Median :0.2100 Median :0.4465 Median :0.4759 NA's:3007
## Mean :0.2911 Mean :0.4571 Mean :0.5113
## 3rd Qu.:0.4653 3rd Qu.:0.6309 3rd Qu.:0.6984
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## NA's :3007 NA's :3007 NA's :3007
Since the FTRI data is complete between 2010 and 2023, except for Syria and island states, we will reduce the dataset, to exclude years outside 2010-23 and country/years that don’t have tech_readiness score: Syria and small island states. After fitting a logistic regression model and testing its accuracy, we will impute the missing tech_readiness with values predicted by the regression model for the 2000-2009 period and 2024.
combined_reduced <- combined_set |>
filter (year >= 2010 & year <= 2023 & !is.na(tech_readiness)) # the reduced set contains ca 2400 rows
#view(combined_reduced) #remove comment to view set
colSums(is.na(combined_reduced))
## country_name country_code year
## 0 0 0
## access_electricity high_tech_exports ict_goods_exports
## 0 0 0
## ict_service_exports internet_users rd_expenditures_gdp
## 0 0 0
## researchers_millpop scientific_articles secure_servers
## 0 0 0
## access_finance_idx ict_idx industry_activity_idx
## 0 0 0
## research_development_idx skills_idx overall_index
## 0 0 0
## tech_readiness
## 0
We’ll look at the boxplots of the variables grouped by tech_readiness.
vars <- c("scientific_articles","access_electricity","high_tech_exports","ict_service_exports","ict_goods_exports","researchers_millpop","secure_servers","internet_users","rd_expenditures_gdp")
for (var in vars) {
boxplot(combined_reduced[[var]] ~ combined_reduced$tech_readiness,
main = paste(var),
ylab = var,
col = "lightblue",
border = "darkblue"
)
}
Conclusions
For most variables, there is clear distinction of means/medians between the two groups (low and high). Some of the variables appear normally distributed within each group, with a few outliers. Others show many significantly separated outliers. Those outliers actually fall in upper 70% of the distribution for each variable. So, they play an important role in separating high from low.
Variable rd_expenditures seems to be highly influential in separating the groups. Also important appear to be scientific_articles, high_tech_exports, and researchers_millpop. secure_servers appear to completely separate low from high, so it may cause instability in the regression.
To appropriately analyze this dataset that presents variable distributions not perfectly normal, an appropriate logistic method should be sought or at least compared with the outcome from glm() which assumes normal distribution.
However, in order to carry out this project within a reasonable time I will only use the glm() method learned in class. We will actually see whether glm() can produce a reliable and accurate model to explain this dataset.
After studying logistic regression, I learned that the best approach to develop a model is to train the model with a subset of the dataset (training set) and to test its accuracy with the remaining rows (the test set).
I will follow an approach described by Wickham using createDataPartition (from library caret), to split the data into a training set (80%) and a test set (20%). Base R has sample() but that function doesn’t balance the proportion of low/high in tech_readiness. createDataPartition preserves the proportion of each class (high, low) in both the training and test sets.
set.seed(123) # set.seed generates a reproducible seed for the random number generator used by createDataPartition.
trainset_idx <- createDataPartition(y = combined_reduced$tech_readiness, p = 0.8, list = FALSE)
combined_red_train <- combined_reduced[trainset_idx, ] # this is the training set, with ca 1930 rows
combined_red_test <- combined_reduced[-trainset_idx, ] # this is the test set, with ca 480 rows
Run with the training dataset
Now go for logistic regression with the training dataset with tech_readiness as output and 9 vars as predictors
model_train <- glm(tech_readiness ~ scientific_articles + access_electricity +
high_tech_exports + ict_service_exports + ict_goods_exports +
researchers_millpop + secure_servers + internet_users + rd_expenditures_gdp,
family = binomial, data = combined_red_train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model_train)
##
## Call:
## glm(formula = tech_readiness ~ scientific_articles + access_electricity +
## high_tech_exports + ict_service_exports + ict_goods_exports +
## researchers_millpop + secure_servers + internet_users + rd_expenditures_gdp,
## family = binomial, data = combined_red_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.363e+01 1.552e+01 -3.456 0.000549 ***
## scientific_articles 1.428e-05 5.429e-06 2.631 0.008511 **
## access_electricity 4.969e-01 1.564e-01 3.177 0.001490 **
## high_tech_exports 6.288e-02 9.842e-03 6.389 1.67e-10 ***
## ict_service_exports -1.246e-02 1.164e-02 -1.070 0.284554
## ict_goods_exports -1.381e-02 1.441e-02 -0.959 0.337804
## researchers_millpop 8.536e-04 1.116e-04 7.649 2.02e-14 ***
## secure_servers 2.878e-06 5.365e-07 5.365 8.09e-08 ***
## internet_users 1.032e-02 5.744e-03 1.797 0.072267 .
## rd_expenditures_gdp 1.026e+00 2.166e-01 4.736 2.18e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2170.17 on 1934 degrees of freedom
## Residual deviance: 634.15 on 1925 degrees of freedom
## AIC: 654.15
##
## Number of Fisher Scoring iterations: 12
Interpretation of the logistic regression and its
accuracy
The deviances are measures of goodness of fit. Null d. is a measure of residuals with a model that is equal to the intercept term. Residual d. is a measure of residuals considering the intercept and all the variables. A large difference Null - Residual (2170-634) indicates a good fit as the model predicts the tech_readiness outcome with smaller deviance than just a null (intercept-only) model. The parenthesis about the dispersion parameter taken as 1 means that the model assumes normal distribution of the data and variance.
The regression coefficients show the change in the log odds of the tech_readiness for a unit change of each predictor variable. So, for example, a 1% increase of %people with access to electricity, increases the log odds of tech_readiness by roughly 0.5. While a 1% increase of %people that are internet_users, increases the log odds of tech_readiness by 0.01, substantially less than electricity_access. We can also see that a 1% of GDP increase in rd_expenditures increases the log odds of tech_readiness by roughly 1, a very significant effect.
The p-value of the intercept and 6 of the variables is well below 0.05, meaning that the calculated coefficients are statistically significant within 95% confidence. Four of them show p-value of almost 0, high_tech_exports, researchers_millpop, secure_servers, and rd_expenditures. While three of them show p-values above 0.05, ict_service_exports, ict_goods_exports and internet_users, meaning that their coefficients are not statistically significance with 95% confidence.
To calculate the p-value of the logistic model compared to the null model we can do :
with(model_train, null.deviance - deviance)
## [1] 1536.024
with(model_train, df.null - df.residual)
## [1] 9
with(model_train, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
## [1] 0
This again shows the difference in deviance between a null and a full model. 9 are the degrees of freedom. And the p-value is 0, showing that the model explains over 95% of the variance (deviances).
Now we will obtain the odds ratios by exponentiating the intercept and coefficients, and will get the confidence intervals (confint), and will show those side by side.
exp(coef(model_train))
## (Intercept) scientific_articles access_electricity high_tech_exports
## 5.128837e-24 1.000014e+00 1.643657e+00 1.064895e+00
## ict_service_exports ict_goods_exports researchers_millpop secure_servers
## 9.876184e-01 9.862872e-01 1.000854e+00 1.000003e+00
## internet_users rd_expenditures_gdp
## 1.010378e+00 2.789784e+00
exp(cbind(OR = coef(model_train), confint(model_train)))
## Waiting for profiling to be done...
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## OR 2.5 % 97.5 %
## (Intercept) 5.128837e-24 1.663066e-39 2.953516e-13
## scientific_articles 1.000014e+00 1.000004e+00 1.000025e+00
## access_electricity 1.643657e+00 1.279432e+00 2.353365e+00
## high_tech_exports 1.064895e+00 1.044657e+00 1.085891e+00
## ict_service_exports 9.876184e-01 9.648224e-01 1.009891e+00
## ict_goods_exports 9.862872e-01 9.591027e-01 1.015126e+00
## researchers_millpop 1.000854e+00 1.000633e+00 1.001073e+00
## secure_servers 1.000003e+00 1.000002e+00 1.000004e+00
## internet_users 1.010378e+00 9.992409e-01 1.022083e+00
## rd_expenditures_gdp 2.789784e+00 1.886414e+00 4.464174e+00
The Warnings show that in a number of cases the prediction is exactly 0 or 1, which is a sign of complete separation or quasi-complete separation, ie one or more predictor variables correlating exactly with the tech_readiness outcome. We already saw this potential problem with secure_servers in the EDA section. However, the p-values lend confidence in the value of coefficients for most (except 3) variables. And moreover, the model converged so I will continue to explore it.
Reading about this problem, it seems that a possible approach is to reduce the number of variables, or to scale and center the variables. If I have time, I will later try reducing by eliminating the ones with the highest p-value coefficients, namely ict_service exports and ict_goods_exports, since overlap to some degree with high_tech_exports, and secure_servers.
Before doing that, though, let’s get to calculate the confusion table and the ROC curve with the full 9-variable model.
Using the model (model_train) predict probabilities and the the predicted technological readiness (pred_tr) by using a cutoff of the sigmoid curve of 0.4, above which is labeled “high” and below which is labeled “low”.
pred_prob <- predict(model_train, type = "response")
pred_tr <- ifelse(pred_prob >= 0.4, "high", "low")
compare_thetwo <- data.frame(ActualTR = combined_red_train$tech_readiness, PredictedTR = pred_tr)
Create the confusion table of tech_readiness, with Actual values in the rows, and Predicted values in the columns.
summ_compare <- table(compare_thetwo$ActualTR, compare_thetwo$PredictedTR)
summ_compare
##
## high low
## low 57 1397
## high 423 58
Now do the accuracy eval; with ‘positive’=‘high’, ‘negative’=‘low’.
TP <- 423; TN <- 1397; FP <- 57; FN <- 58
Accuracy <- (TP + TN) / (TP + TN + FP + FN)
Precision <- TP / (TP + FP)
Sensitivity <- TP / (TP + FN) # Recall
Specificity <- TN / (TN + FP)
F1_Score <- 2 * (Precision * Sensitivity) / (Precision + Sensitivity)
cat (" Accuracy: ",Accuracy,"\n","Precision: ",Precision,"\n", "Recall/sensitivity: ", Sensitivity, "\n", "Specificity: ", Specificity, "\n","F1_Score: ", F1_Score)
## Accuracy: 0.9405685
## Precision: 0.88125
## Recall/sensitivity: 0.8794179
## Specificity: 0.9607978
## F1_Score: 0.880333
These numbers confirm the reliability of the full model. The F1_score close to 1 indicates good balance of precision and recall.
DO the ROC curve and AUC:
library(pROC)
roc_obj <- roc(response = combined_red_train$tech_readiness,
predictor = pred_prob,
direction = "<") # smaller prob = low
## Setting levels: control = low, case = high
# Print AUC value
auc_val <- auc(roc_obj); auc_val
## Area under the curve: 0.9738
With an AUC close to 1 (and far from 0.5) the model shows high accuracy in classifying highs and lows.
Now plot the ROC with AUC displayed
plot.roc(roc_obj, print.auc = TRUE, legacy.axes = TRUE,
xlab = "False Positive Rate (1 - Specificity)",
ylab = "True Positive Rate (Sensitivity)")
Conclusion
Model model_train is accurate and precise and explains the deviances of the training set well. To further assess the accuracy of this model, now use the test data subset.
Run with the Testing Set
Now further test the accuracy of model_train using the test dataset. Predict probs from the test set we called combined_red_test, and then set the threshold in the sigmoid as before, to 0.4. (“high”: \(\le 0.4\))
pred_prob <- predict(model_train, newdata= combined_red_test, type = "response")
pred_tr <- ifelse(pred_prob >= 0.4, "high", "low")
compare_thetwo <- data.frame(ActualTR = combined_red_test$tech_readiness, PredictedTR = pred_tr)
Create the confusion matrix for tech_readiness from the
predictions obtained with test set.
summ_compare <- table(compare_thetwo$ActualTR, compare_thetwo$PredictedTR)
summ_compare
##
## high low
## low 10 353
## high 102 18
Now do the accuracy eval; with ‘positive’=‘high’, ‘negative’=‘low’.
TP <- 102; TN <- 353; FP <- 10; FN <- 18
Accuracy <- (TP + TN) / (TP + TN + FP + FN)
Precision <- TP / (TP + FP)
Sensitivity <- TP / (TP + FN) # Recall
Specificity <- TN / (TN + FP)
F1_Score <- 2 * (Precision * Sensitivity) / (Precision + Sensitivity)
cat (" Accuracy: ",Accuracy,"\n","Precision: ",Precision,"\n", "Recall/sensitivity: ", Sensitivity, "\n", "Specificity: ", Specificity, "\n","F1_Score: ", F1_Score)
## Accuracy: 0.942029
## Precision: 0.9107143
## Recall/sensitivity: 0.85
## Specificity: 0.9724518
## F1_Score: 0.8793103
These numbers again confirm the reliability of the full model. The F1_score close to 1 indicates good balance of precision and recall.
Now do the ROC curve and AUC.
library(pROC)
roc_obj <- roc(response = combined_red_test$tech_readiness,
predictor = pred_prob,
direction = "<") # smaller prob = low
## Setting levels: control = low, case = high
# Print AUC value
auc_val <- auc(roc_obj); auc_val
## Area under the curve: 0.9634
With an AUC close to 1 (and far from 0.5) the model shows high
accuracy in classifying highs and lows also in the test set.
Plot
ROC with AUC displayed
plot.roc(roc_obj, print.auc = TRUE, legacy.axes = TRUE,
xlab = "False Positive Rate (1 - Specificity)",
ylab = "True Positive Rate (Sensitivity)")
Conclusion
Model model_train displays high accuracy, specificity and sensitivity, with the test dataset, equally as well as with the training set.
Backfill the missing Technical Readiness
values
Since model_train accuracy is equally high, above 94% for both test and train sets, let’s go ahead and use the model_train to impute now all of the missing technical readiness levels in the FTR index data table.
We will use the original World Bank data for the other columns to predict (and back fill) the Tech_readiness since 2000, but only for the missing values of tech_readiness in the combined_set. We will not override the existing true values with the predicted ones.
In addition, I noticed that many countries including many of the developed ones, reported scarce statistics for 2024, so, rather than dropping 2024, I will also impute the 2024 variable values reported as zero, with the 2023 values, since most countries will not change their technological readiness status in a year. Notice that if 2023 value is zero, 2024 will also be zero, so 2024 will not be inflated.
Impute the 2024 values of the variables in var_list (when they are zero) with the 2023 values:
var_list <- c("scientific_articles", "access_electricity", "high_tech_exports", "ict_service_exports", "ict_goods_exports", "researchers_millpop", "secure_servers", "internet_users", "rd_expenditures_gdp")
impute_var <- function(var,year,targety) {
case_when (
year == targety & var == 0 ~ lag(var, n=1), TRUE ~var) } # use lag() to get value of previous year
# leave var untouched if not zero
combined_set <- combined_set |> group_by(country_code) |>
mutate(across(all_of(var_list), ~impute_var(.x,year,2023))) |> # impute 2023 when zero
mutate(across(all_of(var_list), ~impute_var(.x,year,2024))) |> # impute 2024 when zero
ungroup()
Now predict the probabilities for the full combined_set (5400
rows) using the successful model_train:
combined_set$predprob <- predict(model_train, newdata= combined_set, type= "response")
# And then determine pred_tech_readiness this way: When tech_readiness is available in combined_set, pred_tech_readiness = tech_readiness (the true value). When tech_readiness is NA, pred_tech_readiness = high or low (depending on whether predprob (probability) is above or below threshold of 0.4).
combined_set$pred_tech_readiness <- if_else(is.na(combined_set$tech_readiness), if_else(combined_set$predprob >=.4, "high","low"), combined_set$tech_readiness)
Check how the basic statistics have evolved between “original” set (2010-2023) and “predicted” set (2000-2024). The ratio High/Low is higher for the 2010-2023 period than for 2000-2024 because some countries became more productive and entered the “high” status in the latter decade compared to the early 2000s.
combined_set |>
summarize( Orig_high = sum(tech_readiness == "high", na.rm = TRUE), Orig_low = sum(tech_readiness == "low", na.rm = TRUE),
OrigHiLowratio = Orig_high/Orig_low,
Pred_high = sum(pred_tech_readiness == "high", na.rm = TRUE), Pred_low = sum(pred_tech_readiness == "low", na.rm = TRUE),
PredHiLowratio = Pred_high/Pred_low ) |> print()
## # A tibble: 1 × 6
## Orig_high Orig_low OrigHiLowratio Pred_high Pred_low PredHiLowratio
## <int> <int> <dbl> <int> <int> <dbl>
## 1 601 1817 0.331 912 4513 0.202
Time-evolution of countries and indicators
Show the evolution over time: in the aggregate depict the yearly progress of country evolving from Low to High, and of a couple of crucial independent indicators: R&D Expenditures and High-Tech Exports.
combined_set_agg <- combined_set |>
group_by(year, pred_tech_readiness) |>
summarize(`R&D Expenditures` = mean(rd_expenditures_gdp), `High-Tech Exports` = mean(high_tech_exports), Countries = n(), .groups = 'drop')
ggplot(combined_set_agg, aes(x = year)) +
geom_point(aes(y = Countries, color = pred_tech_readiness , shape = pred_tech_readiness ), size = 3) +
labs(
title = "World Evolution towards High-Tech in 2000-2024",
x = "Year", y = "Countries",
color = "Technological Readiness",
shape = "Technological Readiness"
) +
theme_minimal()
ggplot(combined_set_agg, aes(x = year)) +
geom_point(aes(y = `High-Tech Exports`, color = pred_tech_readiness , shape = pred_tech_readiness ), size = 3) +
labs(
title = "World Evolution towards High-Tech in 2000-2024",
x = "Year", y = "High-Tech Exports (% all exports)",
color = "Technological Readiness",
shape = "Technological Readiness"
) +
theme_minimal()
ggplot(combined_set_agg, aes(x = year)) +
geom_point(aes(y = `R&D Expenditures`, color = pred_tech_readiness , shape = pred_tech_readiness ), size = 3) +
labs(
title = "World Evolution towards High-Tech in 2000-2024",
x = "Year", y = "R&D Expenditures (mean %GDP)",
color = "Technological Readiness",
shape = "Technological Readiness"
) +
theme_minimal()
A logistic regression model to classify countries into high and low in terms of technological readiness was developed on the basis of economic indicators available in World Bank, and the FTRI technology readiness index available at the United Nations databases. The model was trained with a subset of the data and further tested for accuracy with another subset.
The logistic regression model is accurate and is able to predict technical readiness on the basis of the variables selected. The classification into high and low technical readiness allows an analysis of which factors are most relevant to promote countries to evolve from low to high. Two very important factors are R&D expenditures and high-technology exports. Surprisingly, the factors that are least significant in the model are ICT (information and communication technology) goods and services, despite their relevance to high-tech. Other factors have medium-level significance: number of researchers, scientific_articles, access to electricity, number of secure_servers, and percent of internet_users in the population.
The time evolution of the countries from one class to the other (low to high) shows that it is very difficult to “close the gap”. Two of the indicators display that clearly. The R&D Expenditure gap is about 1.5% GDP and it has decreased over the past decade because the “high” countries are spending less. In parallel, the High-Tech Exports gap is 15% of manufactured exports. The gap is widening in the last decade as “high countries” export more and “low countries” export less.
An important outcome from this project was to be able to predict Technological Readiness for 2000-2024, beyond the range covered by the FTRI index from 2010-2023.
Assembling the database from two different sources was challenging, as the process generated NAs for years not covered by one of the other source. The modeling itself demanded studying some case studies of applications of logistical regression, as interpretation is more difficult than with linear regression. We learned about the occurrence of complete separation, which needs to be dealt with in various ways. The regression generated a few warnings of complete separation, but at the end we showed that the model was highly accurate and didn’t need special treatment for the model to converge.
The glm() model appears to be accurate and reliable after examining the outcome and statistical significance of coefficients and the model. However, the warnings of complete or quasi complete separation indicates that further work is needed to ensure the reliability. Variable scaling and centering, as well as reducing the number of variables can be tried to see whether the warnings disappear. Reading about this, it seems that there are other logistic regression methods that are better suited for non-normally distributed data. Hence learning about those methods could be tried as well.
. Class notes
. H. Wickham, et al., R for Data Science, 2nd Ed., https://r4ds.hadley.nz/ (2023)
. United Nations, Frontier Technology Readiness Index, annual. Understanding the index and definitions of each index. https://unctadstat.unctad.org/datacentre/reportInfo/US.FTRI, accessed in November 2025.
Published at https://rpubs.com/rmiranda/1373253