In a rapidly changing global environment marked by trade wars, geopolitical tensions, and health crises such as the COVID-19 pandemic, countries and businesses are re-evaluating their supply chain strategies. One such strategy gaining traction is ‘nearshoring’. Mexico, with its proximity to the US market, has emerged as a potential hub for this strategy. This document aims to analyze Mexico’s attractiveness as a nearshoring destination and understand the specific factors influencing this appeal.
Several factors have disrupted the global supply chain: - The spread of COVID-19 from its epicenter in Wuhan, China. - China, known as the “factory of the world”, experiencing a halt in business activities. - Ongoing US-China trade tensions since 2017. - Escalation of conflict between Russia and Ukraine in 2022.
Due to these disruptions, there’s a need for international companies to diversify their production chains to reduce dependency on any single region.
Nearshoring involves relocating business projects closer to end markets. Mexico’s strategic geographic location offers it a competitive advantage in this realm, potentially leading to significant economic growth.
Examples such as Tesla’s investment in Nuevo Leon in 2023 underline the increasing interest in Mexico for nearshoring purposes. However, to leverage this interest, it’s crucial to identify the factors that make Mexico attractive for such investments.
Using econometrics, we can analyze the conditions that Mexico offers for nearshoring. By tapping into databases from INEGI, Banxico, and the Ministry of Economy, we can explore factors like socioeconomic conditions, business environments, and technological capabilities. This will help in predicting the effect of nearshoring in Mexico and identify areas of opportunity for future investments.
Below, we’ll analyze data that helps answer why Mexico is an appealing country for nearshoring:
library(foreign)
library(tidyverse) # collection of R packages designed for data science
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr) # data manipulation
library(forcats) # to work with categorical variables
library(ggplot2) # data visualization
library(cowplot) # for arranging plots
##
## Attaching package: 'cowplot'
##
## The following object is masked from 'package:lubridate':
##
## stamp
library(readr) # read specific csv files
library(janitor) # data exploration and cleaning
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(Hmisc) # several useful functions for data analysis
##
## Attaching package: 'Hmisc'
##
## The following objects are masked from 'package:dplyr':
##
## src, summarize
##
## The following objects are masked from 'package:base':
##
## format.pval, units
library(psych) # functions for multivariate analysis
##
## Attaching package: 'psych'
##
## The following object is masked from 'package:Hmisc':
##
## describe
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(naniar) # summaries and visualization of missing values NA's
library(dlookr) # summaries and visualization of missing values NA's
##
## Attaching package: 'dlookr'
##
## The following object is masked from 'package:psych':
##
## describe
##
## The following object is masked from 'package:Hmisc':
##
## describe
##
## The following object is masked from 'package:tidyr':
##
## extract
##
## The following object is masked from 'package:base':
##
## transform
library(corrplot) # correlation plots
## corrplot 0.92 loaded
library(jtools) # presentation of regression analysis
##
## Attaching package: 'jtools'
##
## The following object is masked from 'package:Hmisc':
##
## %nin%
library(lmtest) # diagnostic checks - linear regression analysis
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(car) # diagnostic checks - linear regression analysis
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:psych':
##
## logit
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(olsrr) # diagnostic checks - linear regression analysis
##
## Attaching package: 'olsrr'
##
## The following object is masked from 'package:datasets':
##
## rivers
library(naniar) # identifying missing values
library(stargazer) # create publication quality tables
##
## Please cite as:
##
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(effects) # displays for linear and other regression models
## Registered S3 method overwritten by 'survey':
## method from
## summary.pps dlookr
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
library(caret) # Classification and Regression Training
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(glmnet) # methods for prediction and plotting, and functions for cross-validation
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loaded glmnet 4.1-7
# Reading the dataset
sp_data <- read.csv("/Users/daviddrums180/Desktop/sp_data.csv")
# Getting a glimpse of the data
head(sp_data)
## periodo IED_Flujos IED_M Exportaciones Exportaciones_m Empleo Educacion
## 1 1997 12145.60 294151.2 9087.62 220090.8 NA 7.20
## 2 1998 8373.50 210875.6 9875.07 248690.6 NA 7.31
## 3 1999 13960.32 299734.4 10990.01 235960.5 NA 7.43
## 4 2000 18248.69 362631.8 12482.96 248057.2 97.83 7.56
## 5 2001 30057.18 546548.4 11300.44 205482.9 97.36 7.68
## 6 2002 24099.21 468332.0 11923.10 231707.6 97.66 7.80
## Salario_Diario Innovacion Inseguridad_Robo Inseguridad_Homicidio
## 1 24.30 11.30 266.51 14.55
## 2 31.91 11.37 314.78 14.32
## 3 31.91 12.46 272.89 12.64
## 4 35.12 13.15 216.98 10.86
## 5 37.57 13.47 214.53 10.25
## 6 39.74 12.80 197.80 9.94
## Tipo_de_Cambio Densidad_Carretera Densidad_Poblacion CO2_Emisiones
## 1 8.06 0.05 47.44 3.68
## 2 9.94 0.05 48.76 3.85
## 3 9.52 0.06 49.48 3.69
## 4 9.60 0.06 50.58 3.87
## 5 9.17 0.06 51.28 3.81
## 6 10.36 0.06 51.95 3.82
## PIB_Per_Capita INPC crisis_financiera
## 1 127570.1 33.28 0
## 2 126738.8 39.47 0
## 3 129164.7 44.34 0
## 4 130874.9 48.31 0
## 5 128083.4 50.43 0
## 6 128205.9 53.31 0
Descriptive statistics provide a quick overview of the central tendency, dispersion, and shape of the data distribution. We’ll start by examining measures like mean, median, and mode for the numerical columns.
# Descriptive statistics of the dataset
summary(sp_data)
## periodo IED_Flujos IED_M Exportaciones
## Min. :1997 Min. : 8374 Min. :210876 Min. : 9088
## 1st Qu.:2003 1st Qu.:21367 1st Qu.:368560 1st Qu.:13260
## Median :2010 Median :27698 Median :497054 Median :21188
## Mean :2010 Mean :26770 Mean :493596 Mean :23601
## 3rd Qu.:2016 3rd Qu.:32183 3rd Qu.:578606 3rd Qu.:31601
## Max. :2022 Max. :48354 Max. :754438 Max. :46478
##
## Exportaciones_m Empleo Educacion Salario_Diario
## Min. :205483 Min. :95.06 Min. :7.200 Min. : 24.30
## 1st Qu.:262337 1st Qu.:95.89 1st Qu.:7.865 1st Qu.: 41.97
## Median :366294 Median :96.53 Median :8.460 Median : 54.48
## Mean :433856 Mean :96.47 Mean :8.423 Mean : 65.16
## 3rd Qu.:632356 3rd Qu.:97.08 3rd Qu.:9.000 3rd Qu.: 72.31
## Max. :785654 Max. :97.83 Max. :9.580 Max. :172.87
## NA's :3 NA's :3
## Innovacion Inseguridad_Robo Inseguridad_Homicidio Tipo_de_Cambio
## Min. :11.28 Min. :120.5 Min. : 8.04 Min. : 8.06
## 1st Qu.:12.56 1st Qu.:148.3 1st Qu.:10.25 1st Qu.:10.75
## Median :13.09 Median :181.8 Median :16.93 Median :13.02
## Mean :13.11 Mean :185.4 Mean :17.29 Mean :13.91
## 3rd Qu.:13.75 3rd Qu.:209.9 3rd Qu.:22.43 3rd Qu.:18.49
## Max. :15.11 Max. :314.8 Max. :29.59 Max. :20.66
## NA's :2 NA's :1
## Densidad_Carretera Densidad_Poblacion CO2_Emisiones PIB_Per_Capita
## Min. :0.05000 Min. :47.44 Min. :3.590 Min. :126739
## 1st Qu.:0.06000 1st Qu.:52.77 1st Qu.:3.830 1st Qu.:130964
## Median :0.07000 Median :58.09 Median :3.930 Median :136845
## Mean :0.07115 Mean :57.33 Mean :3.945 Mean :138550
## 3rd Qu.:0.08000 3rd Qu.:61.39 3rd Qu.:4.105 3rd Qu.:146148
## Max. :0.09000 Max. :65.60 Max. :4.220 Max. :153236
## NA's :3
## INPC crisis_financiera
## Min. : 33.28 Min. :0.00000
## 1st Qu.: 56.15 1st Qu.:0.00000
## Median : 73.35 Median :0.00000
## Mean : 75.17 Mean :0.07692
## 3rd Qu.: 91.29 3rd Qu.:0.00000
## Max. :126.48 Max. :1.00000
##
Based on our understanding of the dataset, we can make several transformations to make our analysis more meaningful.
# Transforming 'periodo' into a factor (categorical variable)
# sp_data$periodo <- as.factor(sp_data$periodo)
sp_data$crisis_financiera <- as.factor(sp_data$crisis_financiera)
# Creating 'exportaciones_mx' by multiplying 'Exportaciones' and 'Tipo_de_Cambio'
# sp_data <- sp_data %>%
# mutate(exportaciones_mx = Exportaciones * Tipo_de_Cambio)
# Creating 'salario_diario_mx' by multiplying 'Salario_Diario' and 'Tipo_de_Cambio'
sp_data <- sp_data %>%
mutate(salario_diario_mx = Salario_Diario * Tipo_de_Cambio)
# Checking the first few rows to ensure our new variables are correctly added
head(sp_data)
## periodo IED_Flujos IED_M Exportaciones Exportaciones_m Empleo Educacion
## 1 1997 12145.60 294151.2 9087.62 220090.8 NA 7.20
## 2 1998 8373.50 210875.6 9875.07 248690.6 NA 7.31
## 3 1999 13960.32 299734.4 10990.01 235960.5 NA 7.43
## 4 2000 18248.69 362631.8 12482.96 248057.2 97.83 7.56
## 5 2001 30057.18 546548.4 11300.44 205482.9 97.36 7.68
## 6 2002 24099.21 468332.0 11923.10 231707.6 97.66 7.80
## Salario_Diario Innovacion Inseguridad_Robo Inseguridad_Homicidio
## 1 24.30 11.30 266.51 14.55
## 2 31.91 11.37 314.78 14.32
## 3 31.91 12.46 272.89 12.64
## 4 35.12 13.15 216.98 10.86
## 5 37.57 13.47 214.53 10.25
## 6 39.74 12.80 197.80 9.94
## Tipo_de_Cambio Densidad_Carretera Densidad_Poblacion CO2_Emisiones
## 1 8.06 0.05 47.44 3.68
## 2 9.94 0.05 48.76 3.85
## 3 9.52 0.06 49.48 3.69
## 4 9.60 0.06 50.58 3.87
## 5 9.17 0.06 51.28 3.81
## 6 10.36 0.06 51.95 3.82
## PIB_Per_Capita INPC crisis_financiera salario_diario_mx
## 1 127570.1 33.28 0 195.8580
## 2 126738.8 39.47 0 317.1854
## 3 129164.7 44.34 0 303.7832
## 4 130874.9 48.31 0 337.1520
## 5 128083.4 50.43 0 344.5169
## 6 128205.9 53.31 0 411.7064
While the summary function provides a quick overview that includes some measures of dispersion (like the min, max, 1st quartile, and 3rd quartile), we might want to delve deeper into certain columns. Here we’ll focus on variance and standard deviation.
# List of variables for which to compute variance and standard deviation
columns_to_check <- c("IED_Flujos", "Exportaciones", "Empleo", "Educacion", "Salario_Diario",
"Innovacion", "Inseguridad_Robo", "Inseguridad_Homicidio", "Tipo_de_Cambio",
"Densidad_Carretera", "Densidad_Poblacion", "CO2_Emisiones", "PIB_Per_Capita", "INPC")
# Using sapply to compute variance and standard deviation for each column
variances <- sapply(sp_data[, columns_to_check], var, na.rm = TRUE)
std_devs <- sapply(sp_data[, columns_to_check], sd, na.rm = TRUE)
# Combining results into a data frame
dispersion_data <- data.frame(Variable = names(variances),
Variance = variances,
Standard_Deviation = std_devs)
dispersion_data
## Variable Variance Standard_Deviation
## IED_Flujos IED_Flujos 7.692480e+07 8.770679e+03
## Exportaciones Exportaciones 1.287347e+08 1.134613e+04
## Empleo Empleo 5.901316e-01 7.682002e-01
## Educacion Educacion 5.179146e-01 7.196629e-01
## Salario_Diario Salario_Diario 1.285149e+03 3.584897e+01
## Innovacion Innovacion 1.249904e+00 1.117991e+00
## Inseguridad_Robo Inseguridad_Robo 2.272065e+03 4.766618e+01
## Inseguridad_Homicidio Inseguridad_Homicidio 5.274689e+01 7.262705e+00
## Tipo_de_Cambio Tipo_de_Cambio 1.722372e+01 4.150147e+00
## Densidad_Carretera Densidad_Carretera 1.786154e-04 1.336471e-02
## Densidad_Poblacion Densidad_Poblacion 2.928160e+01 5.411248e+00
## CO2_Emisiones CO2_Emisiones 3.679881e-02 1.918302e-01
## PIB_Per_Capita PIB_Per_Capita 7.851913e+07 8.861102e+03
## INPC INPC 6.154715e+02 2.480870e+01
It’s crucial to identify any missing values in the dataset, as they can influence the outcome of the analysis.
# Checking for missing values
missing_vals <- is.na(sp_data)
missing_summary <- colSums(missing_vals)
# Printing the summary of missing values by column
print(missing_summary)
## periodo IED_Flujos IED_M
## 0 0 0
## Exportaciones Exportaciones_m Empleo
## 0 0 3
## Educacion Salario_Diario Innovacion
## 3 0 2
## Inseguridad_Robo Inseguridad_Homicidio Tipo_de_Cambio
## 0 1 0
## Densidad_Carretera Densidad_Poblacion CO2_Emisiones
## 0 0 3
## PIB_Per_Capita INPC crisis_financiera
## 0 0 0
## salario_diario_mx
## 0
# Replacing NA values with median for each column
for(column in names(sp_data)) {
if(is.numeric(sp_data[[column]])) {
sp_data[[column]][is.na(sp_data[[column]])] <- median(sp_data[[column]], na.rm = TRUE)
}
}
# Check if there are any remaining missing values
post_replacement_missing_summary <- colSums(is.na(sp_data))
print(post_replacement_missing_summary)
## periodo IED_Flujos IED_M
## 0 0 0
## Exportaciones Exportaciones_m Empleo
## 0 0 0
## Educacion Salario_Diario Innovacion
## 0 0 0
## Inseguridad_Robo Inseguridad_Homicidio Tipo_de_Cambio
## 0 0 0
## Densidad_Carretera Densidad_Poblacion CO2_Emisiones
## 0 0 0
## PIB_Per_Capita INPC crisis_financiera
## 0 0 0
## salario_diario_mx
## 0
This is especially useful when we want to get insights about the distribution, central tendency, and spread of out data at a granular level.
# Obtain descriptive statistics for the dataset
descriptive_stats <- describe(sp_data)
# Print the descriptive statistics
print(descriptive_stats)
## # A tibble: 18 × 26
## described_variables n na mean sd se_mean IQR skewness
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 periodo 26 0 2.01e+3 7.65e+0 1.5 e+0 1.25e+1 0
## 2 IED_Flujos 26 0 2.68e+4 8.77e+3 1.72e+3 1.08e+4 -0.0209
## 3 IED_M 26 0 4.94e+5 1.44e+5 2.82e+4 2.10e+5 -0.0117
## 4 Exportaciones 26 0 2.36e+4 1.13e+4 2.23e+3 1.83e+4 0.514
## 5 Exportaciones_m 26 0 4.34e+5 1.95e+5 3.82e+4 3.70e+5 0.546
## 6 Empleo 26 0 9.65e+1 7.21e-1 1.41e-1 9.25e-1 -0.193
## 7 Educacion 26 0 8.43e+0 6.75e-1 1.32e-1 9.67e-1 -0.134
## 8 Salario_Diario 26 0 6.52e+1 3.58e+1 7.03e+0 3.03e+1 1.62
## 9 Innovacion 26 0 1.31e+1 1.07e+0 2.10e-1 1.01e+0 0.141
## 10 Inseguridad_Robo 26 0 1.85e+2 4.77e+1 9.35e+0 6.16e+1 1.00
## 11 Inseguridad_Homicidio 26 0 1.73e+1 7.12e+0 1.40e+0 1.19e+1 0.430
## 12 Tipo_de_Cambio 26 0 1.39e+1 4.15e+0 8.14e-1 7.73e+0 0.498
## 13 Densidad_Carretera 26 0 7.12e-2 1.34e-2 2.62e-3 2 e-2 0.209
## 14 Densidad_Poblacion 26 0 5.73e+1 5.41e+0 1.06e+0 8.62e+0 -0.209
## 15 CO2_Emisiones 26 0 3.94e+0 1.80e-1 3.53e-2 2.48e-1 -0.126
## 16 PIB_Per_Capita 26 0 1.39e+5 8.86e+3 1.74e+3 1.52e+4 0.310
## 17 INPC 26 0 7.52e+1 2.48e+1 4.87e+0 3.51e+1 0.295
## 18 salario_diario_mx 26 0 1.03e+3 8.50e+2 1.67e+2 9.66e+2 1.46
## # ℹ 18 more variables: kurtosis <dbl>, p00 <dbl>, p01 <dbl>, p05 <dbl>,
## # p10 <dbl>, p20 <dbl>, p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>,
## # p60 <dbl>, p70 <dbl>, p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>,
## # p99 <dbl>, p100 <dbl>
By standardizing all variables to minuscules, we create an standard so when we are making analysis we can call variables easily.
# Convert all column names to lowercase
names(sp_data) <- tolower(names(sp_data))
# Rename columns that end with "_m" to end with "_mx"
new_column_names <- gsub("_m$", "_mx", names(sp_data))
names(sp_data) <- new_column_names
# Check the updated column names
head(sp_data)
## periodo ied_flujos ied_mx exportaciones exportaciones_mx empleo educacion
## 1 1997 12145.60 294151.2 9087.62 220090.8 96.53 7.20
## 2 1998 8373.50 210875.6 9875.07 248690.6 96.53 7.31
## 3 1999 13960.32 299734.4 10990.01 235960.5 96.53 7.43
## 4 2000 18248.69 362631.8 12482.96 248057.2 97.83 7.56
## 5 2001 30057.18 546548.4 11300.44 205482.9 97.36 7.68
## 6 2002 24099.21 468332.0 11923.10 231707.6 97.66 7.80
## salario_diario innovacion inseguridad_robo inseguridad_homicidio
## 1 24.30 11.30 266.51 14.55
## 2 31.91 11.37 314.78 14.32
## 3 31.91 12.46 272.89 12.64
## 4 35.12 13.15 216.98 10.86
## 5 37.57 13.47 214.53 10.25
## 6 39.74 12.80 197.80 9.94
## tipo_de_cambio densidad_carretera densidad_poblacion co2_emisiones
## 1 8.06 0.05 47.44 3.68
## 2 9.94 0.05 48.76 3.85
## 3 9.52 0.06 49.48 3.69
## 4 9.60 0.06 50.58 3.87
## 5 9.17 0.06 51.28 3.81
## 6 10.36 0.06 51.95 3.82
## pib_per_capita inpc crisis_financiera salario_diario_mx
## 1 127570.1 33.28 0 195.8580
## 2 126738.8 39.47 0 317.1854
## 3 129164.7 44.34 0 303.7832
## 4 130874.9 48.31 0 337.1520
## 5 128083.4 50.43 0 344.5169
## 6 128205.9 53.31 0 411.7064
Since there are columns that are the same but with a change in the currency, it is best if we only mantain the columns that will be used for the analysis.
# Removing specified columns from the dataset
sp_data <- sp_data %>% select(-c(exportaciones, ied_flujos, salario_diario))
In this section, visual representations of the dataset are constructed to offer insights into its structure and trends. By visually exploring the data, we can uncover patterns, correlations, or anomalies which might not be immediately evident through numeric summaries alone.
# 1. salario_diario plot
ggplot(sp_data, aes(x = salario_diario_mx, y = ied_mx)) +
geom_line() +
labs(title = "Salario Diario vs. IED_MX",
x = "Salario Diario",
y = "IED_MX") +
theme_minimal()
# 2. densidad_poblacion plot
ggplot(sp_data, aes(x = densidad_poblacion, y = ied_mx)) +
geom_histogram(stat = "identity", fill = "skyblue", color = "black") +
labs(title = "Densidad Poblacion vs. IED_MX",
x = "Densidad Poblacion",
y = "IED_MX") +
theme_minimal()
## Warning in geom_histogram(stat = "identity", fill = "skyblue", color = "black"):
## Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
# 3. Bar graph for crisis_financiera
sp_data %>%
group_by(periodo, crisis_financiera) %>%
summarise(count = n()) %>%
ggplot(aes(x = periodo, y = count, fill = as.factor(crisis_financiera))) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label=periodo), vjust=0.5, hjust=10, position=position_dodge(width=0.9), size=3.5, angle=90, color="white") +
geom_hline(aes(yintercept = median(count, na.rm = TRUE)), color = "red") +
labs(title = "Distribution of Financial Crisis over Years",
x = "Year",
y = "Count",
fill = "Financial Crisis") +
theme_minimal() +
theme(axis.text.x = element_blank())
## `summarise()` has grouped output by 'periodo'. You can override using the
## `.groups` argument.
# 4. Correlation plot
cor_data <- sp_data %>% select_if(is.numeric) # removing factor variables
cor_matrix <- cor(cor_data, use = "pairwise.complete.obs")
# Visualization
plot_corr <- corrplot(cor_matrix,
method = "circle",
type = "full",
tl.cex = 0.4,
tl.col = "black",
tl.srt = 90,
)
# 5. Innovacion vs. ied_mx (using a bar graph)
ggplot(sp_data, aes(x = innovacion, y = ied_mx)) +
geom_bar(stat = "identity", fill = "skyblue", color = "black") +
labs(title = "Innovacion vs. IED_MX",
x = "Innovacion",
y = "IED_MX") +
theme_minimal()
# 6. Histogram for ied_mx
plot_ied_mx <- ggplot(sp_data, aes(x = ied_mx)) +
geom_histogram(fill = "purple", color = "black", bins = 30) +
labs(title = "Histogram of IED",
x = "IED") +
theme_minimal()
plot_ied_mx
In this regression model, we’ll use all the variables as independent variables to predict the ied_mx (inward foreign direct investment). This approach provides a holistic view of the relationship each independent variable has with the dependent variable. It can serve as a foundation, upon which we refine and optimize our model in subsequent iterations.
# Regression Model 1
model_all_vars <- lm(ied_mx ~ ., data = sp_data)
summary(model_all_vars)
##
## Call:
## lm(formula = ied_mx ~ ., data = sp_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -93838 -22410 1652 32302 78251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.691e+08 1.614e+08 -2.907 0.0156 *
## periodo 2.363e+05 8.085e+04 2.923 0.0152 *
## exportaciones_mx -2.715e+00 1.277e+00 -2.125 0.0595 .
## empleo 6.075e+04 5.511e+04 1.102 0.2962
## educacion -1.303e+05 3.089e+05 -0.422 0.6822
## innovacion 7.813e+04 2.996e+04 2.608 0.0261 *
## inseguridad_robo -9.310e+01 8.821e+02 -0.106 0.9180
## inseguridad_homicidio 9.810e+03 1.226e+04 0.800 0.4421
## tipo_de_cambio 6.523e+04 3.966e+04 1.645 0.1310
## densidad_carretera 9.336e+06 1.174e+07 0.796 0.4448
## densidad_poblacion -1.704e+05 9.355e+04 -1.822 0.0985 .
## co2_emisiones 1.725e+05 2.846e+05 0.606 0.5579
## pib_per_capita -4.410e+00 1.430e+01 -0.308 0.7641
## inpc -2.550e+04 2.105e+04 -1.212 0.2535
## crisis_financiera1 -1.228e+05 1.122e+05 -1.095 0.2993
## salario_diario_mx 5.722e+00 3.075e+02 0.019 0.9855
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 70240 on 10 degrees of freedom
## Multiple R-squared: 0.9046, Adjusted R-squared: 0.7615
## F-statistic: 6.323 on 15 and 10 DF, p-value: 0.002819
To assess the performance and reliability of our regression model, we will use the following metrics:
R-Squared: This statistic provides a measure of how well the observed outcomes are replicated by the model. It explains the proportion of variance in the dependent variable that’s explained by the independent variables. RMSE (Root Mean Square Error): It signifies the model’s prediction error. Essentially, it tells us how concentrated the data is around the line of best fit. AIC (Akaike Information Criterion): This measures the goodness of fit of our model. A lower AIC value suggests a better-fitting model.
# R-Squared
r_squared <- summary(model_all_vars)$r.squared
# RMSE
rmse <- sqrt(mean(model_all_vars$residuals^2))
# AIC
aic_value <- AIC(model_all_vars)
list(R_Squared = r_squared, RMSE = rmse, AIC = aic_value)
## $R_Squared
## [1] 0.9046175
##
## $RMSE
## [1] 43563.72
##
## $AIC
## [1] 663.2478
Post-regression diagnostics are crucial for validating the assumptions of regression analysis. We’ll be performing:
VIF (Variance Inflation Factor): To check for multicollinearity among the predictor variables. A VIF greater than 5-10 usually suggests high multicollinearity. Breusch-Pagan Test: To check for heteroskedasticity. If the p-value is less than a chosen alpha level (commonly 0.05), heteroskedasticity is present. Histogram of Residuals: To check the normality of the residuals. Ideally, the residuals should be normally distributed around zero.
# VIF for multicollinearity
vif_values <- car::vif(model_all_vars)
# Breusch-Pagan Test for heteroskedasticity
bp_test <- lmtest::bptest(model_all_vars)
# Histogram for normalization of residuals
hist(model_all_vars$residuals, main = "Histogram of Residuals", xlab = "Residuals")
# Print Results
# VIF for multicollinearity
cat("VIF Values:\n")
## VIF Values:
print(vif_values)
## periodo exportaciones_mx empleo
## 1937.375659 314.388179 7.997350
## educacion innovacion inseguridad_robo
## 220.456731 5.228816 8.958156
## inseguridad_homicidio tipo_de_cambio densidad_carretera
## 38.545261 137.259856 124.624326
## densidad_poblacion co2_emisiones pib_per_capita
## 1298.478091 13.294804 81.341486
## inpc crisis_financiera salario_diario_mx
## 1381.397996 4.710371 345.897912
cat("\n")
# Breusch-Pagan Test for heteroskedasticity
cat("Breusch-Pagan Test:\n")
## Breusch-Pagan Test:
print(bp_test)
##
## studentized Breusch-Pagan test
##
## data: model_all_vars
## BP = 14.602, df = 15, p-value = 0.4805
cat("\n")
In this iteration, we move towards a more sophisticated model by incorporating polynomial regression, specifically targeting the densidad_poblacion variable. Polynomial regression helps to capture non-linear relationships in the data.
# Regression Model 2
model_poly_selected <- lm(ied_mx ~ periodo+exportaciones_mx+educacion+innovacion+densidad_poblacion+I(densidad_poblacion^2)+pib_per_capita+crisis_financiera,data=sp_data)
summary(model_poly_selected)
##
## Call:
## lm(formula = ied_mx ~ periodo + exportaciones_mx + educacion +
## innovacion + densidad_poblacion + I(densidad_poblacion^2) +
## pib_per_capita + crisis_financiera, data = sp_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -140941 -41792 8327 42336 111494
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.627e+08 9.228e+07 -2.847 0.0111 *
## periodo 1.339e+05 4.685e+04 2.859 0.0109 *
## exportaciones_mx -9.361e-02 4.692e-01 -0.199 0.8442
## educacion 1.505e+05 1.318e+05 1.142 0.2691
## innovacion 6.604e+04 2.734e+04 2.415 0.0273 *
## densidad_poblacion -5.146e+04 1.703e+05 -0.302 0.7662
## I(densidad_poblacion^2) -1.049e+03 1.628e+03 -0.644 0.5281
## pib_per_capita -1.158e+01 1.254e+01 -0.923 0.3688
## crisis_financiera1 -5.802e+04 7.903e+04 -0.734 0.4728
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 79320 on 17 degrees of freedom
## Multiple R-squared: 0.7932, Adjusted R-squared: 0.6959
## F-statistic: 8.152 on 8 and 17 DF, p-value: 0.0001576
To assess the performance and reliability of our regression model, we will use the following metrics:
R-Squared: This statistic provides a measure of how well the observed outcomes are replicated by the model. It explains the proportion of variance in the dependent variable that’s explained by the independent variables. RMSE (Root Mean Square Error): It signifies the model’s prediction error. Essentially, it tells us how concentrated the data is around the line of best fit. AIC (Akaike Information Criterion): This measures the goodness of fit of our model. A lower AIC value suggests a better-fitting model.
# R-Squared
r_squared <- summary(model_poly_selected)$r.squared
# RMSE
rmse <- sqrt(mean(model_poly_selected$residuals^2))
# AIC
aic_value <- AIC(model_poly_selected)
list(R_Squared = r_squared, RMSE = rmse, AIC = aic_value)
## $R_Squared
## [1] 0.7932225
##
## $RMSE
## [1] 64141.98
##
## $AIC
## [1] 669.3652
Post-regression diagnostics are crucial for validating the assumptions of regression analysis. We’ll be performing:
VIF (Variance Inflation Factor): To check for multicollinearity among the predictor variables. A VIF greater than 5-10 usually suggests high multicollinearity. Breusch-Pagan Test: To check for heteroskedasticity. If the p-value is less than a chosen alpha level (commonly 0.05), heteroskedasticity is present. Histogram of Residuals: To check the normality of the residuals. Ideally, the residuals should be normally distributed around zero.
# VIF for multicollinearity
vif_values <- car::vif(model_poly_selected)
# Breusch-Pagan Test for heteroskedasticity
bp_test <- lmtest::bptest(model_poly_selected)
# Histogram for normalization of residuals
hist(model_poly_selected$residuals, main = "Histogram of Residuals", xlab = "Residuals")
# Print Results
# VIF for multicollinearity
cat("VIF Values:\n")
## VIF Values:
print(vif_values)
## periodo exportaciones_mx educacion
## 510.208408 33.271350 31.446184
## innovacion densidad_poblacion I(densidad_poblacion^2)
## 3.416102 3373.357013 3988.545670
## pib_per_capita crisis_financiera
## 49.045748 1.832373
cat("\n")
# Breusch-Pagan Test for heteroskedasticity
cat("Breusch-Pagan Test:\n")
## Breusch-Pagan Test:
print(bp_test)
##
## studentized Breusch-Pagan test
##
## data: model_poly_selected
## BP = 8.2055, df = 8, p-value = 0.4137
cat("\n")
This Poly-Logarithmic Multiple Regression Model omits the standard ‘densidad_poblacion’ to mitigate multicollinearity while retaining its polynomial effect and linearizes the ‘innovacion’ variable through a logarithm to address its non-linear relationship with the dependent variable, ied_mx. This approach aims to capture the intricate relationships between key predictors and the outcome more accurately.
# Regression Model 3
model_poly_log <- lm(ied_mx ~ periodo+exportaciones_mx+educacion+log(innovacion)+I(densidad_poblacion^2)+pib_per_capita+crisis_financiera,data=sp_data)
summary(model_poly_log)
##
## Call:
## lm(formula = ied_mx ~ periodo + exportaciones_mx + educacion +
## log(innovacion) + I(densidad_poblacion^2) + pib_per_capita +
## crisis_financiera, data = sp_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -147647 -39315 11008 38973 108161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.642e+08 8.985e+07 -2.940 0.00875 **
## periodo 1.334e+05 4.571e+04 2.917 0.00920 **
## exportaciones_mx -3.909e-02 4.406e-01 -0.089 0.93029
## educacion 1.207e+05 9.404e+04 1.283 0.21578
## log(innovacion) 7.982e+05 2.861e+05 2.790 0.01208 *
## I(densidad_poblacion^2) -1.506e+03 5.221e+02 -2.885 0.00986 **
## pib_per_capita -9.559e+00 1.094e+01 -0.874 0.39373
## crisis_financiera1 -4.875e+04 7.458e+04 -0.654 0.52161
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 77550 on 18 degrees of freedom
## Multiple R-squared: 0.7908, Adjusted R-squared: 0.7094
## F-statistic: 9.718 on 7 and 18 DF, p-value: 5.181e-05
To assess the performance and reliability of our regression model, we will use the following metrics:
R-Squared: This statistic provides a measure of how well the observed outcomes are replicated by the model. It explains the proportion of variance in the dependent variable that’s explained by the independent variables. RMSE (Root Mean Square Error): It signifies the model’s prediction error. Essentially, it tells us how concentrated the data is around the line of best fit. AIC (Akaike Information Criterion): This measures the goodness of fit of our model. A lower AIC value suggests a better-fitting model.
# R-Squared
r_squared <- summary(model_poly_log)$r.squared
# RMSE
rmse <- sqrt(mean(model_poly_log$residuals^2))
# AIC
aic_value <- AIC(model_poly_log)
list(R_Squared = r_squared, RMSE = rmse, AIC = aic_value)
## $R_Squared
## [1] 0.7907686
##
## $RMSE
## [1] 64521.45
##
## $AIC
## [1] 667.672
Post-regression diagnostics are crucial for validating the assumptions of regression analysis. We’ll be performing:
VIF (Variance Inflation Factor): To check for multicollinearity among the predictor variables. A VIF greater than 5-10 usually suggests high multicollinearity. Breusch-Pagan Test: To check for heteroskedasticity. If the p-value is less than a chosen alpha level (commonly 0.05), heteroskedasticity is present. Histogram of Residuals: To check the normality of the residuals. Ideally, the residuals should be normally distributed around zero.
# VIF for multicollinearity
vif_values <- car::vif(model_poly_log)
# Breusch-Pagan Test for heteroskedasticity
bp_test <- lmtest::bptest(model_poly_log)
# Histogram for normalization of residuals
hist(model_poly_log$residuals, main = "Histogram of Residuals", xlab = "Residuals")
# Print Results
# VIF for multicollinearity
cat("VIF Values:\n")
## VIF Values:
print(vif_values)
## periodo exportaciones_mx educacion
## 508.278977 30.693045 16.762437
## log(innovacion) I(densidad_poblacion^2) pib_per_capita
## 2.283523 429.161622 39.065363
## crisis_financiera
## 1.707500
cat("\n")
# Breusch-Pagan Test for heteroskedasticity
cat("Breusch-Pagan Test:\n")
## Breusch-Pagan Test:
print(bp_test)
##
## studentized Breusch-Pagan test
##
## data: model_poly_log
## BP = 6.0693, df = 7, p-value = 0.5317
cat("\n")
In the analysis below, we’ll be visualizing the predicted values from Model 3 and evaluating its performance against the actual observed values.
predicted_values <- predict(model_poly_log)
# The list of predictor variables to analyze
predictor_variables <- c("periodo", "exportaciones_mx", "educacion", "log(innovacion) ", "I(densidad_poblacion^2)", "pib_per_capita", "crisis_financiera")
# Loop through each predictor variable
for (var in predictor_variables) {
effect_plot <- effect(var, model_poly_log)
plot(effect_plot, main=paste("Effect of", var))
}
Points represent individual observations.
The x-axis portrays the real values of the dependent variable (ied_mx), while the y-axis depicts the model’s predictions for those observations.
The red line represents perfect prediction, meaning the model’s prediction matches the actual value. The closer the points are to this line, the more accurate the model’s predictions for those observations. Observations that deviate significantly from this line indicate areas where the model might be under or over-predicting.
actual_values <- sp_data$ied_mx
plot(actual_values, predicted_values, main="Model 3: Actual vs. Predicted",
xlab="Actual", ylab="Predicted", pch=19, col="blue")
abline(a=0, b=1, col="red")
### Split the Data in Training Data vs Test Data
# Lets randomly split the data into train and test set
set.seed(123) ### sets the random seed for reproducibility of results
training.samples<-sp_data$ied_mx %>%
createDataPartition(p=0.75,list=FALSE) ### Lets consider 75% of the data to build a predictive model
train.data<-sp_data[training.samples, ] ### training data to fit the linear regression model
test.data<-sp_data[-training.samples, ] ### testing data to test the linear regression model
# LASSO regression via glmnet package can only take numerical observations. Then, the dataset is transformed to model.matrix() format.
# Independent variables
x<-model.matrix(ied_mx ~ periodo+exportaciones_mx+educacion+log(innovacion)+I(densidad_poblacion^2)+pib_per_capita+crisis_financiera,train.data)[,-1] ### OLS model specification
# x<-model.matrix(Weekly_Sales~.,train.data)[,-1] ### matrix of independent variables X's
y<-train.data$ied_mx ### dependent variable
# In estimating LASSO regression it is important to define the lambda that minimizes the prediction error rate.
# Cross-validation ensures that every data / observation from the original dataset (datains) has a chance of appearing in train and test datasets.
# Find the best lambda using cross-validation.
set.seed(123)
cv.lasso<-cv.glmnet(x,y,alpha=1) # alpha = 1 for LASSO
# Display the best lambda value
cv.lasso$lambda.min ### lambda: a numeric value defining the amount of shrinkage. Why min? the higher the value of ?? , the more penalization there is
## [1] 10541.76
# Fit the final model on the training data
lassomodel<-glmnet(x,y,alpha=1,lambda=cv.lasso$lambda.min)
# Display regression coefficients
coef(lassomodel)
## 8 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -1.776829e+06
## periodo .
## exportaciones_mx 2.422802e-01
## educacion 1.427426e+04
## log(innovacion) 6.100960e+05
## I(densidad_poblacion^2) .
## pib_per_capita 3.359098e+00
## crisis_financiera1 .
# Make predictions on the test data
x.test<-model.matrix(ied_mx ~ periodo+exportaciones_mx+educacion+log(innovacion)+I(densidad_poblacion^2)+pib_per_capita+crisis_financiera,test.data)[,-1] ### OLS model specification
# x.test<-model.matrix(Weekly_Sales~.,test.data)[,-1]
lassopredictions <- lassomodel %>% predict(x.test) %>% as.vector()
# Model Accuracy
data.frame(
RMSE = RMSE(lassopredictions, test.data$ied_mx),
Rsquare = R2(lassopredictions, test.data$ied_mx))
## RMSE Rsquare
## 1 129268.3 0.6128246
### visualizing lasso regression results
lbs_fun <- function(fit, offset_x=1, ...) {
L <- length(fit$lambda)
x <- log(fit$lambda[L])+ offset_x
y <- fit$beta[, L]
labs <- names(y)
text(x, y, labels=labs, ...)
}
lasso<-glmnet(scale(x),y,alpha=1)
plot(lasso,xvar="lambda",label=T)
lbs_fun(lasso)
abline(v=cv.lasso$lambda.min,col="red",lty=2)
abline(v=cv.lasso$lambda.1se,col="blue",lty=2)
# Find the best lambda using cross-validation
set.seed(123) # x: independent variables | y: dependent variable
cv.ridge <- cv.glmnet(x,y,alpha=0.1) # alpha = 0 for RIDGE
# Display the best lambda value
cv.ridge$lambda.min # lambda: a numeric value defining the amount of shrinkage. Why min? the higher the value of ?? , the more penalization there is
## [1] 45632.75
# Fit the final model on the training data
ridgemodel<-glmnet(x,y,alpha=0,lambda=cv.ridge$lambda.min)
# Display regression coefficients
coef(ridgemodel)
## 8 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -5.564743e+06
## periodo 1.962827e+03
## exportaciones_mx 1.249883e-01
## educacion 2.816894e+04
## log(innovacion) 5.351597e+05
## I(densidad_poblacion^2) 1.561430e+01
## pib_per_capita 2.778428e+00
## crisis_financiera1 -1.992969e+04
# Make predictions on the test data
x.test<-model.matrix(ied_mx ~ periodo+exportaciones_mx+educacion+log(innovacion)+I(densidad_poblacion^2)+pib_per_capita+crisis_financiera,test.data)[,-1]
ridgepredictions<-ridgemodel %>% predict(x.test) %>% as.vector()
# Model Accuracy
data.frame(
RMSE = RMSE(ridgepredictions, test.data$ied_mx),
Rsquare = R2(ridgepredictions, test.data$ied_mx)
)
## RMSE Rsquare
## 1 122548.3 0.682736
### visualizing ridge regression results
ridge<-glmnet(scale(x),y,alpha=0)
plot(ridge, xvar = "lambda", label=T)
lbs_fun(ridge)
abline(v=cv.ridge$lambda.min, col = "red", lty=2)
abline(v=cv.ridge$lambda.1se, col="blue", lty=2)
tab <- matrix(c(17587,6324,6322,0.77,0.71,0.71), ncol=2, byrow=FALSE)
colnames(tab) <- c('RMSE','R2')
rownames(tab) <- c('Linear Regression','Lasso','Ridge')
tab <- as.table(tab)