Nearshoring, also known as “the relocation of business projects” consists of optimizing production supply chains and in moving and relocating production centers and companies from the country of origin to a country where the input costs (salaries, electricity, etc.,) are lower and where the level of output consumption is higher (El Economista, 2023).
Mexico attracting Nearshoring investment projects is a realistic possibility, since the country shares 3,152 kilometers of border with the United States. Besides, Mexico has a free trade agreement with the United States and Canada (T-MEC), which facilitates trade between the 3 countries involved. Another reason why Mexico is a great candidate for Nearshoring investment projects is its human capital, since Mexican labor is qualified, cheap and competitive.
Predictive Analytics is a field of data analytics used to make predictions about future outcomes. It uses historical data, algorithms and data mining techniques. Predictive Analytics involve the use of some statistical techniques, such as machine learning algorithms for prediction making (logistic regression models, linear regression models, etc.).
Regression analysis in Predictive Analytics is a a fundamental technique, since “it is helpful to understand the relationship between one or more independent variables (x) and the variable which is going to be predicted, known as the dependent variable (y)” (IBM, 2022). The use of regression facilitates prediction making and, there are different types of regressions, such as the linear regression, multiple linear regression, non-linear regression, etc.
Using regression analysis is a suitable strategy to evaluate how appropriate investment projects would be and, to make business decisions, since several variables that directly affect the foreign investment would be taken into account.
The regression analysis to predict the outcomes of Nearshoring investment projects in Mexico, would involve the use of several relevant variables that explain the Flows of Foreign Direct Investment.Those variables can be labor costs, employment, education, GDP, etc. A suitable linear regression model can show promising and encouraging results for Nearshoring investments. These results can present a potential improvement in the industrial models of the companies willing to invest and, an increase in their respective income. In addition, a regression analysis can facilitate the decision-making process regarding foreign investments.
Taking into account the purpose of the Nearshoring concept, which consists of the reconfiguration of supply chains and the relocation of companies according to the position of the end market, and, all those factors that make Mexico a great candidate for foreign investments, it can be concluded that the problem situation consists of determining and analyzing the relationship between all the variables that can predict the behavior of the variable of interest, which in this case is the Flow of Foreign Direct Investment, through the use of various regression models.
In order to address the problem situation, it is necessary to identify the correlation between independent variables and the dependent variable and, to make various regression models. Subsequently, it is essential to select the most appropriate model that explains correctly the performance of Foreign Direct Investment based on the values of the selected explanatory variables, and thus, obtaining significant insights for decision-making on the investment of foreign company projects.
# Installing libraries
library(foreign)
library(dplyr) # data manipulation
library(forcats) # to work with categorical variables
library(ggplot2) # data visualization
library(readr) # read specific csv files
library(janitor) # data exploration and cleaning
library(Hmisc) # several useful functions for data analysis
library(psych) # functions for multivariate analysis
library(naniar) # summaries and visualization of missing values NAs
library(corrplot) # correlation plots
library(jtools) # presentation of regression analysis
library(lmtest) # diagnostic checks - linear regression analysis
library(car) # diagnostic checks - linear regression analysis
library(olsrr) # diagnostic checks - linear regression analysis
library(naniar) # identifying missing values
library(stargazer) # create publication quality tables
library(effects) # displays for linear and other regression models
library(tidyverse) # collection of R packages designed for data science
library(caret) # Classification and Regression Training
library(glmnet) # methods for prediction and plotting, and functions for cross-validation
library(readxl) # read excel
library(plotly)
library(GGally)
library(gridExtra)
library(cowplot)
# Loading the database
data <- read_excel("/cloud/project/SP_DataMexicoAtractiveness_alumn-VF.xlsx",sheet=1,range="A6:R32",na="-")
head(data)
## # A tibble: 6 × 18
## Año IED_Flujos IED_Flujos_MXN Exportaciones Exportaciones_MXN Empleo
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1997 12146. 294298. 9088. 220201. NA
## 2 1998 8374. 210849. 9875. 248659. NA
## 3 1999 13960. 299834. 10990. 236039. NA
## 4 2000 18249. 362638. 12483. 248061. 97.8
## 5 2001 30057. 546448. 11300. 205445. 97.4
## 6 2002 24099. 468391. 11923. 231737. 97.7
## # ℹ 12 more variables: Educación <dbl>, Salario_Diario <dbl>, Innovación <dbl>,
## # Inseguridad_Robo <dbl>, Inseguridad_Homicidio <dbl>, Tipo_de_Cambio <dbl>,
## # Densidad_Carretera <dbl>, Densidad_Población <dbl>, CO2_Emisiones <dbl>,
## # PIB_Per_Cápita <dbl>, INPC <dbl>, Crisis_Financiera <dbl>
# Get the names of the variables
names(data)
## [1] "Año" "IED_Flujos" "IED_Flujos_MXN"
## [4] "Exportaciones" "Exportaciones_MXN" "Empleo"
## [7] "Educación" "Salario_Diario" "Innovación"
## [10] "Inseguridad_Robo" "Inseguridad_Homicidio" "Tipo_de_Cambio"
## [13] "Densidad_Carretera" "Densidad_Población" "CO2_Emisiones"
## [16] "PIB_Per_Cápita" "INPC" "Crisis_Financiera"
# Display structure of the dataset
str(data)
## tibble [26 × 18] (S3: tbl_df/tbl/data.frame)
## $ Año : num [1:26] 1997 1998 1999 2000 2001 ...
## $ IED_Flujos : num [1:26] 12146 8374 13960 18249 30057 ...
## $ IED_Flujos_MXN : num [1:26] 294298 210849 299834 362638 546448 ...
## $ Exportaciones : num [1:26] 9088 9875 10990 12483 11300 ...
## $ Exportaciones_MXN : num [1:26] 220201 248659 236039 248061 205445 ...
## $ Empleo : num [1:26] NA NA NA 97.8 97.4 ...
## $ Educación : num [1:26] 7.2 7.31 7.43 7.56 7.68 ...
## $ Salario_Diario : num [1:26] 24.3 31.9 31.9 35.1 37.6 ...
## $ Innovación : num [1:26] 11.3 11.4 12.5 13.1 13.5 ...
## $ Inseguridad_Robo : num [1:26] 267 315 273 217 215 ...
## $ Inseguridad_Homicidio: num [1:26] 14.6 14.3 12.6 10.9 10.2 ...
## $ Tipo_de_Cambio : num [1:26] 8.06 9.94 9.52 9.6 9.17 ...
## $ Densidad_Carretera : num [1:26] 0.0521 0.053 0.055 0.0552 0.0565 ...
## $ Densidad_Población : num [1:26] 47.4 48.8 49.5 50.6 51.3 ...
## $ CO2_Emisiones : num [1:26] 3.68 3.85 3.69 3.87 3.81 ...
## $ PIB_Per_Cápita : num [1:26] 127570 126739 129165 130875 128083 ...
## $ INPC : num [1:26] 33.3 39.5 44.3 48.3 50.4 ...
## $ Crisis_Financiera : num [1:26] 0 0 0 0 0 0 0 0 0 0 ...
dim(data)
## [1] 26 18
# Identifying missing values
colSums(is.na(data))
## Año IED_Flujos IED_Flujos_MXN
## 0 0 0
## Exportaciones Exportaciones_MXN Empleo
## 0 0 3
## Educación Salario_Diario Innovación
## 3 0 2
## Inseguridad_Robo Inseguridad_Homicidio Tipo_de_Cambio
## 0 1 0
## Densidad_Carretera Densidad_Población CO2_Emisiones
## 0 0 3
## PIB_Per_Cápita INPC Crisis_Financiera
## 0 0 0
# Transforming variables
# 1. Delete not numeric columns
newData<- subset(data, select = - Año)
print(newData)
## # A tibble: 26 × 17
## IED_Flujos IED_Flujos_MXN Exportaciones Exportaciones_MXN Empleo Educación
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 12146. 294298. 9088. 220201. NA 7.20
## 2 8374. 210849. 9875. 248659. NA 7.31
## 3 13960. 299834. 10990. 236039. NA 7.43
## 4 18249. 362638. 12483. 248061. 97.8 7.56
## 5 30057. 546448. 11300. 205445. 97.4 7.68
## 6 24099. 468391. 11923. 231737. 97.7 7.80
## 7 18250. 368747. 13156 265822. 97.1 7.93
## 8 25016. 481300. 13573. 261147. 96.5 8.04
## 9 25796. 458581. 16466. 292718. 97.2 8.14
## 10 21233. 368329. 17486. 303335. 96.5 8.26
## # ℹ 16 more rows
## # ℹ 11 more variables: Salario_Diario <dbl>, Innovación <dbl>,
## # Inseguridad_Robo <dbl>, Inseguridad_Homicidio <dbl>, Tipo_de_Cambio <dbl>,
## # Densidad_Carretera <dbl>, Densidad_Población <dbl>, CO2_Emisiones <dbl>,
## # PIB_Per_Cápita <dbl>, INPC <dbl>, Crisis_Financiera <dbl>
# 2. Replace missing values with the median value
data <- data %>%
mutate_all(~ ifelse(is.na(.), median(., na.rm = TRUE), .))
print(data)
## # A tibble: 26 × 18
## Año IED_Flujos IED_Flujos_MXN Exportaciones Exportaciones_MXN Empleo
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1997 12146. 294298. 9088. 220201. 96.5
## 2 1998 8374. 210849. 9875. 248659. 96.5
## 3 1999 13960. 299834. 10990. 236039. 96.5
## 4 2000 18249. 362638. 12483. 248061. 97.8
## 5 2001 30057. 546448. 11300. 205445. 97.4
## 6 2002 24099. 468391. 11923. 231737. 97.7
## 7 2003 18250. 368747. 13156 265822. 97.1
## 8 2004 25016. 481300. 13573. 261147. 96.5
## 9 2005 25796. 458581. 16466. 292718. 97.2
## 10 2006 21233. 368329. 17486. 303335. 96.5
## # ℹ 16 more rows
## # ℹ 12 more variables: Educación <dbl>, Salario_Diario <dbl>, Innovación <dbl>,
## # Inseguridad_Robo <dbl>, Inseguridad_Homicidio <dbl>, Tipo_de_Cambio <dbl>,
## # Densidad_Carretera <dbl>, Densidad_Población <dbl>, CO2_Emisiones <dbl>,
## # PIB_Per_Cápita <dbl>, INPC <dbl>, Crisis_Financiera <dbl>
# Descriptive Statistics
summary(data)
## Año IED_Flujos IED_Flujos_MXN Exportaciones
## Min. :1997 Min. : 8374 Min. :210849 Min. : 9088
## 1st Qu.:2003 1st Qu.:21367 1st Qu.:368434 1st Qu.:13260
## Median :2010 Median :27698 Median :497116 Median :21188
## Mean :2010 Mean :26770 Mean :493603 Mean :23601
## 3rd Qu.:2016 3rd Qu.:32183 3rd Qu.:578789 3rd Qu.:31601
## Max. :2022 Max. :48354 Max. :754160 Max. :46478
## Exportaciones_MXN Empleo Educación Salario_Diario
## Min. :205445 Min. :95.06 Min. :7.198 Min. : 24.30
## 1st Qu.:262316 1st Qu.:96.09 1st Qu.:7.955 1st Qu.: 41.97
## Median :366363 Median :96.53 Median :8.457 Median : 54.48
## Mean :433858 Mean :96.48 Mean :8.428 Mean : 65.16
## 3rd Qu.:632411 3rd Qu.:97.01 3rd Qu.:8.929 3rd Qu.: 72.31
## Max. :785503 Max. :97.83 Max. :9.579 Max. :172.87
## Innovación Inseguridad_Robo Inseguridad_Homicidio Tipo_de_Cambio
## Min. :11.28 Min. :120.5 Min. : 8.037 Min. : 8.064
## 1st Qu.:12.60 1st Qu.:148.3 1st Qu.:10.402 1st Qu.:10.752
## Median :13.09 Median :181.8 Median :16.928 Median :13.016
## Mean :13.10 Mean :185.4 Mean :17.278 Mean :13.910
## 3rd Qu.:13.60 3rd Qu.:209.9 3rd Qu.:22.346 3rd Qu.:18.489
## Max. :15.11 Max. :314.8 Max. :29.592 Max. :20.664
## Densidad_Carretera Densidad_Población CO2_Emisiones PIB_Per_Cápita
## Min. :0.05205 Min. :47.44 Min. :3.592 Min. :126739
## 1st Qu.:0.05954 1st Qu.:52.78 1st Qu.:3.843 1st Qu.:130964
## Median :0.06989 Median :58.09 Median :3.925 Median :136845
## Mean :0.07106 Mean :57.33 Mean :3.944 Mean :138550
## 3rd Qu.:0.08275 3rd Qu.:61.39 3rd Qu.:4.088 3rd Qu.:146148
## Max. :0.09020 Max. :65.60 Max. :4.221 Max. :153236
## INPC Crisis_Financiera
## Min. : 33.28 Min. :0.00000
## 1st Qu.: 56.15 1st Qu.:0.00000
## Median : 73.35 Median :0.00000
## Mean : 75.17 Mean :0.07692
## 3rd Qu.: 91.29 3rd Qu.:0.00000
## Max. :126.48 Max. :1.00000
Looking at the descriptive statistics of the variables, it can be concluded that there are outliers in the variables “Salario_Diario”, “IED_Flujos_MXN” and in “Exportaciones_MXN” since there is a big difference between the highest and lowest values. This can also be reflected in the difference between the mean and median values of these variables. The information shows that there is a skewed distribution of the data and that the mean value of the previously mentioned variables may not be a reliable representation of the central tendency. The variable “Year” does not have relevant descriptive statistical data.
# Descriptive Statistics for the dependent variable
summary(data$IED_Flujos_MXN)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 210849 368434 497116 493603 578789 754160
For the response variable (IED_Flujos_MXN), the mean is 493603, which is greater than the median (497116), therefore the distribution is expected to be likely right-skewed.
# Measures of Dispersion
# Range, variance and standard deviation
describe(data)
## vars n mean sd median trimmed mad
## Año 1 26 2009.50 7.65 2009.50 2009.50 9.64
## IED_Flujos 2 26 26770.12 8770.68 27697.59 26860.64 9079.25
## IED_Flujos_MXN 3 26 493603.16 143824.44 497115.78 494279.12 183295.95
## Exportaciones 4 26 23600.91 11346.13 21187.89 22875.85 13370.71
## Exportaciones_MXN 5 26 433858.27 194996.37 366363.33 423625.39 184306.64
## Empleo 6 26 96.48 0.72 96.53 96.48 0.75
## Educación 7 26 8.43 0.68 8.46 8.44 0.76
## Salario_Diario 8 26 65.16 35.85 54.48 60.16 22.51
## Innovación 9 26 13.10 1.07 13.09 13.09 0.78
## Inseguridad_Robo 10 26 185.42 47.67 181.83 181.16 47.06
## Inseguridad_Homicidio 11 26 17.28 7.12 16.93 16.98 9.31
## Tipo_de_Cambio 12 26 13.91 4.15 13.02 13.78 4.25
## Densidad_Carretera 13 26 0.07 0.01 0.07 0.07 0.02
## Densidad_Población 14 26 57.33 5.41 58.09 57.44 6.68
## CO2_Emisiones 15 26 3.94 0.18 3.93 3.95 0.17
## PIB_Per_Cápita 16 26 138550.10 8861.10 136845.30 138255.64 11080.42
## INPC 17 26 75.17 24.81 73.35 74.45 27.14
## Crisis_Financiera 18 26 0.08 0.27 0.00 0.00 0.00
## min max range skew kurtosis se
## Año 1997.00 2022.00 25.00 0.00 -1.34 1.50
## IED_Flujos 8373.50 48354.42 39980.92 -0.02 -0.08 1720.07
## IED_Flujos_MXN 210849.08 754159.85 543310.77 -0.01 -1.00 28206.29
## Exportaciones 9087.62 46477.58 37389.97 0.46 -1.08 2225.16
## Exportaciones_MXN 205445.01 785503.27 580058.25 0.48 -1.40 38241.93
## Empleo 95.06 97.83 2.77 -0.17 -0.73 0.14
## Educación 7.20 9.58 2.38 -0.12 -1.06 0.13
## Salario_Diario 24.30 172.87 148.57 1.43 1.44 7.03
## Innovación 11.28 15.11 3.83 0.13 -0.70 0.21
## Inseguridad_Robo 120.49 314.78 194.28 0.89 0.30 9.35
## Inseguridad_Homicidio 8.04 29.59 21.56 0.38 -1.28 1.40
## Tipo_de_Cambio 8.06 20.66 12.60 0.44 -1.39 0.81
## Densidad_Carretera 0.05 0.09 0.04 0.18 -1.50 0.00
## Densidad_Población 47.44 65.60 18.16 -0.19 -1.24 1.06
## CO2_Emisiones 3.59 4.22 0.63 -0.11 -0.95 0.04
## PIB_Per_Cápita 126738.75 153235.73 26496.98 0.28 -1.41 1737.81
## INPC 33.28 126.48 93.20 0.26 -0.95 4.87
## Crisis_Financiera 0.00 1.00 1.00 2.99 7.25 0.05
# 1. Correlation Plot
data<- select(data, - IED_Flujos, - Exportaciones)
print(data)
## # A tibble: 26 × 16
## Año IED_Flujos_MXN Exportaciones_MXN Empleo Educación Salario_Diario
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1997 294298. 220201. 96.5 7.20 24.3
## 2 1998 210849. 248659. 96.5 7.31 31.9
## 3 1999 299834. 236039. 96.5 7.43 31.9
## 4 2000 362638. 248061. 97.8 7.56 35.1
## 5 2001 546448. 205445. 97.4 7.68 37.6
## 6 2002 468391. 231737. 97.7 7.80 39.7
## 7 2003 368747. 265822. 97.1 7.93 41.5
## 8 2004 481300. 261147. 96.5 8.04 43.3
## 9 2005 458581. 292718. 97.2 8.14 45.2
## 10 2006 368329. 303335. 96.5 8.26 47.0
## # ℹ 16 more rows
## # ℹ 10 more variables: Innovación <dbl>, Inseguridad_Robo <dbl>,
## # Inseguridad_Homicidio <dbl>, Tipo_de_Cambio <dbl>,
## # Densidad_Carretera <dbl>, Densidad_Población <dbl>, CO2_Emisiones <dbl>,
## # PIB_Per_Cápita <dbl>, INPC <dbl>, Crisis_Financiera <dbl>
corrplot(cor(data),method = "circle",
type = "upper", order = "hclust", addCoef.col = "black",
tl.col = "black", tl.srt = 45, diag = FALSE, number.cex = 0.4)
There is a significant correlation between the variables
“Inseguridad_Robo”, “Salario_Diario”, “Tipo_de_Cambio”, “Exportaciones”,
“INPC”, Densidad_Carretera”, “Densidad_Población”, “Innovación”, and the
dependent variable “IED_Flujos_MXN”. The correlation between
“Inseguridad_Robo” and the dependent variable is negative.
It can be concluded that the Flow of Foreign Direct Investment “IED_Flujos_MXN” increases as the values of exports, daily salaries, exchange rate, innovation and population density increase, while, as the robbery insecurity decreases, the value of Flow of Foreign Direct Investment increases.
# 2.1 Histogram of dependent variable
ggplot(data = data, aes(x = IED_Flujos_MXN)) +
geom_histogram(fill = "blue", color="black" )+
labs(x = "Median Value IED_Flujos (MM USD)") +
ggtitle("Median Value Histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
By observing the histogram, it can be concluded that there are outilers.
There is a higher frequency in the Flow of Foreign Direct Investment
that goes between 5 and 6 million pesos. A semi-uniform distribution of
the data can also be observed, since the the Flow of Foreign Direct
Investment (IED_Flujos_MXN) in different ranges has the same
frequency.
# 2.2 Logarithmic Histogram
ggplot(data = data, aes(x = log(IED_Flujos_MXN))) +
geom_histogram( fill = "blue", color = "black",bins=12) +
labs(title = "Logarithmic Histogram",
x = "Log(Values) of IED_Flujos_MXN",
y = "Frequency")
Taking into account that the dependent variable covers a very wide
range, a logarithmic histogram was made, in which it can be observed
that there is a more normal distribution of the data but, the data shows
a slight left skewness. The highest values of Flow of Foreign Direct
Investment have more frequency than the lowest values of the same
variable. There is still presence of outliers.
# 2.3 Histograms of independent variables
data %>%
gather(key, val,- Año) %>%
ggplot(aes(x=val)) +
geom_histogram(fill = "darkblue", color="black" ) +
facet_wrap(~key, scales = "free")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# 3. Flow of Foreign Direct Investment x Year
ggplot(data, aes(x = Año, y = log(IED_Flujos_MXN))) +
geom_line(color="blue", linewidth=1.3) +
labs(x = "Year", y = "Flow of Foreign Direct Investment") +
ggtitle("Flow of Foreign Direct Investment (MXN)")
This grapgh shows an in the Flow of Foreign Direct Investment through
the years but, a drop in the investment flow can be observed between
2005 and 2010, which may be due to the financial crises of 2008 and
2009. The peaks of the graph show that there is no stable pattern in the
increase of the investment flow, the data shows fluctuations in the
general trend.
# 4 Scatter Plot with "Exportaciones_MXN" and dependent variable
ggplot(data, aes(x = log(Exportaciones_MXN) , y = log(IED_Flujos_MXN))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Exports x Flow of Foreign Direct Investment", x = "Exports", y = "FDI")
## `geom_smooth()` using formula = 'y ~ x'
The graph shows that while the value of exports increases, the Flow of
Foreign Direct Investment increases as well. These variables have a
positive correlation. This may indicate that the variable
“Exportaciones_MXN” can explain the behavior of the variable
“IED_Flujos_MXN.
Least Squares Estimation (LSE)
HO: A higher average of years of education (“Educación”) does not have a negative effect on the Flow of Foreign Investment in millions of pesos.
H1: A higher average of years of education (“Educación”) has a negative effect on the Flow of Foreign Investment in millions of pesos.
HO: A high rate of the employed economically active population (“Empleo”) does not have a negative effect on the Flow of Foreign Investment in millions of pesos.
H1: A high rate of the employed economically active population (“Empleo”) has a negative effect on the Flow of Foreign Investment in millions of pesos.
HO: Minimum wage in pesos per day (“Salario_Diario”) does not have a negative effect on the Flow of Foreign Investment in millions of pesos.
H1: Minimum wage in pesos per day (“Salario_Diario”) does not have a negative effect on the Flow of Foreign Investment in millions of pesos.
# 1. Multiple Linear Regression Model with log transformation using all independent variables
model1<-lm(log(IED_Flujos_MXN) ~ +Inseguridad_Robo+Salario_Diario+Exportaciones_MXN+Empleo+Tipo_de_Cambio+ Densidad_Población+CO2_Emisiones+Innovación+Educación+Densidad_Carretera+Inseguridad_Homicidio,data=data)
summary(model1)
##
## Call:
## lm(formula = log(IED_Flujos_MXN) ~ +Inseguridad_Robo + Salario_Diario +
## Exportaciones_MXN + Empleo + Tipo_de_Cambio + Densidad_Población +
## CO2_Emisiones + Innovación + Educación + Densidad_Carretera +
## Inseguridad_Homicidio, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.30298 -0.08985 -0.00425 0.08434 0.41901
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.208e+00 1.240e+01 0.420 0.681
## Inseguridad_Robo -1.679e-03 2.526e-03 -0.664 0.517
## Salario_Diario -2.953e-03 9.814e-03 -0.301 0.768
## Exportaciones_MXN -1.893e-06 2.050e-06 -0.923 0.371
## Empleo 6.365e-02 1.053e-01 0.605 0.555
## Tipo_de_Cambio 3.659e-02 6.002e-02 0.610 0.552
## Densidad_Población -2.287e-03 1.571e-01 -0.015 0.989
## CO2_Emisiones -2.401e-01 6.381e-01 -0.376 0.712
## Innovación 1.177e-01 7.274e-02 1.618 0.128
## Educación -6.645e-02 4.719e-01 -0.141 0.890
## Densidad_Carretera 3.570e+01 5.324e+01 0.670 0.513
## Inseguridad_Homicidio 5.234e-03 2.660e-02 0.197 0.847
##
## Residual standard error: 0.2124 on 14 degrees of freedom
## Multiple R-squared: 0.7492, Adjusted R-squared: 0.5522
## F-statistic: 3.802 on 11 and 14 DF, p-value: 0.01072
# VIF
vif(model1)
## Inseguridad_Robo Salario_Diario Exportaciones_MXN
## 8.034723 68.593385 88.537762
## Empleo Tipo_de_Cambio Densidad_Población
## 3.189190 34.376981 400.553403
## CO2_Emisiones Innovación Educación
## 7.351086 3.372922 56.469750
## Densidad_Carretera Inseguridad_Homicidio
## 281.223818 19.855219
# BP test
bptest(model1)
##
## studentized Breusch-Pagan test
##
## data: model1
## BP = 9.8682, df = 11, p-value = 0.5423
# AIC
cat("AIC:", AIC(model1),"\n")
## AIC: 3.127122
selected_model1<-model1
cat("RMSE:",RMSE(selected_model1$fitted.values,data$IED_Flujos_MXN))
## RMSE: 513343
# Noramlity test
shapiro.test(model1$residuals)
##
## Shapiro-Wilk normality test
##
## data: model1$residuals
## W = 0.9651, p-value = 0.5016
# Residuals histogram
hist(model1$residuals)
The variable with the highest correlation is “Densidad_Carretera” but it is not statistically significant.
Multiple R-squared is 0.7492, which could indicate that the predictor variables explain well the variability in the dependent variable, however, a high Multiple R-squared value could also indicate that this model is overtrained.
Adjusted R-squared is 0.5522.
The AIC is of 3.127122, it is not as low as the AIC values of the rest of the models, this indicates that there is a not good enough quality fit of the model
The variance of inflation factors (VIF) is high in most of the variables, which indicates the presence of multicollinearity between the independent variables.
The p-value of the BP test is of 0.5423, which shows that there is not enough evidence to affirm that there is heteroscedasticity in the model.
The p-value in the residual test is 0.5016, which may indicate that there is a normal distribution of the data.
The RSE value of model 1 is low (0.2124), this means that it does fit the observed data well.
# 2. Multiple Linear Regression Model with log transformation using independent variables with the highest correlation
model2<-lm(log(IED_Flujos_MXN) ~ +Inseguridad_Robo+Salario_Diario+Exportaciones_MXN+Tipo_de_Cambio+ Densidad_Población+Innovación+Densidad_Carretera,data=data)
summary(model2)
##
## Call:
## lm(formula = log(IED_Flujos_MXN) ~ +Inseguridad_Robo + Salario_Diario +
## Exportaciones_MXN + Tipo_de_Cambio + Densidad_Población +
## Innovación + Densidad_Carretera, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33207 -0.06177 -0.00462 0.08113 0.38621
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.257e+01 1.769e+00 7.108 1.26e-06 ***
## Inseguridad_Robo -2.088e-03 1.410e-03 -1.480 0.1561
## Salario_Diario -6.712e-04 2.710e-03 -0.248 0.8072
## Exportaciones_MXN -2.045e-06 1.515e-06 -1.350 0.1936
## Tipo_de_Cambio 4.004e-02 5.205e-02 0.769 0.4517
## Densidad_Población -7.703e-02 5.291e-02 -1.456 0.1627
## Innovación 1.090e-01 4.466e-02 2.441 0.0252 *
## Densidad_Carretera 5.969e+01 2.946e+01 2.026 0.0578 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1925 on 18 degrees of freedom
## Multiple R-squared: 0.7353, Adjusted R-squared: 0.6323
## F-statistic: 7.143 on 7 and 18 DF, p-value: 0.0003675
# VIF
vif(model2)
## Inseguridad_Robo Salario_Diario Exportaciones_MXN Tipo_de_Cambio
## 3.050979 6.372765 58.882327 31.481847
## Densidad_Población Innovación Densidad_Carretera
## 55.329421 1.548690 104.874788
# BP test
bptest(model2)
##
## studentized Breusch-Pagan test
##
## data: model2
## BP = 9.2562, df = 7, p-value = 0.2348
# AIC
cat("AIC:", AIC(model2),"\n")
## AIC: -3.466751
selected_model2<-model2
cat("RMSE:",RMSE(selected_model2$fitted.values,data$IED_Flujos_MXN))
## RMSE: 513343
# Normality test
shapiro.test(model2$residuals)
##
## Shapiro-Wilk normality test
##
## data: model2$residuals
## W = 0.96943, p-value = 0.6086
# Residuals histogram
hist(model2$residuals)
The variable with the highest correlation is “Innovación” and it is more statistically significant in comparison to the other explicative variables.
Multiple R-squared is 0.7353, which could indicate that the predictor variables explain well the variability in the dependent variable, however, a high Multiple R-squared value could also indicate that this model is overtrained.
Adjusted R-squared is 0.6323.
The AIC is of -3.466751, its low value indicates that there is a good quality fit of the model.
The variance of inflation factors (VIF) of “Densidad_Carretera”, “Exportaciones_MXN” and “Densidad_Población” are high, which indicates the presence of multicollinearity between the variables.
The p-value of the BP test is of 0.2348, which shows that there is not enough evidence to affirm that there is heteroscedasticity in the model.
The p-value in the residual test is 0.6086, which may indicate that there is a normal distribution of the data.
The RSE value of model 2 is 0.1925, which means the model fits the observed data well.
# 3. Polynomial-Multiple Linear Regression Model with Logarithmic Transformation using variables with the highest correlation
model3<-lm(log(IED_Flujos_MXN) ~ +Inseguridad_Robo+Salario_Diario+Empleo+Innovación+I(Educación^2),data=data)
summary(model3)
##
## Call:
## lm(formula = log(IED_Flujos_MXN) ~ +Inseguridad_Robo + Salario_Diario +
## Empleo + Innovación + I(Educación^2), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.32378 -0.09716 0.02006 0.10083 0.39272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6137715 5.6428850 0.463 0.64822
## Inseguridad_Robo -0.0009267 0.0010699 -0.866 0.39667
## Salario_Diario 0.0011399 0.0015270 0.746 0.46408
## Empleo 0.0864740 0.0584396 1.480 0.15453
## Innovación 0.0851125 0.0458988 1.854 0.07849 .
## I(Educación^2) 0.0152487 0.0048804 3.124 0.00534 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1886 on 20 degrees of freedom
## Multiple R-squared: 0.7174, Adjusted R-squared: 0.6468
## F-statistic: 10.16 on 5 and 20 DF, p-value: 5.901e-05
# VIF
vif(model3)
## Inseguridad_Robo Salario_Diario Empleo Innovación
## 1.827292 2.105407 1.246478 1.702636
## I(Educación^2)
## 2.158051
# BP test
bptest(model3)
##
## studentized Breusch-Pagan test
##
## data: model3
## BP = 2.8935, df = 5, p-value = 0.7164
# AIC
cat("AIC:", AIC(model3),"\n")
## AIC: -5.768831
selected_model3<-model3
cat("RMSE:",RMSE(selected_model3$fitted.values,data$IED_Flujos_MXN))
## RMSE: 513343
# Normality test
shapiro.test(model3$residuals)
##
## Shapiro-Wilk normality test
##
## data: model3$residuals
## W = 0.97628, p-value = 0.7868
# Residuals histogram
hist(model3$residuals)
# Residuals vs. Fitted plot
residual <- resid(model3)
valAdjusted <- fitted(model3)
plot(valAdjusted, residual,
xlab = "Fitted Values",
ylab = "Residuals",
main = "Residuals vs. Fitted Plot")
abline(h = 0, col = "red", lty = 2)
The variable with the highest correlation is “Innovación” and it is more statistically significant in comparison to the other explicative variables. The variable “I(Educación^2)” is statistically significant too.
Multiple R-squared is 0.7174, which could indicate that the predictor variables explain well the variability in the dependent variable.
Adjusted R-squared is 0.6468.
The AIC is of -5.768831 , its low value indicates that there is a good quality fit of the model, and, the AIC value of this model is lower than the values from the rest of the models.
The variance of inflation factors (VIF) of all selected independent variables is below 10, which can indicate that there is not enough evidence to affirm the presence of multicollinearity.
The p-value of the BP test is of 0.7164, which shows that there is not enough evidence to affirm that there is heteroscedasticity in the model.
The p-value in the residual test is 0.7868, which may indicate that there is a normal distribution of the data.
The RSE value of model 3 is 0.1886, which means the model fits the observed data well.
# 4. Polynomial-Multiple Linear Regression Model with Logarithmic Transformation using variables with the highest correlation
model4<-lm(log(IED_Flujos_MXN) ~ log(lag(IED_Flujos_MXN))+Inseguridad_Robo+Salario_Diario+Innovación+Empleo+Educación+I(Exportaciones_MXN^2),data=data)
summary(model4)
##
## Call:
## lm(formula = log(IED_Flujos_MXN) ~ log(lag(IED_Flujos_MXN)) +
## Inseguridad_Robo + Salario_Diario + Innovación + Empleo +
## Educación + I(Exportaciones_MXN^2), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.32073 -0.08119 -0.02452 0.11420 0.32251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.660e+00 6.388e+00 0.260 0.7981
## log(lag(IED_Flujos_MXN)) -2.592e-01 2.238e-01 -1.158 0.2628
## Inseguridad_Robo -1.705e-03 1.299e-03 -1.313 0.2067
## Salario_Diario -6.026e-04 3.058e-03 -0.197 0.8461
## Innovación 7.522e-02 4.911e-02 1.532 0.1440
## Empleo 1.169e-01 6.684e-02 1.749 0.0982 .
## Educación 3.294e-01 1.301e-01 2.531 0.0215 *
## I(Exportaciones_MXN^2) 4.037e-13 6.268e-13 0.644 0.5281
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1938 on 17 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.7208, Adjusted R-squared: 0.6059
## F-statistic: 6.27 on 7 and 17 DF, p-value: 0.0009595
# VIF
vif(model4)
## log(lag(IED_Flujos_MXN)) Inseguridad_Robo Salario_Diario
## 3.321289 2.243638 7.567878
## Innovación Empleo Educación
## 1.629814 1.544642 4.446445
## I(Exportaciones_MXN^2)
## 9.109554
# BP test
bptest(model4)
##
## studentized Breusch-Pagan test
##
## data: model4
## BP = 4.6992, df = 7, p-value = 0.6966
# AIC
cat("AIC:", AIC(model4),"\n")
## AIC: -2.74047
selected_model4<-model4
cat("RMSE:",RMSE(selected_model4$fitted.values,data$IED_Flujos_MXN))
## Warning in pred - obs: longer object length is not a multiple of shorter object
## length
## RMSE: 513343
# Normality test
shapiro.test(model4$residuals)
##
## Shapiro-Wilk normality test
##
## data: model4$residuals
## W = 0.97545, p-value = 0.7829
# Residuals histogram
hist(model4$residuals)
# Residuals vs. Fitted plot
residual2 <- resid(model4)
valAdjusted2 <- fitted(model4)
plot(valAdjusted2, residual2,
xlab = "Fitted Values",
ylab = "Residuals",
main = "Residuals vs. Fitted Plot")
abline(h = 0, col = "red", lty = 2)
###### Model 4: Diagnostic Tests Analysis
The variable with the highest correlation is “Educación” and it is more statistically significant in comparison to the other explicative variables.
Multiple R-squared is 0.7208, which could indicate that the predictor variables explain well the variability in the dependent variable.
Adjusted R-squared is 0.6059.
The AIC is of -2.74047, its low value indicates that there is a good quality fit of the model.
The variance of inflation factors (VIF) of all selected independent variables is below 10, which can indicate that there is not enough evidence to affirm the presence of multicollinearity.
The p-value of the BP test is of 0.6966, which shows that there is not enough evidence to affirm that there is heteroscedasticity in the model.
The p-value in the residual test is 0.7829, which may indicate that there is a normal distribution of the data.
The RSE value of model 4 is 0.1938, which means the model fits the observed data well.
Selected regression model - Model 3
This model is a Polynomial-Multiple Linear Regression Model with
Logarithmic Transformation, and the selected independent variables
where: - Inseguridad_Robo - Salario_Diario - Empleo - Innovación -
I(Educación^2): The independent variable “Education” is quadratic
because this variable can have non-linear effects on the dependent
variable, since, in Mexico, cheap labor is headed by the sector of the
population with the lowest educational level, which is an attractive
factor for foreign investment in projects, however, a high degree of
education also has a positive impact on foreign investment, in addition,
this can be corroborated by the significant positive correlation between
the variable “Education” and “IED_Flujos_MXN”. In addition, education
has a positive impact up to a fourth point, since educational degrees
are not infinite.
Model 3 was selected by taking into account the diagnostic test, which presents the following results and insights: - “Innovación” and “I(Educación^2)” are statistically significant.
Multiple R-squared is 0.7174, which could indicate that the predictor variables explain well the variability in the dependent variable.
The AIC is of -5.768831 , this value, being lower than the AIC values of the rest of the models, indicates that there is a good quality fit of the model.
The variance of inflation factors (VIF) of all selected independent variables is below 10, which can indicate that there is not enough evidence to affirm the presence of multicollinearity. Unlike the other models, this is the one that shows the least multicollinearity, since the selected explanatory variables do not explain each other.
The p-value of the BP test is of 0.7164, which is greater than the 0.05 significance level. The p-value hows that there is not enough evidence to reject the null hypothesis of homoscedasticity. In this case, it cannot be concluded that there is significant heteroscedasticity in the data, and homoscedasticity is assumed. This information explains that the errors have a constant variance, meaning that the p-values of the model coefficients are more reliable and precise.
The p-value in the residual test is 0.7868, which may indicate that there is a normal distribution of the data.
The RSE value of model 3 is 0.1886, which means the model fits the observed data well.
By observing the regression results of model 3, it can be concluded
the following about the correlation between independent variables and
the dependent variable: - As Inseguridad_Robo increases in 1 unit,
IED_Flujos_MXN decreases in 0.0009267 units.
- As Salario_Diario increases in 1 unit, IED_Flujos_MXN increases in
0.0011399 units.
- As Empleo increases in 1 unit, IED_Flujos_MXN increases in 0.0864740
units. - As Innovación increases in 1 unit, IED_Flujos_MXN increases in
0.0851125 units. - As I(Educación^2) increases in 1 unit, IED_Flujos_MXN
increases in 0.0152487 units.
Not all the variables are statistically significant but they do not present multicollinearity. If Mexico focuses on improving the performance of the independent variables, except for “Inseguridad_Robo”, the Flow of Foreign Investment will increase, while, if Mexico does not improve security policies and robbery avoidance techniques, Mexico will not be a good candidate for Nearshoring investment projects.
Focusing on the previously established hypotheses, these are the conclusions:
Hypothesis 1: The null hypothesis posed is not rejected, this means that the average years of education of Mexican population does not have a negative effect on the Flow of Foreign Investment in millions of pesos.
Hypothesis 2: The null hypothesis posed is not rejected, the p-value is greater than 0.05, this means that a high rate of the economically active employed population in Mexico, does not have a negative effect on the Flow of Foreign Investment in millions of pesos.
Hypothesis 3: The null hypothesis posed is not rejected, this means that the minimum wage per day in Mexico does not have a negative effect on the Flow of Foreign Investment.
# 1. Predicted values of the dependent variable "IED_Flujos_MXN" regarding "Innovación"
effect_plot(model3, pred=Innovación, data=data, interval=TRUE)
This graph shows the positive correlation between innovation and Flow of
Foreign Investments, just as it was shown in model 3. When the
independent variable “Innovación” increases in 1 unit, “IED_Flujos_MXN”
increases in 0.0851125 units. This means that as the patent rate
increases in Mexico, the Flow of Foreign Investment will increase as
well. By focusing in innovation, Mexico becomes a great candidate for
Nearshoring investment projects.
# 2. Predicted values of the dependent variable "IED_Flujos_MXN" regarding "Empleo"
effect_plot(model3, pred=Empleo, data=data, interval=TRUE)
This graph shows the positive correlation between employment and Flow of
Foreign Investments, just as it was shown in model 3. When the
independent variable “Empleo” increases in 1 unit, “IED_Flujos_MXN”
increases in 0.0864740 units. This means that while the economically
active population in Mexico increasesas in Mexico, the Flow of Foreign
Investment will increase as well.
# 3. Predicted values of the dependent variable "IED_Flujos_MXN" regarding "Innovación"
effect_plot(model3, pred=Inseguridad_Robo, data=data, interval=TRUE)
This graph shows the negative correlation between robbery insecurity and Flow of Foreign Investments, just as it was shown in model 3. When the independent variable “Inseguridad_Robo” increases in 1 unit, “IED_Flujos_MXN” decreases in 0.0009267 units. This means that as robbery insecurity in Mexico increases, the Flow of Foreign Investment will decrease. By focusing in reducing and counter attacking robbery insecurity, Mexico becomes a greater candidate for Nearshoring investment projects.
Exports are a really significant variable in order to predict the future bejavior of the Flow of Foreign Direct Investment, however, due to its multicollinearity, it is not a reliable variable to include in a regression model.
Data from most of the variables have a normal distribution, especially from “INPC”, “Empleo”, “Inseguridad_Robo” and “Innovación” variables. This means that half of the data are above the mean and the other half are below, that is, there are no outliers that seriously hinder the interpretation of the data.
If Mexico focuses on increasing employment rate and the average years of education of its population, Mexico could be a more attractive candidate for Nearshoring Investment Projects. Not all the variables have a positive correlation, as “Inseguridad_Robo”.
Mexico has improved in terms of innovation over the years. Taking into account its improvement and the statistical significance of this variable, it can be concluded that the country’s level of innovation can reliably predict the level of Foreign Direct Investment Flow.