Nearshoring is a business practice where companies choose to relocate their manufacturing or service operations from their home country to a nearby country, looking for one that is in close proximity to their target market. One of the main goals of nearshoring is to optimize production supply chains and minimize costs while maintaining operational efficiency.
The COVID-19 pandemic highlighted companies’ dependence on China and compromised their supply chains. In addition, the trade war between China and the United States, due to the restrictions imposed on imports from that country, intensified these challenges. (El Economista, 2022)
Mexico and its atractiveness for nearshoring is an interesting case of study due its proximity to United States, one of the larger consumer markets in the world. Be.sides, the free trade agreement between Canada, United States and the analized country in this case, not mentioning its human cheap and competitive capital.
Predictive analytics is typically employed to predict what might happen in future events or outcomes. This approach leverages data analysis and machine learning techniques to make predictions based on historical data. The process of predictive analytics involves gathering and preprocessing data, constructing predictive models, and using these models to make forecasts about future occurrences.
One of the most useful predictive analytics tool is regression analysis, which can determine the relationship between two or more independent variables and a dependent variable, which is going to be predicted. The relationships between variables are written as a mathematical equation that can help predict the outcome should one variable change. (Harvard Business School Online, 2021)
Regression analysis can assist in predicting the occurrence of “Nearshoring” for the Mexican case by exploring and analyzing the relationships among independent variables, which encompass the factors that influence nearshoring decisions, and a single dependent variable representing the outcome probability for nearshoring to occur (in this case, Foreign Direct Investment).
To sum up, regression analysis provides a fact-based knowledge to assess the influence of multiple factors on nearshoring in Mexico, enabling the possibility to make predictions and find insights about which conditions or variables are more likely to attract foreign investment and drive nearshoring activities in the country.
With all the context provided previously, it is time to understand the problem situation. The attractiveness of Mexico for nearshoring is a case that could bring many benefits to the Mexican economy and the country’s growth prospects. As mentioned earlier, Mexico boasts numerous attributes that can attract significant investments from large companies, promising a host of advantages for the nation.
In summary, the problem situation involves the analysis of the provided data to comprehend and assess the relationships among the variables within the dataset (which will be presented and explained below) and Foreign Direct Investment (FDI) flows..
To tackle this problem effectively, it is essential to gain insights through the testing of regression models. These models will guide the country’s efforts toward the targeted development of factors that can attract FDI-driven nearshoring. # Data and Methodology
data <- read.csv("C:\\Users\\danyb\\OneDrive - Instituto Tecnologico y de Estudios Superiores de Monterrey\\Docs\\Documentos\\Business Intelligence\\Quinto Semestre\\Introduction to Econometrics\\data_sp.csv")
library(dlookr)
##
## Attaching package: 'dlookr'
## The following object is masked from 'package:base':
##
## transform
library(kableExtra)
library(effects)
## Loading required package: carData
## Registered S3 method overwritten by 'survey':
## method from
## summary.pps dlookr
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
library(visdat)
library(foreign)
library(dplyr) # data manipulation
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
##
## group_rows
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(forcats) # to work with categorical variables
library(ggplot2) # data visualization
library(readr) # read specific csv files
library(janitor) # data exploration and cleaning
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(Hmisc) # several useful functions for data analysis
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following object is masked from 'package:dlookr':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, units
library(psych) # functions for multivariate analysis
##
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
##
## describe
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## The following object is masked from 'package:dlookr':
##
## describe
library(naniar) # summaries and visualization of missing values NAs
library(dlookr) # summaries and visualization of missing values NAs
library(corrplot) # correlation plots
## corrplot 0.92 loaded
library(jtools) # presentation of regression analysis
##
## Attaching package: 'jtools'
## The following object is masked from 'package:Hmisc':
##
## %nin%
library(lmtest) # diagnostic checks - linear regression analysis
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(car) # diagnostic checks - linear regression analysis
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
library(kableExtra) # HTML table attributes
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ✔ stringr 1.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ psych::%+%() masks ggplot2::%+%()
## ✖ psych::alpha() masks ggplot2::alpha()
## ✖ tidyr::extract() masks dlookr::extract()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::group_rows() masks kableExtra::group_rows()
## ✖ dplyr::lag() masks stats::lag()
## ✖ car::recode() masks dplyr::recode()
## ✖ purrr::some() masks car::some()
## ✖ Hmisc::src() masks dplyr::src()
## ✖ Hmisc::summarize() masks dplyr::summarize()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loaded glmnet 4.1-7
library(Metrics)
##
## Attaching package: 'Metrics'
##
## The following objects are masked from 'package:caret':
##
## precision, recall
str(data)
## 'data.frame': 26 obs. of 16 variables:
## $ Period : int 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 ...
## $ IED_Flows : num 12146 8374 13960 18249 30057 ...
## $ Exports : num 9088 9875 10990 12483 11300 ...
## $ Employment : num NA NA NA 97.8 97.4 ...
## $ Education : num 7.2 7.31 7.43 7.56 7.68 7.8 7.93 8.04 8.14 8.26 ...
## $ Daily_Salary : num 24.3 31.9 31.9 35.1 37.6 ...
## $ Innovation : num 11.3 11.4 12.5 13.2 13.5 ...
## $ Insecurity_Robbery : num 267 315 273 217 215 ...
## $ Insecurity_Homicide: num 14.6 14.3 12.6 10.9 10.2 ...
## $ Exchange_Rate : num 8.06 9.94 9.52 9.6 9.17 ...
## $ Road_Density : num 0.05 0.05 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 ...
## $ Population_Density : num 47.4 48.8 49.5 50.6 51.3 ...
## $ CO2_Emissions : num 3.68 3.85 3.69 3.87 3.81 3.82 3.95 3.98 4.1 4.19 ...
## $ GPD_PER_CAPITA : num 127570 126739 129165 130875 128083 ...
## $ CPI : num 33.3 39.5 44.3 48.3 50.4 ...
## $ Financial_Crisis : int 0 0 0 0 0 0 0 0 0 0 ...
This data set only contains numeric fields, it could make the analysis more interesting.
sum(is.na(data))
## [1] 12
There are 12 missing values in the data set. It is necessary to identify them.
colSums(is.na(data))
## Period IED_Flows Exports Employment
## 0 0 0 3
## Education Daily_Salary Innovation Insecurity_Robbery
## 3 0 2 0
## Insecurity_Homicide Exchange_Rate Road_Density Population_Density
## 1 0 0 0
## CO2_Emissions GPD_PER_CAPITA CPI Financial_Crisis
## 3 0 0 0
gg_miss_var(data)
It can be observed that the missing values are located in four
differente variables: Employment, Education, CO2_Emmissions, Innovation
and Insecurity_Homicide. It is time to fix it.
#Replacing Missing Values with Median Values.
data <- data %>%
mutate_all(~ replace_na(., median(., na.rm = TRUE)))
#Checking the replacing
colSums(is.na(data))
## Period IED_Flows Exports Employment
## 0 0 0 0
## Education Daily_Salary Innovation Insecurity_Robbery
## 0 0 0 0
## Insecurity_Homicide Exchange_Rate Road_Density Population_Density
## 0 0 0 0
## CO2_Emissions GPD_PER_CAPITA CPI Financial_Crisis
## 0 0 0 0
Foreign Direct Investment Flows and Exports are given in million dollars (USD), while the other variables are in Mexican pesos (MXN)
#It would be necessary to change the IED and the Exports to MXN
data$IED_FlowsMXN <- ((data$IED_Flows * data$Exchange_Rate) / data$CPI) * 100
data$Exports_MXN <- ((data$Exports * data$Exchange_Rate) / data$CPI) * 100
#Eliminate the original columns, they will not be necessary.
data <- subset(data, select = -c(IED_Flows, Exports))
#Confirm changes
colnames(data)
## [1] "Period" "Employment" "Education"
## [4] "Daily_Salary" "Innovation" "Insecurity_Robbery"
## [7] "Insecurity_Homicide" "Exchange_Rate" "Road_Density"
## [10] "Population_Density" "CO2_Emissions" "GPD_PER_CAPITA"
## [13] "CPI" "Financial_Crisis" "IED_FlowsMXN"
## [16] "Exports_MXN"
summary(data)
## Period Employment Education Daily_Salary
## Min. :1997 Min. :95.06 Min. :7.200 Min. : 24.30
## 1st Qu.:2003 1st Qu.:96.08 1st Qu.:7.957 1st Qu.: 41.97
## Median :2010 Median :96.53 Median :8.460 Median : 54.48
## Mean :2010 Mean :96.48 Mean :8.428 Mean : 65.16
## 3rd Qu.:2016 3rd Qu.:97.01 3rd Qu.:8.925 3rd Qu.: 72.31
## Max. :2022 Max. :97.83 Max. :9.580 Max. :172.87
## Innovation Insecurity_Robbery Insecurity_Homicide Exchange_Rate
## Min. :11.28 Min. :120.5 Min. : 8.04 Min. : 8.06
## 1st Qu.:12.60 1st Qu.:148.3 1st Qu.:10.40 1st Qu.:10.75
## Median :13.09 Median :181.8 Median :16.93 Median :13.02
## Mean :13.10 Mean :185.4 Mean :17.28 Mean :13.91
## 3rd Qu.:13.61 3rd Qu.:209.9 3rd Qu.:22.34 3rd Qu.:18.49
## Max. :15.11 Max. :314.8 Max. :29.59 Max. :20.66
## Road_Density Population_Density CO2_Emissions GPD_PER_CAPITA
## Min. :0.05000 Min. :47.44 Min. :3.590 Min. :126739
## 1st Qu.:0.06000 1st Qu.:52.77 1st Qu.:3.842 1st Qu.:130964
## Median :0.07000 Median :58.09 Median :3.930 Median :136845
## Mean :0.07115 Mean :57.33 Mean :3.943 Mean :138550
## 3rd Qu.:0.08000 3rd Qu.:61.39 3rd Qu.:4.090 3rd Qu.:146148
## Max. :0.09000 Max. :65.60 Max. :4.220 Max. :153236
## CPI Financial_Crisis IED_FlowsMXN Exports_MXN
## Min. : 33.28 Min. :0.00000 Min. :210876 Min. :205483
## 1st Qu.: 56.15 1st Qu.:0.00000 1st Qu.:368560 1st Qu.:262337
## Median : 73.35 Median :0.00000 Median :497054 Median :366294
## Mean : 75.17 Mean :0.07692 Mean :493596 Mean :433856
## 3rd Qu.: 91.29 3rd Qu.:0.00000 3rd Qu.:578606 3rd Qu.:632356
## Max. :126.48 Max. :1.00000 Max. :754438 Max. :785655
Examining this summary of the data provides insight into the nature of the data values. It can be observed that the independent variables “Exports_MXN” and “Daily_Salary” exhibit outliers due to the significant spread between their minimum and maximum values. Addressing these inconsistencies may be necessary if these variables are to be utilized in the regression models. Additionally, these variables display a substantial disparity between their mean and median values, implying potential unreliability in central tendency measures. #### Measures of dispersion
describe(data)
## vars n mean sd median trimmed mad
## Period 1 26 2009.50 7.65 2009.50 2009.50 9.64
## Employment 2 26 96.48 0.72 96.53 96.48 0.76
## Education 3 26 8.43 0.68 8.46 8.44 0.76
## Daily_Salary 4 26 65.16 35.85 54.48 60.16 22.51
## Innovation 5 26 13.10 1.07 13.09 13.09 0.79
## Insecurity_Robbery 6 26 185.42 47.67 181.83 181.16 47.06
## Insecurity_Homicide 7 26 17.28 7.12 16.93 16.98 9.31
## Exchange_Rate 8 26 13.91 4.15 13.02 13.78 4.25
## Road_Density 9 26 0.07 0.01 0.07 0.07 0.01
## Population_Density 10 26 57.33 5.41 58.09 57.44 6.68
## CO2_Emissions 11 26 3.94 0.18 3.93 3.95 0.17
## GPD_PER_CAPITA 12 26 138550.10 8861.10 136845.30 138255.64 11080.42
## CPI 13 26 75.17 24.81 73.35 74.45 27.14
## Financial_Crisis 14 26 0.08 0.27 0.00 0.00 0.00
## IED_FlowsMXN 15 26 493596.02 143849.16 497053.70 494270.03 183243.92
## Exports_MXN 16 26 433855.52 195018.66 366293.83 423610.02 184264.93
## min max range skew kurtosis se
## Period 1997.00 2022.00 25.00 0.00 -1.34 1.50
## Employment 95.06 97.83 2.77 -0.17 -0.73 0.14
## Education 7.20 9.58 2.38 -0.12 -1.05 0.13
## Daily_Salary 24.30 172.87 148.57 1.43 1.44 7.03
## Innovation 11.28 15.11 3.83 0.12 -0.70 0.21
## Insecurity_Robbery 120.49 314.78 194.29 0.89 0.30 9.35
## Insecurity_Homicide 8.04 29.59 21.55 0.38 -1.28 1.40
## Exchange_Rate 8.06 20.66 12.60 0.44 -1.39 0.81
## Road_Density 0.05 0.09 0.04 0.19 -1.41 0.00
## Population_Density 47.44 65.60 18.16 -0.19 -1.24 1.06
## CO2_Emissions 3.59 4.22 0.63 -0.11 -0.96 0.04
## GPD_PER_CAPITA 126738.75 153235.73 26496.98 0.28 -1.41 1737.81
## CPI 33.28 126.48 93.20 0.26 -0.95 4.87
## Financial_Crisis 0.00 1.00 1.00 2.99 7.25 0.05
## IED_FlowsMXN 210875.58 754437.47 543561.89 -0.01 -1.00 28211.14
## Exports_MXN 205482.92 785654.49 580171.58 0.48 -1.40 38246.31
Looking at standard deviation for the variables mentioned above could confirm the spread of the values of this variables.
ggplot(data, aes(x = Exchange_Rate, y = Exports_MXN)) +
geom_point() +
geom_smooth(method="lm",se= FALSE, color="steelblue")
## `geom_smooth()` using formula = 'y ~ x'
labs(x = "Exchange Rate",
y = "Exports_MXN",
title = "Relation between variables") +
theme_minimal()
## NULL
In this dispersion visualization, it can be observed a positive linear tendency between Exchange Rate and Exports. Taking this into account, it could be mentioned as the higher the exchange rate, the higher the Exports in the country. These types of visualizations are just to explore and understand the data, and this relation may be positive, but it could have variations.
data %>% mutate(E_intervals=cut(Education,breaks=c(7,7.5,8,8.5,9,9.6))) %>%
ggplot(aes(x=reorder(E_intervals,IED_FlowsMXN),y=IED_FlowsMXN, fill=E_intervals)) +
geom_bar(stat="identity") + coord_flip()+
scale_fill_brewer(palette="PuBu")+
labs(x="Years of Education", y="Foreign Direct Investment MXN", color="Years of Education") +
ggtitle("Foreign Direct Investment (MXN) by Years of population Education")
This bar graph shows different intervals of education years, and the
value for Foreign Direct Investment flows for each of these intervals.
Although it is a positive relation, it could be observed that the higher
value of education years is not the higher value for FDI. This
information may suggest that relation between these variables could not
be totally lineal, and it will help to take decisions of transforming
the variable by getting better results in the models.
data %>% mutate(ER_intervals=cut(Exchange_Rate,breaks=c(8,10,12,14,16,18,20,22))) %>%
ggplot(aes(x=reorder(ER_intervals,IED_FlowsMXN),y=IED_FlowsMXN, fill=ER_intervals)) +
geom_bar(stat="identity") + coord_flip()+
scale_fill_brewer(palette="PuBu")+
labs(x="Exchange Rate Value", y="FDI (MXN)", color="Exchange Rate") +
ggtitle("FDI by Exchange Rate Values")
This graph suggests some intriguing insights. The highest Foreign Direct
Investment (FDI) values coincide with one of the highest exchange rate
values, ranking second only to the top value. Interestingly, the lowest
exchange rate values correspond to the second-highest FDI values after
the previously mentioned below This observation prompts consideration of
potential reasons.
It’s possible that a high exchange rate is more favorable for companies in the export sector, while conversely, it may be less advantageous for those in the import sector. However, it’s essential to note that this hypothesis can’t be definitively confirmed due to the limitations of the dataset. Nevertheless, considering these patterns is valuable for constructing regression models.
ggplot(data, aes(x= Period, y=(IED_FlowsMXN))) +
geom_line(color="orange", linewidth=1) +
labs(x= "Year", y= "Flow of Foreign Direct Investment MXN")+
ggtitle("Flow of Foreign Direct Investment through the years")
This time series graph illustrates the behavior of the dependent
variable FDI over time. It can be observed that its highest value was
between 2010 and 2015, approximately in 2013. It also shows a decline in
2020, which could be because of pandemic for COVID-19.Despite
experiencing this decline it’s also noticeable that it has been steadily
rising in the years following that, however, there is not enough data to
confidently assert this. In summary, and looking the data for the first
years (2000-2010), it could be concluded that the values for FDI have
experienced growth through the years.
dev.new()
hist(data$IED_FlowsMXN, prob=TRUE, col='steelblue', main='Histogram with Foreign Direct Investment')
lines(density(data$IED_Flows), col=3, lwd=4)
The histogram for the dependent variable shows the distribution of the values for FDI. It can be observed that the higher frequency is located between 5 and 6 million Mexican pesos (MXN). Despite the fact that the different ranges do not have the same frequency, it is similar.
par(mfrow=c(4, 4))
variables <- c("IED_FlowsMXN", "Period", "Exports_MXN", "Employment", "Education",
"Daily_Salary", "Innovation", "Insecurity_Robbery",
"Insecurity_Homicide", "Exchange_Rate", "Road_Density",
"Population_Density", "CO2_Emissions", "GPD_PER_CAPITA", "CPI")
for (variable in variables) {
hist(data[[variable]], col="steelblue", main=variable)
}
In this part, an histogram for each variable was plotted, with the intention of observed its values behavior, and evaluate if they could need a transform or a change.
plot_normality(data,IED_FlowsMXN)
plot_normality(data,Period)
plot_normality(data,Exports_MXN)# log
plot_normality(data,Employment) #It could be log
plot_normality(data,Education)
plot_normality(data,Daily_Salary) #It could be log
plot_normality(data,Innovation) #It could be log
plot_normality(data,Insecurity_Robbery)
plot_normality(data,Insecurity_Homicide)
plot_normality(data,Exchange_Rate) #To square
plot_normality(data,Road_Density)
plot_normality(data,Population_Density)
plot_normality(data,CO2_Emissions)
plot_normality(data,GPD_PER_CAPITA)
plot_normality(data,CPI)
after that, normality diagnosis plots were plotted to see the
differences between the original distribution frequency of values, and
its possible changes.
data1<-data %>% select(Period,IED_FlowsMXN,Employment,Insecurity_Robbery,Insecurity_Homicide,CO2_Emissions,Financial_Crisis,Daily_Salary,Exchange_Rate,Exports_MXN,Road_Density,Population_Density,CPI,Innovation,Education,GPD_PER_CAPITA)
summary(data1)
## Period IED_FlowsMXN Employment Insecurity_Robbery
## Min. :1997 Min. :210876 Min. :95.06 Min. :120.5
## 1st Qu.:2003 1st Qu.:368560 1st Qu.:96.08 1st Qu.:148.3
## Median :2010 Median :497054 Median :96.53 Median :181.8
## Mean :2010 Mean :493596 Mean :96.48 Mean :185.4
## 3rd Qu.:2016 3rd Qu.:578606 3rd Qu.:97.01 3rd Qu.:209.9
## Max. :2022 Max. :754438 Max. :97.83 Max. :314.8
## Insecurity_Homicide CO2_Emissions Financial_Crisis Daily_Salary
## Min. : 8.04 Min. :3.590 Min. :0.00000 Min. : 24.30
## 1st Qu.:10.40 1st Qu.:3.842 1st Qu.:0.00000 1st Qu.: 41.97
## Median :16.93 Median :3.930 Median :0.00000 Median : 54.48
## Mean :17.28 Mean :3.943 Mean :0.07692 Mean : 65.16
## 3rd Qu.:22.34 3rd Qu.:4.090 3rd Qu.:0.00000 3rd Qu.: 72.31
## Max. :29.59 Max. :4.220 Max. :1.00000 Max. :172.87
## Exchange_Rate Exports_MXN Road_Density Population_Density
## Min. : 8.06 Min. :205483 Min. :0.05000 Min. :47.44
## 1st Qu.:10.75 1st Qu.:262337 1st Qu.:0.06000 1st Qu.:52.77
## Median :13.02 Median :366294 Median :0.07000 Median :58.09
## Mean :13.91 Mean :433856 Mean :0.07115 Mean :57.33
## 3rd Qu.:18.49 3rd Qu.:632356 3rd Qu.:0.08000 3rd Qu.:61.39
## Max. :20.66 Max. :785655 Max. :0.09000 Max. :65.60
## CPI Innovation Education GPD_PER_CAPITA
## Min. : 33.28 Min. :11.28 Min. :7.200 Min. :126739
## 1st Qu.: 56.15 1st Qu.:12.60 1st Qu.:7.957 1st Qu.:130964
## Median : 73.35 Median :13.09 Median :8.460 Median :136845
## Mean : 75.17 Mean :13.10 Mean :8.428 Mean :138550
## 3rd Qu.: 91.29 3rd Qu.:13.61 3rd Qu.:8.925 3rd Qu.:146148
## Max. :126.48 Max. :15.11 Max. :9.580 Max. :153236
corrplot(cor(data1), type = 'upper', order = 'hclust', addCoef.col = 'black', tl.cex = 0.5, tl.srt = 90, mar = c(0,0,0,0),number.cex=0.8)
Finally, this correlation plot visually represents the relationships
between each variable, including the dependent variable, by showing
their correlation coefficients. This can assist in identifying whether
there is a positive or negative correlation between the variable of
interest, ‘IED_FlowsMXN,’ and the independent variables.
Upon examining the plot, we can conclude that the variables with the highest correlation values concerning ‘IED_FlowsMXN’ are “Education”, “Innovation”, “Population_Density”, “Road_Density”, “CPI”, “Exports_MXN”, and “Exchange_Rate”. It’s worth noting that ‘Daily_Salary’ also shows a good (but lower) correlation value. ### Which is the estimation method to be used to estimate the linear regression model? The estimation method is Ordinary Least Squares (OLS).It is a common technique for estimating coefficients. “The OLS method aims to minimize the sum of square differences between the observed and predicted values”. (XLSTAT,2023)
#Linear Regression Model ## Based on EDA, formulate and describe 3 hypotheses
H1: It might be expected a positive relation between Innovation and Foreign Direct Investment flows.
H2: It might be expected a positive relation between exchange_rate and Foreign Direct Investment flows.
H3: It might be expected a negative relation between financial crisis and Foreign Direct Investment flows.
Extra * H4: It might be expected a possitive relation between Exports and Foreign Direct Investment flows.
#Constructing a model with variables mentioned below
model1<-lm(IED_FlowsMXN~ Exports_MXN + Education + Daily_Salary + Innovation + Exchange_Rate+ Road_Density + Population_Density + CPI, data=data)
summary(model1)
##
## Call:
## lm(formula = IED_FlowsMXN ~ Exports_MXN + Education + Daily_Salary +
## Innovation + Exchange_Rate + Road_Density + Population_Density +
## CPI, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -126215 -33514 994 51433 161978
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.859e+05 2.029e+06 0.387 0.7033
## Exports_MXN -2.961e-01 5.643e-01 -0.525 0.6065
## Education 2.579e+04 1.267e+05 0.204 0.8411
## Daily_Salary -3.230e+03 5.653e+03 -0.571 0.5753
## Innovation 5.166e+04 1.942e+04 2.660 0.0165 *
## Exchange_Rate 1.229e+04 2.403e+04 0.512 0.6156
## Road_Density 7.464e+06 7.199e+06 1.037 0.3144
## Population_Density -4.406e+04 4.238e+04 -1.040 0.3131
## CPI 1.299e+04 1.898e+04 0.684 0.5030
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 84970 on 17 degrees of freedom
## Multiple R-squared: 0.7627, Adjusted R-squared: 0.6511
## F-statistic: 6.831 on 8 and 17 DF, p-value: 0.0004584
This model gives information about the relation between independent variables and the dependent variables (FDI). It is relevant to see that just one of the selected variables is statistically significant. To keep concluding, it will be necessary to do check tests for this model:
AIC(model1)
## [1] 672.9421
Value for AIC is to high. It will be compared with AIC for other regression models to see what is the best model, in order to make the prediction more accuracy.
vif(model1)
## Exports_MXN Education Daily_Salary Innovation
## 41.929699 25.337811 142.214820 1.502321
## Exchange_Rate Road_Density Population_Density CPI
## 34.447391 32.049259 182.117126 767.744362
Variables used in model 1 have very high values of Variance Inflation Factor (VIF), which can be interpreted as multicolinearilty between variables in the model. It must show an incorrect prediction.
histogram(model1$residuals)
shapiro.test(model1$residuals)
##
## Shapiro-Wilk normality test
##
## data: model1$residuals
## W = 0.96741, p-value = 0.5575
In the histogram below it can be observed that residuals do follow a normal distribution, and it can be confirmed by getting a p-value higher than 0.05 or 5%.
#Estimating a model deleting some variables, and transforming some into "log variables" in order to achieve a normal distribution, and elevating some of these to square, due its possible non-linear relation with dependent variable.
model2<-lm(log(IED_FlowsMXN) ~ log(Exports_MXN)+#Try with log(Exports)
I(Education^2)+ Daily_Salary + Innovation+
I(Exchange_Rate^2)+Financial_Crisis, data = data)
summary(model2)
##
## Call:
## lm(formula = log(IED_FlowsMXN) ~ log(Exports_MXN) + I(Education^2) +
## Daily_Salary + Innovation + I(Exchange_Rate^2) + Financial_Crisis,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.29124 -0.09391 0.01387 0.09175 0.41664
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.313528 5.209725 3.899 0.000965 ***
## log(Exports_MXN) -0.846135 0.450525 -1.878 0.075802 .
## I(Education^2) 0.021731 0.008366 2.598 0.017675 *
## Daily_Salary 0.004002 0.002568 1.558 0.135659
## Innovation 0.110686 0.040107 2.760 0.012467 *
## I(Exchange_Rate^2) 0.001912 0.001085 1.763 0.094010 .
## Financial_Crisis -0.176630 0.139169 -1.269 0.219704
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1837 on 19 degrees of freedom
## Multiple R-squared: 0.7455, Adjusted R-squared: 0.6651
## F-statistic: 9.276 on 6 and 19 DF, p-value: 8.088e-05
With the changes made between model 1 and 2, it can be observed some principal changes in the significance for some variables, but it needs to be tested before getting any prediction.
AIC(model2)
## [1] -6.483815
Value for AIC is low, but this is not the first parameter that must be considering in testing and evaluating regression models. It’s important to see if variables in this model have multicolinearility.
vif(model2)
## log(Exports_MXN) I(Education^2) Daily_Salary Innovation
## 30.065262 6.660979 6.279672 1.370449
## I(Exchange_Rate^2) Financial_Crisis
## 13.394789 1.059659
It can be observed values above 10 in the variables “Exports_MXN” and “Exchange_Rate”. It suggests that predictions for this model may be incorrect due the multicolinearility in these variables.
histogram(model2$residuals)
shapiro.test(model2$residuals)
##
## Shapiro-Wilk normality test
##
## data: model2$residuals
## W = 0.95953, p-value = 0.3824
In the histogram below it can be observed that residuals do follow a normal distribution, and it can be confirmed by getting a p-value higher than 0.05 or 5%. It can be concluded that the problem with the models is multicolinearility. A transformation will be needing.
In the plotted histograms for variables before, it could be observed than the values for variable “Exports_MXN” did not have a normal distribution, nor using log transformation.
By analyzing these plots, along with the dispersion plot previously created for this variable, it can be concluded that there are significantly more values on the left side. Therefore, the transformation will involve replacing some of the values in the variable with the median of it, in an attempt to achieve a normal distribution, and eliminate multicolinealirity.
data1 <- data
data1$Exports_MXN[data1$Exports_MXN == max(data1$Exports_MXN)] <- median(data1$Exports_MXN)
model3<-lm(log(IED_FlowsMXN) ~ log(Exports_MXN)+
I(Education^2)+ Daily_Salary + Innovation+
I(Exchange_Rate^2)+Financial_Crisis, data = data1)
summary(model3)
##
## Call:
## lm(formula = log(IED_FlowsMXN) ~ log(Exports_MXN) + I(Education^2) +
## Daily_Salary + Innovation + I(Exchange_Rate^2) + Financial_Crisis,
## data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33898 -0.06667 -0.00561 0.06077 0.37666
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.3081392 3.3649076 3.955 0.000849 ***
## log(Exports_MXN) -0.2396294 0.2911425 -0.823 0.420681
## I(Education^2) 0.0149916 0.0085352 1.756 0.095119 .
## Daily_Salary 0.0016523 0.0023016 0.718 0.481548
## Innovation 0.1154369 0.0434346 2.658 0.015541 *
## I(Exchange_Rate^2) 0.0007444 0.0008843 0.842 0.410359
## Financial_Crisis -0.1533275 0.1481751 -1.035 0.313767
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1965 on 19 degrees of freedom
## Multiple R-squared: 0.7086, Adjusted R-squared: 0.6166
## F-statistic: 7.702 on 6 and 19 DF, p-value: 0.0002676
With the imputation of median values, the variables with high level of significancy changes, but it is necessary to test if this model do not present multicolinearility.
AIC(model3)
## [1] -2.967193
vif(model3)
## log(Exports_MXN) I(Education^2) Daily_Salary Innovation
## 9.884281 6.056573 4.405775 1.403972
## I(Exchange_Rate^2) Financial_Crisis
## 7.778066 1.049286
Although AIC value is higher than the model 2 value, in this model the multicolinearility problem were solved, and predicting with this model may be more accuracy than doing it with the previous model.
histogram(model3$residuals)
shapiro.test(model3$residuals)
##
## Shapiro-Wilk normality test
##
## data: model3$residuals
## W = 0.96515, p-value = 0.5029
Histogram shows a bit right-skeewed distribution for the residuals of this model, but with shapiro test, due its p-value of 0.50, it can be concluded that statistical inference might be accurate.
The model selected based on the diagnostic tests is going to be model 3.
This model helps to evaluate the hypothesis made before. *For H1: It can be observed that the coefficient for variable “Innovation” is positive, so the hypothesis is accepted.
*For H2: Based on the regression results, it can be concluded that “Exchange_Rate” has a positive relationship with the dependent variable, so the hypothesis is accepted.
*For H3: Model 3 shows that the variable “Financial_Crisis” has a negative coefficient, so it might have a negative relation with the dependent variable, this is why the third hypothesis is accepted.
*For H4: It can be concluded, due the negative coefficient for “Exports_MXN” in model 3, that it has a negative relation with FDI, so the hypothesis is rejected.
• Show the predicted values of the dependent variable (e.g., effects plot)
plot(effect("Innovation",model3))
plot(effect("I(Exchange_Rate^2)",model3))
plot(effect("Financial_Crisis",model3))
plot(effect("Exports_MXN",model3))
## NOTE: Exports_MXN does not appear in the model
It is important to see how the effect plots ilustrate the previous
results and hypothesis testing obtained with model 3. Innovation and
Exchange Rate shows a positive relation with Foreign Direct Investment,
while Financial Crisis and Exports(MXN) show a negative relation.
*Despite the fact that almost all the variables had high correlation values, not all of these variables where significative for regression model. It shows how correlation and regression is not the same thing.
*It is interesting to see how exports has a negative impact in FDI, while exchange rate has a positive impact, though these two variables shown a positive relation between both.
*It will be interesting to observe the behavior of the education variable due the result of the bar graph. This variable was elevated to square, had a positive coefficient, and significance statistic for the regression model.
*To sum up: Exports: Negative relation, but not significant Education: Positive relation, and significant Daily_Salary: Positive relation, but not significant Innovation: Positive relation, and it is the most significant in the model Exchange_Rate: Positive relation, but not significant Financial_Crisis: Negative relation, but not significant.
#References: Lázaro, E. (2022, November 10). ¿Qué es el nearshoring? El Economista; El Economista. https://www.eleconomista.com.mx/empresas/Que-es-el-nearshoring-20221108-0093.html
Ordinary Least Squares regression (OLS). (2023). XLSTAT, Your Data Analysis Solution. https://www.xlstat.com/en/solutions/features/ordinary-least-squares-regression-ols
What Is Predictive Analytics? 5 Examples | HBS Online. (2021, October 26). Business Insights Blog. https://online.hbs.edu/blog/post/predictive-analytics
#ANEX
Box.test(model3$residuals, type = "Ljung-Box") #Residuales.)
##
## Box-Ljung test
##
## data: model3$residuals
## X-squared = 1.6638, df = 1, p-value = 0.1971
acf(model3$residuals) #Hacerlo de los residuales.
Seeing these results, it can be concluded that no correlation serial is
identified in the model.
#x<-model.matrix(log(IED_FlowsMXN) ~ log(Exports_MXN)+
# I(Education^2)+ Daily_Salary + Innovation+
# I(Exchange_Rate^2)+Financial_Crisis,train.data)[,-1] ### OLS model specification
# x<-model.matrix(Weekly_Sales~.,train.data)[,-1] ### matrix of independent variables X's
#y<-train.data$IED_FlowsMXN ### dependent variable
# In estimating LASSO regression it is important to define the lambda that minimizes the prediction error rate.
# Cross-validation ensures that every data / observation from the original dataset (datains) has a chance of appearing in train and test datasets.
# Find the best lambda using cross-validation.
#set.seed(123)
#cv.lasso<-cv.glmnet(x,y,alpha=1) # alpha = 1 for LASSO
# Display the best lambda value
#cv.lasso$lambda.min ### lambda: a numeric value defining the amount of shrinkage. Why min? the higher the value of ?? , the more penalization there is
# Fit the final model on the training data
#lassomodel<-glmnet(x,y,alpha=1,lambda=cv.lasso$lambda.min)
# Display regression coefficients
#coef(lassomodel)
# Make predictions on the test data
#x.test=model.matrix(log(IED_FlowsMXN) ~ log(Exports_MXN)+
# I(Education^2)+ Daily_Salary + Innovation+
# I(Exchange_Rate^2)+Financial_Crisis,train.data)[,-1] ### OLS model specification
# x.test<-model.matrix(Weekly_Sales~.,test.data)[,-1]
#lassopredictions <- lassomodel %>% predict(x.test) %>% as.vector()
# Model Accuracy
#data.frame(
# RMSE = RMSE(lassopredictions, test.data$IED_FlowsMXN),
#Rsquare = R2(lassopredictions, test.data$IED_FlowsMXN))
#It never run
#lasso<-glmnet(scale(x),y,alpha=1)
#plot(lasso,xvar="lambda",label=T)
#DANIEL FARIAS LASO MODEL