Nearshoring is a practice which is commonly adopted by a great variety of companies belonging to different industries all around the world. This practice consists of a company outsourcing its services and/or processes to other companies in a nearby and cost-effective country, rather than one with a greater distance.
Nearshoring in Mexico has become a very popular practice, especially to companies in the USA and Canada. Mexico’s proximity to these countries, in addition to the T-MEC agreement and its competitive cost of skilled labor force, represent a big opportunity and attractive destination for nearshoring. Aside from the aspects already mentioned, the political and commercial conflicts between US and China have caused Mexico ́s US imports to rise and of course, American companies have migrated from offshoring in Asia, to nearshoring in Mexico.
One famous example that has been announced recently and caused commotion and excitement within the Mexican population is the arrival of Tesla´s production plant. It’s important to mention all the benefits that the recent increase of nearshoring practices will bring to the country economically speaking. All the way from new infrastructure to a lot of new job opportunities, through nearshoring, the Mexican GDP will have up to a 2.5% increase in the next 6 years, aside from reaching a 50 billion US dollars direct foreign investment. (Forbes, 2023)
Predictive analytics, as it name says, is the process of using and analyzing data to predict future outcomes or events and identify trends. It uses historical data to forecast potential scenarios usually through machine learning models. This analysis answers the question “What might happen in the future?”.
Regression analysis is a tool that’s commonly used in predictive analytics which helps us model the relationship between one or more independent variables and a dependent variable. In other words, this means that regression analysis can help you understand how different factors affect your outcome of interest.
Regression analysis can help predict the occurrence of Nearshoring by examining the relationship between various independent variables and the likelihood of companies choosing Mexico as a nearshoring destination. Researchers can collect historical data on factors such as labor costs, exchange rates, political stability, infrastructure development, and trade agreements, which are potential drivers of nearshoring decisions. By applying regression techniques, they can analyze how these variables have influenced nearshoring activities in the past and help us model those same variables in the future.
Through regression analysis, a predictive model can be developed that quantifies the impact of each variable on the likelihood of nearshoring in Mexico. This model can then be used to make future predictions based on changes in these factors. By updating the modelconstantly with new data, businesses, government or any other interested party can obtain insights into the evolving landscape of nearshoring decisions and make informed strategic choices.
##Problem Situation Maria, an econometrics analyst in a Mexican company, plans to use official data sources to create a database exploring conditions influencing nearshoring in Mexico, including socioeconomic, business, technological, environmental, and security factors. She aims to apply econometric models to predict the effects of nearshoring in Mexico, understand why Mexico is attractive for nearshoring, and identify opportunities for new business relocation in the country.
library(foreign)
library(dplyr) # data manipulation
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(forcats) # to work with categorical variables
library(ggplot2) # data visualization
library(readr) # read specific csv files
library(janitor) # data exploration and cleaning
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(Hmisc) # several useful functions for data analysis
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(psych) # functions for multivariate analysis
##
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
##
## describe
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(naniar) # summaries and visualization of missing values NA's
library(corrplot) # correlation plots
## corrplot 0.92 loaded
library(jtools) # presentation of regression analysis
##
## Attaching package: 'jtools'
## The following object is masked from 'package:Hmisc':
##
## %nin%
library(lmtest) # diagnostic checks - linear regression analysis
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(car) # diagnostic checks - linear regression analysis
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
library(olsrr) # diagnostic checks - linear regression analysis
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
library(naniar) # identifying missing values
library(stargazer) # create publication quality tables
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(effects) # displays for linear and other regression models
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
library(tidyverse) # collection of R packages designed for data science
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ✔ stringr 1.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ psych::%+%() masks ggplot2::%+%()
## ✖ psych::alpha() masks ggplot2::alpha()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ car::recode() masks dplyr::recode()
## ✖ purrr::some() masks car::some()
## ✖ Hmisc::src() masks dplyr::src()
## ✖ Hmisc::summarize() masks dplyr::summarize()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret) # Classification and Regression Training
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(glmnet) # methods for prediction and plotting, and functions for cross-validation
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loaded glmnet 4.1-7
library("readxl") # Read Excel
library(caret) # Imputation
library(missForest) # Imputation
library(mice) # Imputation
## Registered S3 methods overwritten by 'broom':
## method from
## tidy.glht jtools
## tidy.summary.glht jtools
##
## Attaching package: 'mice'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(Hmisc) # Imputation
data <- read_excel("C:\\Users\\maria\\OneDrive\\Desktop\\AD 2023\\ECONOMETRICS\\SP_DataMexicoAtractiveness_alumn-VF.xlsx", sheet = 'Datos')
data
## # A tibble: 26 × 15
## Year IED_Flujos Exportaciones Empleo Educacion Salario_Diario Innovacion
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1997 12146. 9088. NA 7.20 24.3 11.3
## 2 1998 8374. 9875. NA 7.31 31.9 11.4
## 3 1999 13960. 10990. NA 7.43 31.9 12.5
## 4 2000 18249. 12483. 97.8 7.56 35.1 13.1
## 5 2001 30057. 11300. 97.4 7.68 37.6 13.5
## 6 2002 24099. 11923. 97.7 7.80 39.7 12.8
## 7 2003 18250. 13156 97.1 7.93 41.5 11.8
## 8 2004 25016. 13573. 96.5 8.04 43.3 12.6
## 9 2005 25796. 16466. 97.2 8.14 45.2 13.4
## 10 2006 21233. 17486. 96.5 8.26 47.0 14.2
## # ℹ 16 more rows
## # ℹ 8 more variables: Inseguridad_Robo <dbl>, Inseguridad_Homicidio <dbl>,
## # Tipo_de_Cambio <dbl>, Densidad_Carretera <dbl>, Densidad_Poblacion <dbl>,
## # CO2_Emisiones <dbl>, PIB_Per_Capita <dbl>, INPC <dbl>
ts_data <- read_excel("C:\\Users\\maria\\OneDrive\\Desktop\\AD 2023\\ECONOMETRICS\\SP_DataMexicoAtractiveness_alumn-VF.xlsx", sheet = 'ts.data')
ts_data
## # A tibble: 96 × 3
## Year Trimestre IED_Flujos
## <dbl> <chr> <dbl>
## 1 1999 I 3596.
## 2 1999 II 3396.
## 3 1999 III 3028.
## 4 1999 IV 3940.
## 5 2000 I 4601.
## 6 2000 II 4857.
## 7 2000 III 3057.
## 8 2000 IV 5734.
## 9 2001 I 3599.
## 10 2001 II 5219.
## # ℹ 86 more rows
# Search for N/A in data (df1)
sum(is.na(data))
## [1] 12
glimpse(data)
## Rows: 26
## Columns: 15
## $ Year <dbl> 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, …
## $ IED_Flujos <dbl> 12145.60, 8373.50, 13960.32, 18248.69, 30057.18,…
## $ Exportaciones <dbl> 9087.616, 9875.065, 10990.013, 12482.963, 11300.…
## $ Empleo <dbl> NA, NA, NA, 97.83000, 97.36000, 97.66000, 97.060…
## $ Educacion <dbl> 7.197968, 7.311587, 7.427924, 7.560000, 7.678270…
## $ Salario_Diario <dbl> 24.30, 31.91, 31.91, 35.12, 37.57, 39.74, 41.53,…
## $ Innovacion <dbl> 11.30141, 11.37221, 12.45897, 13.14556, 13.46812…
## $ Inseguridad_Robo <dbl> 266.5065, 314.7762, 272.8936, 216.9808, 214.5269…
## $ Inseguridad_Homicidio <dbl> 14.554142, 14.319399, 12.641071, 10.857846, 10.2…
## $ Tipo_de_Cambio <dbl> 8.0640, 9.9395, 9.5222, 9.5997, 9.1692, 10.3613,…
## $ Densidad_Carretera <dbl> 0.05205217, 0.05295475, 0.05501698, 0.05522774, …
## $ Densidad_Poblacion <dbl> 47.43650, 48.76163, 49.48089, 50.57930, 51.27675…
## $ CO2_Emisiones <dbl> 3.675330, 3.853045, 3.686918, 3.874147, 3.811393…
## $ PIB_Per_Capita <dbl> 127570.1, 126738.8, 129164.7, 130874.9, 128083.4…
## $ INPC <dbl> 33.27987, 39.47297, 44.33552, 48.30767, 50.43490…
# Search for N/A in ts_data (df2)
sum(is.na(ts_data))
## [1] 0
# Identify columns mith missing values
missing <- colSums(is.na(data)) > 0
missing_col <- names(data)[missing]
missing_col
## [1] "Empleo" "Educacion" "Innovacion"
## [4] "Inseguridad_Homicidio" "CO2_Emisiones"
#After examining our dataset, we noticed that there are several N/A, which for the sake of our model we need to imputate in order for it to be more precise. We will be using a Mchine Learning model to predict missing values.
# Llibrary(Hmisc) allows for multiple imputation with bootstrapping, additive regression, and predictive mean matching
# The named variables specifies the predictors for imputation
data_imp <- aregImpute(~ Year + IED_Flujos + Exportaciones + Empleo + Educacion + Salario_Diario + Innovacion + Inseguridad_Robo + Inseguridad_Homicidio + Tipo_de_Cambio + Densidad_Carretera + Densidad_Poblacion + CO2_Emisiones + PIB_Per_Capita + INPC,
n.impute = 2, # specifies to generate one imputed dataset
type = 'pmm', # specifies imputation method (predictive mean matching)
data = data) # your data frame
## Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
# Display the information of the model we created
# We can see which columns have N/A
# The R-squared shows the quality of the impuitation models. The higher the value (closer to 1), the better prediction.
data_imp
##
## Multiple Imputation using Bootstrap and PMM
##
## aregImpute(formula = ~Year + IED_Flujos + Exportaciones + Empleo +
## Educacion + Salario_Diario + Innovacion + Inseguridad_Robo +
## Inseguridad_Homicidio + Tipo_de_Cambio + Densidad_Carretera +
## Densidad_Poblacion + CO2_Emisiones + PIB_Per_Capita + INPC,
## data = data, n.impute = 2, type = "pmm")
##
## n: 26 p: 15 Imputations: 2 nk: 3
##
## Number of NAs:
## Year IED_Flujos Exportaciones
## 0 0 0
## Empleo Educacion Salario_Diario
## 3 3 0
## Innovacion Inseguridad_Robo Inseguridad_Homicidio
## 2 0 1
## Tipo_de_Cambio Densidad_Carretera Densidad_Poblacion
## 0 0 0
## CO2_Emisiones PIB_Per_Capita INPC
## 3 0 0
##
## type d.f.
## Year s 2
## IED_Flujos s 2
## Exportaciones s 2
## Empleo s 2
## Educacion s 2
## Salario_Diario s 2
## Innovacion s 2
## Inseguridad_Robo s 2
## Inseguridad_Homicidio s 2
## Tipo_de_Cambio s 2
## Densidad_Carretera s 2
## Densidad_Poblacion s 2
## CO2_Emisiones s 1
## PIB_Per_Capita s 2
## INPC s 2
##
## Transformation of Target Variables Forced to be Linear
##
## R-squares for Predicting Non-Missing Values for Each Variable
## Using Last Imputations of Predictors
## Empleo Educacion Innovacion
## 1 1 1
## Inseguridad_Homicidio CO2_Emisiones
## 1 1
# Replace de N/A of each column with the imputation model values
# The column Education was not applied, since the values the model gave us, dind math the tendency of the data.
# For education we used another type of impintation method
data$Empleo[is.na(data$Empleo)] <- data_imp$imputed$Empleo
## Warning in data$Empleo[is.na(data$Empleo)] <- data_imp$imputed$Empleo: number
## of items to replace is not a multiple of replacement length
data$Innovacion[is.na(data$Innovacion)] <- data_imp$imputed$Innovacion
## Warning in data$Innovacion[is.na(data$Innovacion)] <-
## data_imp$imputed$Innovacion: number of items to replace is not a multiple of
## replacement length
data$Inseguridad_Homicidio[is.na(data$Inseguridad_Homicidio)] <- data_imp$imputed$Inseguridad_Homicidio
## Warning in data$Inseguridad_Homicidio[is.na(data$Inseguridad_Homicidio)] <-
## data_imp$imputed$Inseguridad_Homicidio: number of items to replace is not a
## multiple of replacement length
data$CO2_Emisiones[is.na(data$CO2_Emisiones)] <- data_imp$imputed$CO2_Emisiones
## Warning in data$CO2_Emisiones[is.na(data$CO2_Emisiones)] <-
## data_imp$imputed$CO2_Emisiones: number of items to replace is not a multiple of
## replacement length
#verify the changes made
data
## # A tibble: 26 × 15
## Year IED_Flujos Exportaciones Empleo Educacion Salario_Diario Innovacion
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1997 12146. 9088. 97.8 7.20 24.3 11.3
## 2 1998 8374. 9875. 97.8 7.31 31.9 11.4
## 3 1999 13960. 10990. 97.8 7.43 31.9 12.5
## 4 2000 18249. 12483. 97.8 7.56 35.1 13.1
## 5 2001 30057. 11300. 97.4 7.68 37.6 13.5
## 6 2002 24099. 11923. 97.7 7.80 39.7 12.8
## 7 2003 18250. 13156 97.1 7.93 41.5 11.8
## 8 2004 25016. 13573. 96.5 8.04 43.3 12.6
## 9 2005 25796. 16466. 97.2 8.14 45.2 13.4
## 10 2006 21233. 17486. 96.5 8.26 47.0 14.2
## # ℹ 16 more rows
## # ℹ 8 more variables: Inseguridad_Robo <dbl>, Inseguridad_Homicidio <dbl>,
## # Tipo_de_Cambio <dbl>, Densidad_Carretera <dbl>, Densidad_Poblacion <dbl>,
## # CO2_Emisiones <dbl>, PIB_Per_Capita <dbl>, INPC <dbl>
# For the imputation we used a linear regression prediction in order to follow the pattern of the data
imp_edu <- approx(seq_along(data$Educacion),data$Educacion, method = "linear", n = length(data$Educacion))$y
data$Educacion <- imp_edu
#We can see how we fixed our missing value issue, so we can proceed with our code.
data
## # A tibble: 26 × 15
## Year IED_Flujos Exportaciones Empleo Educacion Salario_Diario Innovacion
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1997 12146. 9088. 97.8 7.20 24.3 11.3
## 2 1998 8374. 9875. 97.8 7.30 31.9 11.4
## 3 1999 13960. 10990. 97.8 7.40 31.9 12.5
## 4 2000 18249. 12483. 97.8 7.51 35.1 13.1
## 5 2001 30057. 11300. 97.4 7.62 37.6 13.5
## 6 2002 24099. 11923. 97.7 7.73 39.7 12.8
## 7 2003 18250. 13156 97.1 7.84 41.5 11.8
## 8 2004 25016. 13573. 96.5 7.94 43.3 12.6
## 9 2005 25796. 16466. 97.2 8.05 45.2 13.4
## 10 2006 21233. 17486. 96.5 8.13 47.0 14.2
## # ℹ 16 more rows
## # ℹ 8 more variables: Inseguridad_Robo <dbl>, Inseguridad_Homicidio <dbl>,
## # Tipo_de_Cambio <dbl>, Densidad_Carretera <dbl>, Densidad_Poblacion <dbl>,
## # CO2_Emisiones <dbl>, PIB_Per_Capita <dbl>, INPC <dbl>
#Verify there isnt any N/A
sum(is.na(data))
## [1] 0
#Now that we have our clean data, we can start once again with the complete Exploratory Data Analysis
# Display dataset
data
## # A tibble: 26 × 15
## Year IED_Flujos Exportaciones Empleo Educacion Salario_Diario Innovacion
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1997 12146. 9088. 97.8 7.20 24.3 11.3
## 2 1998 8374. 9875. 97.8 7.30 31.9 11.4
## 3 1999 13960. 10990. 97.8 7.40 31.9 12.5
## 4 2000 18249. 12483. 97.8 7.51 35.1 13.1
## 5 2001 30057. 11300. 97.4 7.62 37.6 13.5
## 6 2002 24099. 11923. 97.7 7.73 39.7 12.8
## 7 2003 18250. 13156 97.1 7.84 41.5 11.8
## 8 2004 25016. 13573. 96.5 7.94 43.3 12.6
## 9 2005 25796. 16466. 97.2 8.05 45.2 13.4
## 10 2006 21233. 17486. 96.5 8.13 47.0 14.2
## # ℹ 16 more rows
## # ℹ 8 more variables: Inseguridad_Robo <dbl>, Inseguridad_Homicidio <dbl>,
## # Tipo_de_Cambio <dbl>, Densidad_Carretera <dbl>, Densidad_Poblacion <dbl>,
## # CO2_Emisiones <dbl>, PIB_Per_Capita <dbl>, INPC <dbl>
#Identify missing values. There shouldnt be any since they are already imputed
sum(is.na(data))
## [1] 0
#Check type of data along with the name of each column
glimpse(data)
## Rows: 26
## Columns: 15
## $ Year <dbl> 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, …
## $ IED_Flujos <dbl> 12145.60, 8373.50, 13960.32, 18248.69, 30057.18,…
## $ Exportaciones <dbl> 9087.616, 9875.065, 10990.013, 12482.963, 11300.…
## $ Empleo <dbl> 97.83000, 97.83000, 97.83000, 97.83000, 97.36000…
## $ Educacion <dbl> 7.197968, 7.297952, 7.400003, 7.512453, 7.621500…
## $ Salario_Diario <dbl> 24.30, 31.91, 31.91, 35.12, 37.57, 39.74, 41.53,…
## $ Innovacion <dbl> 11.30141, 11.37221, 12.45897, 13.14556, 13.46812…
## $ Inseguridad_Robo <dbl> 266.5065, 314.7762, 272.8936, 216.9808, 214.5269…
## $ Inseguridad_Homicidio <dbl> 14.554142, 14.319399, 12.641071, 10.857846, 10.2…
## $ Tipo_de_Cambio <dbl> 8.0640, 9.9395, 9.5222, 9.5997, 9.1692, 10.3613,…
## $ Densidad_Carretera <dbl> 0.05205217, 0.05295475, 0.05501698, 0.05522774, …
## $ Densidad_Poblacion <dbl> 47.43650, 48.76163, 49.48089, 50.57930, 51.27675…
## $ CO2_Emisiones <dbl> 3.675330, 3.853045, 3.686918, 3.874147, 3.811393…
## $ PIB_Per_Capita <dbl> 127570.1, 126738.8, 129164.7, 130874.9, 128083.4…
## $ INPC <dbl> 33.27987, 39.47297, 44.33552, 48.30767, 50.43490…
#Descriptive statistics (mean, median, standard deviation, minimum, maximum)
describe(data)
## vars n mean sd median trimmed mad
## Year 1 26 2009.50 7.65 2009.50 2009.50 9.64
## IED_Flujos 2 26 26770.12 8770.68 27697.59 26860.64 9079.25
## Exportaciones 3 26 23600.91 11346.13 21187.89 22875.85 13370.71
## Empleo 4 26 96.63 0.85 96.61 96.65 0.88
## Educacion 5 26 8.42 0.72 8.46 8.43 0.88
## Salario_Diario 6 26 65.16 35.85 54.48 60.16 22.51
## Innovacion 7 26 12.97 1.18 12.91 12.93 1.15
## Inseguridad_Robo 8 26 185.42 47.67 181.83 181.16 47.06
## Inseguridad_Homicidio 9 26 17.19 7.14 15.74 16.87 8.70
## Tipo_de_Cambio 10 26 13.91 4.15 13.02 13.78 4.25
## Densidad_Carretera 11 26 0.07 0.01 0.07 0.07 0.02
## Densidad_Poblacion 12 26 57.33 5.41 58.09 57.44 6.68
## CO2_Emisiones 13 26 3.91 0.20 3.89 3.91 0.31
## PIB_Per_Capita 14 26 138550.10 8861.10 136845.30 138255.64 11080.42
## INPC 15 26 75.17 24.81 73.35 74.45 27.14
## min max range skew kurtosis se
## Year 1997.00 2022.00 25.00 0.00 -1.34 1.50
## IED_Flujos 8373.50 48354.42 39980.92 -0.02 -0.08 1720.07
## Exportaciones 9087.62 46477.58 37389.97 0.46 -1.08 2225.16
## Empleo 95.06 97.83 2.77 -0.15 -1.13 0.17
## Educacion 7.20 9.58 2.38 -0.09 -1.28 0.14
## Salario_Diario 24.30 172.87 148.57 1.43 1.44 7.03
## Innovacion 11.28 15.11 3.83 0.16 -1.01 0.23
## Inseguridad_Robo 120.49 314.78 194.28 0.89 0.30 9.35
## Inseguridad_Homicidio 8.04 29.59 21.56 0.41 -1.27 1.40
## Tipo_de_Cambio 8.06 20.66 12.60 0.44 -1.39 0.81
## Densidad_Carretera 0.05 0.09 0.04 0.18 -1.50 0.00
## Densidad_Poblacion 47.44 65.60 18.16 -0.19 -1.24 1.06
## CO2_Emisiones 3.59 4.22 0.63 0.06 -1.38 0.04
## PIB_Per_Capita 126738.75 153235.73 26496.98 0.28 -1.41 1737.81
## INPC 33.28 126.48 93.20 0.26 -0.95 4.87
summary(data)
## Year IED_Flujos Exportaciones Empleo Educacion
## Min. :1997 Min. : 8374 Min. : 9088 Min. :95.06 Min. :7.198
## 1st Qu.:2003 1st Qu.:21367 1st Qu.:13260 1st Qu.:96.09 1st Qu.:7.864
## Median :2010 Median :27698 Median :21188 Median :96.61 Median :8.457
## Mean :2010 Mean :26770 Mean :23601 Mean :96.63 Mean :8.424
## 3rd Qu.:2016 3rd Qu.:32183 3rd Qu.:31601 3rd Qu.:97.22 3rd Qu.:9.004
## Max. :2022 Max. :48354 Max. :46478 Max. :97.83 Max. :9.579
## Salario_Diario Innovacion Inseguridad_Robo Inseguridad_Homicidio
## Min. : 24.30 Min. :11.28 Min. :120.5 Min. : 8.037
## 1st Qu.: 41.97 1st Qu.:12.19 1st Qu.:148.3 1st Qu.:10.402
## Median : 54.48 Median :12.91 Median :181.8 Median :15.741
## Mean : 65.16 Mean :12.97 Mean :185.4 Mean :17.187
## 3rd Qu.: 72.31 3rd Qu.:13.60 3rd Qu.:209.9 3rd Qu.:22.346
## Max. :172.87 Max. :15.11 Max. :314.8 Max. :29.592
## Tipo_de_Cambio Densidad_Carretera Densidad_Poblacion CO2_Emisiones
## Min. : 8.064 Min. :0.05205 Min. :47.44 Min. :3.592
## 1st Qu.:10.752 1st Qu.:0.05954 1st Qu.:52.78 1st Qu.:3.718
## Median :13.016 Median :0.06989 Median :58.09 Median :3.894
## Mean :13.910 Mean :0.07106 Mean :57.33 Mean :3.915
## 3rd Qu.:18.489 3rd Qu.:0.08275 3rd Qu.:61.39 3rd Qu.:4.088
## Max. :20.664 Max. :0.09020 Max. :65.60 Max. :4.221
## PIB_Per_Capita INPC
## Min. :126739 Min. : 33.28
## 1st Qu.:130964 1st Qu.: 56.15
## Median :136845 Median : 73.35
## Mean :138550 Mean : 75.17
## 3rd Qu.:146148 3rd Qu.: 91.29
## Max. :153236 Max. :126.48
#Graph 1
# Histogram of dependent variable
hist(data$IED_Flujos, main = "Histogram of Foreigh Direct Inversion")
#With this graph we can appreciate how the frequency distribution of the IED_Flujos variable is has a relatively normal distribution, the only issue is that there is a specific range where there are no data registered.
#In order to avoid having that, we will apply the natural logarithm to the variable and the blank range is not longer visible, while still following a normal distribution.
hist(log(data$IED_Flujos) , main = "Histogram of Foreigh Direct Inversion - Log")
#Graph 2.
#Bar graph mapping Foreign Investment and GDP through the last years
graph<-data %>% group_by(Year)
graph
## # A tibble: 26 × 15
## # Groups: Year [26]
## Year IED_Flujos Exportaciones Empleo Educacion Salario_Diario Innovacion
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1997 12146. 9088. 97.8 7.20 24.3 11.3
## 2 1998 8374. 9875. 97.8 7.30 31.9 11.4
## 3 1999 13960. 10990. 97.8 7.40 31.9 12.5
## 4 2000 18249. 12483. 97.8 7.51 35.1 13.1
## 5 2001 30057. 11300. 97.4 7.62 37.6 13.5
## 6 2002 24099. 11923. 97.7 7.73 39.7 12.8
## 7 2003 18250. 13156 97.1 7.84 41.5 11.8
## 8 2004 25016. 13573. 96.5 7.94 43.3 12.6
## 9 2005 25796. 16466. 97.2 8.05 45.2 13.4
## 10 2006 21233. 17486. 96.5 8.13 47.0 14.2
## # ℹ 16 more rows
## # ℹ 8 more variables: Inseguridad_Robo <dbl>, Inseguridad_Homicidio <dbl>,
## # Tipo_de_Cambio <dbl>, Densidad_Carretera <dbl>, Densidad_Poblacion <dbl>,
## # CO2_Emisiones <dbl>, PIB_Per_Capita <dbl>, INPC <dbl>
ggplot(data=graph,aes(x=Year,IED_Flujos,y=IED_Flujos,fill=PIB_Per_Capita)) +
geom_bar(stat="identity") + coord_flip() +
labs(
y="Foreign Investment",
x="Year",
fill= 'GPD',
title="Foreign Investment and GDP")
# Observing the graph we can see how throughout the years, the Foreign Investments have been growing in the country. At first glance, we can see how in the years 2015 to 2019 the GDP was at its peak. When we analyze the situation, we can observe how after the biggest Foreign Inversion made in 2013, the GDP started to crease. Hence, we can affirm that Foreign Investments are a causality to the increasing GDP.
#Graph 3
# Correlation matrix
library(ggplot2)
library(ggcorrplot)
corre <- round(cor(subset(data,select = -c(Year, Densidad_Carretera, INPC, PIB_Per_Capita, Educacion))), 1)
ggcorrplot(corre, hc.order = TRUE,
type = "lower",
lab = TRUE,
lab_size = 3,
method="circle",
colors = c("pink", "white", "lightgreen"),
title="Correlogram",
ggtheme=theme_bw)
# In this matrix, we can appreciate the correlation among ea every variable. We can see how C02 Emissions has no relationship whatsoever the Foreign Inversion, while in the other hand Population Density and Exportation are the most correlated.
# Emphasis in noting that correlation does not mean causality. We are just cheking for a relationship between variables.
#Graph 4 Scatterplot
theme_set(theme_bw()) # pre-set the bw theme.
g <- ggplot(data, aes(Exportaciones, IED_Flujos))
g + geom_point() +
geom_smooth(method="lm", se=F) +
labs(
y="Foreign Investment",
x="Exportations",
title="Exportations vs Foreign Investment",
)
## `geom_smooth()` using formula = 'y ~ x'
#We can clearly see in the graph that there is a positive correlation between both variables demonstrated by the trend line, when the foreign investment increases, exportation also increases.
# Graph 5 Counts Plot
theme_set(theme_bw()) # pre-set the bw theme.
g <- ggplot(data, aes(Inseguridad_Robo, IED_Flujos))
g + geom_count(col="tomato3", show.legend=F) +
labs(
y="Foreign Investments",
x="Theft Danger",
title="Theft Danger & Foreign Investments")
# We can observe how the dots in the graph are progressively going down, so we can imagine a negative trendline. This means that the more Theft Danger, the less Foreign Inversions.
Hipothesis 1: The variable ‘Inversiones Extranjeras Directas’ is directly positevily correlated to the variable ‘Exportaciones’. This means that when Inversiones Extranjeras increase, Exportaciones also increases.
Hipothesis 2: The variable ‘Inversiones Extranjeras Directas’ is not correlated at all with ‘´CO2_Emisiones’. This means that the variables mentioned do not affect each other.
Hipothesis 3: Inversiones Extranjeras decreses when Peligro_Homicidio increases, meaning that they have a negative correlation.
#Estimate 3 different linear regression models by using:
# - linear regression
# - linear regression with natural logarith applied to the variables
# - polynomial regression
# - lasso regression model
# - ridge regression model
#An important observation is to notice how the variable Year wasnt included in the models. This is due to the fact that is Year was included, the model would beacome a time series and for now we are just looking for correlation.
#Linear regression
linear_regression <- lm(IED_Flujos ~ Exportaciones + Empleo + Educacion + Salario_Diario + Innovacion + Inseguridad_Robo + Inseguridad_Homicidio + Tipo_de_Cambio + Densidad_Carretera + Densidad_Poblacion + CO2_Emisiones + PIB_Per_Capita + INPC, data=data)
summary(linear_regression)
##
## Call:
## lm(formula = IED_Flujos ~ Exportaciones + Empleo + Educacion +
## Salario_Diario + Innovacion + Inseguridad_Robo + Inseguridad_Homicidio +
## Tipo_de_Cambio + Densidad_Carretera + Densidad_Poblacion +
## CO2_Emisiones + PIB_Per_Capita + INPC, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5319.2 -1757.1 -515.3 1720.7 5071.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.717e+05 4.358e+05 -2.229 0.04566 *
## Exportaciones -1.035e+00 6.479e-01 -1.597 0.13617
## Empleo 6.279e+03 3.175e+03 1.977 0.07142 .
## Educacion 1.751e+05 4.408e+04 3.973 0.00185 **
## Salario_Diario -3.695e+01 2.089e+02 -0.177 0.86256
## Innovacion 5.944e+03 1.646e+03 3.611 0.00357 **
## Inseguridad_Robo 1.522e+01 4.496e+01 0.338 0.74089
## Inseguridad_Homicidio 1.529e+03 5.305e+02 2.883 0.01377 *
## Tipo_de_Cambio -2.631e+03 8.164e+02 -3.223 0.00731 **
## Densidad_Carretera -1.496e+06 1.139e+06 -1.313 0.21361
## Densidad_Poblacion -1.865e+04 4.745e+03 -3.931 0.00200 **
## CO2_Emisiones 1.636e+04 1.396e+04 1.172 0.26392
## PIB_Per_Capita -6.483e-01 5.877e-01 -1.103 0.29161
## INPC 9.981e+02 8.324e+02 1.199 0.25367
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3789 on 12 degrees of freedom
## Multiple R-squared: 0.9104, Adjusted R-squared: 0.8134
## F-statistic: 9.384 on 13 and 12 DF, p-value: 0.00022
RMSE(linear_regression$fitted.values, data$IED_Flujos)
## [1] 2573.814
#Linear regression with natural logarithm applied to the variables
linear_log<-lm(log(IED_Flujos) ~ Exportaciones + Empleo + Educacion + Salario_Diario + Innovacion + Inseguridad_Robo + Inseguridad_Homicidio + Tipo_de_Cambio + Densidad_Carretera + Densidad_Poblacion + CO2_Emisiones + PIB_Per_Capita + INPC,data=data)
summary(linear_log)
##
## Call:
## lm(formula = log(IED_Flujos) ~ Exportaciones + Empleo + Educacion +
## Salario_Diario + Innovacion + Inseguridad_Robo + Inseguridad_Homicidio +
## Tipo_de_Cambio + Densidad_Carretera + Densidad_Poblacion +
## CO2_Emisiones + PIB_Per_Capita + INPC, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.239837 -0.067162 -0.002574 0.070082 0.240449
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.778e+01 1.876e+01 -0.948 0.36193
## Exportaciones -4.863e-05 2.789e-05 -1.744 0.10668
## Empleo 2.011e-01 1.367e-01 1.472 0.16679
## Educacion 5.048e+00 1.897e+00 2.661 0.02076 *
## Salario_Diario -5.549e-03 8.992e-03 -0.617 0.54871
## Innovacion 2.338e-01 7.084e-02 3.301 0.00633 **
## Inseguridad_Robo -1.919e-03 1.935e-03 -0.992 0.34101
## Inseguridad_Homicidio 5.946e-02 2.283e-02 2.604 0.02304 *
## Tipo_de_Cambio -9.908e-02 3.513e-02 -2.820 0.01545 *
## Densidad_Carretera -4.177e+01 4.902e+01 -0.852 0.41079
## Densidad_Poblacion -6.227e-01 2.042e-01 -3.049 0.01010 *
## CO2_Emisiones 5.212e-01 6.009e-01 0.867 0.40276
## PIB_Per_Capita -2.195e-05 2.529e-05 -0.868 0.40241
## INPC 6.400e-02 3.583e-02 1.786 0.09931 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.163 on 12 degrees of freedom
## Multiple R-squared: 0.914, Adjusted R-squared: 0.8208
## F-statistic: 9.806 on 13 and 12 DF, p-value: 0.0001759
RMSE(linear_log$fitted.values, data$IED_Flujos)
## [1] 28107.96
##Polynomial regresion.
#To identify which variables to elevate,first of all we evaluate their theoritical reason of wether it can have a polynomial behaviour (square or cubic).
# In other words, we are searching for variables that their behaviour makes a curve (goes up, reaches certain point and go down again) with the possibility of going up again (cubic beahaviour).
# After analyzing, the variables I considered most fitting variables are PIB Per Capita, Tipo de Cambio and CO2 Emisiones.
# One we defined the possible variables, we graph them in order to see if we can visualize this theoretical behaviour.
theme_set(theme_bw())
g <- ggplot(data, aes(PIB_Per_Capita, IED_Flujos))
g + geom_count(col="tomato3", show.legend=F) +
labs(
y="Foreign Investment",
x="GDP",
title="Counts Plot")
theme_set(theme_bw())
g <- ggplot(data, aes(Tipo_de_Cambio, IED_Flujos))
g + geom_count(col="tomato3", show.legend=F) +
labs(
y="Foreign Investment",
x="Exchange Rate",
title="Counts Plot")
theme_set(theme_bw())
g <- ggplot(data, aes(CO2_Emisiones, IED_Flujos))
g + geom_count(col="tomato3", show.legend=F) +
labs(
y="Foreign Investment",
x="C02 Emissions",
title="Counts Plot")
#We can see how the graphs of PIB and Tipo de Cambio, shows somewhat a polynomial behavior, so we can use those variables to elevate the model. However, the CO2 Emisiones does not show any type of correlation, so we can discard this variable.
#Finally we will have a Cubic Polynomial Regression, or a 3rd degree Polynomial Regression.
#### Its really IMPORTANT to mention that this data set has a limited amount of records, so identifying a polynomial behavior just by looking at a graph its not 100% certain nor accurate.
# Polynomial Regression Model
polynomial_regression <- lm(log(IED_Flujos) ~ Exportaciones + Empleo + Educacion + Salario_Diario + Innovacion + Inseguridad_Robo + Inseguridad_Homicidio + Tipo_de_Cambio + Densidad_Carretera + Densidad_Poblacion + CO2_Emisiones + PIB_Per_Capita + I(PIB_Per_Capita^2) + INPC, data=data)
summary(polynomial_regression)
##
## Call:
## lm(formula = log(IED_Flujos) ~ Exportaciones + Empleo + Educacion +
## Salario_Diario + Innovacion + Inseguridad_Robo + Inseguridad_Homicidio +
## Tipo_de_Cambio + Densidad_Carretera + Densidad_Poblacion +
## CO2_Emisiones + PIB_Per_Capita + I(PIB_Per_Capita^2) + INPC,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.245352 -0.065304 0.002983 0.072802 0.228259
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.312e+00 4.511e+01 -0.206 0.8402
## Exportaciones -4.446e-05 3.529e-05 -1.260 0.2338
## Empleo 1.936e-01 1.469e-01 1.318 0.2143
## Educacion 5.036e+00 1.978e+00 2.545 0.0272 *
## Salario_Diario -5.285e-03 9.458e-03 -0.559 0.5875
## Innovacion 2.387e-01 7.742e-02 3.083 0.0104 *
## Inseguridad_Robo -2.190e-03 2.402e-03 -0.912 0.3814
## Inseguridad_Homicidio 6.247e-02 2.785e-02 2.243 0.0465 *
## Tipo_de_Cambio -1.018e-01 3.884e-02 -2.620 0.0238 *
## Densidad_Carretera -4.770e+01 5.849e+01 -0.816 0.4321
## Densidad_Poblacion -6.226e-01 2.129e-01 -2.925 0.0138 *
## CO2_Emisiones 5.690e-01 6.672e-01 0.853 0.4119
## PIB_Per_Capita -1.318e-04 5.280e-04 -0.250 0.8075
## I(PIB_Per_Capita^2) 3.944e-10 1.894e-09 0.208 0.8388
## INPC 6.480e-02 3.754e-02 1.726 0.1123
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.17 on 11 degrees of freedom
## Multiple R-squared: 0.9143, Adjusted R-squared: 0.8052
## F-statistic: 8.383 on 14 and 11 DF, p-value: 0.0005698
RMSE(polynomial_regression$fitted.values, data$IED_Flujos)
## [1] 28107.96
### Diagnostic Tests
# Model 1 - Linear regression
vif(linear_regression)
## Exportaciones Empleo Educacion
## 94.133511 12.560405 1731.885829
## Salario_Diario Innovacion Inseguridad_Robo
## 97.707251 6.563801 8.000941
## Inseguridad_Homicidio Tipo_de_Cambio Densidad_Carretera
## 24.965162 19.986978 404.496907
## Densidad_Poblacion CO2_Emisiones PIB_Per_Capita
## 1148.107734 13.686035 47.236728
## INPC
## 742.865236
bptest(linear_regression)
##
## studentized Breusch-Pagan test
##
## data: linear_regression
## BP = 9.3084, df = 13, p-value = 0.7493
AIC(linear_regression)
## [1] 512.1483
histogram(polynomial_regression$residuals)
# Model 2 - Linear regression with natural logarithm applied to the variables
vif(linear_log)
## Exportaciones Empleo Educacion
## 94.133511 12.560405 1731.885829
## Salario_Diario Innovacion Inseguridad_Robo
## 97.707251 6.563801 8.000941
## Inseguridad_Homicidio Tipo_de_Cambio Densidad_Carretera
## 24.965162 19.986978 404.496907
## Densidad_Poblacion CO2_Emisiones PIB_Per_Capita
## 1148.107734 13.686035 47.236728
## INPC
## 742.865236
bptest(linear_log)
##
## studentized Breusch-Pagan test
##
## data: linear_log
## BP = 7.8148, df = 13, p-value = 0.8555
AIC(linear_log)
## [1] -10.6308
histogram(linear_log$residuals)
# Model 3 - Polynomial regresion
vif(polynomial_regression)
## Exportaciones Empleo Educacion
## 138.778656 13.361240 1733.310924
## Salario_Diario Innovacion Inseguridad_Robo
## 99.490502 7.214584 11.343616
## Inseguridad_Homicidio Tipo_de_Cambio Densidad_Carretera
## 34.191177 22.484306 530.050990
## Densidad_Poblacion CO2_Emisiones PIB_Per_Capita
## 1148.114283 15.528332 18944.483802
## I(PIB_Per_Capita^2) INPC
## 19061.970472 750.789865
bptest(polynomial_regression)
##
## studentized Breusch-Pagan test
##
## data: polynomial_regression
## BP = 9.6019, df = 14, p-value = 0.7907
AIC(polynomial_regression)
## [1] -8.733148
histogram(polynomial_regression$residuals)
### Split the Data in Training Data vs Test Data
set.seed(123)
training.samples<-data$IED_Flujos %>%
createDataPartition(p=0.75,list=FALSE)
train.data<-data[training.samples, ]
test.data<-data[-training.samples, ]
#################################
### LASSO REGRESSION ANALYSIS ###
#################################
# Independent variables
x<-model.matrix(log(IED_Flujos) ~ Exportaciones + Empleo + Educacion + Salario_Diario + Innovacion + Inseguridad_Robo + Inseguridad_Homicidio + Tipo_de_Cambio + Densidad_Carretera + Densidad_Poblacion + CO2_Emisiones + PIB_Per_Capita + INPC,train.data)[,-1] ### OLS model specification
y<-train.data$IED_Flujos ### dependent variable
# In estimating LASSO regression it is important to define the lambda that minimizes the prediction error rate.
# Cross-validation ensures that every data / observation from the original dataset has a chance of appearing in train and test datasets.
# Find the best lambda using cross-validation.
set.seed(123)
cv.lasso <- cv.glmnet(x,y,alpha=1) # alpha = 1 for LASSO
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
## fold
# Display the best lambda value
cv.lasso$lambda.min ### lambda: a numeric value defining the amount of shrinkage. Why min? the higher the value of ?? , the more penalization there is
## [1] 362.8589
# Fit the final model on the training data
lassomodel<-glmnet(x,y,alpha=1,lambda=cv.lasso$lambda.min)
lassomodel
##
## Call: glmnet(x = x, y = y, alpha = 1, lambda = cv.lasso$lambda.min)
##
## Df %Dev Lambda
## 1 5 80.58 362.9
# Display regression coefficients
coef(lassomodel)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -7549.186773
## Exportaciones .
## Empleo .
## Educacion .
## Salario_Diario 9.081805
## Innovacion 2650.871459
## Inseguridad_Robo -20.895547
## Inseguridad_Homicidio .
## Tipo_de_Cambio .
## Densidad_Carretera .
## Densidad_Poblacion .
## CO2_Emisiones -3155.292307
## PIB_Per_Capita .
## INPC 196.294087
# Make predictions on the test data
x.test<-model.matrix(log(IED_Flujos) ~ Exportaciones + Empleo + Educacion + Salario_Diario + Innovacion + Inseguridad_Robo + Inseguridad_Homicidio + Tipo_de_Cambio + Densidad_Carretera + Densidad_Poblacion + CO2_Emisiones + PIB_Per_Capita + INPC,test.data)[,-1] ### OLS model specification
# x.test<-model.matrix(Weekly_Sales~.,test.data)[,-1]
lassopredictions <- lassomodel %>% predict(x.test) %>% as.vector()
# Model Accuracy
data.frame(
RMSE = RMSE(lassopredictions, test.data$IED_Flujos),
Rsquare = R2(lassopredictions, test.data$IED_Flujos))
## RMSE Rsquare
## 1 10445.54 0.5707237
### visualizing lasso regression results
lbs_fun <- function(fit, offset_x=1, ...) {
L <- length(fit$lambda)
x <- log(fit$lambda[L])+ offset_x
y <- fit$beta[, L]
labs <- names(y)
text(x, y, labels=labs, ...)
}
lasso<-glmnet(scale(x),y,alpha=1)
plot(lasso,xvar="lambda",label=T)
lbs_fun(lasso)
abline(v=cv.lasso$lambda.min,col="red",lty=2)
abline(v=cv.lasso$lambda.1se,col="blue",lty=2)
#################################
### RIDGE REGRESSION ANALYSIS ###
#################################
# Find the best lambda using cross-validation
set.seed(123) # x: independent variables | y: dependent variable
cv.ridge <- cv.glmnet(x,y,alpha=0.1) # alpha = 0 for RIDGE
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
## fold
# Display the best lambda value
cv.ridge$lambda.min # lambda: a numeric value defining the amount of shrinkage. Why min? the higher the value of ?? , the more penalization there is
## [1] 1723.874
# Fit the final model on the training data
ridgemodel<-glmnet(x,y,alpha=0,lambda=cv.ridge$lambda.min)
# Display regression coefficients
coef(ridgemodel)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -2.637912e+04
## Exportaciones 3.203238e-02
## Empleo 4.017246e+01
## Educacion 1.352587e+03
## Salario_Diario 2.418552e+01
## Innovacion 2.105752e+03
## Inseguridad_Robo -2.619092e+01
## Inseguridad_Homicidio -4.659805e+01
## Tipo_de_Cambio 2.373136e+00
## Densidad_Carretera 4.882539e+04
## Densidad_Poblacion 1.646285e+02
## CO2_Emisiones -3.796542e+03
## PIB_Per_Capita 8.648935e-02
## INPC 4.079777e+01
# Make predictions on the test data
x.test<-model.matrix(log(IED_Flujos) ~ Exportaciones + Empleo + Educacion + Salario_Diario + Innovacion + Inseguridad_Robo + Inseguridad_Homicidio + Tipo_de_Cambio + Densidad_Carretera + Densidad_Poblacion + CO2_Emisiones + PIB_Per_Capita + INPC,test.data)[,-1]
ridgepredictions<-ridgemodel %>% predict(x.test) %>% as.vector()
# Model Accuracy
data.frame(
RMSE = RMSE(ridgepredictions, test.data$IED_Flujos),
Rsquare = R2(ridgepredictions, test.data$IED_Flujos)
)
## RMSE Rsquare
## 1 10544.56 0.5296032
### visualizing ridge regression results
ridge<-glmnet(scale(x),y,alpha=0)
plot(ridge, xvar = "lambda", label=T)
lbs_fun(ridge)
abline(v=cv.ridge$lambda.min, col = "red", lty=2)
abline(v=cv.ridge$lambda.1se, col="blue", lty=2)
##HETEROSCEDASTICITY
# Using the BP Test, we can check out for heteroscedasticity (residuals with a NOT constant variance/unequal scatter) by looking at the P-VALUE it gave us.
#H0.Homoscedasticityt is present
#HA.Heteroscedasticty is present
# Its very important to mention that our models should have HOMOSCEDASTICITY for a model way more precise.
#If the P- Value is less than 5% or 0.05, we reject the null hypothesis which means that heteroscedasticity is present in the model
#Looking at every model's BP Test we can see how:
# - Linear regression: P-Value = 0.7493. H0 - Homoscedasticityt is present
# - Linear regression with natural logarithm: P-Value =0.8555 HO - Homoscedasticityt is present
# - Polynomial regression: P-Value = 0.5275 H0 - Homoscedasticityt is present
### MODEL ACCUARCY
#Now that we confirmed that neither one of our models has heterocedasticity, we can continue to compare each model and decide which one fits best.
#We will take into consideration:
# - RMSE (Root Mean Standard Error)
# - R2 Error
# RMSE - Root Mean Square Error: The larger the difference indicates a larger gap between the predicted and observed values,
# which means poor regression model fit. In the same way, the smaller RMSE that indicates the better the model.
# Create a new variable in order to compare the quality and reliability of each of the regression models.
tab <- matrix(c(2573.814,28107.96,28107.95,10445.54,10544.56, 0.8134, 0.8208, 0.8405,0.5707,0.5296), ncol=2, byrow=FALSE)
colnames(tab) <- c('RMSE','R2')
rownames(tab) <- c('Linear Regression','Linear Regresion (Log)','Polynomial', 'Lasso', 'Ridge')
tab <- as.table(tab)
tab
## RMSE R2
## Linear Regression 2573.8140 0.8134
## Linear Regresion (Log) 28107.9600 0.8208
## Polynomial 28107.9500 0.8405
## Lasso 10445.5400 0.5707
## Ridge 10544.5600 0.5296
##Next, we are going to check for Multicolineality.
##MULTICOLINEALITY
# We identify this phenomenon by checking the VIF function.
# In order to verify this, the values of the variables of our models should be less than 10 to the extent possible.
# After verifying all VIF tests, we can conclude that our model with the most Multicolineality is clearly the polynomial. Since the other two are pretty similar, we decided that we are going to use the LINEAR REGRESSION, since it has the best RMSE with a pretty good R2.
# Lets take a look once again to the model we chose.
#Linear regression
linear_regression <- lm(IED_Flujos ~ Exportaciones + Empleo + Educacion + Salario_Diario + Innovacion + Inseguridad_Robo + Inseguridad_Homicidio + Tipo_de_Cambio + Densidad_Carretera + Densidad_Poblacion + CO2_Emisiones + PIB_Per_Capita + INPC, data=data)
summary(linear_regression)
##
## Call:
## lm(formula = IED_Flujos ~ Exportaciones + Empleo + Educacion +
## Salario_Diario + Innovacion + Inseguridad_Robo + Inseguridad_Homicidio +
## Tipo_de_Cambio + Densidad_Carretera + Densidad_Poblacion +
## CO2_Emisiones + PIB_Per_Capita + INPC, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5319.2 -1757.1 -515.3 1720.7 5071.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.717e+05 4.358e+05 -2.229 0.04566 *
## Exportaciones -1.035e+00 6.479e-01 -1.597 0.13617
## Empleo 6.279e+03 3.175e+03 1.977 0.07142 .
## Educacion 1.751e+05 4.408e+04 3.973 0.00185 **
## Salario_Diario -3.695e+01 2.089e+02 -0.177 0.86256
## Innovacion 5.944e+03 1.646e+03 3.611 0.00357 **
## Inseguridad_Robo 1.522e+01 4.496e+01 0.338 0.74089
## Inseguridad_Homicidio 1.529e+03 5.305e+02 2.883 0.01377 *
## Tipo_de_Cambio -2.631e+03 8.164e+02 -3.223 0.00731 **
## Densidad_Carretera -1.496e+06 1.139e+06 -1.313 0.21361
## Densidad_Poblacion -1.865e+04 4.745e+03 -3.931 0.00200 **
## CO2_Emisiones 1.636e+04 1.396e+04 1.172 0.26392
## PIB_Per_Capita -6.483e-01 5.877e-01 -1.103 0.29161
## INPC 9.981e+02 8.324e+02 1.199 0.25367
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3789 on 12 degrees of freedom
## Multiple R-squared: 0.9104, Adjusted R-squared: 0.8134
## F-statistic: 9.384 on 13 and 12 DF, p-value: 0.00022
RMSE(linear_regression$fitted.values, data$IED_Flujos)
## [1] 2573.814
### Diagnostic Tests
vif(linear_regression)
## Exportaciones Empleo Educacion
## 94.133511 12.560405 1731.885829
## Salario_Diario Innovacion Inseguridad_Robo
## 97.707251 6.563801 8.000941
## Inseguridad_Homicidio Tipo_de_Cambio Densidad_Carretera
## 24.965162 19.986978 404.496907
## Densidad_Poblacion CO2_Emisiones PIB_Per_Capita
## 1148.107734 13.686035 47.236728
## INPC
## 742.865236
bptest(linear_regression)
##
## studentized Breusch-Pagan test
##
## data: linear_regression
## BP = 9.3084, df = 13, p-value = 0.7493
AIC(linear_regression)
## [1] 512.1483
histogram(polynomial_regression$residuals)
#We will eliminate the variables that does not have a significant impact on the model using the correlation chart
corre2 <- round(cor(subset(data)), 1)
ggcorrplot(corre2, hc.order = TRUE,
type = "lower",
lab = TRUE,
lab_size = 3,
method="circle",
colors = c("pink", "white", "lightgreen"),
title="Correlogram",
ggtheme=theme_bw)
#We discovered that the variables with least correlation were: co2 emisiones, riesgo_homicidio and innovacion.
#Playing with the model, we also removed from the model the variables: Exportaciones, Educacion, Salario Diario, Densidad carretera, Densidad Poblacion
#New model
linear_regression2 <- lm(IED_Flujos ~ Empleo + Innovacion + Inseguridad_Robo + Inseguridad_Homicidio + Tipo_de_Cambio + CO2_Emisiones + PIB_Per_Capita + INPC, data=data)
summary(linear_regression2)
##
## Call:
## lm(formula = IED_Flujos ~ Empleo + Innovacion + Inseguridad_Robo +
## Inseguridad_Homicidio + Tipo_de_Cambio + CO2_Emisiones +
## PIB_Per_Capita + INPC, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7992.0 -1593.4 238.2 1772.4 14270.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.083e+05 3.150e+05 0.344 0.7352
## Empleo -1.287e+03 2.925e+03 -0.440 0.6655
## Innovacion 3.185e+03 1.813e+03 1.757 0.0969 .
## Inseguridad_Robo -3.372e+00 4.365e+01 -0.077 0.9393
## Inseguridad_Homicidio -7.429e+01 4.071e+02 -0.183 0.8573
## Tipo_de_Cambio -2.004e+03 9.725e+02 -2.061 0.0550 .
## CO2_Emisiones -1.183e+04 1.213e+04 -0.975 0.3431
## PIB_Per_Capita 3.230e-01 4.226e-01 0.764 0.4551
## INPC 4.372e+02 1.979e+02 2.209 0.0412 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5414 on 17 degrees of freedom
## Multiple R-squared: 0.7409, Adjusted R-squared: 0.6189
## F-statistic: 6.075 on 8 and 17 DF, p-value: 0.0008996
RMSE(linear_regression2$fitted.values, data$IED_Flujos)
## [1] 4378.015
### Diagnostic Tests: New Model
vif(linear_regression2)
## Empleo Innovacion Inseguridad_Robo
## 5.218425 3.898395 3.691898
## Inseguridad_Homicidio Tipo_de_Cambio CO2_Emisiones
## 7.197865 13.887651 5.059914
## PIB_Per_Capita INPC
## 11.957778 20.559747
bptest(linear_regression2)
##
## studentized Breusch-Pagan test
##
## data: linear_regression2
## BP = 10.168, df = 8, p-value = 0.2534
AIC(linear_regression2)
## [1] 529.771
histogram(linear_regression2$residuals)
#Testing again our model, we can see how reduced the multicolineality became significantly reduced.
# We have a total of 5 variables with no multicolineality, havin a p-value smaller that 0.05.
#We also have only 3 variables with a small amount of multicolineality. Having their p values on a range from 12 to 20.
#Now that we have the best version of a model, we can evaluate our hypothesis.
summary(linear_regression2)
##
## Call:
## lm(formula = IED_Flujos ~ Empleo + Innovacion + Inseguridad_Robo +
## Inseguridad_Homicidio + Tipo_de_Cambio + CO2_Emisiones +
## PIB_Per_Capita + INPC, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7992.0 -1593.4 238.2 1772.4 14270.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.083e+05 3.150e+05 0.344 0.7352
## Empleo -1.287e+03 2.925e+03 -0.440 0.6655
## Innovacion 3.185e+03 1.813e+03 1.757 0.0969 .
## Inseguridad_Robo -3.372e+00 4.365e+01 -0.077 0.9393
## Inseguridad_Homicidio -7.429e+01 4.071e+02 -0.183 0.8573
## Tipo_de_Cambio -2.004e+03 9.725e+02 -2.061 0.0550 .
## CO2_Emisiones -1.183e+04 1.213e+04 -0.975 0.3431
## PIB_Per_Capita 3.230e-01 4.226e-01 0.764 0.4551
## INPC 4.372e+02 1.979e+02 2.209 0.0412 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5414 on 17 degrees of freedom
## Multiple R-squared: 0.7409, Adjusted R-squared: 0.6189
## F-statistic: 6.075 on 8 and 17 DF, p-value: 0.0008996
#* * Hipothesis 1: The variable Direct Foreign Investments is directly positively correlated to the variable Exportations. This means that the most Foreign Investments in the country, the more Exportations it generates.
#*
#* In order to fit the model and eliminate multicolineality, the variable Exportation was deleted from the model, however, in the data visualization section we observed a graph that indicates how these variables indeed correlate.
#* Hipothesis 2: The variable Direct Foreign Investments is not correlated at all with CO2_Emisions. This means that the variables mentioned do not affect each other.
#*
#* For the C02 Emisions we can observe a negative correlation, however it has a significance of only 0.1, so we can almost certaintly state that there isnt really any correlation. Out hypothesis is correct.
#* * Hipothesis 3: Foreign Investments decreases when Danger of Murder increases, meaning that they have a negative correlation.
#*
#* We can observe how Danger of Murder is not relevant at all. The model does not indicate any relation with foreign investments.
linear_regression2 <- lm(IED_Flujos ~ Empleo + Innovacion + Inseguridad_Robo + Inseguridad_Homicidio + Tipo_de_Cambio + CO2_Emisiones + PIB_Per_Capita + INPC, data=data)
colors <- c("blue", "green", "red", "purple")
plot(linear_regression2, which = 1, col = colors[1], pch = 19)
#Show the predicted values of the dependent variable
After determining the that our prediction model is the Linear Regression and after adjusting its multicolineality we gathered the following insights:
The variables that are attractive to businesses that seak nearshoring in Mexico are mainly: low exchange rate and a innovation. We can see this graphically in the Ridge Analysis.
Hipothesis 1: The variable Direct Foreign Investments is directly positively correlated to the variable Exportations. This means that the most Foreign Investments in the country, the more Exportations it generates.
In order to fit the model and eliminate multicolineality, the variable Exportation was deleted from the model, however, in the data visualization section we observed a graph that indicates how these variables indeed correlate.
For the C02 Emisions we can observe a negative correlation, however it has a significance of only 0.1, so we can almost certaintly state that there isnt really any correlation. Out hypothesis is correct.
*Hipothesis 3: Foreign Investments decreases when Danger of Murder increases, meaning that they have a negative correlation.
We can observe how Danger of Murder is not relevant at all. The model does not indicate any relation with foreign investments.
Cardona, S., & Cardona, S. (2023, August 10). Nearshoring en México: una oportunidad histórica para la economía del futuro. Forbes México. https://www.forbes.com.mx/nearshoring-en-mexico-una-oportunidad-historica-para-la-economia-del-futuro/
Nearshoring: La solución actual para el comercio exterior. (2021). @Thomsonreuters. https://www.thomsonreutersmexico.com/es-mx/soluciones-de-comercio-exterior/blog-comercio-exterior/nearshoring-la-solucion-actual-para-el-comercio-exteior
What Is Predictive Analytics? 5 Examples | HBS Online. (2021, October 26). Business Insights Blog. https://online.hbs.edu/blog/post/predictive-analytics
Tierney, N. (2023, February 2). Exploring Imputed Values. R-Project.org. https://cran.r-project.org/web/packages/naniar/vignettes/exploring-imputed-values.html
Imputation in R: Top 3 Ways for Imputing Missing Data - Machine Learning, R programming. (2023). Appsilon.com. https://appsilon.com/imputation-in-r/#missforest
(2013). R-Statistics.co. http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html#Counts%20Chart