Evidence 1. Predictive Model

Background

Nearshoring

Nearshoring is a practice which is commonly adopted by a great variety of companies belonging to different industries all around the world. This practice consists of a company outsourcing its services and/or processes to other companies in a nearby and cost-effective country, rather than one with a greater distance.

Nearshoring in Mexico has become a very popular practice, especially to companies in the USA and Canada. Mexico’s proximity to these countries, in addition to the T-MEC agreement and its competitive cost of skilled labor force, represent a big opportunity and attractive destination for nearshoring. Aside from the aspects already mentioned, the political and commercial conflicts between US and China have caused Mexico ́s US imports to rise and of course, American companies have migrated from offshoring in Asia, to nearshoring in Mexico.

One famous example that has been announced recently and caused commotion and excitement within the Mexican population is the arrival of Tesla´s production plant. It’s important to mention all the benefits that the recent increase of nearshoring practices will bring to the country economically speaking. All the way from new infrastructure to a lot of new job opportunities, through nearshoring, the Mexican GDP will have up to a 2.5% increase in the next 6 years, aside from reaching a 50 billion US dollars direct foreign investment. (Forbes, 2023)

Predictive Analytics

Predictive analytics, as it name says, is the process of using and analyzing data to predict future outcomes or events and identify trends. It uses historical data to forecast potential scenarios usually through machine learning models. This analysis answers the question “What might happen in the future?”.

Regression analysis is a tool that’s commonly used in predictive analytics which helps us model the relationship between one or more independent variables and a dependent variable. In other words, this means that regression analysis can help you understand how different factors affect your outcome of interest.

Regression Analysis to predict Nearshoring in Mexico

Regression analysis can help predict the occurrence of Nearshoring by examining the relationship between various independent variables and the likelihood of companies choosing Mexico as a nearshoring destination. Researchers can collect historical data on factors such as labor costs, exchange rates, political stability, infrastructure development, and trade agreements, which are potential drivers of nearshoring decisions. By applying regression techniques, they can analyze how these variables have influenced nearshoring activities in the past and help us model those same variables in the future.

Through regression analysis, a predictive model can be developed that quantifies the impact of each variable on the likelihood of nearshoring in Mexico. This model can then be used to make future predictions based on changes in these factors. By updating the modelconstantly with new data, businesses, government or any other interested party can obtain insights into the evolving landscape of nearshoring decisions and make informed strategic choices.

##Problem Situation Maria, an econometrics analyst in a Mexican company, plans to use official data sources to create a database exploring conditions influencing nearshoring in Mexico, including socioeconomic, business, technological, environmental, and security factors. She aims to apply econometric models to predict the effects of nearshoring in Mexico, understand why Mexico is attractive for nearshoring, and identify opportunities for new business relocation in the country.

Importing Libraries

library(foreign)
library(dplyr)        # data manipulation

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(forcats)      # to work with categorical variables
library(ggplot2)      # data visualization 
library(readr)        # read specific csv files
library(janitor)      # data exploration and cleaning

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(Hmisc)        # several useful functions for data analysis

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(psych)        # functions for multivariate analysis

## 
## Attaching package: 'psych'

## The following object is masked from 'package:Hmisc':
## 
##     describe

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(naniar)       # summaries and visualization of missing values NA's
library(corrplot)     # correlation plots

## corrplot 0.92 loaded

library(jtools)       # presentation of regression analysis

## 
## Attaching package: 'jtools'

## The following object is masked from 'package:Hmisc':
## 
##     %nin%

library(lmtest)       # diagnostic checks - linear regression analysis

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(car)          # diagnostic checks - linear regression analysis

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

## The following object is masked from 'package:dplyr':
## 
##     recode

library(olsrr)        # diagnostic checks - linear regression analysis

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

library(naniar)       # identifying missing values
library(stargazer)    # create publication quality tables

## 
## Please cite as:

##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

library(effects)      # displays for linear and other regression models

## lattice theme set by effectsTheme()
## See ?effectsTheme for details.

library(tidyverse)    # collection of R packages designed for data science

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ✔ stringr   1.5.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ psych::%+%()       masks ggplot2::%+%()
## ✖ psych::alpha()     masks ggplot2::alpha()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ car::recode()      masks dplyr::recode()
## ✖ purrr::some()      masks car::some()
## ✖ Hmisc::src()       masks dplyr::src()
## ✖ Hmisc::summarize() masks dplyr::summarize()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(caret)        # Classification and Regression Training

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(glmnet)       # methods for prediction and plotting, and functions for cross-validation

## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-7

library("readxl")     # Read Excel
library(caret)        # Imputation
library(missForest)   # Imputation
library(mice)         # Imputation

## Registered S3 methods overwritten by 'broom':
##   method            from  
##   tidy.glht         jtools
##   tidy.summary.glht jtools
## 
## Attaching package: 'mice'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     cbind, rbind

library(Hmisc)        # Imputation

0. Import DataBase

data <- read_excel("C:\\Users\\maria\\OneDrive\\Desktop\\AD 2023\\ECONOMETRICS\\SP_DataMexicoAtractiveness_alumn-VF.xlsx", sheet = 'Datos')
data

## # A tibble: 26 × 15
##     Year IED_Flujos Exportaciones Empleo Educacion Salario_Diario Innovacion
##    <dbl>      <dbl>         <dbl>  <dbl>     <dbl>          <dbl>      <dbl>
##  1  1997     12146.         9088.   NA        7.20           24.3       11.3
##  2  1998      8374.         9875.   NA        7.31           31.9       11.4
##  3  1999     13960.        10990.   NA        7.43           31.9       12.5
##  4  2000     18249.        12483.   97.8      7.56           35.1       13.1
##  5  2001     30057.        11300.   97.4      7.68           37.6       13.5
##  6  2002     24099.        11923.   97.7      7.80           39.7       12.8
##  7  2003     18250.        13156    97.1      7.93           41.5       11.8
##  8  2004     25016.        13573.   96.5      8.04           43.3       12.6
##  9  2005     25796.        16466.   97.2      8.14           45.2       13.4
## 10  2006     21233.        17486.   96.5      8.26           47.0       14.2
## # ℹ 16 more rows
## # ℹ 8 more variables: Inseguridad_Robo <dbl>, Inseguridad_Homicidio <dbl>,
## #   Tipo_de_Cambio <dbl>, Densidad_Carretera <dbl>, Densidad_Poblacion <dbl>,
## #   CO2_Emisiones <dbl>, PIB_Per_Capita <dbl>, INPC <dbl>

ts_data <- read_excel("C:\\Users\\maria\\OneDrive\\Desktop\\AD 2023\\ECONOMETRICS\\SP_DataMexicoAtractiveness_alumn-VF.xlsx", sheet = 'ts.data')
ts_data

## # A tibble: 96 × 3
##     Year Trimestre IED_Flujos
##    <dbl> <chr>          <dbl>
##  1  1999 I              3596.
##  2  1999 II             3396.
##  3  1999 III            3028.
##  4  1999 IV             3940.
##  5  2000 I              4601.
##  6  2000 II             4857.
##  7  2000 III            3057.
##  8  2000 IV             5734.
##  9  2001 I              3599.
## 10  2001 II             5219.
## # ℹ 86 more rows

1. Data imputation

# Search for N/A in data (df1)
sum(is.na(data))

## [1] 12

glimpse(data)

## Rows: 26
## Columns: 15
## $ Year                  <dbl> 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, …
## $ IED_Flujos            <dbl> 12145.60, 8373.50, 13960.32, 18248.69, 30057.18,…
## $ Exportaciones         <dbl> 9087.616, 9875.065, 10990.013, 12482.963, 11300.…
## $ Empleo                <dbl> NA, NA, NA, 97.83000, 97.36000, 97.66000, 97.060…
## $ Educacion             <dbl> 7.197968, 7.311587, 7.427924, 7.560000, 7.678270…
## $ Salario_Diario        <dbl> 24.30, 31.91, 31.91, 35.12, 37.57, 39.74, 41.53,…
## $ Innovacion            <dbl> 11.30141, 11.37221, 12.45897, 13.14556, 13.46812…
## $ Inseguridad_Robo      <dbl> 266.5065, 314.7762, 272.8936, 216.9808, 214.5269…
## $ Inseguridad_Homicidio <dbl> 14.554142, 14.319399, 12.641071, 10.857846, 10.2…
## $ Tipo_de_Cambio        <dbl> 8.0640, 9.9395, 9.5222, 9.5997, 9.1692, 10.3613,…
## $ Densidad_Carretera    <dbl> 0.05205217, 0.05295475, 0.05501698, 0.05522774, …
## $ Densidad_Poblacion    <dbl> 47.43650, 48.76163, 49.48089, 50.57930, 51.27675…
## $ CO2_Emisiones         <dbl> 3.675330, 3.853045, 3.686918, 3.874147, 3.811393…
## $ PIB_Per_Capita        <dbl> 127570.1, 126738.8, 129164.7, 130874.9, 128083.4…
## $ INPC                  <dbl> 33.27987, 39.47297, 44.33552, 48.30767, 50.43490…

# Search for N/A in ts_data (df2)
sum(is.na(ts_data))

## [1] 0

# Identify columns mith missing values
missing <- colSums(is.na(data)) > 0
missing_col <- names(data)[missing]
missing_col

## [1] "Empleo"                "Educacion"             "Innovacion"           
## [4] "Inseguridad_Homicidio" "CO2_Emisiones"

#After examining our dataset, we noticed that there are  several N/A, which for the sake of our model we need to imputate in order for it to be more precise. We will be using a Mchine Learning model to predict missing values.

# Llibrary(Hmisc)  allows for multiple imputation with bootstrapping, additive regression, and predictive mean matching

# The named variables specifies the predictors for imputation
data_imp <- aregImpute(~ Year + IED_Flujos + Exportaciones + Empleo + Educacion + Salario_Diario + Innovacion +  Inseguridad_Robo + Inseguridad_Homicidio + Tipo_de_Cambio + Densidad_Carretera + Densidad_Poblacion + CO2_Emisiones +  PIB_Per_Capita + INPC, 
                         n.impute = 2,  # specifies to generate one imputed dataset
                         type = 'pmm', # specifies imputation method (predictive mean matching)
                         data = data) # your data frame

## Iteration 1 
Iteration 2 
Iteration 3 
Iteration 4 
Iteration 5

# Display the information of the model we created
# We can see which columns have N/A
# The R-squared shows the quality of the impuitation models. The higher the value (closer to 1), the better prediction.
data_imp

## 
## Multiple Imputation using Bootstrap and PMM
## 
## aregImpute(formula = ~Year + IED_Flujos + Exportaciones + Empleo + 
##     Educacion + Salario_Diario + Innovacion + Inseguridad_Robo + 
##     Inseguridad_Homicidio + Tipo_de_Cambio + Densidad_Carretera + 
##     Densidad_Poblacion + CO2_Emisiones + PIB_Per_Capita + INPC, 
##     data = data, n.impute = 2, type = "pmm")
## 
## n: 26    p: 15   Imputations: 2      nk: 3 
## 
## Number of NAs:
##                  Year            IED_Flujos         Exportaciones 
##                     0                     0                     0 
##                Empleo             Educacion        Salario_Diario 
##                     3                     3                     0 
##            Innovacion      Inseguridad_Robo Inseguridad_Homicidio 
##                     2                     0                     1 
##        Tipo_de_Cambio    Densidad_Carretera    Densidad_Poblacion 
##                     0                     0                     0 
##         CO2_Emisiones        PIB_Per_Capita                  INPC 
##                     3                     0                     0 
## 
##                       type d.f.
## Year                     s    2
## IED_Flujos               s    2
## Exportaciones            s    2
## Empleo                   s    2
## Educacion                s    2
## Salario_Diario           s    2
## Innovacion               s    2
## Inseguridad_Robo         s    2
## Inseguridad_Homicidio    s    2
## Tipo_de_Cambio           s    2
## Densidad_Carretera       s    2
## Densidad_Poblacion       s    2
## CO2_Emisiones            s    1
## PIB_Per_Capita           s    2
## INPC                     s    2
## 
## Transformation of Target Variables Forced to be Linear
## 
## R-squares for Predicting Non-Missing Values for Each Variable
## Using Last Imputations of Predictors
##                Empleo             Educacion            Innovacion 
##                     1                     1                     1 
## Inseguridad_Homicidio         CO2_Emisiones 
##                     1                     1

# Replace de N/A of each column with the imputation model values
# The column Education was not applied, since the values the model gave us, dind math the tendency of the data.
# For education we used another type of impintation method
data$Empleo[is.na(data$Empleo)] <- data_imp$imputed$Empleo

## Warning in data$Empleo[is.na(data$Empleo)] <- data_imp$imputed$Empleo: number
## of items to replace is not a multiple of replacement length

data$Innovacion[is.na(data$Innovacion)] <- data_imp$imputed$Innovacion

## Warning in data$Innovacion[is.na(data$Innovacion)] <-
## data_imp$imputed$Innovacion: number of items to replace is not a multiple of
## replacement length

data$Inseguridad_Homicidio[is.na(data$Inseguridad_Homicidio)] <- data_imp$imputed$Inseguridad_Homicidio

## Warning in data$Inseguridad_Homicidio[is.na(data$Inseguridad_Homicidio)] <-
## data_imp$imputed$Inseguridad_Homicidio: number of items to replace is not a
## multiple of replacement length

data$CO2_Emisiones[is.na(data$CO2_Emisiones)] <- data_imp$imputed$CO2_Emisiones

## Warning in data$CO2_Emisiones[is.na(data$CO2_Emisiones)] <-
## data_imp$imputed$CO2_Emisiones: number of items to replace is not a multiple of
## replacement length

#verify the changes made
data

## # A tibble: 26 × 15
##     Year IED_Flujos Exportaciones Empleo Educacion Salario_Diario Innovacion
##    <dbl>      <dbl>         <dbl>  <dbl>     <dbl>          <dbl>      <dbl>
##  1  1997     12146.         9088.   97.8      7.20           24.3       11.3
##  2  1998      8374.         9875.   97.8      7.31           31.9       11.4
##  3  1999     13960.        10990.   97.8      7.43           31.9       12.5
##  4  2000     18249.        12483.   97.8      7.56           35.1       13.1
##  5  2001     30057.        11300.   97.4      7.68           37.6       13.5
##  6  2002     24099.        11923.   97.7      7.80           39.7       12.8
##  7  2003     18250.        13156    97.1      7.93           41.5       11.8
##  8  2004     25016.        13573.   96.5      8.04           43.3       12.6
##  9  2005     25796.        16466.   97.2      8.14           45.2       13.4
## 10  2006     21233.        17486.   96.5      8.26           47.0       14.2
## # ℹ 16 more rows
## # ℹ 8 more variables: Inseguridad_Robo <dbl>, Inseguridad_Homicidio <dbl>,
## #   Tipo_de_Cambio <dbl>, Densidad_Carretera <dbl>, Densidad_Poblacion <dbl>,
## #   CO2_Emisiones <dbl>, PIB_Per_Capita <dbl>, INPC <dbl>

# For the imputation we used a linear regression prediction in order to follow the pattern of the data
imp_edu <- approx(seq_along(data$Educacion),data$Educacion, method = "linear", n = length(data$Educacion))$y
data$Educacion <- imp_edu


#We can see how we fixed our missing value issue, so we can proceed with our code.
data

## # A tibble: 26 × 15
##     Year IED_Flujos Exportaciones Empleo Educacion Salario_Diario Innovacion
##    <dbl>      <dbl>         <dbl>  <dbl>     <dbl>          <dbl>      <dbl>
##  1  1997     12146.         9088.   97.8      7.20           24.3       11.3
##  2  1998      8374.         9875.   97.8      7.30           31.9       11.4
##  3  1999     13960.        10990.   97.8      7.40           31.9       12.5
##  4  2000     18249.        12483.   97.8      7.51           35.1       13.1
##  5  2001     30057.        11300.   97.4      7.62           37.6       13.5
##  6  2002     24099.        11923.   97.7      7.73           39.7       12.8
##  7  2003     18250.        13156    97.1      7.84           41.5       11.8
##  8  2004     25016.        13573.   96.5      7.94           43.3       12.6
##  9  2005     25796.        16466.   97.2      8.05           45.2       13.4
## 10  2006     21233.        17486.   96.5      8.13           47.0       14.2
## # ℹ 16 more rows
## # ℹ 8 more variables: Inseguridad_Robo <dbl>, Inseguridad_Homicidio <dbl>,
## #   Tipo_de_Cambio <dbl>, Densidad_Carretera <dbl>, Densidad_Poblacion <dbl>,
## #   CO2_Emisiones <dbl>, PIB_Per_Capita <dbl>, INPC <dbl>

#Verify there isnt any N/A
sum(is.na(data))

## [1] 0

1. Exploratory Data Analysis

#Now that we have our clean data, we can start once again with the complete Exploratory Data Analysis

# Display dataset
data

## # A tibble: 26 × 15
##     Year IED_Flujos Exportaciones Empleo Educacion Salario_Diario Innovacion
##    <dbl>      <dbl>         <dbl>  <dbl>     <dbl>          <dbl>      <dbl>
##  1  1997     12146.         9088.   97.8      7.20           24.3       11.3
##  2  1998      8374.         9875.   97.8      7.30           31.9       11.4
##  3  1999     13960.        10990.   97.8      7.40           31.9       12.5
##  4  2000     18249.        12483.   97.8      7.51           35.1       13.1
##  5  2001     30057.        11300.   97.4      7.62           37.6       13.5
##  6  2002     24099.        11923.   97.7      7.73           39.7       12.8
##  7  2003     18250.        13156    97.1      7.84           41.5       11.8
##  8  2004     25016.        13573.   96.5      7.94           43.3       12.6
##  9  2005     25796.        16466.   97.2      8.05           45.2       13.4
## 10  2006     21233.        17486.   96.5      8.13           47.0       14.2
## # ℹ 16 more rows
## # ℹ 8 more variables: Inseguridad_Robo <dbl>, Inseguridad_Homicidio <dbl>,
## #   Tipo_de_Cambio <dbl>, Densidad_Carretera <dbl>, Densidad_Poblacion <dbl>,
## #   CO2_Emisiones <dbl>, PIB_Per_Capita <dbl>, INPC <dbl>

#Identify missing values. There shouldnt be any since they are already imputed
sum(is.na(data))

## [1] 0

#Check type of data along with the name of each column
glimpse(data)

## Rows: 26
## Columns: 15
## $ Year                  <dbl> 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, …
## $ IED_Flujos            <dbl> 12145.60, 8373.50, 13960.32, 18248.69, 30057.18,…
## $ Exportaciones         <dbl> 9087.616, 9875.065, 10990.013, 12482.963, 11300.…
## $ Empleo                <dbl> 97.83000, 97.83000, 97.83000, 97.83000, 97.36000…
## $ Educacion             <dbl> 7.197968, 7.297952, 7.400003, 7.512453, 7.621500…
## $ Salario_Diario        <dbl> 24.30, 31.91, 31.91, 35.12, 37.57, 39.74, 41.53,…
## $ Innovacion            <dbl> 11.30141, 11.37221, 12.45897, 13.14556, 13.46812…
## $ Inseguridad_Robo      <dbl> 266.5065, 314.7762, 272.8936, 216.9808, 214.5269…
## $ Inseguridad_Homicidio <dbl> 14.554142, 14.319399, 12.641071, 10.857846, 10.2…
## $ Tipo_de_Cambio        <dbl> 8.0640, 9.9395, 9.5222, 9.5997, 9.1692, 10.3613,…
## $ Densidad_Carretera    <dbl> 0.05205217, 0.05295475, 0.05501698, 0.05522774, …
## $ Densidad_Poblacion    <dbl> 47.43650, 48.76163, 49.48089, 50.57930, 51.27675…
## $ CO2_Emisiones         <dbl> 3.675330, 3.853045, 3.686918, 3.874147, 3.811393…
## $ PIB_Per_Capita        <dbl> 127570.1, 126738.8, 129164.7, 130874.9, 128083.4…
## $ INPC                  <dbl> 33.27987, 39.47297, 44.33552, 48.30767, 50.43490…

#Descriptive statistics (mean, median, standard deviation, minimum, maximum)
describe(data)

##                       vars  n      mean       sd    median   trimmed      mad
## Year                     1 26   2009.50     7.65   2009.50   2009.50     9.64
## IED_Flujos               2 26  26770.12  8770.68  27697.59  26860.64  9079.25
## Exportaciones            3 26  23600.91 11346.13  21187.89  22875.85 13370.71
## Empleo                   4 26     96.63     0.85     96.61     96.65     0.88
## Educacion                5 26      8.42     0.72      8.46      8.43     0.88
## Salario_Diario           6 26     65.16    35.85     54.48     60.16    22.51
## Innovacion               7 26     12.97     1.18     12.91     12.93     1.15
## Inseguridad_Robo         8 26    185.42    47.67    181.83    181.16    47.06
## Inseguridad_Homicidio    9 26     17.19     7.14     15.74     16.87     8.70
## Tipo_de_Cambio          10 26     13.91     4.15     13.02     13.78     4.25
## Densidad_Carretera      11 26      0.07     0.01      0.07      0.07     0.02
## Densidad_Poblacion      12 26     57.33     5.41     58.09     57.44     6.68
## CO2_Emisiones           13 26      3.91     0.20      3.89      3.91     0.31
## PIB_Per_Capita          14 26 138550.10  8861.10 136845.30 138255.64 11080.42
## INPC                    15 26     75.17    24.81     73.35     74.45    27.14
##                             min       max    range  skew kurtosis      se
## Year                    1997.00   2022.00    25.00  0.00    -1.34    1.50
## IED_Flujos              8373.50  48354.42 39980.92 -0.02    -0.08 1720.07
## Exportaciones           9087.62  46477.58 37389.97  0.46    -1.08 2225.16
## Empleo                    95.06     97.83     2.77 -0.15    -1.13    0.17
## Educacion                  7.20      9.58     2.38 -0.09    -1.28    0.14
## Salario_Diario            24.30    172.87   148.57  1.43     1.44    7.03
## Innovacion                11.28     15.11     3.83  0.16    -1.01    0.23
## Inseguridad_Robo         120.49    314.78   194.28  0.89     0.30    9.35
## Inseguridad_Homicidio      8.04     29.59    21.56  0.41    -1.27    1.40
## Tipo_de_Cambio             8.06     20.66    12.60  0.44    -1.39    0.81
## Densidad_Carretera         0.05      0.09     0.04  0.18    -1.50    0.00
## Densidad_Poblacion        47.44     65.60    18.16 -0.19    -1.24    1.06
## CO2_Emisiones              3.59      4.22     0.63  0.06    -1.38    0.04
## PIB_Per_Capita        126738.75 153235.73 26496.98  0.28    -1.41 1737.81
## INPC                      33.28    126.48    93.20  0.26    -0.95    4.87

summary(data)

##       Year        IED_Flujos    Exportaciones       Empleo        Educacion    
##  Min.   :1997   Min.   : 8374   Min.   : 9088   Min.   :95.06   Min.   :7.198  
##  1st Qu.:2003   1st Qu.:21367   1st Qu.:13260   1st Qu.:96.09   1st Qu.:7.864  
##  Median :2010   Median :27698   Median :21188   Median :96.61   Median :8.457  
##  Mean   :2010   Mean   :26770   Mean   :23601   Mean   :96.63   Mean   :8.424  
##  3rd Qu.:2016   3rd Qu.:32183   3rd Qu.:31601   3rd Qu.:97.22   3rd Qu.:9.004  
##  Max.   :2022   Max.   :48354   Max.   :46478   Max.   :97.83   Max.   :9.579  
##  Salario_Diario     Innovacion    Inseguridad_Robo Inseguridad_Homicidio
##  Min.   : 24.30   Min.   :11.28   Min.   :120.5    Min.   : 8.037       
##  1st Qu.: 41.97   1st Qu.:12.19   1st Qu.:148.3    1st Qu.:10.402       
##  Median : 54.48   Median :12.91   Median :181.8    Median :15.741       
##  Mean   : 65.16   Mean   :12.97   Mean   :185.4    Mean   :17.187       
##  3rd Qu.: 72.31   3rd Qu.:13.60   3rd Qu.:209.9    3rd Qu.:22.346       
##  Max.   :172.87   Max.   :15.11   Max.   :314.8    Max.   :29.592       
##  Tipo_de_Cambio   Densidad_Carretera Densidad_Poblacion CO2_Emisiones  
##  Min.   : 8.064   Min.   :0.05205    Min.   :47.44      Min.   :3.592  
##  1st Qu.:10.752   1st Qu.:0.05954    1st Qu.:52.78      1st Qu.:3.718  
##  Median :13.016   Median :0.06989    Median :58.09      Median :3.894  
##  Mean   :13.910   Mean   :0.07106    Mean   :57.33      Mean   :3.915  
##  3rd Qu.:18.489   3rd Qu.:0.08275    3rd Qu.:61.39      3rd Qu.:4.088  
##  Max.   :20.664   Max.   :0.09020    Max.   :65.60      Max.   :4.221  
##  PIB_Per_Capita        INPC       
##  Min.   :126739   Min.   : 33.28  
##  1st Qu.:130964   1st Qu.: 56.15  
##  Median :136845   Median : 73.35  
##  Mean   :138550   Mean   : 75.17  
##  3rd Qu.:146148   3rd Qu.: 91.29  
##  Max.   :153236   Max.   :126.48

2. Data Vizualization

#Graph 1
# Histogram of dependent variable
hist(data$IED_Flujos, main = "Histogram of Foreigh Direct Inversion")

#With this graph we can appreciate how the frequency distribution of the IED_Flujos variable is has a relatively normal distribution, the only issue is that there is a specific range where there are no data registered.

#In order to avoid having that, we will apply the natural logarithm to the variable and the blank range is not longer visible, while still following a normal distribution.
hist(log(data$IED_Flujos) , main = "Histogram of Foreigh Direct Inversion - Log")

#Graph 2.
#Bar graph mapping Foreign Investment and GDP through the last years 
graph<-data %>% group_by(Year)
graph

## # A tibble: 26 × 15
## # Groups:   Year [26]
##     Year IED_Flujos Exportaciones Empleo Educacion Salario_Diario Innovacion
##    <dbl>      <dbl>         <dbl>  <dbl>     <dbl>          <dbl>      <dbl>
##  1  1997     12146.         9088.   97.8      7.20           24.3       11.3
##  2  1998      8374.         9875.   97.8      7.30           31.9       11.4
##  3  1999     13960.        10990.   97.8      7.40           31.9       12.5
##  4  2000     18249.        12483.   97.8      7.51           35.1       13.1
##  5  2001     30057.        11300.   97.4      7.62           37.6       13.5
##  6  2002     24099.        11923.   97.7      7.73           39.7       12.8
##  7  2003     18250.        13156    97.1      7.84           41.5       11.8
##  8  2004     25016.        13573.   96.5      7.94           43.3       12.6
##  9  2005     25796.        16466.   97.2      8.05           45.2       13.4
## 10  2006     21233.        17486.   96.5      8.13           47.0       14.2
## # ℹ 16 more rows
## # ℹ 8 more variables: Inseguridad_Robo <dbl>, Inseguridad_Homicidio <dbl>,
## #   Tipo_de_Cambio <dbl>, Densidad_Carretera <dbl>, Densidad_Poblacion <dbl>,
## #   CO2_Emisiones <dbl>, PIB_Per_Capita <dbl>, INPC <dbl>

ggplot(data=graph,aes(x=Year,IED_Flujos,y=IED_Flujos,fill=PIB_Per_Capita)) +
  geom_bar(stat="identity") + coord_flip() +
  labs( 
       y="Foreign Investment", 
       x="Year", 
       fill= 'GPD',
       title="Foreign Investment and GDP")

# Observing the graph we can see how throughout the years, the Foreign Investments have been growing in the country. At first glance, we can see how in the years 2015 to 2019 the GDP was at its peak. When we analyze the situation, we can observe how after the biggest Foreign Inversion made in 2013, the GDP started to crease. Hence, we can affirm that Foreign Investments are a causality to the increasing GDP.


#Graph 3
# Correlation matrix
library(ggplot2)
library(ggcorrplot)

corre <- round(cor(subset(data,select =  -c(Year, Densidad_Carretera, INPC, PIB_Per_Capita, Educacion))), 1)

ggcorrplot(corre, hc.order = TRUE, 
           type = "lower", 
           lab = TRUE, 
           lab_size = 3, 
           method="circle", 
           colors = c("pink", "white", "lightgreen"), 
           title="Correlogram", 
           ggtheme=theme_bw)

# In this matrix, we can appreciate the correlation among ea every variable. We can see how C02 Emissions has no relationship whatsoever the Foreign Inversion, while in the other hand Population Density and Exportation are the most correlated.

# Emphasis in noting that correlation does not mean causality. We are just cheking for a relationship between variables.



#Graph 4 Scatterplot

theme_set(theme_bw())  # pre-set the bw theme.
g <- ggplot(data, aes(Exportaciones, IED_Flujos))
g + geom_point() + 
  geom_smooth(method="lm", se=F) +
  labs( 
       y="Foreign Investment", 
       x="Exportations", 
       title="Exportations vs Foreign Investment", 
       )

## `geom_smooth()` using formula = 'y ~ x'

#We can clearly see in the graph that there is a positive correlation between both variables demonstrated by the trend line, when the foreign investment increases, exportation also increases.


# Graph 5 Counts Plot
theme_set(theme_bw())  # pre-set the bw theme.
g <- ggplot(data, aes(Inseguridad_Robo, IED_Flujos))
g + geom_count(col="tomato3", show.legend=F) +
  labs( 
       y="Foreign Investments", 
       x="Theft Danger", 
       title="Theft Danger & Foreign Investments")

# We can observe how the dots in the graph are progressively going down, so we can imagine a negative trendline. This means that the more Theft Danger, the less Foreign Inversions.

3.Hyphothesis statements

Hipothesis 1: The variable ‘Inversiones Extranjeras Directas’ is directly positevily correlated to the variable ‘Exportaciones’. This means that when Inversiones Extranjeras increase, Exportaciones also increases.
Hipothesis 2: The variable ‘Inversiones Extranjeras Directas’ is not correlated at all with ‘´CO2_Emisiones’. This means that the variables mentioned do not affect each other.
Hipothesis 3: Inversiones Extranjeras decreses when Peligro_Homicidio increases, meaning that they have a negative correlation.

4. Regresion Analysis

#Estimate 3 different linear regression models by using:
#         - linear regression
#         - linear regression with natural logarith applied to the variables
#         - polynomial regression
#         - lasso regression model
#         - ridge regression model

#An important observation is to notice how the variable Year wasnt included in the models. This is due to the fact that is Year was included, the model would beacome a time series and for now we are just looking for correlation.

#Linear regression 
linear_regression <- lm(IED_Flujos ~ Exportaciones + Empleo + Educacion + Salario_Diario + Innovacion +  Inseguridad_Robo + Inseguridad_Homicidio + Tipo_de_Cambio + Densidad_Carretera + Densidad_Poblacion + CO2_Emisiones +  PIB_Per_Capita + INPC, data=data) 

summary(linear_regression)

## 
## Call:
## lm(formula = IED_Flujos ~ Exportaciones + Empleo + Educacion + 
##     Salario_Diario + Innovacion + Inseguridad_Robo + Inseguridad_Homicidio + 
##     Tipo_de_Cambio + Densidad_Carretera + Densidad_Poblacion + 
##     CO2_Emisiones + PIB_Per_Capita + INPC, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5319.2 -1757.1  -515.3  1720.7  5071.8 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)   
## (Intercept)           -9.717e+05  4.358e+05  -2.229  0.04566 * 
## Exportaciones         -1.035e+00  6.479e-01  -1.597  0.13617   
## Empleo                 6.279e+03  3.175e+03   1.977  0.07142 . 
## Educacion              1.751e+05  4.408e+04   3.973  0.00185 **
## Salario_Diario        -3.695e+01  2.089e+02  -0.177  0.86256   
## Innovacion             5.944e+03  1.646e+03   3.611  0.00357 **
## Inseguridad_Robo       1.522e+01  4.496e+01   0.338  0.74089   
## Inseguridad_Homicidio  1.529e+03  5.305e+02   2.883  0.01377 * 
## Tipo_de_Cambio        -2.631e+03  8.164e+02  -3.223  0.00731 **
## Densidad_Carretera    -1.496e+06  1.139e+06  -1.313  0.21361   
## Densidad_Poblacion    -1.865e+04  4.745e+03  -3.931  0.00200 **
## CO2_Emisiones          1.636e+04  1.396e+04   1.172  0.26392   
## PIB_Per_Capita        -6.483e-01  5.877e-01  -1.103  0.29161   
## INPC                   9.981e+02  8.324e+02   1.199  0.25367   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3789 on 12 degrees of freedom
## Multiple R-squared:  0.9104, Adjusted R-squared:  0.8134 
## F-statistic: 9.384 on 13 and 12 DF,  p-value: 0.00022

RMSE(linear_regression$fitted.values, data$IED_Flujos)

## [1] 2573.814

#Linear regression with natural logarithm applied to the variables
linear_log<-lm(log(IED_Flujos) ~ Exportaciones + Empleo + Educacion + Salario_Diario + Innovacion +  Inseguridad_Robo + Inseguridad_Homicidio + Tipo_de_Cambio + Densidad_Carretera + Densidad_Poblacion + CO2_Emisiones +  PIB_Per_Capita + INPC,data=data) 

summary(linear_log)

## 
## Call:
## lm(formula = log(IED_Flujos) ~ Exportaciones + Empleo + Educacion + 
##     Salario_Diario + Innovacion + Inseguridad_Robo + Inseguridad_Homicidio + 
##     Tipo_de_Cambio + Densidad_Carretera + Densidad_Poblacion + 
##     CO2_Emisiones + PIB_Per_Capita + INPC, data = data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.239837 -0.067162 -0.002574  0.070082  0.240449 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)   
## (Intercept)           -1.778e+01  1.876e+01  -0.948  0.36193   
## Exportaciones         -4.863e-05  2.789e-05  -1.744  0.10668   
## Empleo                 2.011e-01  1.367e-01   1.472  0.16679   
## Educacion              5.048e+00  1.897e+00   2.661  0.02076 * 
## Salario_Diario        -5.549e-03  8.992e-03  -0.617  0.54871   
## Innovacion             2.338e-01  7.084e-02   3.301  0.00633 **
## Inseguridad_Robo      -1.919e-03  1.935e-03  -0.992  0.34101   
## Inseguridad_Homicidio  5.946e-02  2.283e-02   2.604  0.02304 * 
## Tipo_de_Cambio        -9.908e-02  3.513e-02  -2.820  0.01545 * 
## Densidad_Carretera    -4.177e+01  4.902e+01  -0.852  0.41079   
## Densidad_Poblacion    -6.227e-01  2.042e-01  -3.049  0.01010 * 
## CO2_Emisiones          5.212e-01  6.009e-01   0.867  0.40276   
## PIB_Per_Capita        -2.195e-05  2.529e-05  -0.868  0.40241   
## INPC                   6.400e-02  3.583e-02   1.786  0.09931 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.163 on 12 degrees of freedom
## Multiple R-squared:  0.914,  Adjusted R-squared:  0.8208 
## F-statistic: 9.806 on 13 and 12 DF,  p-value: 0.0001759

RMSE(linear_log$fitted.values, data$IED_Flujos)

## [1] 28107.96

##Polynomial regresion. 

#To identify which variables to elevate,first of all we  evaluate their theoritical reason of wether it can have a polynomial behaviour (square or cubic). 
# In other words, we are searching for variables that their behaviour makes a curve (goes up, reaches certain point and go down again) with the possibility of going up again (cubic beahaviour).
# After analyzing, the variables I considered most fitting variables are PIB Per Capita, Tipo de Cambio and CO2 Emisiones.

# One we defined the possible variables, we graph them in order to see if we can visualize this theoretical behaviour.

theme_set(theme_bw()) 
g <- ggplot(data, aes(PIB_Per_Capita, IED_Flujos))
g + geom_count(col="tomato3", show.legend=F) +
  labs( 
       y="Foreign Investment", 
       x="GDP", 
       title="Counts Plot")

theme_set(theme_bw())  
g <- ggplot(data, aes(Tipo_de_Cambio, IED_Flujos))
g + geom_count(col="tomato3", show.legend=F) +
  labs( 
       y="Foreign Investment", 
       x="Exchange Rate", 
       title="Counts Plot")

theme_set(theme_bw())  
g <- ggplot(data, aes(CO2_Emisiones, IED_Flujos))
g + geom_count(col="tomato3", show.legend=F) +
  labs( 
       y="Foreign Investment", 
       x="C02 Emissions", 
       title="Counts Plot")

#We can see how the graphs of PIB and Tipo de Cambio, shows somewhat a polynomial behavior, so we can use those variables to elevate the model. However, the CO2 Emisiones does not show any type of correlation, so we can discard this variable.

#Finally we will have a Cubic Polynomial Regression, or a 3rd degree Polynomial Regression.

#### Its really IMPORTANT to mention that this data set has a limited amount of records, so identifying a polynomial behavior just by looking at a graph its not 100% certain nor accurate.

# Polynomial Regression Model
polynomial_regression <- lm(log(IED_Flujos) ~ Exportaciones + Empleo + Educacion + Salario_Diario + Innovacion +  Inseguridad_Robo + Inseguridad_Homicidio + Tipo_de_Cambio  +  Densidad_Carretera + Densidad_Poblacion + CO2_Emisiones +  PIB_Per_Capita + I(PIB_Per_Capita^2) + INPC, data=data) 

summary(polynomial_regression)

## 
## Call:
## lm(formula = log(IED_Flujos) ~ Exportaciones + Empleo + Educacion + 
##     Salario_Diario + Innovacion + Inseguridad_Robo + Inseguridad_Homicidio + 
##     Tipo_de_Cambio + Densidad_Carretera + Densidad_Poblacion + 
##     CO2_Emisiones + PIB_Per_Capita + I(PIB_Per_Capita^2) + INPC, 
##     data = data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.245352 -0.065304  0.002983  0.072802  0.228259 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)  
## (Intercept)           -9.312e+00  4.511e+01  -0.206   0.8402  
## Exportaciones         -4.446e-05  3.529e-05  -1.260   0.2338  
## Empleo                 1.936e-01  1.469e-01   1.318   0.2143  
## Educacion              5.036e+00  1.978e+00   2.545   0.0272 *
## Salario_Diario        -5.285e-03  9.458e-03  -0.559   0.5875  
## Innovacion             2.387e-01  7.742e-02   3.083   0.0104 *
## Inseguridad_Robo      -2.190e-03  2.402e-03  -0.912   0.3814  
## Inseguridad_Homicidio  6.247e-02  2.785e-02   2.243   0.0465 *
## Tipo_de_Cambio        -1.018e-01  3.884e-02  -2.620   0.0238 *
## Densidad_Carretera    -4.770e+01  5.849e+01  -0.816   0.4321  
## Densidad_Poblacion    -6.226e-01  2.129e-01  -2.925   0.0138 *
## CO2_Emisiones          5.690e-01  6.672e-01   0.853   0.4119  
## PIB_Per_Capita        -1.318e-04  5.280e-04  -0.250   0.8075  
## I(PIB_Per_Capita^2)    3.944e-10  1.894e-09   0.208   0.8388  
## INPC                   6.480e-02  3.754e-02   1.726   0.1123  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.17 on 11 degrees of freedom
## Multiple R-squared:  0.9143, Adjusted R-squared:  0.8052 
## F-statistic: 8.383 on 14 and 11 DF,  p-value: 0.0005698

RMSE(polynomial_regression$fitted.values, data$IED_Flujos)

## [1] 28107.96

### Diagnostic Tests
# Model 1 - Linear regression 
vif(linear_regression)

##         Exportaciones                Empleo             Educacion 
##             94.133511             12.560405           1731.885829 
##        Salario_Diario            Innovacion      Inseguridad_Robo 
##             97.707251              6.563801              8.000941 
## Inseguridad_Homicidio        Tipo_de_Cambio    Densidad_Carretera 
##             24.965162             19.986978            404.496907 
##    Densidad_Poblacion         CO2_Emisiones        PIB_Per_Capita 
##           1148.107734             13.686035             47.236728 
##                  INPC 
##            742.865236

bptest(linear_regression)

## 
##  studentized Breusch-Pagan test
## 
## data:  linear_regression
## BP = 9.3084, df = 13, p-value = 0.7493

AIC(linear_regression)

## [1] 512.1483

histogram(polynomial_regression$residuals)

# Model 2 - Linear regression with natural logarithm applied to the variables
vif(linear_log)

##         Exportaciones                Empleo             Educacion 
##             94.133511             12.560405           1731.885829 
##        Salario_Diario            Innovacion      Inseguridad_Robo 
##             97.707251              6.563801              8.000941 
## Inseguridad_Homicidio        Tipo_de_Cambio    Densidad_Carretera 
##             24.965162             19.986978            404.496907 
##    Densidad_Poblacion         CO2_Emisiones        PIB_Per_Capita 
##           1148.107734             13.686035             47.236728 
##                  INPC 
##            742.865236

bptest(linear_log)

## 
##  studentized Breusch-Pagan test
## 
## data:  linear_log
## BP = 7.8148, df = 13, p-value = 0.8555

AIC(linear_log)

## [1] -10.6308

histogram(linear_log$residuals)

# Model 3 - Polynomial regresion
vif(polynomial_regression)

##         Exportaciones                Empleo             Educacion 
##            138.778656             13.361240           1733.310924 
##        Salario_Diario            Innovacion      Inseguridad_Robo 
##             99.490502              7.214584             11.343616 
## Inseguridad_Homicidio        Tipo_de_Cambio    Densidad_Carretera 
##             34.191177             22.484306            530.050990 
##    Densidad_Poblacion         CO2_Emisiones        PIB_Per_Capita 
##           1148.114283             15.528332          18944.483802 
##   I(PIB_Per_Capita^2)                  INPC 
##          19061.970472            750.789865

bptest(polynomial_regression)

## 
##  studentized Breusch-Pagan test
## 
## data:  polynomial_regression
## BP = 9.6019, df = 14, p-value = 0.7907

AIC(polynomial_regression)

## [1] -8.733148

histogram(polynomial_regression$residuals)

5. EXTRA LASSO & RIDGE

### Split the Data in Training Data vs Test Data

set.seed(123)                                
training.samples<-data$IED_Flujos %>%
  createDataPartition(p=0.75,list=FALSE)       

train.data<-data[training.samples, ] 
test.data<-data[-training.samples, ]  

#################################
### LASSO REGRESSION ANALYSIS ###
#################################

# Independent variables
x<-model.matrix(log(IED_Flujos) ~ Exportaciones + Empleo + Educacion + Salario_Diario + Innovacion +  Inseguridad_Robo + Inseguridad_Homicidio + Tipo_de_Cambio + Densidad_Carretera + Densidad_Poblacion + CO2_Emisiones +  PIB_Per_Capita + INPC,train.data)[,-1] ### OLS model specification

y<-train.data$IED_Flujos ### dependent variable 

# In estimating LASSO regression it is important to define the lambda that minimizes the prediction error rate. 
# Cross-validation ensures that every data / observation from the original dataset  has a chance of appearing in train and test datasets.
# Find the best lambda using cross-validation.
set.seed(123) 
cv.lasso <- cv.glmnet(x,y,alpha=1) # alpha = 1 for LASSO

## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
## fold

# Display the best lambda value
cv.lasso$lambda.min                      ### lambda: a numeric value defining the amount of shrinkage. Why min? the higher the value of ?? , the more penalization there is

## [1] 362.8589

# Fit the final model on the training data
lassomodel<-glmnet(x,y,alpha=1,lambda=cv.lasso$lambda.min)
lassomodel

## 
## Call:  glmnet(x = x, y = y, alpha = 1, lambda = cv.lasso$lambda.min) 
## 
##   Df  %Dev Lambda
## 1  5 80.58  362.9

# Display regression coefficients
coef(lassomodel)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                                 s0
## (Intercept)           -7549.186773
## Exportaciones             .       
## Empleo                    .       
## Educacion                 .       
## Salario_Diario            9.081805
## Innovacion             2650.871459
## Inseguridad_Robo        -20.895547
## Inseguridad_Homicidio     .       
## Tipo_de_Cambio            .       
## Densidad_Carretera        .       
## Densidad_Poblacion        .       
## CO2_Emisiones         -3155.292307
## PIB_Per_Capita            .       
## INPC                    196.294087

# Make predictions on the test data
x.test<-model.matrix(log(IED_Flujos) ~ Exportaciones + Empleo + Educacion + Salario_Diario + Innovacion +  Inseguridad_Robo + Inseguridad_Homicidio + Tipo_de_Cambio + Densidad_Carretera + Densidad_Poblacion + CO2_Emisiones +  PIB_Per_Capita + INPC,test.data)[,-1] ### OLS model specification
# x.test<-model.matrix(Weekly_Sales~.,test.data)[,-1]
lassopredictions <- lassomodel %>% predict(x.test) %>% as.vector()

# Model Accuracy
data.frame(
  RMSE = RMSE(lassopredictions, test.data$IED_Flujos),
  Rsquare = R2(lassopredictions, test.data$IED_Flujos))

##       RMSE   Rsquare
## 1 10445.54 0.5707237

### visualizing lasso regression results 
lbs_fun <- function(fit, offset_x=1, ...) {
  L <- length(fit$lambda)
  x <- log(fit$lambda[L])+ offset_x
  y <- fit$beta[, L]
  labs <- names(y)
  text(x, y, labels=labs, ...)
}

lasso<-glmnet(scale(x),y,alpha=1)

plot(lasso,xvar="lambda",label=T)
lbs_fun(lasso)
abline(v=cv.lasso$lambda.min,col="red",lty=2)
abline(v=cv.lasso$lambda.1se,col="blue",lty=2)

#################################
### RIDGE REGRESSION ANALYSIS ###
#################################

# Find the best lambda using cross-validation
set.seed(123)                           # x: independent variables | y: dependent variable 
cv.ridge <- cv.glmnet(x,y,alpha=0.1)      # alpha = 0 for RIDGE

## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
## fold

# Display the best lambda value
cv.ridge$lambda.min                     # lambda: a numeric value defining the amount of shrinkage. Why min? the higher the value of ?? , the more penalization there is

## [1] 1723.874

# Fit the final model on the training data
ridgemodel<-glmnet(x,y,alpha=0,lambda=cv.ridge$lambda.min)

# Display regression coefficients
coef(ridgemodel)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                                  s0
## (Intercept)           -2.637912e+04
## Exportaciones          3.203238e-02
## Empleo                 4.017246e+01
## Educacion              1.352587e+03
## Salario_Diario         2.418552e+01
## Innovacion             2.105752e+03
## Inseguridad_Robo      -2.619092e+01
## Inseguridad_Homicidio -4.659805e+01
## Tipo_de_Cambio         2.373136e+00
## Densidad_Carretera     4.882539e+04
## Densidad_Poblacion     1.646285e+02
## CO2_Emisiones         -3.796542e+03
## PIB_Per_Capita         8.648935e-02
## INPC                   4.079777e+01

# Make predictions on the test data
x.test<-model.matrix(log(IED_Flujos) ~ Exportaciones + Empleo + Educacion + Salario_Diario + Innovacion +  Inseguridad_Robo + Inseguridad_Homicidio + Tipo_de_Cambio + Densidad_Carretera + Densidad_Poblacion + CO2_Emisiones +  PIB_Per_Capita + INPC,test.data)[,-1]
ridgepredictions<-ridgemodel %>% predict(x.test) %>% as.vector()

# Model Accuracy
data.frame(
  RMSE = RMSE(ridgepredictions, test.data$IED_Flujos),
  Rsquare = R2(ridgepredictions, test.data$IED_Flujos)
)

##       RMSE   Rsquare
## 1 10544.56 0.5296032

### visualizing ridge regression results 

ridge<-glmnet(scale(x),y,alpha=0)

plot(ridge, xvar = "lambda", label=T)
lbs_fun(ridge)
abline(v=cv.ridge$lambda.min, col = "red", lty=2)
abline(v=cv.ridge$lambda.1se, col="blue", lty=2)

6. CHOOSING MODEL

##HETEROSCEDASTICITY
# Using the BP Test, we can check out for heteroscedasticity (residuals with a NOT constant variance/unequal scatter) by looking at the P-VALUE it gave us. 
#H0.Homoscedasticityt is present
#HA.Heteroscedasticty is present
# Its very important to mention that our models should have HOMOSCEDASTICITY for a model way more precise.

#If the P- Value is less than 5% or 0.05, we reject the null hypothesis which means that heteroscedasticity is present in the model

#Looking at every model's BP Test we can see how:
#   - Linear regression: P-Value = 0.7493. H0 - Homoscedasticityt is present
#   - Linear regression with natural logarithm: P-Value =0.8555 HO - Homoscedasticityt is present
#   - Polynomial regression: P-Value = 0.5275 H0 - Homoscedasticityt is present


### MODEL ACCUARCY

#Now that we confirmed that neither one of our models has heterocedasticity, we can continue to compare each model and decide which one fits best.
#We will take into consideration:
#     - RMSE (Root Mean Standard Error)
#     - R2 Error

# RMSE - Root Mean Square Error: The larger the difference indicates a larger gap between the predicted and observed values, 
# which means poor regression model fit. In the same way, the smaller RMSE that indicates the better the model.

# Create a new variable in order to compare the quality and reliability of each of the regression models. 
tab <- matrix(c(2573.814,28107.96,28107.95,10445.54,10544.56, 0.8134, 0.8208, 0.8405,0.5707,0.5296), ncol=2, byrow=FALSE)
colnames(tab) <- c('RMSE','R2')
rownames(tab) <- c('Linear Regression','Linear Regresion (Log)','Polynomial', 'Lasso', 'Ridge')
tab <- as.table(tab)
tab

##                              RMSE         R2
## Linear Regression       2573.8140     0.8134
## Linear Regresion (Log) 28107.9600     0.8208
## Polynomial             28107.9500     0.8405
## Lasso                  10445.5400     0.5707
## Ridge                  10544.5600     0.5296

##Next, we are going to check for Multicolineality.

##MULTICOLINEALITY

# We identify this phenomenon by checking the VIF function.
# In order to verify this, the values of the variables of our models should be less than 10 to the extent possible.
# After verifying all VIF tests, we can conclude that our model with the most Multicolineality is clearly the polynomial. Since the other two are pretty similar, we decided that we are going to use the LINEAR REGRESSION, since it has the best RMSE with a pretty good R2.

6. IMPROVE THE MODEL

# Lets take a look once again to the model we chose.
#Linear regression 
linear_regression <- lm(IED_Flujos ~ Exportaciones + Empleo + Educacion + Salario_Diario + Innovacion +  Inseguridad_Robo + Inseguridad_Homicidio + Tipo_de_Cambio + Densidad_Carretera + Densidad_Poblacion + CO2_Emisiones +  PIB_Per_Capita + INPC, data=data) 

summary(linear_regression)

## 
## Call:
## lm(formula = IED_Flujos ~ Exportaciones + Empleo + Educacion + 
##     Salario_Diario + Innovacion + Inseguridad_Robo + Inseguridad_Homicidio + 
##     Tipo_de_Cambio + Densidad_Carretera + Densidad_Poblacion + 
##     CO2_Emisiones + PIB_Per_Capita + INPC, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5319.2 -1757.1  -515.3  1720.7  5071.8 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)   
## (Intercept)           -9.717e+05  4.358e+05  -2.229  0.04566 * 
## Exportaciones         -1.035e+00  6.479e-01  -1.597  0.13617   
## Empleo                 6.279e+03  3.175e+03   1.977  0.07142 . 
## Educacion              1.751e+05  4.408e+04   3.973  0.00185 **
## Salario_Diario        -3.695e+01  2.089e+02  -0.177  0.86256   
## Innovacion             5.944e+03  1.646e+03   3.611  0.00357 **
## Inseguridad_Robo       1.522e+01  4.496e+01   0.338  0.74089   
## Inseguridad_Homicidio  1.529e+03  5.305e+02   2.883  0.01377 * 
## Tipo_de_Cambio        -2.631e+03  8.164e+02  -3.223  0.00731 **
## Densidad_Carretera    -1.496e+06  1.139e+06  -1.313  0.21361   
## Densidad_Poblacion    -1.865e+04  4.745e+03  -3.931  0.00200 **
## CO2_Emisiones          1.636e+04  1.396e+04   1.172  0.26392   
## PIB_Per_Capita        -6.483e-01  5.877e-01  -1.103  0.29161   
## INPC                   9.981e+02  8.324e+02   1.199  0.25367   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3789 on 12 degrees of freedom
## Multiple R-squared:  0.9104, Adjusted R-squared:  0.8134 
## F-statistic: 9.384 on 13 and 12 DF,  p-value: 0.00022

RMSE(linear_regression$fitted.values, data$IED_Flujos)

## [1] 2573.814

### Diagnostic Tests
vif(linear_regression)

##         Exportaciones                Empleo             Educacion 
##             94.133511             12.560405           1731.885829 
##        Salario_Diario            Innovacion      Inseguridad_Robo 
##             97.707251              6.563801              8.000941 
## Inseguridad_Homicidio        Tipo_de_Cambio    Densidad_Carretera 
##             24.965162             19.986978            404.496907 
##    Densidad_Poblacion         CO2_Emisiones        PIB_Per_Capita 
##           1148.107734             13.686035             47.236728 
##                  INPC 
##            742.865236

bptest(linear_regression)

## 
##  studentized Breusch-Pagan test
## 
## data:  linear_regression
## BP = 9.3084, df = 13, p-value = 0.7493

AIC(linear_regression)

## [1] 512.1483

histogram(polynomial_regression$residuals)

#We will eliminate the variables that does not have a significant impact on the model using the correlation chart

corre2 <- round(cor(subset(data)), 1)

ggcorrplot(corre2, hc.order = TRUE, 
           type = "lower", 
           lab = TRUE, 
           lab_size = 3, 
           method="circle", 
           colors = c("pink", "white", "lightgreen"), 
           title="Correlogram", 
           ggtheme=theme_bw)

#We discovered that the variables with least correlation were: co2 emisiones, riesgo_homicidio and innovacion.


#Playing with the model, we also removed from the model the variables: Exportaciones, Educacion, Salario Diario, Densidad carretera, Densidad Poblacion


#New model
linear_regression2 <- lm(IED_Flujos ~  Empleo + Innovacion + Inseguridad_Robo + Inseguridad_Homicidio + Tipo_de_Cambio + CO2_Emisiones +  PIB_Per_Capita + INPC, data=data)

summary(linear_regression2)

## 
## Call:
## lm(formula = IED_Flujos ~ Empleo + Innovacion + Inseguridad_Robo + 
##     Inseguridad_Homicidio + Tipo_de_Cambio + CO2_Emisiones + 
##     PIB_Per_Capita + INPC, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7992.0 -1593.4   238.2  1772.4 14270.6 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)  
## (Intercept)            1.083e+05  3.150e+05   0.344   0.7352  
## Empleo                -1.287e+03  2.925e+03  -0.440   0.6655  
## Innovacion             3.185e+03  1.813e+03   1.757   0.0969 .
## Inseguridad_Robo      -3.372e+00  4.365e+01  -0.077   0.9393  
## Inseguridad_Homicidio -7.429e+01  4.071e+02  -0.183   0.8573  
## Tipo_de_Cambio        -2.004e+03  9.725e+02  -2.061   0.0550 .
## CO2_Emisiones         -1.183e+04  1.213e+04  -0.975   0.3431  
## PIB_Per_Capita         3.230e-01  4.226e-01   0.764   0.4551  
## INPC                   4.372e+02  1.979e+02   2.209   0.0412 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5414 on 17 degrees of freedom
## Multiple R-squared:  0.7409, Adjusted R-squared:  0.6189 
## F-statistic: 6.075 on 8 and 17 DF,  p-value: 0.0008996

RMSE(linear_regression2$fitted.values, data$IED_Flujos)

## [1] 4378.015

### Diagnostic Tests: New Model
vif(linear_regression2)

##                Empleo            Innovacion      Inseguridad_Robo 
##              5.218425              3.898395              3.691898 
## Inseguridad_Homicidio        Tipo_de_Cambio         CO2_Emisiones 
##              7.197865             13.887651              5.059914 
##        PIB_Per_Capita                  INPC 
##             11.957778             20.559747

bptest(linear_regression2)

## 
##  studentized Breusch-Pagan test
## 
## data:  linear_regression2
## BP = 10.168, df = 8, p-value = 0.2534

AIC(linear_regression2)

## [1] 529.771

histogram(linear_regression2$residuals)

#Testing again our model, we can see how reduced the multicolineality became significantly reduced.
# We have a total of 5 variables with no multicolineality, havin a p-value smaller that 0.05.
#We also have only 3 variables with a small amount of multicolineality. Having their p values on a range from 12 to 20.

#Now that we have the best version of a model, we can evaluate our hypothesis.

summary(linear_regression2)

## 
## Call:
## lm(formula = IED_Flujos ~ Empleo + Innovacion + Inseguridad_Robo + 
##     Inseguridad_Homicidio + Tipo_de_Cambio + CO2_Emisiones + 
##     PIB_Per_Capita + INPC, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7992.0 -1593.4   238.2  1772.4 14270.6 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)  
## (Intercept)            1.083e+05  3.150e+05   0.344   0.7352  
## Empleo                -1.287e+03  2.925e+03  -0.440   0.6655  
## Innovacion             3.185e+03  1.813e+03   1.757   0.0969 .
## Inseguridad_Robo      -3.372e+00  4.365e+01  -0.077   0.9393  
## Inseguridad_Homicidio -7.429e+01  4.071e+02  -0.183   0.8573  
## Tipo_de_Cambio        -2.004e+03  9.725e+02  -2.061   0.0550 .
## CO2_Emisiones         -1.183e+04  1.213e+04  -0.975   0.3431  
## PIB_Per_Capita         3.230e-01  4.226e-01   0.764   0.4551  
## INPC                   4.372e+02  1.979e+02   2.209   0.0412 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5414 on 17 degrees of freedom
## Multiple R-squared:  0.7409, Adjusted R-squared:  0.6189 
## F-statistic: 6.075 on 8 and 17 DF,  p-value: 0.0008996

#* * Hipothesis 1: The variable Direct Foreign Investments is directly positively correlated to the variable Exportations. This means that the most Foreign Investments in the country, the more Exportations it generates.
#* 
#* In order to fit the model and eliminate multicolineality, the variable Exportation was deleted from the model, however, in the data visualization section we observed a graph that indicates how these variables indeed correlate.

#* Hipothesis 2: The variable Direct Foreign Investments is not correlated at all with CO2_Emisions. This means that the variables mentioned do not affect each other. 

#* 
#* For the C02 Emisions we can observe a negative correlation, however it has a significance of only 0.1, so we can almost certaintly state that there isnt really any correlation. Out hypothesis is correct.

#* * Hipothesis 3: Foreign Investments decreases when Danger of Murder increases, meaning that they have a negative correlation.
#* 
#* We can observe how Danger of Murder is not relevant at all. The model does not indicate any relation with foreign investments.


linear_regression2 <- lm(IED_Flujos ~  Empleo + Innovacion + Inseguridad_Robo + Inseguridad_Homicidio + Tipo_de_Cambio + CO2_Emisiones +  PIB_Per_Capita + INPC, data=data)


colors <- c("blue", "green", "red", "purple")

plot(linear_regression2, which = 1, col = colors[1], pch = 19)

#Show the predicted values of the dependent variable

Conclusions

After determining the that our prediction model is the Linear Regression and after adjusting its multicolineality we gathered the following insights:

The variables that are attractive to businesses that seak nearshoring in Mexico are mainly: low exchange rate and a innovation. We can see this graphically in the Ridge Analysis.
Hipothesis 1: The variable Direct Foreign Investments is directly positively correlated to the variable Exportations. This means that the most Foreign Investments in the country, the more Exportations it generates.

In order to fit the model and eliminate multicolineality, the variable Exportation was deleted from the model, however, in the data visualization section we observed a graph that indicates how these variables indeed correlate.

Hipothesis 2: The variable Direct Foreign Investments is not correlated at all with CO2_Emisions. This means that the variables mentioned do not affect each other.

For the C02 Emisions we can observe a negative correlation, however it has a significance of only 0.1, so we can almost certaintly state that there isnt really any correlation. Out hypothesis is correct.

*Hipothesis 3: Foreign Investments decreases when Danger of Murder increases, meaning that they have a negative correlation.

We can observe how Danger of Murder is not relevant at all. The model does not indicate any relation with foreign investments.

We discovered that the variables with least correlation were: co2 emisiones, riesgo_homicidio and innovacion.

Bibliografia

Cardona, S., & Cardona, S. (2023, August 10). Nearshoring en México: una oportunidad histórica para la economía del futuro. Forbes México. https://www.forbes.com.mx/nearshoring-en-mexico-una-oportunidad-historica-para-la-economia-del-futuro/

Nearshoring: La solución actual para el comercio exterior. (2021). @Thomsonreuters. https://www.thomsonreutersmexico.com/es-mx/soluciones-de-comercio-exterior/blog-comercio-exterior/nearshoring-la-solucion-actual-para-el-comercio-exteior

What Is Predictive Analytics? 5 Examples | HBS Online. (2021, October 26). Business Insights Blog. https://online.hbs.edu/blog/post/predictive-analytics

Tierney, N. (2023, February 2). Exploring Imputed Values. R-Project.org. https://cran.r-project.org/web/packages/naniar/vignettes/exploring-imputed-values.html

Imputation in R: Top 3 Ways for Imputing Missing Data - Machine Learning, R programming. (2023). Appsilon.com. https://appsilon.com/imputation-in-r/#missforest

(2013). R-Statistics.co. http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html#Counts%20Chart