What is nearshoring? Nearshoring is a business strtegy in which a company makes the decision to transfer a specific part of the work to a different/less expensive geographic location. This system provides a closer and more effective operation location to a company.This strategy brings different benefits such as:
Why is Mexico so attractive to nearshoring? Ther are different reasons of why Mexico is so attrative for companies that whant to do nearshoring. The main reasons are:
What is predictive analytics? Predictive analytics is a forms of data analytics in which data is use to predict future events, trens or behaviors. It is mainly compose of historical data and it’s use to forecast possible scenarios. Having a probable view of the future can help organizations to make better strategic decisions.
What is the use of regresion analysis in predictive analytics? One of the most important tools in predictive analytics is the regression analysis. This is a mathematical tool that determine the relationship between two or more variables. This tool also provide us with an equation that can be use to predict a result in certain conditions of the dependent variables. According to the Harvard Business School Profesor Jan Hammond, regression analysis “allows the user to gain insights into the structure of that relationship and provides measures of how well the data fit that relationship. Such insights can prove extremely valuable for analyzing historical trends and developing forecast.”
How can regression analysis help to predict the ocurrence of nearshoring in Mexico? Taking into consideration what it was learned during the investigation, there are a different aspects that makes Mexico an atractive place for companies to do nearshoring. Each of these factors and indicators can be consider a variable that determines the amount of Foreign Investment in the country. The regression analysis can be use to understand the relationship of each of these factors (independent variables) with the ocurrence of nearshoring (dependent variable). This will also create a mathematical equation that can be use to predict the ocurrence of nearshoring according to the behavior of the different variables.
# Importing the libraries we are going to use:
library(foreign)
library(dplyr) # data manipulation
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(forcats) # to work with categorical variables
library(ggplot2) # data visualization
library(readr) # read specific csv files
library(janitor) # data exploration and cleaning
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(Hmisc) # several useful functions for data analysis
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(psych) # functions for multivariate analysis
##
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
##
## describe
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(naniar) # summaries and visualization of missing values NA's
library(corrplot) # correlation plots
## corrplot 0.92 loaded
library(jtools) # presentation of regression analysis
##
## Attaching package: 'jtools'
## The following object is masked from 'package:Hmisc':
##
## %nin%
library(lmtest) # diagnostic checks - linear regression analysis
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(car) # diagnostic checks - linear regression analysis
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
library(olsrr) # diagnostic checks - linear regression analysis
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
library(naniar) # identifying missing values
library(stargazer) # create publication quality tables
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(effects) # displays for linear and other regression models
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
library(tidyverse) # collection of R packages designed for data science
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ✔ stringr 1.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ psych::%+%() masks ggplot2::%+%()
## ✖ psych::alpha() masks ggplot2::alpha()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ car::recode() masks dplyr::recode()
## ✖ purrr::some() masks car::some()
## ✖ Hmisc::src() masks dplyr::src()
## ✖ Hmisc::summarize() masks dplyr::summarize()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret) # Classification and Regression Training
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(glmnet) # methods for prediction and plotting, and functions for cross-validation
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loaded glmnet 4.1-7
library(readxl) # Leer archivos en excel
library(gganimate)
## No renderer backend detected. gganimate will default to writing frames to separate files
## Consider installing:
## - the `gifski` package for gif output
## - the `av` package for video output
## and restarting the R session
library(gapminder)
library(ggcorrplot)
# Uploading the data base. We are going to create to different data bases. One for the first page of the database and another one for the second page.
# Upload the first page of our data base.
bd1 <- read_excel("/Users/pedrovillanueva/Desktop/SP_DataMexicoAtractiveness_alumn-VF.xlsx", sheet = "Datos")
# Upload the second page of our data base.
bd2 <- read_excel("/Users/pedrovillanueva/Desktop/SP_DataMexicoAtractiveness_alumn-VF.xlsx", sheet = "Datos_Series_de_Tiempo")
# Confirm that our data base was uploaded correctly.
bd1
## # A tibble: 26 × 15
## Año IED_Flujos Exportaciones Empleo Educación Salario_Diario Innovación
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr>
## 1 1997 12146. 9088. - 7.1979679955… 24.3 11.301406…
## 2 1998 8374. 9875. - 7.3115866164… 31.9 11.372208…
## 3 1999 13960. 10990. - 7.4279238305… 31.9 12.458970…
## 4 2000 18249. 12483. 97.83 7.56 35.1 13.145562…
## 5 2001 30057. 11300. 97.36 7.6782701458… 37.6 13.468116…
## 6 2002 24099. 11923. 97.66 7.8034178920… 39.7 12.798930…
## 7 2003 18250. 13156 97.06 7.9254003630… 41.5 11.810749…
## 8 2004 25016. 13573. 96.48 8.0438954581… 43.3 12.608412…
## 9 2005 25796. 16466. 97.17 8.14 45.2 13.414424…
## 10 2006 21233. 17486. 96.53 8.2591355074… 47.0 14.231369…
## # ℹ 16 more rows
## # ℹ 8 more variables: Inseguridad_Robo <dbl>, Inseguridad_Homicidio <chr>,
## # Tipo_de_Cambio <dbl>, Densidad_Carretera <dbl>, Densidad_Población <dbl>,
## # CO2_Emisiones <chr>, PIB_Per_Cápita <dbl>, INPC <dbl>
bd2
## # A tibble: 96 × 3
## Año Trimestre IED_Flujos
## <dbl> <chr> <dbl>
## 1 1999 I 3596.
## 2 1999 II 3396.
## 3 1999 III 3028.
## 4 1999 IV 3940.
## 5 2000 I 4601.
## 6 2000 II 4857.
## 7 2000 III 3057.
## 8 2000 IV 5734.
## 9 2001 I 3599.
## 10 2001 II 5219.
## # ℹ 86 more rows
# Knowing the variables that we are working with is essential when making an exploratory data analysis. In the next code we take a further look into the different variables that we have in each data base.
# Creates a variable that stores the column names of each of our databases.
colnames1 <- colnames(bd1)
colnames2 <- colnames(bd2)
# Shows us the name of the variables in each of our databases.
colnames1
## [1] "Año" "IED_Flujos" "Exportaciones"
## [4] "Empleo" "Educación" "Salario_Diario"
## [7] "Innovación" "Inseguridad_Robo" "Inseguridad_Homicidio"
## [10] "Tipo_de_Cambio" "Densidad_Carretera" "Densidad_Población"
## [13] "CO2_Emisiones" "PIB_Per_Cápita" "INPC"
colnames2
## [1] "Año" "Trimestre" "IED_Flujos"
# Descriptive Statistics
# General descriptive statistics of our first data base.
descriptive_statistics_1 <- summary(bd1)
descriptive_statistics_1
## Año IED_Flujos Exportaciones Empleo
## Min. :1997 Min. : 8374 Min. : 9088 Length:26
## 1st Qu.:2003 1st Qu.:21367 1st Qu.:13260 Class :character
## Median :2010 Median :27698 Median :21188 Mode :character
## Mean :2010 Mean :26770 Mean :23601
## 3rd Qu.:2016 3rd Qu.:32183 3rd Qu.:31601
## Max. :2022 Max. :48354 Max. :46478
## Educación Salario_Diario Innovación Inseguridad_Robo
## Length:26 Min. : 24.30 Length:26 Min. :120.5
## Class :character 1st Qu.: 41.97 Class :character 1st Qu.:148.3
## Mode :character Median : 54.48 Mode :character Median :181.8
## Mean : 65.16 Mean :185.4
## 3rd Qu.: 72.31 3rd Qu.:209.9
## Max. :172.87 Max. :314.8
## Inseguridad_Homicidio Tipo_de_Cambio Densidad_Carretera Densidad_Población
## Length:26 Min. : 8.064 Min. :0.05205 Min. :47.44
## Class :character 1st Qu.:10.752 1st Qu.:0.05954 1st Qu.:52.78
## Mode :character Median :13.016 Median :0.06989 Median :58.09
## Mean :13.910 Mean :0.07106 Mean :57.33
## 3rd Qu.:18.489 3rd Qu.:0.08275 3rd Qu.:61.39
## Max. :20.664 Max. :0.09020 Max. :65.60
## CO2_Emisiones PIB_Per_Cápita INPC
## Length:26 Min. :126739 Min. : 33.28
## Class :character 1st Qu.:130964 1st Qu.: 56.15
## Mode :character Median :136845 Median : 73.35
## Mean :138550 Mean : 75.17
## 3rd Qu.:146148 3rd Qu.: 91.29
## Max. :153236 Max. :126.48
# Observations bd1:
# We can see that there are some variables that are being put as a character when they should be as a float. This means that there are inconsistencies with the data and a process of data cleaning should be done. We first need to transform this variables to numeric, then finding the missing values and lastly implement an imputation process.
# The variables that need to be examinate are: "Inseguridad_Homicidio", "Innovación", "Educación", "CO2_Emisiones" and "Empleo"
# General descriptive statistics of our second data base.
descriptive_statistics_2 <- summary(bd2)
descriptive_statistics_2
## Año Trimestre IED_Flujos
## Min. :1999 Length:96 Min. : 1341
## 1st Qu.:2005 Class :character 1st Qu.: 4351
## Median :2010 Mode :character Median : 6238
## Mean :2011 Mean : 7036
## 3rd Qu.:2016 3rd Qu.: 8053
## Max. :2023 Max. :22794
# Observations bd2:
# We can see that in this case the variable "Trimestre" is in a character form. In this case this is correct and the variable should be in this form. This means that no change in the type of data are needed.
# Transforming variables that are in a character form to numeric.
# With this variable we are transforming the character values of the columns to a numeric type in our first base.
bd1$Inseguridad_Homicidio <- as.numeric(bd1$Inseguridad_Homicidio)
## Warning: NAs introduced by coercion
bd1$Innovación <- as.numeric(bd1$Innovación)
## Warning: NAs introduced by coercion
bd1$Educación <- as.numeric(bd1$Educación)
## Warning: NAs introduced by coercion
bd1$Empleo <- as.numeric(bd1$Empleo)
## Warning: NAs introduced by coercion
bd1$CO2_Emisiones <- as.numeric(bd1$CO2_Emisiones)
## Warning: NAs introduced by coercion
# Observations:
# NAs where introduced by coercion. This means that there where measing values in these columns and latter on the process we will need to take attention of this values.
# Looking for missing values, inconsistencies or NA.
# We create a variable that stores the amount of NA in our two data bases.
na_bd1 <- sum(is.na(bd1))
na_bd2 <- sum(is.na(bd2))
# We show the number of NA in each database
na_bd1
## [1] 12
na_bd2
## [1] 0
# Observations:
# In the first data base we have 12 missing values. We need to proced with an imputation method to take care of these values. For the second database there where no NAs found so no further work needs to be done.
# Imputation method for missing values.
# Calculating the mean for each of the variables that have NA. Except for the variable Educación. For this variable, we see that there is a tendency so putting the mean will not take into consideration this trend. So we will implement
mean_empleo <- mean(bd1$Empleo, na.rm = TRUE)
mean_innovacion <- mean(bd1$Innovación, na.rm = TRUE)
mean_inseguridad <- mean(bd1$Inseguridad_Homicidio, na.rm = TRUE)
mean_co2 <- mean(bd1$CO2_Emisiones, na.rm = TRUE)
# Imputating the missing values with the mean of each category.
bd1$Empleo[is.na(bd1$Empleo)] <- mean_empleo
bd1$Innovación[is.na(bd1$Innovación)] <- mean_innovacion
bd1$Inseguridad_Homicidio[is.na(bd1$Inseguridad_Homicidio)] <- mean_inseguridad
bd1$CO2_Emisiones[is.na(bd1$CO2_Emisiones)] <- mean_co2
# We look for any missing values in our data.
na_bd1_new <- sum(is.na(bd1))
na_bd1_new
## [1] 3
# The missing values that we have are from the variable Educación. We detected a trend for this variable so we will use an approx in a sequencue model in order to take into consideration this change.
imputation_education <- approx(seq_along(bd1$Educación),bd1$Educación, method = "linear", n = length(bd1$Educación))$y
bd1$Educación <- imputation_education
# With this code we can see that our new imputated data follows the existing trend of our data base.
bd1$Educación
## [1] 7.197968 7.297952 7.400003 7.512453 7.621500 7.728329 7.837573 7.944360
## [9] 8.047740 8.132312 8.235308 8.326336 8.413348 8.500440 8.579609 8.654893
## [17] 8.762427 8.850012 8.937710 9.025490 9.113350 9.201318 9.289411 9.377608
## [25] 9.468821 9.578832
# We made another count in order to confirm that there are no more missing values.
sum(is.na(bd1))
## [1] 0
# Descriptive statistics with full data en b1
descriptive_statistics_1 <- summary(bd1)
descriptive_statistics_1
## Año IED_Flujos Exportaciones Empleo Educación
## Min. :1997 Min. : 8374 Min. : 9088 Min. :95.06 Min. :7.198
## 1st Qu.:2003 1st Qu.:21367 1st Qu.:13260 1st Qu.:96.09 1st Qu.:7.864
## Median :2010 Median :27698 Median :21188 Median :96.48 Median :8.457
## Mean :2010 Mean :26770 Mean :23601 Mean :96.47 Mean :8.424
## 3rd Qu.:2016 3rd Qu.:32183 3rd Qu.:31601 3rd Qu.:97.01 3rd Qu.:9.004
## Max. :2022 Max. :48354 Max. :46478 Max. :97.83 Max. :9.579
## Salario_Diario Innovación Inseguridad_Robo Inseguridad_Homicidio
## Min. : 24.30 Min. :11.28 Min. :120.5 Min. : 8.037
## 1st Qu.: 41.97 1st Qu.:12.60 1st Qu.:148.3 1st Qu.:10.402
## Median : 54.48 Median :13.11 Median :181.8 Median :17.110
## Mean : 65.16 Mean :13.11 Mean :185.4 Mean :17.292
## 3rd Qu.: 72.31 3rd Qu.:13.60 3rd Qu.:209.9 3rd Qu.:22.346
## Max. :172.87 Max. :15.11 Max. :314.8 Max. :29.592
## Tipo_de_Cambio Densidad_Carretera Densidad_Población CO2_Emisiones
## Min. : 8.064 Min. :0.05205 Min. :47.44 Min. :3.592
## 1st Qu.:10.752 1st Qu.:0.05954 1st Qu.:52.78 1st Qu.:3.843
## Median :13.016 Median :0.06989 Median :58.09 Median :3.946
## Mean :13.910 Mean :0.07106 Mean :57.33 Mean :3.946
## 3rd Qu.:18.489 3rd Qu.:0.08275 3rd Qu.:61.39 3rd Qu.:4.088
## Max. :20.664 Max. :0.09020 Max. :65.60 Max. :4.221
## PIB_Per_Cápita INPC
## Min. :126739 Min. : 33.28
## 1st Qu.:130964 1st Qu.: 56.15
## Median :136845 Median : 73.35
## Mean :138550 Mean : 75.17
## 3rd Qu.:146148 3rd Qu.: 91.29
## Max. :153236 Max. :126.48
# Another way of looking at the description of each of our data bases.
describe_1 <- describe(bd1)
describe_2 <- describe(bd2)
describe_1
## vars n mean sd median trimmed mad
## Año 1 26 2009.50 7.65 2009.50 2009.50 9.64
## IED_Flujos 2 26 26770.12 8770.68 27697.59 26860.64 9079.25
## Exportaciones 3 26 23600.91 11346.13 21187.89 22875.85 13370.71
## Empleo 4 26 96.47 0.72 96.48 96.48 0.75
## Educación 5 26 8.42 0.72 8.46 8.43 0.88
## Salario_Diario 6 26 65.16 35.85 54.48 60.16 22.51
## Innovación 7 26 13.11 1.07 13.11 13.09 0.78
## Inseguridad_Robo 8 26 185.42 47.67 181.83 181.16 47.06
## Inseguridad_Homicidio 9 26 17.29 7.12 17.11 16.99 9.31
## Tipo_de_Cambio 10 26 13.91 4.15 13.02 13.78 4.25
## Densidad_Carretera 11 26 0.07 0.01 0.07 0.07 0.02
## Densidad_Población 12 26 57.33 5.41 58.09 57.44 6.68
## CO2_Emisiones 13 26 3.95 0.18 3.95 3.95 0.17
## PIB_Per_Cápita 14 26 138550.10 8861.10 136845.30 138255.64 11080.42
## INPC 15 26 75.17 24.81 73.35 74.45 27.14
## min max range skew kurtosis se
## Año 1997.00 2022.00 25.00 0.00 -1.34 1.50
## IED_Flujos 8373.50 48354.42 39980.92 -0.02 -0.08 1720.07
## Exportaciones 9087.62 46477.58 37389.97 0.46 -1.08 2225.16
## Empleo 95.06 97.83 2.77 -0.14 -0.74 0.14
## Educación 7.20 9.58 2.38 -0.09 -1.28 0.14
## Salario_Diario 24.30 172.87 148.57 1.43 1.44 7.03
## Innovación 11.28 15.11 3.83 0.13 -0.70 0.21
## Inseguridad_Robo 120.49 314.78 194.28 0.89 0.30 9.35
## Inseguridad_Homicidio 8.04 29.59 21.56 0.38 -1.28 1.40
## Tipo_de_Cambio 8.06 20.66 12.60 0.44 -1.39 0.81
## Densidad_Carretera 0.05 0.09 0.04 0.18 -1.50 0.00
## Densidad_Población 47.44 65.60 18.16 -0.19 -1.24 1.06
## CO2_Emisiones 3.59 4.22 0.63 -0.15 -0.94 0.04
## PIB_Per_Cápita 126738.75 153235.73 26496.98 0.28 -1.41 1737.81
## INPC 33.28 126.48 93.20 0.26 -0.95 4.87
describe_2
## vars n mean sd median trimmed mad min max
## Año 1 96 2010.51 6.98 2010.5 2010.50 8.90 1999.00 2023.00
## Trimestre* 2 96 2.50 1.12 2.5 2.50 1.48 1.00 4.00
## IED_Flujos 3 96 7036.50 3978.53 6237.9 6458.65 2797.97 1340.58 22794.16
## range skew kurtosis se
## Año 24.00 0.01 -1.23 0.71
## Trimestre* 3.00 0.00 -1.39 0.11
## IED_Flujos 21453.58 1.60 3.01 406.06
# Visualization of our data - Grpahs
# Loading the data frame
bd1.1 <- bd1%>% select(IED_Flujos,Año,Exportaciones,Empleo,Educación,Salario_Diario,Innovación,Inseguridad_Robo,Inseguridad_Homicidio,Tipo_de_Cambio,Densidad_Carretera,Densidad_Población,CO2_Emisiones,PIB_Per_Cápita,INPC) %>% group_by(Año)
bd1.1
## # A tibble: 26 × 15
## # Groups: Año [26]
## IED_Flujos Año Exportaciones Empleo Educación Salario_Diario Innovación
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 12146. 1997 9088. 96.5 7.20 24.3 11.3
## 2 8374. 1998 9875. 96.5 7.30 31.9 11.4
## 3 13960. 1999 10990. 96.5 7.40 31.9 12.5
## 4 18249. 2000 12483. 97.8 7.51 35.1 13.1
## 5 30057. 2001 11300. 97.4 7.62 37.6 13.5
## 6 24099. 2002 11923. 97.7 7.73 39.7 12.8
## 7 18250. 2003 13156 97.1 7.84 41.5 11.8
## 8 25016. 2004 13573. 96.5 7.94 43.3 12.6
## 9 25796. 2005 16466. 97.2 8.05 45.2 13.4
## 10 21233. 2006 17486. 96.5 8.13 47.0 14.2
## # ℹ 16 more rows
## # ℹ 8 more variables: Inseguridad_Robo <dbl>, Inseguridad_Homicidio <dbl>,
## # Tipo_de_Cambio <dbl>, Densidad_Carretera <dbl>, Densidad_Población <dbl>,
## # CO2_Emisiones <dbl>, PIB_Per_Cápita <dbl>, INPC <dbl>
# Graph 1:
graph1 <- ggplot(data=bd1.1,aes(x=Año,IED_Flujos,y=IED_Flujos,fill=Educación)) +
geom_bar(stat="identity") + coord_flip() +
ggtitle("Effect of education in the foreign investment throught the years") +
labs(x = "Foreign Investment",
fill = "Education Level",
y = "Year")
graph1
# Analysis: In this graph we can see that the level of education has been increasing through the years. We can see a correlation between the level of education and the foreign investment. We can see that as the education has been getting better, the amount of foreign investment has increase.
# Graph 2:
a <- ggplot(bd1.1, aes(Densidad_Carretera,IED_Flujos))
graph2 <- a + geom_point() +
geom_smooth(method="lm", se=F) +
labs(
y="Total foreign investment",
x="Amount of highways",
title="Impact of the amount of highways to the foreign investment in Mexico",
caption="Source: CIC")
graph2
## `geom_smooth()` using formula = 'y ~ x'
# Analysis: In this graph we can see the relationship between the amount of highways in the country and the total foreign investment. Also we can see a tendency line that crosses the data in order to better view the behavior. We can see in the line that as the amount of highways increases the amount of foreign investment also increases. We can see that our data dosen´t fully behave as our line since we can see an early peak and then the increase becomes slower.
# Graph 3: For this graph we will create two differnt graphs and compare them between each other in order to determine which factor of insecurity affects more the foreign investment; the amount of thefs or the amount of murders.
#Graph 3,1:
theme_set(theme_bw()) # pre-set the bw theme.
b <- ggplot(bd1.1, aes(Inseguridad_Homicidio,IED_Flujos))
graph3.1 <- b + geom_count(col="tomato3", show.legend=F) +
labs(title="Correlation between homicides to the foreign investment",
y="Foreign Investment",
x="Homicide Rate",
subtitle="Counts Plot")
graph3.1
# Analysis graph 3.1: we see that there is not a clear tendency in our data. We see that our dots are skatter all over the graph so we can´t assume that the homicide rate increases or decreases the foreign investment by just looking at the behavior of the graph.
# Graph 3.2:
theme_set(theme_bw()) # pre-set the bw theme.
g <- ggplot(bd1.1, aes(Inseguridad_Robo,IED_Flujos))
graph3.2 <- g + geom_count(col="blue", show.legend=F) +
labs(title="Impact of theft to the foreign investment",
y="Foreign Investment",
x="Thieves Rate",
subtitle="Counts Plot")
graph3.2
# Analysis graph 3.2: in this graph we can clearly see that there is a negative correlation between the two vairables. This means that we can assume that as the thieves rate goes up the amount of foreign investment goes down.
# Analysis Graph 3_General: We can assume that the amount of theves have a bigger negative correlationship with the foreign investment than the amount of murders.
# Graph 4:
corre <- round(cor(subset(bd1,select = -c(Año,INPC,Exportaciones,Educación,Densidad_Carretera))), 1)
ggcorrplot(corre, hc.order = TRUE,
type = "lower",
lab = TRUE,
lab_size = 3,
method="circle",
colors = c("red", "white", "green"),
title="Correlation between our variables",
ggtheme=theme_bw)
# In this graph we cann see that there is a positive correlation between the population density and the daily salary and the exchange rate. We can also observe that there is a negative corelationship between the foreign investment and the amount of thefs commited.
# IMPORTANT INFORMATION: The insights discover in this graphs are NOT the conclusions of the analysis since we haven´t find the concrete correlationship. This is just to have a look at our data and helps us to create hypothesis that will be further prove in our regression analysis.
# Linear regresion
practice_linear_regresion <- lm(IED_Flujos ~ Exportaciones+Empleo+Educación+Salario_Diario+Innovación+Inseguridad_Robo+Inseguridad_Homicidio+Tipo_de_Cambio+Densidad_Carretera+Densidad_Población+CO2_Emisiones+PIB_Per_Cápita+INPC, data = bd1)
summary(practice_linear_regresion)
##
## Call:
## lm(formula = IED_Flujos ~ Exportaciones + Empleo + Educación +
## Salario_Diario + Innovación + Inseguridad_Robo + Inseguridad_Homicidio +
## Tipo_de_Cambio + Densidad_Carretera + Densidad_Población +
## CO2_Emisiones + PIB_Per_Cápita + INPC, data = bd1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4595.1 -2152.5 -531.4 2152.8 4934.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.027e+06 3.812e+05 -2.695 0.01950 *
## Exportaciones -1.083e+00 6.748e-01 -1.605 0.13451
## Empleo 5.757e+03 2.371e+03 2.428 0.03184 *
## Educación 1.779e+05 4.490e+04 3.961 0.00189 **
## Salario_Diario -1.205e+01 2.044e+02 -0.059 0.95395
## Innovación 4.849e+03 1.502e+03 3.228 0.00725 **
## Inseguridad_Robo 4.352e+01 5.138e+01 0.847 0.41353
## Inseguridad_Homicidio 1.385e+03 5.516e+02 2.511 0.02736 *
## Tipo_de_Cambio -2.582e+03 8.448e+02 -3.057 0.00996 **
## Densidad_Carretera -1.240e+06 1.151e+06 -1.078 0.30232
## Densidad_Población -1.645e+04 4.287e+03 -3.836 0.00237 **
## CO2_Emisiones 9.462e+03 1.173e+04 0.807 0.43548
## PIB_Per_Cápita -3.341e-01 4.966e-01 -0.673 0.51384
## INPC 9.457e+01 9.535e+02 0.099 0.92263
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3864 on 12 degrees of freedom
## Multiple R-squared: 0.9068, Adjusted R-squared: 0.8059
## F-statistic: 8.985 on 13 and 12 DF, p-value: 0.0002737
vif(practice_linear_regresion) # We detected that there was multicolinearity. In order to errase this, we will need to do a new imputation system in our data and eliminate some variables.
## Exportaciones Empleo Educación
## 98.143017 4.886377 1727.453825
## Salario_Diario Innovación Inseguridad_Robo
## 89.896000 4.348086 10.041711
## Inseguridad_Homicidio Tipo_de_Cambio Densidad_Carretera
## 25.807378 20.578432 396.904949
## Densidad_Población CO2_Emisiones PIB_Per_Cápita
## 901.177320 7.492508 32.429142
## INPC
## 936.922243
# We are going to replace the highest and lowest registers in our data and add the average. This will eliminate multicolinearity. We
bd2 <- bd1
num_values_to_replace <- 3
#Exportaciones
median_value_exportaciones <- median(bd2[["Exportaciones"]], na.rm = TRUE)
sorted_indices <- order(bd2[["Exportaciones"]], na.last = NA)
lowest_indices <- sorted_indices[1:num_values_to_replace]
highest_indices <- sorted_indices[(length(sorted_indices) - num_values_to_replace + 1):length(sorted_indices)]
bd2[["Exportaciones"]][lowest_indices] <- median_value_exportaciones
bd2[["Exportaciones"]][highest_indices] <- median_value_exportaciones
# Empleo
median_value_empleo <- median(bd2[["Empleo"]], na.rm = TRUE)
sorted_indices <- order(bd2[["Empleo"]], na.last = NA)
lowest_indices <- sorted_indices[1:num_values_to_replace]
highest_indices <- sorted_indices[(length(sorted_indices) - num_values_to_replace + 1):length(sorted_indices)]
bd2[["Empleo"]][lowest_indices] <- median_value_empleo
bd2[["Empleo"]][highest_indices] <- median_value_empleo
# Educación
median_value_educacion <- median(bd2[["Educación"]], na.rm = TRUE)
sorted_indices <- order(bd2[["Educación"]], na.last = NA)
lowest_indices <- sorted_indices[1:num_values_to_replace]
highest_indices <- sorted_indices[(length(sorted_indices) - num_values_to_replace + 1):length(sorted_indices)]
bd2[["Educación"]][lowest_indices] <- median_value_educacion
bd2[["Educación"]][highest_indices] <- median_value_educacion
# Salario_Diario
median_value_salario <- median(bd2[["Salario_Diario"]], na.rm = TRUE)
sorted_indices <- order(bd2[["Salario_Diario"]], na.last = NA)
lowest_indices <- sorted_indices[1:num_values_to_replace]
highest_indices <- sorted_indices[(length(sorted_indices) - num_values_to_replace + 1):length(sorted_indices)]
bd2[["Salario_Diario"]][lowest_indices] <- median_value_salario
bd2[["Salario_Diario"]][highest_indices] <- median_value_salario
# Innovación
median_value_innovacion <- median(bd2[["Innovación"]], na.rm = TRUE)
sorted_indices <- order(bd2[["Innovación"]], na.last = NA)
lowest_indices <- sorted_indices[1:num_values_to_replace]
highest_indices <- sorted_indices[(length(sorted_indices) - num_values_to_replace + 1):length(sorted_indices)]
bd2[["Innovación"]][lowest_indices] <- median_value_innovacion
bd2[["Innovación"]][highest_indices] <- median_value_innovacion
# Inseguridad_Robo
median_value_robo <- median(bd2[["Inseguridad_Robo"]], na.rm = TRUE)
sorted_indices <- order(bd2[["Inseguridad_Robo"]], na.last = NA)
lowest_indices <- sorted_indices[1:num_values_to_replace]
highest_indices <- sorted_indices[(length(sorted_indices) - num_values_to_replace + 1):length(sorted_indices)]
bd2[["Inseguridad_Robo"]][lowest_indices] <- median_value_robo
bd2[["Inseguridad_Robo"]][highest_indices] <- median_value_robo
# Inseguridad_Homicidio
median_value_homicidio <- median(bd2[["Inseguridad_Homicidio"]], na.rm = TRUE)
sorted_indices <- order(bd2[["Inseguridad_Homicidio"]], na.last = NA)
lowest_indices <- sorted_indices[1:num_values_to_replace]
highest_indices <- sorted_indices[(length(sorted_indices) - num_values_to_replace + 1):length(sorted_indices)]
bd2[["Inseguridad_Homicidio"]][lowest_indices] <- median_value_homicidio
bd2[["Inseguridad_Homicidio"]][highest_indices] <- median_value_homicidio
# Tipo_de_Cambio
median_value_cambio <- median(bd2[["Tipo_de_Cambio"]], na.rm = TRUE)
sorted_indices <- order(bd2[["Tipo_de_Cambio"]], na.last = NA)
lowest_indices <- sorted_indices[1:num_values_to_replace]
highest_indices <- sorted_indices[(length(sorted_indices) - num_values_to_replace + 1):length(sorted_indices)]
bd2[["Tipo_de_Cambio"]][lowest_indices] <- median_value_cambio
bd2[["Tipo_de_Cambio"]][highest_indices] <- median_value_cambio
# Densidad_Carretera
median_value_carretera <- median(bd2[["Densidad_Carretera"]], na.rm = TRUE)
sorted_indices <- order(bd2[["Densidad_Carretera"]], na.last = NA)
lowest_indices <- sorted_indices[1:num_values_to_replace]
highest_indices <- sorted_indices[(length(sorted_indices) - num_values_to_replace + 1):length(sorted_indices)]
bd2[["Densidad_Carretera"]][lowest_indices] <- median_value_carretera
bd2[["Densidad_Carretera"]][highest_indices] <- median_value_carretera
# Densidad_Población
median_value_poblacion <- median(bd2[["Densidad_Población"]], na.rm = TRUE)
sorted_indices <- order(bd2[["Densidad_Población"]], na.last = NA)
lowest_indices <- sorted_indices[1:num_values_to_replace]
highest_indices <- sorted_indices[(length(sorted_indices) - num_values_to_replace + 1):length(sorted_indices)]
bd2[["Densidad_Población"]][lowest_indices] <- median_value_poblacion
bd2[["Densidad_Población"]][highest_indices] <- median_value_poblacion
# CO2_Emisiones
median_value_c02 <- median(bd2[["CO2_Emisiones"]], na.rm = TRUE)
sorted_indices <- order(bd2[["CO2_Emisiones"]], na.last = NA)
lowest_indices <- sorted_indices[1:num_values_to_replace]
highest_indices <- sorted_indices[(length(sorted_indices) - num_values_to_replace + 1):length(sorted_indices)]
bd2[["CO2_Emisiones"]][lowest_indices] <- median_value_c02
bd2[["CO2_Emisiones"]][highest_indices] <- median_value_c02
# PIB_Per_Cápita
median_value_pib <- median(bd2[["PIB_Per_Cápita"]], na.rm = TRUE)
sorted_indices <- order(bd2[["PIB_Per_Cápita"]], na.last = NA)
lowest_indices <- sorted_indices[1:num_values_to_replace]
highest_indices <- sorted_indices[(length(sorted_indices) - num_values_to_replace + 1):length(sorted_indices)]
bd2[["PIB_Per_Cápita"]][lowest_indices] <- median_value_pib
bd2[["PIB_Per_Cápita"]][highest_indices] <- median_value_pib
# INPC
median_value_inpc <- median(bd2[["INPC"]], na.rm = TRUE)
sorted_indices <- order(bd2[["INPC"]], na.last = NA)
lowest_indices <- sorted_indices[1:num_values_to_replace]
highest_indices <- sorted_indices[(length(sorted_indices) - num_values_to_replace + 1):length(sorted_indices)]
bd2[["INPC"]][lowest_indices] <- median_value_inpc
bd2[["INPC"]][highest_indices] <- median_value_inpc
# Performing a new linear regresion
practice2_linear_regresion <- lm(IED_Flujos ~ Exportaciones+Empleo+Educación+Innovación+Inseguridad_Robo+Inseguridad_Homicidio+Tipo_de_Cambio+Densidad_Carretera+CO2_Emisiones+PIB_Per_Cápita, data = bd2)
summary(practice_linear_regresion)
##
## Call:
## lm(formula = IED_Flujos ~ Exportaciones + Empleo + Educación +
## Salario_Diario + Innovación + Inseguridad_Robo + Inseguridad_Homicidio +
## Tipo_de_Cambio + Densidad_Carretera + Densidad_Población +
## CO2_Emisiones + PIB_Per_Cápita + INPC, data = bd1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4595.1 -2152.5 -531.4 2152.8 4934.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.027e+06 3.812e+05 -2.695 0.01950 *
## Exportaciones -1.083e+00 6.748e-01 -1.605 0.13451
## Empleo 5.757e+03 2.371e+03 2.428 0.03184 *
## Educación 1.779e+05 4.490e+04 3.961 0.00189 **
## Salario_Diario -1.205e+01 2.044e+02 -0.059 0.95395
## Innovación 4.849e+03 1.502e+03 3.228 0.00725 **
## Inseguridad_Robo 4.352e+01 5.138e+01 0.847 0.41353
## Inseguridad_Homicidio 1.385e+03 5.516e+02 2.511 0.02736 *
## Tipo_de_Cambio -2.582e+03 8.448e+02 -3.057 0.00996 **
## Densidad_Carretera -1.240e+06 1.151e+06 -1.078 0.30232
## Densidad_Población -1.645e+04 4.287e+03 -3.836 0.00237 **
## CO2_Emisiones 9.462e+03 1.173e+04 0.807 0.43548
## PIB_Per_Cápita -3.341e-01 4.966e-01 -0.673 0.51384
## INPC 9.457e+01 9.535e+02 0.099 0.92263
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3864 on 12 degrees of freedom
## Multiple R-squared: 0.9068, Adjusted R-squared: 0.8059
## F-statistic: 8.985 on 13 and 12 DF, p-value: 0.0002737
RMSE(practice2_linear_regresion$fitted.values, bd2$IED_Flujos)
## [1] 5486.737
vif(practice2_linear_regresion) # We asjusted the model in order to find no multicolineality.
## Exportaciones Empleo Educación
## 33.302113 1.546579 48.825615
## Innovación Inseguridad_Robo Inseguridad_Homicidio
## 1.452614 2.441999 2.319355
## Tipo_de_Cambio Densidad_Carretera CO2_Emisiones
## 4.533531 5.296000 1.526098
## PIB_Per_Cápita
## 2.818299
# Linear regresion with log
linear_regresion_log <- lm(log(IED_Flujos) ~ Exportaciones+Empleo+Educación+Innovación+Inseguridad_Robo+Inseguridad_Homicidio+Tipo_de_Cambio+Densidad_Carretera+CO2_Emisiones+PIB_Per_Cápita, data = bd2)
summary(linear_regresion_log)
##
## Call:
## lm(formula = log(IED_Flujos) ~ Exportaciones + Empleo + Educación +
## Innovación + Inseguridad_Robo + Inseguridad_Homicidio +
## Tipo_de_Cambio + Densidad_Carretera + CO2_Emisiones + PIB_Per_Cápita,
## data = bd2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.43995 -0.11727 0.02566 0.10408 0.51909
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.078e+01 2.217e+01 1.840 0.0857 .
## Exportaciones 8.867e-05 4.654e-05 1.905 0.0761 .
## Empleo -2.224e-01 1.771e-01 -1.255 0.2285
## Educación -2.092e+00 8.767e-01 -2.387 0.0306 *
## Innovación 7.228e-02 1.236e-01 0.585 0.5674
## Inseguridad_Robo -6.271e-03 3.821e-03 -1.641 0.1216
## Inseguridad_Homicidio 3.180e-02 1.836e-02 1.732 0.1038
## Tipo_de_Cambio 1.224e-01 4.181e-02 2.927 0.0104 *
## Densidad_Carretera 1.127e+01 1.385e+01 0.814 0.4285
## CO2_Emisiones 9.001e-01 6.476e-01 1.390 0.1848
## PIB_Per_Cápita 9.236e-07 1.599e-05 0.058 0.9547
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2979 on 15 degrees of freedom
## Multiple R-squared: 0.641, Adjusted R-squared: 0.4017
## F-statistic: 2.678 on 10 and 15 DF, p-value: 0.04153
# Validation
RMSE(linear_regresion_log$fitted.values, bd2$IED_Flujos) # We detected a really high level of RMSE compared with the previous model
## [1] 28107.99
vif(linear_regresion_log) # We can see that our variables have no multicolineality
## Exportaciones Empleo Educación
## 33.302113 1.546579 48.825615
## Innovación Inseguridad_Robo Inseguridad_Homicidio
## 1.452614 2.441999 2.319355
## Tipo_de_Cambio Densidad_Carretera CO2_Emisiones
## 4.533531 5.296000 1.526098
## PIB_Per_Cápita
## 2.818299
bptest(linear_regresion_log) # We found out that our model has Heteroestlasticidad
##
## studentized Breusch-Pagan test
##
## data: linear_regresion_log
## BP = 11.25, df = 10, p-value = 0.3384
AIC(linear_regresion_log)
## [1] 20.51169
histogram(linear_regresion_log$residuals)
# Polynomial Regression model
# The first step to make a polynomial regression we need to determine which variables might behave in a polynomial form.
ggplot(bd1, aes(x =bd1$Exportaciones , y = bd1$IED_Flujos)) +
geom_point(color = "blue", size = 3, shape = 16) +
labs(title = "Gráfico de Dispersión con ggplot2",
x = "Exportaciones",
y = "IED Flujos")
## Warning: Use of `bd1$Exportaciones` is discouraged.
## ℹ Use `Exportaciones` instead.
## Warning: Use of `bd1$IED_Flujos` is discouraged.
## ℹ Use `IED_Flujos` instead.
ggplot(bd1, aes(x =bd1$Empleo , y = bd1$IED_Flujos)) +
geom_point(color = "blue", size = 3, shape = 16) +
labs(title = "Gráfico de Dispersión con ggplot2",
x = "Empleo",
y = "IED Flujos")
## Warning: Use of `bd1$Empleo` is discouraged.
## ℹ Use `Empleo` instead.
## Use of `bd1$IED_Flujos` is discouraged.
## ℹ Use `IED_Flujos` instead.
ggplot(bd1, aes(x =bd1$Educación , y = bd1$IED_Flujos)) +
geom_point(color = "blue", size = 3, shape = 16) +
labs(title = "Gráfico de Dispersión con ggplot2",
x = "Educacion",
y = "IED Flujos")
## Warning: Use of `bd1$Educación` is discouraged.
## ℹ Use `Educación` instead.
## Use of `bd1$IED_Flujos` is discouraged.
## ℹ Use `IED_Flujos` instead.
ggplot(bd1, aes(x =bd1$Salario_Diario , y = bd1$IED_Flujos)) +
geom_point(color = "blue", size = 3, shape = 16) +
labs(title = "Gráfico de Dispersión con ggplot2",
x = "Salario Diario",
y = "IED Flujos")
## Warning: Use of `bd1$Salario_Diario` is discouraged.
## ℹ Use `Salario_Diario` instead.
## Use of `bd1$IED_Flujos` is discouraged.
## ℹ Use `IED_Flujos` instead.
ggplot(bd1, aes(x =bd1$Innovación , y = bd1$IED_Flujos)) +
geom_point(color = "blue", size = 3, shape = 16) +
labs(title = "Gráfico de Dispersión con ggplot2",
x = "Innovacion",
y = "IED Flujos")
## Warning: Use of `bd1$Innovación` is discouraged.
## ℹ Use `Innovación` instead.
## Use of `bd1$IED_Flujos` is discouraged.
## ℹ Use `IED_Flujos` instead.
ggplot(bd1, aes(x =bd1$Inseguridad_Robo , y = bd1$IED_Flujos)) +
geom_point(color = "blue", size = 3, shape = 16) +
labs(title = "Gráfico de Dispersión con ggplot2",
x = "Inseguridad Robo",
y = "IED Flujos")
## Warning: Use of `bd1$Inseguridad_Robo` is discouraged.
## ℹ Use `Inseguridad_Robo` instead.
## Use of `bd1$IED_Flujos` is discouraged.
## ℹ Use `IED_Flujos` instead.
ggplot(bd1, aes(x =bd1$Inseguridad_Homicidio , y = bd1$IED_Flujos)) +
geom_point(color = "blue", size = 3, shape = 16) +
labs(title = "Gráfico de Dispersión con ggplot2",
x = "Inseguridad Homicidio",
y = "IED Flujos")
## Warning: Use of `bd1$Inseguridad_Homicidio` is discouraged.
## ℹ Use `Inseguridad_Homicidio` instead.
## Use of `bd1$IED_Flujos` is discouraged.
## ℹ Use `IED_Flujos` instead.
ggplot(bd1, aes(x =bd1$Tipo_de_Cambio , y = bd1$IED_Flujos)) +
geom_point(color = "blue", size = 3, shape = 16) +
labs(title = "Gráfico de Dispersión con ggplot2",
x = "Tipo de cambio",
y = "IED Flujos")
## Warning: Use of `bd1$Tipo_de_Cambio` is discouraged.
## ℹ Use `Tipo_de_Cambio` instead.
## Use of `bd1$IED_Flujos` is discouraged.
## ℹ Use `IED_Flujos` instead.
ggplot(bd1, aes(x =bd1$Densidad_Carretera , y = bd1$IED_Flujos)) +
geom_point(color = "blue", size = 3, shape = 16) +
labs(title = "Gráfico de Dispersión con ggplot2",
x = "Densidad Carretera",
y = "IED Flujos")
## Warning: Use of `bd1$Densidad_Carretera` is discouraged.
## ℹ Use `Densidad_Carretera` instead.
## Use of `bd1$IED_Flujos` is discouraged.
## ℹ Use `IED_Flujos` instead.
ggplot(bd1, aes(x =bd1$Densidad_Población , y = bd1$IED_Flujos)) +
geom_point(color = "blue", size = 3, shape = 16) +
labs(title = "Gráfico de Dispersión con ggplot2",
x = "Densidad Poblacion",
y = "IED Flujos")
## Warning: Use of `bd1$Densidad_Población` is discouraged.
## ℹ Use `Densidad_Población` instead.
## Use of `bd1$IED_Flujos` is discouraged.
## ℹ Use `IED_Flujos` instead.
ggplot(bd1, aes(x =bd1$CO2_Emisiones , y = bd1$IED_Flujos)) +
geom_point(color = "blue", size = 3, shape = 16) +
labs(title = "Gráfico de Dispersión con ggplot2",
x = "C02 Emisiones",
y = "IED Flujos")
## Warning: Use of `bd1$CO2_Emisiones` is discouraged.
## ℹ Use `CO2_Emisiones` instead.
## Use of `bd1$IED_Flujos` is discouraged.
## ℹ Use `IED_Flujos` instead.
ggplot(bd1, aes(x =bd1$PIB_Per_Cápita , y = bd1$IED_Flujos)) +
geom_point(color = "blue", size = 3, shape = 16) +
labs(title = "Gráfico de Dispersión con ggplot2",
x = "PIB Per Capita",
y = "IED Flujos")
## Warning: Use of `bd1$PIB_Per_Cápita` is discouraged.
## ℹ Use `PIB_Per_Cápita` instead.
## Use of `bd1$IED_Flujos` is discouraged.
## ℹ Use `IED_Flujos` instead.
ggplot(bd1, aes(x =bd1$INPC , y = bd1$IED_Flujos)) +
geom_point(color = "blue", size = 3, shape = 16) +
labs(title = "Gráfico de Dispersión con ggplot2",
x = "INPC",
y = "IED Flujos")
## Warning: Use of `bd1$INPC` is discouraged.
## ℹ Use `INPC` instead.
## Use of `bd1$IED_Flujos` is discouraged.
## ℹ Use `IED_Flujos` instead.
# Analysis: we can see that there are some variables that don`t follow a linear tendency. The values that we saw that have a polinomial behavior are: Salario Diario, Tipo de cambio, Densidad Carretera, PIB Per Capita.
# Polynomial regresion
polynomial_regresion1 <- lm(IED_Flujos ~ Exportaciones+Empleo+Educación+Innovación+Inseguridad_Robo+Inseguridad_Homicidio+Tipo_de_Cambio+I(Tipo_de_Cambio^2)+Densidad_Carretera+I(Densidad_Carretera^2)+CO2_Emisiones+PIB_Per_Cápita+I(PIB_Per_Cápita^2), data = bd2)
summary(polynomial_regresion1)
##
## Call:
## lm(formula = IED_Flujos ~ Exportaciones + Empleo + Educación +
## Innovación + Inseguridad_Robo + Inseguridad_Homicidio +
## Tipo_de_Cambio + I(Tipo_de_Cambio^2) + Densidad_Carretera +
## I(Densidad_Carretera^2) + CO2_Emisiones + PIB_Per_Cápita +
## I(PIB_Per_Cápita^2), data = bd2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8441.7 -3443.8 -979.1 2913.1 15862.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.172e+06 1.158e+06 1.013 0.331
## Exportaciones -1.179e-01 1.510e+00 -0.078 0.939
## Empleo -3.586e+03 5.007e+03 -0.716 0.488
## Educación 7.724e+03 3.336e+04 0.232 0.821
## Innovación 4.533e+03 4.205e+03 1.078 0.302
## Inseguridad_Robo -2.786e+01 1.033e+02 -0.270 0.792
## Inseguridad_Homicidio 7.114e+02 4.855e+02 1.465 0.169
## Tipo_de_Cambio 4.904e+03 7.173e+03 0.684 0.507
## I(Tipo_de_Cambio^2) -1.268e+02 2.434e+02 -0.521 0.612
## Densidad_Carretera -6.165e+06 4.852e+06 -1.271 0.228
## I(Densidad_Carretera^2) 3.803e+07 2.999e+07 1.268 0.229
## CO2_Emisiones 2.039e+04 1.720e+04 1.186 0.259
## PIB_Per_Cápita -1.219e+01 1.576e+01 -0.774 0.454
## I(PIB_Per_Cápita^2) 4.580e-05 5.668e-05 0.808 0.435
##
## Residual standard error: 7390 on 12 degrees of freedom
## Multiple R-squared: 0.6592, Adjusted R-squared: 0.29
## F-statistic: 1.785 on 13 and 12 DF, p-value: 0.1622
# Validation
RMSE(polynomial_regresion1$fitted.values, bd2$IED_Flujos) # We saw that there is a significant numeber of RMSE compared to other models.
## [1] 5020.822
vif(polynomial_regresion1) # We can see that there is multicolineality in this model.
## Exportaciones Empleo Educación
## 56.925237 2.007776 114.861723
## Innovación Inseguridad_Robo Inseguridad_Homicidio
## 2.732010 2.899480 2.635145
## Tipo_de_Cambio I(Tipo_de_Cambio^2) Densidad_Carretera
## 216.809834 225.453056 1055.541862
## I(Densidad_Carretera^2) CO2_Emisiones PIB_Per_Cápita
## 859.760629 1.749050 4451.061848
## I(PIB_Per_Cápita^2)
## 4454.575920
bptest(polynomial_regresion1) # This model has Heteroestlasticity.
##
## studentized Breusch-Pagan test
##
## data: polynomial_regresion1
## BP = 7.1632, df = 13, p-value = 0.8935
AIC(polynomial_regresion1)
## [1] 546.8949
histogram(polynomial_regresion1$residuals)
# Lasso regresion model
### Split the Data in Training Data vs Test Data
set.seed(123) ### sets the random seed for reproducibility of results
training.samples<-bd2$IED_Flujos %>%
createDataPartition(p=0.75,list=FALSE) ### Lets consider 75% of the data to build a predictive model
train.data<-bd1[training.samples, ] ### training data to fit the linear regression model
test.data<-bd1[-training.samples, ] ### testing data to test the linear regression model
#################################
### LASSO REGRESSION ANALYSIS ###
#################################
# The dataset is transformed to model.matrix() format.
# Independent variables
x<-model.matrix(log(IED_Flujos) ~ Exportaciones+Empleo+Educación+Innovación+Inseguridad_Robo+Inseguridad_Homicidio+Tipo_de_Cambio+Densidad_Carretera+CO2_Emisiones+PIB_Per_Cápita,train.data)[,-1] ### OLS model specification
# x<-model.matrix(Weekly_Sales~.,train.data)[,-1] ### matrix of independent variables X's
y<-train.data$IED_Flujos ### dependent variable
# In estimating LASSO regression it is important to define the lambda that minimizes the prediction error rate.
# Cross-validation ensures that every data / observation from the original dataset (datains) has a chance of appearing in train and test datasets.
# Find the best lambda using cross-validation.
set.seed(123)
cv.lasso <- cv.glmnet(x,y,alpha=1) # alpha = 1 for LASSO
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
## fold
# NO SE PUEDE USAR CON SOLO UNA VARIABLE
# Display the best lambda value
cv.lasso$lambda.min
## [1] 577.774
# Fit the final model on the training data
lassomodel<-glmnet(x,y,alpha=1,lambda=cv.lasso$lambda.min)
lassomodel
##
## Call: glmnet(x = x, y = y, alpha = 1, lambda = cv.lasso$lambda.min)
##
## Df %Dev Lambda
## 1 4 79.49 577.8
# Display regression coefficients
coef(lassomodel)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -1.689578e+05
## Exportaciones .
## Empleo 1.073709e+03
## Educación 6.902473e+03
## Innovación 2.561121e+03
## Inseguridad_Robo -2.515549e+00
## Inseguridad_Homicidio .
## Tipo_de_Cambio .
## Densidad_Carretera .
## CO2_Emisiones .
## PIB_Per_Cápita .
# Make predictions on the test data
x.test<-model.matrix(log(IED_Flujos) ~
Exportaciones+Empleo+Educación+Innovación+Inseguridad_Robo+Inseguridad_Homicidio+Tipo_de_Cambio+Densidad_Carretera+CO2_Emisiones+PIB_Per_Cápita,test.data)[,-1] ### OLS model specification
# x.test<-model.matrix(Weekly_Sales~.,test.data)[,-1]
lassopredictions <- lassomodel %>% predict(x.test) %>% as.vector()
# Model Accuracy
data.frame(
RMSE = RMSE(lassopredictions, test.data$IED_Flujos),
Rsquare = R2(lassopredictions, test.data$IED_Flujos))
## RMSE Rsquare
## 1 10408.73 0.6924555
### visualizing lasso regression results
lbs_fun <- function(fit, offset_x=1, ...) {
L <- length(fit$lambda)
x <- log(fit$lambda[L])+ offset_x
y <- fit$beta[, L]
labs <- names(y)
text(x, y, labels=labs, ...)
}
lasso<-glmnet(scale(x),y,alpha=1)
plot(lasso,xvar="lambda",label=T)
lbs_fun(lasso)
abline(v=cv.lasso$lambda.min,col="red",lty=2)
abline(v=cv.lasso$lambda.1se,col="blue",lty=2)
#################################
### RIDGE REGRESSION ANALYSIS ###
#################################
# Find the best lambda using cross-validation
set.seed(123) # x: independent variables | y: dependent variable
cv.ridge <- cv.glmnet(x,y,alpha=0.1) # alpha = 0 for RIDGE
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
## fold
# Display the best lambda value
cv.ridge$lambda.min
## [1] 2744.894
# Fit the final model on the training data
ridgemodel<-glmnet(x,y,alpha=0,lambda=cv.ridge$lambda.min)
# Display regression coefficients
coef(ridgemodel)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -1.134776e+05
## Exportaciones 6.145233e-02
## Empleo 8.671375e+02
## Educación 1.888556e+03
## Innovación 2.059001e+03
## Inseguridad_Robo -2.652667e+01
## Inseguridad_Homicidio 3.276834e+01
## Tipo_de_Cambio 9.590785e+01
## Densidad_Carretera 7.478123e+04
## CO2_Emisiones -1.611175e+03
## PIB_Per_Cápita 1.118728e-01
# Make predictions on the test data
x.test<-model.matrix(log(IED_Flujos) ~ Exportaciones+Empleo+Educación+Innovación+Inseguridad_Robo+Inseguridad_Homicidio+Tipo_de_Cambio+Densidad_Carretera+CO2_Emisiones+PIB_Per_Cápita,test.data)[,-1]
ridgepredictions<-ridgemodel %>% predict(x.test) %>% as.vector()
# Model Accuracy
data.frame(
RMSE = RMSE(ridgepredictions, test.data$IED_Flujos),
Rsquare = R2(ridgepredictions, test.data$IED_Flujos)
)
## RMSE Rsquare
## 1 10699 0.5946856
### visualizing ridge regression results
ridge<-glmnet(scale(x),y,alpha=0)
plot(ridge, xvar = "lambda", label=T)
lbs_fun(ridge)
abline(v=cv.ridge$lambda.min, col = "red", lty=2)
abline(v=cv.ridge$lambda.1se, col="blue", lty=2)
# Resumen de modelos
tab <- matrix(c(5564,28102,4336,10408,10699,0.8059,0.2445,0.4704,0.692,0.594), ncol=2, byrow=FALSE)
colnames(tab) <- c('RMSE','R2')
rownames(tab) <- c('Linear Regression','Linear Regresion (Log)','Polynomial ^2', 'Lasso','Ridge')
tab <- as.table(tab)
tab
## RMSE R2
## Linear Regression 5564.0000 0.8059
## Linear Regresion (Log) 28102.0000 0.2445
## Polynomial ^2 4336.0000 0.4704
## Lasso 10408.0000 0.6920
## Ridge 10699.0000 0.5940
# We can see that our model with the lowest RMSE is the polynomial model , but it has a R^2 really low so we are going to not choose that model. Our next model is the linear regression model. This has a low RMSE compare to the other models and offers the highest R^2. This means that this models shows a better balance in the parameters and is better suited for our analysis.
# Show the predicted values of the dependent variable
final_model <- practice_linear_regresion
colors <- c("purple","green","blue", "red")
par(mfrow=c(2,2))
plot(final_model, which = 1, col = colors[1], pch = 19)
plot(final_model, which = 2, col = colors[2], pch = 19)
plot(final_model, which = 3, col = colors[3], pch = 19)
plot(final_model, which = 5, col = colors[4], pch = 19)
par(mfrow=c(1,1))
summary(final_model)
##
## Call:
## lm(formula = IED_Flujos ~ Exportaciones + Empleo + Educación +
## Salario_Diario + Innovación + Inseguridad_Robo + Inseguridad_Homicidio +
## Tipo_de_Cambio + Densidad_Carretera + Densidad_Población +
## CO2_Emisiones + PIB_Per_Cápita + INPC, data = bd1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4595.1 -2152.5 -531.4 2152.8 4934.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.027e+06 3.812e+05 -2.695 0.01950 *
## Exportaciones -1.083e+00 6.748e-01 -1.605 0.13451
## Empleo 5.757e+03 2.371e+03 2.428 0.03184 *
## Educación 1.779e+05 4.490e+04 3.961 0.00189 **
## Salario_Diario -1.205e+01 2.044e+02 -0.059 0.95395
## Innovación 4.849e+03 1.502e+03 3.228 0.00725 **
## Inseguridad_Robo 4.352e+01 5.138e+01 0.847 0.41353
## Inseguridad_Homicidio 1.385e+03 5.516e+02 2.511 0.02736 *
## Tipo_de_Cambio -2.582e+03 8.448e+02 -3.057 0.00996 **
## Densidad_Carretera -1.240e+06 1.151e+06 -1.078 0.30232
## Densidad_Población -1.645e+04 4.287e+03 -3.836 0.00237 **
## CO2_Emisiones 9.462e+03 1.173e+04 0.807 0.43548
## PIB_Per_Cápita -3.341e-01 4.966e-01 -0.673 0.51384
## INPC 9.457e+01 9.535e+02 0.099 0.92263
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3864 on 12 degrees of freedom
## Multiple R-squared: 0.9068, Adjusted R-squared: 0.8059
## F-statistic: 8.985 on 13 and 12 DF, p-value: 0.0002737
After analyzing the regression model we can see that the variables with the most corelation with the foreign investment is “Tipo de cambio” , “Densidad de Población”, “Innovacion” and “Educación”. Other variables that have a significant impact are: “empleo” and “inseguridad homicidio”.
Testing our hyphotesis:
Hypothesis 1: As the amount of thefts increases the number of foreign investment decreases. Analysis: We reject our hypothesis since in our coefficients of thefts is not significant.
Hypothesis 2: As the level of education increases the more attractive mexico becomes for foreign investment. Analysis: we accept our hypothesis since we see that this has a significant relationship between the level of education and the foreign investment.
Hypothesis 3: As there is more haighways in the country, the foreign investment increases. Analysis: We reject our hyphotesis since in our coeficient of Densidad Carretera is not signficant.
What are the conditions that Mexico offers to attract nearshoring? The conditions that attract investors to make nearshoring in Mexico are a high level of education, a great level of innovation and a low exchange rate.
What econometric model should be applied to predict the effect of nearshoring in Mexico? The econometric model that should be applied to predict the effects that causes nearshoring in mexico is called “final_model” and it´s a linear regression model.
What are the areas of opportunity for Mexico in terms of relocating new business opportunities? The main strategy that Mexico can do in order to have more nearshoring opportunities is to improve in it´s innovation. This turned out to be one of the most important variables with the highest positive coeficient, which means that it has a strong positive relationship in the foreign investment as the innovation gets better. Investment in the education and educational programs and projects that potenzialice innovation across the country will be the key factor to make Mexico the go to country for nearshoring.