Part 1 - Background:

Part 2 - Problem Sitution

Part 3 - Data and Methodology

# Importing the libraries we are going to use:

library(foreign)
library(dplyr)        # data manipulation 
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(forcats)      # to work with categorical variables
library(ggplot2)      # data visualization 
library(readr)        # read specific csv files
library(janitor)      # data exploration and cleaning 
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(Hmisc)        # several useful functions for data analysis 
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(psych)        # functions for multivariate analysis 
## 
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
## 
##     describe
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(naniar)       # summaries and visualization of missing values NA's
library(corrplot)     # correlation plots
## corrplot 0.92 loaded
library(jtools)       # presentation of regression analysis 
## 
## Attaching package: 'jtools'
## The following object is masked from 'package:Hmisc':
## 
##     %nin%
library(lmtest)       # diagnostic checks - linear regression analysis 
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(car)          # diagnostic checks - linear regression analysis
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
## The following object is masked from 'package:dplyr':
## 
##     recode
library(olsrr)        # diagnostic checks - linear regression analysis 
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
library(naniar)       # identifying missing values
library(stargazer)    # create publication quality tables
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(effects)      # displays for linear and other regression models
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
library(tidyverse)    # collection of R packages designed for data science
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ✔ stringr   1.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ psych::%+%()       masks ggplot2::%+%()
## ✖ psych::alpha()     masks ggplot2::alpha()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ car::recode()      masks dplyr::recode()
## ✖ purrr::some()      masks car::some()
## ✖ Hmisc::src()       masks dplyr::src()
## ✖ Hmisc::summarize() masks dplyr::summarize()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)        # Classification and Regression Training 
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(glmnet)       # methods for prediction and plotting, and functions for cross-validation
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-7
library(readxl)       # Leer archivos en excel
library(gganimate)
## No renderer backend detected. gganimate will default to writing frames to separate files
## Consider installing:
## - the `gifski` package for gif output
## - the `av` package for video output
## and restarting the R session
library(gapminder)
library(ggcorrplot)
# Uploading the data base. We are going to create to different data bases. One for the first page of the database and another one for the second page. 

# Upload the first page of our data base.
bd1 <- read_excel("/Users/pedrovillanueva/Desktop/SP_DataMexicoAtractiveness_alumn-VF.xlsx", sheet = "Datos") 

 # Upload the second page of our data base.
bd2 <- read_excel("/Users/pedrovillanueva/Desktop/SP_DataMexicoAtractiveness_alumn-VF.xlsx", sheet = "Datos_Series_de_Tiempo")

# Confirm that our data base was uploaded correctly.
bd1 
## # A tibble: 26 × 15
##      Año IED_Flujos Exportaciones Empleo Educación     Salario_Diario Innovación
##    <dbl>      <dbl>         <dbl> <chr>  <chr>                  <dbl> <chr>     
##  1  1997     12146.         9088. -      7.1979679955…           24.3 11.301406…
##  2  1998      8374.         9875. -      7.3115866164…           31.9 11.372208…
##  3  1999     13960.        10990. -      7.4279238305…           31.9 12.458970…
##  4  2000     18249.        12483. 97.83  7.56                    35.1 13.145562…
##  5  2001     30057.        11300. 97.36  7.6782701458…           37.6 13.468116…
##  6  2002     24099.        11923. 97.66  7.8034178920…           39.7 12.798930…
##  7  2003     18250.        13156  97.06  7.9254003630…           41.5 11.810749…
##  8  2004     25016.        13573. 96.48  8.0438954581…           43.3 12.608412…
##  9  2005     25796.        16466. 97.17  8.14                    45.2 13.414424…
## 10  2006     21233.        17486. 96.53  8.2591355074…           47.0 14.231369…
## # ℹ 16 more rows
## # ℹ 8 more variables: Inseguridad_Robo <dbl>, Inseguridad_Homicidio <chr>,
## #   Tipo_de_Cambio <dbl>, Densidad_Carretera <dbl>, Densidad_Población <dbl>,
## #   CO2_Emisiones <chr>, PIB_Per_Cápita <dbl>, INPC <dbl>
bd2
## # A tibble: 96 × 3
##      Año Trimestre IED_Flujos
##    <dbl> <chr>          <dbl>
##  1  1999 I              3596.
##  2  1999 II             3396.
##  3  1999 III            3028.
##  4  1999 IV             3940.
##  5  2000 I              4601.
##  6  2000 II             4857.
##  7  2000 III            3057.
##  8  2000 IV             5734.
##  9  2001 I              3599.
## 10  2001 II             5219.
## # ℹ 86 more rows
# Knowing the variables that we are working with is essential when making an exploratory data analysis. In the next code we take a further look into the different variables that we have in each data base. 

# Creates a variable that stores the column names of each of our databases.
colnames1 <- colnames(bd1) 
colnames2 <- colnames(bd2)

# Shows us the name of the variables in each of our databases. 
colnames1
##  [1] "Año"                   "IED_Flujos"            "Exportaciones"        
##  [4] "Empleo"                "Educación"             "Salario_Diario"       
##  [7] "Innovación"            "Inseguridad_Robo"      "Inseguridad_Homicidio"
## [10] "Tipo_de_Cambio"        "Densidad_Carretera"    "Densidad_Población"   
## [13] "CO2_Emisiones"         "PIB_Per_Cápita"        "INPC"
colnames2
## [1] "Año"        "Trimestre"  "IED_Flujos"
# Descriptive Statistics 

# General descriptive statistics of our first data base. 
descriptive_statistics_1 <-  summary(bd1)
descriptive_statistics_1
##       Año         IED_Flujos    Exportaciones      Empleo         
##  Min.   :1997   Min.   : 8374   Min.   : 9088   Length:26         
##  1st Qu.:2003   1st Qu.:21367   1st Qu.:13260   Class :character  
##  Median :2010   Median :27698   Median :21188   Mode  :character  
##  Mean   :2010   Mean   :26770   Mean   :23601                     
##  3rd Qu.:2016   3rd Qu.:32183   3rd Qu.:31601                     
##  Max.   :2022   Max.   :48354   Max.   :46478                     
##   Educación         Salario_Diario    Innovación        Inseguridad_Robo
##  Length:26          Min.   : 24.30   Length:26          Min.   :120.5   
##  Class :character   1st Qu.: 41.97   Class :character   1st Qu.:148.3   
##  Mode  :character   Median : 54.48   Mode  :character   Median :181.8   
##                     Mean   : 65.16                      Mean   :185.4   
##                     3rd Qu.: 72.31                      3rd Qu.:209.9   
##                     Max.   :172.87                      Max.   :314.8   
##  Inseguridad_Homicidio Tipo_de_Cambio   Densidad_Carretera Densidad_Población
##  Length:26             Min.   : 8.064   Min.   :0.05205    Min.   :47.44     
##  Class :character      1st Qu.:10.752   1st Qu.:0.05954    1st Qu.:52.78     
##  Mode  :character      Median :13.016   Median :0.06989    Median :58.09     
##                        Mean   :13.910   Mean   :0.07106    Mean   :57.33     
##                        3rd Qu.:18.489   3rd Qu.:0.08275    3rd Qu.:61.39     
##                        Max.   :20.664   Max.   :0.09020    Max.   :65.60     
##  CO2_Emisiones      PIB_Per_Cápita        INPC       
##  Length:26          Min.   :126739   Min.   : 33.28  
##  Class :character   1st Qu.:130964   1st Qu.: 56.15  
##  Mode  :character   Median :136845   Median : 73.35  
##                     Mean   :138550   Mean   : 75.17  
##                     3rd Qu.:146148   3rd Qu.: 91.29  
##                     Max.   :153236   Max.   :126.48
# Observations bd1: 
# We can see that there are some variables that are being put as a character when they should be as a float. This means that there are inconsistencies with the data and a process of data cleaning should be done. We first need to transform this variables to numeric, then finding the missing values and lastly implement an imputation process. 
# The variables that need to be examinate are: "Inseguridad_Homicidio", "Innovación", "Educación", "CO2_Emisiones" and "Empleo"

# General descriptive statistics of our second data base. 
descriptive_statistics_2 <- summary(bd2)
descriptive_statistics_2
##       Año        Trimestre           IED_Flujos   
##  Min.   :1999   Length:96          Min.   : 1341  
##  1st Qu.:2005   Class :character   1st Qu.: 4351  
##  Median :2010   Mode  :character   Median : 6238  
##  Mean   :2011                      Mean   : 7036  
##  3rd Qu.:2016                      3rd Qu.: 8053  
##  Max.   :2023                      Max.   :22794
# Observations bd2: 
# We can see that in this case the variable "Trimestre" is in a character form. In this case this is correct and the variable should be in this form. This means that no change in the type of data are needed. 
# Transforming variables that are in a character form to numeric. 

# With this variable we are transforming the character values of the columns to a numeric type in our first base.  
bd1$Inseguridad_Homicidio <- as.numeric(bd1$Inseguridad_Homicidio)
## Warning: NAs introduced by coercion
bd1$Innovación <- as.numeric(bd1$Innovación)
## Warning: NAs introduced by coercion
bd1$Educación <- as.numeric(bd1$Educación)
## Warning: NAs introduced by coercion
bd1$Empleo <-  as.numeric(bd1$Empleo)
## Warning: NAs introduced by coercion
bd1$CO2_Emisiones <- as.numeric(bd1$CO2_Emisiones)
## Warning: NAs introduced by coercion
# Observations: 
# NAs where introduced by coercion. This means that there where measing values in these columns and latter on the process we will need to take attention of this values. 
# Looking for missing values, inconsistencies or NA. 

# We create a variable that stores the amount of NA in our two data bases. 
na_bd1 <- sum(is.na(bd1))
na_bd2 <- sum(is.na(bd2))

# We show the number of NA in each database
na_bd1
## [1] 12
na_bd2
## [1] 0
# Observations: 
# In the first data base we have 12 missing values. We need to proced with an imputation method to take care of these values. For the second database there where no NAs found so no further work needs to be done. 
# Imputation method for missing values. 


# Calculating the mean for each of the variables that have NA. Except for the variable Educación. For this variable, we see that there is a tendency so putting the mean will not take into consideration this trend. So we will implement 
mean_empleo <- mean(bd1$Empleo, na.rm = TRUE)
mean_innovacion <-  mean(bd1$Innovación, na.rm = TRUE)
mean_inseguridad <- mean(bd1$Inseguridad_Homicidio, na.rm = TRUE)
mean_co2 <- mean(bd1$CO2_Emisiones, na.rm = TRUE)


# Imputating the missing values with the mean of each category.
bd1$Empleo[is.na(bd1$Empleo)] <- mean_empleo
bd1$Innovación[is.na(bd1$Innovación)] <- mean_innovacion
bd1$Inseguridad_Homicidio[is.na(bd1$Inseguridad_Homicidio)] <- mean_inseguridad
bd1$CO2_Emisiones[is.na(bd1$CO2_Emisiones)] <- mean_co2

# We look for any missing values in our data. 
na_bd1_new <- sum(is.na(bd1))
na_bd1_new
## [1] 3
# The missing values that we have are from the variable Educación. We detected a trend for this variable so we will use an approx in a sequencue model in order to take into consideration this change. 
imputation_education <- approx(seq_along(bd1$Educación),bd1$Educación, method = "linear", n = length(bd1$Educación))$y
bd1$Educación <- imputation_education
# With this code we can see that our new imputated data follows the existing trend of our data base.
bd1$Educación
##  [1] 7.197968 7.297952 7.400003 7.512453 7.621500 7.728329 7.837573 7.944360
##  [9] 8.047740 8.132312 8.235308 8.326336 8.413348 8.500440 8.579609 8.654893
## [17] 8.762427 8.850012 8.937710 9.025490 9.113350 9.201318 9.289411 9.377608
## [25] 9.468821 9.578832
# We made another count in order to confirm that there are no more missing values. 
sum(is.na(bd1))
## [1] 0

Part 4 - Descriptive Statistics

# Descriptive statistics with full data en b1 
descriptive_statistics_1 <-  summary(bd1)
descriptive_statistics_1
##       Año         IED_Flujos    Exportaciones       Empleo        Educación    
##  Min.   :1997   Min.   : 8374   Min.   : 9088   Min.   :95.06   Min.   :7.198  
##  1st Qu.:2003   1st Qu.:21367   1st Qu.:13260   1st Qu.:96.09   1st Qu.:7.864  
##  Median :2010   Median :27698   Median :21188   Median :96.48   Median :8.457  
##  Mean   :2010   Mean   :26770   Mean   :23601   Mean   :96.47   Mean   :8.424  
##  3rd Qu.:2016   3rd Qu.:32183   3rd Qu.:31601   3rd Qu.:97.01   3rd Qu.:9.004  
##  Max.   :2022   Max.   :48354   Max.   :46478   Max.   :97.83   Max.   :9.579  
##  Salario_Diario     Innovación    Inseguridad_Robo Inseguridad_Homicidio
##  Min.   : 24.30   Min.   :11.28   Min.   :120.5    Min.   : 8.037       
##  1st Qu.: 41.97   1st Qu.:12.60   1st Qu.:148.3    1st Qu.:10.402       
##  Median : 54.48   Median :13.11   Median :181.8    Median :17.110       
##  Mean   : 65.16   Mean   :13.11   Mean   :185.4    Mean   :17.292       
##  3rd Qu.: 72.31   3rd Qu.:13.60   3rd Qu.:209.9    3rd Qu.:22.346       
##  Max.   :172.87   Max.   :15.11   Max.   :314.8    Max.   :29.592       
##  Tipo_de_Cambio   Densidad_Carretera Densidad_Población CO2_Emisiones  
##  Min.   : 8.064   Min.   :0.05205    Min.   :47.44      Min.   :3.592  
##  1st Qu.:10.752   1st Qu.:0.05954    1st Qu.:52.78      1st Qu.:3.843  
##  Median :13.016   Median :0.06989    Median :58.09      Median :3.946  
##  Mean   :13.910   Mean   :0.07106    Mean   :57.33      Mean   :3.946  
##  3rd Qu.:18.489   3rd Qu.:0.08275    3rd Qu.:61.39      3rd Qu.:4.088  
##  Max.   :20.664   Max.   :0.09020    Max.   :65.60      Max.   :4.221  
##  PIB_Per_Cápita        INPC       
##  Min.   :126739   Min.   : 33.28  
##  1st Qu.:130964   1st Qu.: 56.15  
##  Median :136845   Median : 73.35  
##  Mean   :138550   Mean   : 75.17  
##  3rd Qu.:146148   3rd Qu.: 91.29  
##  Max.   :153236   Max.   :126.48
# Another way of looking at the description of each of our data bases. 
describe_1 <- describe(bd1)
describe_2 <- describe(bd2)
describe_1
##                       vars  n      mean       sd    median   trimmed      mad
## Año                      1 26   2009.50     7.65   2009.50   2009.50     9.64
## IED_Flujos               2 26  26770.12  8770.68  27697.59  26860.64  9079.25
## Exportaciones            3 26  23600.91 11346.13  21187.89  22875.85 13370.71
## Empleo                   4 26     96.47     0.72     96.48     96.48     0.75
## Educación                5 26      8.42     0.72      8.46      8.43     0.88
## Salario_Diario           6 26     65.16    35.85     54.48     60.16    22.51
## Innovación               7 26     13.11     1.07     13.11     13.09     0.78
## Inseguridad_Robo         8 26    185.42    47.67    181.83    181.16    47.06
## Inseguridad_Homicidio    9 26     17.29     7.12     17.11     16.99     9.31
## Tipo_de_Cambio          10 26     13.91     4.15     13.02     13.78     4.25
## Densidad_Carretera      11 26      0.07     0.01      0.07      0.07     0.02
## Densidad_Población      12 26     57.33     5.41     58.09     57.44     6.68
## CO2_Emisiones           13 26      3.95     0.18      3.95      3.95     0.17
## PIB_Per_Cápita          14 26 138550.10  8861.10 136845.30 138255.64 11080.42
## INPC                    15 26     75.17    24.81     73.35     74.45    27.14
##                             min       max    range  skew kurtosis      se
## Año                     1997.00   2022.00    25.00  0.00    -1.34    1.50
## IED_Flujos              8373.50  48354.42 39980.92 -0.02    -0.08 1720.07
## Exportaciones           9087.62  46477.58 37389.97  0.46    -1.08 2225.16
## Empleo                    95.06     97.83     2.77 -0.14    -0.74    0.14
## Educación                  7.20      9.58     2.38 -0.09    -1.28    0.14
## Salario_Diario            24.30    172.87   148.57  1.43     1.44    7.03
## Innovación                11.28     15.11     3.83  0.13    -0.70    0.21
## Inseguridad_Robo         120.49    314.78   194.28  0.89     0.30    9.35
## Inseguridad_Homicidio      8.04     29.59    21.56  0.38    -1.28    1.40
## Tipo_de_Cambio             8.06     20.66    12.60  0.44    -1.39    0.81
## Densidad_Carretera         0.05      0.09     0.04  0.18    -1.50    0.00
## Densidad_Población        47.44     65.60    18.16 -0.19    -1.24    1.06
## CO2_Emisiones              3.59      4.22     0.63 -0.15    -0.94    0.04
## PIB_Per_Cápita        126738.75 153235.73 26496.98  0.28    -1.41 1737.81
## INPC                      33.28    126.48    93.20  0.26    -0.95    4.87
describe_2
##            vars  n    mean      sd median trimmed     mad     min      max
## Año           1 96 2010.51    6.98 2010.5 2010.50    8.90 1999.00  2023.00
## Trimestre*    2 96    2.50    1.12    2.5    2.50    1.48    1.00     4.00
## IED_Flujos    3 96 7036.50 3978.53 6237.9 6458.65 2797.97 1340.58 22794.16
##               range skew kurtosis     se
## Año           24.00 0.01    -1.23   0.71
## Trimestre*     3.00 0.00    -1.39   0.11
## IED_Flujos 21453.58 1.60     3.01 406.06
# Visualization of our data - Grpahs

# Loading the data frame 
bd1.1 <-  bd1%>% select(IED_Flujos,Año,Exportaciones,Empleo,Educación,Salario_Diario,Innovación,Inseguridad_Robo,Inseguridad_Homicidio,Tipo_de_Cambio,Densidad_Carretera,Densidad_Población,CO2_Emisiones,PIB_Per_Cápita,INPC) %>% group_by(Año)
bd1.1
## # A tibble: 26 × 15
## # Groups:   Año [26]
##    IED_Flujos   Año Exportaciones Empleo Educación Salario_Diario Innovación
##         <dbl> <dbl>         <dbl>  <dbl>     <dbl>          <dbl>      <dbl>
##  1     12146.  1997         9088.   96.5      7.20           24.3       11.3
##  2      8374.  1998         9875.   96.5      7.30           31.9       11.4
##  3     13960.  1999        10990.   96.5      7.40           31.9       12.5
##  4     18249.  2000        12483.   97.8      7.51           35.1       13.1
##  5     30057.  2001        11300.   97.4      7.62           37.6       13.5
##  6     24099.  2002        11923.   97.7      7.73           39.7       12.8
##  7     18250.  2003        13156    97.1      7.84           41.5       11.8
##  8     25016.  2004        13573.   96.5      7.94           43.3       12.6
##  9     25796.  2005        16466.   97.2      8.05           45.2       13.4
## 10     21233.  2006        17486.   96.5      8.13           47.0       14.2
## # ℹ 16 more rows
## # ℹ 8 more variables: Inseguridad_Robo <dbl>, Inseguridad_Homicidio <dbl>,
## #   Tipo_de_Cambio <dbl>, Densidad_Carretera <dbl>, Densidad_Población <dbl>,
## #   CO2_Emisiones <dbl>, PIB_Per_Cápita <dbl>, INPC <dbl>
# Graph 1:
graph1 <- ggplot(data=bd1.1,aes(x=Año,IED_Flujos,y=IED_Flujos,fill=Educación)) +
  geom_bar(stat="identity") + coord_flip() +
  ggtitle("Effect of education in the foreign investment throught the years") + 
  labs(x = "Foreign Investment",
       fill = "Education Level",
       y = "Year")
graph1

# Analysis: In this graph we can see that the level of education has been increasing through the years. We can see a correlation between the level of education and the foreign investment. We can see that as the education has been getting better, the amount of foreign investment has increase. 


# Graph 2:
a <- ggplot(bd1.1, aes(Densidad_Carretera,IED_Flujos))
graph2 <- a + geom_point() + 
  geom_smooth(method="lm", se=F) +
  labs( 
       y="Total foreign investment", 
       x="Amount of highways", 
       title="Impact of the amount of highways to the foreign investment in Mexico", 
       caption="Source: CIC")
graph2
## `geom_smooth()` using formula = 'y ~ x'

# Analysis: In this graph we can see the relationship between the amount of highways in the country and the total foreign investment. Also we can see a tendency line that crosses the data in order to better view the behavior. We can see in the line that as the amount of highways increases the amount of foreign investment also increases. We can see that our data dosen´t fully behave as our line since we can see an early peak and then the increase becomes slower. 

# Graph 3: For this graph we will create two differnt graphs and compare them between each other in order to determine which factor of insecurity affects more the foreign investment; the amount of thefs or the amount of murders.

#Graph 3,1: 
theme_set(theme_bw())  # pre-set the bw theme.
b <- ggplot(bd1.1, aes(Inseguridad_Homicidio,IED_Flujos))
graph3.1 <- b + geom_count(col="tomato3", show.legend=F) +
  labs(title="Correlation between homicides to the foreign investment", 
       y="Foreign Investment", 
       x="Homicide Rate", 
       subtitle="Counts Plot")
graph3.1

# Analysis graph 3.1: we see that there is not a clear tendency in our data. We see that our dots are skatter all over the graph so we can´t assume that the homicide rate increases or decreases the foreign investment by just looking at the behavior of the graph. 


# Graph 3.2: 
theme_set(theme_bw())  # pre-set the bw theme.
g <- ggplot(bd1.1, aes(Inseguridad_Robo,IED_Flujos))
graph3.2 <- g + geom_count(col="blue", show.legend=F) +
  labs(title="Impact of theft to the foreign investment", 
       y="Foreign Investment", 
       x="Thieves Rate", 
       subtitle="Counts Plot")
graph3.2

# Analysis graph 3.2: in this graph we can clearly see that there is a negative correlation between the two vairables. This means that we can assume that as the thieves rate goes up the amount of foreign investment goes down. 

# Analysis Graph 3_General: We can assume that the amount of theves have a bigger negative correlationship with the foreign investment than the amount of murders. 

# Graph 4: 
corre <- round(cor(subset(bd1,select =  -c(Año,INPC,Exportaciones,Educación,Densidad_Carretera))), 1)

ggcorrplot(corre, hc.order = TRUE, 
           type = "lower", 
           lab = TRUE, 
           lab_size = 3, 
           method="circle", 
           colors = c("red", "white", "green"), 
           title="Correlation between our variables", 
           ggtheme=theme_bw)

# In this graph we cann see that there is a positive correlation between the population density and the daily salary and the exchange rate. We can also observe that there is a negative corelationship between the foreign investment and the amount of thefs commited. 



# IMPORTANT INFORMATION: The insights discover in this graphs are NOT the conclusions of the analysis since we haven´t find the concrete correlationship. This is just to have a look at our data and helps us to create hypothesis that will be further prove in our regression analysis. 

Hipothesis:

  • Hypothesis 1: As the amount of thefts increases the number of foreign investment decreases.
  • Hypothesis 2: As the level of education increases the more attractive mexico becomes for foreign investment.
  • Hypothesis 3: As there is more haighways in the country, the foreign investment increases.
# Linear regresion 
practice_linear_regresion <- lm(IED_Flujos ~ Exportaciones+Empleo+Educación+Salario_Diario+Innovación+Inseguridad_Robo+Inseguridad_Homicidio+Tipo_de_Cambio+Densidad_Carretera+Densidad_Población+CO2_Emisiones+PIB_Per_Cápita+INPC, data = bd1)
summary(practice_linear_regresion)
## 
## Call:
## lm(formula = IED_Flujos ~ Exportaciones + Empleo + Educación + 
##     Salario_Diario + Innovación + Inseguridad_Robo + Inseguridad_Homicidio + 
##     Tipo_de_Cambio + Densidad_Carretera + Densidad_Población + 
##     CO2_Emisiones + PIB_Per_Cápita + INPC, data = bd1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4595.1 -2152.5  -531.4  2152.8  4934.7 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)   
## (Intercept)           -1.027e+06  3.812e+05  -2.695  0.01950 * 
## Exportaciones         -1.083e+00  6.748e-01  -1.605  0.13451   
## Empleo                 5.757e+03  2.371e+03   2.428  0.03184 * 
## Educación              1.779e+05  4.490e+04   3.961  0.00189 **
## Salario_Diario        -1.205e+01  2.044e+02  -0.059  0.95395   
## Innovación             4.849e+03  1.502e+03   3.228  0.00725 **
## Inseguridad_Robo       4.352e+01  5.138e+01   0.847  0.41353   
## Inseguridad_Homicidio  1.385e+03  5.516e+02   2.511  0.02736 * 
## Tipo_de_Cambio        -2.582e+03  8.448e+02  -3.057  0.00996 **
## Densidad_Carretera    -1.240e+06  1.151e+06  -1.078  0.30232   
## Densidad_Población    -1.645e+04  4.287e+03  -3.836  0.00237 **
## CO2_Emisiones          9.462e+03  1.173e+04   0.807  0.43548   
## PIB_Per_Cápita        -3.341e-01  4.966e-01  -0.673  0.51384   
## INPC                   9.457e+01  9.535e+02   0.099  0.92263   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3864 on 12 degrees of freedom
## Multiple R-squared:  0.9068, Adjusted R-squared:  0.8059 
## F-statistic: 8.985 on 13 and 12 DF,  p-value: 0.0002737
vif(practice_linear_regresion) # We detected that there was multicolinearity. In order to errase this, we will need to do a new imputation system in our data and eliminate some variables. 
##         Exportaciones                Empleo             Educación 
##             98.143017              4.886377           1727.453825 
##        Salario_Diario            Innovación      Inseguridad_Robo 
##             89.896000              4.348086             10.041711 
## Inseguridad_Homicidio        Tipo_de_Cambio    Densidad_Carretera 
##             25.807378             20.578432            396.904949 
##    Densidad_Población         CO2_Emisiones        PIB_Per_Cápita 
##            901.177320              7.492508             32.429142 
##                  INPC 
##            936.922243
# We are going to replace the highest and lowest registers in our data and add the average. This will eliminate multicolinearity. We 

bd2 <- bd1
num_values_to_replace <- 3  

#Exportaciones
median_value_exportaciones <- median(bd2[["Exportaciones"]], na.rm = TRUE)
sorted_indices <- order(bd2[["Exportaciones"]], na.last = NA)
lowest_indices <- sorted_indices[1:num_values_to_replace]
highest_indices <- sorted_indices[(length(sorted_indices) - num_values_to_replace + 1):length(sorted_indices)]
bd2[["Exportaciones"]][lowest_indices] <- median_value_exportaciones
bd2[["Exportaciones"]][highest_indices] <- median_value_exportaciones

# Empleo
median_value_empleo <- median(bd2[["Empleo"]], na.rm = TRUE)
sorted_indices <- order(bd2[["Empleo"]], na.last = NA)
lowest_indices <- sorted_indices[1:num_values_to_replace]
highest_indices <- sorted_indices[(length(sorted_indices) - num_values_to_replace + 1):length(sorted_indices)]
bd2[["Empleo"]][lowest_indices] <- median_value_empleo
bd2[["Empleo"]][highest_indices] <- median_value_empleo

# Educación
median_value_educacion <- median(bd2[["Educación"]], na.rm = TRUE)
sorted_indices <- order(bd2[["Educación"]], na.last = NA)
lowest_indices <- sorted_indices[1:num_values_to_replace]
highest_indices <- sorted_indices[(length(sorted_indices) - num_values_to_replace + 1):length(sorted_indices)]
bd2[["Educación"]][lowest_indices] <- median_value_educacion
bd2[["Educación"]][highest_indices] <- median_value_educacion

# Salario_Diario
median_value_salario <- median(bd2[["Salario_Diario"]], na.rm = TRUE)
sorted_indices <- order(bd2[["Salario_Diario"]], na.last = NA)
lowest_indices <- sorted_indices[1:num_values_to_replace]
highest_indices <- sorted_indices[(length(sorted_indices) - num_values_to_replace + 1):length(sorted_indices)]
bd2[["Salario_Diario"]][lowest_indices] <- median_value_salario
bd2[["Salario_Diario"]][highest_indices] <- median_value_salario

# Innovación
median_value_innovacion <- median(bd2[["Innovación"]], na.rm = TRUE)
sorted_indices <- order(bd2[["Innovación"]], na.last = NA)
lowest_indices <- sorted_indices[1:num_values_to_replace]
highest_indices <- sorted_indices[(length(sorted_indices) - num_values_to_replace + 1):length(sorted_indices)]
bd2[["Innovación"]][lowest_indices] <- median_value_innovacion
bd2[["Innovación"]][highest_indices] <- median_value_innovacion

# Inseguridad_Robo
median_value_robo <- median(bd2[["Inseguridad_Robo"]], na.rm = TRUE)
sorted_indices <- order(bd2[["Inseguridad_Robo"]], na.last = NA)
lowest_indices <- sorted_indices[1:num_values_to_replace]
highest_indices <- sorted_indices[(length(sorted_indices) - num_values_to_replace + 1):length(sorted_indices)]
bd2[["Inseguridad_Robo"]][lowest_indices] <- median_value_robo
bd2[["Inseguridad_Robo"]][highest_indices] <- median_value_robo

# Inseguridad_Homicidio
median_value_homicidio <- median(bd2[["Inseguridad_Homicidio"]], na.rm = TRUE)
sorted_indices <- order(bd2[["Inseguridad_Homicidio"]], na.last = NA)
lowest_indices <- sorted_indices[1:num_values_to_replace]
highest_indices <- sorted_indices[(length(sorted_indices) - num_values_to_replace + 1):length(sorted_indices)]
bd2[["Inseguridad_Homicidio"]][lowest_indices] <- median_value_homicidio
bd2[["Inseguridad_Homicidio"]][highest_indices] <- median_value_homicidio

# Tipo_de_Cambio
median_value_cambio <- median(bd2[["Tipo_de_Cambio"]], na.rm = TRUE)
sorted_indices <- order(bd2[["Tipo_de_Cambio"]], na.last = NA)
lowest_indices <- sorted_indices[1:num_values_to_replace]
highest_indices <- sorted_indices[(length(sorted_indices) - num_values_to_replace + 1):length(sorted_indices)]
bd2[["Tipo_de_Cambio"]][lowest_indices] <- median_value_cambio
bd2[["Tipo_de_Cambio"]][highest_indices] <- median_value_cambio

# Densidad_Carretera
median_value_carretera <- median(bd2[["Densidad_Carretera"]], na.rm = TRUE)
sorted_indices <- order(bd2[["Densidad_Carretera"]], na.last = NA)
lowest_indices <- sorted_indices[1:num_values_to_replace]
highest_indices <- sorted_indices[(length(sorted_indices) - num_values_to_replace + 1):length(sorted_indices)]
bd2[["Densidad_Carretera"]][lowest_indices] <- median_value_carretera
bd2[["Densidad_Carretera"]][highest_indices] <- median_value_carretera

# Densidad_Población
median_value_poblacion <- median(bd2[["Densidad_Población"]], na.rm = TRUE)
sorted_indices <- order(bd2[["Densidad_Población"]], na.last = NA)
lowest_indices <- sorted_indices[1:num_values_to_replace]
highest_indices <- sorted_indices[(length(sorted_indices) - num_values_to_replace + 1):length(sorted_indices)]
bd2[["Densidad_Población"]][lowest_indices] <- median_value_poblacion
bd2[["Densidad_Población"]][highest_indices] <- median_value_poblacion

# CO2_Emisiones
median_value_c02 <- median(bd2[["CO2_Emisiones"]], na.rm = TRUE)
sorted_indices <- order(bd2[["CO2_Emisiones"]], na.last = NA)
lowest_indices <- sorted_indices[1:num_values_to_replace]
highest_indices <- sorted_indices[(length(sorted_indices) - num_values_to_replace + 1):length(sorted_indices)]
bd2[["CO2_Emisiones"]][lowest_indices] <- median_value_c02
bd2[["CO2_Emisiones"]][highest_indices] <- median_value_c02

# PIB_Per_Cápita
median_value_pib <- median(bd2[["PIB_Per_Cápita"]], na.rm = TRUE)
sorted_indices <- order(bd2[["PIB_Per_Cápita"]], na.last = NA)
lowest_indices <- sorted_indices[1:num_values_to_replace]
highest_indices <- sorted_indices[(length(sorted_indices) - num_values_to_replace + 1):length(sorted_indices)]
bd2[["PIB_Per_Cápita"]][lowest_indices] <- median_value_pib
bd2[["PIB_Per_Cápita"]][highest_indices] <- median_value_pib

# INPC
median_value_inpc <- median(bd2[["INPC"]], na.rm = TRUE)
sorted_indices <- order(bd2[["INPC"]], na.last = NA)
lowest_indices <- sorted_indices[1:num_values_to_replace]
highest_indices <- sorted_indices[(length(sorted_indices) - num_values_to_replace + 1):length(sorted_indices)]
bd2[["INPC"]][lowest_indices] <- median_value_inpc
bd2[["INPC"]][highest_indices] <- median_value_inpc




# Performing a new linear regresion 
practice2_linear_regresion <- lm(IED_Flujos ~ Exportaciones+Empleo+Educación+Innovación+Inseguridad_Robo+Inseguridad_Homicidio+Tipo_de_Cambio+Densidad_Carretera+CO2_Emisiones+PIB_Per_Cápita, data = bd2)
summary(practice_linear_regresion)
## 
## Call:
## lm(formula = IED_Flujos ~ Exportaciones + Empleo + Educación + 
##     Salario_Diario + Innovación + Inseguridad_Robo + Inseguridad_Homicidio + 
##     Tipo_de_Cambio + Densidad_Carretera + Densidad_Población + 
##     CO2_Emisiones + PIB_Per_Cápita + INPC, data = bd1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4595.1 -2152.5  -531.4  2152.8  4934.7 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)   
## (Intercept)           -1.027e+06  3.812e+05  -2.695  0.01950 * 
## Exportaciones         -1.083e+00  6.748e-01  -1.605  0.13451   
## Empleo                 5.757e+03  2.371e+03   2.428  0.03184 * 
## Educación              1.779e+05  4.490e+04   3.961  0.00189 **
## Salario_Diario        -1.205e+01  2.044e+02  -0.059  0.95395   
## Innovación             4.849e+03  1.502e+03   3.228  0.00725 **
## Inseguridad_Robo       4.352e+01  5.138e+01   0.847  0.41353   
## Inseguridad_Homicidio  1.385e+03  5.516e+02   2.511  0.02736 * 
## Tipo_de_Cambio        -2.582e+03  8.448e+02  -3.057  0.00996 **
## Densidad_Carretera    -1.240e+06  1.151e+06  -1.078  0.30232   
## Densidad_Población    -1.645e+04  4.287e+03  -3.836  0.00237 **
## CO2_Emisiones          9.462e+03  1.173e+04   0.807  0.43548   
## PIB_Per_Cápita        -3.341e-01  4.966e-01  -0.673  0.51384   
## INPC                   9.457e+01  9.535e+02   0.099  0.92263   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3864 on 12 degrees of freedom
## Multiple R-squared:  0.9068, Adjusted R-squared:  0.8059 
## F-statistic: 8.985 on 13 and 12 DF,  p-value: 0.0002737
RMSE(practice2_linear_regresion$fitted.values, bd2$IED_Flujos)
## [1] 5486.737
vif(practice2_linear_regresion) # We asjusted the model in order to find no multicolineality. 
##         Exportaciones                Empleo             Educación 
##             33.302113              1.546579             48.825615 
##            Innovación      Inseguridad_Robo Inseguridad_Homicidio 
##              1.452614              2.441999              2.319355 
##        Tipo_de_Cambio    Densidad_Carretera         CO2_Emisiones 
##              4.533531              5.296000              1.526098 
##        PIB_Per_Cápita 
##              2.818299
# Linear regresion with log 
linear_regresion_log <-  lm(log(IED_Flujos) ~ Exportaciones+Empleo+Educación+Innovación+Inseguridad_Robo+Inseguridad_Homicidio+Tipo_de_Cambio+Densidad_Carretera+CO2_Emisiones+PIB_Per_Cápita, data = bd2)

summary(linear_regresion_log)
## 
## Call:
## lm(formula = log(IED_Flujos) ~ Exportaciones + Empleo + Educación + 
##     Innovación + Inseguridad_Robo + Inseguridad_Homicidio + 
##     Tipo_de_Cambio + Densidad_Carretera + CO2_Emisiones + PIB_Per_Cápita, 
##     data = bd2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.43995 -0.11727  0.02566  0.10408  0.51909 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)  
## (Intercept)            4.078e+01  2.217e+01   1.840   0.0857 .
## Exportaciones          8.867e-05  4.654e-05   1.905   0.0761 .
## Empleo                -2.224e-01  1.771e-01  -1.255   0.2285  
## Educación             -2.092e+00  8.767e-01  -2.387   0.0306 *
## Innovación             7.228e-02  1.236e-01   0.585   0.5674  
## Inseguridad_Robo      -6.271e-03  3.821e-03  -1.641   0.1216  
## Inseguridad_Homicidio  3.180e-02  1.836e-02   1.732   0.1038  
## Tipo_de_Cambio         1.224e-01  4.181e-02   2.927   0.0104 *
## Densidad_Carretera     1.127e+01  1.385e+01   0.814   0.4285  
## CO2_Emisiones          9.001e-01  6.476e-01   1.390   0.1848  
## PIB_Per_Cápita         9.236e-07  1.599e-05   0.058   0.9547  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2979 on 15 degrees of freedom
## Multiple R-squared:  0.641,  Adjusted R-squared:  0.4017 
## F-statistic: 2.678 on 10 and 15 DF,  p-value: 0.04153
# Validation
RMSE(linear_regresion_log$fitted.values, bd2$IED_Flujos) # We detected a really high level of RMSE compared with the previous model 
## [1] 28107.99
vif(linear_regresion_log) # We can see that our variables have no multicolineality
##         Exportaciones                Empleo             Educación 
##             33.302113              1.546579             48.825615 
##            Innovación      Inseguridad_Robo Inseguridad_Homicidio 
##              1.452614              2.441999              2.319355 
##        Tipo_de_Cambio    Densidad_Carretera         CO2_Emisiones 
##              4.533531              5.296000              1.526098 
##        PIB_Per_Cápita 
##              2.818299
bptest(linear_regresion_log)  # We found out that our model has Heteroestlasticidad 
## 
##  studentized Breusch-Pagan test
## 
## data:  linear_regresion_log
## BP = 11.25, df = 10, p-value = 0.3384
AIC(linear_regresion_log)
## [1] 20.51169
histogram(linear_regresion_log$residuals)

# Polynomial Regression model 

# The first step to make a polynomial regression we need to determine which variables might behave in a polynomial form. 

ggplot(bd1, aes(x =bd1$Exportaciones , y = bd1$IED_Flujos)) +
  geom_point(color = "blue", size = 3, shape = 16) +
  labs(title = "Gráfico de Dispersión con ggplot2",
       x = "Exportaciones",
       y = "IED Flujos")
## Warning: Use of `bd1$Exportaciones` is discouraged.
## ℹ Use `Exportaciones` instead.
## Warning: Use of `bd1$IED_Flujos` is discouraged.
## ℹ Use `IED_Flujos` instead.

ggplot(bd1, aes(x =bd1$Empleo , y = bd1$IED_Flujos)) +
  geom_point(color = "blue", size = 3, shape = 16) +
  labs(title = "Gráfico de Dispersión con ggplot2",
       x = "Empleo",
       y = "IED Flujos")
## Warning: Use of `bd1$Empleo` is discouraged.
## ℹ Use `Empleo` instead.
## Use of `bd1$IED_Flujos` is discouraged.
## ℹ Use `IED_Flujos` instead.

ggplot(bd1, aes(x =bd1$Educación , y = bd1$IED_Flujos)) +
  geom_point(color = "blue", size = 3, shape = 16) +
  labs(title = "Gráfico de Dispersión con ggplot2",
       x = "Educacion",
       y = "IED Flujos")
## Warning: Use of `bd1$Educación` is discouraged.
## ℹ Use `Educación` instead.
## Use of `bd1$IED_Flujos` is discouraged.
## ℹ Use `IED_Flujos` instead.

ggplot(bd1, aes(x =bd1$Salario_Diario , y = bd1$IED_Flujos)) +
  geom_point(color = "blue", size = 3, shape = 16) +
  labs(title = "Gráfico de Dispersión con ggplot2",
       x = "Salario Diario",
       y = "IED Flujos")
## Warning: Use of `bd1$Salario_Diario` is discouraged.
## ℹ Use `Salario_Diario` instead.
## Use of `bd1$IED_Flujos` is discouraged.
## ℹ Use `IED_Flujos` instead.

ggplot(bd1, aes(x =bd1$Innovación , y = bd1$IED_Flujos)) +
  geom_point(color = "blue", size = 3, shape = 16) +
  labs(title = "Gráfico de Dispersión con ggplot2",
       x = "Innovacion",
       y = "IED Flujos")
## Warning: Use of `bd1$Innovación` is discouraged.
## ℹ Use `Innovación` instead.
## Use of `bd1$IED_Flujos` is discouraged.
## ℹ Use `IED_Flujos` instead.

ggplot(bd1, aes(x =bd1$Inseguridad_Robo , y = bd1$IED_Flujos)) +
  geom_point(color = "blue", size = 3, shape = 16) +
  labs(title = "Gráfico de Dispersión con ggplot2",
       x = "Inseguridad Robo",
       y = "IED Flujos")
## Warning: Use of `bd1$Inseguridad_Robo` is discouraged.
## ℹ Use `Inseguridad_Robo` instead.
## Use of `bd1$IED_Flujos` is discouraged.
## ℹ Use `IED_Flujos` instead.

ggplot(bd1, aes(x =bd1$Inseguridad_Homicidio , y = bd1$IED_Flujos)) +
  geom_point(color = "blue", size = 3, shape = 16) +
  labs(title = "Gráfico de Dispersión con ggplot2",
       x = "Inseguridad Homicidio",
       y = "IED Flujos")
## Warning: Use of `bd1$Inseguridad_Homicidio` is discouraged.
## ℹ Use `Inseguridad_Homicidio` instead.
## Use of `bd1$IED_Flujos` is discouraged.
## ℹ Use `IED_Flujos` instead.

ggplot(bd1, aes(x =bd1$Tipo_de_Cambio , y = bd1$IED_Flujos)) +
  geom_point(color = "blue", size = 3, shape = 16) +
  labs(title = "Gráfico de Dispersión con ggplot2",
       x = "Tipo de cambio",
       y = "IED Flujos")
## Warning: Use of `bd1$Tipo_de_Cambio` is discouraged.
## ℹ Use `Tipo_de_Cambio` instead.
## Use of `bd1$IED_Flujos` is discouraged.
## ℹ Use `IED_Flujos` instead.

ggplot(bd1, aes(x =bd1$Densidad_Carretera , y = bd1$IED_Flujos)) +
  geom_point(color = "blue", size = 3, shape = 16) +
  labs(title = "Gráfico de Dispersión con ggplot2",
       x = "Densidad Carretera",
       y = "IED Flujos")
## Warning: Use of `bd1$Densidad_Carretera` is discouraged.
## ℹ Use `Densidad_Carretera` instead.
## Use of `bd1$IED_Flujos` is discouraged.
## ℹ Use `IED_Flujos` instead.

ggplot(bd1, aes(x =bd1$Densidad_Población , y = bd1$IED_Flujos)) +
  geom_point(color = "blue", size = 3, shape = 16) +
  labs(title = "Gráfico de Dispersión con ggplot2",
       x = "Densidad Poblacion",
       y = "IED Flujos")
## Warning: Use of `bd1$Densidad_Población` is discouraged.
## ℹ Use `Densidad_Población` instead.
## Use of `bd1$IED_Flujos` is discouraged.
## ℹ Use `IED_Flujos` instead.

ggplot(bd1, aes(x =bd1$CO2_Emisiones , y = bd1$IED_Flujos)) +
  geom_point(color = "blue", size = 3, shape = 16) +
  labs(title = "Gráfico de Dispersión con ggplot2",
       x = "C02 Emisiones",
       y = "IED Flujos")
## Warning: Use of `bd1$CO2_Emisiones` is discouraged.
## ℹ Use `CO2_Emisiones` instead.
## Use of `bd1$IED_Flujos` is discouraged.
## ℹ Use `IED_Flujos` instead.

ggplot(bd1, aes(x =bd1$PIB_Per_Cápita , y = bd1$IED_Flujos)) +
  geom_point(color = "blue", size = 3, shape = 16) +
  labs(title = "Gráfico de Dispersión con ggplot2",
       x = "PIB Per Capita",
       y = "IED Flujos")
## Warning: Use of `bd1$PIB_Per_Cápita` is discouraged.
## ℹ Use `PIB_Per_Cápita` instead.
## Use of `bd1$IED_Flujos` is discouraged.
## ℹ Use `IED_Flujos` instead.

ggplot(bd1, aes(x =bd1$INPC , y = bd1$IED_Flujos)) +
  geom_point(color = "blue", size = 3, shape = 16) +
  labs(title = "Gráfico de Dispersión con ggplot2",
       x = "INPC",
       y = "IED Flujos")
## Warning: Use of `bd1$INPC` is discouraged.
## ℹ Use `INPC` instead.
## Use of `bd1$IED_Flujos` is discouraged.
## ℹ Use `IED_Flujos` instead.

# Analysis: we can see that there are some variables that don`t follow a linear tendency. The values that we saw that have a polinomial behavior are: Salario Diario, Tipo de cambio, Densidad Carretera, PIB Per Capita. 

# Polynomial regresion 
polynomial_regresion1 <- lm(IED_Flujos ~ Exportaciones+Empleo+Educación+Innovación+Inseguridad_Robo+Inseguridad_Homicidio+Tipo_de_Cambio+I(Tipo_de_Cambio^2)+Densidad_Carretera+I(Densidad_Carretera^2)+CO2_Emisiones+PIB_Per_Cápita+I(PIB_Per_Cápita^2), data = bd2) 
summary(polynomial_regresion1)
## 
## Call:
## lm(formula = IED_Flujos ~ Exportaciones + Empleo + Educación + 
##     Innovación + Inseguridad_Robo + Inseguridad_Homicidio + 
##     Tipo_de_Cambio + I(Tipo_de_Cambio^2) + Densidad_Carretera + 
##     I(Densidad_Carretera^2) + CO2_Emisiones + PIB_Per_Cápita + 
##     I(PIB_Per_Cápita^2), data = bd2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8441.7 -3443.8  -979.1  2913.1 15862.4 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)              1.172e+06  1.158e+06   1.013    0.331
## Exportaciones           -1.179e-01  1.510e+00  -0.078    0.939
## Empleo                  -3.586e+03  5.007e+03  -0.716    0.488
## Educación                7.724e+03  3.336e+04   0.232    0.821
## Innovación               4.533e+03  4.205e+03   1.078    0.302
## Inseguridad_Robo        -2.786e+01  1.033e+02  -0.270    0.792
## Inseguridad_Homicidio    7.114e+02  4.855e+02   1.465    0.169
## Tipo_de_Cambio           4.904e+03  7.173e+03   0.684    0.507
## I(Tipo_de_Cambio^2)     -1.268e+02  2.434e+02  -0.521    0.612
## Densidad_Carretera      -6.165e+06  4.852e+06  -1.271    0.228
## I(Densidad_Carretera^2)  3.803e+07  2.999e+07   1.268    0.229
## CO2_Emisiones            2.039e+04  1.720e+04   1.186    0.259
## PIB_Per_Cápita          -1.219e+01  1.576e+01  -0.774    0.454
## I(PIB_Per_Cápita^2)      4.580e-05  5.668e-05   0.808    0.435
## 
## Residual standard error: 7390 on 12 degrees of freedom
## Multiple R-squared:  0.6592, Adjusted R-squared:   0.29 
## F-statistic: 1.785 on 13 and 12 DF,  p-value: 0.1622
# Validation
RMSE(polynomial_regresion1$fitted.values, bd2$IED_Flujos) # We saw that there is a significant numeber of RMSE compared to other models. 
## [1] 5020.822
vif(polynomial_regresion1) # We can see that there is multicolineality in this model. 
##           Exportaciones                  Empleo               Educación 
##               56.925237                2.007776              114.861723 
##              Innovación        Inseguridad_Robo   Inseguridad_Homicidio 
##                2.732010                2.899480                2.635145 
##          Tipo_de_Cambio     I(Tipo_de_Cambio^2)      Densidad_Carretera 
##              216.809834              225.453056             1055.541862 
## I(Densidad_Carretera^2)           CO2_Emisiones          PIB_Per_Cápita 
##              859.760629                1.749050             4451.061848 
##     I(PIB_Per_Cápita^2) 
##             4454.575920
bptest(polynomial_regresion1) # This model has Heteroestlasticity. 
## 
##  studentized Breusch-Pagan test
## 
## data:  polynomial_regresion1
## BP = 7.1632, df = 13, p-value = 0.8935
AIC(polynomial_regresion1)
## [1] 546.8949
histogram(polynomial_regresion1$residuals)

# Lasso regresion model 

### Split the Data in Training Data vs Test Data
set.seed(123)                                  ### sets the random seed for reproducibility of results
training.samples<-bd2$IED_Flujos %>%
  createDataPartition(p=0.75,list=FALSE)       ### Lets consider 75% of the data to build a predictive model

train.data<-bd1[training.samples, ]  ### training data to fit the linear regression model 
test.data<-bd1[-training.samples, ]  ### testing data to test the linear regression model         

#################################
### LASSO REGRESSION ANALYSIS ###
#################################

# The dataset is transformed to model.matrix() format. 
# Independent variables
x<-model.matrix(log(IED_Flujos) ~ Exportaciones+Empleo+Educación+Innovación+Inseguridad_Robo+Inseguridad_Homicidio+Tipo_de_Cambio+Densidad_Carretera+CO2_Emisiones+PIB_Per_Cápita,train.data)[,-1] ### OLS model specification
# x<-model.matrix(Weekly_Sales~.,train.data)[,-1] ### matrix of independent variables X's
y<-train.data$IED_Flujos ### dependent variable 

# In estimating LASSO regression it is important to define the lambda that minimizes the prediction error rate. 
# Cross-validation ensures that every data / observation from the original dataset (datains) has a chance of appearing in train and test datasets.
# Find the best lambda using cross-validation.
set.seed(123) 
cv.lasso <- cv.glmnet(x,y,alpha=1) # alpha = 1 for LASSO 
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
## fold
# NO SE PUEDE USAR CON SOLO UNA VARIABLE

# Display the best lambda value
cv.lasso$lambda.min                      
## [1] 577.774
# Fit the final model on the training data
lassomodel<-glmnet(x,y,alpha=1,lambda=cv.lasso$lambda.min)
lassomodel
## 
## Call:  glmnet(x = x, y = y, alpha = 1, lambda = cv.lasso$lambda.min) 
## 
##   Df  %Dev Lambda
## 1  4 79.49  577.8
# Display regression coefficients
coef(lassomodel)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                                  s0
## (Intercept)           -1.689578e+05
## Exportaciones          .           
## Empleo                 1.073709e+03
## Educación              6.902473e+03
## Innovación             2.561121e+03
## Inseguridad_Robo      -2.515549e+00
## Inseguridad_Homicidio  .           
## Tipo_de_Cambio         .           
## Densidad_Carretera     .           
## CO2_Emisiones          .           
## PIB_Per_Cápita         .
# Make predictions on the test data
x.test<-model.matrix(log(IED_Flujos) ~ 
Exportaciones+Empleo+Educación+Innovación+Inseguridad_Robo+Inseguridad_Homicidio+Tipo_de_Cambio+Densidad_Carretera+CO2_Emisiones+PIB_Per_Cápita,test.data)[,-1] ### OLS model specification
# x.test<-model.matrix(Weekly_Sales~.,test.data)[,-1]
lassopredictions <- lassomodel %>% predict(x.test) %>% as.vector()



# Model Accuracy
data.frame(
  RMSE = RMSE(lassopredictions, test.data$IED_Flujos),
  Rsquare = R2(lassopredictions, test.data$IED_Flujos))
##       RMSE   Rsquare
## 1 10408.73 0.6924555
### visualizing lasso regression results 
lbs_fun <- function(fit, offset_x=1, ...) {
  L <- length(fit$lambda)
  x <- log(fit$lambda[L])+ offset_x
  y <- fit$beta[, L]
  labs <- names(y)
  text(x, y, labels=labs, ...)
}

lasso<-glmnet(scale(x),y,alpha=1)

plot(lasso,xvar="lambda",label=T)
lbs_fun(lasso)
abline(v=cv.lasso$lambda.min,col="red",lty=2)
abline(v=cv.lasso$lambda.1se,col="blue",lty=2)

#################################
### RIDGE REGRESSION ANALYSIS ###
#################################

# Find the best lambda using cross-validation
set.seed(123)                           # x: independent variables | y: dependent variable 
cv.ridge <- cv.glmnet(x,y,alpha=0.1)      # alpha = 0 for RIDGE
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per
## fold
# Display the best lambda value
cv.ridge$lambda.min                     
## [1] 2744.894
# Fit the final model on the training data
ridgemodel<-glmnet(x,y,alpha=0,lambda=cv.ridge$lambda.min)

# Display regression coefficients
coef(ridgemodel)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                                  s0
## (Intercept)           -1.134776e+05
## Exportaciones          6.145233e-02
## Empleo                 8.671375e+02
## Educación              1.888556e+03
## Innovación             2.059001e+03
## Inseguridad_Robo      -2.652667e+01
## Inseguridad_Homicidio  3.276834e+01
## Tipo_de_Cambio         9.590785e+01
## Densidad_Carretera     7.478123e+04
## CO2_Emisiones         -1.611175e+03
## PIB_Per_Cápita         1.118728e-01
# Make predictions on the test data
x.test<-model.matrix(log(IED_Flujos) ~ Exportaciones+Empleo+Educación+Innovación+Inseguridad_Robo+Inseguridad_Homicidio+Tipo_de_Cambio+Densidad_Carretera+CO2_Emisiones+PIB_Per_Cápita,test.data)[,-1]
ridgepredictions<-ridgemodel %>% predict(x.test) %>% as.vector()

# Model Accuracy
data.frame(
  RMSE = RMSE(ridgepredictions, test.data$IED_Flujos),
  Rsquare = R2(ridgepredictions, test.data$IED_Flujos)
)
##    RMSE   Rsquare
## 1 10699 0.5946856
### visualizing ridge regression results 
ridge<-glmnet(scale(x),y,alpha=0)
plot(ridge, xvar = "lambda", label=T)
lbs_fun(ridge)
abline(v=cv.ridge$lambda.min, col = "red", lty=2)
abline(v=cv.ridge$lambda.1se, col="blue", lty=2)

# Resumen de modelos 
tab <- matrix(c(5564,28102,4336,10408,10699,0.8059,0.2445,0.4704,0.692,0.594), ncol=2, byrow=FALSE)
colnames(tab) <- c('RMSE','R2')
rownames(tab) <- c('Linear Regression','Linear Regresion (Log)','Polynomial ^2', 'Lasso','Ridge')
tab <- as.table(tab)
tab
##                              RMSE         R2
## Linear Regression       5564.0000     0.8059
## Linear Regresion (Log) 28102.0000     0.2445
## Polynomial ^2           4336.0000     0.4704
## Lasso                  10408.0000     0.6920
## Ridge                  10699.0000     0.5940
# We can see that our model with the lowest RMSE is the polynomial model , but it has a R^2 really low so we are going to not choose that model. Our next model is the linear regression model. This has a low RMSE compare to the other models and offers the highest R^2. This means that this models shows a better balance in the parameters and is better suited for our analysis.
# Show the predicted values of the dependent variable
final_model <- practice_linear_regresion
colors <- c("purple","green","blue", "red")
par(mfrow=c(2,2)) 
plot(final_model, which = 1, col = colors[1], pch = 19)  
plot(final_model, which = 2, col = colors[2], pch = 19)
plot(final_model, which = 3, col = colors[3], pch = 19)
plot(final_model, which = 5, col = colors[4], pch = 19)

par(mfrow=c(1,1))

summary(final_model)
## 
## Call:
## lm(formula = IED_Flujos ~ Exportaciones + Empleo + Educación + 
##     Salario_Diario + Innovación + Inseguridad_Robo + Inseguridad_Homicidio + 
##     Tipo_de_Cambio + Densidad_Carretera + Densidad_Población + 
##     CO2_Emisiones + PIB_Per_Cápita + INPC, data = bd1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4595.1 -2152.5  -531.4  2152.8  4934.7 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)   
## (Intercept)           -1.027e+06  3.812e+05  -2.695  0.01950 * 
## Exportaciones         -1.083e+00  6.748e-01  -1.605  0.13451   
## Empleo                 5.757e+03  2.371e+03   2.428  0.03184 * 
## Educación              1.779e+05  4.490e+04   3.961  0.00189 **
## Salario_Diario        -1.205e+01  2.044e+02  -0.059  0.95395   
## Innovación             4.849e+03  1.502e+03   3.228  0.00725 **
## Inseguridad_Robo       4.352e+01  5.138e+01   0.847  0.41353   
## Inseguridad_Homicidio  1.385e+03  5.516e+02   2.511  0.02736 * 
## Tipo_de_Cambio        -2.582e+03  8.448e+02  -3.057  0.00996 **
## Densidad_Carretera    -1.240e+06  1.151e+06  -1.078  0.30232   
## Densidad_Población    -1.645e+04  4.287e+03  -3.836  0.00237 **
## CO2_Emisiones          9.462e+03  1.173e+04   0.807  0.43548   
## PIB_Per_Cápita        -3.341e-01  4.966e-01  -0.673  0.51384   
## INPC                   9.457e+01  9.535e+02   0.099  0.92263   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3864 on 12 degrees of freedom
## Multiple R-squared:  0.9068, Adjusted R-squared:  0.8059 
## F-statistic: 8.985 on 13 and 12 DF,  p-value: 0.0002737

Interpret the regression results of selected regression model.

After analyzing the regression model we can see that the variables with the most corelation with the foreign investment is “Tipo de cambio” , “Densidad de Población”, “Innovacion” and “Educación”. Other variables that have a significant impact are: “empleo” and “inseguridad homicidio”.

Part 5 - Conclusions

Testing our hyphotesis:

Part 6 - References