Tidying data
Data Set Issues:
Some of the main problems with the original data set are:
- The data set has observations that are irrelevant for the purpose of the EDA. E.g. the EDA focuses on Andean Countries, but the original data set covers all countries in Latin America.
- There are too many variables: Some of them are irrelevant (e.g. repeated variables). Similarly, even though there might be many that are interesting, it is necessary to narrow them down to not sacrifice quality for quantity.
- The names of the variables are not informative or are in a different language. Some of the names of the variables are in Spanish (e.g. “PAIS”), which would make it harder for English speakers to understand the EDA (given that it is written in English), and some other variables have names that are not intuitive (e.g. “Q1”: whether the surveyed person is Male or Female).
To solve this issues and others, the process of tidying the data is shown below.
Filtering by Countries:
The data set was filtered by all countries of interest; which are Peru, Bolivia, Ecuador, and Colombia. The country variable is called “PAIS” and the countries are represented by the following numbers: Peru (11), Bolivia (10), Ecuador (9), and Colombia (8). The tables below show the frequency of each observation in the variable “PAIS” before and after filtering by countries. Additionally, now the data set has 49596 observations and 266 variables (using “str()”). Before filtering, table of frequency for variable “PAIS:”
1 2 3 4 5 6 7 8 9 10 11 12 13
9333 9253 9426 9492 9500 9031 9375 8987 14913 18196 7500 6845 8151
14 15 16 17 21 22 23 24 25 26 27 28 29
7224 8192 7510 5920 12013 8261 7601 8695 7212 6101 7006 3429 3828
40 41
6609 7151
After filtering, table of frequency for variable “PAIS:”
8 9 10 11
8987 14913 18196 7500
Subsetting by Variables of Interest:
Not all the 266 variables are relevant, and using all of them for an EDA would sacrifice specificity. Therefore, each variable in the questionnaire was analyzed and only relevant variables were selected (e.g. selected variable about whether a person trusts a president, but leave variable that indicates whether a person trusts a mayor). Since these are not the final variables used for the EDA (further subsetting later), there is no a specific codebook yet, but the reader can find information about each variable in the general codebook of the original survey. Below, the reader can see that the data set now has 49596 observations and 53 variables.
tibble [49,596 × 53] (S3: tbl_df/tbl/data.frame)
$ PAIS : num [1:49596] 10 10 10 10 10 10 10 10 10 10 ...
$ WAVE : num [1:49596] 2004 2004 2004 2004 2004 ...
$ YEAR : num [1:49596] 2004 2004 2004 2004 2004 ...
$ ESTRATOSEC: num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ UR : num [1:49596] 1 1 1 1 2 2 1 1 2 1 ...
$ TAMANO : num [1:49596] 4 1 3 4 5 5 4 3 5 3 ...
$ IDIOMAQ : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ Q1 : num [1:49596] 2 2 1 1 2 1 1 1 1 1 ...
$ LS3 : num [1:49596] 1 3 3 3 2 2 2 3 3 2 ...
$ A4 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ SOCT2 : num [1:49596] 3 3 3 3 2 1 3 2 3 1 ...
$ IDIO2 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ CP5 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ L1 : num [1:49596] 5 6 8 5 NA 7 6 6 NA 7 ...
$ PROT3 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ JC10 : num [1:49596] 2 2 2 2 NA 2 2 1 2 1 ...
$ JC13 : num [1:49596] 2 1 1 2 NA 1 2 1 2 1 ...
$ JC15A : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ VIC1EXT : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ VIC1EXTA : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ VIC1HOGAR : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ PESE1 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ PESE2 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ AOJ12 : num [1:49596] 3 1 4 3 3 3 3 3 1 3 ...
$ B2 : num [1:49596] 3 6 2 4 4 4 3 5 4 5 ...
$ B4 : num [1:49596] 4 5 5 5 NA 6 5 3 3 5 ...
$ B6 : num [1:49596] 4 5 3 6 NA 6 2 3 4 6 ...
$ B10A : num [1:49596] 4 6 5 1 5 4 2 3 4 3 ...
$ B12 : num [1:49596] 6 2 7 6 3 6 3 3 3 3 ...
$ B13 : num [1:49596] 5 1 4 3 4 6 6 2 3 3 ...
$ B18 : num [1:49596] 2 4 6 1 5 4 5 2 3 1 ...
$ B21 : num [1:49596] 3 2 3 2 3 3 2 3 2 2 ...
$ B21A : num [1:49596] 7 5 2 1 4 6 3 6 5 5 ...
$ B47A : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ N9 : num [1:49596] 1 2 5 3 NA 5 6 3 5 5 ...
$ N11 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ N15 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ M1 : num [1:49596] 2 2 4 3 3 2 3 3 3 2 ...
$ SD2NEW2 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ SD3NEW2 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ SD6NEW2 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ ROS4 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ ING4 : num [1:49596] 6 4 4 5 7 6 6 4 3 4 ...
$ MIL7 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ PN4 : num [1:49596] 3 3 4 3 NA 3 2 3 3 3 ...
$ EXC2 : num [1:49596] 0 1 0 1 0 0 0 0 0 0 ...
$ EXC7 : num [1:49596] 1 3 3 2 NA 3 2 1 4 1 ...
$ POL1 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ ED : num [1:49596] 10 10 7 12 4 14 12 16 14 13 ...
$ Q10NEW_12 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ Q10NEW_14 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ Q10D : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ ETID : num [1:49596] 2 2 2 1 2 3 2 2 3 2 ...
After analyzing each variable to check whether any of them could cause major troubles later on, it was noticed that both Q10NEW_14 and Q10NEW_12 measure monthly household income in dollars. However, due to the fact that living costs vary from country to country, this variable could lead to misleading results in the future. Even though there are solutions to address this problem, considering the number of variables in the data set, it was decided that it was better to get rid of both variables. Another issue noticed was that the variable “WAVE” seemed to have the same values as the variable “YEAR.”
# A tibble: 10 × 2
YEAR WAVE
<dbl> <dbl>
1 2004 2004
2 2004 2004
3 2004 2004
4 2004 2004
5 2004 2004
6 2004 2004
7 2004 2004
8 2004 2004
9 2004 2004
10 2004 2004
After checking is both columns were equal, the hypothesis was confirmed. Finally, the variable ESTRATOSEC (size of municipality) was considered to be not relevant given that there was other demographic information (sex, country, urbanization, etc.) that was more useful. After getting rid of such variables, the data set had 49 variables and 49596 observations.
tibble [49,596 × 49] (S3: tbl_df/tbl/data.frame)
$ PAIS : num [1:49596] 10 10 10 10 10 10 10 10 10 10 ...
$ YEAR : num [1:49596] 2004 2004 2004 2004 2004 ...
$ UR : num [1:49596] 1 1 1 1 2 2 1 1 2 1 ...
$ TAMANO : num [1:49596] 4 1 3 4 5 5 4 3 5 3 ...
$ IDIOMAQ : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ Q1 : num [1:49596] 2 2 1 1 2 1 1 1 1 1 ...
$ LS3 : num [1:49596] 1 3 3 3 2 2 2 3 3 2 ...
$ A4 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ SOCT2 : num [1:49596] 3 3 3 3 2 1 3 2 3 1 ...
$ IDIO2 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ CP5 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ L1 : num [1:49596] 5 6 8 5 NA 7 6 6 NA 7 ...
$ PROT3 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ JC10 : num [1:49596] 2 2 2 2 NA 2 2 1 2 1 ...
$ JC13 : num [1:49596] 2 1 1 2 NA 1 2 1 2 1 ...
$ JC15A : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ VIC1EXT : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ VIC1EXTA : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ VIC1HOGAR: num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ PESE1 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ PESE2 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ AOJ12 : num [1:49596] 3 1 4 3 3 3 3 3 1 3 ...
$ B2 : num [1:49596] 3 6 2 4 4 4 3 5 4 5 ...
$ B4 : num [1:49596] 4 5 5 5 NA 6 5 3 3 5 ...
$ B6 : num [1:49596] 4 5 3 6 NA 6 2 3 4 6 ...
$ B10A : num [1:49596] 4 6 5 1 5 4 2 3 4 3 ...
$ B12 : num [1:49596] 6 2 7 6 3 6 3 3 3 3 ...
$ B13 : num [1:49596] 5 1 4 3 4 6 6 2 3 3 ...
$ B18 : num [1:49596] 2 4 6 1 5 4 5 2 3 1 ...
$ B21 : num [1:49596] 3 2 3 2 3 3 2 3 2 2 ...
$ B21A : num [1:49596] 7 5 2 1 4 6 3 6 5 5 ...
$ B47A : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ N9 : num [1:49596] 1 2 5 3 NA 5 6 3 5 5 ...
$ N11 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ N15 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ M1 : num [1:49596] 2 2 4 3 3 2 3 3 3 2 ...
$ SD2NEW2 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ SD3NEW2 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ SD6NEW2 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ ROS4 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ ING4 : num [1:49596] 6 4 4 5 7 6 6 4 3 4 ...
$ MIL7 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ PN4 : num [1:49596] 3 3 4 3 NA 3 2 3 3 3 ...
$ EXC2 : num [1:49596] 0 1 0 1 0 0 0 0 0 0 ...
$ EXC7 : num [1:49596] 1 3 3 2 NA 3 2 1 4 1 ...
$ POL1 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ ED : num [1:49596] 10 10 7 12 4 14 12 16 14 13 ...
$ Q10D : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
$ ETID : num [1:49596] 2 2 2 1 2 3 2 2 3 2 ...
Accounting for Irregularities with Data Collection:
Given that the collection of data was performed during a long time span and in different countries, it was expected that there could be irregularities such as:
No observations collected in all countries for the same years, etc.
Uneven number of observations collected among countries. The values below show the unique observations for the variable “YEAR” for each country.
Peru:
[1] 2006 2008 2010 2012 2014
Colombia:
[1] 2004 2006 2008 2010 2012 2014
Ecuador:
[1] 2004 2006 2008 2010 2012 2014
Bolivia:
[1] 2004 2006 2008 2010 2012 2014
The values above show that, in Peru, unlike the other countries, no data was collected during 2004. Therefore, I decided that I would only work with the years: 2006, 2008, 2010, 2012, and 2014. After filtering by years that are not 2004, the number of observations was 42045 with 49 variables, as show below.
tibble [42,045 × 49] (S3: tbl_df/tbl/data.frame)
$ PAIS : num [1:42045] 10 10 10 10 10 10 10 10 10 10 ...
$ YEAR : num [1:42045] 2006 2006 2006 2006 2006 ...
$ UR : num [1:42045] 2 1 2 1 1 2 2 2 2 2 ...
$ TAMANO : num [1:42045] 5 3 5 1 4 5 5 5 5 5 ...
$ IDIOMAQ : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
$ Q1 : num [1:42045] 2 2 1 2 1 1 1 2 2 1 ...
$ LS3 : num [1:42045] 1 1 2 2 2 1 2 2 3 1 ...
$ A4 : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
$ SOCT2 : num [1:42045] NA 3 1 1 2 NA 2 3 2 2 ...
$ IDIO2 : num [1:42045] 2 3 1 1 3 2 2 2 1 3 ...
$ CP5 : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
$ L1 : num [1:42045] 5 6 5 3 5 9 NA 5 5 3 ...
$ PROT3 : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
$ JC10 : num [1:42045] 1 1 2 1 1 2 2 1 1 2 ...
$ JC13 : num [1:42045] 2 1 2 2 1 2 1 1 1 2 ...
$ JC15A : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
$ VIC1EXT : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
$ VIC1EXTA : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
$ VIC1HOGAR: num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
$ PESE1 : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
$ PESE2 : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
$ AOJ12 : num [1:42045] 3 3 2 4 3 2 3 3 1 3 ...
$ B2 : num [1:42045] 5 4 6 6 3 5 6 3 5 1 ...
$ B4 : num [1:42045] 2 6 6 4 5 5 6 4 3 1 ...
$ B6 : num [1:42045] 2 4 7 5 5 4 7 5 5 NA ...
$ B10A : num [1:42045] 5 6 6 5 3 4 7 3 6 4 ...
$ B12 : num [1:42045] 2 5 6 5 4 4 1 6 5 3 ...
$ B13 : num [1:42045] NA 6 6 6 3 4 6 4 4 3 ...
$ B18 : num [1:42045] 4 4 6 6 2 4 1 5 4 1 ...
$ B21 : num [1:42045] 6 4 4 3 4 4 3 6 4 3 ...
$ B21A : num [1:42045] 2 5 7 5 5 6 3 1 6 4 ...
$ B47A : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
$ N9 : num [1:42045] 7 6 6 5 3 5 4 2 5 NA ...
$ N11 : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
$ N15 : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
$ M1 : num [1:42045] 2 3 1 2 3 2 3 3 2 2 ...
$ SD2NEW2 : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
$ SD3NEW2 : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
$ SD6NEW2 : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
$ ROS4 : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
$ ING4 : num [1:42045] 3 4 6 4 4 4 5 NA 6 4 ...
$ MIL7 : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
$ PN4 : num [1:42045] 2 2 2 2 2 1 2 2 2 3 ...
$ EXC2 : num [1:42045] 0 0 0 0 0 0 0 0 0 0 ...
$ EXC7 : num [1:42045] 4 2 1 2 2 3 3 2 3 NA ...
$ POL1 : num [1:42045] 3 4 1 3 4 3 3 1 3 3 ...
$ ED : num [1:42045] 4 5 12 17 11 0 9 9 12 3 ...
$ Q10D : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
$ ETID : num [1:42045] 3 2 2 2 3 2 3 3 4 NA ...
The next issue was to check whether the data had been unevenly recorded across countries because this could lead to potential issues. For example, later on a visualization could show that there are twice as many people who trust their president in Ecuador compared to Bolivia, but this would be a consequence of the fact that there were twice as many observations recorded in Bolivia. The following plot shows how many observations were recorded per country in a given year. It can be seen that Bolivia had about twice as many observations than Peru and Colombia in every year. Similarly, Ecuador had the same trend in 2006, 2008, and 2010.

To account for this, a sample of every country-year combination was taken, so further analysis would be more comparable. The size of the sample was determined by the minimum country-year combination to take advantage of considering as many observations possible while still having comparable observations. The minimum country-year combination of number of observations was 1489, which corresponds to Ecuador in 2014. This can be verified below.
## # A tibble: 20 × 3
## # Groups: PAIS, YEAR [20]
## PAIS YEAR n
## <dbl> <dbl> <int>
## 1 8 2006 1491
## 2 8 2008 1503
## 3 8 2010 1506
## 4 8 2012 1512
## 5 8 2014 1496
## 6 9 2006 2925
## 7 9 2008 3000
## 8 9 2010 2999
## 9 9 2012 1500
## 10 9 2014 1489
## 11 10 2006 3008
## 12 10 2008 3003
## 13 10 2010 3018
## 14 10 2012 3029
## 15 10 2014 3066
## 16 11 2006 1500
## 17 11 2008 1500
## 18 11 2010 1500
## 19 11 2012 1500
## 20 11 2014 1500
The results of taking a sample of size 1489 per country-year combination are shown below.
# A tibble: 20 × 3
# Groups: PAIS, YEAR [20]
PAIS YEAR n
<dbl> <dbl> <int>
1 8 2006 1489
2 8 2008 1489
3 8 2010 1489
4 8 2012 1489
5 8 2014 1489
6 9 2006 1489
7 9 2008 1489
8 9 2010 1489
9 9 2012 1489
10 9 2014 1489
11 10 2006 1489
12 10 2008 1489
13 10 2010 1489
14 10 2012 1489
15 10 2014 1489
16 11 2006 1489
17 11 2008 1489
18 11 2010 1489
19 11 2012 1489
20 11 2014 1489

The data set now has 29,780 observations and 49 variables.
tibble [29,780 × 49] (S3: tbl_df/tbl/data.frame)
$ PAIS : num [1:29780] 8 8 8 8 8 8 8 8 8 8 ...
$ YEAR : num [1:29780] 2006 2006 2006 2006 2006 ...
$ UR : num [1:29780] 1 2 2 1 1 2 2 1 1 1 ...
$ TAMANO : num [1:29780] 3 4 4 3 4 4 4 3 4 4 ...
$ IDIOMAQ : num [1:29780] 1 1 1 1 1 1 1 1 1 1 ...
$ Q1 : num [1:29780] 1 1 2 1 1 1 1 1 1 1 ...
$ LS3 : num [1:29780] 2 1 1 2 1 3 3 2 1 2 ...
$ A4 : num [1:29780] 57 21 57 27 3 4 4 12 4 55 ...
$ SOCT2 : num [1:29780] 3 3 1 2 2 2 2 3 2 3 ...
$ IDIO2 : num [1:29780] 3 2 2 2 1 2 2 3 3 3 ...
$ CP5 : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
$ L1 : num [1:29780] 5 2 NA 4 9 NA 4 9 7 NA ...
$ PROT3 : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
$ JC10 : num [1:29780] 2 2 1 1 1 2 1 2 2 2 ...
$ JC13 : num [1:29780] 2 2 1 1 1 2 2 2 2 2 ...
$ JC15A : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
$ VIC1EXT : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
$ VIC1EXTA : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
$ VIC1HOGAR: num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
$ PESE1 : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
$ PESE2 : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
$ AOJ12 : num [1:29780] 2 3 3 3 3 3 4 4 2 2 ...
$ B2 : num [1:29780] 7 7 7 2 7 6 4 7 5 NA ...
$ B4 : num [1:29780] 6 5 7 3 5 4 5 7 2 NA ...
$ B6 : num [1:29780] 5 7 7 3 6 5 4 7 4 NA ...
$ B10A : num [1:29780] 5 6 7 2 4 6 6 7 3 NA ...
$ B12 : num [1:29780] 5 1 7 3 7 3 5 7 2 6 ...
$ B13 : num [1:29780] 3 1 7 2 5 4 6 7 3 6 ...
$ B18 : num [1:29780] 4 1 7 4 6 4 5 7 5 6 ...
$ B21 : num [1:29780] 2 7 7 2 6 5 4 7 4 5 ...
$ B21A : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
$ B47A : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
$ N9 : num [1:29780] 5 1 7 2 1 6 4 3 4 NA ...
$ N11 : num [1:29780] 5 1 7 4 4 3 5 4 5 NA ...
$ N15 : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
$ M1 : num [1:29780] 3 4 2 3 3 3 3 3 2 2 ...
$ SD2NEW2 : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
$ SD3NEW2 : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
$ SD6NEW2 : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
$ ROS4 : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
$ ING4 : num [1:29780] 7 1 NA 5 5 6 5 4 6 6 ...
$ MIL7 : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
$ PN4 : num [1:29780] 2 3 NA 3 3 3 3 2 2 2 ...
$ EXC2 : num [1:29780] 0 0 0 0 0 0 0 0 0 0 ...
$ EXC7 : num [1:29780] 1 4 1 1 2 3 3 1 2 NA ...
$ POL1 : num [1:29780] 2 4 3 3 1 3 3 2 2 1 ...
$ ED : num [1:29780] 16 7 5 17 5 5 7 11 11 2 ...
$ Q10D : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
$ ETID : num [1:29780] 2 2 2 1 1 2 3 2 2 1 ...
Adjusting values in variables:
The sample had 49 variables and all of them were numeric due to how the data was collected (numbers represented categorical variables). Although for some variables the numeric type made sense(ranking, extent to which someone agrees from 1-7, etc.), there were others were categorical values were more appropriate. Thus, most variables were recoded after careful consideration. Below, the reader can see the results for the first ten observations before and after recoding (personal criteria was employed to determine where shifting to a categorical variable would be better; however, this was subject to change throughout the EDA).
Before:
# A tibble: 10 × 49
PAIS YEAR UR TAMANO IDIOMAQ Q1 LS3 A4 SOCT2 IDIO2 CP5 L1
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 8 2006 1 3 1 1 2 57 3 3 NA 5
2 8 2006 2 4 1 1 1 21 3 2 NA 2
3 8 2006 2 4 1 2 1 57 1 2 NA NA
4 8 2006 1 3 1 1 2 27 2 2 NA 4
5 8 2006 1 4 1 1 1 3 2 1 NA 9
6 8 2006 2 4 1 1 3 4 2 2 NA NA
7 8 2006 2 4 1 1 3 4 2 2 NA 4
8 8 2006 1 3 1 1 2 12 3 3 NA 9
9 8 2006 1 4 1 1 1 4 2 3 NA 7
10 8 2006 1 4 1 1 2 55 3 3 NA NA
# … with 37 more variables: PROT3 <dbl>, JC10 <dbl>, JC13 <dbl>, JC15A <dbl>,
# VIC1EXT <dbl>, VIC1EXTA <dbl>, VIC1HOGAR <dbl>, PESE1 <dbl>, PESE2 <dbl>,
# AOJ12 <dbl>, B2 <dbl>, B4 <dbl>, B6 <dbl>, B10A <dbl>, B12 <dbl>,
# B13 <dbl>, B18 <dbl>, B21 <dbl>, B21A <dbl>, B47A <dbl>, N9 <dbl>,
# N11 <dbl>, N15 <dbl>, M1 <dbl>, SD2NEW2 <dbl>, SD3NEW2 <dbl>,
# SD6NEW2 <dbl>, ROS4 <dbl>, ING4 <dbl>, MIL7 <dbl>, PN4 <dbl>, EXC2 <dbl>,
# EXC7 <dbl>, POL1 <dbl>, ED <dbl>, Q10D <dbl>, ETID <dbl>
After:
# A tibble: 10 × 49
PAIS YEAR UR TAMANO IDIOMAQ Q1 LS3 A4 SOCT2 IDIO2 CP5 L1
<chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 Colombia 2006 Urban 3 Spanish Male Some… Viol… Worse Worse No a… 5
2 Colombia 2006 Rural 4 Spanish Male Very… Educ… Worse Same No a… 2
3 Colombia 2006 Rural 4 Spanish Fema… Very… Viol… Bett… Same No a… NA
4 Colombia 2006 Urban 3 Spanish Male Some… Secu… Same Same No a… 4
5 Colombia 2006 Urban 4 Spanish Male Very… Unem… Same Bett… No a… 9
6 Colombia 2006 Rural 4 Spanish Male Some… Pove… Same Same No a… NA
7 Colombia 2006 Rural 4 Spanish Male Some… Pove… Same Same No a… 4
8 Colombia 2006 Urban 3 Spanish Male Some… Drug… Worse Worse No a… 9
9 Colombia 2006 Urban 4 Spanish Male Very… Pove… Same Worse No a… 7
10 Colombia 2006 Urban 4 Spanish Male Some… Hous… Worse Worse No a… NA
# … with 37 more variables: PROT3 <chr>, JC10 <chr>, JC13 <chr>, JC15A <chr>,
# VIC1EXT <chr>, VIC1EXTA <dbl>, VIC1HOGAR <chr>, PESE1 <chr>, PESE2 <chr>,
# AOJ12 <chr>, B2 <dbl>, B4 <dbl>, B6 <dbl>, B10A <dbl>, B12 <dbl>,
# B13 <dbl>, B18 <dbl>, B21 <dbl>, B21A <dbl>, B47A <dbl>, N9 <dbl>,
# N11 <dbl>, N15 <dbl>, M1 <dbl>, SD2NEW2 <chr>, SD3NEW2 <chr>,
# SD6NEW2 <chr>, ROS4 <dbl>, ING4 <dbl>, MIL7 <dbl>, PN4 <chr>, EXC2 <chr>,
# EXC7 <chr>, POL1 <dbl>, ED <dbl>, Q10D <chr>, ETID <chr>
Renaming variables:
The name of the variables were not intuitive for a person looking at the data set which can be problematic for manipulation because one always has to refer back to the codebook. Therefore, all of the variables were renamed.
Before:
[1] "PAIS" "YEAR" "UR" "TAMANO" "IDIOMAQ" "Q1"
[7] "LS3" "A4" "SOCT2" "IDIO2" "CP5" "L1"
[13] "PROT3" "JC10" "JC13" "JC15A" "VIC1EXT" "VIC1EXTA"
[19] "VIC1HOGAR" "PESE1" "PESE2" "AOJ12" "B2" "B4"
[25] "B6" "B10A" "B12" "B13" "B18" "B21"
[31] "B21A" "B47A" "N9" "N11" "N15" "M1"
[37] "SD2NEW2" "SD3NEW2" "SD6NEW2" "ROS4" "ING4" "MIL7"
[43] "PN4" "EXC2" "EXC7" "POL1" "ED" "Q10D"
[49] "ETID"
After:
[1] "country" "year"
[3] "urban_rural" "size_place"
[5] "language_form" "sex"
[7] "life_satisfaction" "country_main_problem"
[9] "economy_compared_12" "personal_economy_12"
[11] "times_solving_community_problem_12" "left_right"
[13] "demonstration_participation_12" "military_takeover_crime"
[15] "military_takeover_corruption" "close_congress_difficult_times"
[17] "victim_crime_12" "times_victim_crime_12"
[19] "household_victim_crime_12" "violence_neighborhoods_compared"
[21] "violece_neighborhood_12" "trust_judicial_punishment"
[23] "respect_political_institutions" "pride_living_political_system"
[25] "should_suppot_political_system" "trust_justice_system"
[27] "trust_armed_forces" "trust_national_congress"
[29] "trust_national_police" "trust_political_parties"
[31] "trust_president" "trust_elections"
[33] "administration_combats_corruption" "administration_imp_safety"
[35] "administration_good_economy_mgm" "rate_president_performance"
[37] "satisf_road_streets_highw" "satisf_public_schools"
[39] "satisf_health_services" "strong_pol_inequality"
[41] "democracy_better" "aaff_should_combate_crime"
[43] "satisf_democracy_country" "police_bribe_12"
[45] "freq_corrup_public_off" "interest_politics"
[47] "schooling_completed" "salary_satisf"
[49] "race_id"
Final subsetting:
The variables in the data set spinned around the following topics:
Variables not closely connected with the topics above were removed (with exception of demographic information: sex, country, race, etc.). Those variables were: language_form (irrelevant because over 95% of observations are “Spanish”), times_solving_community_problem_12, demonstration_participation_12, and satisf_road_streets_highw (low/none importance for topics above). Thus, the final data set had 29,780 observations and 45 variables. It is important to highlight that final data set means clean version of the original data to be used as starting point for the EDA, not that no further data manipulations were performed afterwards.
tibble [29,780 × 45] (S3: tbl_df/tbl/data.frame)
$ country : chr [1:29780] "Colombia" "Colombia" "Colombia" "Colombia" ...
$ year : num [1:29780] 2006 2006 2006 2006 2006 ...
$ urban_rural : chr [1:29780] "Urban" "Rural" "Rural" "Urban" ...
$ size_place : num [1:29780] 3 4 4 3 4 4 4 3 4 4 ...
$ sex : chr [1:29780] "Male" "Male" "Female" "Male" ...
$ life_satisfaction : chr [1:29780] "Somewhat satisfied" "Very satisfied" "Very satisfied" "Somewhat satisfied" ...
$ country_main_problem : chr [1:29780] "Violence" "Education (quality, lack of)" "Violence" "Security (lack of)" ...
$ economy_compared_12 : chr [1:29780] "Worse" "Worse" "Better" "Same" ...
$ personal_economy_12 : chr [1:29780] "Worse" "Same" "Same" "Same" ...
$ left_right : num [1:29780] 5 2 NA 4 9 NA 4 9 7 NA ...
$ military_takeover_crime : chr [1:29780] "Not justified" "Not justified" "Justified" "Justified" ...
$ military_takeover_corruption : chr [1:29780] "Not justified" "Not justified" "Justified" "Justified" ...
$ close_congress_difficult_times : chr [1:29780] "No answer" "No answer" "No answer" "No answer" ...
$ victim_crime_12 : chr [1:29780] "No answer" "No answer" "No answer" "No answer" ...
$ times_victim_crime_12 : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
$ household_victim_crime_12 : chr [1:29780] "No answer" "No answer" "No answer" "No answer" ...
$ violence_neighborhoods_compared : chr [1:29780] "No answer" "No answer" "No answer" "No answer" ...
$ violece_neighborhood_12 : chr [1:29780] "No answer" "No answer" "No answer" "No answer" ...
$ trust_judicial_punishment : chr [1:29780] "Some" "Little" "Little" "Little" ...
$ respect_political_institutions : num [1:29780] 7 7 7 2 7 6 4 7 5 NA ...
$ pride_living_political_system : num [1:29780] 6 5 7 3 5 4 5 7 2 NA ...
$ should_suppot_political_system : num [1:29780] 5 7 7 3 6 5 4 7 4 NA ...
$ trust_justice_system : num [1:29780] 5 6 7 2 4 6 6 7 3 NA ...
$ trust_armed_forces : num [1:29780] 5 1 7 3 7 3 5 7 2 6 ...
$ trust_national_congress : num [1:29780] 3 1 7 2 5 4 6 7 3 6 ...
$ trust_national_police : num [1:29780] 4 1 7 4 6 4 5 7 5 6 ...
$ trust_political_parties : num [1:29780] 2 7 7 2 6 5 4 7 4 5 ...
$ trust_president : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
$ trust_elections : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
$ administration_combats_corruption: num [1:29780] 5 1 7 2 1 6 4 3 4 NA ...
$ administration_imp_safety : num [1:29780] 5 1 7 4 4 3 5 4 5 NA ...
$ administration_good_economy_mgm : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
$ rate_president_performance : num [1:29780] 3 4 2 3 3 3 3 3 2 2 ...
$ satisf_public_schools : chr [1:29780] "No answer" "No answer" "No answer" "No answer" ...
$ satisf_health_services : chr [1:29780] "No answer" "No answer" "No answer" "No answer" ...
$ strong_pol_inequality : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
$ democracy_better : num [1:29780] 7 1 NA 5 5 6 5 4 6 6 ...
$ aaff_should_combate_crime : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
$ satisf_democracy_country : chr [1:29780] "Somewhat satisfied" "Dissatisfied" "No answer" "Dissatisfied" ...
$ police_bribe_12 : chr [1:29780] "No" "No" "No" "No" ...
$ freq_corrup_public_off : chr [1:29780] "Very common" "Very uncommon" "Very common" "Very common" ...
$ interest_politics : num [1:29780] 2 4 3 3 1 3 3 2 2 1 ...
$ schooling_completed : num [1:29780] 16 7 5 17 5 5 7 11 11 2 ...
$ salary_satisf : chr [1:29780] "No answer" "No answer" "No answer" "No answer" ...
$ race_id : chr [1:29780] "Mestizo" "Mestizo" "Mestizo" "White" ...
Other: For code and comments regarding the tidying data process, check the R script “Processing data.”
Bivariate
For the bivariate analysis, one variable was selected for each of the main topics and other variables were plotted against that main variable. The selected variables were:
Big Indicators
The plot below shows that:
overall, most people considered economic problems and unemployment to be the main problems facing the country despite of race, since both are the columns with lighter colors across rows.
the lightest square is the intersection of race: Black and Issue: unemployment. This means that black people are proportionally more worried about unemployment than any other race. The hypothesis is that this is a result of systemic racism which makes it harder for Black people to obtain jobs. Thus, this is their main worry.

The plot below shows that:
the one issue that worries the greatest proportion of people in Colombia is Violence, while in Ecuador and Bolivia is the economy. In the case of Peru, it seems to be unemployment.
even though there are common patterns when looking at particular issues across countries (e.g. economy), the graph is heterogeneous.

The plot below shows that:
the columns with the lighter colors are the ones that correspond to the economy, and unemployment. It means that, across all educational levels, people are highly worried about this issues.
the column of poverty gets darker as the educational level increases. This mean that, at higher educational background, there is a smaller proportion of people worried about poverty. This makes sense given that they are at lower risk of losing jobs and, thereby, losing their income because their education makes them more employable.
the column of corruption shows an opposite trend, at higher educational levels, it gets lighter. This means that there is a greater proportion of people with higher educational backgrounds worried about corruption compared to people with lower educational background. The hypothesis is that this follows from the fact that people with higher education have the privilege to have learned more about politics, so they have a broader understanding on how corruption can negatively impact a country and, therefore, are more worried about it.

Crime
The app below (to get access to this app go to this link, the picture shown below is for reference: https://xamanthalc.shinyapps.io/EDA1_pie_chart/?_ga=2.239497859.1658449803.1639090319-538650606.1637013074) shows different groups within the people who answered “Yes” to have been victims of crime during the past year. The questions were the legend goes from 1 - 7 represent extent of agreement where 1 is “A little” and 7 is “A lot.” Some highlights are that:
most of the victims of crime were from Peru.
most of the victims were male.
the pie chart that shows the breakdown in years, only has 2010, 2012, and 2014 as options. Since the three years seemed to be in an even proportion, it is likely that crime incidents have remained about the same during that time period. Moreover, regarding the lack of presence of the years 2006, and 2008; the hypothesis was that the question was not included in the survey until 2010. The process to test the hypothesis is shown below. As it can be seen, for answer “Yes” and “No,” the only years present were from 2010 to 2014. Even though both years appear in “No answer,” it is likely that this was a NA result that became a “No answer” when recoding. Thus, the hypothesis is confirmed.
year
victim_crime_12 2010 2012 2014
Yes 1484 1531 1467
year
victim_crime_12 2010 2012 2014
No 4462 4409 4469
year
victim_crime_12 2006 2008 2010 2012 2014
No answer 5956 5956 10 16 20
most of the victims have none to little trust that the justice system in their countries would punish the guilty. I think that this perception could also be shaped by past experiences where they were victims of crime and did not find justice for the event.
about half of the people think that it is not justified for military to take over when crime is too high; however, most people think that AAFF (Armed Forces) should combat crime. This is interesting because it shows that even though people refuse that military forces to be in power (as in a military coup d’é·tat), they still think that it would be beneficial if they start taking on some of the national police’s duties.
about half of the people are in the lower end (1, 2, 3) of agreement regarding to what extent they trust the national police.

Corruption
The boxplot below shows that:
- As the perception on frequency of corruption among public officials increases, the median of the extent of agreement on whether the current administrations combats corruption decreases.

The graph below shows that the lightest squares are both the ones where two 7’s intersect and where 1’s intersect. This means that:
Most of the people who trust political parties very little (1) highly disagree with the view that the current administration combats corruption (1).
Most of the people who trust political parties a lot (7) highly agree with the view that the current administration combats corruption (7).

The hypothesis was that if a person perceives that the current administration is intentionally combating corruption, it increases their trust on political parties because they deemed them to be more honest. However, this hypothesis implied causation, which cannot be proved. However, correlation could be established and tested below. It was:
[1] 0.3043626
The boxplot below has a step-like shape. The main takeaways from this is that:
The degree of respect for political institutions and the extent of agreement on whether the current administration combats corruption have a positive relationship.
More respect for political institutions do not only correspond to a higher median of agreement on whether the current administration combats corruption, but also to a higher interquartile range (IQR). This means that 75% of observations in each group have a higher agreement on whether the current administration combats corruption compared to the 75% of observations of a group with a lower degree of respect for political institutions.

Similarly to before, since causation cannot be tested, a correlation coefficient between both variables was calculated. The correlation was:
[1] 0.2649056
Democracy
The first plot is a map of the Andean Region showing the median and mean of agreement with the statement that democracy is better than other forms of government despite its issues. Before making the plot, it was needed to find both metrics per country and they are shown below.
# A tibble: 4 × 2
# Groups: country [4]
country median
<chr> <dbl>
1 Colombia 6
2 Ecuador 5
3 Bolivia 5
4 Peru 5
# A tibble: 4 × 2
# Groups: country [4]
country mean
<chr> <dbl>
1 Colombia 5.28
2 Ecuador 4.98
3 Bolivia 5.04
4 Peru 4.74
The maps below show that:
the country with the highest median and mean is Colombia. However, given that the other countries are less than one scale far from Colombia, this difference is not significant.
all countries in the Andean community agree that democracy is the best form of government.

The plot below shows that:
the trend was almost the same despite salary: most people agreed that democracy is the best form of government.
there was a slightly greater proportion of people in the group with the best salaries that think that democracy is the best form of government compared to other salaries. Nonetheless, the difference was not too significant.

A similar plot was replicated for educational background; however, the visualization was difficult on the eyes and the trend was the same for all educational background: most people thought that democracy was better than other form of government. Yet, there was one interesting insight in the NA column. The plot below shows that:
- Out of all people who did not answer whether democracy was the best form of government, most of them were from lower educational backgrounds.
The insight is very powerful because it shows that lack of a sufficient educational background restrict people from forming any opinions at all. This is concerning because it means that lack of educational do not only lead people to wrong stands, but to no stands at all. This is dangerous to democracy because it means that people with low education do not have the same decision power as their counterpart, which can negatively affect: political elections, referendums, etc.
