Introduction

In this document, an Exploratory Data Analysis (EDA) was performed on a data set about public opinions of people in the Andean community (Peru, Colombia, Ecuador, and Bolivia) regarding the major problems that the countries are facing during a specific span of time (8 - 10 years)-gathered through surveys. The original data set is made up of 266 variables and 236754 observations; however, this EDA dealt with a subset of that data.

An EDA is the process of performing an investigation on a data set to discover patterns, anomalies, and to test hypothesis; which is usually aided by visualizations (https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15). Thus, this document’s purpose was to carry such operations on the data set to provide answers to questions such as:

  • Is the data collection biased by not including enough diversity?

  • Does race play a role in people’s perception of the main issue the country is facing in the Andean community?

  • What is the median number of times that someone has been victim of crime in the countries?

  • Do main concerns differ by country?

  • Is there a correlation between people’s concern about corruption and trust to political parties?

  • Have people become more concerned about crime over time, etc?

The EDA focused on four criteria of study (Big Indicators, Crime, Corruption, and Democracy). The structure is broken down in:

  • Univariate Analysis: In-depth study of the distribution of demographic variables to identify bias and the distribution of variable related to the criteria of study.

  • Bivariate Analysis: An study of how two variables interact with each other to discover how variables related to the main criteria change according to demographics, or according to other variables. ’

  • Spece-Temporal Analysis: An analysis of how a variable has changed throughout time and by country to spot geographical differences/similarities within the Andean communities and trends over time.

  • Multivariate Analysis: Studying more than three variables at the same time to generate insights about one or more criteria of study.

Conclusion can be found at the bottom of the document and for a compact version of this document the reader can refer to the Executive Summary file.

Considerations:

The options for some demographic categories studied in this data set were directly taken from the options asked in the original surveys. This document acknowledges that the options do not fully reflect the identification of some readers. The limitations of demographic options presented in this EDA are not a reflection of the author’s views, but a reflection of a lack of diversity in the question’s of the original survey.

Given that the original survey deals with race, it is important to keep in mind that the racial diaspora in Latin America differs in some aspects to the treatment of race in the USA. Thus, language surrounded race that is sensitive in the USA is not necessarily perceived as such in Latin America.

This EDA is likely to encounter accessibility issues. There is one version is process for people with visual impairments whose link will be attached here once it is released.

About the data

Source: For this project,the data set titled “Latin America Public Opinion Project (LAPOP), 2004 - 2015 [28 COUNTRIES]” collected by “The Latino American Public Opinion Project” at Vanderbilt University and retrieved from ICPSR (https://www.icpsr.umich.edu/web/ICPSR/studies/36562) was used.

Composition of the Data Set: The data set has 266 variables and 236754 observations. Due to its large size, functions such as “glimpse()” or “str()” were not used. In the original data set, there were only two character variables (“IDNUM_14”: Number of questionnaire - numbers written as characters, and “FECHA”: date.) Below, the reader can see the first 10 observations with some of the variables.

# A tibble: 10 × 266
    PAIS  WAVE  YEAR IDNUM IDNUM_14 ESTRATOPRI ESTRATOSEC STRATA   UPM CLUSTER
   <dbl> <dbl> <dbl> <dbl> <chr>         <dbl>      <dbl>  <dbl> <dbl>   <dbl>
 1    17  2008  2008   594 <NA>           1701         NA   1701  1702      30
 2    17  2008  2008    14 <NA>           1705         NA   1717  1759       3
 3    17  2008  2008    84 <NA>           1701         NA   1701  1711       1
 4    17  2008  2008  1110 <NA>           1701         NA   1701  1727       5
 5    17  2008  2008  1296 <NA>           1703         NA   1709  1741       6
 6    17  2008  2008   817 <NA>           1704         NA   1713  1749       3
 7    17  2008  2008  1493 <NA>           1702         NA   1705  1731      15
 8    17  2008  2008   225 <NA>           1702         NA   1705  1731      44
 9    17  2008  2008  1185 <NA>           1701         NA   1701  1725      19
10    17  2008  2008  1045 <NA>           1701         NA   1701  1722       6
# … with 256 more variables: UR <dbl>, TAMANO <dbl>, IDIOMAQ <dbl>,
#   FECHA <chr>, WT <dbl>, WEIGHT1500 <dbl>, ODD <dbl>, X_OR_Y <dbl>, Q1 <dbl>,
#   Q2 <dbl>, A4 <dbl>, A4_06 <dbl>, A4C <dbl>, A4I <dbl>, AB1 <dbl>,
#   AB2 <dbl>, AB5 <dbl>, AOJ8 <dbl>, AOJ11 <dbl>, AOJ12 <dbl>, AOJ17 <dbl>,
#   AOJ18 <dbl>, AOJ21 <dbl>, AOJ22 <dbl>, AUT1 <dbl>, B1 <dbl>, B2 <dbl>,
#   B3 <dbl>, B3MILX <dbl>, B4 <dbl>, B6 <dbl>, B10A <dbl>, B11 <dbl>,
#   B12 <dbl>, B13 <dbl>, B14 <dbl>, B15 <dbl>, B16 <dbl>, B17 <dbl>, …

Codebook: The original data set is very untidy. Since the EDA will not work over all the data set, a codebook for all 266 variables is not provided. Therefore, for any questions about the data set, please check the folder “Data Information” where the file “Breakdown of Questions” provides in depth detail about the original data set.

Other: For code and comments regarding the original data set explorations, check the R script “Exploring Raw Data.”

EDA

Tidying data

Data Set Issues:

Some of the main problems with the original data set are:

  • The data set has observations that are irrelevant for the purpose of the EDA. E.g. the EDA focuses on Andean Countries, but the original data set covers all countries in Latin America.
  • There are too many variables: Some of them are irrelevant (e.g. repeated variables). Similarly, even though there might be many that are interesting, it is necessary to narrow them down to not sacrifice quality for quantity.
  • The names of the variables are not informative or are in a different language. Some of the names of the variables are in Spanish (e.g. “PAIS”), which would make it harder for English speakers to understand the EDA (given that it is written in English), and some other variables have names that are not intuitive (e.g. “Q1”: whether the surveyed person is Male or Female).

To solve this issues and others, the process of tidying the data is shown below.

Filtering by Countries:

The data set was filtered by all countries of interest; which are Peru, Bolivia, Ecuador, and Colombia. The country variable is called “PAIS” and the countries are represented by the following numbers: Peru (11), Bolivia (10), Ecuador (9), and Colombia (8). The tables below show the frequency of each observation in the variable “PAIS” before and after filtering by countries. Additionally, now the data set has 49596 observations and 266 variables (using “str()”). Before filtering, table of frequency for variable “PAIS:”


    1     2     3     4     5     6     7     8     9    10    11    12    13 
 9333  9253  9426  9492  9500  9031  9375  8987 14913 18196  7500  6845  8151 
   14    15    16    17    21    22    23    24    25    26    27    28    29 
 7224  8192  7510  5920 12013  8261  7601  8695  7212  6101  7006  3429  3828 
   40    41 
 6609  7151 

After filtering, table of frequency for variable “PAIS:”


    8     9    10    11 
 8987 14913 18196  7500 

Subsetting by Variables of Interest:

Not all the 266 variables are relevant, and using all of them for an EDA would sacrifice specificity. Therefore, each variable in the questionnaire was analyzed and only relevant variables were selected (e.g. selected variable about whether a person trusts a president, but leave variable that indicates whether a person trusts a mayor). Since these are not the final variables used for the EDA (further subsetting later), there is no a specific codebook yet, but the reader can find information about each variable in the general codebook of the original survey. Below, the reader can see that the data set now has 49596 observations and 53 variables.

tibble [49,596 × 53] (S3: tbl_df/tbl/data.frame)
 $ PAIS      : num [1:49596] 10 10 10 10 10 10 10 10 10 10 ...
 $ WAVE      : num [1:49596] 2004 2004 2004 2004 2004 ...
 $ YEAR      : num [1:49596] 2004 2004 2004 2004 2004 ...
 $ ESTRATOSEC: num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ UR        : num [1:49596] 1 1 1 1 2 2 1 1 2 1 ...
 $ TAMANO    : num [1:49596] 4 1 3 4 5 5 4 3 5 3 ...
 $ IDIOMAQ   : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ Q1        : num [1:49596] 2 2 1 1 2 1 1 1 1 1 ...
 $ LS3       : num [1:49596] 1 3 3 3 2 2 2 3 3 2 ...
 $ A4        : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ SOCT2     : num [1:49596] 3 3 3 3 2 1 3 2 3 1 ...
 $ IDIO2     : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ CP5       : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ L1        : num [1:49596] 5 6 8 5 NA 7 6 6 NA 7 ...
 $ PROT3     : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ JC10      : num [1:49596] 2 2 2 2 NA 2 2 1 2 1 ...
 $ JC13      : num [1:49596] 2 1 1 2 NA 1 2 1 2 1 ...
 $ JC15A     : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ VIC1EXT   : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ VIC1EXTA  : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ VIC1HOGAR : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ PESE1     : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ PESE2     : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ AOJ12     : num [1:49596] 3 1 4 3 3 3 3 3 1 3 ...
 $ B2        : num [1:49596] 3 6 2 4 4 4 3 5 4 5 ...
 $ B4        : num [1:49596] 4 5 5 5 NA 6 5 3 3 5 ...
 $ B6        : num [1:49596] 4 5 3 6 NA 6 2 3 4 6 ...
 $ B10A      : num [1:49596] 4 6 5 1 5 4 2 3 4 3 ...
 $ B12       : num [1:49596] 6 2 7 6 3 6 3 3 3 3 ...
 $ B13       : num [1:49596] 5 1 4 3 4 6 6 2 3 3 ...
 $ B18       : num [1:49596] 2 4 6 1 5 4 5 2 3 1 ...
 $ B21       : num [1:49596] 3 2 3 2 3 3 2 3 2 2 ...
 $ B21A      : num [1:49596] 7 5 2 1 4 6 3 6 5 5 ...
 $ B47A      : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ N9        : num [1:49596] 1 2 5 3 NA 5 6 3 5 5 ...
 $ N11       : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ N15       : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ M1        : num [1:49596] 2 2 4 3 3 2 3 3 3 2 ...
 $ SD2NEW2   : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ SD3NEW2   : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ SD6NEW2   : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ ROS4      : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ ING4      : num [1:49596] 6 4 4 5 7 6 6 4 3 4 ...
 $ MIL7      : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ PN4       : num [1:49596] 3 3 4 3 NA 3 2 3 3 3 ...
 $ EXC2      : num [1:49596] 0 1 0 1 0 0 0 0 0 0 ...
 $ EXC7      : num [1:49596] 1 3 3 2 NA 3 2 1 4 1 ...
 $ POL1      : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ ED        : num [1:49596] 10 10 7 12 4 14 12 16 14 13 ...
 $ Q10NEW_12 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ Q10NEW_14 : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ Q10D      : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ ETID      : num [1:49596] 2 2 2 1 2 3 2 2 3 2 ...

After analyzing each variable to check whether any of them could cause major troubles later on, it was noticed that both Q10NEW_14 and Q10NEW_12 measure monthly household income in dollars. However, due to the fact that living costs vary from country to country, this variable could lead to misleading results in the future. Even though there are solutions to address this problem, considering the number of variables in the data set, it was decided that it was better to get rid of both variables. Another issue noticed was that the variable “WAVE” seemed to have the same values as the variable “YEAR.”

# A tibble: 10 × 2
    YEAR  WAVE
   <dbl> <dbl>
 1  2004  2004
 2  2004  2004
 3  2004  2004
 4  2004  2004
 5  2004  2004
 6  2004  2004
 7  2004  2004
 8  2004  2004
 9  2004  2004
10  2004  2004

After checking is both columns were equal, the hypothesis was confirmed. Finally, the variable ESTRATOSEC (size of municipality) was considered to be not relevant given that there was other demographic information (sex, country, urbanization, etc.) that was more useful. After getting rid of such variables, the data set had 49 variables and 49596 observations.

tibble [49,596 × 49] (S3: tbl_df/tbl/data.frame)
 $ PAIS     : num [1:49596] 10 10 10 10 10 10 10 10 10 10 ...
 $ YEAR     : num [1:49596] 2004 2004 2004 2004 2004 ...
 $ UR       : num [1:49596] 1 1 1 1 2 2 1 1 2 1 ...
 $ TAMANO   : num [1:49596] 4 1 3 4 5 5 4 3 5 3 ...
 $ IDIOMAQ  : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ Q1       : num [1:49596] 2 2 1 1 2 1 1 1 1 1 ...
 $ LS3      : num [1:49596] 1 3 3 3 2 2 2 3 3 2 ...
 $ A4       : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ SOCT2    : num [1:49596] 3 3 3 3 2 1 3 2 3 1 ...
 $ IDIO2    : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ CP5      : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ L1       : num [1:49596] 5 6 8 5 NA 7 6 6 NA 7 ...
 $ PROT3    : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ JC10     : num [1:49596] 2 2 2 2 NA 2 2 1 2 1 ...
 $ JC13     : num [1:49596] 2 1 1 2 NA 1 2 1 2 1 ...
 $ JC15A    : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ VIC1EXT  : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ VIC1EXTA : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ VIC1HOGAR: num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ PESE1    : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ PESE2    : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ AOJ12    : num [1:49596] 3 1 4 3 3 3 3 3 1 3 ...
 $ B2       : num [1:49596] 3 6 2 4 4 4 3 5 4 5 ...
 $ B4       : num [1:49596] 4 5 5 5 NA 6 5 3 3 5 ...
 $ B6       : num [1:49596] 4 5 3 6 NA 6 2 3 4 6 ...
 $ B10A     : num [1:49596] 4 6 5 1 5 4 2 3 4 3 ...
 $ B12      : num [1:49596] 6 2 7 6 3 6 3 3 3 3 ...
 $ B13      : num [1:49596] 5 1 4 3 4 6 6 2 3 3 ...
 $ B18      : num [1:49596] 2 4 6 1 5 4 5 2 3 1 ...
 $ B21      : num [1:49596] 3 2 3 2 3 3 2 3 2 2 ...
 $ B21A     : num [1:49596] 7 5 2 1 4 6 3 6 5 5 ...
 $ B47A     : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ N9       : num [1:49596] 1 2 5 3 NA 5 6 3 5 5 ...
 $ N11      : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ N15      : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ M1       : num [1:49596] 2 2 4 3 3 2 3 3 3 2 ...
 $ SD2NEW2  : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ SD3NEW2  : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ SD6NEW2  : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ ROS4     : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ ING4     : num [1:49596] 6 4 4 5 7 6 6 4 3 4 ...
 $ MIL7     : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ PN4      : num [1:49596] 3 3 4 3 NA 3 2 3 3 3 ...
 $ EXC2     : num [1:49596] 0 1 0 1 0 0 0 0 0 0 ...
 $ EXC7     : num [1:49596] 1 3 3 2 NA 3 2 1 4 1 ...
 $ POL1     : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ ED       : num [1:49596] 10 10 7 12 4 14 12 16 14 13 ...
 $ Q10D     : num [1:49596] NA NA NA NA NA NA NA NA NA NA ...
 $ ETID     : num [1:49596] 2 2 2 1 2 3 2 2 3 2 ...

Accounting for Irregularities with Data Collection:

Given that the collection of data was performed during a long time span and in different countries, it was expected that there could be irregularities such as:

  • No observations collected in all countries for the same years, etc.

  • Uneven number of observations collected among countries. The values below show the unique observations for the variable “YEAR” for each country.
    Peru:

[1] 2006 2008 2010 2012 2014

Colombia:

[1] 2004 2006 2008 2010 2012 2014

Ecuador:

[1] 2004 2006 2008 2010 2012 2014

Bolivia:

[1] 2004 2006 2008 2010 2012 2014

The values above show that, in Peru, unlike the other countries, no data was collected during 2004. Therefore, I decided that I would only work with the years: 2006, 2008, 2010, 2012, and 2014. After filtering by years that are not 2004, the number of observations was 42045 with 49 variables, as show below.

tibble [42,045 × 49] (S3: tbl_df/tbl/data.frame)
 $ PAIS     : num [1:42045] 10 10 10 10 10 10 10 10 10 10 ...
 $ YEAR     : num [1:42045] 2006 2006 2006 2006 2006 ...
 $ UR       : num [1:42045] 2 1 2 1 1 2 2 2 2 2 ...
 $ TAMANO   : num [1:42045] 5 3 5 1 4 5 5 5 5 5 ...
 $ IDIOMAQ  : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
 $ Q1       : num [1:42045] 2 2 1 2 1 1 1 2 2 1 ...
 $ LS3      : num [1:42045] 1 1 2 2 2 1 2 2 3 1 ...
 $ A4       : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
 $ SOCT2    : num [1:42045] NA 3 1 1 2 NA 2 3 2 2 ...
 $ IDIO2    : num [1:42045] 2 3 1 1 3 2 2 2 1 3 ...
 $ CP5      : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
 $ L1       : num [1:42045] 5 6 5 3 5 9 NA 5 5 3 ...
 $ PROT3    : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
 $ JC10     : num [1:42045] 1 1 2 1 1 2 2 1 1 2 ...
 $ JC13     : num [1:42045] 2 1 2 2 1 2 1 1 1 2 ...
 $ JC15A    : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
 $ VIC1EXT  : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
 $ VIC1EXTA : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
 $ VIC1HOGAR: num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
 $ PESE1    : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
 $ PESE2    : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
 $ AOJ12    : num [1:42045] 3 3 2 4 3 2 3 3 1 3 ...
 $ B2       : num [1:42045] 5 4 6 6 3 5 6 3 5 1 ...
 $ B4       : num [1:42045] 2 6 6 4 5 5 6 4 3 1 ...
 $ B6       : num [1:42045] 2 4 7 5 5 4 7 5 5 NA ...
 $ B10A     : num [1:42045] 5 6 6 5 3 4 7 3 6 4 ...
 $ B12      : num [1:42045] 2 5 6 5 4 4 1 6 5 3 ...
 $ B13      : num [1:42045] NA 6 6 6 3 4 6 4 4 3 ...
 $ B18      : num [1:42045] 4 4 6 6 2 4 1 5 4 1 ...
 $ B21      : num [1:42045] 6 4 4 3 4 4 3 6 4 3 ...
 $ B21A     : num [1:42045] 2 5 7 5 5 6 3 1 6 4 ...
 $ B47A     : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
 $ N9       : num [1:42045] 7 6 6 5 3 5 4 2 5 NA ...
 $ N11      : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
 $ N15      : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
 $ M1       : num [1:42045] 2 3 1 2 3 2 3 3 2 2 ...
 $ SD2NEW2  : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
 $ SD3NEW2  : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
 $ SD6NEW2  : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
 $ ROS4     : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
 $ ING4     : num [1:42045] 3 4 6 4 4 4 5 NA 6 4 ...
 $ MIL7     : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
 $ PN4      : num [1:42045] 2 2 2 2 2 1 2 2 2 3 ...
 $ EXC2     : num [1:42045] 0 0 0 0 0 0 0 0 0 0 ...
 $ EXC7     : num [1:42045] 4 2 1 2 2 3 3 2 3 NA ...
 $ POL1     : num [1:42045] 3 4 1 3 4 3 3 1 3 3 ...
 $ ED       : num [1:42045] 4 5 12 17 11 0 9 9 12 3 ...
 $ Q10D     : num [1:42045] NA NA NA NA NA NA NA NA NA NA ...
 $ ETID     : num [1:42045] 3 2 2 2 3 2 3 3 4 NA ...

The next issue was to check whether the data had been unevenly recorded across countries because this could lead to potential issues. For example, later on a visualization could show that there are twice as many people who trust their president in Ecuador compared to Bolivia, but this would be a consequence of the fact that there were twice as many observations recorded in Bolivia. The following plot shows how many observations were recorded per country in a given year. It can be seen that Bolivia had about twice as many observations than Peru and Colombia in every year. Similarly, Ecuador had the same trend in 2006, 2008, and 2010.



To account for this, a sample of every country-year combination was taken, so further analysis would be more comparable. The size of the sample was determined by the minimum country-year combination to take advantage of considering as many observations possible while still having comparable observations. The minimum country-year combination of number of observations was 1489, which corresponds to Ecuador in 2014. This can be verified below.

## # A tibble: 20 × 3
## # Groups:   PAIS, YEAR [20]
##     PAIS  YEAR     n
##    <dbl> <dbl> <int>
##  1     8  2006  1491
##  2     8  2008  1503
##  3     8  2010  1506
##  4     8  2012  1512
##  5     8  2014  1496
##  6     9  2006  2925
##  7     9  2008  3000
##  8     9  2010  2999
##  9     9  2012  1500
## 10     9  2014  1489
## 11    10  2006  3008
## 12    10  2008  3003
## 13    10  2010  3018
## 14    10  2012  3029
## 15    10  2014  3066
## 16    11  2006  1500
## 17    11  2008  1500
## 18    11  2010  1500
## 19    11  2012  1500
## 20    11  2014  1500

The results of taking a sample of size 1489 per country-year combination are shown below.

# A tibble: 20 × 3
# Groups:   PAIS, YEAR [20]
    PAIS  YEAR     n
   <dbl> <dbl> <int>
 1     8  2006  1489
 2     8  2008  1489
 3     8  2010  1489
 4     8  2012  1489
 5     8  2014  1489
 6     9  2006  1489
 7     9  2008  1489
 8     9  2010  1489
 9     9  2012  1489
10     9  2014  1489
11    10  2006  1489
12    10  2008  1489
13    10  2010  1489
14    10  2012  1489
15    10  2014  1489
16    11  2006  1489
17    11  2008  1489
18    11  2010  1489
19    11  2012  1489
20    11  2014  1489



The data set now has 29,780 observations and 49 variables.

tibble [29,780 × 49] (S3: tbl_df/tbl/data.frame)
 $ PAIS     : num [1:29780] 8 8 8 8 8 8 8 8 8 8 ...
 $ YEAR     : num [1:29780] 2006 2006 2006 2006 2006 ...
 $ UR       : num [1:29780] 1 2 2 1 1 2 2 1 1 1 ...
 $ TAMANO   : num [1:29780] 3 4 4 3 4 4 4 3 4 4 ...
 $ IDIOMAQ  : num [1:29780] 1 1 1 1 1 1 1 1 1 1 ...
 $ Q1       : num [1:29780] 1 1 2 1 1 1 1 1 1 1 ...
 $ LS3      : num [1:29780] 2 1 1 2 1 3 3 2 1 2 ...
 $ A4       : num [1:29780] 57 21 57 27 3 4 4 12 4 55 ...
 $ SOCT2    : num [1:29780] 3 3 1 2 2 2 2 3 2 3 ...
 $ IDIO2    : num [1:29780] 3 2 2 2 1 2 2 3 3 3 ...
 $ CP5      : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
 $ L1       : num [1:29780] 5 2 NA 4 9 NA 4 9 7 NA ...
 $ PROT3    : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
 $ JC10     : num [1:29780] 2 2 1 1 1 2 1 2 2 2 ...
 $ JC13     : num [1:29780] 2 2 1 1 1 2 2 2 2 2 ...
 $ JC15A    : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
 $ VIC1EXT  : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
 $ VIC1EXTA : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
 $ VIC1HOGAR: num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
 $ PESE1    : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
 $ PESE2    : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
 $ AOJ12    : num [1:29780] 2 3 3 3 3 3 4 4 2 2 ...
 $ B2       : num [1:29780] 7 7 7 2 7 6 4 7 5 NA ...
 $ B4       : num [1:29780] 6 5 7 3 5 4 5 7 2 NA ...
 $ B6       : num [1:29780] 5 7 7 3 6 5 4 7 4 NA ...
 $ B10A     : num [1:29780] 5 6 7 2 4 6 6 7 3 NA ...
 $ B12      : num [1:29780] 5 1 7 3 7 3 5 7 2 6 ...
 $ B13      : num [1:29780] 3 1 7 2 5 4 6 7 3 6 ...
 $ B18      : num [1:29780] 4 1 7 4 6 4 5 7 5 6 ...
 $ B21      : num [1:29780] 2 7 7 2 6 5 4 7 4 5 ...
 $ B21A     : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
 $ B47A     : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
 $ N9       : num [1:29780] 5 1 7 2 1 6 4 3 4 NA ...
 $ N11      : num [1:29780] 5 1 7 4 4 3 5 4 5 NA ...
 $ N15      : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
 $ M1       : num [1:29780] 3 4 2 3 3 3 3 3 2 2 ...
 $ SD2NEW2  : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
 $ SD3NEW2  : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
 $ SD6NEW2  : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
 $ ROS4     : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
 $ ING4     : num [1:29780] 7 1 NA 5 5 6 5 4 6 6 ...
 $ MIL7     : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
 $ PN4      : num [1:29780] 2 3 NA 3 3 3 3 2 2 2 ...
 $ EXC2     : num [1:29780] 0 0 0 0 0 0 0 0 0 0 ...
 $ EXC7     : num [1:29780] 1 4 1 1 2 3 3 1 2 NA ...
 $ POL1     : num [1:29780] 2 4 3 3 1 3 3 2 2 1 ...
 $ ED       : num [1:29780] 16 7 5 17 5 5 7 11 11 2 ...
 $ Q10D     : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
 $ ETID     : num [1:29780] 2 2 2 1 1 2 3 2 2 1 ...

Adjusting values in variables:

The sample had 49 variables and all of them were numeric due to how the data was collected (numbers represented categorical variables). Although for some variables the numeric type made sense(ranking, extent to which someone agrees from 1-7, etc.), there were others were categorical values were more appropriate. Thus, most variables were recoded after careful consideration. Below, the reader can see the results for the first ten observations before and after recoding (personal criteria was employed to determine where shifting to a categorical variable would be better; however, this was subject to change throughout the EDA).
Before:

# A tibble: 10 × 49
    PAIS  YEAR    UR TAMANO IDIOMAQ    Q1   LS3    A4 SOCT2 IDIO2   CP5    L1
   <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1     8  2006     1      3       1     1     2    57     3     3    NA     5
 2     8  2006     2      4       1     1     1    21     3     2    NA     2
 3     8  2006     2      4       1     2     1    57     1     2    NA    NA
 4     8  2006     1      3       1     1     2    27     2     2    NA     4
 5     8  2006     1      4       1     1     1     3     2     1    NA     9
 6     8  2006     2      4       1     1     3     4     2     2    NA    NA
 7     8  2006     2      4       1     1     3     4     2     2    NA     4
 8     8  2006     1      3       1     1     2    12     3     3    NA     9
 9     8  2006     1      4       1     1     1     4     2     3    NA     7
10     8  2006     1      4       1     1     2    55     3     3    NA    NA
# … with 37 more variables: PROT3 <dbl>, JC10 <dbl>, JC13 <dbl>, JC15A <dbl>,
#   VIC1EXT <dbl>, VIC1EXTA <dbl>, VIC1HOGAR <dbl>, PESE1 <dbl>, PESE2 <dbl>,
#   AOJ12 <dbl>, B2 <dbl>, B4 <dbl>, B6 <dbl>, B10A <dbl>, B12 <dbl>,
#   B13 <dbl>, B18 <dbl>, B21 <dbl>, B21A <dbl>, B47A <dbl>, N9 <dbl>,
#   N11 <dbl>, N15 <dbl>, M1 <dbl>, SD2NEW2 <dbl>, SD3NEW2 <dbl>,
#   SD6NEW2 <dbl>, ROS4 <dbl>, ING4 <dbl>, MIL7 <dbl>, PN4 <dbl>, EXC2 <dbl>,
#   EXC7 <dbl>, POL1 <dbl>, ED <dbl>, Q10D <dbl>, ETID <dbl>

After:

# A tibble: 10 × 49
   PAIS      YEAR UR    TAMANO IDIOMAQ Q1    LS3   A4    SOCT2 IDIO2 CP5      L1
   <chr>    <dbl> <chr>  <dbl> <chr>   <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
 1 Colombia  2006 Urban      3 Spanish Male  Some… Viol… Worse Worse No a…     5
 2 Colombia  2006 Rural      4 Spanish Male  Very… Educ… Worse Same  No a…     2
 3 Colombia  2006 Rural      4 Spanish Fema… Very… Viol… Bett… Same  No a…    NA
 4 Colombia  2006 Urban      3 Spanish Male  Some… Secu… Same  Same  No a…     4
 5 Colombia  2006 Urban      4 Spanish Male  Very… Unem… Same  Bett… No a…     9
 6 Colombia  2006 Rural      4 Spanish Male  Some… Pove… Same  Same  No a…    NA
 7 Colombia  2006 Rural      4 Spanish Male  Some… Pove… Same  Same  No a…     4
 8 Colombia  2006 Urban      3 Spanish Male  Some… Drug… Worse Worse No a…     9
 9 Colombia  2006 Urban      4 Spanish Male  Very… Pove… Same  Worse No a…     7
10 Colombia  2006 Urban      4 Spanish Male  Some… Hous… Worse Worse No a…    NA
# … with 37 more variables: PROT3 <chr>, JC10 <chr>, JC13 <chr>, JC15A <chr>,
#   VIC1EXT <chr>, VIC1EXTA <dbl>, VIC1HOGAR <chr>, PESE1 <chr>, PESE2 <chr>,
#   AOJ12 <chr>, B2 <dbl>, B4 <dbl>, B6 <dbl>, B10A <dbl>, B12 <dbl>,
#   B13 <dbl>, B18 <dbl>, B21 <dbl>, B21A <dbl>, B47A <dbl>, N9 <dbl>,
#   N11 <dbl>, N15 <dbl>, M1 <dbl>, SD2NEW2 <chr>, SD3NEW2 <chr>,
#   SD6NEW2 <chr>, ROS4 <dbl>, ING4 <dbl>, MIL7 <dbl>, PN4 <chr>, EXC2 <chr>,
#   EXC7 <chr>, POL1 <dbl>, ED <dbl>, Q10D <chr>, ETID <chr>

Renaming variables:

The name of the variables were not intuitive for a person looking at the data set which can be problematic for manipulation because one always has to refer back to the codebook. Therefore, all of the variables were renamed.
Before:

 [1] "PAIS"      "YEAR"      "UR"        "TAMANO"    "IDIOMAQ"   "Q1"       
 [7] "LS3"       "A4"        "SOCT2"     "IDIO2"     "CP5"       "L1"       
[13] "PROT3"     "JC10"      "JC13"      "JC15A"     "VIC1EXT"   "VIC1EXTA" 
[19] "VIC1HOGAR" "PESE1"     "PESE2"     "AOJ12"     "B2"        "B4"       
[25] "B6"        "B10A"      "B12"       "B13"       "B18"       "B21"      
[31] "B21A"      "B47A"      "N9"        "N11"       "N15"       "M1"       
[37] "SD2NEW2"   "SD3NEW2"   "SD6NEW2"   "ROS4"      "ING4"      "MIL7"     
[43] "PN4"       "EXC2"      "EXC7"      "POL1"      "ED"        "Q10D"     
[49] "ETID"     

After:

 [1] "country"                            "year"                              
 [3] "urban_rural"                        "size_place"                        
 [5] "language_form"                      "sex"                               
 [7] "life_satisfaction"                  "country_main_problem"              
 [9] "economy_compared_12"                "personal_economy_12"               
[11] "times_solving_community_problem_12" "left_right"                        
[13] "demonstration_participation_12"     "military_takeover_crime"           
[15] "military_takeover_corruption"       "close_congress_difficult_times"    
[17] "victim_crime_12"                    "times_victim_crime_12"             
[19] "household_victim_crime_12"          "violence_neighborhoods_compared"   
[21] "violece_neighborhood_12"            "trust_judicial_punishment"         
[23] "respect_political_institutions"     "pride_living_political_system"     
[25] "should_suppot_political_system"     "trust_justice_system"              
[27] "trust_armed_forces"                 "trust_national_congress"           
[29] "trust_national_police"              "trust_political_parties"           
[31] "trust_president"                    "trust_elections"                   
[33] "administration_combats_corruption"  "administration_imp_safety"         
[35] "administration_good_economy_mgm"    "rate_president_performance"        
[37] "satisf_road_streets_highw"          "satisf_public_schools"             
[39] "satisf_health_services"             "strong_pol_inequality"             
[41] "democracy_better"                   "aaff_should_combate_crime"         
[43] "satisf_democracy_country"           "police_bribe_12"                   
[45] "freq_corrup_public_off"             "interest_politics"                 
[47] "schooling_completed"                "salary_satisf"                     
[49] "race_id"                           

Final subsetting:

The variables in the data set spinned around the following topics:

  • Big indicators: Economy, Education, Health

  • Crime

  • Corruption

  • Democracy

Variables not closely connected with the topics above were removed (with exception of demographic information: sex, country, race, etc.). Those variables were: language_form (irrelevant because over 95% of observations are “Spanish”), times_solving_community_problem_12, demonstration_participation_12, and satisf_road_streets_highw (low/none importance for topics above). Thus, the final data set had 29,780 observations and 45 variables. It is important to highlight that final data set means clean version of the original data to be used as starting point for the EDA, not that no further data manipulations were performed afterwards.

tibble [29,780 × 45] (S3: tbl_df/tbl/data.frame)
 $ country                          : chr [1:29780] "Colombia" "Colombia" "Colombia" "Colombia" ...
 $ year                             : num [1:29780] 2006 2006 2006 2006 2006 ...
 $ urban_rural                      : chr [1:29780] "Urban" "Rural" "Rural" "Urban" ...
 $ size_place                       : num [1:29780] 3 4 4 3 4 4 4 3 4 4 ...
 $ sex                              : chr [1:29780] "Male" "Male" "Female" "Male" ...
 $ life_satisfaction                : chr [1:29780] "Somewhat satisfied" "Very satisfied" "Very satisfied" "Somewhat satisfied" ...
 $ country_main_problem             : chr [1:29780] "Violence" "Education (quality, lack of)" "Violence" "Security (lack of)" ...
 $ economy_compared_12              : chr [1:29780] "Worse" "Worse" "Better" "Same" ...
 $ personal_economy_12              : chr [1:29780] "Worse" "Same" "Same" "Same" ...
 $ left_right                       : num [1:29780] 5 2 NA 4 9 NA 4 9 7 NA ...
 $ military_takeover_crime          : chr [1:29780] "Not justified" "Not justified" "Justified" "Justified" ...
 $ military_takeover_corruption     : chr [1:29780] "Not justified" "Not justified" "Justified" "Justified" ...
 $ close_congress_difficult_times   : chr [1:29780] "No answer" "No answer" "No answer" "No answer" ...
 $ victim_crime_12                  : chr [1:29780] "No answer" "No answer" "No answer" "No answer" ...
 $ times_victim_crime_12            : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
 $ household_victim_crime_12        : chr [1:29780] "No answer" "No answer" "No answer" "No answer" ...
 $ violence_neighborhoods_compared  : chr [1:29780] "No answer" "No answer" "No answer" "No answer" ...
 $ violece_neighborhood_12          : chr [1:29780] "No answer" "No answer" "No answer" "No answer" ...
 $ trust_judicial_punishment        : chr [1:29780] "Some" "Little" "Little" "Little" ...
 $ respect_political_institutions   : num [1:29780] 7 7 7 2 7 6 4 7 5 NA ...
 $ pride_living_political_system    : num [1:29780] 6 5 7 3 5 4 5 7 2 NA ...
 $ should_suppot_political_system   : num [1:29780] 5 7 7 3 6 5 4 7 4 NA ...
 $ trust_justice_system             : num [1:29780] 5 6 7 2 4 6 6 7 3 NA ...
 $ trust_armed_forces               : num [1:29780] 5 1 7 3 7 3 5 7 2 6 ...
 $ trust_national_congress          : num [1:29780] 3 1 7 2 5 4 6 7 3 6 ...
 $ trust_national_police            : num [1:29780] 4 1 7 4 6 4 5 7 5 6 ...
 $ trust_political_parties          : num [1:29780] 2 7 7 2 6 5 4 7 4 5 ...
 $ trust_president                  : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
 $ trust_elections                  : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
 $ administration_combats_corruption: num [1:29780] 5 1 7 2 1 6 4 3 4 NA ...
 $ administration_imp_safety        : num [1:29780] 5 1 7 4 4 3 5 4 5 NA ...
 $ administration_good_economy_mgm  : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
 $ rate_president_performance       : num [1:29780] 3 4 2 3 3 3 3 3 2 2 ...
 $ satisf_public_schools            : chr [1:29780] "No answer" "No answer" "No answer" "No answer" ...
 $ satisf_health_services           : chr [1:29780] "No answer" "No answer" "No answer" "No answer" ...
 $ strong_pol_inequality            : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
 $ democracy_better                 : num [1:29780] 7 1 NA 5 5 6 5 4 6 6 ...
 $ aaff_should_combate_crime        : num [1:29780] NA NA NA NA NA NA NA NA NA NA ...
 $ satisf_democracy_country         : chr [1:29780] "Somewhat satisfied" "Dissatisfied" "No answer" "Dissatisfied" ...
 $ police_bribe_12                  : chr [1:29780] "No" "No" "No" "No" ...
 $ freq_corrup_public_off           : chr [1:29780] "Very common" "Very uncommon" "Very common" "Very common" ...
 $ interest_politics                : num [1:29780] 2 4 3 3 1 3 3 2 2 1 ...
 $ schooling_completed              : num [1:29780] 16 7 5 17 5 5 7 11 11 2 ...
 $ salary_satisf                    : chr [1:29780] "No answer" "No answer" "No answer" "No answer" ...
 $ race_id                          : chr [1:29780] "Mestizo" "Mestizo" "Mestizo" "White" ...

Other: For code and comments regarding the tidying data process, check the R script “Processing data.”

Univariate Analysis

Since there were 45 variables, univariate analysis was not performed in each one of them. The variables that seemed the most relevant/interesting were analyzed.

Demographic Data

The app (to get access to the app go to this link, picture below is for reference: https://xamanthalc.shinyapps.io/EDA1_Andean_perception_demo/?_ga=2.238320000.1658449803.1639090319-538650606.1637013074) below shows that, generally, the data set reflects:

  • that both gender surveyed have an even representation.

  • that more people from urban zones compared to rural zones were surveyed.

  • that most of the people in the survey identify as “Mestizo” (mixed race: of Spanish and indigenous descent).

Regarding schooling level, the plot below shows that:

  • most of the people surveyed finished high school (10 - 12).

  • there is also a relevant number of people who only finished elementary school (6).

Because this a plot for univariate analysis, it cannot be determined whether this is the case for every country or throughout the years.

The plot is further explained by the summary statistics below that:

  • the median schooling completed is 11 (high school).

  • 75% of people are within the schooling years from 6 to 13 (IQR).

  • most people finished 11 schooling years.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00    6.00   11.00   10.08   13.00   18.00     125 
[1] "Mode is 11"

Main Topics Data

Because there were many variables that were not related to the demographics of the people surveyed, not all of them were subject of univariate analysis. A few variables were picked per topic outlined in the introduction (Big indicators: Economy, Education, Health / Crime / Corruption / Democracy) and were studied below.

The variables picked per topic were:

  • Big indicators: Economy, Education, Health:

    • country_main_problem: perception regarding the main problem the country is facing.

    • economy_compared_12: perception on whether the economy is better/worse/same than the previous year.

    • satisf_health_services: satisfaction with health services.

    • satisf_public_schools: satisfaction with the quality of education of public schools.

  • Crime:

    • victim_crime_12: whether a person had been victim of crime during the past 12 months.

    • times_victim_crime_12: how many times a person had been victim of crime during the past 12 months.

  • Corruption

    • freq_corrup_public_off: perception on how common corruption is among public officials.
  • Democracy:

    • satisf_democracy_country: satisfaction with how democracy operates in their country.
Big Indicators

The tibbles and the bar chart below show that:

  • most of the people surveyed consider that the main problem their country is facing is unemployment, followed by economics problems, crime, poverty, and corruption.

  • electricity and human right violations had the smallest number of people who considered them to be the most relevant issues.

# A tibble: 6 × 2
# Groups:   country_main_problem [6]
  country_main_problem        n
  <chr>                   <int>
1 Unemployment             4539
2 No answer                4507
3 Economy (problems with)  4043
4 Crime                    2907
5 Poverty                  2422
6 Corruption               2189
# A tibble: 5 × 2
# Groups:   country_main_problem [5]
  country_main_problem              n
  <chr>                         <int>
1 Electricity (lack of)            21
2 Human right violations           25
3 Transpotation (problems with)    39
4 External debt                    43
5 Land to farm (lack of)           44

The app below shows that (to get access to the app go to this link, picture below is for reference: https://xamanthalc.shinyapps.io/EDA1_big_topics/?_ga=2.235827713.1658449803.1639090319-538650606.1637013074):

  • most people believe that their country’s economy is performing the same as in the past 12 months.

  • more people believe that their country’s economy is doing the worse in the past 12 months than better.

  • there was a big absenteeism in response rate for the question regarding health services and public schools’ quality of education.

  • generally, most people are somewhat satisfied with the quality of education in public school. However, the table below provides deeper insight regarding this statement: it shows that when adding the number of people who are “very dissatisfied” and “dissatisfied,” and comparing it against people who are “satisfied,” the gap gets much smaller.

    ## # A tibble: 1 × 2
    ##   overall_dissatisfied kind_of_satisfied
    ##                  <int>             <int>
    ## 1                 4379              5990
  • by not much, more people are dissatisfied with health services in their countries than people who are somewhat satisfied. Similarly to quality of education, when making variables overall_dissatisfied (“dissatisfied” and “very dissatisfied”) and overall_satisfied (“somewhat satisfied” and “very satisfied”), the quantity of people who are overall dissatisfied is still greater.

    ## # A tibble: 1 × 2
    ##   overall_dissatisfied overall_satisfied
    ##                  <int>             <int>
    ## 1                 6291              5153

Crime

The table below shows that 4482 people in the data set were victims of crime during the past 12 month from the time the survey was taken. This means that:

  • 15% of observations were victim of crime.
.
       No No answer       Yes 
    13340     11958      4482 

Furthermore, out of the people who were victims of crimes, most of them reported that:

  • during the past years, they were victims of crime 1 time.
 times_victim_crime_12
 Min.   : 1.000       
 1st Qu.: 1.000       
 Median : 1.000       
 Mean   : 1.914       
 3rd Qu.: 2.000       
 Max.   :20.000       
 NA's   :44           

The plots below show the same information graphically.

Corruption

The bar chart below shows that:

  • most people think that corruption among public officials is very common.
  • the perception on frequency of corruption among public officials and the number of people is in a positive relationship: higher frequency of perception = more people.

Democracy

The bar chart below shows that:

  • most people are either somewhat satisfied or dissatisfied with how democracy works in their country.
  • there are more people who are ver dissatisfied with how democracy operates in their country than people who are very satisfied.

Bivariate

For the bivariate analysis, one variable was selected for each of the main topics and other variables were plotted against that main variable. The selected variables were:

  • Big indicators: Economy, Education, Health:

    • country_main_problem: perception on what the main problem is that the country is facing.
  • Crime

    • victim_crime_12 : whether a person had been victim of a crime during the past 12 months.
  • Corruption

    • administration_combats_corruption: extent to which a person perceives the current administration combats corruption (0 - 7: a lot).
  • Democracy

    • democracy_better: extent to which a person perceives that democracy is better than other forms of government (0 - 7: a lot).

Big Indicators

The plot below shows that:

  • overall, most people considered economic problems and unemployment to be the main problems facing the country despite of race, since both are the columns with lighter colors across rows.

  • the lightest square is the intersection of race: Black and Issue: unemployment. This means that black people are proportionally more worried about unemployment than any other race. The hypothesis is that this is a result of systemic racism which makes it harder for Black people to obtain jobs. Thus, this is their main worry.


The plot below shows that:

  • the one issue that worries the greatest proportion of people in Colombia is Violence, while in Ecuador and Bolivia is the economy. In the case of Peru, it seems to be unemployment.

  • even though there are common patterns when looking at particular issues across countries (e.g. economy), the graph is heterogeneous.


The plot below shows that:

  • the columns with the lighter colors are the ones that correspond to the economy, and unemployment. It means that, across all educational levels, people are highly worried about this issues.

  • the column of poverty gets darker as the educational level increases. This mean that, at higher educational background, there is a smaller proportion of people worried about poverty. This makes sense given that they are at lower risk of losing jobs and, thereby, losing their income because their education makes them more employable.

  • the column of corruption shows an opposite trend, at higher educational levels, it gets lighter. This means that there is a greater proportion of people with higher educational backgrounds worried about corruption compared to people with lower educational background. The hypothesis is that this follows from the fact that people with higher education have the privilege to have learned more about politics, so they have a broader understanding on how corruption can negatively impact a country and, therefore, are more worried about it.


Crime

The app below (to get access to this app go to this link, the picture shown below is for reference: https://xamanthalc.shinyapps.io/EDA1_pie_chart/?_ga=2.239497859.1658449803.1639090319-538650606.1637013074) shows different groups within the people who answered “Yes” to have been victims of crime during the past year. The questions were the legend goes from 1 - 7 represent extent of agreement where 1 is “A little” and 7 is “A lot.” Some highlights are that:

  • most of the victims of crime were from Peru.

  • most of the victims were male.

  • the pie chart that shows the breakdown in years, only has 2010, 2012, and 2014 as options. Since the three years seemed to be in an even proportion, it is likely that crime incidents have remained about the same during that time period. Moreover, regarding the lack of presence of the years 2006, and 2008; the hypothesis was that the question was not included in the survey until 2010. The process to test the hypothesis is shown below. As it can be seen, for answer “Yes” and “No,” the only years present were from 2010 to 2014. Even though both years appear in “No answer,” it is likely that this was a NA result that became a “No answer” when recoding. Thus, the hypothesis is confirmed.

                   year
    victim_crime_12 2010 2012 2014
                Yes 1484 1531 1467
                   year
    victim_crime_12 2010 2012 2014
                 No 4462 4409 4469
                   year
    victim_crime_12 2006 2008 2010 2012 2014
          No answer 5956 5956   10   16   20
  • most of the victims have none to little trust that the justice system in their countries would punish the guilty. I think that this perception could also be shaped by past experiences where they were victims of crime and did not find justice for the event.

  • about half of the people think that it is not justified for military to take over when crime is too high; however, most people think that AAFF (Armed Forces) should combat crime. This is interesting because it shows that even though people refuse that military forces to be in power (as in a military coup d’é·tat), they still think that it would be beneficial if they start taking on some of the national police’s duties.

  • about half of the people are in the lower end (1, 2, 3) of agreement regarding to what extent they trust the national police.

Corruption

The boxplot below shows that:

  • As the perception on frequency of corruption among public officials increases, the median of the extent of agreement on whether the current administrations combats corruption decreases.

The graph below shows that the lightest squares are both the ones where two 7’s intersect and where 1’s intersect. This means that:

  • Most of the people who trust political parties very little (1) highly disagree with the view that the current administration combats corruption (1).

  • Most of the people who trust political parties a lot (7) highly agree with the view that the current administration combats corruption (7).

The hypothesis was that if a person perceives that the current administration is intentionally combating corruption, it increases their trust on political parties because they deemed them to be more honest. However, this hypothesis implied causation, which cannot be proved. However, correlation could be established and tested below. It was:

[1] 0.3043626

The boxplot below has a step-like shape. The main takeaways from this is that:

  • The degree of respect for political institutions and the extent of agreement on whether the current administration combats corruption have a positive relationship.

  • More respect for political institutions do not only correspond to a higher median of agreement on whether the current administration combats corruption, but also to a higher interquartile range (IQR). This means that 75% of observations in each group have a higher agreement on whether the current administration combats corruption compared to the 75% of observations of a group with a lower degree of respect for political institutions.

Similarly to before, since causation cannot be tested, a correlation coefficient between both variables was calculated. The correlation was:

[1] 0.2649056

Democracy

The first plot is a map of the Andean Region showing the median and mean of agreement with the statement that democracy is better than other forms of government despite its issues. Before making the plot, it was needed to find both metrics per country and they are shown below.

# A tibble: 4 × 2
# Groups:   country [4]
  country  median
  <chr>     <dbl>
1 Colombia      6
2 Ecuador       5
3 Bolivia       5
4 Peru          5
# A tibble: 4 × 2
# Groups:   country [4]
  country   mean
  <chr>    <dbl>
1 Colombia  5.28
2 Ecuador   4.98
3 Bolivia   5.04
4 Peru      4.74

The maps below show that:

  • the country with the highest median and mean is Colombia. However, given that the other countries are less than one scale far from Colombia, this difference is not significant.

  • all countries in the Andean community agree that democracy is the best form of government.

The plot below shows that:

  • the trend was almost the same despite salary: most people agreed that democracy is the best form of government.

  • there was a slightly greater proportion of people in the group with the best salaries that think that democracy is the best form of government compared to other salaries. Nonetheless, the difference was not too significant.

A similar plot was replicated for educational background; however, the visualization was difficult on the eyes and the trend was the same for all educational background: most people thought that democracy was better than other form of government. Yet, there was one interesting insight in the NA column. The plot below shows that:

  • Out of all people who did not answer whether democracy was the best form of government, most of them were from lower educational backgrounds.

The insight is very powerful because it shows that lack of a sufficient educational background restrict people from forming any opinions at all. This is concerning because it means that lack of educational do not only lead people to wrong stands, but to no stands at all. This is dangerous to democracy because it means that people with low education do not have the same decision power as their counterpart, which can negatively affect: political elections, referendums, etc.

Space-Temporal Analysis

Even though there are a few bivariate plots, most of the plots are multivariate to analyze how a variable “X” changes by country across years.

Demographic Data

Gender

The bar plot below shows that over the years and across countries, the observations have been collected evenly from both genders considered in the survey. This is beneficial because it reduces gender bias when analyzing the data.

Race

The plots below showcase both similarities and differences regarding the race of the people surveyed. Similarities:

  • In each country, the race proportions of people surveyed has remained about the same throughout the years.

  • In all countries and years, the main race group was “Mestizo” (mixed race: of Spanish and indigenous descent).

Differences: Most differences come from comparing across countries.

  • Bolivia has very few observations coming from people who identify as black and Mulatto compared to other countries.

  • Colombia has a much greater proportion of surveys coming from people who identify as white compared to other countries.

School Level

The histogram (bivariate) shows that across the years, most of the people surveyed culminated 11-12 schooling levels, meaning that they finished high school. Colombia an Ecuador have bimodal plots, where the modes are first at 5-6 (finished elementary school) and then at 11-12 schooling years (finished high school).

Additionally, the tibble and scatterplot below show how the median schooling years have changed throughout the years in the Andean community.They show that Ecuador and Colombia have had a positive trend regarding schooling years, which can be explained by a greater emphasis on education given that they have a lot of people who had only finished elementary school (unlike Peru and Bolivia, whose distributions were more even).

# A tibble: 4 × 6
# Groups:   country [4]
  country  `2006` `2008` `2010` `2012` `2014`
  <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Bolivia      12     11     12     10     12
2 Colombia      9     10     11     11     11
3 Ecuador      11     11     11     12     12
4 Peru         11     11     11     11     11

Urban/Rural

The bar plot below shows that across years and all countries in the Andean Community, most of the observations were recorded in urban areas. It can also be seen that as years passed by, fewer observations came from rural areas. This reflects the demographic change at a higher level in the countries. For example, in Peru, in 1990, the percentage of the population in rural zones was 29%. Nowadays, it is 21% (http://cuentame.inegi.org.mx/poblacion/rur_urb.aspx?tema=P).

Main Topics

Big Indicators

The plot below shows that:

  • Across years and countries, most people thought that their personal economy has remained the same.

  • In Peru and Ecuador, the number of people who thought that their personal economy was better compared to the past year had a positive trend throughout time. On the other hand, the number of people who think that thought personal economy was worse compared to the past year had a negative trend throughout time. This could hint towards an improvement in the economy of Peruvians and Ecuadorians.

  • Bolivia showed the same trend from 2012 onward.

  • In Colombia, from 2010 onward there were more people who perceived their economy to be better than the past year relatively to people who perceived it to be worse. There had been many fluctuations throughout time, so it was difficult to spot a trend.

Crime

Given that the sample size for each combination of country and year was the same across the data set (done in the tidying data process), it is fair to say that the plot below shows that:

  • In all countries but Colombia, there was an increasing number of people who consider crime to be the main issue facing the country throughout the years.

  • Peruvians are more concerned of crime as the main issue facing the country than other nationalities in the Andean community.

Corruption

The bar chart below shows that:

  • Peru and Bolivia were the countries where bribe requests from the police were the most common within the Andean community.

  • The number of bribe requests from the police remained about the same from 2006 to 2014.

Democracy

The density plots below show that:

  • Colombia had the most right skewed plot throughout the years, which means that there were a greater proportion of people in Colombia with higher respect for political institutions relatively to the Andean community.

  • Ecuador’s density plots are becoming more right skewed throughout the years. This means that, as time passes by, the proportion of people with higher respect for political institutions is growing and the proportion of people with lower respect for political institutions is decreasing.

Multivariate Analysis

The radial plots below show that:

  • Most people are somewhat satisfied with the quality of education in public schools.

  • Race did not appear to be a factor affecting people’s satisfaction with the quality of education in public schools.

  • Most of the people who are dissatisfied or very dissatisfied with the quality of education in public schools are those who-at least-went to high school. The hypothesis for this were:

    • this group of people had expectations regarding high school education that were not met.

    • some of this people went to private schools, so they have a better baseline to compare quality of education. Thus, they deemed it as low.

    • as people who most likely finished high school, they went through all the basic educational experience, so they had more instances to perceive the quality of education as bad.

The animation below shows that:

  • the degree of pride of living under the current political system was correlated with the degree of agreement on whether the current administration imposes safety.

  • As time passed by, Ecuador has a positive trend on both views since their median agreement that the administration imposes safety and their median pride about living under the current political system increased.

  • Colombia had the opposite trend: their median agreement that the administration imposes safety and their median pride about living under the current political system decreased over time.

  • Peru and Bolivia’s degree of agreement on whether the administration imposes safety fluctuated and did not show an explicit trend.

The correlation plot below gave a large amount of information due to the many variables it hold. Some of the most interesting takeaways were:

  • the rate of the president performance is strongly negative correlated with the trust to the president. Even though this is logical and obvious, the strength of the correlation made the statement stand out.

  • any variable regarding the work that the current administration is performing in one area was strongly correlated to each other. For example, a person who thought that the administration did a good job combating corruption was likely to also think that the administration did a good job at imposing safety.

Conclusion

Regarding the most important question of the study: what is the main issue facing the country right now?, most people in the Andean community were concerned with economic problems and unemployment, which is curious given that according to the data of the World Bank (https://data.worldbank.org/indicator/NY.GDP.PCAP.CD?end=2014&locations=ZJ&start=2004), GDP per capita increased and unemployment fell from 2006 - 2014. Added to the fact that, despite the gains in GDP per capita, most people perceived their personal economy to be the same (and in some cases, worse), these facts raise the question on whether macroeconomic gains such as GDP per capita are benefiting everyone in the system or only the higher classes/multinational corporations. Additionally, both race and education completed profiled themselves as the main demographic drivers of perceptions (or lack of) about issues in the Andean community. For example, black people are proportionally more concerned with unemployment and people with low educational backgrounds are more concerned with poverty, these findings shine concerns on what the systematic barriers that exclude less-privileged people from being active agents in the economy are, and what policies are needed to reverse this. Finally, crime is increasingly being considered as the most important problem facing the country throughout time, which is likely related to the fact that about 15% have been victims of crime during the past 12 months across the 8 years of the survey. Taking into account that perceptions regarding crime are correlated with perceptions regarding politics, the increasing number of people concerned about crime must be understood as an urgent call to action for politicians to address safety concerns in the Andean region. For further study, it would be relevant to expand this study to all the Latin American region to identify patterns across economic blocks/geographical locations/etc; and to find out to what extent people’s perceptions are affected by the region they belong to.