ABSTRACT:

There are lots of factors that influenced voting trends at the United Nations Council let’s analyze those voting trends and build simple regression models to analyze the overall voting trends which will allow to answer some hidden questions.

1.1 Import Data Analysis

we start by importing the data set and output the data to get an overall idea of the type of data we are dealing with. Then we filter important features in the data such as the name of the countries and their voting percentages(Yes or No)

votes <- readRDS("C:/Users/jkevi_000/Downloads/votes.rds")
# Print the votes dataset
print(votes)
## # A tibble: 508,929 x 4
##     rcid session  vote ccode
##    <dbl>   <dbl> <dbl> <int>
##  1    46       2     1     2
##  2    46       2     1    20
##  3    46       2     9    31
##  4    46       2     1    40
##  5    46       2     1    41
##  6    46       2     1    42
##  7    46       2     9    51
##  8    46       2     9    52
##  9    46       2     9    53
## 10    46       2     9    54
## # ... with 508,919 more rows
# Filter for votes that are "yes", "abstain", or "no"
votes %>% filter(vote <= 3)
## # A tibble: 353,547 x 4
##     rcid session  vote ccode
##    <dbl>   <dbl> <dbl> <int>
##  1    46       2     1     2
##  2    46       2     1    20
##  3    46       2     1    40
##  4    46       2     1    41
##  5    46       2     1    42
##  6    46       2     1    70
##  7    46       2     1    90
##  8    46       2     1    91
##  9    46       2     1    92
## 10    46       2     1    93
## # ... with 353,537 more rows
# Add another %>% step to add a year column
votes %>%
  filter(vote <= 3) %>%
  mutate(year = 1945 + session)
## # A tibble: 353,547 x 5
##     rcid session  vote ccode  year
##    <dbl>   <dbl> <dbl> <int> <dbl>
##  1    46       2     1     2  1947
##  2    46       2     1    20  1947
##  3    46       2     1    40  1947
##  4    46       2     1    41  1947
##  5    46       2     1    42  1947
##  6    46       2     1    70  1947
##  7    46       2     1    90  1947
##  8    46       2     1    91  1947
##  9    46       2     1    92  1947
## 10    46       2     1    93  1947
## # ... with 353,537 more rows
# Load the countrycode package
library(countrycode)


# Convert country code 100
countrycode(100, "cown", "country.name")
## [1] "Colombia"
# Add a country column within the mutate: votes_processed
votes_processed <- votes %>%
  filter(vote <= 3) %>%
  mutate(year = session + 1945, 
            country = countrycode(ccode, "cown", "country.name"))

# Print votes_processed
head(votes_processed)
## # A tibble: 6 x 6
##    rcid session  vote ccode  year                  country
##   <dbl>   <dbl> <dbl> <int> <dbl>                    <chr>
## 1    46       2     1     2  1947 United States of America
## 2    46       2     1    20  1947                   Canada
## 3    46       2     1    40  1947                     Cuba
## 4    46       2     1    41  1947                    Haiti
## 5    46       2     1    42  1947       Dominican Republic
## 6    46       2     1    70  1947                   Mexico
# Find total and fraction of "yes" votes
votes_processed %>% summarize(total = n(),
                              percent_yes = mean(vote == 1))
## # A tibble: 1 x 2
##    total percent_yes
##    <int>       <dbl>
## 1 353547   0.7999248
# Change this code to summarize by year
votes_processed %>%
group_by(year) %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1)) 
## # A tibble: 34 x 3
##     year total percent_yes
##    <dbl> <int>       <dbl>
##  1  1947  2039   0.5693968
##  2  1949  3469   0.4375901
##  3  1951  1434   0.5850767
##  4  1953  1537   0.6317502
##  5  1955  2169   0.6947902
##  6  1957  2708   0.6085672
##  7  1959  4326   0.5880721
##  8  1961  7482   0.5729751
##  9  1963  3308   0.7294438
## 10  1965  4382   0.7078959
## # ... with 24 more rows
# Summarize by country: by_country
by_country <- votes_processed %>%
  group_by(country) %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1))

# You have the votes summarized by country
by_country <- votes_processed %>%
  group_by(country) %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1))

# Print the by_country dataset
head(by_country)
## # A tibble: 6 x 3
##               country total percent_yes
##                 <chr> <int>       <dbl>
## 1         Afghanistan  2373   0.8592499
## 2             Albania  1695   0.7174041
## 3             Algeria  2213   0.8992318
## 4             Andorra   719   0.6383866
## 5              Angola  1431   0.9238295
## 6 Antigua and Barbuda  1302   0.9124424
# Sort in ascending order of percent_yes
by_country %>% arrange(percent_yes)
## # A tibble: 200 x 3
##                                                 country total percent_yes
##                                                   <chr> <int>       <dbl>
##  1                                             Zanzibar     2   0.0000000
##  2                             United States of America  2568   0.2694704
##  3                                                Palau   369   0.3387534
##  4                                               Israel  2380   0.3407563
##  5                          Federal Republic of Germany  1075   0.3972093
##  6 United Kingdom of Great Britain and Northern Ireland  2558   0.4167318
##  7                                               France  2527   0.4265928
##  8                     Micronesia (Federated States of)   724   0.4419890
##  9                                     Marshall Islands   757   0.4914135
## 10                                              Belgium  2568   0.4922118
## # ... with 190 more rows
# Now sort in descending order
by_country %>% arrange(desc(percent_yes))
## # A tibble: 200 x 3
##                  country total percent_yes
##                    <chr> <int>       <dbl>
##  1 Sao Tome and Principe  1091   0.9761687
##  2            Seychelles   881   0.9750284
##  3              Djibouti  1598   0.9612015
##  4         Guinea Bissau  1538   0.9603381
##  5           Timor-Leste   326   0.9570552
##  6             Mauritius  1831   0.9497542
##  7              Zimbabwe  1361   0.9493020
##  8               Comoros  1133   0.9470432
##  9  United Arab Emirates  1934   0.9467425
## 10            Mozambique  1701   0.9465021
## # ... with 190 more rows
# Filter out countries with fewer than 100 votes
by_country %>%
  arrange(percent_yes) %>%
  filter(total >= 100)
## # A tibble: 197 x 3
##                                                 country total percent_yes
##                                                   <chr> <int>       <dbl>
##  1                             United States of America  2568   0.2694704
##  2                                                Palau   369   0.3387534
##  3                                               Israel  2380   0.3407563
##  4                          Federal Republic of Germany  1075   0.3972093
##  5 United Kingdom of Great Britain and Northern Ireland  2558   0.4167318
##  6                                               France  2527   0.4265928
##  7                     Micronesia (Federated States of)   724   0.4419890
##  8                                     Marshall Islands   757   0.4914135
##  9                                              Belgium  2568   0.4922118
## 10                                               Canada  2576   0.5081522
## # ... with 187 more rows
# Define by_year
by_year <- votes_processed %>%
  group_by(year) %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1))

1.2 Data Visualization

This will be the most important step of our data exploration.Let’s take a look at a few plots of the voting trends of different countries, especially the percentage of “Yes” votes amongst different countries over time and see what hypothesis we can infer from those graphs.

# Load the ggplot2 package
library(ggplot2)

# Create line plot
ggplot(by_year, aes(x = year, y = percent_yes))+
  geom_line()

# Change to scatter plot and add smoothing curve
ggplot(by_year, aes(year, percent_yes)) +
  geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess'

Now lets turn our attention to voting trends observed in specific countries rather than the overall voting trend. We will start by looking at the trend for the United Kingdom then pick a few other countries as well.

# Group by year and country: by_year_country
by_year_country <- votes_processed %>%
  group_by(year, country) %>%
  summarize(total = n(),
            percent_yes = mean(vote == 1))


# Print by_year_country
head(by_year_country)
## # A tibble: 6 x 4
## # Groups:   year [1]
##    year                          country total percent_yes
##   <dbl>                            <chr> <int>       <dbl>
## 1  1947                      Afghanistan    34   0.3823529
## 2  1947                        Argentina    38   0.5789474
## 3  1947                        Australia    38   0.5526316
## 4  1947                          Belarus    38   0.5000000
## 5  1947                          Belgium    38   0.6052632
## 6  1947 Bolivia (Plurinational State of)    37   0.5945946
# Create a filtered version: FR_by_year
FR_by_year <- by_year_country %>%
  filter(country == "France")

# Line plot of percent_yes over time for France only
ggplot(FR_by_year, aes(year, percent_yes)) +
  geom_line()

# Vector of four countries to examine
countries <- c("United States of America", "Cameroon",
               "France", "India")

# Filter by_year_country: filtered_4_countries
filtered_4_countries <- by_year_country %>%
  filter(country %in% countries)


# Line plot of % yes in four countries
ggplot(filtered_4_countries, aes(x = year, y = percent_yes, color = country)) +
  geom_line()

# Vector of six countries to examine
countries <- c("United States", "Cameroon",
               "France", "Japan", "Brazil", "India")

# Filtered by_year_country: filtered_6_countries
filtered_6_countries <- by_year_country %>% 
                                filter(country %in% countries)



# Vector of six countries to examine
countries <- c("United States", "Cameroon",
               "France", "Japan", "Brazil", "India")

# Filtered by_year_country: filtered_6_countries
filtered_6_countries <- by_year_country %>%
  filter(country %in% countries)

# Line plot of % yes over time faceted by country
ggplot(filtered_6_countries, aes(year, percent_yes)) +
  geom_line() +
  facet_wrap(~ country, scales = "free_y")

1.3 Predicitve Modelling Using Linear Regression MOdels

We will use a linear regression model in order to examine how one variable changes with respect to another by fitting a best fit line. In this specific case we will use our regression line to describe the association between the percentage of “Yes” votes by different countries over time.

# Load the broom package
library(broom)

# Linear regression of percent_yes by year for US
US_by_year <- by_year_country %>%
  filter(country == "United States of America")
US_fit <- lm(percent_yes ~ year, US_by_year)
summary(US_fit)
## 
## Call:
## lm(formula = percent_yes ~ year, data = US_by_year)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.222491 -0.080635 -0.008661  0.081948  0.194307 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.6641455  1.8379743   6.890 8.48e-08 ***
## year        -0.0062393  0.0009282  -6.722 1.37e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1062 on 32 degrees of freedom
## Multiple R-squared:  0.5854, Adjusted R-squared:  0.5724 
## F-statistic: 45.18 on 1 and 32 DF,  p-value: 1.367e-07

The estimated slope of our model is -0.006 meaning that the percentage of yes votes by the United States has decreased by a factor of 0.006 over the years. Now we could dig a little deeper to find out why but we don’t have enough data to answer that question. But to verify wether or not the trend is due to chance let’s look at the p-value which is 1.37*10^(-07) <<< 0.05 so clearly our model is significant.

# Create US_tidied 
US_tidied <- tidy(US_fit)

1.4 Deploying Linear Regression Models for Multiple Countries

Applying linear regression to individual countries would take us forever instead, let’s deploy our model for multiple countries in one go.

# Load the tidyr package
library(tidyr)

# Nest all columns besides country
nested<- by_year_country %>% group_by(country) %>% nest()


# Unnest the data column to return it to its original form
unnest(nested)
## # A tibble: 4,744 x 4
##        country  year total percent_yes
##          <chr> <dbl> <int>       <dbl>
##  1 Afghanistan  1947    34   0.3823529
##  2 Afghanistan  1949    51   0.6078431
##  3 Afghanistan  1951    25   0.7600000
##  4 Afghanistan  1953    26   0.7692308
##  5 Afghanistan  1955    37   0.7297297
##  6 Afghanistan  1957    34   0.5294118
##  7 Afghanistan  1959    54   0.6111111
##  8 Afghanistan  1961    76   0.6052632
##  9 Afghanistan  1963    32   0.7812500
## 10 Afghanistan  1965    40   0.8500000
## # ... with 4,734 more rows
#Perform Linear Regression on each nested data frame
# Load tidyr and purrr
library(tidyr)
library(purrr)



# Perform a linear regression on each item in the data column
  nested %>% mutate(model = map(data, ~lm(percent_yes ~ year, data = .)))
## # A tibble: 200 x 3
##                             country              data    model
##                               <chr>            <list>   <list>
##  1                      Afghanistan <tibble [34 x 3]> <S3: lm>
##  2                        Argentina <tibble [34 x 3]> <S3: lm>
##  3                        Australia <tibble [34 x 3]> <S3: lm>
##  4                          Belarus <tibble [34 x 3]> <S3: lm>
##  5                          Belgium <tibble [34 x 3]> <S3: lm>
##  6 Bolivia (Plurinational State of) <tibble [34 x 3]> <S3: lm>
##  7                           Brazil <tibble [34 x 3]> <S3: lm>
##  8                           Canada <tibble [34 x 3]> <S3: lm>
##  9                            Chile <tibble [34 x 3]> <S3: lm>
## 10                         Colombia <tibble [34 x 3]> <S3: lm>
## # ... with 190 more rows
# Load the broom package
library(broom)


# Add another mutate that applies tidy() to each model
  nested %>% mutate(model = map(data, ~ lm(percent_yes ~ year, data = .)))%>%
  mutate(tidied = map(model,tidy))
## # A tibble: 200 x 4
##                             country              data    model
##                               <chr>            <list>   <list>
##  1                      Afghanistan <tibble [34 x 3]> <S3: lm>
##  2                        Argentina <tibble [34 x 3]> <S3: lm>
##  3                        Australia <tibble [34 x 3]> <S3: lm>
##  4                          Belarus <tibble [34 x 3]> <S3: lm>
##  5                          Belgium <tibble [34 x 3]> <S3: lm>
##  6 Bolivia (Plurinational State of) <tibble [34 x 3]> <S3: lm>
##  7                           Brazil <tibble [34 x 3]> <S3: lm>
##  8                           Canada <tibble [34 x 3]> <S3: lm>
##  9                            Chile <tibble [34 x 3]> <S3: lm>
## 10                         Colombia <tibble [34 x 3]> <S3: lm>
## # ... with 190 more rows, and 1 more variables: tidied <list>
# Add one more step that unnests the tidied column
country_coefficients <- nested %>%
  mutate(model = map(data, ~ lm(percent_yes ~ year, data = .)),
         tidied = map(model, tidy)) %>%
         unnest(tidied)

# Print the resulting country_coefficients variable
head(country_coefficients)
## # A tibble: 6 x 6
##       country        term      estimate    std.error statistic
##         <chr>       <chr>         <dbl>        <dbl>     <dbl>
## 1 Afghanistan (Intercept) -11.063084650 1.4705189228 -7.523252
## 2 Afghanistan        year   0.006009299 0.0007426499  8.091698
## 3   Argentina (Intercept)  -9.464512565 2.1008982371 -4.504984
## 4   Argentina        year   0.005148829 0.0010610076  4.852773
## 5   Australia (Intercept)  -4.545492536 2.1479916283 -2.116159
## 6   Australia        year   0.002567161 0.0010847910  2.366503
## # ... with 1 more variables: p.value <dbl>
# Print the country_coefficients dataset
head(country_coefficients)
## # A tibble: 6 x 6
##       country        term      estimate    std.error statistic
##         <chr>       <chr>         <dbl>        <dbl>     <dbl>
## 1 Afghanistan (Intercept) -11.063084650 1.4705189228 -7.523252
## 2 Afghanistan        year   0.006009299 0.0007426499  8.091698
## 3   Argentina (Intercept)  -9.464512565 2.1008982371 -4.504984
## 4   Argentina        year   0.005148829 0.0010610076  4.852773
## 5   Australia (Intercept)  -4.545492536 2.1479916283 -2.116159
## 6   Australia        year   0.002567161 0.0010847910  2.366503
## # ... with 1 more variables: p.value <dbl>
# Filter for only the slope terms
country_coefficients %>% filter(term == "year")
## # A tibble: 199 x 6
##                             country  term    estimate    std.error
##                               <chr> <chr>       <dbl>        <dbl>
##  1                      Afghanistan  year 0.006009299 0.0007426499
##  2                        Argentina  year 0.005148829 0.0010610076
##  3                        Australia  year 0.002567161 0.0010847910
##  4                          Belarus  year 0.003907557 0.0007587624
##  5                          Belgium  year 0.003203234 0.0007652852
##  6 Bolivia (Plurinational State of)  year 0.005802864 0.0009657515
##  7                           Brazil  year 0.006107151 0.0008167736
##  8                           Canada  year 0.001515867 0.0009552118
##  9                            Chile  year 0.006775560 0.0008220463
## 10                         Colombia  year 0.006157755 0.0009645084
## # ... with 189 more rows, and 2 more variables: statistic <dbl>,
## #   p.value <dbl>

Now let’s filter significant countries only by using our p-value strategy(p-value < 0.05)

# Filter for only the slope terms
slope_terms <- country_coefficients %>%
  filter(term == "year")

# Add p.adjusted column, then filter
slope_terms %>% mutate(p.adjusted = p.adjust(p.value) )
## # A tibble: 199 x 7
##                             country  term    estimate    std.error
##                               <chr> <chr>       <dbl>        <dbl>
##  1                      Afghanistan  year 0.006009299 0.0007426499
##  2                        Argentina  year 0.005148829 0.0010610076
##  3                        Australia  year 0.002567161 0.0010847910
##  4                          Belarus  year 0.003907557 0.0007587624
##  5                          Belgium  year 0.003203234 0.0007652852
##  6 Bolivia (Plurinational State of)  year 0.005802864 0.0009657515
##  7                           Brazil  year 0.006107151 0.0008167736
##  8                           Canada  year 0.001515867 0.0009552118
##  9                            Chile  year 0.006775560 0.0008220463
## 10                         Colombia  year 0.006157755 0.0009645084
## # ... with 189 more rows, and 3 more variables: statistic <dbl>,
## #   p.value <dbl>, p.adjusted <dbl>
# Filter by adjusted p-values
filtered_countries <- country_coefficients %>%
  filter(term == "year") %>%
  mutate(p.adjusted = p.adjust(p.value)) %>%
  filter(p.adjusted < .05)

Now let’s look at the countries where the percentage of “Yes” votes increases the fastest.

# Sort for the countries increasing most quickly
filtered_countries %>% arrange(desc(estimate))
## # A tibble: 61 x 7
##                country  term    estimate    std.error statistic
##                  <chr> <chr>       <dbl>        <dbl>     <dbl>
##  1        South Africa  year 0.011858333 0.0014003768  8.467959
##  2          Kazakhstan  year 0.010955741 0.0019482401  5.623404
##  3 Yemen Arab Republic  year 0.010854882 0.0015869058  6.840281
##  4          Kyrgyzstan  year 0.009725462 0.0009884060  9.839541
##  5              Malawi  year 0.009084873 0.0018111087  5.016194
##  6  Dominican Republic  year 0.008055482 0.0009138578  8.814809
##  7            Portugal  year 0.008020046 0.0017124482  4.683380
##  8            Honduras  year 0.007717977 0.0009214260  8.376123
##  9                Peru  year 0.007299813 0.0009764019  7.476238
## 10           Nicaragua  year 0.007075848 0.0010716402  6.602820
## # ... with 51 more rows, and 2 more variables: p.value <dbl>,
## #   p.adjusted <dbl>

Now lets look at the countries where the percentage of “Yes” votes decreases the fastest

# Sort for the countries decreasing most quickly
filtered_countries %>% arrange(estimate)
## # A tibble: 61 x 7
##                       country  term     estimate    std.error statistic
##                         <chr> <chr>        <dbl>        <dbl>     <dbl>
##  1          Republic of Korea  year -0.009209912 0.0015453128 -5.959901
##  2                     Israel  year -0.006852921 0.0011718657 -5.847873
##  3   United States of America  year -0.006239305 0.0009282243 -6.721764
##  4                    Belgium  year  0.003203234 0.0007652852  4.185673
##  5                     Guinea  year  0.003621508 0.0008326598  4.349325
##  6                    Morocco  year  0.003798641 0.0008603064  4.415451
##  7                    Belarus  year  0.003907557 0.0007587624  5.149908
##  8 Iran (Islamic Republic of)  year  0.003911100 0.0008558952  4.569602
##  9                      Congo  year  0.003967778 0.0009220262  4.303324
## 10                      Sudan  year  0.003989394 0.0009613894  4.149613
## # ... with 51 more rows, and 2 more variables: p.value <dbl>,
## #   p.adjusted <dbl>

2- Conclusion

Using our simple linear regression model we were able to significantly observed voting trends during United Nations voting councils. We could further our analysis by merging new datasets with new information about specific factors that would push a country to vote yes and also make use of state-of-art machine learning algorithm for more accurate trend analysis.