Background

You will create an R Markdown file that produces an HTML file with your results for your assigned country. Please submit both your .Rmd and your .html file. There will also be a short follow-up quiz in class after the assignment is due that covers basic concepts covered in this assignment. The quiz is pen-and-paper (no notes).

Preparation

  1. Save the data file (“World_Values_subset_Wave_7.csv”) into a subfolder called Data in the folder where you put this .Rmd file
  2. [If you have not already done so] Create a .Rproj file in the folder where you put this .Rmd syntax file. To do so, go to File/New Project in RStudio and browse to this folder.
  3. The codebook for the World Values Survey data is available at https://www.worldvaluessurvey.org/WVSDocumentationWV7.jsp
  4. You will be assigned a country in class. In this assignment, you will compare respondents in your assigned country to respondents in the U.S.

1) Load the World Values Survey data into a data object called “wv”. Use the code I have included here; I also do some data cleaning for missing values. Display the first 4 rows and 6 columns of this object to make sure the data loaded properly. Please do not display more than 6 columns (to avoid messiness/length).

wv[1:4,1:6]
##   X   country A_YEAR A_STUDY B_COUNTRY B_COUNTRY_ALPHA
## 1 1 Australia   2018       2        36             AUS
## 2 2 Australia   2018       2        36             AUS
## 3 3 Australia   2018       2        36             AUS
## 4 4 Australia   2018       2        36             AUS

2) How many rows and columns are there in the “wv” dataset?

dim(wv)
## [1] 60735   349

ANSWER: 60735 Rows, 349 Columns

3) Create a new data object that contains only the data for the country you have been assigned. How big is this data object?

bol <- wv[wv$country =='Bolivia', ]
dim(bol)
## [1] 2067  349

ANSWER: 2067 Observations (rows)

4) Create a new data object that contains only the data for the United States. (Be sure it has a different name than the others.) How big is this data object?

us <- wv[wv$country =='United States', ]
dim(us)
## [1] 2596  349

ANSWER: 2596 Observations (rows)

5) Demographics: Calculate raw percentages for gender and (separately) education in each country (using the 3-category version of education). Assuming a 50-50 split by gender, briefly discuss what weights would be if we only used gender to weight in each country.

# Distribution of gender in each country
df1 <- table(bol$ed3.factor)
df2 <- table(us$ed3.factor)
cbind("bolivia"=df1, "US"=df2)
##        bolivia   US
## low        709   45
## medium     633 1200
## high       719 1317
# Weights for gender in each country (estimate)
targetgender <- c(male = .5, female = .5)
#Bolivia weights by gender
round(prop.table(table(bol$gender.factor)) / targetgender, digits = 3)
## 
##   male female 
##  0.991  1.009
#US weights by gender
round(prop.table(table(us$gender.factor)) / targetgender, digits = 3)
## 
##   male female 
##  1.071  0.929
# Compare the distribution of education in each country (using the 3 category version of education)

6) Q49 is a life satisfaction variable. Use the barplot command to create side-by-side histograms of life satisfaction for the country you have been assigned and the United States. Make the figure more appealing to the reader (e.g., a clear title). (Discuss this figure in your answer to the next problem.)

# US 
us_counts <- data.frame(table(us$Q49)) %>%
  rename(score = Var1, count = Freq) %>%
  mutate(perc = count / sum(count) * 100)

# Bolivia 
bol_counts <- data.frame(table(bol$Q49)) %>%
  rename(score = Var1, count = Freq) %>%
  mutate(perc = count / sum(count) * 100)
#combined
combined <- c(us_counts, bol_counts, by = "score", suffix = c("_US", "_Bol"))

#Plot
perc_matrix <- rbind(us_counts$perc, bol_counts$perc)
rownames(perc_matrix) <- c("US", "Bolivia")
barplot(perc_matrix,
        beside = TRUE,           
        col = c("blue", "yellow"),
        names.arg = 1:10,       
        legend.text = TRUE,
        args.legend = list(x = "topright"),
        main = "Life Satisfaction: US vs Bolivia",
        xlab = "Score (1–10)",
        ylab = "Percentage of Observations")

7) Discuss similarities/differences in life satisfaction in the two countries.

DISCUSSION: Life satisfaction for the United States is distributed relatively evenly around 8, but Bolivia has a strikingly larger amount of 10s reported, relative to sample size. The United States has a larger proportion reporting 9s, 8s, and 7s, whereas Bolivia has a larger proportion reporting 5s and 6s.

8) Q164 is “How important is God in your life?” from 0 (not at all important) to 10 (very important). Create side-by-side histograms (using the barplot command) of responses to this question for your assigned country and for the U.S. (after dealing with missing data). Discuss similarities and differences.

#US counts without NAs
us_countsgod <- us %>%
  filter(!is.na(Q164)) %>%           
  count(Q164) %>%                    
  mutate(perc = n / sum(n) * 100)   

# Bolivia counts without NAs
bol_countsgod <- bol %>%
  filter(!is.na(Q164)) %>%
  count(Q164) %>%
  mutate(perc = n / sum(n) * 100)
#Combined
combinedgod <- c(us_countsgod, bol_countsgod, by = "score", suffix = c("_US", "_Bol"))
#plot
perc_matrixgod <- rbind(us_countsgod$perc, bol_countsgod$perc)
rownames(perc_matrixgod) <- c("US", "Bolivia")
barplot(perc_matrixgod,
        beside = TRUE,           
        col = c("#FF70E7", "#FFFFA6"),
        names.arg = 1:10,       
        legend.text = TRUE,
        args.legend = list(x = "topleft"),
        main = "Importance of God: US vs Bolivia",
        xlab = "Score (1–10)",
        ylab = "Percentage of Observations")

DISCUSSION: There is a larger proportion of people that do not think God is important in the United States, with a higher proportion of people saying 1-8, and with 9-10, there is a much higher proportion of Bolivians prioritizing God as important.

9) Q71 is trust in government (with answer categories of “A great deal”, “Quite a lot”, “Not very much”, “None at all”). Create side-by-side histograms (using the barplot command) of responses to this question for your assigned country and for the U.S. Discuss similarities and differences.

ustrust <- table(us$Q71)
boltrust <- table(bol$Q71)
levels <- c("a great deal", "quite a lot", "not very much", "none at all")
combinedtrust <- rbind(US = ustrust,
                       Bolivia = boltrust)
barplot(combinedtrust,
        beside = TRUE,
        col = c("blue", "red"),
        legend = rownames(combinedtrust),
        args.legend = list(x = "topleft"),
        names.arg = c("A great deal", "Quite a lot", "Not very much", "None at all"),
        main = "Trust in Government",
        ylab = "Number of Responses",
        las = 1)

DISCUSSION: Both Americans and Bolivians do not have a great deal of faith in their government. Though notably, Bolivians have significantly more faith than do Americans do. Most respondents though chose “not very much” trust in their government, and notably Americans shared not having any faith in their government more than Bolivians.

10) Create a crosstab of age (using age6.factor created above) and education (using ed3.factor created above) for each country with and without weights. Calculate percentages by row. Explain what each of the four crosstabs shows and comment on differences/similarities between the country you were assigned and the United States.

Comparison by age (rows)

# United States: using unweighted data
us_tab <- table(us$age6.factor, us$ed3.factor)
round(prop.table(us_tab, 1)*100, 2)
##            
##               low medium  high
##   15-24      1.45  55.80 42.75
##   25-34      1.58  39.28 59.14
##   35-44      2.89  43.11 54.00
##   45-54      0.93  48.13 50.93
##   55-64      1.87  55.61 42.51
##   65 years+  1.77  48.67 49.56
# United States: using weighted data
us_tab_w <- xtabs(W_WEIGHT ~ age6.factor + ed3.factor, data = us)
round(prop.table(us_tab_w, 1) *100, 2)
##            ed3.factor
## age6.factor   low medium  high
##   15-24      1.86  63.45 34.69
##   25-34      3.49  49.38 47.14
##   35-44      5.71  47.14 47.14
##   45-54      2.04  54.64 43.33
##   55-64      2.39  60.73 36.88
##   65 years+  2.96  49.07 47.97
chin <- wv[wv$country =='China', ]
# China: using unweighted data
chin_tab <- table(chin$age6.factor, chin$ed3.factor)
round(prop.table(chin_tab, 1)*100, 2)
##            
##               low medium  high
##   15-24     12.69  33.13 54.18
##   25-34     35.47  25.00 39.53
##   35-44     48.21  26.30 25.49
##   45-54     67.81  18.88 13.30
##   55-64     71.00  23.73  5.27
##   65 years+ 84.74  10.28  4.98
# China: using weighted data
chin_tab_w <- xtabs(W_WEIGHT ~ age6.factor + ed3.factor, data = chin)
round(prop.table(chin_tab_w, 1)*100, 2)
##            ed3.factor
## age6.factor   low medium  high
##   15-24     25.26  34.24 40.50
##   25-34     52.99  18.70 28.31
##   35-44     64.08  19.65 16.27
##   45-54     80.10  12.38  7.53
##   55-64     78.86  17.17  3.97
##   65 years+ 89.42   6.85  3.73

The difference between the unweighted and weighted data shows that China’s was relatively representative, whereas the United States underrepresented the younger cohorts significantly, requiring weighting throughout.

11) Create a histogram of weights (the “W_WEIGHT” variable) for each country. Briefly explain what this shows.

# US weights
hist(us$W_WEIGHT,
     main = "Distribution of Weights (US)",
     xlab = "Weight",
     col = "blue",
     xlim = c(0, 10))

# China weights
hist(chin$W_WEIGHT,
     main = "Distribution of Weights (China)",
     xlab = "Weight",
     col = "red",
     xlim = c(0, 10))

par(mfrow = c(1,2))  # 1 row, 2 plots

hist(us$W_WEIGHT,
     main = "US Weights",
     xlab = "Weight",
     col = "blue",
      xlim = c(0, 10))

hist(chin$W_WEIGHT,
     main = "China Weights",
     xlab = "Weight",
     col = "red",
     xlim = c(0, 10))

DISCUSSION: The weights are both skewed to the right, but China’s was more skewed to the right and does not have that small trail leading to 10. This means that their population was largely representative, giving more validity to their data. On the other hand, the data in the US graph shows that there were populations that were not representative in the population, most notably seen on the crosstabs as the younger cohorts were not necessarily as represented in the American poll, requiring heavier weighting

12) Come up with a question and answer it with a figure or description for a subgroup in the population or a cross-tab. Explain why this is an interesting question and explain your results. Did the results surprise you?

Look through the codebook for ideas and guidance on data coding.

# US remove NAs + summary
us_summary <- us %>%
  filter(!is.na(Q49) & !is.na(us$age6.factor)) %>%   # remove missing values
  group_by(age6.factor) %>%                        # group by age
  summarize(mean_life = mean(Q49, na.rm = TRUE))

# Bolivia remove NAs + summary
bol_summary <- bol %>%
  filter(!is.na(Q49) & !is.na(age6.factor)) %>%
  group_by(age6.factor) %>%
  summarize(mean_life = mean(Q49, na.rm = TRUE))

# Combine into a matrix for barplot
bar_data <- rbind(us_summary$mean_life, bol_summary$mean_life)
rownames(bar_data) <- c("US", "Bolivia")
colnames(bar_data) <- us_summary$age  # make sure age groups align
## Warning: Unknown or uninitialised column: `age`.
# Side-by-side barplot
barplot(bar_data,
        beside = TRUE,
        col = c("skyblue", "salmon"),
        legend.text = TRUE,
        args.legend = list(x = "bottomright"),
        main = "Average Life Satisfaction by Age Group",
        xlab = "Age Group",
        ylab = "Average Life Satisfaction (1–10)",
        names.arg = c("15-24","25-34","35-44","45-54","55-64","65+")
)

DISCUSSION: The question I wanted to ask was how age related to life satisfaction in Bolivia versus in the United States. The difference was striking, as the younger cohort in Bolivia tended to be more satisfied than their American counterparts, and as age increased, the trend flipped, suggesting that Americans that were older were more satisfied.

The following item is not required/not for credit

[Not required/not for credit 2] Weighting factors. If you have learned regression, estimate regression models that explain weights as a function of standard demographic variables such as gender, income, education, and age in each country. (Note that the weights may not vary for some countries; they vary for the U.S.)