You will create an R Markdown file that produces an HTML file with your results for your assigned country. Please submit both your .Rmd and your .html file. There will also be a short follow-up quiz in class after the assignment is due that covers basic concepts covered in this assignment. The quiz is pen-and-paper (no notes).
wv[1:4,1:6]
## X country A_YEAR A_STUDY B_COUNTRY B_COUNTRY_ALPHA
## 1 1 Australia 2018 2 36 AUS
## 2 2 Australia 2018 2 36 AUS
## 3 3 Australia 2018 2 36 AUS
## 4 4 Australia 2018 2 36 AUS
dim(wv)
## [1] 60735 349
ANSWER: 60735 Rows, 349 Columns
bol <- wv[wv$country =='Bolivia', ]
dim(bol)
## [1] 2067 349
ANSWER: 2067 Observations (rows)
us <- wv[wv$country =='United States', ]
dim(us)
## [1] 2596 349
ANSWER: 2596 Observations (rows)
# Distribution of gender in each country
df1 <- table(bol$ed3.factor)
df2 <- table(us$ed3.factor)
cbind("bolivia"=df1, "US"=df2)
## bolivia US
## low 709 45
## medium 633 1200
## high 719 1317
# Weights for gender in each country (estimate)
targetgender <- c(male = .5, female = .5)
#Bolivia weights by gender
round(prop.table(table(bol$gender.factor)) / targetgender, digits = 3)
##
## male female
## 0.991 1.009
#US weights by gender
round(prop.table(table(us$gender.factor)) / targetgender, digits = 3)
##
## male female
## 1.071 0.929
# Compare the distribution of education in each country (using the 3 category version of education)
# US
us_counts <- data.frame(table(us$Q49)) %>%
rename(score = Var1, count = Freq) %>%
mutate(perc = count / sum(count) * 100)
# Bolivia
bol_counts <- data.frame(table(bol$Q49)) %>%
rename(score = Var1, count = Freq) %>%
mutate(perc = count / sum(count) * 100)
#combined
combined <- c(us_counts, bol_counts, by = "score", suffix = c("_US", "_Bol"))
#Plot
perc_matrix <- rbind(us_counts$perc, bol_counts$perc)
rownames(perc_matrix) <- c("US", "Bolivia")
barplot(perc_matrix,
beside = TRUE,
col = c("blue", "yellow"),
names.arg = 1:10,
legend.text = TRUE,
args.legend = list(x = "topright"),
main = "Life Satisfaction: US vs Bolivia",
xlab = "Score (1–10)",
ylab = "Percentage of Observations")
DISCUSSION: Life satisfaction for the United States is distributed relatively evenly around 8, but Bolivia has a strikingly larger amount of 10s reported, relative to sample size. The United States has a larger proportion reporting 9s, 8s, and 7s, whereas Bolivia has a larger proportion reporting 5s and 6s.
#US counts without NAs
us_countsgod <- us %>%
filter(!is.na(Q164)) %>%
count(Q164) %>%
mutate(perc = n / sum(n) * 100)
# Bolivia counts without NAs
bol_countsgod <- bol %>%
filter(!is.na(Q164)) %>%
count(Q164) %>%
mutate(perc = n / sum(n) * 100)
#Combined
combinedgod <- c(us_countsgod, bol_countsgod, by = "score", suffix = c("_US", "_Bol"))
#plot
perc_matrixgod <- rbind(us_countsgod$perc, bol_countsgod$perc)
rownames(perc_matrixgod) <- c("US", "Bolivia")
barplot(perc_matrixgod,
beside = TRUE,
col = c("#FF70E7", "#FFFFA6"),
names.arg = 1:10,
legend.text = TRUE,
args.legend = list(x = "topleft"),
main = "Importance of God: US vs Bolivia",
xlab = "Score (1–10)",
ylab = "Percentage of Observations")
DISCUSSION: There is a larger proportion of people that do not think God is important in the United States, with a higher proportion of people saying 1-8, and with 9-10, there is a much higher proportion of Bolivians prioritizing God as important.
ustrust <- table(us$Q71)
boltrust <- table(bol$Q71)
levels <- c("a great deal", "quite a lot", "not very much", "none at all")
combinedtrust <- rbind(US = ustrust,
Bolivia = boltrust)
barplot(combinedtrust,
beside = TRUE,
col = c("blue", "red"),
legend = rownames(combinedtrust),
args.legend = list(x = "topleft"),
names.arg = c("A great deal", "Quite a lot", "Not very much", "None at all"),
main = "Trust in Government",
ylab = "Number of Responses",
las = 1)
DISCUSSION: Both Americans and Bolivians do not have a great deal of faith in their government. Though notably, Bolivians have significantly more faith than do Americans do. Most respondents though chose “not very much” trust in their government, and notably Americans shared not having any faith in their government more than Bolivians.
# United States: using unweighted data
us_tab <- table(us$age6.factor, us$ed3.factor)
round(prop.table(us_tab, 1)*100, 2)
##
## low medium high
## 15-24 1.45 55.80 42.75
## 25-34 1.58 39.28 59.14
## 35-44 2.89 43.11 54.00
## 45-54 0.93 48.13 50.93
## 55-64 1.87 55.61 42.51
## 65 years+ 1.77 48.67 49.56
# United States: using weighted data
us_tab_w <- xtabs(W_WEIGHT ~ age6.factor + ed3.factor, data = us)
round(prop.table(us_tab_w, 1) *100, 2)
## ed3.factor
## age6.factor low medium high
## 15-24 1.86 63.45 34.69
## 25-34 3.49 49.38 47.14
## 35-44 5.71 47.14 47.14
## 45-54 2.04 54.64 43.33
## 55-64 2.39 60.73 36.88
## 65 years+ 2.96 49.07 47.97
chin <- wv[wv$country =='China', ]
# China: using unweighted data
chin_tab <- table(chin$age6.factor, chin$ed3.factor)
round(prop.table(chin_tab, 1)*100, 2)
##
## low medium high
## 15-24 12.69 33.13 54.18
## 25-34 35.47 25.00 39.53
## 35-44 48.21 26.30 25.49
## 45-54 67.81 18.88 13.30
## 55-64 71.00 23.73 5.27
## 65 years+ 84.74 10.28 4.98
# China: using weighted data
chin_tab_w <- xtabs(W_WEIGHT ~ age6.factor + ed3.factor, data = chin)
round(prop.table(chin_tab_w, 1)*100, 2)
## ed3.factor
## age6.factor low medium high
## 15-24 25.26 34.24 40.50
## 25-34 52.99 18.70 28.31
## 35-44 64.08 19.65 16.27
## 45-54 80.10 12.38 7.53
## 55-64 78.86 17.17 3.97
## 65 years+ 89.42 6.85 3.73
The difference between the unweighted and weighted data shows that China’s was relatively representative, whereas the United States underrepresented the younger cohorts significantly, requiring weighting throughout.
# US weights
hist(us$W_WEIGHT,
main = "Distribution of Weights (US)",
xlab = "Weight",
col = "blue",
xlim = c(0, 10))
# China weights
hist(chin$W_WEIGHT,
main = "Distribution of Weights (China)",
xlab = "Weight",
col = "red",
xlim = c(0, 10))
par(mfrow = c(1,2)) # 1 row, 2 plots
hist(us$W_WEIGHT,
main = "US Weights",
xlab = "Weight",
col = "blue",
xlim = c(0, 10))
hist(chin$W_WEIGHT,
main = "China Weights",
xlab = "Weight",
col = "red",
xlim = c(0, 10))
DISCUSSION: The weights are both skewed to the right, but China’s was more skewed to the right and does not have that small trail leading to 10. This means that their population was largely representative, giving more validity to their data. On the other hand, the data in the US graph shows that there were populations that were not representative in the population, most notably seen on the crosstabs as the younger cohorts were not necessarily as represented in the American poll, requiring heavier weighting
Look through the codebook for ideas and guidance on data coding.
# US remove NAs + summary
us_summary <- us %>%
filter(!is.na(Q49) & !is.na(us$age6.factor)) %>% # remove missing values
group_by(age6.factor) %>% # group by age
summarize(mean_life = mean(Q49, na.rm = TRUE))
# Bolivia remove NAs + summary
bol_summary <- bol %>%
filter(!is.na(Q49) & !is.na(age6.factor)) %>%
group_by(age6.factor) %>%
summarize(mean_life = mean(Q49, na.rm = TRUE))
# Combine into a matrix for barplot
bar_data <- rbind(us_summary$mean_life, bol_summary$mean_life)
rownames(bar_data) <- c("US", "Bolivia")
colnames(bar_data) <- us_summary$age # make sure age groups align
## Warning: Unknown or uninitialised column: `age`.
# Side-by-side barplot
barplot(bar_data,
beside = TRUE,
col = c("skyblue", "salmon"),
legend.text = TRUE,
args.legend = list(x = "bottomright"),
main = "Average Life Satisfaction by Age Group",
xlab = "Age Group",
ylab = "Average Life Satisfaction (1–10)",
names.arg = c("15-24","25-34","35-44","45-54","55-64","65+")
)
DISCUSSION: The question I wanted to ask was how age related to life satisfaction in Bolivia versus in the United States. The difference was striking, as the younger cohort in Bolivia tended to be more satisfied than their American counterparts, and as age increased, the trend flipped, suggesting that Americans that were older were more satisfied.