Some of you have Project 2 ideas that combine two layers of statistical stories - for example, a kdensity with a histogram to compare your firm’s age distribution to Germany’s. Here’s how to get it to work.
I tidied the German data from CIA World Factbook by hand in Excel first and then saved my tidy dataframe as csv. Here’s what the first few rows of my data look like. Note that when I did this in Excel I made sure that Population was set to number not, general so R will read it in as numeric.
germany <- read.csv("ciadata_germany.csv")
head(germany)
## AgeGroup Population
## 1 20-24 4553436
## 2 25-29 4823925
## 3 30-34 5441865
## 4 35-39 5430155
## 5 40-44 5059667
## 6 45-49 5183684
Then, I created a column of shares that I will eventually plot. Here again are the first few rows.
germany <- germany %>%
mutate(Share = Population/sum(Population))
head(germany)
## AgeGroup Population Share
## 1 20-24 4553436 0.08335769
## 2 25-29 4823925 0.08830941
## 3 30-34 5441865 0.09962176
## 4 35-39 5430155 0.09940739
## 5 40-44 5059667 0.09262503
## 6 45-49 5183684 0.09489535
Now, if I want to plot this data and the data from my firm in the same plot, I need to either create a categorical age variable in Data Preparation that matches the one for Germany OR make an Age varible for Germany containing only 1 number. I’ll show the second strategy here. Note that I’m just assiging the midpoint of each range to do this.
germany <- germany %>%
mutate(Age = case_when(AgeGroup == "20-24" ~ 22,
AgeGroup == "25-29" ~ 27,
AgeGroup == "30-34" ~ 32,
AgeGroup == "40-44" ~ 42,
AgeGroup == "45-49" ~ 47,
AgeGroup == "50-54" ~ 52,
AgeGroup == "55-59" ~ 57,
AgeGroup == "60-64" ~ 62,
AgeGroup == "65-69" ~ 67))
Here’s a simple histogram using my data from my firm, filtering for ages 20 to 69. Note that I adjusted the binwidth to 5 so that it is the same width as the space between the midpoints of my ranges in the germany dataframe.
myfirm %>%
filter(Age>= 20 & Age <= 69) %>%
ggplot(aes(x = Age)) +
geom_histogram(binwidth = 5)
Here’s the curve from the german data frame that I want to fit over this.
germany %>%
ggplot(aes(x = Age, y = Share)) +
geom_line()
## Warning: Removed 1 row(s) containing missing values (geom_path).
Now I want to combine these. First, I need to get the y axis scaled the same way. The histogram has counts. I want shares. Here’s how to fix that.
myfirm %>%
filter(Age>= 20 & Age <= 69) %>%
ggplot(aes(x = Age)) +
geom_histogram(aes(y=..count../sum(..count..)), binwidth = 5)
Then I can add the new geom_line but I have to give data = germany with it.
myfirm %>%
filter(Age>= 20 & Age <= 69) %>%
ggplot(aes(x = Age)) +
geom_histogram(binwidth = 5, aes(y=..count../sum(..count..))) +
geom_line(data = germany, aes(x = Age, y = Share))
## Warning: Removed 1 row(s) containing missing values (geom_path).
And here’s a less ugly version.
myfirm %>%
filter(Age>= 20 & Age <= 69) %>%
ggplot(aes(x = Age)) +
geom_histogram(binwidth = 5, aes(y=..count../sum(..count..)), fill = "lightblue") +
geom_line(data = germany, aes(x = Age, y = Share), color = "orange", size = 5) +
theme_classic() +
labs(y = "Share",
title = "Firm 0's workforce is older than Germany as a whole")
## Warning: Removed 1 row(s) containing missing values (geom_path).
Since in this approach we left age as continuous for our firm, I would compare the median age for the firm to the median age reported in the CIA data for your statistical test. If instead you chose to make a categorical age variable for your firm, then you can test to see whether the percentage frequencies for that variable are different than the ones for Germany. This is a variation on the Chi Square Goodness of Fit. Now we’re testing to against a null hypothesis that our firm has the same age distribution as Germany. Here’s the code to create the age groups and then test against the percentages for Germany. Don’t skip the last part where you declare AgeGroup as an ordered factor. It matters a lot if you have any age groups without any workers.
agedist_myfirm <- myfirm %>%
mutate(AgeGroup = case_when(Age >= 20 & Age <= 24 ~ "20-24",
Age >= 25 & Age <= 29 ~ "25-29",
Age >= 30 & Age <= 34 ~ "30-34",
Age >= 35 & Age <= 39 ~ "35-39",
Age >= 40 & Age <= 44 ~ "40-44",
Age >= 45 & Age <= 49 ~ "45-49",
Age >= 50 & Age <= 54 ~ "50-54",
Age >= 55 & Age <= 59 ~ "55-59",
Age >= 60 & Age <= 64 ~ "60-64",
Age >= 65 & Age <= 69 ~ "65-69"),
AgeGroup = factor(AgeGroup, levels = c("20-24", "25-29", "30-34", "35-39",
"40-44", "45-49", "50-54", "55-59",
"60-64", "65-69")))
Now I’m finally ready to run the test. p = gives the test the set of proportions from Germany stored in the share variable and tests your firm’s data to see whether they are different from those hypothesized proportions.
chisq.test(table(agedist_myfirm$AgeGroup), p = germany$Share)
##
## Chi-squared test for given probabilities
##
## data: table(agedist_myfirm$AgeGroup)
## X-squared = 560.82, df = 9, p-value < 2.2e-16