Please indicate
What gave you the most trouble:
Any comments you have:
After our discussion about making smaller buckets for the income variable (ex. high income, medium income, low income), I tried to use fct_recode like I did with the religion variable to group the categorial variables. While I understand it conceptually, I wasn’t able to get the code to work.
Perform an Exploratory Data Analysis (EDA) on the profiles data set, specifically on the relationship between gender and
incomejoball keeping in mind in HW-3, you will be fitting a logistic regression to predict a user’s gender based on these variables.
#Select Only the Needed Variables From the Dataset
profiles <- profiles %>%
mutate(is_female = ifelse(sex=="f", 1, 0))
profiles_less <- profiles %>%
select(religion, is_female, income, job)
#Gender and Income
#Number of People Who Did Not Report Income
profiles_less %>%
filter(income==-1) %>%
tally()
## # A tibble: 1 × 1
## n
## <int>
## 1 48442
#Number and Proportion Female Who Did Not Report Income
profiles_less %>%
filter(income==-1) %>%
group_by(is_female) %>%
tally() %>%
kable()
| is_female | n |
|---|---|
| 0 | 27438 |
| 1 | 21004 |
profiles_less %>%
filter(income==-1) %>%
summarise(`Prop Female of Non-Reporters`=mean(is_female)) %>%
kable()
0.4335907
#Dataset without -1 Income Values
profiles_less2 <- profiles_less %>%
filter(income != -1)
#Distribution of Income
profiles_less %>%
group_by(income) %>%
tally() %>%
kable()
| income | n |
|---|---|
| -1 | 48442 |
| 20000 | 2952 |
| 30000 | 1048 |
| 40000 | 1005 |
| 50000 | 975 |
| 60000 | 736 |
| 70000 | 707 |
| 80000 | 1111 |
| 100000 | 1621 |
| 150000 | 631 |
| 250000 | 149 |
| 500000 | 48 |
| 1000000 | 521 |
#Distribution of Reported Incomes
ggplot(data=profiles_less2, aes(x=income)) + geom_histogram(binwidth = 10000)
#Average Income by Gender
profiles_less2 %>%
group_by(is_female) %>%
summarise("Avg Income By Gender" = mean(income)) %>%
kable()
| is_female | Avg Income By Gender |
|---|---|
| 0 | 110984.39 |
| 1 | 86633.47 |
#Distribution of Income by Gender
ggplot(data=profiles_less2, aes(x=income, y=..density..)) +
geom_histogram(binwidth = 20000) +
facet_wrap(~is_female) +
xlab("Income") +
ylab("Density")
profiles %>%
group_by(income) %>%
summarise(mean(is_female)) %>%
kable()
| income | mean(is_female) |
|---|---|
| -1 | 0.4335907 |
| 20000 | 0.3492547 |
| 30000 | 0.3053435 |
| 40000 | 0.3373134 |
| 50000 | 0.3189744 |
| 60000 | 0.3002717 |
| 70000 | 0.2347949 |
| 80000 | 0.2349235 |
| 100000 | 0.1579272 |
| 150000 | 0.1188590 |
| 250000 | 0.0335570 |
| 500000 | 0.0625000 |
| 1000000 | 0.2399232 |
#Proportion Female by Income
profiles_less %>%
group_by(income) %>%
summarise(`Proportion Female`=mean(is_female), "Number"= n()) %>%
kable()
| income | Proportion Female | Number |
|---|---|---|
| -1 | 0.4335907 | 48442 |
| 20000 | 0.3492547 | 2952 |
| 30000 | 0.3053435 | 1048 |
| 40000 | 0.3373134 | 1005 |
| 50000 | 0.3189744 | 975 |
| 60000 | 0.3002717 | 736 |
| 70000 | 0.2347949 | 707 |
| 80000 | 0.2349235 | 1111 |
| 100000 | 0.1579272 | 1621 |
| 150000 | 0.1188590 | 631 |
| 250000 | 0.0335570 | 149 |
| 500000 | 0.0625000 | 48 |
| 1000000 | 0.2399232 | 521 |
profiles_new <- profiles_less %>%
group_by(income) %>%
mutate(`Proportion Female`=mean(is_female), "Number"= n())
ggplot(profiles_new, aes(x=income, y=`Proportion Female`)) +
geom_point() +
geom_hline(yintercept= 0.4)
#Gender and Job
#Gender Proportions by Job
profiles_less2 %>%
group_by(job) %>%
summarise(`Gender Prop by Job` = mean(is_female)) %>%
arrange(desc(`Gender Prop by Job`)) %>%
kable()
| job | Gender Prop by Job |
|---|---|
| clerical / administrative | 0.6568627 |
| education / academia | 0.5027100 |
| medicine / health | 0.4776579 |
| unemployed | 0.3960396 |
| student | 0.3286290 |
| retired | 0.3285714 |
| hospitality / travel | 0.3159420 |
| other | 0.3108371 |
| law / legal services | 0.3041475 |
| rather not say | 0.3033708 |
| artistic / musical / writer | 0.3013436 |
| sales / marketing / biz dev | 0.2793103 |
| political / government | 0.2769231 |
| NA | 0.2551320 |
| banking / financial / real estate | 0.2410959 |
| entertainment / media | 0.2147505 |
| executive / management | 0.2082585 |
| transportation | 0.1285714 |
| science / tech / engineering | 0.1116803 |
| military | 0.0714286 |
| computer / hardware / software | 0.0666090 |
| construction / craftsmanship | 0.0448179 |
profiles_less2 <- profiles_less2 %>%
group_by(job) %>%
mutate(`Gender Prop by Job` = mean(is_female))
ggplot(profiles_less2, aes(x=job, y= `Gender Prop by Job`)) +
geom_point() +
coord_flip() +
geom_hline(yintercept= 0.4) +
xlab("Job")
## Warning: Removed 341 rows containing missing values (geom_point).
#Average Income by Job and Gender
profiles_less2 %>%
group_by(job, is_female) %>%
summarise(`Avg Income By Job and Gender` = mean(income)) %>%
spread(key=is_female, value= `Avg Income By Job and Gender`) %>%
arrange(desc(`0`)) %>%
kable()
| job | 0 | 1 |
|---|---|---|
| retired | 367872.34 | 205652.17 |
| rather not say | 214516.13 | 146296.30 |
| NA | 192480.31 | 155402.30 |
| executive / management | 148163.27 | 103275.86 |
| artistic / musical / writer | 134313.19 | 117197.45 |
| banking / financial / real estate | 130469.31 | 98295.45 |
| law / legal services | 124437.09 | 96363.64 |
| medicine / health | 120973.45 | 88483.87 |
| science / tech / engineering | 110115.34 | 99449.54 |
| computer / hardware / software | 109879.52 | 87142.86 |
| political / government | 108652.48 | 92037.04 |
| entertainment / media | 107375.69 | 71111.11 |
| sales / marketing / biz dev | 100143.54 | 74773.66 |
| education / academia | 98610.35 | 72264.15 |
| hospitality / travel | 97669.49 | 109724.77 |
| construction / craftsmanship | 93401.76 | 55000.00 |
| other | 87429.38 | 68622.13 |
| unemployed | 85737.70 | 100000.00 |
| student | 79819.82 | 63803.68 |
| transportation | 75163.93 | 151666.67 |
| military | 65604.40 | 340000.00 |
| clerical / administrative | 60714.29 | 45597.01 |
#Religion Categories are Grouped By Major Religion
profiles_religion <- profiles_less %>%
mutate(`Grouped Religion` = fct_recode(religion,
"agnosticism" = "agnosticism",
"agnosticism" = "agnosticism and laughing about it",
"agnosticism" = "agnosticism and somewhat serious about it",
"agnosticism" = "agnosticism and very serious about it",
"agnosticism" = "agnosticism but not too serious about it",
"atheism" = "atheism",
"atheism" = "atheism and laughing about it",
"atheism" = "atheism and somewhat serious about it",
"atheism" = "atheism and very serious about it",
"atheism" = "atheism but not too serious about it",
"buddhism" = "buddhism",
"buddhism" = "buddhism and laughing about it",
"buddhism" = "buddhism and somewhat serious about it",
"buddhism" = "buddhism and very serious about it",
"buddhism" = "buddhism but not too serious about it",
"catholism" = "catholicism",
"catholism" = "catholicism and laughing about it",
"catholism" = "catholicism and somewhat serious about it",
"catholism" = "catholicism and very serious about it",
"catholism" = "catholicism but not too serious about it",
"christianity" = "christianity",
"christianity" = "christianity and laughing about it",
"christianity" ="christianity and somewhat serious about it",
"christianity" = "christianity and very serious about it",
"christianity" = "christianity but not too serious about it",
"hinduism" = "hinduism",
"hinduism" = "hinduism and laughing about it",
"hinduism" = "hinduism and somewhat serious about it",
"hinduism" = "hinduism and very serious about it",
"hinduism" = "hinduism but not too serious about it",
"islam" = "islam",
"islam" = "islam and laughing about it",
"islam" = "islam and somewhat serious about it",
"islam" = "islam and very serious about it",
"islam" = "islam but not too serious about it",
"judaism" = "judaism",
"judaism" = "judaism and laughing about it",
"judaism" = "judaism and somewhat serious about it",
"judaism" = "judaism and very serious about it",
"judaism" = "judaism but not too serious about it",
"other" = "other",
"other" = "other and laughing about it",
"other" = "other and somewhat serious about it",
"other" = "other and very serious about it",
"other" = "other but not too serious about it"))
#Proportion Female By Religion
profiles_religion %>%
group_by(`Grouped Religion`) %>%
summarise(`Prop Female` = mean(is_female), `Number`=n()) %>%
arrange(desc(`Prop Female`)) %>%
kable()
| Grouped Religion | Prop Female | Number |
|---|---|---|
| judaism | 0.4887024 | 3098 |
| christianity | 0.4601693 | 5787 |
| catholism | 0.4411517 | 4758 |
| other | 0.4356193 | 7743 |
| buddhism | 0.4214579 | 1948 |
| NA | 0.4076931 | 20226 |
| agnosticism | 0.3702905 | 8812 |
| hinduism | 0.3266667 | 450 |
| atheism | 0.2797423 | 6985 |
| islam | 0.2661871 | 139 |
profiles_religion <- profiles_religion%>%
group_by(`Grouped Religion`) %>%
mutate(`Prop Female` = mean(is_female)) %>%
ungroup() %>%
mutate(`Grouped Religion` =fct_reorder(`Grouped Religion`,`Prop Female`))
ggplot(profiles_religion, aes(x=`Grouped Religion`, y= `Prop Female`)) +
geom_point() +
geom_hline(yintercept= 0.4) +
xlab("Religion")
mean(profiles$is_female)
## [1] 0.4023121
#The line at y=0.40 on the plot is added as a reference tool, indicating the proportion of female respondents in the entire sample.
Notes:
-43.4% of respondents who did not report their income were female. This gender distribution is not drastically different from the overall proportion of female respondents, 40%. Therefore, reporting income is not a strong predictor of gender.
-The average female income is 86,633, significatly lower than the average male income of 110,984. Therefore, income can be a strong predictor of sex. Overall, the distribution of female incomes is shifted slightly farther to the left than the distribution of male incomes. Grouping the income variable into bigger buckets, for example into high, medium, and low income, would simplify the model.
-Many jobs have gender distributions significantly different than the overall proportion of female respondents, 0.40. Therefore, depending on the specific job, job can be a moderate to strong predictor of sex.
-Religion can also be a strong predictor of sex, depending on the specific religion. For religions with proportions of females close to 0.40, religion is not a powerful predictor of gender. On the other hand, for religions with proportions of females far from 0.40, religion is a powerful predictor of gender. For example, Islam is 26.6% women, compared to the overal 40% of respondents, indicating that religion is a strong predictor of sex for those who reported Islam. However, only 139 users reported Islam. Therefore, it is a strong predictor of sex for a very small subset of the population. It might not be worth including a separate coefficient for Islam in the model. In contrast, Christianity would be a moderately strong predictor of sex for a larger subset of people, and therefore might be more valuable to include in the model. Overall, it is important to note that 20,226 people did not report a religion and not reporting religion is not a strong predictor of sex, since the gender breakdown for that subset is very close to 0.40.
In the file HW-2_Shiny_App.Rmd, build the Shiny App discussed in Lec09 on Monday 10/3: Using the movies data set in the ggplot2movies data set, make a Shiny app that
Action, Animation, Comedy, etc), have a radio button that allows you to toggle between comedies and non-comedies. This app should be simpler.