Admistrative:

Please indicate

  • Who you collaborated with: Amanda Hotvedt
  • Roughly how much time you spent on this HW so far: 7
  • The URL of the RPubs published URL here.
  • What gave you the most trouble:

  • Any comments you have:

After our discussion about making smaller buckets for the income variable (ex. high income, medium income, low income), I tried to use fct_recode like I did with the religion variable to group the categorial variables. While I understand it conceptually, I wasn’t able to get the code to work.

Question 1:

Perform an Exploratory Data Analysis (EDA) on the profiles data set, specifically on the relationship between gender and

  • income
  • job
  • One more categorical variable of your choice

all keeping in mind in HW-3, you will be fitting a logistic regression to predict a user’s gender based on these variables.

#Select Only the Needed Variables From the Dataset

profiles <- profiles %>%
  mutate(is_female = ifelse(sex=="f", 1, 0))

profiles_less <- profiles %>% 
  select(religion, is_female, income, job)

#Gender and Income

#Number of People Who Did Not Report Income

profiles_less %>% 
  filter(income==-1) %>% 
  tally()
## # A tibble: 1 × 1
##       n
##   <int>
## 1 48442
#Number and Proportion Female Who Did Not Report Income

profiles_less %>%
  filter(income==-1) %>% 
  group_by(is_female) %>% 
  tally() %>% 
  kable()
is_female n
0 27438
1 21004
profiles_less %>%
  filter(income==-1) %>% 
  summarise(`Prop Female of Non-Reporters`=mean(is_female)) %>% 
  kable()

Prop Female of Non-Reporters

                0.4335907
#Dataset without -1 Income Values

profiles_less2 <- profiles_less %>% 
  filter(income != -1) 

#Distribution of Income

profiles_less %>% 
  group_by(income) %>% 
  tally() %>% 
  kable()
income n
-1 48442
20000 2952
30000 1048
40000 1005
50000 975
60000 736
70000 707
80000 1111
100000 1621
150000 631
250000 149
500000 48
1000000 521
#Distribution of Reported Incomes

ggplot(data=profiles_less2, aes(x=income)) + geom_histogram(binwidth = 10000)

#Average Income by Gender

profiles_less2 %>% 
  group_by(is_female) %>% 
  summarise("Avg Income By Gender" = mean(income)) %>% 
  kable()
is_female Avg Income By Gender
0 110984.39
1 86633.47
#Distribution of Income by Gender

ggplot(data=profiles_less2, aes(x=income, y=..density..)) + 
  geom_histogram(binwidth = 20000) + 
  facet_wrap(~is_female) +
  xlab("Income") +
  ylab("Density") 

profiles %>% 
  group_by(income) %>% 
  summarise(mean(is_female)) %>% 
  kable()
income mean(is_female)
-1 0.4335907
20000 0.3492547
30000 0.3053435
40000 0.3373134
50000 0.3189744
60000 0.3002717
70000 0.2347949
80000 0.2349235
100000 0.1579272
150000 0.1188590
250000 0.0335570
500000 0.0625000
1000000 0.2399232
#Proportion Female by Income

profiles_less %>%
  group_by(income) %>% 
  summarise(`Proportion Female`=mean(is_female), "Number"= n()) %>% 
  kable()
income Proportion Female Number
-1 0.4335907 48442
20000 0.3492547 2952
30000 0.3053435 1048
40000 0.3373134 1005
50000 0.3189744 975
60000 0.3002717 736
70000 0.2347949 707
80000 0.2349235 1111
100000 0.1579272 1621
150000 0.1188590 631
250000 0.0335570 149
500000 0.0625000 48
1000000 0.2399232 521
profiles_new <- profiles_less %>%
  group_by(income) %>% 
  mutate(`Proportion Female`=mean(is_female), "Number"= n()) 

ggplot(profiles_new, aes(x=income, y=`Proportion Female`)) +
  geom_point() +
  geom_hline(yintercept= 0.4)

#Gender and Job

#Gender Proportions by Job

profiles_less2 %>% 
  group_by(job) %>% 
  summarise(`Gender Prop by Job` = mean(is_female)) %>% 
  arrange(desc(`Gender Prop by Job`)) %>% 
  kable()
job Gender Prop by Job
clerical / administrative 0.6568627
education / academia 0.5027100
medicine / health 0.4776579
unemployed 0.3960396
student 0.3286290
retired 0.3285714
hospitality / travel 0.3159420
other 0.3108371
law / legal services 0.3041475
rather not say 0.3033708
artistic / musical / writer 0.3013436
sales / marketing / biz dev 0.2793103
political / government 0.2769231
NA 0.2551320
banking / financial / real estate 0.2410959
entertainment / media 0.2147505
executive / management 0.2082585
transportation 0.1285714
science / tech / engineering 0.1116803
military 0.0714286
computer / hardware / software 0.0666090
construction / craftsmanship 0.0448179
profiles_less2 <- profiles_less2 %>% 
  group_by(job) %>% 
  mutate(`Gender Prop by Job` = mean(is_female)) 

ggplot(profiles_less2, aes(x=job, y= `Gender Prop by Job`))  + 
  geom_point() + 
  coord_flip() +
  geom_hline(yintercept= 0.4) +
  xlab("Job")
## Warning: Removed 341 rows containing missing values (geom_point).

#Average Income by Job and Gender

profiles_less2 %>% 
  group_by(job, is_female) %>% 
  summarise(`Avg Income By Job and Gender` = mean(income)) %>% 
  spread(key=is_female, value= `Avg Income By Job and Gender`) %>% 
  arrange(desc(`0`)) %>% 
  kable()
job 0 1
retired 367872.34 205652.17
rather not say 214516.13 146296.30
NA 192480.31 155402.30
executive / management 148163.27 103275.86
artistic / musical / writer 134313.19 117197.45
banking / financial / real estate 130469.31 98295.45
law / legal services 124437.09 96363.64
medicine / health 120973.45 88483.87
science / tech / engineering 110115.34 99449.54
computer / hardware / software 109879.52 87142.86
political / government 108652.48 92037.04
entertainment / media 107375.69 71111.11
sales / marketing / biz dev 100143.54 74773.66
education / academia 98610.35 72264.15
hospitality / travel 97669.49 109724.77
construction / craftsmanship 93401.76 55000.00
other 87429.38 68622.13
unemployed 85737.70 100000.00
student 79819.82 63803.68
transportation 75163.93 151666.67
military 65604.40 340000.00
clerical / administrative 60714.29 45597.01
#Religion Categories are Grouped By Major Religion

profiles_religion <- profiles_less %>% 
  mutate(`Grouped Religion` = fct_recode(religion,
                  "agnosticism" = "agnosticism",
                  "agnosticism" = "agnosticism and laughing about it",
                  "agnosticism" = "agnosticism and somewhat serious about it",
                  "agnosticism" = "agnosticism and very serious about it",
                  "agnosticism" = "agnosticism but not too serious about it",
                  "atheism" = "atheism",
                  "atheism" = "atheism and laughing about it",
                  "atheism" = "atheism and somewhat serious about it",
                  "atheism" = "atheism and very serious about it",
                  "atheism" = "atheism but not too serious about it",
                  "buddhism" = "buddhism",
                  "buddhism" = "buddhism and laughing about it",
                  "buddhism" = "buddhism and somewhat serious about it",
                  "buddhism" = "buddhism and very serious about it",
                  "buddhism" = "buddhism but not too serious about it",
                  "catholism" = "catholicism",
                  "catholism" = "catholicism and laughing about it",
                  "catholism" = "catholicism and somewhat serious about it",
                  "catholism" = "catholicism and very serious about it",
                  "catholism" = "catholicism but not too serious about it",
                  "christianity" = "christianity",
                  "christianity" = "christianity and laughing about it",
                  "christianity" ="christianity and somewhat serious about it",
                  "christianity" = "christianity and very serious about it",
                  "christianity" = "christianity but not too serious about it",
                  "hinduism" = "hinduism",
                  "hinduism" = "hinduism and laughing about it",
                  "hinduism" = "hinduism and somewhat serious about it",
                  "hinduism" = "hinduism and very serious about it",
                  "hinduism" = "hinduism but not too serious about it",
                  "islam" = "islam",
                  "islam" = "islam and laughing about it",
                  "islam" = "islam and somewhat serious about it",
                  "islam" = "islam and very serious about it",
                  "islam" = "islam but not too serious about it",
                  "judaism" = "judaism",
                  "judaism" = "judaism and laughing about it",
                  "judaism" = "judaism and somewhat serious about it",
                  "judaism" = "judaism and very serious about it",
                  "judaism" = "judaism but not too serious about it",
                  "other" = "other",
                  "other" = "other and laughing about it",
                  "other" = "other and somewhat serious about it",
                  "other" = "other and very serious about it",
                  "other" = "other but not too serious about it"))

#Proportion Female By Religion

profiles_religion %>% 
  group_by(`Grouped Religion`) %>% 
  summarise(`Prop Female` = mean(is_female), `Number`=n()) %>% 
  arrange(desc(`Prop Female`)) %>% 
  kable() 
Grouped Religion Prop Female Number
judaism 0.4887024 3098
christianity 0.4601693 5787
catholism 0.4411517 4758
other 0.4356193 7743
buddhism 0.4214579 1948
NA 0.4076931 20226
agnosticism 0.3702905 8812
hinduism 0.3266667 450
atheism 0.2797423 6985
islam 0.2661871 139
profiles_religion <- profiles_religion%>% 
  group_by(`Grouped Religion`) %>% 
  mutate(`Prop Female` = mean(is_female)) %>% 
  ungroup() %>% 
  mutate(`Grouped Religion` =fct_reorder(`Grouped Religion`,`Prop Female`))

ggplot(profiles_religion, aes(x=`Grouped Religion`, y= `Prop Female`))  + 
  geom_point() + 
  geom_hline(yintercept= 0.4) +
  xlab("Religion")

mean(profiles$is_female)
## [1] 0.4023121
#The line at y=0.40 on the plot is added as a reference tool, indicating the proportion of female respondents in the entire sample. 

Notes:

-43.4% of respondents who did not report their income were female. This gender distribution is not drastically different from the overall proportion of female respondents, 40%. Therefore, reporting income is not a strong predictor of gender.

-The average female income is 86,633, significatly lower than the average male income of 110,984. Therefore, income can be a strong predictor of sex. Overall, the distribution of female incomes is shifted slightly farther to the left than the distribution of male incomes. Grouping the income variable into bigger buckets, for example into high, medium, and low income, would simplify the model.

-Many jobs have gender distributions significantly different than the overall proportion of female respondents, 0.40. Therefore, depending on the specific job, job can be a moderate to strong predictor of sex.

-Religion can also be a strong predictor of sex, depending on the specific religion. For religions with proportions of females close to 0.40, religion is not a powerful predictor of gender. On the other hand, for religions with proportions of females far from 0.40, religion is a powerful predictor of gender. For example, Islam is 26.6% women, compared to the overal 40% of respondents, indicating that religion is a strong predictor of sex for those who reported Islam. However, only 139 users reported Islam. Therefore, it is a strong predictor of sex for a very small subset of the population. It might not be worth including a separate coefficient for Islam in the model. In contrast, Christianity would be a moderately strong predictor of sex for a larger subset of people, and therefore might be more valuable to include in the model. Overall, it is important to note that 20,226 people did not report a religion and not reporting religion is not a strong predictor of sex, since the gender breakdown for that subset is very close to 0.40.

Question 2:

In the file HW-2_Shiny_App.Rmd, build the Shiny App discussed in Lec09 on Monday 10/3: Using the movies data set in the ggplot2movies data set, make a Shiny app that

  • Plots budget on the x-axis and rating on the y-axis
  • Instead of having a radio button to select the genre of movie (Action, Animation, Comedy, etc), have a radio button that allows you to toggle between comedies and non-comedies. This app should be simpler.