MATH 216 Homework 2

Admistrative:

Please indicate

Who you collaborated with: Amanda Hotvedt
Roughly how much time you spent on this HW so far: 7
The URL of the RPubs published URL here.
What gave you the most trouble:
Any comments you have:

After our discussion about making smaller buckets for the income variable (ex. high income, medium income, low income), I tried to use fct_recode like I did with the religion variable to group the categorial variables. While I understand it conceptually, I wasn’t able to get the code to work.

Question 1:

Perform an Exploratory Data Analysis (EDA) on the profiles data set, specifically on the relationship between gender and

income
job
One more categorical variable of your choice

all keeping in mind in HW-3, you will be fitting a logistic regression to predict a user’s gender based on these variables.

#Select Only the Needed Variables From the Dataset

profiles <- profiles %>%
  mutate(is_female = ifelse(sex=="f", 1, 0))

profiles_less <- profiles %>% 
  select(religion, is_female, income, job)

#Gender and Income

#Number of People Who Did Not Report Income

profiles_less %>% 
  filter(income==-1) %>% 
  tally()

## # A tibble: 1 × 1
##       n
##   <int>
## 1 48442

#Number and Proportion Female Who Did Not Report Income

profiles_less %>%
  filter(income==-1) %>% 
  group_by(is_female) %>% 
  tally() %>% 
  kable()

is_female	n
0	27438
1	21004

profiles_less %>%
  filter(income==-1) %>% 
  summarise(`Prop Female of Non-Reporters`=mean(is_female)) %>% 
  kable()

Prop Female of Non-Reporters

                0.4335907

#Dataset without -1 Income Values

profiles_less2 <- profiles_less %>% 
  filter(income != -1) 

#Distribution of Income

profiles_less %>% 
  group_by(income) %>% 
  tally() %>% 
  kable()

income	n
-1	48442
20000	2952
30000	1048
40000	1005
50000	975
60000	736
70000	707
80000	1111
100000	1621
150000	631
250000	149
500000	48
1000000	521

#Distribution of Reported Incomes

ggplot(data=profiles_less2, aes(x=income)) + geom_histogram(binwidth = 10000)

#Average Income by Gender

profiles_less2 %>% 
  group_by(is_female) %>% 
  summarise("Avg Income By Gender" = mean(income)) %>% 
  kable()

is_female	Avg Income By Gender
0	110984.39
1	86633.47

#Distribution of Income by Gender

ggplot(data=profiles_less2, aes(x=income, y=..density..)) + 
  geom_histogram(binwidth = 20000) + 
  facet_wrap(~is_female) +
  xlab("Income") +
  ylab("Density")

profiles %>% 
  group_by(income) %>% 
  summarise(mean(is_female)) %>% 
  kable()

income	mean(is_female)
-1	0.4335907
20000	0.3492547
30000	0.3053435
40000	0.3373134
50000	0.3189744
60000	0.3002717
70000	0.2347949
80000	0.2349235
100000	0.1579272
150000	0.1188590
250000	0.0335570
500000	0.0625000
1000000	0.2399232

#Proportion Female by Income

profiles_less %>%
  group_by(income) %>% 
  summarise(`Proportion Female`=mean(is_female), "Number"= n()) %>% 
  kable()

income	Proportion Female	Number
-1	0.4335907	48442
20000	0.3492547	2952
30000	0.3053435	1048
40000	0.3373134	1005
50000	0.3189744	975
60000	0.3002717	736
70000	0.2347949	707
80000	0.2349235	1111
100000	0.1579272	1621
150000	0.1188590	631
250000	0.0335570	149
500000	0.0625000	48
1000000	0.2399232	521

profiles_new <- profiles_less %>%
  group_by(income) %>% 
  mutate(`Proportion Female`=mean(is_female), "Number"= n()) 

ggplot(profiles_new, aes(x=income, y=`Proportion Female`)) +
  geom_point() +
  geom_hline(yintercept= 0.4)

#Gender and Job

#Gender Proportions by Job

profiles_less2 %>% 
  group_by(job) %>% 
  summarise(`Gender Prop by Job` = mean(is_female)) %>% 
  arrange(desc(`Gender Prop by Job`)) %>% 
  kable()

job	Gender Prop by Job
clerical / administrative	0.6568627
education / academia	0.5027100
medicine / health	0.4776579
unemployed	0.3960396
student	0.3286290
retired	0.3285714
hospitality / travel	0.3159420
other	0.3108371
law / legal services	0.3041475
rather not say	0.3033708
artistic / musical / writer	0.3013436
sales / marketing / biz dev	0.2793103
political / government	0.2769231
NA	0.2551320
banking / financial / real estate	0.2410959
entertainment / media	0.2147505
executive / management	0.2082585
transportation	0.1285714
science / tech / engineering	0.1116803
military	0.0714286
computer / hardware / software	0.0666090
construction / craftsmanship	0.0448179

profiles_less2 <- profiles_less2 %>% 
  group_by(job) %>% 
  mutate(`Gender Prop by Job` = mean(is_female)) 

ggplot(profiles_less2, aes(x=job, y= `Gender Prop by Job`))  + 
  geom_point() + 
  coord_flip() +
  geom_hline(yintercept= 0.4) +
  xlab("Job")

## Warning: Removed 341 rows containing missing values (geom_point).

#Average Income by Job and Gender

profiles_less2 %>% 
  group_by(job, is_female) %>% 
  summarise(`Avg Income By Job and Gender` = mean(income)) %>% 
  spread(key=is_female, value= `Avg Income By Job and Gender`) %>% 
  arrange(desc(`0`)) %>% 
  kable()

job	0	1
retired	367872.34	205652.17
rather not say	214516.13	146296.30
NA	192480.31	155402.30
executive / management	148163.27	103275.86
artistic / musical / writer	134313.19	117197.45
banking / financial / real estate	130469.31	98295.45
law / legal services	124437.09	96363.64
medicine / health	120973.45	88483.87
science / tech / engineering	110115.34	99449.54
computer / hardware / software	109879.52	87142.86
political / government	108652.48	92037.04
entertainment / media	107375.69	71111.11
sales / marketing / biz dev	100143.54	74773.66
education / academia	98610.35	72264.15
hospitality / travel	97669.49	109724.77
construction / craftsmanship	93401.76	55000.00
other	87429.38	68622.13
unemployed	85737.70	100000.00
student	79819.82	63803.68
transportation	75163.93	151666.67
military	65604.40	340000.00
clerical / administrative	60714.29	45597.01

#Religion Categories are Grouped By Major Religion

profiles_religion <- profiles_less %>% 
  mutate(`Grouped Religion` = fct_recode(religion,
                  "agnosticism" = "agnosticism",
                  "agnosticism" = "agnosticism and laughing about it",
                  "agnosticism" = "agnosticism and somewhat serious about it",
                  "agnosticism" = "agnosticism and very serious about it",
                  "agnosticism" = "agnosticism but not too serious about it",
                  "atheism" = "atheism",
                  "atheism" = "atheism and laughing about it",
                  "atheism" = "atheism and somewhat serious about it",
                  "atheism" = "atheism and very serious about it",
                  "atheism" = "atheism but not too serious about it",
                  "buddhism" = "buddhism",
                  "buddhism" = "buddhism and laughing about it",
                  "buddhism" = "buddhism and somewhat serious about it",
                  "buddhism" = "buddhism and very serious about it",
                  "buddhism" = "buddhism but not too serious about it",
                  "catholism" = "catholicism",
                  "catholism" = "catholicism and laughing about it",
                  "catholism" = "catholicism and somewhat serious about it",
                  "catholism" = "catholicism and very serious about it",
                  "catholism" = "catholicism but not too serious about it",
                  "christianity" = "christianity",
                  "christianity" = "christianity and laughing about it",
                  "christianity" ="christianity and somewhat serious about it",
                  "christianity" = "christianity and very serious about it",
                  "christianity" = "christianity but not too serious about it",
                  "hinduism" = "hinduism",
                  "hinduism" = "hinduism and laughing about it",
                  "hinduism" = "hinduism and somewhat serious about it",
                  "hinduism" = "hinduism and very serious about it",
                  "hinduism" = "hinduism but not too serious about it",
                  "islam" = "islam",
                  "islam" = "islam and laughing about it",
                  "islam" = "islam and somewhat serious about it",
                  "islam" = "islam and very serious about it",
                  "islam" = "islam but not too serious about it",
                  "judaism" = "judaism",
                  "judaism" = "judaism and laughing about it",
                  "judaism" = "judaism and somewhat serious about it",
                  "judaism" = "judaism and very serious about it",
                  "judaism" = "judaism but not too serious about it",
                  "other" = "other",
                  "other" = "other and laughing about it",
                  "other" = "other and somewhat serious about it",
                  "other" = "other and very serious about it",
                  "other" = "other but not too serious about it"))

#Proportion Female By Religion

profiles_religion %>% 
  group_by(`Grouped Religion`) %>% 
  summarise(`Prop Female` = mean(is_female), `Number`=n()) %>% 
  arrange(desc(`Prop Female`)) %>% 
  kable()

Grouped Religion	Prop Female	Number
judaism	0.4887024	3098
christianity	0.4601693	5787
catholism	0.4411517	4758
other	0.4356193	7743
buddhism	0.4214579	1948
NA	0.4076931	20226
agnosticism	0.3702905	8812
hinduism	0.3266667	450
atheism	0.2797423	6985
islam	0.2661871	139

profiles_religion <- profiles_religion%>% 
  group_by(`Grouped Religion`) %>% 
  mutate(`Prop Female` = mean(is_female)) %>% 
  ungroup() %>% 
  mutate(`Grouped Religion` =fct_reorder(`Grouped Religion`,`Prop Female`))

ggplot(profiles_religion, aes(x=`Grouped Religion`, y= `Prop Female`))  + 
  geom_point() + 
  geom_hline(yintercept= 0.4) +
  xlab("Religion")

mean(profiles$is_female)

## [1] 0.4023121

#The line at y=0.40 on the plot is added as a reference tool, indicating the proportion of female respondents in the entire sample.

Notes:

-43.4% of respondents who did not report their income were female. This gender distribution is not drastically different from the overall proportion of female respondents, 40%. Therefore, reporting income is not a strong predictor of gender.

-The average female income is 86,633, significatly lower than the average male income of 110,984. Therefore, income can be a strong predictor of sex. Overall, the distribution of female incomes is shifted slightly farther to the left than the distribution of male incomes. Grouping the income variable into bigger buckets, for example into high, medium, and low income, would simplify the model.

-Many jobs have gender distributions significantly different than the overall proportion of female respondents, 0.40. Therefore, depending on the specific job, job can be a moderate to strong predictor of sex.

-Religion can also be a strong predictor of sex, depending on the specific religion. For religions with proportions of females close to 0.40, religion is not a powerful predictor of gender. On the other hand, for religions with proportions of females far from 0.40, religion is a powerful predictor of gender. For example, Islam is 26.6% women, compared to the overal 40% of respondents, indicating that religion is a strong predictor of sex for those who reported Islam. However, only 139 users reported Islam. Therefore, it is a strong predictor of sex for a very small subset of the population. It might not be worth including a separate coefficient for Islam in the model. In contrast, Christianity would be a moderately strong predictor of sex for a larger subset of people, and therefore might be more valuable to include in the model. Overall, it is important to note that 20,226 people did not report a religion and not reporting religion is not a strong predictor of sex, since the gender breakdown for that subset is very close to 0.40.

Question 2:

In the file HW-2_Shiny_App.Rmd, build the Shiny App discussed in Lec09 on Monday 10/3: Using the movies data set in the ggplot2movies data set, make a Shiny app that

Plots budget on the x-axis and rating on the y-axis
Instead of having a radio button to select the genre of movie (Action, Animation, Comedy, etc), have a radio button that allows you to toggle between comedies and non-comedies. This app should be simpler.

MATH 216 Homework 2

Katherine Hobbs

Admistrative:

Question 1:

Prop Female of Non-Reporters

Question 2: