Please indicate
Perform an Exploratory Data Analysis (EDA) on the profiles data set, specifically on the relationship between gender and
incomejoblocationall keeping in mind in HW-3, you will be fitting a logistic regression to predict a user’s gender based on these variables.
PROFILES
profiles_no_essay <- profiles %>%
select(-essay0, -essay1, -essay2, -essay3, -essay4, -essay5, -essay6, -essay7, -essay8, -essay9)
INCOME
income <- profiles %>%
select(sex, income) %>%
group_by(sex)
summarize(income,
income = mean(income, na.rm=TRUE))
## # A tibble: 2 × 2
## sex income
## <chr> <dbl>
## 1 f 11181.70
## 2 m 25991.31
The mean income for females is 11,181.70 and for males is 25,991.31. However, inherent issues arise when you ask someone to self report their income, particularly on a dating website where the implicit goal is to find someone to date. Therefore, people may not report their income for privacy reasons or to embelish their income to seem more attractive to potential matches on the website.
income_tally <- income %>%
group_by(income) %>%
tally()
income_tally_sex <- income %>%
group_by(income, sex) %>%
tally()
ggplot(data=income_tally, aes(x=income, y=n)) +
geom_bar(stat="identity") +
labs(x="Income", y="Number of Profiles", title="Income Distribution of OkCupid San Francisco Bay Area Users")
Out of 59,946 individual profiles in the data set, 48,442 report having an income of -1, which means that aprroximately 81% of the data set did not report their actual income on their OkCupid profile.
ggplot(data=income_tally_sex, aes(x=income, y=n, fill=sex)) +
geom_bar(stat="identity", position="dodge") +
labs(x="Income", y="Number of Profiles", title="Income Distribution of OkCupid San Francisco Bay Area Users")
ggplot(data=income, aes(x=income)) +
geom_histogram(binwidth=5000) +
facet_wrap(~sex) +
labs(x="Income", y="Number of Profiles", title="Income Distribution of OkCupid San Francisco Bay Area Users")
Out of the 48442 total who reported their income to be -1, 27438 were male and 21004 were female. The income distributions of both sexes look fairly similar, as both are very right-skewed.
income_f <- income_tally_sex %>%
filter(sex=="f") %>%
mutate(prop=n/24117)
income_m <- income_tally_sex %>%
filter(sex=="m") %>%
mutate(prop=n/35829 )
ggplot(data=income_f, aes(x=income, y=prop)) +
geom_bar(stat="identity") +
labs(x="Income", y="Proportion of Females", title=" Income Distribution of Female OkCupid San Francisco Bay Area Users")
ggplot(data=income_m, aes(x=income, y=prop)) +
geom_bar(stat="identity") +
labs(x="Income", y="Proportion of Males", title=" Income Distribution of Male OkCupid San Francisco Bay Area Users")
Despite the fact that a larger sheer number of males reported having an income of -1, a greater proportion of females reported having an income of -1. As we can see in the plot above 87% of females reported having an income of -1, while 76% of males did so. This may indicate that women are less likely to feel comfortable reporting their actual income.
Now just for the sake of curiosity, let’s see what the distribution of income would look like if we remove all the profiles who reported having an income of -1. I will call this “Actual” Income.
income_actual <- income %>%
filter(income>-1)
summarize(income_actual,
income = median(income, na.rm=TRUE))
## # A tibble: 2 × 2
## sex income
## <chr> <int>
## 1 f 40000
## 2 m 60000
summarize(income_actual,
income = mean(income, na.rm=TRUE))
## # A tibble: 2 × 2
## sex income
## <chr> <dbl>
## 1 f 86633.47
## 2 m 110984.39
In order to determine whether or not the mean and median values are realistic, I looked up the actual values in San Francisco in 2012 source here. According to the American Community Survey (ACS) Census data, the median household income in SF in 2012 was 74,922. I felt like household income was more relevant that family income because household income includes individuals living alone which I believe more likely applies to people who are using a dating website. This values compares to the female median income of 40,000 and male median income of 60,000. The per capita income for SF in 2012 was 40,522 which compares to the female mean income of 86,633 and 110,984. So the median values in the data set are less than the actual SF value while the mean values are highers than the actual SF value. This could mean that while more profiles are lowballing their income on their profile, there are enough people who are grossly overexaggerating their income to bring the data set’s mean above the actual value (e.g. there are 7 individual who claim to be unemployed but also earning an income of 1 million dollars).
ggplot(data=income_actual, aes(x=income)) +
geom_histogram(binwidth=10000)
labs(x="'Actual' Income", y="Number of Profiles", title=" 'Actual' Income Distribution of OkCupid San Francisco Bay Area Users")
## $x
## [1] "'Actual' Income"
##
## $y
## [1] "Number of Profiles"
##
## $title
## [1] " 'Actual' Income Distribution of OkCupid San Francisco Bay Area Users"
##
## attr(,"class")
## [1] "labels"
ggplot(data=income_actual, aes(x=income)) +
geom_histogram(binwidth=10000) +
facet_wrap(~sex) +
labs(x="Income", y="Number of Profiles", title=" Income Distribution of OkCupid San Francisco Bay Area Users Faceted by Sex")
ggplot(data=income_actual, aes(x=income, fill=sex)) +
geom_histogram(binwidth=10000, position="dodge") +
labs(x="Income", y="Number of Profiles", title=" Income Distribution of OkCupid San Francisco Bay Area Users Compared by Sex")
Making Income a Categorical Variable: no_response: people who responded -1 low_income: people who make 30,000 or less middle income: people who make between 30,000 and 70,000 high_income: people who make over 70,000
income_levels <- income %>%
mutate(income_level =
ifelse(income %in% -2:0, "No Response",
ifelse(income %in% 0:20000, "Low Income",
ifelse(income %in% 20001:70000, "Middle Income",
ifelse(income %in% 70000:1000000, "High Income", " "))))) %>%
group_by(income_level, sex) %>%
tally() %>%
group_by(income_level) %>%
mutate(prop=n/sum(n))
ggplot(data=income_levels, aes(x=income_level, y=prop, fill=sex))+
geom_bar(stat="identity", position="dodge")
JOB
job <- profiles %>%
select(sex, job) %>%
group_by(job,sex) %>%
tally() %>%
group_by(job) %>%
mutate(prop=n/sum(n))
ggplot(data=job, aes(x=job, y=prop, fill=sex)) +
geom_bar(stat="identity", position="dodge") +
labs(x="Job Type", y="Number of Profiles", title="SF OkCupid Job Type Distribution Compared by Sex") +
geom_hline(yintercept = 0.40) +
coord_flip()
## Warning: Removed 2 rows containing missing values (geom_bar).
ggplot(data=job, aes(x=job, y=n, fill=sex)) +
geom_bar(stat="identity", position="dodge") +
labs(x="Job Type", y="Number of Profiles", title="SF OkCupid Job Type Distribution Compared by Sex") +
coord_flip()
## Warning: Removed 2 rows containing missing values (geom_bar).
ggplot(data=job, aes(x=job, y=n, fill=job, fct_reoder(job, n))) +
geom_bar(stat="identity") +
facet_wrap(~sex) +
theme(axis.text.x = element_blank()) +
labs(x="Job Type", y="Number of Profiles", title="SF OkCupid Job Type Distribution Faceted by Sex")
## Warning: Removed 2 rows containing missing values (position_stack).
Now obviously these plots are pretty unwieldy and difficult to interpret due to the number of different jobs. So I tried to gain a better understand of which jobs are most prevelant for females and for males.
job_f <- job %>%
filter(sex=="f") %>%
mutate(prop=n/24117)
job_f_top8 <- job_f %>%
filter(n>1015)
job_f_top8 %>%
knitr::kable()
| job | sex | n | prop |
|---|---|---|---|
| artistic / musical / writer | f | 1882 | 0.0780362 |
| education / academia | f | 2155 | 0.0893561 |
| medicine / health | f | 2187 | 0.0906829 |
| other | f | 3644 | 0.1510967 |
| sales / marketing / biz dev | f | 1916 | 0.0794460 |
| science / tech / engineering | f | 1016 | 0.0421280 |
| student | f | 2151 | 0.0891902 |
| NA | f | 3656 | 0.1515943 |
job_m <- job %>%
filter(sex=="m") %>%
mutate(prop=n/35829)
job_m_top8 <- job_m %>%
filter(n>1617)
job_m_top8 %>%
knitr::kable()
| job | sex | n | prop |
|---|---|---|---|
| artistic / musical / writer | m | 2557 | 0.0713668 |
| computer / hardware / software | m | 4068 | 0.1135393 |
| executive / management | m | 1618 | 0.0451589 |
| other | m | 3945 | 0.1101063 |
| sales / marketing / biz dev | m | 2475 | 0.0690781 |
| science / tech / engineering | m | 3832 | 0.1069525 |
| student | m | 2731 | 0.0762232 |
| NA | m | 4542 | 0.1267688 |
Both sexes have “N/A”" and “Other”" in their top eight job types, further illustrating the point that people prefer to remain fairly private/vague on their profiles. However, men do this less so than women because computer/hardware/software is second highest job for men whereas for women N/A and other are first and second.
LOCATION
location <- profiles %>%
select(sex, location) %>%
group_by(sex, location) %>%
tally() %>%
group_by(location) %>%
mutate(prop=n/sum(n))
location_f <- location %>%
filter(sex=="f") %>%
mutate(prop=n/24117)
location_f_top8 <- location_f %>%
filter(n>268)
location_f_top8 %>%
knitr::kable()
| sex | location | n | prop |
|---|---|---|---|
| f | alameda, california | 367 | 0.0152175 |
| f | berkeley, california | 1757 | 0.0728532 |
| f | emeryville, california | 269 | 0.0111540 |
| f | oakland, california | 3491 | 0.1447527 |
| f | palo alto, california | 369 | 0.0153004 |
| f | san francisco, california | 12265 | 0.5085624 |
| f | san mateo, california | 479 | 0.0198615 |
| f | san rafael, california | 340 | 0.0140979 |
location_m <- location %>%
filter(sex=="m") %>%
mutate(prop=n/35829)
location_m_top8 <- location_m %>%
filter(n>490)
location_m_top8 %>%
knitr::kable()
| sex | location | n | prop |
|---|---|---|---|
| m | alameda, california | 543 | 0.0151553 |
| m | berkeley, california | 2455 | 0.0685199 |
| m | daly city, california | 491 | 0.0137040 |
| m | hayward, california | 498 | 0.0138994 |
| m | oakland, california | 3723 | 0.1039102 |
| m | palo alto, california | 695 | 0.0193977 |
| m | san francisco, california | 18799 | 0.5246867 |
| m | san mateo, california | 852 | 0.0237796 |
| After | tallying, I realized that so | me of lo | cations that profiles report are not in the San Francisco Bay Area, let alone in California. However out of the top 8 locations for each sex, the top 6 out in the exact same order of number of profiles per city. I chose not to display a plot here because I thought it would be even moreunwieldy as the job type plot given how many locations were reported by users. |
AGE
age <- profiles %>%
select(sex, age) %>%
group_by(sex, age) %>%
tally() %>%
group_by(age) %>%
mutate(prop=n/sum(n))
ggplot(data=age, aes(x=age, y=prop, fill=sex)) +
geom_bar(stat="identity", position="dodge") +
labs(x="Job Type", y="Number of Profiles", title="SF OkCupid Job Type Distribution Compared by Sex") +
geom_hline(yintercept = 0.40) +
coord_flip()
age_avg <- profiles %>%
select(sex, age) %>%
group_by(sex, age) %>%
tally()
summarize(age_avg,
age = mean(age, na.rm=TRUE))
## # A tibble: 2 × 2
## sex age
## <chr> <dbl>
## 1 f 44.75472
## 2 m 44.73585
summarize(age_avg,
age = median(age, na.rm=TRUE))
## # A tibble: 2 × 2
## sex age
## <chr> <int>
## 1 f 44
## 2 m 44
ggplot(data=age, aes(x=age, y=n, col=sex)) +
geom_bar(stat="identity", position="dodge") +
labs(x="Age in Years", y="Number of Profiles", title="SF OkCupid Age Distribution Compared by Sex")
age_f <- age %>%
filter(sex=="f") %>%
mutate(prop=n/24117)
age_m <- age %>%
filter(sex=="m") %>%
mutate(prop=n/35829 )
The age distribution of SF OkCupid users is similar between the sexes, both are skewed right, with similar ranges and the age with the highest proportion of users peaking under 30 years old. However, the female distribution appears to be slightly more right skewed than the male population.
AGE AND INCOME
age_income <- profiles %>%
select(sex, income, age) %>%
filter(income>-1) %>%
group_by(age, sex) %>%
summarise(avg_income=mean(income))
ggplot(data=age_income, aes(x=age, y=avg_income, col=sex)) +
geom_point() +
labs(x="Age", y="Income", title="Relationship between Age and Income for SF OkCupid Users, Compared by Sex")
ggplot(data=age_income, aes(x=age, y=avg_income)) +
geom_point() +
geom_smooth() +
facet_wrap(~sex) +
labs(x="Age", y="Income", title="Relationship between Age and Income for SF OkCupid Users, Faceted by Sex")
The relationship between age and income is similar between males and females, both with income peaking around the age of 35. However, females income drops more dramatically after the age of 35 than males income. Additionally the entire female income vs age relationship is (unsurprisingly, yet disapointingly given the wage gap) shifted down about 25,000 dollars at the peak. The male starting income, under the age of 20, starts out higher than the females and similarly the male ending income, around the age of 70, is higher than the females. ## Question 2:
In the file HW-2_Shiny_App.Rmd, build the Shiny App discussed in Lec09 on Monday 10/3: Using the movies data set in the ggplot2movies data set, make a Shiny app that
Action, Animation, Comedy, etc), have a radio button that allows you to toggle between comedies and non-comedies. This app should be simpler.