## Warning: package 'tidyverse' was built under R version 3.2.5
## Warning: package 'ggplot2' was built under R version 3.2.4
## Warning: package 'tibble' was built under R version 3.2.4
## Warning: package 'tidyr' was built under R version 3.2.5
## Warning: package 'readr' was built under R version 3.2.5
## Warning: package 'purrr' was built under R version 3.2.5
## Warning: package 'dplyr' was built under R version 3.2.5
## Warning: package 'knitr' was built under R version 3.2.5
## Warning: package 'forcats' was built under R version 3.2.5

Admistrative:

Please indicate

  • Who you collaborated with: Brenda, Trisha, and Connor.
  • Roughly how much time you spent on this HW so far: four hours.
  • The URL of the RPubs published URL here.
  • What gave you the most trouble: The shiny app inputs and figuring out how to make groups for low, middle, high income to then show in plots (never achieved)
  • Any comments you have: Quizzes should be worth less, homeworks more.

Question 1:

Perform an Exploratory Data Analysis (EDA) on the profiles data set, specifically on the relationship between gender and

  • income
  • job
  • One more categorical variable of your choice

all keeping in mind in HW-3, you will be fitting a logistic regression to predict a user’s gender based on these variables.

To begin exploratory data, let’s look at the profile data stored by OKCupid! Let’s exclude the essays in this dataset, assuming these are highly subjective and they’re a little too lengthy for an analysis.

age body_type diet drinks drugs education ethnicity height income job location offspring orientation pets religion sex sign smokes speaks status
22 a little extra strictly anything socially never working on college/university asian, white 75 -1 transportation south san francisco, california doesn’t have kids, but might want them straight likes dogs and likes cats agnosticism and very serious about it m gemini sometimes english single
35 average mostly other often sometimes working on space camp white 70 80000 hospitality / travel oakland, california doesn’t have kids, but might want them straight likes dogs and likes cats agnosticism but not too serious about it m cancer no english (fluently), spanish (poorly), french (poorly) single
38 thin anything socially NA graduated from masters program NA 68 -1 NA san francisco, california NA straight has cats NA m pisces but it doesn’t matter no english, french, c++ available
23 thin vegetarian socially NA working on college/university white 71 20000 student berkeley, california doesn’t want kids straight likes cats NA m pisces no english, german (poorly) single
29 average mostly anything socially NA graduated from college/university white 67 -1 computer / hardware / software san francisco, california doesn’t have kids, but might want them straight likes cats atheism m taurus no english (fluently), chinese (okay) single
32 fit strictly anything socially never graduated from college/university white, other 65 -1 NA san francisco, california NA straight likes dogs and likes cats NA f virgo NA english single

Mean Age for both Sexes:

  • Dataset fails to include other sex categories, ie transgender, but we will continue to perform analysis under binary assumptions.
Mean Age for OkCupid
sex MeanAge
f 32
m 32

Age range using a boxplot to explore variation:

job job_tot job_percent
artistic / musical / writer 2783 7.828190
banking / financial / real estate 1366 3.842367
clerical / administrative 477 1.341734
computer / hardware / software 2894 8.140418
construction / craftsmanship 624 1.755225
education / academia 2120 5.963264
entertainment / media 1324 3.724227
executive / management 1441 4.053332
hospitality / travel 877 2.466879
law / legal services 818 2.300920

We cannot ignore -1 category, let’s review how often someone reports -1 based on gender:

sex job_amt percent_gender
f 12143 43
m 15943 57

Median Incomes of Female and Male Profiles

sex medianIncome
f 40000
m 60000

These are our median incomes, but let’s look at the variation within this.

sex sd_income median_income
f 196512.7 40000
m 216860.0 60000

Over all looks as if the income for males and females is at about the national average. We can question the validity of the data considering it is all self reported. We are also missing -1 values.

Here’s a sample of our user’s most common jobs:

job sex Job_amt prop
computer / hardware / software m 2509 0.8669661
science / tech / engineering m 2443 0.7893376
other m 2432 0.5167871

Same viz, but a bar chart of proportions to better visualize job proportions. We have also included a line showing the percentage of total female profiles to show the disparities between employment type and gender.

## Warning: Removed 2 rows containing missing values (geom_bar).

Exploring another categorical variable, I am interested in diet variable, but let’s see if enough users report these categories and what their frequencies are.

diet n
mostly anything 3673
anything 1175
strictly anything 1030
mostly vegetarian 660
mostly other 237
strictly vegetarian 174
strictly other 138
mostly vegan 79
vegetarian 77
other 64

Let’s recode these to three similiar categories. About one six of users are vegetarian. Interesting, let’s keep digging.

diet_code n
omnivore 6317
other 67
veggie 1081
diet_code sex n
omnivore f 1626
omnivore m 4691
other f 10
other m 57
veggie f 425
veggie m 656

Interested in diet proportions, so let’s check this baby out to see if diet is valuable predictor.

diet_code sex n prop
omnivore f 1626 0.2574007
omnivore m 4691 0.7425993
other f 10 0.1492537
other m 57 0.8507463
veggie f 425 0.3931545
veggie m 656 0.6068455

These proportions and diet totals make me believe they will be a valuable predictor. For example, you have a 40% chance of being a female in this dataset and 40% chance of being a vegetarian if you are female, too. Let’s explore this.

Let’s plot test hypothesis:

Question 2:

In the file HW-2_Shiny_App.Rmd, build the Shiny App discussed in Lec09 on Monday 10/3: Using the movies data set in the ggplot2movies data set, make a Shiny app that

  • Plots budget on the x-axis and rating on the y-axis
  • Instead of having a radio button to select the genre of movie (Action, Animation, Comedy, etc), have a radio button that allows you to toggle between comedies and non-comedies. This app should be simpler.