Introduction

OkCupid is an American-based, internationally operating online dating, friendship, and social networking website that features multiple-choice questions in order to match members.

You will work with profile data on 59,946 OkCupid useres who were living within 25 miles of San Francisco, had active profiles on June 26, 2012, were online in the previous year, and had at least one picture in their profile.

Data

Below is a preivew of the data set and some further details on the variables.

Variable Description
age user’s age
body_type user’s body type
diet user’s dietary habits
drinks user’s alcohol drinking habits
drugs user’s drug habits
education highest degree or education level
ethnicity ethnicity
height height in inches
income income in dollars
job job status or profession
location city/area, state
offspring child status
orientation sexual orientation
pets number of pets
religion religious affiliation
sex biological sex
sign astrological sign
smokes user’s smoking habits
speaks languages spoken
status user’s relationship status
essay0 user’s self summary trimmed to 140 characters

Questions

To get started, read in okcupid.csv (available on Google Classroom) and save it as an object called cupid.

In the questions that follow you must use functions in the dplyr, tidyr, and ggplot2 packages as often as is practically possible. Be sure to also use %>% operator.

Question 1

Remove the variable essay0. Separate the variable location into variables area and state. You can save the resulting data frame as an R object named cupid and work with this data frame for the remaining questions.

Question 2

For each variable, what is the percentage of missing values?

Question 3

How many profiles exist for each sex? How many profiles for each sex have an income greater than 0 dollars? Which sex reports an income more often?

Question 4

Recreate the plot below based on the five most reported profile ethnicities: white, asian, hispanic / latin, black, other. Incomes are only for those who reported less than 250,000 dollars.

Question 5

Create a bar plot that depicts the relationship between two categorical variables. These two variables can be from the data set or ones that you created based on the data.

Question 6

What are the 10 most common jobs for females and what is the corresponding 75th percentile of income for each of the 10 jobs. All of the results should be in one data frame. Do the same for males.

Question 7

Create any plot to examine the relationship between at least two variables. The plot can be based on variables in the data set or new variables you created.

Question 8

How would you describe the typical (most common) male as per the profile data? What about the typical (most common) female?

Essential details

Deadline and submission

The deadline to submit Homework 2 is 11:59pm on Tuesday, February 19. Submit your work by uploading only your Rmd file through Google Classroom. Late work will not be accepted except under certain extraordinary circumstances.

Help

  • Post your questions in the #hw2 channel on Slack. If you are trying to get help on a code error, explain your error in detail or give a reproducible example that generates the same error. Make use of the code snippet option available in Slack.

  • Feel free to visit Scott or I in office hours or make an appointment.

  • Communicate with your classmates, but do not share large snippets of code.

  • Scott or I will not answer any questions within the first 24 hours of this homework being assigned, and we will not answer any questions within 6 hours of the deadline.

Academic integrity

This is an individual assignment. However, you may discuss ideas, how to debug code, and how to approach a problem with your classmates. You may not copy-and-paste another individual’s code from this class. As a reminder, below is the policy on sharing and using other’s code.

Similar reproducible examples (reprex) exist online that will help you answer many of the questions posed on in-class assignments, pre-class assignments, homework assignments, and midterm exams. Use of these resources is allowed unless it is written explicitly on the assignment. You must always cite any code you copy or use as inspiration. Copied code without citation is plagiarism and will result in a 0 for the assignment.

Grading

You must use R Markdown. Formatting is at your discretion but is graded. Use the in-class assignments and resources available online for inspiration. Another useful resource for R Markdown formatting is available at: https://holtzy.github.io/Pimp-my-rmd/

Topic Points
Questions 1-8 64
R Markdown formatting 9
Communication of results 7
Knit 7
Code style 7
Named code chunks 6
Total 100

A bonus of up to 3 points can be earned for implementing a plot with a geom we did not discuss in class. You can also earn the 3 points if you use one the of ggplot2 extension packages.

References

  1. “OkCupid Profile Data for Introductory Statistics and Data Science Courses”, Journal of Statistics Education, 2015.