OkCupid is an American-based, internationally operating online dating, friendship, and social networking website that features multiple-choice questions in order to match members.
You will work with profile data on 59,946 OkCupid useres who were living within 25 miles of San Francisco, had active profiles on June 26, 2012, were online in the previous year, and had at least one picture in their profile.
Below is a preivew of the data set and some further details on the variables.
| Variable | Description |
|---|---|
age |
user’s age |
body_type |
user’s body type |
diet |
user’s dietary habits |
drinks |
user’s alcohol drinking habits |
drugs |
user’s drug habits |
education |
highest degree or education level |
ethnicity |
ethnicity |
height |
height in inches |
income |
income in dollars |
job |
job status or profession |
location |
city/area, state |
offspring |
child status |
orientation |
sexual orientation |
pets |
number of pets |
religion |
religious affiliation |
sex |
biological sex |
sign |
astrological sign |
smokes |
user’s smoking habits |
speaks |
languages spoken |
status |
user’s relationship status |
essay0 |
user’s self summary trimmed to 140 characters |
To get started, read in okcupid.csv (available on Google Classroom) and save it as an object called cupid.
In the questions that follow you must use functions in the dplyr, tidyr, and ggplot2 packages as often as is practically possible. Be sure to also use %>% operator.
Remove the variable essay0. Separate the variable location into variables area and state. You can save the resulting data frame as an R object named cupid and work with this data frame for the remaining questions.
For each variable, what is the percentage of missing values?
How many profiles exist for each sex? How many profiles for each sex have an income greater than 0 dollars? Which sex reports an income more often?
Recreate the plot below based on the five most reported profile ethnicities: white, asian, hispanic / latin, black, other. Incomes are only for those who reported less than 250,000 dollars.
Create a bar plot that depicts the relationship between two categorical variables. These two variables can be from the data set or ones that you created based on the data.
What are the 10 most common jobs for females and what is the corresponding 75th percentile of income for each of the 10 jobs. All of the results should be in one data frame. Do the same for males.
Create any plot to examine the relationship between at least two variables. The plot can be based on variables in the data set or new variables you created.
How would you describe the typical (most common) male as per the profile data? What about the typical (most common) female?
The deadline to submit Homework 2 is 11:59pm on Tuesday, February 19. Submit your work by uploading only your Rmd file through Google Classroom. Late work will not be accepted except under certain extraordinary circumstances.
Post your questions in the #hw2 channel on Slack. If you are trying to get help on a code error, explain your error in detail or give a reproducible example that generates the same error. Make use of the code snippet option available in Slack.
Feel free to visit Scott or I in office hours or make an appointment.
Communicate with your classmates, but do not share large snippets of code.
Scott or I will not answer any questions within the first 24 hours of this homework being assigned, and we will not answer any questions within 6 hours of the deadline.
This is an individual assignment. However, you may discuss ideas, how to debug code, and how to approach a problem with your classmates. You may not copy-and-paste another individual’s code from this class. As a reminder, below is the policy on sharing and using other’s code.
Similar reproducible examples (reprex) exist online that will help you answer many of the questions posed on in-class assignments, pre-class assignments, homework assignments, and midterm exams. Use of these resources is allowed unless it is written explicitly on the assignment. You must always cite any code you copy or use as inspiration. Copied code without citation is plagiarism and will result in a 0 for the assignment.
You must use R Markdown. Formatting is at your discretion but is graded. Use the in-class assignments and resources available online for inspiration. Another useful resource for R Markdown formatting is available at: https://holtzy.github.io/Pimp-my-rmd/
| Topic | Points |
|---|---|
| Questions 1-8 | 64 |
| R Markdown formatting | 9 |
| Communication of results | 7 |
| Knit | 7 |
| Code style | 7 |
| Named code chunks | 6 |
| Total | 100 |
A bonus of up to 3 points can be earned for implementing a plot with a geom we did not discuss in class. You can also earn the 3 points if you use one the of ggplot2 extension packages.