607 Final Project: OkCupid Analysis

By: Kelly Shaffer

The okc dataset I’ll be using in the analysis contains the profile details from 59,946
individuals active on the online dating site OkCupid as of June 20th, 2012. These individuals
all live within a 25 mile radius of San Francisco, California.

I hypothesize the following things:

-Smokers are less educated than non-smokers.
-Females are more likely to be vegetarian/vegan than males.
-This population will contain more individuals who identify as something
other than straight than the national average of 3.6%.
-This population is more educated than the national average.
-The males in this population are taller than the females.

Final Project Checklist items pertaining to the deliverables in this file are listed below

for simplicity of grading:

Your project has a recognizable “data science workflow,” such as the OSEMN workflow or Hadley
Wickham’s Grammar of Data Science. [Example: First the data is acquired, then necessary
transformations and clean-up are performed, then the analysis and presentation work is performed]

My project has been organized by Data Acquisition/Import, Data 
Transformation/Clean Up, and Data Analysis/Visualization.

Project includes data from at least two different types of data sources (e.g., two or more of
these: relational or CSV, Neo4J, web page [scraped or API], MongoDB, etc.)

Data source 1 is a csv file of OkCupid profile data.
Data source 2 is the "women" dataset from the datasets
package in R.

Project includes at least one data transformation operation. [Examples: transforming from wide
to long; converting columns to date format]

The following transformations have been performed:
Renamed all columns.
Changed -1 to "Rather Not Say" in `Income`
Extracted date from `Last Login` and created a new column
  `Last Login Date`
    **Converting this to date but getting errors right now
Created a new column `Data Pull Date` as of 2012-06-20
  **Trying to convert this to date as well
Created a new column `Days Since Login` which is the difference
  in days between `Last Login Date` and `Data Pull Date`

Project includes at least one statistical analysis and at least one graphics that describes or validates your data.

A confidence interval and margin of error was calculated
for the proportion of females in the population. A bar plot
displays the proportion.

Project includes at least one graphic that supports your conclusion(s).

Many graphics are included within the project that support my conclusions.

Project includes at least one statistical analysis that supports your conclusion(s).

Many analyses have been performed that support my conclusions.

Project includes at least one feature that we did not cover in class! There are many examples: “I used ggmap; I created a decision tree; I ranked the results; I created my presentation slides directly from R; I figured out to use OAuth 2.0.”

I'm figuring out how to load data from a zipped file on GitHub.

Presentation. Was the presentation delivered in the allotted time (3 to 5 minutes)?

I signed up to present early. Therefore, I'll be granted 8 minutes for my presentation.
I plan to stay within these boundaries.

Presentation. Did you show (at least) one challenge you encountered in code and/or data, and what you did when you encountered that challenge?

Challenge: The csv file of data I'm using is too big to upload to GitHub, and
reproducibility is a requirement of this assignment. If I connect to the data on my local
machine, I will have not met this requirement.

Code and data. Have you delivered the submitted code and data where it is self-contained preferably in rpubs.com and github? Am I able to fully reproduce your results with what you’ve delivered? You won’t receive full credit if your code references data on your local machine!

I will be uploading all code to rpubs and github. The data is too large to upload to GitHub,
so I was unable to link to it directly that way. Therefore, I am providing the link to where
the data can be downloaded and saved to your machine. This should suffice as reproducibility
since you'll have what you need to recreate my analysis. The link to the data is below:
https://github.com/rudeboybert/JSE_OkCupid/blob/master/profiles.csv.zip

Code and data. Does all of the delivered code run without errors?

Yes. All of the delivered code runs without errors. I have commented out a section where I
tried to link to the zip file on GitHub but couldn't debug the error, despite hours spent
on stackoverflow.

Code and data. Have you delivered your code and conclusions using a “reproducible research” tool such as RMarkdown?

Yes, the file has been delivered in RMarkdown.

Data Acquisition, Import, Validation

library(datasets)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
library(rio)
## Warning: package 'rio' was built under R version 3.3.3
library(downloader)
library(mosaic)
## Warning: package 'mosaic' was built under R version 3.3.3
## Loading required package: dplyr
## Warning: package 'dplyr' was built under R version 3.3.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: lattice
## Loading required package: mosaicData
## Warning: package 'mosaicData' was built under R version 3.3.3
## Loading required package: Matrix
## 
## The 'mosaic' package masks several functions from core packages in order to add additional features.  
## The original behavior of these functions should not be affected by this.
## 
## Attaching package: 'mosaic'
## The following object is masked from 'package:Matrix':
## 
##     mean
## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally
## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cov, D, fivenum, IQR, median, prop.test,
##     quantile, sd, t.test, var
## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum
library(psych)
## Warning: package 'psych' was built under R version 3.3.3
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:mosaic':
## 
##     logit, read.file, rescale
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
#Loading the data in locally - this file is too big for GitHub
okc <- import("C:/ProgramData/MySQL/MySQL Server 5.7/Uploads/profiles.csv", stringsAsFactors = FALSE, )

#take a look at the beginning and end of the data to make sure everything loaded in properly
#head(okc)
#tail(okc)

#Trying to connect to the original zip file on github
#Currently getting 'error 1 in extracting from zip file'
#    download_repo <- function(repo, user){
#      require(downloader)
#      url <- sprintf("https://github.com/%s/%s/blob/master/profiles.csv.zip", user, repo)
#      tmp <- tempfile(fileext = ".zip")
#      download(url, tmp)
#      unzip(tmp)  
#    }
#    profiles <- download_repo('JSE_OkCupid','rudeboybert')

Data Transformation, Clean Up

#Rename the columns
names(okc)[names(okc)=="age"] <- "Age"
names(okc)[names(okc)=="body_type"] <- "Body Type"
names(okc)[names(okc)=="diet"] <- "Diet"
names(okc)[names(okc)=="drinks"] <- "Drinking Habits"
names(okc)[names(okc)=="drugs"] <- "Drug Use"
names(okc)[names(okc)=="education"] <- "Education"
names(okc)[names(okc)=="essay0"] <- "Self Summary"
names(okc)[names(okc)=="essay1"] <- "Doing with Life"
names(okc)[names(okc)=="essay2"] <- "Really Good At"
names(okc)[names(okc)=="essay3"] <- "First Thing Noticed"
names(okc)[names(okc)=="essay4"] <- "Favorites"
names(okc)[names(okc)=="essay5"] <- "6 Things"
names(okc)[names(okc)=="essay6"] <- "Thinking About"
names(okc)[names(okc)=="essay7"] <- "Friday Night"
names(okc)[names(okc)=="essay8"] <- "Most Private Thing"
names(okc)[names(okc)=="essay9"] <- "Message Me If"
names(okc)[names(okc)=="ethnicity"] <- "Ethnicity"
names(okc)[names(okc)=="height"] <- "Height (in)"
names(okc)[names(okc)=="income"] <- "Income (in USD)"
names(okc)[names(okc)=="job"] <- "Job"
names(okc)[names(okc)=="last_online"] <- "Last Online"
names(okc)[names(okc)=="location"] <- "Location"
names(okc)[names(okc)=="offspring"] <- "Offspring"
names(okc)[names(okc)=="orientation"] <- "Sexual Orientation"
names(okc)[names(okc)=="pets"] <- "Pets"
names(okc)[names(okc)=="religion"] <- "Religion"
names(okc)[names(okc)=="sex"] <- "Gender"
names(okc)[names(okc)=="sign"] <- "Sign"
names(okc)[names(okc)=="smokes"] <- "Smokes"
names(okc)[names(okc)=="speaks"] <- "Speaks"
names(okc)[names(okc)=="status"] <- "Status"

#Convert numeric variables from int to dbl
okc$`Height (in)` = as.numeric(as.character(okc$`Height (in)`))
okc$`Age` = as.numeric(as.character(okc$Age))

#Change -1 values to "Rather Not Say" for ease of understanding/analysis
okc$`Income (in USD)` <- gsub("-1", "Rather Not Say", okc$`Income (in USD)`)

#Simplify the education values since it was cluttering my mosaic plot
#graduated
okc$Education <- gsub("graduated from high school", "G from HS", okc$Education)
okc$Education <- gsub("graduated from college/university", "G from Uni", okc$Education)
okc$Education <- gsub("graduated from two-year college", "G from 2YR", okc$Education)
okc$Education <- gsub("graduated from masters program", "G from M", okc$Education)
okc$Education <- gsub("graduated from law school", "G from Law", okc$Education)
okc$Education <- gsub("graduated from med school", "G from Med", okc$Education)
okc$Education <- gsub("graduated from ph.d program", "G from PhD", okc$Education)
okc$Education <- gsub("graduated from space camp", "G from SC", okc$Education)

#working on
okc$Education <- gsub("working on high school", "W on HS", okc$Education)
okc$Education <- gsub("working on college/university", "W on Uni", okc$Education)
okc$Education <- gsub("working on two-year college", "W on 2YR", okc$Education)
okc$Education <- gsub("working on masters program", "W on M", okc$Education)
okc$Education <- gsub("working on law school", "W on LS", okc$Education)
okc$Education <- gsub("working on med school", "W on Med", okc$Education)
okc$Education <- gsub("working on ph.d program", "W on PhD", okc$Education)
okc$Education <- gsub("working on space camp", "W on SC", okc$Education)

#dropped out of
okc$Education <- gsub("dropped out of high school", "DO of HS", okc$Education)
okc$Education <- gsub("dropped out of college/university", "DO of Uni", okc$Education)
okc$Education <- gsub("dropped out of two-year college", "DO of 2YR", okc$Education)
okc$Education <- gsub("dropped out of masters program", "DO of M", okc$Education)
okc$Education <- gsub("dropped out of law school", "DO of LS", okc$Education)
okc$Education <- gsub("dropped out of med school", "DO of Med", okc$Education)
okc$Education <- gsub("dropped out of ph.d program", "DO of PhD", okc$Education)
okc$Education <- gsub("dropped out of space camp", "DO of SC", okc$Education)

#graduated
okc$Education <- gsub("high school", "G from HS", okc$Education)
okc$Education <- gsub("college/university", "G from Uni", okc$Education)
okc$Education <- gsub("two-year college", "G from 2YR", okc$Education)
okc$Education <- gsub("masters program", "G from M", okc$Education)
okc$Education <- gsub("law school", "G from Law", okc$Education)
okc$Education <- gsub("med school", "G from Med", okc$Education)
okc$Education <- gsub("ph.d program", "G from PhD", okc$Education)
okc$Education <- gsub("space camp", "G from SC", okc$Education)

#Create a new column with date not including the time
`Last Date Online` <- substr(okc$`Last Online`, 1, 10)
okc$`Last Date Online` <- `Last Date Online`

#create new field with as of date
`Data Pull Date` <- "2012-06-20"
okc$`Data Pull Date` <- `Data Pull Date`

Data Analysis/Visualization

#Divide the data into male/female
okc_f <- subset(okc, Gender == "f")
okc_m <- subset(okc, Gender == "m")

#Create the histogram of female heights of okc users
hist(okc_f$`Height (in)`,  ylim = c(0, 4000), xlim = c(55, 75), main = "Histogram of Female Height of OkCupid Users", xlab = "Height", breaks = 125)

#Look at how normal it really is
qqnorm(okc_f$`Height (in)`)
qqline(okc_f$`Height (in)`)

#Create the histogram of female heights from datasets
data(women)
hist(women$height, main = "Histogram of Female Height", xlab = "Height", breaks = 6)

#Table of aggregated smoker data
Smoke <- table(okc$Smokes)

#Relative frequency table of smoker data
Smoke_Rel <- table(okc$Smokes)/59946
barplot(Smoke)

barplot(Smoke_Rel)

#Comparing gender and smoking
table(okc$Gender, okc$Smokes)/59946
##    
##                          no   sometimes trying to quit when drinking
##   f 0.034230808 0.304624162 0.022787175    0.008007206   0.019384112
##   m 0.057718613 0.427634871 0.040386348    0.016681680   0.031328195
##    
##             yes
##   f 0.013278617
##   m 0.023938211
mosaicplot(table(okc$Gender, okc$Smokes), main = "Mosaic Plot of Gender & Smoking")

((25635+18261)/59946)
## [1] 0.732259
#Comparing gender and sexual orientation
table(okc$Gender, okc$`Sexual Orientation`)/59946
##    
##       bisexual        gay   straight
##   f 0.03329663 0.02649051 0.34252494
##   m 0.01286158 0.06647650 0.51834985
mosaicplot(table(okc$Gender, okc$`Sexual Orientation`), main = "Mosaic Plot of Gender & Sexual Orientation")

#Comparing gender and body type
barplot(table(okc$Gender, okc$`Body Type`), legend.text = TRUE, args.legend = list(x = "topright", bty = "n"), main = "Stacked Bar Chart of Gender & Body Type", xlab = "Body Type", ylab = "Frequency")

(table(okc$Gender, okc$`Body Type`))/59946
##    
##                 a little extra    athletic     average       curvy
##   f 0.045090582    0.013695659 0.038518000 0.093751043 0.063573883
##   m 0.043255597    0.030160478 0.158642779 0.150668935 0.001885030
##    
##             fit full figured      jacked  overweight rather not say
##   f 0.073916525  0.014513062 0.002151937 0.002418844    0.001768258
##   m 0.138124312  0.002318754 0.004871051 0.004987822    0.001534715
##    
##          skinny        thin     used up
##   f 0.010025690 0.041187068 0.001701531
##   m 0.019617656 0.037400327 0.004220465
#Comparing gender and diet
barplot(table(okc$Gender, okc$`Diet`), legend.text = TRUE, args.legend = list(x = "topright", bty = "n"), main = "Stacked Bar Chart of Gender & Diet", xlab = "Diet", ylab = "Frequency")

(table(okc$Gender, okc$`Diet`))
##    
##           anything halal kosher mostly anything mostly halal mostly kosher
##   f  9913     2417     3      6            6358           12            36
##   m 14482     3766     8      5           10227           36            50
##    
##     mostly other mostly vegan mostly vegetarian other strictly anything
##   f          425          181              1913   164              1479
##   m          582          157              1531   167              3634
##    
##     strictly halal strictly kosher strictly other strictly vegan
##   f              2               5            178            107
##   m             16              13            274            121
##    
##     strictly vegetarian vegan vegetarian
##   f                 469    71        378
##   m                 406    65        289
mosaicplot(table(okc$Gender, okc$`Diet`), main = "Mosaic Plot of Gender & Diet")

#Comparing smoking and education
table(okc$Education)
## 
##             DO of 2YR   DO of HS   DO of LS    DO of M  DO of Med 
##       6628        191        102         18        140         12 
##  DO of PhD   DO of SC  DO of Uni G from 2YR  G from HS G from Law 
##        127        523        995       1753       1524       1141 
##   G from M G from Med G from PhD  G from SC G from Uni   W on 2YR 
##       9097        457       1298        715      24760       1074 
##    W on HS    W on LS     W on M   W on Med   W on PhD    W on SC 
##         87        269       1683        212        983        445 
##   W on Uni 
##       5712
table(okc$Smokes, okc$Education)
##                 
##                        DO of 2YR DO of HS DO of LS DO of M DO of Med
##                   1546         8        2        2       7         2
##   no              3666       101       39       11     111         8
##   sometimes        508        19       11        1       6         1
##   trying to quit   216        16        9        1       4         0
##   when drinking    290        13       12        1       9         1
##   yes              402        34       29        2       3         0
##                 
##                  DO of PhD DO of SC DO of Uni G from 2YR G from HS
##                          6       46        39         83        60
##   no                   103      297       600       1169       853
##   sometimes              8       53       111        169       230
##   trying to quit         1       31        59         86       106
##   when drinking          4       42        79        117        69
##   yes                    5       54       107        129       206
##                 
##                  G from Law G from M G from Med G from PhD G from SC
##                          82      578         24         69        42
##   no                    960     7839        408       1157       452
##   sometimes              33      287         13         27        80
##   trying to quit         11       73          4          7        37
##   when drinking          38      263          6         25        53
##   yes                    17       57          2         13        51
##                 
##                  G from Uni W on 2YR W on HS W on LS W on M W on Med
##                        2132       62       3      28    139        9
##   no                  19132      615      45     198   1277      179
##   sometimes            1269      141      19      14     80       11
##   trying to quit        461       56       5       5     32        3
##   when drinking        1257       79       4      18    109        7
##   yes                   509      121      11       6     46        3
##                 
##                  W on PhD W on SC W on Uni
##                        88      44      411
##   no                  798     253     3625
##   sometimes            36      45      615
##   trying to quit        5      22      230
##   when drinking        46      40      458
##   yes                  10      41      373
mosaicplot(table(okc$Smokes, okc$Education), main = "Mosaic Plot of Smoking Habits & Education")

table(okc$Education)/59946
## 
##                 DO of 2YR     DO of HS     DO of LS      DO of M 
## 0.1105661762 0.0031862009 0.0017015314 0.0003002702 0.0023354352 
##    DO of Med    DO of PhD     DO of SC    DO of Uni   G from 2YR 
## 0.0002001802 0.0021185734 0.0087245187 0.0165982718 0.0292429854 
##    G from HS   G from Law     G from M   G from Med   G from PhD 
## 0.0254228806 0.0190337971 0.1517532446 0.0076235278 0.0216528209 
##    G from SC   G from Uni     W on 2YR      W on HS      W on LS 
## 0.0119274013 0.4130384012 0.0179161245 0.0014513062 0.0044873720 
##       W on M     W on Med     W on PhD      W on SC     W on Uni 
## 0.0280752677 0.0035365162 0.0163980916 0.0074233477 0.0952857572
#Male vs. Female Height
#Obviously, we would expect males to be taller in the population
#The same proves true in our sample of okc users
boxplot(okc$`Height (in)` ~ okc$Gender)

histogram(~`Height (in)` | Gender, width=1, layout=c(1,2), xlab="Height in inches", data=okc)

#ci_gender <- inference(okc$Gender, est = "proportion", type = "ci", method = "theoretical", success = "f")

#ME calculation
1.96 * 0.002
## [1] 0.00392

The following line of code runs fine in my R Markdown file, but when I try to knit, it produces the following error:

Error: “Error in eval(expr, envir, enclos) : could not find function”inference" calls: … handle -> withCallingHandler -> withVisible -> eval -> eval"

Code: ci_gender <- inference(okc$Gender, est = “proportion”, type = “ci”, method = “theoretical”, success = “f”)

A screenshot of the output of the function is below for your reference. I have decided to include this analysis in my project despite the error, given the fact that it’s not throwing this error when run in R Markdown.

inference output

inference output

Inference Output: Single proportion – success: f Summary statistics: p_hat = 0.4023 ; n = 59946 Check conditions: number of successes = 24117 ; number of failures = 35829 Standard error = 0.002 95 % Confidence interval = ( 0.3984 , 0.4062 )

The confidence interval for the proportion of females is (0.3984, 0.4062).

Let’s calculate the margin of error for the estimate of the proportion of the proportion of females:

ME = z* x SE
ME = 1.96 x 0.002
ME = 0.00392
ME = 0.392%

Conclusions regarding this population:

Education

41% of users have graduated with their Bachelor’s
15% of users have graduated with their Master’s
66.55% have a Bachelor’s or greater (including those in school for
advanced degrees beyond Bachelor’s)
Individuals with a college degree are most likely to be non-smokers.

Smoking

Overall, 72% of users do not smoke.
42% of male users do not smoke.
30% of female users do not smoke.
This is really interesting, because in data published by the CDC,
less females reporting smoking than males. In the CDC data, it
seemed like about half of the population reported smoking, and half
reported not smoking. This is much different than what we’re seeing
here in the San Francisco population, which I believe speaks to the
local culture. They are much more environmentally friendly and health
conscious in California.
Smoking is more common amongst the undereducated.

Sexual Orientation

86% of users are straight
14% of users are not straight

There are almost three times as many bisexual females than bisexual
males.
3.3% of users are bisexual females
1.2% of users are bisexual males

There are almost three times as many gay males than gay females.
2.6% of users are gay females
6.6% of users are gay males

Body Type

“Curvy” is almost exclusively a body type used by females.
The body types “Athletic” “Skinny” and “Jacked” are used much more
commonly by males to describe themselves.
Males are most commonly describing themselves as “Athletic”,
while females are most commonly describing themselves as “Average”.

Diet

This is much more half and half across gender than I thought it
would be. I expected that more females would be vegetarian/vegan,
but this is not the case. In looking at the mosaic plot of gender
and diet, you can see that they look almost the same with just
slight variations. A fun surprise to have discovered!