Introduction

The OkCupid Profiles dataset contains 60552 observations and 29 columns,which has included typical user demographics information, lifestyle variables, and text responses to 10 essay questions. The original source is ``OkCupid Data for Introductory Statistics and Data Science Courses’’ by Albert Kim and Adriana Escobedo-Land. Data science is gaining more prominence in academia and industry. By analysing the data we hope to give an answer or a suggestion to people’s normal life questions.

In this project, the first topic we are curious to analyze is the relationship between body type, age and the single status. Furthermore, by grouping, clustering, and refining the dataset, we want to see if there is a relationship between heights, income and the single status for the second topic. And the third topic we are looking at is text analysis on essay question 2. We hope to provide users some helpful labels which are the pop up words when they are being asked what are you good at. Such text analysis can show information about how people usually sell themselves to potential lovers.

Import and Preprocess Data

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggplot2)
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
okc <- read_csv("https://uofi.box.com/shared/static/oy32nc373w4jqz3kummksnw6wvhfrl7a.csv", 
                col_types = cols(last_online = col_datetime(format = "%Y-%m-%d-%H-%M"))) #takes about 20 seconds on laptop
colnames(okc) <- tolower(colnames(okc))
original_copy = okc
okc = sqldf("select body_type, age, height, income, status from okc")
okc = sqldf("select * from okc where status = 'available' or status = 'married' or status = 'seeing someone'")

copy <- mutate(okc, age_group =
ifelse(okc$age %in% 18:23, "18-23",
ifelse(okc$age %in% 24:29, "24-29",
ifelse(okc$age %in% 30:35, "30-35",
ifelse(okc$age %in% 36:41, "36-41",
ifelse(okc$age %in% 42:47, "42-47",
ifelse(okc$age %in% 48:53, "48-53",
ifelse(okc$age %in% 54:59, "54-59",
ifelse(okc$age %in% 60:65, "60-65",
"above 65")))))))))

Analysis

Body Type, Age Group, and Single Status

bar <- ggplot(data = okc) +
geom_bar(
mapping = aes(x = body_type, fill = status), position = "fill",
width = 1
) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
bar + coord_flip() + ggtitle("Percentage Distribution of Different Body_type and Single Status")

bar <- ggplot(data = okc) +
    geom_bar(
        mapping = aes(x = body_type, fill = status),
        width = 1
    ) +
    theme(aspect.ratio = 1) +
    labs(x = NULL, y = NULL)
bar + coord_flip() + ggtitle("Amount of Different Body Shapes and Single Status")

By first graph, we can see that most people filled out average, fit, and athlete. Although it showed the amount of single status in each body type criteria, it is still hard to provide a conclusion. So let’s look at the second graph, jacked body type seems to be the most amount of married.

library('RColorBrewer')

available <- okc[which(okc$status=='available'),]
ggplot(data = available, col= heat.colors(5)) + 
  geom_bar(mapping = aes(x = status, fill = body_type ))

This graph takes out the available stats, we look at the percentage of available people alone, it was surprised to see, fit, and average, and athletes are more likely to being single, whereas jacked are much less likely to being single.

bar <- ggplot(data = copy) +
geom_bar(
mapping = aes(x = age_group, fill = status), position = "fill",
width = 1
) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
bar + coord_polar() + ggtitle("Distribution of Different Age_Group and Single Status")

From the fourth graph, we can see the trends from age 18-35, people getting married, but by the age around 36, there seems to be more pressure in family, or work place, people are seeking to find new relationships.

Hierachical Clustering on Income and Height

There are two continuous variables in the dataset which are income and height. We have investigated two different groups previously, which are body shape and age groups. For this part, we renamed the values in the status column, we combine married and seeing someone as not single and set available as single. We have investigated in two group methods previously, which is body shapes and age groups. We want to further study how will income and height be clustered into these two groups.

copy_2 = sqldf("select * from copy where income <> -1")
temp = copy_2$status
temp = replace(temp, temp=="available", "single")
temp = replace(temp, temp=="seeing someone", "not single")
temp = replace(temp, temp=="married", "not single")
copy_2$status = temp
copy_2 = sqldf("select * from copy_2 where income < 250000")

ggplot(data = copy_2, mapping = aes(x = status, y = height)) +
geom_boxplot() + ggtitle("Boxplot Height vs Single Status")

ggplot(data = copy_2, mapping = aes(x = status, y = income)) +
geom_boxplot() + ggtitle("Boxplot Income vs Single Status")

First we plotted the box plot of income against single status and also the height against single status. We saw that single people have slightly higher average height and higher income.

dataset_1 = sqldf("select body_type, avg(height) as AVG_Height, avg(income) as AVG_Income from copy_2 group by body_type")
dataset_2 = sqldf("select age_group, avg(height) as AVG_Height, avg(income) as AVG_Income from copy_2 group by age_group")
ggplot(data = dataset_1, mapping = aes(x = AVG_Height, y = AVG_Income)) +
geom_point(mapping = aes(color = body_type)) +
geom_smooth()

ggplot(data = dataset_2, mapping = aes(x = AVG_Height, y = AVG_Income)) +
geom_point(mapping = aes(color = age_group)) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

We plotted the the average income vs average height when group by body type and by age group. We saw some interesting facts here that if you are a tall person, you have a high chance to being either jacked or overweight. As well as that, higher income group seems to spend more time on fitness, where average, fit, and athletic are prevalent body type . Although the fit curves for income vs height have different shapes when group by body type or age group. However, both group methods show a portion of the curve with a positive correlation between height and income.

plot_hclost <- function(copy_2){
small_table = copy_2
z = small_table[,-c(1,1)]
m = apply(z,2,mean)
s = apply(z,2,sd)
z <- scale(z,m,s)
distance = dist(z)
hc_trail = hclust(distance)
plot(hc_trail, labels = small_table$body_type, hang = -1)
title("Body Type                                                       ")
}
plot_hclost(dataset_1)

plot_hclost <- function(copy_2){
small_table = copy_2
z = small_table[,-c(1,1)]
m = apply(z,2,mean)
s = apply(z,2,sd)
z <- scale(z,m,s)
distance = dist(z)
hc_trail = hclust(distance)
plot(hc_trail, labels = small_table$age_group, hang = -1)
title("Age Group                                                       ")
}
plot_hclost(dataset_2)

Now,we have the hierarchical cluster results by body type and by age group. We can see similar body types share similar income and height status. For example, skinny and thin, athletic and fit, and a little extra and average. An interesting fact here is that people who choose rather not answer have similar body type to full figured. Also people who leave their answers blank are most likely skinny or thin. On the right graph, the results are mostly expected that adjacent age groups share similar income and height status. One useful information you can get here is that most people seem to retire around age 65 so that the age group over 65 shared some similar height and income status with the younger generation.

Text Analysis on Essay Question2

First, I took a look at some answers of question 2, which means the original data.

library(tidyverse)
okc <- read_csv("https://uofi.box.com/shared/static/oy32nc373w4jqz3kummksnw6wvhfrl7a.csv", 
    col_types = cols(last_online = col_datetime(format = "%Y-%m-%d-%H-%M")))
head(okc)
## # A tibble: 6 x 31
##     age body_type diet  drinks drugs education essay0 essay1 essay2 essay3
##   <dbl> <chr>     <chr> <chr>  <chr> <chr>     <chr>  <chr>  <chr>  <chr> 
## 1    22 a little… stri… socia… never working … "abou… "curr… "maki… "the …
## 2    35 average   most… often  some… working … "i am… dedic… "bein… <NA>  
## 3    38 thin      anyt… socia… <NA>  graduate… "i'm … "i ma… "impr… "my l…
## 4    23 thin      vege… socia… <NA>  working … i wor… readi… "play… socia…
## 5    29 athletic  <NA>  socia… never graduate… "hey … work … "crea… i smi…
## 6    29 average   most… socia… <NA>  graduate… "i'm … "buil… "imag… "i ha…
## # … with 21 more variables: essay4 <chr>, essay5 <chr>, essay6 <chr>,
## #   essay7 <chr>, essay8 <chr>, essay9 <chr>, ethnicity <chr>, height <dbl>,
## #   income <dbl>, job <chr>, last_online <dttm>, location <chr>,
## #   offspring <chr>, orientation <chr>, pets <chr>, religion <chr>, sex <chr>,
## #   sign <chr>, smokes <chr>, speaks <chr>, status <chr>
colnames(okc) <- tolower(colnames(okc))
set.seed(448)
si <- sample(1:nrow(okc),10000)
okc$essay2[si][1:15]
##  [1] "listening, cooking, and locking myself out of my\napartment/jeep.<br />\n<br />\nalso i can solve a rubix cube :o)"                                                                                                                                                                                                                                                                                                                                                                                  
##  [2] "drawing.<br />\nwriting.<br />\nbullshit persuasion. (never for selfish needs, more for\nenabling)<br />\nrandom acts of kindness.<br />\nadventures!<br />\ncatching people when they fall.<br />\ncoming up with impractical goals (i.e. learn to ride a unicycle)"                                                                                                                                                                                                                                
##  [3] NA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
##  [4] "taking care of people :)!"                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
##  [5] "the cell phone and the email, cat herding and hand holding,\nenvisioning and inspiring."                                                                                                                                                                                                                                                                                                                                                                                                             
##  [6] "getting things done... cooking... laughing... sleeping... not\nsleeping... procrastinating... meeting great friends.. showing up a\nlittle late but being the last one to leave... making other folks\ncomfortable..."                                                                                                                                                                                                                                                                               
##  [7] "wordplay. puns that are good maybe 30% of the time and the rest are\nso bad that i usually only make myself and a few close friends\nlaugh.<br />\nmaking up songs on the spot.<br />\nseeing things from other people's point of view."                                                                                                                                                                                                                                                             
##  [8] "hmm?"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
##  [9] "painting. observational comedy. finding cool shit at goodwill.\nclimbing stairs. whipping up a mean salad. walkie talkie\nimpressions and making kids laugh."                                                                                                                                                                                                                                                                                                                                        
## [10] "making the best of what life brings to you"                                                                                                                                                                                                                                                                                                                                                                                                                                                          
## [11] "travelling and learning languages. i have visited many many\ncountries, speak 5.5 languages fluently, and i can pack backpacks,\ncross borders, and chat random people up in public places like i\nwas born to do it :) and i am a singer.<br />\nah, and cooking."                                                                                                                                                                                                                                  
## [12] "being a very good friend. laughing. having compassion.\nunderstanding how people \"work\" and being a great therapist. i take\ngood care of the planet. i am a great kisser. i can hold a plank\nfor over 3 minutes. if we get together you will find that i am\nreally good at making you smile and wanting to see me again."                                                                                                                                                                       
## [13] "neutralizing awkward situations<br />\ngiving massages (touch)<br />\nfalling on my ass on my bike<br />\ngoing to foreign places<br />\nsinging loudly in public<br />\ndancing...in public and private<br />\ndoing accents...<br />\nwalking fast.<br />\ntaking pictures.<br />\nteaching (i would hope so).<br />\ndiscovering new music<br />\ndiscovering old music<br />\ncooking<br />\nbeing positive about life<br />\nstarting dance parties<br />\nword play<br />\nmaking people laugh"
## [14] "having random deep conversations, telling funny stories, getting\nmyself into and out of interesting situations"                                                                                                                                                                                                                                                                                                                                                                                     
## [15] "laughing... i like to talk to people and be social."

Then I cleaned the data by setting all words to lowercase, removing stopwords, extremely rare words, punctuation, unnecessary whitespace and numbers. Afterwards, I made a corpus of the words in question 2.

library(NLP)
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(tm)
e2 <- data.frame(doc_id=si,text=okc$essay2[si],stringsAsFactors = FALSE)
corpus <- VCorpus(DataframeSource(e2))
tryTolower <- function(x){
y = NA
try_error = tryCatch(tolower(x), error = function(e) e)
if (!inherits(try_error, 'error'))
y = tolower(x)
return(y)
}

clean.corpus<-function(corpus){
corpus <- tm_map(corpus, content_transformer(tryTolower))
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
return(corpus)
}
newcorpus <- clean.corpus(corpus)
tdm<-TermDocumentMatrix(newcorpus, control=list(weighting=weightTf))
tdm.essay2 <- as.matrix(tdm)
rownames(tdm.essay2)[21:120]
##   [1] "absentmindedness" "absolute"         "absolutely"      
##   [4] "absorbed"         "absorbing"        "abstract"        
##   [7] "abstracting"      "abstraction"      "abstractions"    
##  [10] "abstrusely"       "absurd"           "absurdity"       
##  [13] "absurdly"         "absurdnessbr"     "abundance"       
##  [16] "abuse"            "abused"           "abusive"         
##  [19] "abysmal"          "academia"         "academiabr"      
##  [22] "academic"         "academically"     "academics"       
##  [25] "accent"           "accentbr"         "accents"         
##  [28] "accentsbr"        "accentsdialects"  "accep"           
##  [31] "acceptabilitybr"  "acceptable"       "acceptance"      
##  [34] "accepted"         "accepting"        "access"          
##  [37] "accessible"       "accessing"        "accessorizing"   
##  [40] "accident"         "accidental"       "accidentally"    
##  [43] "accidentbr"       "accidents"        "acclimating"     
##  [46] "accommodated"     "accommodating"    "accommodatingbr" 
##  [49] "accommodations"   "accompaniment"    "accompany"       
##  [52] "accomplice"       "accomplish"       "accomplished"    
##  [55] "accomplishing"    "accomplishment"   "accordian"       
##  [58] "according"        "accordingly"      "accordion"       
##  [61] "accountability"   "accountable"      "accountant"      
##  [64] "accountbr"        "accounting"       "accountsbr"      
##  [67] "accumulate"       "accumulating"     "accurate"        
##  [70] "accurately"       "accused"          "ace"             
##  [73] "aced"             "acerbic"          "aces"            
##  [76] "acesbr"           "ache"             "acheive"         
##  [79] "achieve"          "achieved"         "achievement"     
##  [82] "achieving"        "acknowledge"      "acknowledging"   
##  [85] "acoustic"         "acoustics"        "acquired"        
##  [88] "acquiring"        "acrobatic"        "acrobatics"      
##  [91] "acrobats"         "acronyms"         "across"          
##  [94] "acroyoga"         "acrylic"          "acrylicbr"       
##  [97] "act"              "acting"           "actingbr"        
## [100] "actinglooking"
sfq <- data.frame(words=names(sort(rowSums(tdm.essay2),decreasing = TRUE)), freqs=sort(rowSums(tdm.essay2),decreasing = TRUE), row.names = NULL)
sfq[1:100,]
##            words freqs
## 1           good  3267
## 2         making  2510
## 3         people  2381
## 4         things  1812
## 5         really  1255
## 6        cooking   989
## 7      listening   982
## 8           like   954
## 9            can   914
## 10          also   894
## 11    classilink   772
## 12          love   719
## 13         laugh   715
## 14       friends   602
## 15        pretty   602
## 16          time   598
## 17       getting   590
## 18           new   584
## 19       finding   539
## 20          just   534
## 21       playing   484
## 22      laughing   471
## 23          know   460
## 24       writing   460
## 25           one   434
## 26          life   433
## 27       dancing   431
## 28         think   427
## 29           fun   414
## 30          make   411
## 31      anything   408
## 32           get   399
## 33        taking   395
## 34         great   394
## 35        others   383
## 36          href   371
## 37          feel   365
## 38         stuff   356
## 39        giving   349
## 40       talking   338
## 41       reading   335
## 42          well   327
## 43        friend   315
## 44         music   314
## 45           say   306
## 46        around   303
## 47          much   301
## 48      learning   291
## 49           way   286
## 50          cook   285
## 51        eating   283
## 52       working   281
## 53           lot   271
## 54          work   267
## 55        always   265
## 56         games   257
## 57       keeping   256
## 58       singing   255
## 59    everything   254
## 60     something   252
## 61        trying   239
## 62          will   239
## 63        fixing   238
## 64        seeing   233
## 65          find   225
## 66         enjoy   224
## 67        random   221
## 68          best   220
## 69          even   217
## 70          food   217
## 71         going   216
## 72           amp   210
## 73          want   207
## 74          mind   206
## 75          many   205
## 76      thinking   205
## 77           art   204
## 78        sports   204
## 79   remembering   203
## 80           bad   202
## 81          told   202
## 82           try   201
## 83        baking   195
## 84          play   194
## 85        person   193
## 86        little   188
## 87     sometimes   185
## 88       solving   182
## 89        better   180
## 90      enjoying   180
## 91          back   177
## 92           day   177
## 93  conversation   176
## 94       helping   175
## 95          care   173
## 96       telling   172
## 97         jokes   171
## 98           put   170
## 99       staying   170
## 100        never   168

Next, I made a bar plot and a polar plot of the most frequent words of question 2, which is “what are you good at?”

ggplot(sfq[1:18,], mapping = aes(x = reorder(words, freqs), y = freqs, fill = reorder(words, freqs))) +
  geom_bar(stat= "identity") +
  coord_flip() +
  scale_colour_hue() +
  labs(x= "Words", title = "18 Most Frequent Words",y= "Frequence") +
  theme(panel.background = element_blank(), axis.ticks.x = element_blank(),axis.ticks.y = element_blank())

ggplot(sfq[1:18,], mapping = aes(x = reorder(words, freqs), y = freqs, fill = reorder(words, freqs))) +
  geom_bar(stat= "identity") +
  coord_polar() +
  scale_colour_hue() +
  labs(x= "Words", title = "18 Most Frequent Words",y= "Frequence") +
  theme(panel.background = element_blank(), axis.ticks.x = element_blank(),axis.ticks.y = element_blank())

Finally, I made a wordcloud of those words and as we can see from the plot, bigger words mean higher frequency. The word “laughing” probably means people try to show that they have a sense of humor. The words “making” , “cooking” and “good” probably show that people want to show their abilities to others.

library(wordcloud2)
wordcloud2(sfq,size=0.4,shape='pentagon',color="random-dark",backgroundColor="pink")

Conclusion

For the first relationship, we found that people with body type “Athlete”, “Average” and “fit” are more likely to be available. The age group analysis shows younger group are more likely to be in a relationship. What’s more, around 35, a high possibility of divorce may occur because there is an increasing jump in the single percentage.

For hierarchical cluster analysis, similar body shapes show similar height and income status. People who choose not to answer their body type are most likely full-figured. On the other hand, age group clusters show us most people retire around 65.

For text analysis, we found that people usually try to sell themselves to their potential lovers with these words and a sense of humor, also they could use some of these words as their labels when answering “what are you good at?”