This exploratory data analysis report was created for Udacity’s Data Analysis with R project as part of the Data Analyst Nanodegree program.
The data set I selected to explore is:
How Couples Meet and Stay Together (HCMST), Wave 1 2009, Wave 2 2010, Wave 3 2011, Wave 4 2013, United States http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/30103/version/5
This collection consists of a survey given in 2009 and three follow up surveys in 2010, 2011, and 2013. The subjects were surveyed as to how they met their spouse/romantic partner as well as a variety of demographics information. There were 4002 respondants and 457 attributes in the source data.
Note to Readers: the description of this Udacity project assignment calls for a stream of consciousness mode of exploratory data analysis; as such, the document style may be more casual than if it were a more formal type of project deliverable.
Some significant transformation of the data was needed in order to create a more usable, tidy dataset of manageable scope.
From the original 457 attributes, I subjectively narrowed the set to 37 attributes of interest for analysis. Incomplete observations were excluded, as I am interested in looking specifically at the data for couples that had completed the final wave of surveying. After this filtering, I was left with 2657 of the initial 4002 records. Finally, to aid in readability, I renamed the attributes to make them more intuitive, and transformed categorical variables into factors with representative labels, using the codebook from the authors of the study as my reference.
Number of observations
## [1] 2657
Names of dataset features/variables
## [1] "CaseID" "Age" "Education"
## [4] "Ethnicity" "Gender" "Income"
## [7] "MaritalStatus" "Married" "Region"
## [10] "EmploymentStatus" "PoliticalParty" "Religion"
## [13] "GLBStatus" "NumberOfChildren" "SameSexCouple"
## [16] "AgeDiff" "RelationshipYears" "Cohabitating"
## [19] "WhoEarnedMore" "MetByFriends" "MetByFamily"
## [22] "MetByNeighbors" "MetByChurch" "MetByWork"
## [25] "MetBySchool" "MetByOnline" "MetByBar"
## [28] "MetBySocialGroup" "MetByPrivateParty" "MetByOther"
## [31] "StillTogether" "RelationshipQuality" "HomeOwnership"
## [34] "CohabitatationYears" "FirstMetYears" "ParentalApproval"
## [37] "YearMet"
Compact display of data structure
## 'data.frame': 2657 obs. of 37 variables:
## $ CaseID : int 22526 23286 26315 28536 29584 32656 33536 34341 35653 36493 ...
## $ Age : int 52 28 31 53 58 65 53 34 65 49 ...
## $ Education : Factor w/ 4 levels "less than high school",..: 4 4 3 4 4 4 4 4 4 4 ...
## $ Ethnicity : Factor w/ 5 levels "white","black",..: 4 1 1 1 1 1 1 1 1 1 ...
## $ Gender : Factor w/ 2 levels "Male","Female": 2 2 1 1 1 2 1 2 1 1 ...
## $ Income : int 22250 45000 45000 137250 17250 137250 200000 80000 45000 45000 ...
## $ MaritalStatus : Factor w/ 6 levels "married","widowed",..: 6 6 5 6 4 6 6 1 3 3 ...
## $ Married : Factor w/ 2 levels "not married",..: 1 2 1 1 2 1 1 2 1 1 ...
## $ Region : Factor w/ 9 levels "new england",..: 3 9 5 4 5 2 9 4 8 4 ...
## $ EmploymentStatus : Factor w/ 7 levels "working-paid employee",..: 1 1 1 1 1 5 2 1 1 1 ...
## $ PoliticalParty : Factor w/ 3 levels "republican","other",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Religion : Factor w/ 13 levels "baptist","protestant",..: 3 5 12 2 2 13 2 12 11 2 ...
## $ GLBStatus : Factor w/ 2 levels "not glb","glb": 2 2 2 2 2 2 2 2 2 2 ...
## $ NumberOfChildren : int 0 0 0 0 0 0 0 0 0 0 ...
## $ SameSexCouple : Factor w/ 2 levels "different sex couple",..: 2 2 2 2 1 2 2 1 2 2 ...
## $ AgeDiff : int 4 2 9 2 7 0 9 0 15 3 ...
## $ RelationshipYears : int 7 8 8 12 30 27 15 14 14 0 ...
## $ Cohabitating : Factor w/ 2 levels "0","1": 2 2 1 2 1 2 2 2 1 1 ...
## $ WhoEarnedMore : Factor w/ 3 levels "subject","same amount",..: 3 1 1 1 3 3 2 1 3 3 ...
## $ MetByFriends : Factor w/ 2 levels "0","1": 1 2 2 1 1 2 1 2 1 1 ...
## $ MetByFamily : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ MetByNeighbors : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 2 1 1 ...
## $ MetByChurch : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ MetByWork : Factor w/ 3 levels "-1","0","1": 2 2 2 2 3 2 2 2 2 2 ...
## $ MetBySchool : Factor w/ 3 levels "-1","0","1": 2 3 2 2 2 2 2 3 2 2 ...
## $ MetByOnline : Factor w/ 3 levels "-1","0","1": 3 2 2 2 2 2 2 2 3 3 ...
## $ MetByBar : Factor w/ 3 levels "-1","0","1": 2 2 2 2 2 2 3 2 2 2 ...
## $ MetBySocialGroup : Factor w/ 3 levels "-1","0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ MetByPrivateParty : Factor w/ 3 levels "-1","0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ MetByOther : Factor w/ 3 levels "-1","0","1": 2 2 3 3 2 3 2 2 2 2 ...
## $ StillTogether : Factor w/ 2 levels "together","split up": 1 1 2 1 1 1 1 1 2 2 ...
## $ RelationshipQuality: Factor w/ 5 levels "very poor","poor",..: 4 4 4 4 4 5 5 5 5 3 ...
## $ HomeOwnership : Factor w/ 3 levels "Own Home","Rent",..: 2 2 1 1 2 1 1 2 1 1 ...
## $ CohabitatationYears: int 7 6 8 12 28 26 15 12 NA NA ...
## $ FirstMetYears : int 7 9 8 12 30 29 15 16 14 0 ...
## $ ParentalApproval : Factor w/ 2 levels "Non-approval",..: NA 1 2 2 2 NA 2 2 NA NA ...
## $ YearMet : num 2002 2001 2001 1997 1979 ...
Data features with basic statistical summaries
## CaseID Age Education
## Min. : 22526 Min. :19.00 less than high school :280
## 1st Qu.:1131421 1st Qu.:34.00 high school :647
## Median :2223581 Median :45.00 some college :753
## Mean :2273922 Mean :46.13 bachelor degree or higher:977
## 3rd Qu.:3408205 3rd Qu.:57.00
## Max. :4628251 Max. :92.00
##
## Ethnicity Gender Income MaritalStatus
## white :1989 Male :1361 Min. : 2500 married :1552
## black : 194 Female:1296 1st Qu.: 37250 widowed : 52
## other : 88 Median : 67250 divorced : 183
## hispanic: 285 Mean : 70243 separated : 36
## 2+ races: 101 3rd Qu.: 92250 never married : 407
## Max. :200000 living with partner: 427
##
## Married Region
## not married: 940 pacific :455
## married :1717 south atlantic :454
## east-north central:416
## mid-atlantic :392
## west-south central:250
## mountain :208
## (Other) :482
## EmploymentStatus PoliticalParty
## working-paid employee :1543 republican:1072
## working-self-employed : 235 other : 61
## not working-temporary layoff: 14 democrat :1524
## not working-looking for work: 99
## not working-retired : 376
## not working-disabled : 188
## not working-other : 202
## Religion GLBStatus NumberOfChildren
## catholic :618 not glb:2071 Min. :0.0000
## protestant :603 glb : 586 1st Qu.:0.0000
## none :402 Median :0.0000
## baptist :341 Mean :0.4896
## other christian:334 3rd Qu.:1.0000
## (Other) :350 Max. :7.0000
## NA's : 9
## SameSexCouple AgeDiff RelationshipYears
## different sex couple:2251 Min. : 0.000 Min. : 0.00
## same-sex couple : 406 1st Qu.: 1.000 1st Qu.: 5.00
## Median : 3.000 Median :13.00
## Mean : 4.703 Mean :17.35
## 3rd Qu.: 6.000 3rd Qu.:25.00
## Max. :70.000 Max. :71.00
## NA's :14 NA's :21
## Cohabitating WhoEarnedMore MetByFriends MetByFamily MetByNeighbors
## 0 : 542 subject :1193 0 :1676 0 :2208 0 :2359
## 1 :2112 same amount: 325 1 : 917 1 : 385 1 : 234
## NA's: 3 partner :1117 NA's: 64 NA's: 64 NA's: 64
## NA's : 22
##
##
##
## MetByChurch MetByWork MetBySchool MetByOnline MetByBar MetBySocialGroup
## 0 :2416 -1: 14 -1: 14 -1: 14 -1: 14 -1: 14
## 1 : 177 0 :2207 0 :2264 0 :2510 0 :2332 0 :2522
## NA's: 64 1 : 436 1 : 379 1 : 133 1 : 311 1 : 121
##
##
##
##
## MetByPrivateParty MetByOther StillTogether RelationshipQuality
## -1: 14 -1: 14 together:2202 very poor: 16
## 0 :2343 0 :1719 split up: 455 poor : 37
## 1 : 300 1 : 924 fair : 207
## good : 817
## excellent:1569
## NA's : 11
##
## HomeOwnership CohabitatationYears FirstMetYears
## Own Home :2125 Min. : 0.00 Min. : 0.0
## Rent : 505 1st Qu.: 6.00 1st Qu.: 7.0
## Occupy without Payment: 27 Median :13.00 Median :14.0
## Mean :18.07 Mean :18.8
## 3rd Qu.:26.00 3rd Qu.:27.0
## Max. :69.00 Max. :75.0
## NA's :436 NA's :12
## ParentalApproval YearMet
## Non-approval: 397 Min. :1938
## Approval :1477 1st Qu.:1984
## NA's : 783 Median :1996
## Mean :1992
## 3rd Qu.:2004
## Max. :2009
## NA's :21
What’s the distribution of ages of subjects?
Subjects’ ages range from 19 to 92 and are slightly bimodal and right-skewed, with a median age of 45.
Continuing to explore subject demographics; what’s the distribution of Education values?
A preponderance have bachelor’s degree or higher.
What’s the distribution of Ethnicity values?
A large majority of subjects are white (likely overrepresented).
Continuing to explore subject demographics; what’s the distribution of Gender values?
Approximately even ratio of males to females.
What’s the distribution of Household Income?
Near normal distribution with some outliers at the high end. Median income is $67,250.
What’s the distribution of geographic Region?
Nothing particularly interesting there.
What’s the distribution of Employment Status?
Most subjects are working as paid employees.
What’s the distribution of Political Party?
Approximately 3:2 ratio of Democrats to Republicans with a small number of Independents.
What’s the distribution of Religion?
A large variety of religions views seem to be represented.
What’s the distribution of Number of Children?
Most subjects do not currently have children.
What’s the distribution of GLB status?
Approximately 1/5 of subjects identify as gay/lesbian/bi.
What’s the distribution of Same Sex Couples?
There are 586 same sex couples represented (approximately 22%).
What’s the distribution of Cohabitating couples?
2113 couples (79%) live with their partners.
What’s the distribution of Married couples?
1717 (64%) of the couples are married.
What’s the distribution of Who Earns More, subject or partner?
A pretty even ratio of the subject earning more vs their partner earning more.
What’s the distribution of Length of the Relationship in years?
Duration of Relationship is an exponential distribution ranging from 0 to 71 years with a median of 13 years.
What’s the distribution of how the subjects rate the Relationship Quality?
89% of subjects rated the quality of their relationship favorably (good or excellent).
By the end of the study period (2013), how many of the original subjects are still with their partner?
##
## together split up
## 0.8287542 0.1712458
83% are still coupled; 17% have split up.
The tidied dataset consists of 2657 observations with 37 variables.
Most variables are categorical. Numerical variables include Age, Income, Number of Children, Age Difference from Partner, Year Met, and Years since subject met/started relationship/started cohabitating with partner.
Other observations: Subjects were age 19 to 92. Mean household income was $70,243. 64.6% were married to their partner. 83% were still in a relationship with their partner at the end of the study.
The most interesting feature is StillTogether, which indicates whether or not the subject is still in a relationship with the partner by the end of the study.
Another feature of interest is how the subject met their partner, which is representated by 11 binary MetBy* variables (the respondants were able to choose more than one.)
I’m interested in studying the potential relationship between the non-identifying features (ie. all but CaseID) and the StillTogether variable.
No.
The data transformation/tidying operations I performed are described above.
Most of the important features are categorical. Age and Income are nearly normal distribitions. Relationship Years is exponential.
Let’s start trying to explore relationships between variables; particularly those influencing StillTogether. I wonder if couples that met online are more or less likely to still be together?
So, only 63.9% of couples that met online were still together at the end of the study vs. 83.9% who did not meet online. This is interesting. Although, this would only include couples who met before 2010. As social networking has grown dramatically and become more of a social norm in more recent years, I do wonder if this finding would still hold true with more recent data.
What about subject Political Party, does that influence relationship viability?
It seems Republicans may be slightly more likely to remain coupled than Democrats or Others.
Does being the major breadwinner of the household influence StillTogether?
There doesn’t seem to be a significant relationship here.
What about household Income?
Couples that were still together at the end of the study have a higher income distribution than those that split up.
OK, here’s an interesting question. Does the couple’s household income seem to affect whether or not the subject’s parents approve of the relationship?
Yes, couples with parental approval have a higher mean household income than those without parental approval.
Are married couple more likely to stick it out than unmarried couples?
Yes, 96% of married couples were still together at the end of the study. This seems to be a significant relationship (pun intended.)
Is there a relationship between cohabitation and the StillTogether result?
92% of couples that lived together were still together vs 46.% of couples that did not live together.
Is there a difference between Same Sex Couples and Different Sex Couples with respect to the StillTogether outcome?
Different sex couples were slightly more likely to still be together in the study.
Setting aside the StillTogether variable for now, how has the way in which couples meet changed over time?
We can see there is a dramatic increase in the occurence of people meeting their partners online over the years. It would be interesting to see more recent data on this.
Does household income have any interesting effects on subjective relationship quality rankings?
No, not as much as I would have expected. Subjects with the very highest incomes were not represented in the lowest relationship quality categories. However, the Mean household income is equal between those who saw their relationship quality as Excellent and those who saw it as Very Poor.
Couples that met online were less likely to still be together. Couples that were still together had a slightly higher median household income. Couples that were cohabitating were more likely to still be together. Same sex couples were less likely to still be together.
Couples with a higher household income were more likely to receive parental approval. The occurance of partners meeting online increases dramatically over time.
Married couples were much more likely to “stick it out” and still be together.
Let’s look at the various categories of how the subjects met their partner.
Friends and “Other” are the most frequent methods. I’m a little curious about that Other category, but I don’t have a way to drill down into it further with the current set of variables.
Let’s use some statistical tools to determine if there are any direct correlations between the variables.
It is not surprising that Age is directly correlated to relationship length (RelationshipYears, CohabitationYears), and inversely correlated to YearMet. SameSexCouple is correlated with GLBStatus. There is a weak correlation between Gender and WhoEarnsMore. Income and Education have a correlation. ParentalApproval and RelationshipQuality have a correlation.
We’re going to create a binomial logistic regression model with StillTogether as the dependent variable. For our initial model, we’re going to use all variables except for CaseID (the numeric unique identifier for observations.)
##
## Call:
## glm(formula = StillTogether ~ . - CaseID, family = binomial,
## data = no.na.Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.93683 -0.37956 -0.20410 -0.08566 2.88804
##
## Coefficients: (7 not defined because of singularities)
## Estimate Std. Error z value
## (Intercept) -8.615e+00 3.956e+03 -0.002
## Age -3.755e-02 1.492e-02 -2.517
## Educationhigh school -3.895e-01 4.727e-01 -0.824
## Educationsome college -2.395e-01 4.635e-01 -0.517
## Educationbachelor degree or higher -5.596e-01 4.927e-01 -1.136
## Ethnicityblack -2.777e-02 4.141e-01 -0.067
## Ethnicityother -1.286e+00 7.668e-01 -1.677
## Ethnicityhispanic -1.541e-01 3.459e-01 -0.445
## Ethnicity2+ races 5.627e-02 5.090e-01 0.111
## GenderFemale 2.428e-01 2.316e-01 1.049
## Income -4.109e-06 2.762e-06 -1.487
## MaritalStatuswidowed -1.551e+01 1.845e+03 -0.008
## MaritalStatusdivorced 1.058e+00 5.369e-01 1.970
## MaritalStatusseparated 6.020e-01 8.316e-01 0.724
## MaritalStatusnever married 6.186e-01 4.406e-01 1.404
## MaritalStatusliving with partner 3.501e-01 3.951e-01 0.886
## Marriedmarried -3.253e-01 3.545e-01 -0.918
## Regionmid-atlantic -1.951e-01 5.274e-01 -0.370
## Regioneast-north central -7.386e-01 5.297e-01 -1.394
## Regionwest-north central -1.232e+00 6.365e-01 -1.936
## Regionsouth atlantic -1.003e-01 5.068e-01 -0.198
## Regioneast-south central -4.424e-01 6.665e-01 -0.664
## Regionwest-south central -5.431e-01 5.871e-01 -0.925
## Regionmountain -3.605e-01 6.084e-01 -0.593
## Regionpacific -1.659e-01 5.106e-01 -0.325
## EmploymentStatusworking-self-employed -3.494e-01 3.852e-01 -0.907
## EmploymentStatusnot working-temporary layoff -1.508e+00 1.301e+00 -1.159
## EmploymentStatusnot working-looking for work -8.451e-02 4.947e-01 -0.171
## EmploymentStatusnot working-retired -2.235e-01 1.127e+00 -0.198
## EmploymentStatusnot working-disabled 2.792e-01 4.195e-01 0.666
## EmploymentStatusnot working-other 8.263e-02 4.479e-01 0.184
## PoliticalPartyother -5.477e-01 8.358e-01 -0.655
## PoliticalPartydemocrat -1.705e-01 2.520e-01 -0.677
## Religionprotestant -4.533e-01 3.808e-01 -1.190
## Religioncatholic -6.589e-01 3.950e-01 -1.668
## Religionmormon -1.992e+00 1.275e+00 -1.562
## Religionjewish 2.512e-01 6.487e-01 0.387
## Religionmuslim -1.644e+01 2.153e+03 -0.008
## Religionhindu 2.040e-01 1.808e+00 0.113
## Religionbuddhist -4.731e-01 1.254e+00 -0.377
## Religionpentecostal -1.649e+01 5.351e+02 -0.031
## Religioneastern orthodox -1.589e+01 1.221e+03 -0.013
## Religionother christian -3.842e-01 3.922e-01 -0.980
## Religionother non-christian -5.559e-01 5.523e-01 -1.007
## Religionnone -3.500e-01 3.799e-01 -0.922
## GLBStatusglb 2.629e-01 3.898e-01 0.674
## NumberOfChildren 1.030e-01 1.133e-01 0.909
## SameSexCouplesame-sex couple 9.017e-01 4.595e-01 1.963
## AgeDiff 1.122e-02 2.020e-02 0.556
## RelationshipYears -1.134e-01 5.999e-02 -1.891
## Cohabitating1 -1.311e+00 3.667e-01 -3.576
## WhoEarnedMoresame amount 3.684e-02 3.344e-01 0.110
## WhoEarnedMorepartner -2.911e-01 2.448e-01 -1.189
## MetByFriends1 -8.792e-02 2.383e-01 -0.369
## MetByFamily1 7.207e-02 2.949e-01 0.244
## MetByNeighbors1 -9.996e-01 4.680e-01 -2.136
## MetByChurch1 -7.967e-01 7.028e-01 -1.134
## MetByWork0 1.789e+01 3.956e+03 0.005
## MetByWork1 1.766e+01 3.956e+03 0.004
## MetBySchool0 -4.807e-01 4.492e-01 -1.070
## MetBySchool1 NA NA NA
## MetByOnline0 -2.224e-01 5.710e-01 -0.389
## MetByOnline1 NA NA NA
## MetByBar0 -4.876e-01 4.139e-01 -1.178
## MetByBar1 NA NA NA
## MetBySocialGroup0 1.639e-01 5.732e-01 0.286
## MetBySocialGroup1 NA NA NA
## MetByPrivateParty0 -1.225e+00 4.354e-01 -2.814
## MetByPrivateParty1 NA NA NA
## MetByOther0 -9.555e-01 3.912e-01 -2.443
## MetByOther1 NA NA NA
## RelationshipQualitypoor -1.098e+00 1.039e+00 -1.057
## RelationshipQualityfair -2.769e+00 9.065e-01 -3.054
## RelationshipQualitygood -3.471e+00 8.911e-01 -3.895
## RelationshipQualityexcellent -4.754e+00 9.059e-01 -5.248
## HomeOwnershipRent -2.705e-01 2.680e-01 -1.009
## HomeOwnershipOccupy without Payment 6.987e-01 8.200e-01 0.852
## CohabitatationYears 1.022e-02 5.669e-02 0.180
## FirstMetYears 2.707e-02 2.645e-02 1.023
## ParentalApprovalApproval -1.269e-01 2.576e-01 -0.492
## YearMet NA NA NA
## Pr(>|z|)
## (Intercept) 0.998263
## Age 0.011820 *
## Educationhigh school 0.409954
## Educationsome college 0.605402
## Educationbachelor degree or higher 0.255991
## Ethnicityblack 0.946526
## Ethnicityother 0.093548 .
## Ethnicityhispanic 0.655990
## Ethnicity2+ races 0.911981
## GenderFemale 0.294389
## Income 0.136924
## MaritalStatuswidowed 0.993292
## MaritalStatusdivorced 0.048793 *
## MaritalStatusseparated 0.469129
## MaritalStatusnever married 0.160313
## MaritalStatusliving with partner 0.375531
## Marriedmarried 0.358870
## Regionmid-atlantic 0.711397
## Regioneast-north central 0.163187
## Regionwest-north central 0.052850 .
## Regionsouth atlantic 0.843116
## Regioneast-south central 0.506793
## Regionwest-south central 0.354916
## Regionmountain 0.553467
## Regionpacific 0.745286
## EmploymentStatusworking-self-employed 0.364287
## EmploymentStatusnot working-temporary layoff 0.246383
## EmploymentStatusnot working-looking for work 0.864361
## EmploymentStatusnot working-retired 0.842873
## EmploymentStatusnot working-disabled 0.505631
## EmploymentStatusnot working-other 0.853637
## PoliticalPartyother 0.512293
## PoliticalPartydemocrat 0.498714
## Religionprotestant 0.233927
## Religioncatholic 0.095279 .
## Religionmormon 0.118199
## Religionjewish 0.698542
## Religionmuslim 0.993907
## Religionhindu 0.910202
## Religionbuddhist 0.705910
## Religionpentecostal 0.975421
## Religioneastern orthodox 0.989612
## Religionother christian 0.327274
## Religionother non-christian 0.314160
## Religionnone 0.356787
## GLBStatusglb 0.500046
## NumberOfChildren 0.363541
## SameSexCouplesame-sex couple 0.049699 *
## AgeDiff 0.578485
## RelationshipYears 0.058656 .
## Cohabitating1 0.000349 ***
## WhoEarnedMoresame amount 0.912288
## WhoEarnedMorepartner 0.234443
## MetByFriends1 0.712206
## MetByFamily1 0.806950
## MetByNeighbors1 0.032673 *
## MetByChurch1 0.256990
## MetByWork0 0.996393
## MetByWork1 0.996438
## MetBySchool0 0.284484
## MetBySchool1 NA
## MetByOnline0 0.696929
## MetByOnline1 NA
## MetByBar0 0.238818
## MetByBar1 NA
## MetBySocialGroup0 0.774885
## MetBySocialGroup1 NA
## MetByPrivateParty0 0.004900 **
## MetByPrivateParty1 NA
## MetByOther0 0.014579 *
## MetByOther1 NA
## RelationshipQualitypoor 0.290605
## RelationshipQualityfair 0.002255 **
## RelationshipQualitygood 9.82e-05 ***
## RelationshipQualityexcellent 1.54e-07 ***
## HomeOwnershipRent 0.312875
## HomeOwnershipOccupy without Payment 0.394166
## CohabitatationYears 0.856935
## FirstMetYears 0.306163
## ParentalApprovalApproval 0.622402
## YearMet NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1066.53 on 1491 degrees of freedom
## Residual deviance: 682.16 on 1418 degrees of freedom
## AIC: 830.16
##
## Number of Fisher Scoring iterations: 16
Our initial model has an AIC of 830.
Now let’s use stepwise AIC optimization to find the best model fit.
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## StillTogether ~ (CaseID + Age + Education + Ethnicity + Gender +
## Income + MaritalStatus + Married + Region + EmploymentStatus +
## PoliticalParty + Religion + GLBStatus + NumberOfChildren +
## SameSexCouple + AgeDiff + RelationshipYears + Cohabitating +
## WhoEarnedMore + MetByFriends + MetByFamily + MetByNeighbors +
## MetByChurch + MetByWork + MetBySchool + MetByOnline + MetByBar +
## MetBySocialGroup + MetByPrivateParty + MetByOther + RelationshipQuality +
## HomeOwnership + CohabitatationYears + FirstMetYears + ParentalApproval +
## YearMet) - CaseID
##
## Final Model:
## StillTogether ~ Age + Income + Married + SameSexCouple + RelationshipYears +
## Cohabitating + MetByNeighbors + MetByChurch + MetByBar +
## MetByPrivateParty + MetByOther + RelationshipQuality
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 1418 682.1563 830.1563
## 2 - YearMet 0 0.000000000 1418 682.1563 830.1563
## 3 - EmploymentStatus 6 3.235367203 1424 685.3916 821.3916
## 4 - Region 8 9.140655194 1432 694.5323 814.5323
## 5 - Religion 12 16.707935257 1444 711.2402 807.2402
## 6 - Education 3 1.657798241 1447 712.8980 802.8980
## 7 - MaritalStatus 5 6.327018366 1452 719.2250 799.2250
## 8 - PoliticalParty 2 0.910844570 1454 720.1359 796.1359
## 9 - Ethnicity 4 4.992743319 1458 725.1286 793.1286
## 10 - HomeOwnership 2 1.600656859 1460 726.7293 790.7293
## 11 - CohabitatationYears 1 0.006660194 1461 726.7359 788.7359
## 12 - ParentalApproval 1 0.010333973 1462 726.7463 786.7463
## 13 - MetByOnline 1 0.114009190 1463 726.8603 784.8603
## 14 - WhoEarnedMore 2 2.112845231 1465 728.9731 782.9731
## 15 - NumberOfChildren 1 0.133198958 1466 729.1063 781.1063
## 16 - AgeDiff 1 0.136548659 1467 729.2429 779.2429
## 17 - MetBySchool 1 0.328844785 1468 729.5717 777.5717
## 18 - Gender 1 0.502619399 1469 730.0743 776.0743
## 19 - MetByFriends 1 0.513947553 1470 730.5883 774.5883
## 20 - MetBySocialGroup 1 0.508614575 1471 731.0969 773.0969
## 21 - MetByFamily 1 0.809728915 1472 731.9066 771.9066
## 22 - MetByWork 1 0.961437815 1473 732.8681 770.8681
## 23 - GLBStatus 1 1.099131014 1474 733.9672 769.9672
## 24 - FirstMetYears 1 1.886010267 1475 735.8532 769.8532
##
## Call:
## glm(formula = StillTogether ~ Age + Income + Married + SameSexCouple +
## RelationshipYears + Cohabitating + MetByNeighbors + MetByChurch +
## MetByBar + MetByPrivateParty + MetByOther + RelationshipQuality,
## family = binomial, data = no.na.Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1619 -0.4068 -0.2418 -0.1153 2.9905
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.812e+00 5.354e+02 -0.011 0.991340
## Age -3.312e-02 1.227e-02 -2.700 0.006941 **
## Income -4.957e-06 2.295e-06 -2.160 0.030794 *
## Marriedmarried -6.171e-01 2.366e-01 -2.608 0.009111 **
## SameSexCouplesame-sex couple 1.002e+00 2.788e-01 3.594 0.000326 ***
## RelationshipYears -8.068e-02 1.829e-02 -4.411 1.03e-05 ***
## Cohabitating1 -1.614e+00 3.102e-01 -5.203 1.96e-07 ***
## MetByNeighbors1 -8.860e-01 4.285e-01 -2.068 0.038677 *
## MetByChurch1 -1.129e+00 6.818e-01 -1.656 0.097630 .
## MetByBar0 1.253e+01 5.354e+02 0.023 0.981333
## MetByBar1 1.300e+01 5.354e+02 0.024 0.980634
## MetByPrivateParty0 -9.186e-01 3.074e-01 -2.989 0.002803 **
## MetByPrivateParty1 NA NA NA NA
## MetByOther0 -8.946e-01 2.257e-01 -3.963 7.40e-05 ***
## MetByOther1 NA NA NA NA
## RelationshipQualitypoor -7.574e-01 8.805e-01 -0.860 0.389695
## RelationshipQualityfair -2.240e+00 7.609e-01 -2.944 0.003241 **
## RelationshipQualitygood -2.896e+00 7.398e-01 -3.915 9.05e-05 ***
## RelationshipQualityexcellent -4.091e+00 7.460e-01 -5.484 4.15e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1066.53 on 1491 degrees of freedom
## Residual deviance: 735.85 on 1475 degrees of freedom
## AIC: 769.85
##
## Number of Fisher Scoring iterations: 12
Our optimized model has an AIC of 768, a significant improvement. The new model is also more parsimonious: StillTogether ~ Age + Income + Married + SameSexCouple + RelationshipYears + Cohabitating + MetByNeighbors + MetByChurch + MetByBar + MetByPrivateParty + MetByOther + RelationshipQuality
Let’s look at some diagnostic plots to help evaluate our model.
The ordered residuals plot and Cook statistic plot seem to indicate that this is a good model but with a fair number of outliers. The ROC (receiver operating characteritic plot) shows that the model has 87.29% accuracy. This is considered a statistically “good” model.
Using the variables we’ve found to be predictive in our regression model, let’s use a K-means clustering algorithm to define two clusters. The cases within each cluster will be relatively similar in Age + Income + Married + SameSexCouple + RelationshipYears + Cohabitating + MetByNeighbors + MetByChurch + MetByBar + MetByPrivateParty + MetByOther + RelationshipQuality. The cases that are most dissimilar will be in different clusters.
Now lets create a scatterplot of all couples with their cluster membership (1 or 2) vs. the StillTogether outcome.
We see that cluster membership is very influential on the StillTogether variable. Most couples in cluster 1 are still together. Couples in cluster 2 seem to fairly evenly distributed between together and not together.
The correlation matrix indicates some of the correlated features, such as between the subject’s age and length of relationship. Another example is GLB status of subject and couple’s same sex status. Most of these correlations make intuitive sense and were not surprising
I found all of the observations to be interesting, but none particularly surprising.
I constructed a binomial logistic regression model with StillTogether as the dependent variable. I started with all features (aside from CaseID) as the potential independent variables. I used stepwise AIC optimization to find the optimum model for this data set. The following features were found to provide the best model explaining variability of StillTogether: Age, Relationship Quality, Income, Parental Approval, Relationship Years, Cohabitatation Years, Married, Cohabitating, Same Sex Couple, and how the couple met (5 variables). For this set of data, these attributes predicted whether or not the couple was still together with 87% accuracy.
To help confirm the model, I performed K-means clustering on the couples using the above set of attributes. I created two clusters based upon combined similarity of Age, Relationship Quality, Income, Parental Approval, Relationship Years, Cohabitatation Years, Married, Cohabitating, Same Sex Couple, and how the couple met.
This plot shows the trends in the manner in which couples meet from year to year. We see that meeting Online is a strong upward trend, and we must also bear in mind the last data point would be for 2010. Meetings via Church or Neighbors were both downward-trending.
This boxplot shows the distribution of subjective RelationshipQuality values grouped by length of relationship (RelationshipYears). The V shape of the different quartiles seems to indicate that subjects have a less extreme qualitative sentiment towards their relationship quality in the relatively earlier years. After the 9 year mark, subjects are more likely to develop a stronger view, either positive or negative. After the 40 year point, all couples rate their relationships as fair, good, or excellent. I found it quite surprising that the mean RelationshipYears value for “very poor” ranked relationships is as high as 16.5 years.
This is a plot of the StillTogether outcome for our two clusters of couples. I created two distinct clusters based upon combined similarity of predictive features (Age, Relationship Quality, Income, Parental Approval, Relationship Years, Cohabitatation Years, Married, Cohabitating, Same Sex Couple, and how the couple met). We see from the two plots that one of the clusters is much more likely to stay together than the other. This helps validate that these factors are indeed predictive of relationship viability.
I tried to select an interesting dataset to explore, and overall it was enjoyable to work with. There were over 400 features initially, so I knew that I would need to narrow down the feature set in order to make the scope of this project manageable. I chose demographic features that seemed interesting, and also reviewed (skimmed) other works in the social science domain to help with feature selection. Tidying the dataset and appropriately labeling the categorical values was a time-consuming, but essential, step in the analysis process. As this was the first time I’ve conducted a binomial logistic regression, rather than a generalized linear model, interpreting some of the diagnosic results, such as the residual plots, was a challenge. If I were to expand my research on this data, I would most likely be interested in 1.) pulling in more of the 400+ features for examination, and 2.) exploring some different methods of predictive model creation, such as random forest and related classification algorithms. I would likely also partition the data into training and test sets to help validate the model.
Rosenfeld, Michael J., Reuben J. Thomas, and Maja Falcon. How Couples Meet and Stay Together (HCMST), Wave 1 2009, Wave 2 2010, Wave 3 2011, Wave 4 2013, United States. ICPSR30103-v7. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2014-09-02. http://doi.org/10.3886/ICPSR30103.v7 Persistent URL: http://doi.org/10.3886/ICPSR30103.v7