Introduction

This exploratory data analysis report was created for Udacity’s Data Analysis with R project as part of the Data Analyst Nanodegree program.

The data set I selected to explore is:

How Couples Meet and Stay Together (HCMST), Wave 1 2009, Wave 2 2010, Wave 3 2011, Wave 4 2013, United States http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/30103/version/5

This collection consists of a survey given in 2009 and three follow up surveys in 2010, 2011, and 2013. The subjects were surveyed as to how they met their spouse/romantic partner as well as a variety of demographics information. There were 4002 respondants and 457 attributes in the source data.

Note to Readers: the description of this Udacity project assignment calls for a stream of consciousness mode of exploratory data analysis; as such, the document style may be more casual than if it were a more formal type of project deliverable.

Data Tidying

Some significant transformation of the data was needed in order to create a more usable, tidy dataset of manageable scope.

  1. Subsetting columns
  2. Renaming columns
  3. Filtering incomplete observations
  4. Transforming categorical variables into factors with representative labels (using codebook)

From the original 457 attributes, I subjectively narrowed the set to 37 attributes of interest for analysis. Incomplete observations were excluded, as I am interested in looking specifically at the data for couples that had completed the final wave of surveying. After this filtering, I was left with 2657 of the initial 4002 records. Finally, to aid in readability, I renamed the attributes to make them more intuitive, and transformed categorical variables into factors with representative labels, using the codebook from the authors of the study as my reference.

Initial Exploration

Number of observations

## [1] 2657

Names of dataset features/variables

##  [1] "CaseID"              "Age"                 "Education"          
##  [4] "Ethnicity"           "Gender"              "Income"             
##  [7] "MaritalStatus"       "Married"             "Region"             
## [10] "EmploymentStatus"    "PoliticalParty"      "Religion"           
## [13] "GLBStatus"           "NumberOfChildren"    "SameSexCouple"      
## [16] "AgeDiff"             "RelationshipYears"   "Cohabitating"       
## [19] "WhoEarnedMore"       "MetByFriends"        "MetByFamily"        
## [22] "MetByNeighbors"      "MetByChurch"         "MetByWork"          
## [25] "MetBySchool"         "MetByOnline"         "MetByBar"           
## [28] "MetBySocialGroup"    "MetByPrivateParty"   "MetByOther"         
## [31] "StillTogether"       "RelationshipQuality" "HomeOwnership"      
## [34] "CohabitatationYears" "FirstMetYears"       "ParentalApproval"   
## [37] "YearMet"

Compact display of data structure

## 'data.frame':    2657 obs. of  37 variables:
##  $ CaseID             : int  22526 23286 26315 28536 29584 32656 33536 34341 35653 36493 ...
##  $ Age                : int  52 28 31 53 58 65 53 34 65 49 ...
##  $ Education          : Factor w/ 4 levels "less than high school",..: 4 4 3 4 4 4 4 4 4 4 ...
##  $ Ethnicity          : Factor w/ 5 levels "white","black",..: 4 1 1 1 1 1 1 1 1 1 ...
##  $ Gender             : Factor w/ 2 levels "Male","Female": 2 2 1 1 1 2 1 2 1 1 ...
##  $ Income             : int  22250 45000 45000 137250 17250 137250 200000 80000 45000 45000 ...
##  $ MaritalStatus      : Factor w/ 6 levels "married","widowed",..: 6 6 5 6 4 6 6 1 3 3 ...
##  $ Married            : Factor w/ 2 levels "not married",..: 1 2 1 1 2 1 1 2 1 1 ...
##  $ Region             : Factor w/ 9 levels "new england",..: 3 9 5 4 5 2 9 4 8 4 ...
##  $ EmploymentStatus   : Factor w/ 7 levels "working-paid employee",..: 1 1 1 1 1 5 2 1 1 1 ...
##  $ PoliticalParty     : Factor w/ 3 levels "republican","other",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Religion           : Factor w/ 13 levels "baptist","protestant",..: 3 5 12 2 2 13 2 12 11 2 ...
##  $ GLBStatus          : Factor w/ 2 levels "not glb","glb": 2 2 2 2 2 2 2 2 2 2 ...
##  $ NumberOfChildren   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SameSexCouple      : Factor w/ 2 levels "different sex couple",..: 2 2 2 2 1 2 2 1 2 2 ...
##  $ AgeDiff            : int  4 2 9 2 7 0 9 0 15 3 ...
##  $ RelationshipYears  : int  7 8 8 12 30 27 15 14 14 0 ...
##  $ Cohabitating       : Factor w/ 2 levels "0","1": 2 2 1 2 1 2 2 2 1 1 ...
##  $ WhoEarnedMore      : Factor w/ 3 levels "subject","same amount",..: 3 1 1 1 3 3 2 1 3 3 ...
##  $ MetByFriends       : Factor w/ 2 levels "0","1": 1 2 2 1 1 2 1 2 1 1 ...
##  $ MetByFamily        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ MetByNeighbors     : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 2 1 1 ...
##  $ MetByChurch        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ MetByWork          : Factor w/ 3 levels "-1","0","1": 2 2 2 2 3 2 2 2 2 2 ...
##  $ MetBySchool        : Factor w/ 3 levels "-1","0","1": 2 3 2 2 2 2 2 3 2 2 ...
##  $ MetByOnline        : Factor w/ 3 levels "-1","0","1": 3 2 2 2 2 2 2 2 3 3 ...
##  $ MetByBar           : Factor w/ 3 levels "-1","0","1": 2 2 2 2 2 2 3 2 2 2 ...
##  $ MetBySocialGroup   : Factor w/ 3 levels "-1","0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ MetByPrivateParty  : Factor w/ 3 levels "-1","0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ MetByOther         : Factor w/ 3 levels "-1","0","1": 2 2 3 3 2 3 2 2 2 2 ...
##  $ StillTogether      : Factor w/ 2 levels "together","split up": 1 1 2 1 1 1 1 1 2 2 ...
##  $ RelationshipQuality: Factor w/ 5 levels "very poor","poor",..: 4 4 4 4 4 5 5 5 5 3 ...
##  $ HomeOwnership      : Factor w/ 3 levels "Own Home","Rent",..: 2 2 1 1 2 1 1 2 1 1 ...
##  $ CohabitatationYears: int  7 6 8 12 28 26 15 12 NA NA ...
##  $ FirstMetYears      : int  7 9 8 12 30 29 15 16 14 0 ...
##  $ ParentalApproval   : Factor w/ 2 levels "Non-approval",..: NA 1 2 2 2 NA 2 2 NA NA ...
##  $ YearMet            : num  2002 2001 2001 1997 1979 ...

Data features with basic statistical summaries

##      CaseID             Age                            Education  
##  Min.   :  22526   Min.   :19.00   less than high school    :280  
##  1st Qu.:1131421   1st Qu.:34.00   high school              :647  
##  Median :2223581   Median :45.00   some college             :753  
##  Mean   :2273922   Mean   :46.13   bachelor degree or higher:977  
##  3rd Qu.:3408205   3rd Qu.:57.00                                  
##  Max.   :4628251   Max.   :92.00                                  
##                                                                   
##     Ethnicity       Gender         Income                   MaritalStatus 
##  white   :1989   Male  :1361   Min.   :  2500   married            :1552  
##  black   : 194   Female:1296   1st Qu.: 37250   widowed            :  52  
##  other   :  88                 Median : 67250   divorced           : 183  
##  hispanic: 285                 Mean   : 70243   separated          :  36  
##  2+ races: 101                 3rd Qu.: 92250   never married      : 407  
##                                Max.   :200000   living with partner: 427  
##                                                                           
##         Married                    Region   
##  not married: 940   pacific           :455  
##  married    :1717   south atlantic    :454  
##                     east-north central:416  
##                     mid-atlantic      :392  
##                     west-south central:250  
##                     mountain          :208  
##                     (Other)           :482  
##                      EmploymentStatus    PoliticalParty
##  working-paid employee       :1543    republican:1072  
##  working-self-employed       : 235    other     :  61  
##  not working-temporary layoff:  14    democrat  :1524  
##  not working-looking for work:  99                     
##  not working-retired         : 376                     
##  not working-disabled        : 188                     
##  not working-other           : 202                     
##             Religion     GLBStatus    NumberOfChildren
##  catholic       :618   not glb:2071   Min.   :0.0000  
##  protestant     :603   glb    : 586   1st Qu.:0.0000  
##  none           :402                  Median :0.0000  
##  baptist        :341                  Mean   :0.4896  
##  other christian:334                  3rd Qu.:1.0000  
##  (Other)        :350                  Max.   :7.0000  
##  NA's           :  9                                  
##               SameSexCouple     AgeDiff       RelationshipYears
##  different sex couple:2251   Min.   : 0.000   Min.   : 0.00    
##  same-sex couple     : 406   1st Qu.: 1.000   1st Qu.: 5.00    
##                              Median : 3.000   Median :13.00    
##                              Mean   : 4.703   Mean   :17.35    
##                              3rd Qu.: 6.000   3rd Qu.:25.00    
##                              Max.   :70.000   Max.   :71.00    
##                              NA's   :14       NA's   :21       
##  Cohabitating     WhoEarnedMore  MetByFriends MetByFamily MetByNeighbors
##  0   : 542    subject    :1193   0   :1676    0   :2208   0   :2359     
##  1   :2112    same amount: 325   1   : 917    1   : 385   1   : 234     
##  NA's:   3    partner    :1117   NA's:  64    NA's:  64   NA's:  64     
##               NA's       :  22                                          
##                                                                         
##                                                                         
##                                                                         
##  MetByChurch MetByWork MetBySchool MetByOnline MetByBar  MetBySocialGroup
##  0   :2416   -1:  14   -1:  14     -1:  14     -1:  14   -1:  14         
##  1   : 177   0 :2207   0 :2264     0 :2510     0 :2332   0 :2522         
##  NA's:  64   1 : 436   1 : 379     1 : 133     1 : 311   1 : 121         
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##  MetByPrivateParty MetByOther  StillTogether  RelationshipQuality
##  -1:  14           -1:  14    together:2202   very poor:  16     
##  0 :2343           0 :1719    split up: 455   poor     :  37     
##  1 : 300           1 : 924                    fair     : 207     
##                                               good     : 817     
##                                               excellent:1569     
##                                               NA's     :  11     
##                                                                  
##                 HomeOwnership  CohabitatationYears FirstMetYears 
##  Own Home              :2125   Min.   : 0.00       Min.   : 0.0  
##  Rent                  : 505   1st Qu.: 6.00       1st Qu.: 7.0  
##  Occupy without Payment:  27   Median :13.00       Median :14.0  
##                                Mean   :18.07       Mean   :18.8  
##                                3rd Qu.:26.00       3rd Qu.:27.0  
##                                Max.   :69.00       Max.   :75.0  
##                                NA's   :436         NA's   :12    
##      ParentalApproval    YearMet    
##  Non-approval: 397    Min.   :1938  
##  Approval    :1477    1st Qu.:1984  
##  NA's        : 783    Median :1996  
##                       Mean   :1992  
##                       3rd Qu.:2004  
##                       Max.   :2009  
##                       NA's   :21

Univariate Plots

Beginning univarate exploration

What’s the distribution of ages of subjects?

Subjects’ ages range from 19 to 92 and are slightly bimodal and right-skewed, with a median age of 45.

Continuing to explore subject demographics; what’s the distribution of Education values?

A preponderance have bachelor’s degree or higher.

What’s the distribution of Ethnicity values?

A large majority of subjects are white (likely overrepresented).

Continuing to explore subject demographics; what’s the distribution of Gender values?

Approximately even ratio of males to females.

What’s the distribution of Household Income?

Near normal distribution with some outliers at the high end. Median income is $67,250.

What’s the distribution of geographic Region?

Nothing particularly interesting there.

What’s the distribution of Employment Status?

Most subjects are working as paid employees.

What’s the distribution of Political Party?

Approximately 3:2 ratio of Democrats to Republicans with a small number of Independents.

What’s the distribution of Religion?

A large variety of religions views seem to be represented.

What’s the distribution of Number of Children?

Most subjects do not currently have children.

What’s the distribution of GLB status?

Approximately 1/5 of subjects identify as gay/lesbian/bi.

What’s the distribution of Same Sex Couples?

There are 586 same sex couples represented (approximately 22%).

What’s the distribution of Cohabitating couples?

2113 couples (79%) live with their partners.

What’s the distribution of Married couples?

1717 (64%) of the couples are married.

What’s the distribution of Who Earns More, subject or partner?

A pretty even ratio of the subject earning more vs their partner earning more.

What’s the distribution of Length of the Relationship in years?

Duration of Relationship is an exponential distribution ranging from 0 to 71 years with a median of 13 years.

What’s the distribution of how the subjects rate the Relationship Quality?

89% of subjects rated the quality of their relationship favorably (good or excellent).

By the end of the study period (2013), how many of the original subjects are still with their partner?

## 
##  together  split up 
## 0.8287542 0.1712458

83% are still coupled; 17% have split up.

Univariate Analysis

What is the structure of your dataset?

The tidied dataset consists of 2657 observations with 37 variables.

Most variables are categorical. Numerical variables include Age, Income, Number of Children, Age Difference from Partner, Year Met, and Years since subject met/started relationship/started cohabitating with partner.

Other observations: Subjects were age 19 to 92. Mean household income was $70,243. 64.6% were married to their partner. 83% were still in a relationship with their partner at the end of the study.

What is/are the main feature(s) of interest in your dataset?

The most interesting feature is StillTogether, which indicates whether or not the subject is still in a relationship with the partner by the end of the study.

Another feature of interest is how the subject met their partner, which is representated by 11 binary MetBy* variables (the respondants were able to choose more than one.)

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I’m interested in studying the potential relationship between the non-identifying features (ie. all but CaseID) and the StillTogether variable.

Did you create any new variables from existing variables in the dataset?

No.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The data transformation/tidying operations I performed are described above.

Most of the important features are categorical. Age and Income are nearly normal distribitions. Relationship Years is exponential.

Bivariate Plots

Let’s start trying to explore relationships between variables; particularly those influencing StillTogether. I wonder if couples that met online are more or less likely to still be together?

So, only 63.9% of couples that met online were still together at the end of the study vs. 83.9% who did not meet online. This is interesting. Although, this would only include couples who met before 2010. As social networking has grown dramatically and become more of a social norm in more recent years, I do wonder if this finding would still hold true with more recent data.

What about subject Political Party, does that influence relationship viability?

It seems Republicans may be slightly more likely to remain coupled than Democrats or Others.

Does being the major breadwinner of the household influence StillTogether?

There doesn’t seem to be a significant relationship here.

What about household Income?

Couples that were still together at the end of the study have a higher income distribution than those that split up.

OK, here’s an interesting question. Does the couple’s household income seem to affect whether or not the subject’s parents approve of the relationship?

Yes, couples with parental approval have a higher mean household income than those without parental approval.

Are married couple more likely to stick it out than unmarried couples?

Yes, 96% of married couples were still together at the end of the study. This seems to be a significant relationship (pun intended.)

Is there a relationship between cohabitation and the StillTogether result?

92% of couples that lived together were still together vs 46.% of couples that did not live together.

Is there a difference between Same Sex Couples and Different Sex Couples with respect to the StillTogether outcome?

Different sex couples were slightly more likely to still be together in the study.

Setting aside the StillTogether variable for now, how has the way in which couples meet changed over time?

We can see there is a dramatic increase in the occurence of people meeting their partners online over the years. It would be interesting to see more recent data on this.

Does household income have any interesting effects on subjective relationship quality rankings?

No, not as much as I would have expected. Subjects with the very highest incomes were not represented in the lowest relationship quality categories. However, the Mean household income is equal between those who saw their relationship quality as Excellent and those who saw it as Very Poor.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Couples that met online were less likely to still be together. Couples that were still together had a slightly higher median household income. Couples that were cohabitating were more likely to still be together. Same sex couples were less likely to still be together.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Couples with a higher household income were more likely to receive parental approval. The occurance of partners meeting online increases dramatically over time.

What was the strongest relationship you found?

Married couples were much more likely to “stick it out” and still be together.

Multivariate Plots/Model

Let’s look at the various categories of how the subjects met their partner.

Friends and “Other” are the most frequent methods. I’m a little curious about that Other category, but I don’t have a way to drill down into it further with the current set of variables.

Let’s use some statistical tools to determine if there are any direct correlations between the variables.

Correlation Matrix

It is not surprising that Age is directly correlated to relationship length (RelationshipYears, CohabitationYears), and inversely correlated to YearMet. SameSexCouple is correlated with GLBStatus. There is a weak correlation between Gender and WhoEarnsMore. Income and Education have a correlation. ParentalApproval and RelationshipQuality have a correlation.

Regression Model: Initial

We’re going to create a binomial logistic regression model with StillTogether as the dependent variable. For our initial model, we’re going to use all variables except for CaseID (the numeric unique identifier for observations.)

## 
## Call:
## glm(formula = StillTogether ~ . - CaseID, family = binomial, 
##     data = no.na.Data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.93683  -0.37956  -0.20410  -0.08566   2.88804  
## 
## Coefficients: (7 not defined because of singularities)
##                                                Estimate Std. Error z value
## (Intercept)                                  -8.615e+00  3.956e+03  -0.002
## Age                                          -3.755e-02  1.492e-02  -2.517
## Educationhigh school                         -3.895e-01  4.727e-01  -0.824
## Educationsome college                        -2.395e-01  4.635e-01  -0.517
## Educationbachelor degree or higher           -5.596e-01  4.927e-01  -1.136
## Ethnicityblack                               -2.777e-02  4.141e-01  -0.067
## Ethnicityother                               -1.286e+00  7.668e-01  -1.677
## Ethnicityhispanic                            -1.541e-01  3.459e-01  -0.445
## Ethnicity2+ races                             5.627e-02  5.090e-01   0.111
## GenderFemale                                  2.428e-01  2.316e-01   1.049
## Income                                       -4.109e-06  2.762e-06  -1.487
## MaritalStatuswidowed                         -1.551e+01  1.845e+03  -0.008
## MaritalStatusdivorced                         1.058e+00  5.369e-01   1.970
## MaritalStatusseparated                        6.020e-01  8.316e-01   0.724
## MaritalStatusnever married                    6.186e-01  4.406e-01   1.404
## MaritalStatusliving with partner              3.501e-01  3.951e-01   0.886
## Marriedmarried                               -3.253e-01  3.545e-01  -0.918
## Regionmid-atlantic                           -1.951e-01  5.274e-01  -0.370
## Regioneast-north central                     -7.386e-01  5.297e-01  -1.394
## Regionwest-north central                     -1.232e+00  6.365e-01  -1.936
## Regionsouth atlantic                         -1.003e-01  5.068e-01  -0.198
## Regioneast-south central                     -4.424e-01  6.665e-01  -0.664
## Regionwest-south central                     -5.431e-01  5.871e-01  -0.925
## Regionmountain                               -3.605e-01  6.084e-01  -0.593
## Regionpacific                                -1.659e-01  5.106e-01  -0.325
## EmploymentStatusworking-self-employed        -3.494e-01  3.852e-01  -0.907
## EmploymentStatusnot working-temporary layoff -1.508e+00  1.301e+00  -1.159
## EmploymentStatusnot working-looking for work -8.451e-02  4.947e-01  -0.171
## EmploymentStatusnot working-retired          -2.235e-01  1.127e+00  -0.198
## EmploymentStatusnot working-disabled          2.792e-01  4.195e-01   0.666
## EmploymentStatusnot working-other             8.263e-02  4.479e-01   0.184
## PoliticalPartyother                          -5.477e-01  8.358e-01  -0.655
## PoliticalPartydemocrat                       -1.705e-01  2.520e-01  -0.677
## Religionprotestant                           -4.533e-01  3.808e-01  -1.190
## Religioncatholic                             -6.589e-01  3.950e-01  -1.668
## Religionmormon                               -1.992e+00  1.275e+00  -1.562
## Religionjewish                                2.512e-01  6.487e-01   0.387
## Religionmuslim                               -1.644e+01  2.153e+03  -0.008
## Religionhindu                                 2.040e-01  1.808e+00   0.113
## Religionbuddhist                             -4.731e-01  1.254e+00  -0.377
## Religionpentecostal                          -1.649e+01  5.351e+02  -0.031
## Religioneastern orthodox                     -1.589e+01  1.221e+03  -0.013
## Religionother christian                      -3.842e-01  3.922e-01  -0.980
## Religionother non-christian                  -5.559e-01  5.523e-01  -1.007
## Religionnone                                 -3.500e-01  3.799e-01  -0.922
## GLBStatusglb                                  2.629e-01  3.898e-01   0.674
## NumberOfChildren                              1.030e-01  1.133e-01   0.909
## SameSexCouplesame-sex couple                  9.017e-01  4.595e-01   1.963
## AgeDiff                                       1.122e-02  2.020e-02   0.556
## RelationshipYears                            -1.134e-01  5.999e-02  -1.891
## Cohabitating1                                -1.311e+00  3.667e-01  -3.576
## WhoEarnedMoresame amount                      3.684e-02  3.344e-01   0.110
## WhoEarnedMorepartner                         -2.911e-01  2.448e-01  -1.189
## MetByFriends1                                -8.792e-02  2.383e-01  -0.369
## MetByFamily1                                  7.207e-02  2.949e-01   0.244
## MetByNeighbors1                              -9.996e-01  4.680e-01  -2.136
## MetByChurch1                                 -7.967e-01  7.028e-01  -1.134
## MetByWork0                                    1.789e+01  3.956e+03   0.005
## MetByWork1                                    1.766e+01  3.956e+03   0.004
## MetBySchool0                                 -4.807e-01  4.492e-01  -1.070
## MetBySchool1                                         NA         NA      NA
## MetByOnline0                                 -2.224e-01  5.710e-01  -0.389
## MetByOnline1                                         NA         NA      NA
## MetByBar0                                    -4.876e-01  4.139e-01  -1.178
## MetByBar1                                            NA         NA      NA
## MetBySocialGroup0                             1.639e-01  5.732e-01   0.286
## MetBySocialGroup1                                    NA         NA      NA
## MetByPrivateParty0                           -1.225e+00  4.354e-01  -2.814
## MetByPrivateParty1                                   NA         NA      NA
## MetByOther0                                  -9.555e-01  3.912e-01  -2.443
## MetByOther1                                          NA         NA      NA
## RelationshipQualitypoor                      -1.098e+00  1.039e+00  -1.057
## RelationshipQualityfair                      -2.769e+00  9.065e-01  -3.054
## RelationshipQualitygood                      -3.471e+00  8.911e-01  -3.895
## RelationshipQualityexcellent                 -4.754e+00  9.059e-01  -5.248
## HomeOwnershipRent                            -2.705e-01  2.680e-01  -1.009
## HomeOwnershipOccupy without Payment           6.987e-01  8.200e-01   0.852
## CohabitatationYears                           1.022e-02  5.669e-02   0.180
## FirstMetYears                                 2.707e-02  2.645e-02   1.023
## ParentalApprovalApproval                     -1.269e-01  2.576e-01  -0.492
## YearMet                                              NA         NA      NA
##                                              Pr(>|z|)    
## (Intercept)                                  0.998263    
## Age                                          0.011820 *  
## Educationhigh school                         0.409954    
## Educationsome college                        0.605402    
## Educationbachelor degree or higher           0.255991    
## Ethnicityblack                               0.946526    
## Ethnicityother                               0.093548 .  
## Ethnicityhispanic                            0.655990    
## Ethnicity2+ races                            0.911981    
## GenderFemale                                 0.294389    
## Income                                       0.136924    
## MaritalStatuswidowed                         0.993292    
## MaritalStatusdivorced                        0.048793 *  
## MaritalStatusseparated                       0.469129    
## MaritalStatusnever married                   0.160313    
## MaritalStatusliving with partner             0.375531    
## Marriedmarried                               0.358870    
## Regionmid-atlantic                           0.711397    
## Regioneast-north central                     0.163187    
## Regionwest-north central                     0.052850 .  
## Regionsouth atlantic                         0.843116    
## Regioneast-south central                     0.506793    
## Regionwest-south central                     0.354916    
## Regionmountain                               0.553467    
## Regionpacific                                0.745286    
## EmploymentStatusworking-self-employed        0.364287    
## EmploymentStatusnot working-temporary layoff 0.246383    
## EmploymentStatusnot working-looking for work 0.864361    
## EmploymentStatusnot working-retired          0.842873    
## EmploymentStatusnot working-disabled         0.505631    
## EmploymentStatusnot working-other            0.853637    
## PoliticalPartyother                          0.512293    
## PoliticalPartydemocrat                       0.498714    
## Religionprotestant                           0.233927    
## Religioncatholic                             0.095279 .  
## Religionmormon                               0.118199    
## Religionjewish                               0.698542    
## Religionmuslim                               0.993907    
## Religionhindu                                0.910202    
## Religionbuddhist                             0.705910    
## Religionpentecostal                          0.975421    
## Religioneastern orthodox                     0.989612    
## Religionother christian                      0.327274    
## Religionother non-christian                  0.314160    
## Religionnone                                 0.356787    
## GLBStatusglb                                 0.500046    
## NumberOfChildren                             0.363541    
## SameSexCouplesame-sex couple                 0.049699 *  
## AgeDiff                                      0.578485    
## RelationshipYears                            0.058656 .  
## Cohabitating1                                0.000349 ***
## WhoEarnedMoresame amount                     0.912288    
## WhoEarnedMorepartner                         0.234443    
## MetByFriends1                                0.712206    
## MetByFamily1                                 0.806950    
## MetByNeighbors1                              0.032673 *  
## MetByChurch1                                 0.256990    
## MetByWork0                                   0.996393    
## MetByWork1                                   0.996438    
## MetBySchool0                                 0.284484    
## MetBySchool1                                       NA    
## MetByOnline0                                 0.696929    
## MetByOnline1                                       NA    
## MetByBar0                                    0.238818    
## MetByBar1                                          NA    
## MetBySocialGroup0                            0.774885    
## MetBySocialGroup1                                  NA    
## MetByPrivateParty0                           0.004900 ** 
## MetByPrivateParty1                                 NA    
## MetByOther0                                  0.014579 *  
## MetByOther1                                        NA    
## RelationshipQualitypoor                      0.290605    
## RelationshipQualityfair                      0.002255 ** 
## RelationshipQualitygood                      9.82e-05 ***
## RelationshipQualityexcellent                 1.54e-07 ***
## HomeOwnershipRent                            0.312875    
## HomeOwnershipOccupy without Payment          0.394166    
## CohabitatationYears                          0.856935    
## FirstMetYears                                0.306163    
## ParentalApprovalApproval                     0.622402    
## YearMet                                            NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1066.53  on 1491  degrees of freedom
## Residual deviance:  682.16  on 1418  degrees of freedom
## AIC: 830.16
## 
## Number of Fisher Scoring iterations: 16

Our initial model has an AIC of 830.

Regression Model: Optimized

Now let’s use stepwise AIC optimization to find the best model fit.

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## StillTogether ~ (CaseID + Age + Education + Ethnicity + Gender + 
##     Income + MaritalStatus + Married + Region + EmploymentStatus + 
##     PoliticalParty + Religion + GLBStatus + NumberOfChildren + 
##     SameSexCouple + AgeDiff + RelationshipYears + Cohabitating + 
##     WhoEarnedMore + MetByFriends + MetByFamily + MetByNeighbors + 
##     MetByChurch + MetByWork + MetBySchool + MetByOnline + MetByBar + 
##     MetBySocialGroup + MetByPrivateParty + MetByOther + RelationshipQuality + 
##     HomeOwnership + CohabitatationYears + FirstMetYears + ParentalApproval + 
##     YearMet) - CaseID
## 
## Final Model:
## StillTogether ~ Age + Income + Married + SameSexCouple + RelationshipYears + 
##     Cohabitating + MetByNeighbors + MetByChurch + MetByBar + 
##     MetByPrivateParty + MetByOther + RelationshipQuality
## 
## 
##                     Step Df     Deviance Resid. Df Resid. Dev      AIC
## 1                                             1418   682.1563 830.1563
## 2              - YearMet  0  0.000000000      1418   682.1563 830.1563
## 3     - EmploymentStatus  6  3.235367203      1424   685.3916 821.3916
## 4               - Region  8  9.140655194      1432   694.5323 814.5323
## 5             - Religion 12 16.707935257      1444   711.2402 807.2402
## 6            - Education  3  1.657798241      1447   712.8980 802.8980
## 7        - MaritalStatus  5  6.327018366      1452   719.2250 799.2250
## 8       - PoliticalParty  2  0.910844570      1454   720.1359 796.1359
## 9            - Ethnicity  4  4.992743319      1458   725.1286 793.1286
## 10       - HomeOwnership  2  1.600656859      1460   726.7293 790.7293
## 11 - CohabitatationYears  1  0.006660194      1461   726.7359 788.7359
## 12    - ParentalApproval  1  0.010333973      1462   726.7463 786.7463
## 13         - MetByOnline  1  0.114009190      1463   726.8603 784.8603
## 14       - WhoEarnedMore  2  2.112845231      1465   728.9731 782.9731
## 15    - NumberOfChildren  1  0.133198958      1466   729.1063 781.1063
## 16             - AgeDiff  1  0.136548659      1467   729.2429 779.2429
## 17         - MetBySchool  1  0.328844785      1468   729.5717 777.5717
## 18              - Gender  1  0.502619399      1469   730.0743 776.0743
## 19        - MetByFriends  1  0.513947553      1470   730.5883 774.5883
## 20    - MetBySocialGroup  1  0.508614575      1471   731.0969 773.0969
## 21         - MetByFamily  1  0.809728915      1472   731.9066 771.9066
## 22           - MetByWork  1  0.961437815      1473   732.8681 770.8681
## 23           - GLBStatus  1  1.099131014      1474   733.9672 769.9672
## 24       - FirstMetYears  1  1.886010267      1475   735.8532 769.8532
## 
## Call:
## glm(formula = StillTogether ~ Age + Income + Married + SameSexCouple + 
##     RelationshipYears + Cohabitating + MetByNeighbors + MetByChurch + 
##     MetByBar + MetByPrivateParty + MetByOther + RelationshipQuality, 
##     family = binomial, data = no.na.Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1619  -0.4068  -0.2418  -0.1153   2.9905  
## 
## Coefficients: (2 not defined because of singularities)
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -5.812e+00  5.354e+02  -0.011 0.991340    
## Age                          -3.312e-02  1.227e-02  -2.700 0.006941 ** 
## Income                       -4.957e-06  2.295e-06  -2.160 0.030794 *  
## Marriedmarried               -6.171e-01  2.366e-01  -2.608 0.009111 ** 
## SameSexCouplesame-sex couple  1.002e+00  2.788e-01   3.594 0.000326 ***
## RelationshipYears            -8.068e-02  1.829e-02  -4.411 1.03e-05 ***
## Cohabitating1                -1.614e+00  3.102e-01  -5.203 1.96e-07 ***
## MetByNeighbors1              -8.860e-01  4.285e-01  -2.068 0.038677 *  
## MetByChurch1                 -1.129e+00  6.818e-01  -1.656 0.097630 .  
## MetByBar0                     1.253e+01  5.354e+02   0.023 0.981333    
## MetByBar1                     1.300e+01  5.354e+02   0.024 0.980634    
## MetByPrivateParty0           -9.186e-01  3.074e-01  -2.989 0.002803 ** 
## MetByPrivateParty1                   NA         NA      NA       NA    
## MetByOther0                  -8.946e-01  2.257e-01  -3.963 7.40e-05 ***
## MetByOther1                          NA         NA      NA       NA    
## RelationshipQualitypoor      -7.574e-01  8.805e-01  -0.860 0.389695    
## RelationshipQualityfair      -2.240e+00  7.609e-01  -2.944 0.003241 ** 
## RelationshipQualitygood      -2.896e+00  7.398e-01  -3.915 9.05e-05 ***
## RelationshipQualityexcellent -4.091e+00  7.460e-01  -5.484 4.15e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1066.53  on 1491  degrees of freedom
## Residual deviance:  735.85  on 1475  degrees of freedom
## AIC: 769.85
## 
## Number of Fisher Scoring iterations: 12

Our optimized model has an AIC of 768, a significant improvement. The new model is also more parsimonious: StillTogether ~ Age + Income + Married + SameSexCouple + RelationshipYears + Cohabitating + MetByNeighbors + MetByChurch + MetByBar + MetByPrivateParty + MetByOther + RelationshipQuality

Let’s look at some diagnostic plots to help evaluate our model.

The ordered residuals plot and Cook statistic plot seem to indicate that this is a good model but with a fair number of outliers. The ROC (receiver operating characteritic plot) shows that the model has 87.29% accuracy. This is considered a statistically “good” model.

K-means Clustering

Using the variables we’ve found to be predictive in our regression model, let’s use a K-means clustering algorithm to define two clusters. The cases within each cluster will be relatively similar in Age + Income + Married + SameSexCouple + RelationshipYears + Cohabitating + MetByNeighbors + MetByChurch + MetByBar + MetByPrivateParty + MetByOther + RelationshipQuality. The cases that are most dissimilar will be in different clusters.

Now lets create a scatterplot of all couples with their cluster membership (1 or 2) vs. the StillTogether outcome.

We see that cluster membership is very influential on the StillTogether variable. Most couples in cluster 1 are still together. Couples in cluster 2 seem to fairly evenly distributed between together and not together.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The correlation matrix indicates some of the correlated features, such as between the subject’s age and length of relationship. Another example is GLB status of subject and couple’s same sex status. Most of these correlations make intuitive sense and were not surprising

Were there any interesting or surprising interactions between features?

I found all of the observations to be interesting, but none particularly surprising.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I constructed a binomial logistic regression model with StillTogether as the dependent variable. I started with all features (aside from CaseID) as the potential independent variables. I used stepwise AIC optimization to find the optimum model for this data set. The following features were found to provide the best model explaining variability of StillTogether: Age, Relationship Quality, Income, Parental Approval, Relationship Years, Cohabitatation Years, Married, Cohabitating, Same Sex Couple, and how the couple met (5 variables). For this set of data, these attributes predicted whether or not the couple was still together with 87% accuracy.

To help confirm the model, I performed K-means clustering on the couples using the above set of attributes. I created two clusters based upon combined similarity of Age, Relationship Quality, Income, Parental Approval, Relationship Years, Cohabitatation Years, Married, Cohabitating, Same Sex Couple, and how the couple met.


Final Plots and Summary

Plot One

Description One

This plot shows the trends in the manner in which couples meet from year to year. We see that meeting Online is a strong upward trend, and we must also bear in mind the last data point would be for 2010. Meetings via Church or Neighbors were both downward-trending.

Plot Two

Description Two

This boxplot shows the distribution of subjective RelationshipQuality values grouped by length of relationship (RelationshipYears). The V shape of the different quartiles seems to indicate that subjects have a less extreme qualitative sentiment towards their relationship quality in the relatively earlier years. After the 9 year mark, subjects are more likely to develop a stronger view, either positive or negative. After the 40 year point, all couples rate their relationships as fair, good, or excellent. I found it quite surprising that the mean RelationshipYears value for “very poor” ranked relationships is as high as 16.5 years.

Plot Three

Description Three

This is a plot of the StillTogether outcome for our two clusters of couples. I created two distinct clusters based upon combined similarity of predictive features (Age, Relationship Quality, Income, Parental Approval, Relationship Years, Cohabitatation Years, Married, Cohabitating, Same Sex Couple, and how the couple met). We see from the two plots that one of the clusters is much more likely to stay together than the other. This helps validate that these factors are indeed predictive of relationship viability.


Reflection

I tried to select an interesting dataset to explore, and overall it was enjoyable to work with. There were over 400 features initially, so I knew that I would need to narrow down the feature set in order to make the scope of this project manageable. I chose demographic features that seemed interesting, and also reviewed (skimmed) other works in the social science domain to help with feature selection. Tidying the dataset and appropriately labeling the categorical values was a time-consuming, but essential, step in the analysis process. As this was the first time I’ve conducted a binomial logistic regression, rather than a generalized linear model, interpreting some of the diagnosic results, such as the residual plots, was a challenge. If I were to expand my research on this data, I would most likely be interested in 1.) pulling in more of the 400+ features for examination, and 2.) exploring some different methods of predictive model creation, such as random forest and related classification algorithms. I would likely also partition the data into training and test sets to help validate the model.

Citation

Rosenfeld, Michael J., Reuben J. Thomas, and Maja Falcon. How Couples Meet and Stay Together (HCMST), Wave 1 2009, Wave 2 2010, Wave 3 2011, Wave 4 2013, United States. ICPSR30103-v7. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2014-09-02. http://doi.org/10.3886/ICPSR30103.v7 Persistent URL: http://doi.org/10.3886/ICPSR30103.v7