Quant Final (Fall 2024)

Depending on the options you’ve specified when you read in your csv file, your variables may be factor, numeric or strings, etc. In each of the questions below, check that your variable is in the correct format with class(variable) before running your analysis.

library(readr)
Dating <- read_csv("Dating.csv")

## New names:
## Rows: 1428 Columns: 37
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (25): region, usr, sex, life_quality, use_internet, use_email, use_socia... dbl
## (11): ...1, X, state, age, children0_5, children6_11, children12_17, emp... lgl
## (1): looking_for_partner
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

View(Dating)
str(Dating)

## spc_tbl_ [1,428 × 37] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ...1                  : num [1:1428] 1 2 3 4 5 6 7 8 9 10 ...
##  $ X                     : num [1:1428] 1 2 3 5 6 7 8 9 10 12 ...
##  $ state                 : num [1:1428] 5 17 34 45 19 34 48 12 55 8 ...
##  $ region                : chr [1:1428] "South" "Midwest" "Northeast" "South" ...
##  $ usr                   : chr [1:1428] "Rural" "Suburban" "Suburban" "Urban" ...
##  $ sex                   : chr [1:1428] "Female" "Female" "Male" "Female" ...
##  $ life_quality          : chr [1:1428] "2" "2" "3" "3" ...
##  $ use_internet          : chr [1:1428] "Yes" "Yes" NA NA ...
##  $ use_email             : chr [1:1428] "Yes" "Yes" NA NA ...
##  $ use_social_networking : chr [1:1428] "No" "Yes" "No" "Yes" ...
##  $ use_twitter           : chr [1:1428] "No" "No" "No" "No" ...
##  $ use_reddit            : chr [1:1428] "No" "No" "No" "No" ...
##  $ googled_own_name      : chr [1:1428] "No" "Yes" "Yes" "Yes" ...
##  $ have_cell_phone       : chr [1:1428] "Yes" "Yes" "Yes" "Yes" ...
##  $ have_tablet           : chr [1:1428] "Yes" "No" "No" "No" ...
##  $ have_smart_phone      : chr [1:1428] "No" "Yes" "Yes" "No" ...
##  $ marital_status        : chr [1:1428] "Married" "Married" "Married" "Married" ...
##  $ in_relationship       : chr [1:1428] "Yes" "Yes" "Yes" "Yes" ...
##  $ years_in_relationship : chr [1:1428] "57" "15" "15" "18" ...
##  $ looking_for_partner   : logi [1:1428] NA NA NA NA NA NA ...
##  $ met_partner_online    : chr [1:1428] "No" "No" "No" "No" ...
##  $ searched_for_ex_online: chr [1:1428] "No" "No" "Yes" "No" ...
##  $ flirted_online        : chr [1:1428] "No" "No" "No" "No" ...
##  $ used_dating_site      : chr [1:1428] "No" "No" "No" "No" ...
##  $ age                   : num [1:1428] 77 38 51 62 48 77 67 48 59 65 ...
##  $ have_children         : chr [1:1428] "No" "Yes" "Yes" "No" ...
##  $ children0_5           : num [1:1428] NA 0 1 NA 0 NA NA 0 NA NA ...
##  $ children6_11          : num [1:1428] NA 1 2 NA 0 NA NA 1 NA NA ...
##  $ children12_17         : num [1:1428] NA 1 0 NA 3 NA NA 1 NA NA ...
##  $ adults_in_household   : chr [1:1428] "2" "2" "2" "2" ...
##  $ educ2                 : chr [1:1428] "8" "6" "4" "6" ...
##  $ emplnw                : num [1:1428] 3 2 1 3 1 3 3 1 1 3 ...
##  $ race                  : chr [1:1428] "White" "White" "Mixed race" "White" ...
##  $ income                : num [1:1428] 5 9 8 99 3 7 2 99 5 4 ...
##  $ lgbt                  : chr [1:1428] "Straight" "Straight" "Straight" "Straight" ...
##  $ weight                : num [1:1428] 1.26 2.35 6.87 1.42 2.42 ...
##  $ standwt               : num [1:1428] 0.397 0.744 2.17 0.448 0.764 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ...1 = col_double(),
##   ..   X = col_double(),
##   ..   state = col_double(),
##   ..   region = col_character(),
##   ..   usr = col_character(),
##   ..   sex = col_character(),
##   ..   life_quality = col_character(),
##   ..   use_internet = col_character(),
##   ..   use_email = col_character(),
##   ..   use_social_networking = col_character(),
##   ..   use_twitter = col_character(),
##   ..   use_reddit = col_character(),
##   ..   googled_own_name = col_character(),
##   ..   have_cell_phone = col_character(),
##   ..   have_tablet = col_character(),
##   ..   have_smart_phone = col_character(),
##   ..   marital_status = col_character(),
##   ..   in_relationship = col_character(),
##   ..   years_in_relationship = col_character(),
##   ..   looking_for_partner = col_logical(),
##   ..   met_partner_online = col_character(),
##   ..   searched_for_ex_online = col_character(),
##   ..   flirted_online = col_character(),
##   ..   used_dating_site = col_character(),
##   ..   age = col_double(),
##   ..   have_children = col_character(),
##   ..   children0_5 = col_double(),
##   ..   children6_11 = col_double(),
##   ..   children12_17 = col_double(),
##   ..   adults_in_household = col_character(),
##   ..   educ2 = col_character(),
##   ..   emplnw = col_double(),
##   ..   race = col_character(),
##   ..   income = col_double(),
##   ..   lgbt = col_character(),
##   ..   weight = col_double(),
##   ..   standwt = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Question 1

The years_in_relationship variable measures how long a respondent has spent in their current relationship. As you recode this, you may find that R converts each text string to the wrong number. For example, the string “0” may be converted to 2 or some other number (this happens because R’s as.numeric function returns factor levels if they’re available). If this happens, convert the variable to a character string before converting it to a numeric vector, as in the following expression:

as.numeric(as.character(df$years_in_relationship))

You do not need to remove outliers in the years_in_relationship variable.

What is the mean of years_in_relationship in the sample?

# Check variable type
class(Dating$years_in_relationship)

## [1] "character"

# Replace "Refused" with NA
Dating$years_in_relationship[Dating$years_in_relationship == "Refused"] <- NA

# Error check for NA values
sum(is.na(Dating$years_in_relationship))

## [1] 16

# Convert the variable to numeric
Dating$years_in_relationship <- as.numeric(as.character(Dating$years_in_relationship))

# Confirm numeric conversion
class(Dating$years_in_relationship)

## [1] "numeric"

# Calculate the mean, excluding NAs
mean_years <- mean(Dating$years_in_relationship, na.rm = TRUE)
print(mean_years)

## [1] 20.9313

Question 2

The life_quality variable measures one’s perception of quality of life for themselves and their family on a 5-point scale, where 1 = excellent and 5 = poor. As before, you may have to convert the variable to a character string before converting it to a numeric vector, as in the following expression:

as.numeric(as.character(df$life_quality))

Before continuing, make sure your life quality variable is numeric.

Next, create a new variable on a 5-point scale called good_life_quality where 1 = poor and 5 = excellent. We will treat this good_life_quality variable as a metric variable in our analyses. Make sure to set the “Don’t know” and “Refused” values to NA.

What is the mean of good_life_quality in the sample?

## [1] "character"

## [1] "2"          "3"          "4"          "5"          "1"         
## [6] "Refused"    "Don't know"

## [1] 6

## [1] "numeric"

## [1] 2.509142

Question 3

To run a nested regression in R, you will need to select the rows in your dataset that have no missing values in your final OLS model.

First, check for blank and/or invalid responses in good_life_quality, years_in_relationship, age, and use_twitter and set them to NA. Make sure to set “Refused” and “Don’t know” responses to NA.

Create a new dataframe, Dating_lim, that contains only rows with non-missing values for good_life_quality, years_in_relationship, age and use_twitter.

How many cases does this leave you with in the Dating_lim dataframe?

# check for blank and/or invalid responses in good_life_quality, years_in_relationship, age, and use_twitter and set them to NA

# set “Refused” and “Don’t know” responses to NA

# good_life_quality
life_quality_counts <- table(Dating$good_life_quality, useNA = "ifany")
print(life_quality_counts)

## 
##    1    2    3    4    5 <NA> 
##  292  402  493  182   53    6

# years_in_relationship
yrs_relationship_counts <- table(Dating$years_in_relationship, useNA = "ifany")
print(yrs_relationship_counts)

## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
##   51   91   66   52   35   56   37   34   31   26   45   14   41   24   29   43 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31 
##   15   18   28    9   54   15   15   27   24   43   18   15   13   11   39   11 
##   32   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47 
##   20   10   10   31   12   19   16    6   30    7    9   21   12   20   18   15 
##   48   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63 
##   10    7   22    2   12    6    4    8   10   11    3    3    6    2    6    6 
##   65   66   67   86   97 <NA> 
##    4    1    1    1    1   16

# age
Dating$age[Dating$age == "98"] <- NA
Dating$age[Dating$age == "99"] <- NA
age_counts <- table(Dating$age, useNA = "ifany")
print(age_counts)

## 
##   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32   33 
##   12   10   12   13   14   19   18   14   17   26   19   22   19   12   19   19 
##   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48   49 
##   21   21   26   21   28   14   35   18   35   19   19   32   15   25   28   26 
##   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64   65 
##   40   22   35   27   40   19   31   27   38   29   28   25   30   26   31   31 
##   66   67   68   69   70   71   72   73   74   75   76   77   78   79   80   81 
##   28   31   14   19   21   14   10   14   11   18   11   10   13    6    9    8 
##   82   83   84   85   86   87   89   90   91   92   97 <NA> 
##    4    4    5    4    4    4    1    2    1    3    2   30

# use_twitter
Dating$use_twitter[Dating$use_twitter == "Refused"] <- NA
Dating$use_twitter[Dating$use_twitter == "Don't know"] <- NA
use_twitter_counts <- table(Dating$use_twitter, useNA = "ifany")
print(use_twitter_counts)

## 
##   No  Yes <NA> 
## 1081  186  161

# Create Dating_lim dataframe with non-missing values for all specified columns
Dating_lim <- Dating[!is.na(Dating$good_life_quality) & 
                     !is.na(Dating$years_in_relationship) & 
                     !is.na(Dating$age) & 
                     !is.na(Dating$use_twitter), ]

# Count the number of cases in Dating_lim
num_cases <- nrow(Dating_lim)
print(paste("Number of cases in Dating_lim:", num_cases))

## [1] "Number of cases in Dating_lim: 1225"

Question 3.1

Model 1: Fit an OLS model to Dating_lim (the data frame you made in the previous part of your exam) that predicts good_life_quality (dependent variable) as a function of years_in_relationship (independent variable).

library(car)

## Loading required package: carData

# Estimate the model and same the results in object "ols"
model1 <- lm(good_life_quality ~ years_in_relationship, Dating_lim) 

# View common diagnostics
plot(model1)

# View summary 
summary(model1)

## 
## Call:
## lm(formula = good_life_quality ~ years_in_relationship, data = Dating_lim)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4848 -0.4832 -0.4019  0.5598  2.6061 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.484772   0.047404  52.417   <2e-16 ***
## years_in_relationship -0.001594   0.001869  -0.852    0.394    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.069 on 1223 degrees of freedom
## Multiple R-squared:  0.0005938,  Adjusted R-squared:  -0.0002233 
## F-statistic: 0.7267 on 1 and 1223 DF,  p-value: 0.3941

Question 3.2

Model 2: Now fit a second OLS model to Dating_lim. Keep good_life_quality as your dependent variable, but now use both years_in_relationship and age as your explanatory variables.

# Estimate the model and same the results in object "ols"
model2 <- lm(good_life_quality ~ years_in_relationship + age, Dating_lim) 

# view common diags
plot(model2)

# view summary
summary(model2)

## 
## Call:
## lm(formula = good_life_quality ~ years_in_relationship + age, 
##     data = Dating_lim)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8501 -0.5815 -0.3414  0.6059  2.6625 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.169788   0.113163  19.174  < 2e-16 ***
## years_in_relationship -0.008671   0.002968  -2.922  0.00355 ** 
## age                    0.009319   0.003042   3.063  0.00224 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.066 on 1222 degrees of freedom
## Multiple R-squared:  0.008209,   Adjusted R-squared:  0.006586 
## F-statistic: 5.057 on 2 and 1222 DF,  p-value: 0.006496

Question 3.3

Model 3: Time to fit a third OLS model to Dating_lim. Keeping good_life_quality as your dependent variable, and years_in_relationship and age as your explanatory variables, add use_twitter as your third explanatory variable.

# Estimate the model and same the results in object "ols"
model3 <- lm(good_life_quality ~ years_in_relationship + age + use_twitter, Dating_lim) 

# view common diags
plot(model3)

# view summary
summary(model3)

## 
## Call:
## lm(formula = good_life_quality ~ years_in_relationship + age + 
##     use_twitter, data = Dating_lim)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8719 -0.6115 -0.1281  0.5776  2.8557 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.265488   0.115916  19.544  < 2e-16 ***
## years_in_relationship -0.008793   0.002955  -2.976 0.002980 ** 
## age                    0.008306   0.003042   2.730 0.006417 ** 
## use_twitterYes        -0.305947   0.087414  -3.500 0.000482 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.061 on 1221 degrees of freedom
## Multiple R-squared:  0.01806,    Adjusted R-squared:  0.01565 
## F-statistic: 7.486 on 3 and 1221 DF,  p-value: 5.751e-05

Question 4

Compute the F-statistics and associated p-values between your three regression models. Examine the R2 and AIC values for each of your three regression models.

# f-stats and associated p-values for all 3 models
anova(model1, model2, model3)

## Analysis of Variance Table
## 
## Model 1: good_life_quality ~ years_in_relationship
## Model 2: good_life_quality ~ years_in_relationship + age
## Model 3: good_life_quality ~ years_in_relationship + age + use_twitter
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1   1223 1398.8                                   
## 2   1222 1388.2  1    10.659  9.4694 0.0021358 ** 
## 3   1221 1374.4  1    13.789 12.2498 0.0004821 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# AIC values
AIC(model1, model2, model3)

##        df      AIC
## model1  3 3644.936
## model2  4 3637.566
## model3  5 3627.337

Question 4.1

ncvTest(model3)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 4.894009, Df = 1, p = 0.02695

spreadLevelPlot(model3)

## 
## Suggested power transformation:  6.792875e-05