Hwk 7 - Assumptions

THIS CODE IS IDENTICAL TO THE LAST PROJECT

The code below loads the dataset “CPSWaitingForAdoptionFY2014_2023.csv” as an object called “waitadopt”, and deletes the original file name.

library(readr)
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

CPSWaitingForAdoptionFY2014_2023 <- read_csv("CPSWaitingForAdoptionFY2014-2023.csv")

## Rows: 12158 Columns: 7

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Region, Gender, Race/Ethnicity, Age Group
## dbl (3): Fiscal Year, Chidlren Waiting on Adoption 31 August, Average Months...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

waitadopt <- CPSWaitingForAdoptionFY2014_2023

rm(CPSWaitingForAdoptionFY2014_2023)

Independent variable “Gender” includes an “unknown” observation. This chunk changes “Unknown” to “NA”:

waitadopt$Gender[waitadopt$Gender == 'Unknown'] <- NA

This chunk recodes “female” as “1”, and “male” as “0” onto a new variable called “Gendervar”.

waitadopt <- waitadopt %>% mutate(Gendervar=ifelse(Gender=='Female',1,0))
waitadopt

## # A tibble: 12,158 × 8
##    `Fiscal Year` Region    Gender `Race/Ethnicity` `Age Group`         
##            <dbl> <chr>     <chr>  <chr>            <chr>               
##  1          2023 1-Lubbock Female African American Birth to 5 Years Old
##  2          2023 1-Lubbock Female African American 6-12 Years Old      
##  3          2023 1-Lubbock Female African American 13-17 Years Old     
##  4          2023 1-Lubbock Female African American 13-17 Years Old     
##  5          2023 1-Lubbock Female African American Birth to 5 Years Old
##  6          2023 1-Lubbock Female African American Birth to 5 Years Old
##  7          2023 1-Lubbock Female African American Birth to 5 Years Old
##  8          2023 1-Lubbock Female African American 6-12 Years Old      
##  9          2023 1-Lubbock Female African American 6-12 Years Old      
## 10          2023 1-Lubbock Female African American 6-12 Years Old      
## # ℹ 12,148 more rows
## # ℹ 3 more variables: `Chidlren Waiting on Adoption 31 August` <dbl>,
## #   `Average Months since Termination of Parental Rights` <dbl>,
## #   Gendervar <dbl>

The second independent variable is age group, and does not need to be recoded to work with R.

The dependent variable is the average number of months since termination of parental rights. This chunk runs a model to examine the relationship between gender, age group, and the aforementioned dependent variable.

waitadopt_model <- lm(`Average Months since Termination of Parental Rights`~`Age Group`+Gender, data = waitadopt)
summary(waitadopt_model)

## 
## Call:
## lm(formula = `Average Months since Termination of Parental Rights` ~ 
##     `Age Group` + Gender, data = waitadopt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.390  -7.565  -1.845   4.636 141.540 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      35.5530     0.2976  119.48   <2e-16 ***
## `Age Group`6-12 Years Old       -19.8078     0.3468  -57.12   <2e-16 ***
## `Age Group`Birth to 5 Years Old -29.9454     0.3540  -84.59   <2e-16 ***
## GenderMale                        3.9366     0.2779   14.16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.32 on 12153 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3811, Adjusted R-squared:  0.3809 
## F-statistic:  2494 on 3 and 12153 DF,  p-value: < 2.2e-16

ASSUMPTIONS FOR LINEAR REGRESSION

LINEARITY:

Plot:

plot(waitadopt_model,which=1)

As mentioned in the last project, the results are hard to interpret because observations fall on numbers that do not seem to represent either months, gender, or age, but the red curve seems to be close to the dotted line that represents perfect linearity.

Rainbow test: The chunk below runs a Rainbow test to determine if the “model is consistent across the entire dataset or if it needs to be segmented into different parts to better fit distinct time periods or condition” (Chatgpt).

raintest(waitadopt_model)

## 
##  Rainbow test
## 
## data:  waitadopt_model
## Rain = 0.91449, df1 = 6079, df2 = 6074, p-value = 0.9998

The p-value is extremely high, and most definitely not significant, meaning that “the model is stable across the dataset, and no structural break is detected” (Chatgpt).

INDEPENDENCE OF ERRORS: The Durbin-Watson test is used to determine independence of errors to “detect the presence of autocorrelation,” which “occurs when residuals (errors) in a regression model are correlated with each other” (Chatgpt). The chunk below uses the “car” packages to run a Durbin-Watson test on the current model:

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

durbinWatsonTest(waitadopt_model)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.1371266      1.725736       0
##  Alternative hypothesis: rho != 0

According to the p-value of zero (0), there is strong evidence of correlation of errors. For that reason, instead of using a linear model, a “robust” linear model will be created using “RLM” from the “MASS” package, where “independence of errors is not as important”:

waitadopt_robustmodel<-rlm(`Average Months since Termination of Parental Rights`~waitadopt$`Age Group`+Gendervar,data=waitadopt)

summary(waitadopt_robustmodel)

## 
## Call: rlm(formula = `Average Months since Termination of Parental Rights` ~ 
##     waitadopt$`Age Group` + Gendervar, data = waitadopt)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -35.1841  -5.6973  -0.7309   6.0745 145.7459 
## 
## Coefficients:
##                                           Value     Std. Error t value  
## (Intercept)                                 35.2841    0.2006   175.9299
## waitadopt$`Age Group`6-12 Years Old        -18.4313    0.2341   -78.7328
## waitadopt$`Age Group`Birth to 5 Years Old  -26.5532    0.2390  -111.1151
## Gendervar                                   -2.7455    0.1876   -14.6354
## 
## Residual standard error: 8.629 on 12153 degrees of freedom
##   (1 observation deleted due to missingness)

As mentioned during lecture, “robust linear models do not report p-values or r-squared”. In this sense, it is hard to conceptualize what the “robust” model is doing.

HOMOSCEDASTICITY: Per Chatgpt, homoscedasticity is “a condition where the variance of the errors (residuals) is constant across all levels of the independent variable(s). When this assumption holds, the residuals exhibit a consistent spread around the regression line, regardless of the values of the predictors”.

Plot using “Scale-Location”:

plot(waitadopt_model,which=3)

Using the Breusch-Pagan Test in the “lmtest” package homoscedasticity is tested mathematically:

bptest(waitadopt_model)

## 
##  studentized Breusch-Pagan test
## 
## data:  waitadopt_model
## BP = 836.68, df = 3, p-value < 2.2e-16

The p-value in this test is very significant, meaning that the model is homoscedastic, and this condition is met.

NORMALITY OF RESIDUALS: The chunk below runs a Q-Q plot to “test for”make sure that the residuals are normally distributed”:

plot(waitadopt_model,which=2)

The curve seems to deviate significantly from the curve, and “for residuals to be normally distributed, all of the observations would need to lie along the dotted line.” A shapiro-wilk test is used below to “check for normality of residuals” with the code:

shapiro.test(waitadopt_model:residuals), which yields the following message: Error in shapiro.test(waitadopt_model:residuals) : sample size must be between 3 and 5000

Since the Shapiro test can only be used with a sample size only up to 5000, the model is be transformed using the square root function, and a Q-Q plot is called using the log-transformed model:

waitadopt_model_sqrt <- lm(sqrt(`Average Months since Termination of Parental Rights`)~`Age Group`+Gendervar, data = waitadopt)
plot(waitadopt_model_sqrt, which = 2)

The square root transformation results in a curve that more closely follows the “normality” dotted line. However, it is still far from perfect, mostly because it deviates significantly towards the low and high values.

MULTICOLINARITY The chunk below runs a “VIF” function in the car package to the model:

library(car)
vif(waitadopt_model)

##                 GVIF Df GVIF^(1/(2*Df))
## `Age Group` 1.000059  2        1.000015
## Gender      1.000059  1        1.000030

Since “a VIF over 10 means that the variable is strongly correlated with some other variable,” the resulting values of “1” indicate low multicolinearity in the model.

INTERPRETATION

This model is very decent at following the assumptions of linear regression. First, linearity is hard to discern from the plot, but the high p-value indicates that there is high linearity in the dataset. Second, independence of errors as suggested by the Durbin-Watson statistic is promising, with a DW value of 1.72. However, the p-value of “0” suggests strong evidence of correlation of errors. Still, using a “robust” linear model does not result in a different p-value. In terms of Homoscedasticity, it is clear from both the plot and the Breusch-Pagan Test, that there is consistency of errors or residuals. It is hard to tell with precision whether there is normality of residuals from looking at the plots, and the Shapiro test cannot be used for large datasets. Lastly, there is little evidence of multicolinearity, as evidenced by the low factors in the VIF analysis.

There seems to be reasonable evidence to suggests that the model meets most of the criteria for linear regression assumptions. However, most concerning is independence of errors, given that both the regular and the robust linear models returned a p-value of “0”, which indicates a strong correlation of errors, and therefore, it is possible that this may render the model incompatible with linear regression.

Hwk 7 - Assumptions

Lester Velazquez-Po

2024-11-05

THIS CODE IS IDENTICAL TO THE LAST PROJECT

ASSUMPTIONS FOR LINEAR REGRESSION