Brian Surratt

brian.surratt@my.utsa.edu

Here is my screenshot

Including libraries

library(readxl)
library(car)

## Loading required package: carData

library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(ggplot2)
library(stargazer)

## 
## Please cite as:

##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

library(ggpubr)
library(ggrepel)

Reading in the Excel file

Junkins <- read_excel("/Users/briansurratt/Library/CloudStorage/OneDrive-UniversityofTexasatSanAntonio/DEM 7283 Stats II/Homework 1/Junkins Data.xlsx")

Creation of variables

Junkins$south<- recode(Junkins$region, "3=1; else=0")
tabyl(Junkins$south)

##  Junkins$south  n percent
##              0 34    0.68
##              1 16    0.32

Junkins$northeast<- recode(Junkins$region, "1=1; else=0")
tabyl(Junkins$northeast)

##  Junkins$northeast  n percent
##                  0 41    0.82
##                  1  9    0.18

Junkins$midwest<- recode(Junkins$region, "2=1; else=0")
tabyl(Junkins$midwest)

##  Junkins$midwest  n percent
##                0 38    0.76
##                1 12    0.24

Junkins$west<- recode(Junkins$region, "4=1; else=0")
tabyl(Junkins$west)

##  Junkins$west  n percent
##             0 37    0.74
##             1 13    0.26

Junkins$relconssq<-Junkins$relcons*Junkins$relcons

Junkins$relconsln<-log(Junkins$relcons)

Junkins$relconsrec<-1/(Junkins$relcons)

geographic variation in age at marriage

model <-lm(t_ageFM~northeast + midwest + west, Junkins)

summary(model)

## 
## Call:
## lm(formula = t_ageFM ~ northeast + midwest + west, data = Junkins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.43846 -0.55625  0.06563  0.75677  2.23750 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.20625    0.26624 102.188   <2e-16 ***
## northeast    1.03264    0.44373   2.327   0.0244 *  
## midwest      0.00625    0.40669   0.015   0.9878    
## west        -0.31779    0.39765  -0.799   0.4283    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.065 on 46 degrees of freedom
## Multiple R-squared:  0.1657, Adjusted R-squared:  0.1113 
## F-statistic: 3.045 on 3 and 46 DF,  p-value: 0.03805

Interpretation: The independent variables are the regions of residence in the United States, specifically northeast, midwest, and west. A multiple linear regression model is conducted in r using age at first marriage as the dependent variable. The correlation between the northeast region and age at first marriage is statistically significant (p<0.05). Living in the northeast (relative to other regions) shows an increase in age at first marriage by 1.03 years.

model <-lm(t_ageFM~northeast, Junkins)

summary(model)

## 
## Call:
## lm(formula = t_ageFM ~ northeast, data = Junkins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.65732 -0.50271  0.01768  0.74268  2.34268 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27.1073     0.1642 165.054  < 2e-16 ***
## northeast     1.1316     0.3871   2.923  0.00527 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.052 on 48 degrees of freedom
## Multiple R-squared:  0.1511, Adjusted R-squared:  0.1334 
## F-statistic: 8.545 on 1 and 48 DF,  p-value: 0.005272

Interpretation: The independent variable is residence in the northeast region of the United States. A simple linear regression model is conducted in r using age at first marriage as the dependent variable. The correlation between the northeast region and age at first marriage is statistically significant (p<0.001). The adjusted R-squared shows the proportion of the variance that is explained by residing in the northeast is 13.34%.

Questions:

Why are the coefficient and p value different for northeast between the multiple versus simple regression? When interpreting an IV, which coefficeint (simple regression or multiple regression) is the better description of the relationship between an IV and the DV?
What is the difference in the R squared for the simple regression model and the multiple regression model? Which R squared is more meaningful in this case, the one from the simple regression or the multiple regression?
Does the summary table for a simple regression give the correlation between the IV and DV? (I guess not, since this is determined in the next r chunk.)

scatterplot and correlation of DV with IV

plot(Junkins$relcons,Junkins$t_ageFM)

cor.test(Junkins$relcons,Junkins$t_ageFM)

## 
##  Pearson's product-moment correlation
## 
## data:  Junkins$relcons and Junkins$t_ageFM
## t = -4.8748, df = 48, p-value = 1.233e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7359186 -0.3537616
## sample estimates:
##        cor 
## -0.5754459

Interpretation: The Pearson’s correlation between percent very religious and age at first marriage is negative and moderate to strong (-0.058). As percent very religious increases, age of first marriage decreases.

tests of different specifications of religious concentration

model <-lm(t_ageFM~relcons, Junkins)

summary(model)

## 
## Call:
## lm(formula = t_ageFM ~ relcons, data = Junkins)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8183 -0.4097  0.0782  0.5650  1.8037 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.27091    0.23707 119.252  < 2e-16 ***
## relcons     -0.04911    0.01007  -4.875 1.23e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9335 on 48 degrees of freedom
## Multiple R-squared:  0.3311, Adjusted R-squared:  0.3172 
## F-statistic: 23.76 on 1 and 48 DF,  p-value: 1.233e-05

Interpretation: The independent variable is percent very religious and the dependent variable is age at first marriage. A simple linear regression model is conducted in r. The correlation between percent very religious and age at first marriage is statistically significant (p<0.001). The adjusted R-squared shows the proportion of the variance that is explained by percent very religious is 31.72%.

model <-lm(t_ageFM~relcons + relconssq, Junkins)

summary(model)

## 
## Call:
## lm(formula = t_ageFM ~ relcons + relconssq, data = Junkins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.81884 -0.41111  0.08564  0.56738  1.80413 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.824e+01  3.623e-01  77.952   <2e-16 ***
## relcons     -4.607e-02  2.878e-02  -1.601    0.116    
## relconssq   -5.175e-05  4.593e-04  -0.113    0.911    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9432 on 47 degrees of freedom
## Multiple R-squared:  0.3313, Adjusted R-squared:  0.3029 
## F-statistic: 11.64 on 2 and 47 DF,  p-value: 7.81e-05

Interpretation: The independent variables are percent very religious and percent very religious squared. The dependent variable is age at first marriage. A multiple linear regression model is conducted in r. In this model, neither correlation is statistically significant.

Questions:

Why was neither IV statistically significant? In the prior model, percent very religious was shown to be statistically significant, so why isn’t it in the multiple regression?

model <-lm(t_ageFM~relconsln, Junkins)

summary(model)

## 
## Call:
## lm(formula = t_ageFM ~ relconsln, data = Junkins)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6441 -0.5908  0.0393  0.6119  1.9690 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  29.6203     0.5419  54.655  < 2e-16 ***
## relconsln    -0.8412     0.1911  -4.402 5.96e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9633 on 48 degrees of freedom
## Multiple R-squared:  0.2876, Adjusted R-squared:  0.2728 
## F-statistic: 19.38 on 1 and 48 DF,  p-value: 5.958e-05

Interpretation: The independent variable is the logarithm of percent very religious and the dependent variable is age at first marriage. A simple linear regression model is conducted in r. The correlation between the log of percent very religious and age at first marriage is statistically significant (p<0.001). The adjusted R-squared shows the proportion of the variance that is explained by the log of percent very religious is 27.28%.

Questions:

I do not understand R squared in the context of the logarithm of an IV.

model <-lm(t_ageFM~relconsrec, Junkins)

summary(model)

## 
## Call:
## lm(formula = t_ageFM ~ relconsrec, data = Junkins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.44794 -0.49354 -0.09158  0.80604  2.17950 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.7445     0.2233 119.789  < 2e-16 ***
## relconsrec    6.6906     2.0013   3.343  0.00161 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.028 on 48 degrees of freedom
## Multiple R-squared:  0.1889, Adjusted R-squared:  0.172 
## F-statistic: 11.18 on 1 and 48 DF,  p-value: 0.001612

Interpretation: The independent variable is the reciprocal of percent very religious and the dependent variable is age at first marriage. A simple linear regression model is conducted in r. The correlation between the reciprocal of percent very religious and age at first marriage is statistically significant (p<0.01). The adjusted R-squared shows the proportion of the variance that is explained by the reciprocal of percent very religious is 17.2%.

Questions:

I do not understand R squared in the context of a the reciprocal of an IV.

test of mediation

model <-lm(t_ageFM~northeast + relcons, Junkins)

summary(model)

## 
## Call:
## lm(formula = t_ageFM ~ northeast + relcons, data = Junkins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.72217 -0.50792 -0.02583  0.54625  1.90289 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.10361    0.30729  91.457  < 2e-16 ***
## northeast    0.34786    0.40488   0.859 0.394607    
## relcons     -0.04375    0.01187  -3.686 0.000589 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.936 on 47 degrees of freedom
## Multiple R-squared:  0.3415, Adjusted R-squared:  0.3135 
## F-statistic: 12.19 on 2 and 47 DF,  p-value: 5.45e-05

Interpretation: The independent variables are residence in the northeast and percent very religious. The dependent variable is age at first marriage. A multiple linear regression model is conducted in r. The correlation between the percent very religious and age at first marriage is statistically significant (p<0.001), but residence in the northeast is not statistically significant. A 1% increase in percent very religous results in a delcine of age of marriage by .044 years.

Questions:

Again, I don’t understand the coefficient for IVs in this context (multiple regression) compared to simple regression.

test of moderation

model <-lm(t_ageFM~northeast + relcons + (northeast*relcons), Junkins)

summary(model)

## 
## Call:
## lm(formula = t_ageFM ~ northeast + relcons + (northeast * relcons), 
##     data = Junkins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.72617 -0.50502 -0.02841  0.57561  1.89866 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       28.11320    0.30988  90.724  < 2e-16 ***
## northeast         -0.23321    1.06750  -0.218 0.828031    
## relcons           -0.04417    0.01197  -3.689 0.000594 ***
## northeast:relcons  0.11804    0.20041   0.589 0.558754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9426 on 46 degrees of freedom
## Multiple R-squared:  0.3464, Adjusted R-squared:  0.3038 
## F-statistic: 8.127 on 3 and 46 DF,  p-value: 0.0001898

Interpretation: The independent variables are residence in the northeast, percent very religious, and residence in the northeast multipled by percent very religious. The dependent variable is age at first marriage. A multiple linear regression model is conducted in r. The correlation between the percent very religious and age at first marriage is statistically significant (p<0.001), but residence in the northeast, and the product of the two IVs is not statistically significant. A 1% increase in percent very religious results in a delcine of age of marriage by .044 years.

Questions:

I don’t undestand the reasoning behind multiplying the two IVs to create a new IV or how to interpret the results.

creating a nice summary table

Model.1 <- lm(t_ageFM~northeast + relcons, Junkins)
Model.2 <- lm (t_ageFM~northeast + relcons + (northeast*relcons), Junkins)


# https://ademos.people.uic.edu/Chapter13.html

stargazer(Model.1, Model.2,type="text", 
column.labels = c("Main Effects", "Interaction"), 
intercept.bottom = FALSE, 
single.row=FALSE,     
notes.append = FALSE, 
header=FALSE)

## 
## ================================================================
##                                 Dependent variable:             
##                     --------------------------------------------
##                                       t_ageFM                   
##                          Main Effects           Interaction     
##                              (1)                    (2)         
## ----------------------------------------------------------------
## Constant                  28.104***              28.113***      
##                            (0.307)                (0.310)       
##                                                                 
## northeast                   0.348                 -0.233        
##                            (0.405)                (1.067)       
##                                                                 
## relcons                   -0.044***              -0.044***      
##                            (0.012)                (0.012)       
##                                                                 
## northeast:relcons                                  0.118        
##                                                   (0.200)       
##                                                                 
## ----------------------------------------------------------------
## Observations                  50                    50          
## R2                          0.341                  0.346        
## Adjusted R2                 0.313                  0.304        
## Residual Std. Error    0.936 (df = 47)        0.943 (df = 46)   
## F Statistic         12.186*** (df = 2; 47) 8.127*** (df = 3; 46)
## ================================================================
## Note:                                *p<0.1; **p<0.05; ***p<0.01

Homework 1

2023-01-30

Brian Surratt

brian.surratt@my.utsa.edu

Including libraries

Reading in the Excel file

Creation of variables

geographic variation in age at marriage

scatterplot and correlation of DV with IV

tests of different specifications of religious concentration

test of mediation

test of moderation

creating a nice summary table