DACSS 603 Homework 4

Homework 4 for DACSS 603

Molly Hackbarth
04-10-2022

PART 1

(Data file: house.selling.price.2 from smss R package) For the house.selling.price.2 data the tables below show a correlation matrix and a model fit using four predictors of selling price.

Question 1

(SMSS 14.3, 14.4, merged & modified) Price

#create matrix
tab <- matrix(c(1, 0.899, 0.590, 0.714,  0.357,
                0.899, 1, 0.669,  0.662, 0.176,
                0.590, 0.699,  1, 0.334, 0.267,
                0.714, 0.662,  0.334, 1, 0.182,
                0.357, 0.176, 0.267, 0.182, 1), ncol=5, nrow = 5)

#define column names and row names of matrix
colnames(tab) <- c('Price', 'Size', 'Bed', 'Bath', 'New')
rownames(tab) <- c('Price', 'Size', 'Bed', 'Bath', 'New')

#convert matrix to table 
tab <- as.table(tab)

#view table 
tab
      Price  Size   Bed  Bath   New
Price 1.000 0.899 0.590 0.714 0.357
Size  0.899 1.000 0.699 0.662 0.176
Bed   0.590 0.669 1.000 0.334 0.267
Bath  0.714 0.662 0.334 1.000 0.182
New   0.357 0.176 0.267 0.182 1.000
tab2 <- matrix(c(-41.795, 64.761,-2.766, 19.203, 18.984,
                 12.104, 5.630, 3.960, 5.650, 3.873, 
                 -3.453, 11.504, -0.698,  3.399, 4.902,
                 0.001, 0, 0.487, 0.001, 0.00000), ncol=4, nrow = 5)

#define column names and row names of matrix
colnames(tab2) <- c('Estimate', 'Std. Error', 't value', 'Pr(> | t| )')
rownames(tab2) <- c('Intercept', 'Size', 'Bed', 'Bath', 'New')

#convert matrix to table 
tab2 <- as.table(tab2)

#view table 
tab2
          Estimate Std. Error t value Pr(> | t| )
Intercept  -41.795     12.104  -3.453       0.001
Size        64.761      5.630  11.504       0.000
Bed         -2.766      3.960  -0.698       0.487
Bath        19.203      5.650   3.399       0.001
New         18.984      3.873   4.902       0.000

A

With these four predictors,

A. For backward elimination, which variable would be deleted first? Why?

For backwards elimination the first variable to be deleted would be BED. This is because BED has the largest p-value. Compared to the variables SIZE, BATH, and NEW, the variable BED has the largest p-value and would fail to reject the null hypothesis that BED is statistically significant in determining the selling price of the home.

B

B. For forward selection, which variable would be added first? Why?

For forwards selection we would add the variable SIZE. Due to the p-values for SIZE and NEW both being 0.000 we will look to the correlation matrix. When looking at the correlation matrix we see SIZE has a higher correlation to selling price at .90 out of 1 compared to NEW being .36 out of 1.

C

C. Why do you think that BEDS has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?

I believe the variable BEDS has a large p-value in the multiple regression model because the model may be too complex with too many variables. If there are too many variables we are unable to theorize about causality or the relationship between variables. The sample size being very small also can be a factor in having a high p-value and a high correlation. When we run the model we see that there are only 93 observations, which is a very small sample size.

D

D. Using software with these four predictors, find the model that would be selected using each criterion: a. R2 b. Adjusted R2 c. PRESS d. AIC e. BIC

The model for R2 and Adjusted R2 would be a multiple regression model using the function lm() and summary().

In order to find the PRESS we would use our multiple regression model “house” in the function PRESS(). To use the PRESS() we will used the created function.

In order to find the AIC we would use our multiple regression model “house” in the function AIC()

In order to find the BIC we would use our multiple regression model “house” in the function BIC()

data("house.selling.price.2")
house <- lm(P ~ Be + Ba + New + S, data = house.selling.price.2)

summary(house)

Call:
lm(formula = P ~ Be + Ba + New + S, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.212  -9.546   1.277   9.406  71.953 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -41.795     12.104  -3.453 0.000855 ***
Be            -2.766      3.960  -0.698 0.486763    
Ba            19.203      5.650   3.399 0.001019 ** 
New           18.984      3.873   4.902  4.3e-06 ***
S             64.761      5.630  11.504  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared:  0.8689,    Adjusted R-squared:  0.8629 
F-statistic: 145.8 on 4 and 88 DF,  p-value: < 2.2e-16
AIC(house)
[1] 790.6225
BIC(house)
[1] 805.8181
PRESS <- function(linear.model) {
    pr <- residuals(linear.model)/(1 - lm.influence(linear.model)$hat)
    sum(pr^2)
}

fit <- lm(P ~ Be + Ba + New + S, data = house.selling.price.2)
PRESS(fit)
[1] 28390.22

E

E. Explain which model you prefer and why.

I prefer the AIC and BIC models as they both penalize the addition of new variables. Additionally they both want the lowest value possible to have a better-fit (parsimonious) model. This makes it easy to rerun test with multiple variables to see when adding a new variable would increase the score instead of decreasing it. Seeing the AIC or BIC score increase also lets us know when we should stop adding new variables. Additionally running both the AIC and BIC allow us to receive two scores and pick the lower score of the two.

Question 2

(Data file: trees from base R) From the documentation: “This data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees. Note that the diameter (in inches) is erroneously labelled Girth in the data. It is measured at 4 ft 6 in above the ground.”

Tree volume estimation is a big deal, especially in the lumber industry. Use the trees data to build a basic model of tree volume prediction. In particular,

A

A. fit a multiple regression model with the Volume as the outcome and Girth and Height as the explanatory variables

To do the multiple regression model I used the function lm().

tree <- lm(Volume ~ Girth + Height, data=trees)

B

B. Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions is violated?

To run the diagnostic plots I will use the function plot(). I will also use the function par() to align the graphs.

par(mfrow = c(2,3)); plot(tree, which = 1:6)

Based on the plots I believe that the following regression assumptions are violated for residuals vs fitted plot:

Both of these assumptions are violated in the plot.

I believe the following regression assumptions for Scale-Location plot are also violated:

Both of these assumptions are violated in the plot.

In addition the Cook’s Distance vs Leverage plot shows a highly influential data point of data item 31.

Question 3

(inspired by ALR 9.16) (Data file: florida in alr R package) In the 2000 election for U.S. president, the counting of votes in Florida was controversial. In Palm Beach County in south Florida, for example, voters used a so-called butterfly ballot. Some believe that the layout of the ballot caused some voters to cast votes for Buchanan when their intended choice was Gore.

The data has variables for the number of votes for each candidate—Gore, Bush, and Buchanan. Run a simple linear regression model where the Buchanan vote is the outcome and the Bush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?

To run a linear regression I used lm(), for regression diagnostic plots I used plot() along with par() to align the graphics.

data("florida")

vote <- lm(Buchanan ~ Bush, data=florida)

par(mfrow = c(2,3)); plot(vote, which = 1:6)

After looking at the diagnostic plots, Palm Beach County is an outlier. The Palm Beach residual is the only residuals that violates the regression assumptions for most of the plots. One exception is the Dade residual also appears to violate some of the assumptions.

In the Residuals vs Fitted plot Palm Beach County’s residual violates the linearity assumption by being far away from the other residuals. It is much higher up.

In the Normal Q-Q plot Palm Beach County’s residual violates the normality assumption by not falling around the line.

In the Scale-Location plot Palm Beach County’s residual violates the homoskedasticity by the spreading wider and further than the rest of the residuals.

In the Residuals vs Leverage plot Palm Beach County’s residual violates the influential observation by having the residual being outside the red dashed line. Thus Palm Beach County can be influential against a regression line.

In the Cook’s Distance plot, the Cook’s distance for the Palm Beach County observation is larger than 1 and also larger than 4/67 (.06). This indicates that Palm Beach County is an outlier.

PART 2 (Final Project)

1

1. What is your research question for the final project?

My research question: Is age statistically significant to views on homosexual sexual relationships, same sex marriage, and abortions when including if you’re a democrat or republican?

2

2. What is your hypothesis (i.e. an answer to the research question) that you want to test?

My Hypothesis to be tested: My hypothesis to be tested is that age is not statistically significant in predicting views about homosexual sexual relationships, homosexual marriage, and abortions.

3

3. Present some exploratory analysis. In particular:

A

a. Numerically summarize (e.g. with the summary() function) the variables of interest (the outcome, the explanatory variable, the control variables).

Below I have numerically summarized the variables. However due to the variables ABANY, HOMOSEX, MARHOMO, and PARTY ID being categorical variables, the summary does not explain as much.

# Check if "drat" package is in your library and if it's not this will download for you
if (!require("drat")) {
    install.packages("drat")
    library("drat")
}

# Finds the repository that "drat" is in
drat::addRepo("kjhealy")

# Installs the package "gssr" and use library function
if (!require("gssr")) {
    install.packages("gssr")
    library("gssr")
}


data(gss_doc)
gss21 <- gss_get_yr(2021)

gssmh <- gss21 %>% 
  dplyr::select(age, partyid, homosex, marhomo, abany) %>% 
  dplyr::filter_at(vars(age, partyid),
            all_vars(!is.na(.))) %>% 
  dplyr::filter(partyid == 5 | partyid <=1 | partyid == 6)

info <- gss_get_marginals(varnames = c("age", "partyid", 
                                       "homosex", "marhomo", "abany"))

paged_table(info)
summary(gssmh)
      age           partyid        homosex         marhomo     
 Min.   :18.00   Min.   :0.00   Min.   :1.000   Min.   :1.000  
 1st Qu.:40.00   1st Qu.:0.00   1st Qu.:1.000   1st Qu.:1.000  
 Median :55.00   Median :1.00   Median :4.000   Median :2.000  
 Mean   :53.81   Mean   :2.46   Mean   :3.059   Mean   :2.139  
 3rd Qu.:67.00   3rd Qu.:5.00   3rd Qu.:4.000   3rd Qu.:3.000  
 Max.   :89.00   Max.   :6.00   Max.   :4.000   Max.   :5.000  
                                NA's   :708     NA's   :687    
     abany      
 Min.   :1.000  
 1st Qu.:1.000  
 Median :1.000  
 Mean   :1.409  
 3rd Qu.:2.000  
 Max.   :2.000  
 NA's   :1368   

For the variable AGE we can see the minimum age is 18, the maximum age is 89 and the mean age is about 54.

For the variable PARTY ID we can see the minimum is 0 (Strongly Democrat), the maximum is 6 (Strongly Republican), and the mean is 2.46 which would be leaning close to “Not Strongly Democratic”. In order to do a study on only Democrats and Republicans, Independent participants were removed (2-4).

For the variable HOMOSEX (Question: “219. What about sexual relations between two adults of the same sex?”) we can see the minimum is 1 (Always Wrong), the maximum is 4 (Not Wrong at All) and the mean is 3.06 which would be leaning close to “Sometimes Wrong”.

For the variable MARHOMO (Question: “1280. Do you agree or disagree? j. Homosexual couples should have the right to marry one another.”) we can see the minimum is 1 (Strongly Agree), the maximum is 5 (Strongly Disagree) and the mean is 2.14 which would be leaning close to “Agree”.

For the variable ABANY (Question: “206. Please tell me whether or not you think it should be possible for a pregnant woman to obtain a legal abortion if: g. The woman wants it for any reason?”) we can see the minimum is 1 (Yes), the maximum is 2 (No) and the mean is 1.41 which would be leaning slightly closer to “Yes”.

I also ran the following linear models below.

homosexpa <- lm(homosex ~ partyid + age + partyid*age, data = gssmh)

marhomopa <- lm(marhomo ~ partyid + age + partyid*age, data = gssmh)

abandypa <- lm(abany ~ partyid + age + partyid*age, data = gssmh)

stargazer(homosexpa, marhomopa, abandypa, type = 'text', 
          dep.var.labels = c('Homosexual Sex Relationships', 
                             'Homosexual Marriage', 'Abortion'),
          covariate.labels = c('Party ID', 'Age', 'Party ID: Age')
          )

==================================================================================================
                                                 Dependent variable:                              
                    ------------------------------------------------------------------------------
                    Homosexual Sex Relationships    Homosexual Marriage           Abortion        
                                (1)                         (2)                      (3)          
--------------------------------------------------------------------------------------------------
Party ID                     -0.179***                   0.199***                 0.098***        
                              (0.044)                     (0.044)                  (0.022)        
                                                                                                  
Age                          -0.010***                   0.011***                  0.002*         
                              (0.002)                     (0.002)                  (0.001)        
                                                                                                  
Party ID: Age                 -0.0001                      0.001                   -0.0004        
                              (0.001)                     (0.001)                 (0.0004)        
                                                                                                  
Constant                      4.025***                   0.978***                 1.089***        
                              (0.139)                     (0.140)                  (0.072)        
                                                                                                  
--------------------------------------------------------------------------------------------------
Observations                   1,416                       1,437                     756          
R2                             0.153                       0.234                    0.170         
Adjusted R2                    0.151                       0.232                    0.167         
Residual Std. Error      1.209 (df = 1412)           1.220 (df = 1433)        0.449 (df = 752)    
F Statistic           84.900*** (df = 3; 1412)   145.879*** (df = 3; 1433) 51.506*** (df = 3; 752)
==================================================================================================
Note:                                                                  *p<0.1; **p<0.05; ***p<0.01

For these summaries, I used stargazer and a 5% significance level for the analyses.

For homosexual sexual relationships, the p-value for variable for AGE has 3 asterisks. From the function stargazer() we can conclude that the p-value was between 0 and 0.01. With the p-value being under the .05 significance level we reject the null hypothesis that age is not statistically significant in predicting views about homosexual sexual relationships.

For homosexual marriage, the p-value for the variable AGE has 3 asterisks. From the function stargazer() we can conclude that the p-value was between 0 and 0.01. With the p-value being under the .05 significance level we reject the null hypothesis that age is not statistically significant in predicting views about homosexual marriage.

For views on abortion, the p-value for the variable AGE has 1 asterisks. From the function stargazer() we can conclude that the p-value was between 0.05 and 0.1. With the p-value being above the .05 significance level we fail to reject the null hypothesis that age is not statistically significant in predicting views about abortion.

Overall we can see that for views on homosexual sexual relationships, homosexual marriage, and abortion that the variable PARTY ID has 3 asterisks. From the function stargazer() we can conclude that the p-value is less than .01. With the p-value being below the .05 significance level. Thus in relation to views on homosexual sexual relationships, homosexual marriage, and abortion PARTY ID has a significant impact.

Comparatively when we look at the interaction term PARTY ID * AGE, the interaction term has no asterisks. From the function stargazer() we can conclude that the p-value was above 0.1. With the p-value being above the .05 significance level. Thus in relation to views on homosexual sexual relationships, homosexual marriage, and abortion PARTY ID * AGE has an insignificant impact.

To test further I made a dummy variable where you are either a democrat (0) or republican (1). I then performed lm() again.

gssmh$dr <- ifelse(gssmh$partyid <=3, 0, 1)

homosexpa2 <- lm(homosex ~ dr + age + dr*age , data = gssmh)

marhomopa2 <- lm(marhomo ~ dr + age + dr*age, data = gssmh)

abandypa2 <- lm(abany ~ dr + age + dr*age, data = gssmh)

stargazer(homosexpa2, marhomopa2, abandypa2, type = 'text', 
          dep.var.labels = c('Homosexual Sex Relationships', 
                             'Homosexual Marriage', 'Abortion'), 
          covariate.labels = c('Party ID', 'Age', 'Party ID: Age')
          )

==================================================================================================
                                                 Dependent variable:                              
                    ------------------------------------------------------------------------------
                    Homosexual Sex Relationships    Homosexual Marriage           Abortion        
                                (1)                         (2)                      (3)          
--------------------------------------------------------------------------------------------------
Party ID                     -0.862***                   0.913***                 0.466***        
                              (0.227)                     (0.229)                  (0.114)        
                                                                                                  
Age                          -0.009***                   0.010***                   0.002         
                              (0.002)                     (0.002)                  (0.001)        
                                                                                                  
Party ID: Age                  -0.001                      0.006                   -0.001         
                              (0.004)                     (0.004)                  (0.002)        
                                                                                                  
Constant                      3.915***                   1.118***                 1.153***        
                              (0.128)                     (0.129)                  (0.066)        
                                                                                                  
--------------------------------------------------------------------------------------------------
Observations                   1,416                       1,437                     756          
R2                             0.144                       0.221                    0.164         
Adjusted R2                    0.142                       0.220                    0.161         
Residual Std. Error      1.215 (df = 1412)           1.230 (df = 1433)        0.451 (df = 752)    
F Statistic           79.241*** (df = 3; 1412)   135.741*** (df = 3; 1433) 49.127*** (df = 3; 752)
==================================================================================================
Note:                                                                  *p<0.1; **p<0.05; ***p<0.01

For homosexual sexual relationships, the p-value for the variable AGE has 3 asterisks. From the function stargazer() we can conclude that the p-value was between 0 and 0.01. With the p-value being under the .05 significance level we reject the null hypothesis that age is not statistically significant in predicting views about homosexual sexual relationships.

For homosexual marriage, the p-value for the variable AGE has 3 asterisks. From the function stargazer() we can conclude that the p-value was between 0 and 0.01. With the p-value being under the .05 significance level we reject the null hypothesis that age is not statistically significant in predicting views about homosexual marriage.

For views on abortion, the p-value for the variable AGE has no asterisks. From the function stargazer() we can conclude that the p-value was above 0.1. With the p-value being above the .05 significance level we fail to reject the null hypothesis that age is not statistically significant in predicting views about abortion.

Overall we can see that for views on homosexual sexual relationships, homosexual marriage, and abortion that the variable PARTY ID has 3 asterisks. From the function stargazer() we can conclude that the p-value was below 0.1. With the p-value being below the .05 significance level. Thus in relation to views on homosexual sexual relationships, homosexual marriage, and abortion PARTY ID has a significant impact.

Comparatively when we look at the interaction term PARTY ID * AGE, the interaction term has no asterisks. From the function stargazer() we can conclude that the p-value was above 0.1. With the p-value being above the .05 significance level. Thus in relation to views on homosexual sexual relationships, homosexual marriage, and abortion PARTY ID* AGE has an insignificant impact.

B

b. Plot the relationships between key variables. You can do this any way you want, but one straightforward way of doing this would be with the pairs() function or other scatter plots / box plots.

Below is the three plots I will use for my final project. The size of each bubble is based on the count of people’s age and the response picked for the variable (HOMOSEX, MARHOMO, ABANY). The color shows the dummy variable DR where 0 is “DEMOCRAT” and 1 is “REPUBLICAN”.

gssmhhs <- gssmh%>%group_by(age, homosex)%>%mutate(count=n())
gssmhms <- gssmh%>%group_by(age, marhomo)%>%mutate(count=n())
gssmhab <- gssmh%>%group_by(age, abany)%>%mutate(count=n())
p1 <- ggplot(gssmhhs, aes(x=age, y=homosex, 
                           color=dr)) +
  geom_point() +
   labs(title= "Age compared to attitudes of homosexual relationships",
        x= "Age of participant", 
        y = "Attitudes towards homosexual relationships") +
   theme_minimal()

p1 + geom_point(aes(size = count))

This graph shows:

x-axis: Age of the respondent

y-axis: Attitudes towards homosexual sexual relationships with the mimimum 1 (Always Wrong) and the maximum is 4 (Not Wrong at All)

fill colors: Party ID where 0 is “Democrat” and 1 is “Republican”

Count: The size of each bubble is based on the count of people’s age and the response picked for HOMOSEX

This is a graph is showing the attitudes towards homosexual sexual relationships by age, count and party id

p2 <- ggplot(gssmhms, aes(x=age, y=marhomo, 
                           color=dr)) +
  geom_point() +
   labs(title= "Age compared to attitudes of homosexual marriage",
        x= "Age of participant", 
        y = "Attitudes towards homosexual marriage") +
   theme_minimal()

p2 + geom_point(aes(size = count))

This graph shows:

x-axis: Age of the respondent

y-axis: Attitudes towards homosexual marriage with the mimimum 1 (Strongly Agree) and the maximum is 5 (Strongly Disagree)

fill colors: Party ID where 0 is “Democrat” and 1 is “Republican”

Count: The size of each bubble is based on the count of people’s age and the response picked for MARHOMO

This is a graph is showing the attitudes towards homosexual marriage by age, count and party id

p3 <- ggplot(gssmhab, aes(x=age, y=abany, color=dr)) +
  geom_point() +
   labs(title= "Age compared to attitudes on abortion",
        x= "Age of participant", 
        y = "Attitudes towards abortion") +
  scale_y_continuous(breaks=c(0,1,2)) +
   theme_minimal()

p3 + geom_point(aes(size = count))

This graph shows:

x-axis: Age of the respondent

y-axis: Attitudes towards abortion with the mimimum 1 (Yes) and the maximum is 2 (No)

fill colors: Party ID where 0 is “Democrat” and 1 is “Republican”

Count: The size of each bubble is based on the count of people’s age and the response picked for ABANY

This is a graph is showing the attitudes towards abortion by age, count and party id