Homework 4 for DACSS 603
(Data file: house.selling.price.2 from smss R package) For the house.selling.price.2 data the tables below show a correlation matrix and a model fit using four predictors of selling price.
(SMSS 14.3, 14.4, merged & modified) Price
#create matrix
tab <- matrix(c(1, 0.899, 0.590, 0.714, 0.357,
0.899, 1, 0.669, 0.662, 0.176,
0.590, 0.699, 1, 0.334, 0.267,
0.714, 0.662, 0.334, 1, 0.182,
0.357, 0.176, 0.267, 0.182, 1), ncol=5, nrow = 5)
#define column names and row names of matrix
colnames(tab) <- c('Price', 'Size', 'Bed', 'Bath', 'New')
rownames(tab) <- c('Price', 'Size', 'Bed', 'Bath', 'New')
#convert matrix to table
tab <- as.table(tab)
#view table
tab
Price Size Bed Bath New
Price 1.000 0.899 0.590 0.714 0.357
Size 0.899 1.000 0.699 0.662 0.176
Bed 0.590 0.669 1.000 0.334 0.267
Bath 0.714 0.662 0.334 1.000 0.182
New 0.357 0.176 0.267 0.182 1.000
tab2 <- matrix(c(-41.795, 64.761,-2.766, 19.203, 18.984,
12.104, 5.630, 3.960, 5.650, 3.873,
-3.453, 11.504, -0.698, 3.399, 4.902,
0.001, 0, 0.487, 0.001, 0.00000), ncol=4, nrow = 5)
#define column names and row names of matrix
colnames(tab2) <- c('Estimate', 'Std. Error', 't value', 'Pr(> | t| )')
rownames(tab2) <- c('Intercept', 'Size', 'Bed', 'Bath', 'New')
#convert matrix to table
tab2 <- as.table(tab2)
#view table
tab2
Estimate Std. Error t value Pr(> | t| )
Intercept -41.795 12.104 -3.453 0.001
Size 64.761 5.630 11.504 0.000
Bed -2.766 3.960 -0.698 0.487
Bath 19.203 5.650 3.399 0.001
New 18.984 3.873 4.902 0.000
With these four predictors,
A. For backward elimination, which variable would be deleted first? Why?
For backwards elimination the first variable to be deleted would be BED. This is because BED has the largest p-value. Compared to the variables SIZE, BATH, and NEW, the variable BED has the largest p-value and would fail to reject the null hypothesis that BED is statistically significant in determining the selling price of the home.
B. For forward selection, which variable would be added first? Why?
For forwards selection we would add the variable SIZE. Due to the p-values for SIZE and NEW both being 0.000 we will look to the correlation matrix. When looking at the correlation matrix we see SIZE has a higher correlation to selling price at .90 out of 1 compared to NEW being .36 out of 1.
C. Why do you think that BEDS has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?
I believe the variable BEDS has a large p-value in the multiple regression model because the model may be too complex with too many variables. If there are too many variables we are unable to theorize about causality or the relationship between variables. The sample size being very small also can be a factor in having a high p-value and a high correlation. When we run the model we see that there are only 93 observations, which is a very small sample size.
D. Using software with these four predictors, find the model that would be selected using each criterion: a. R2 b. Adjusted R2 c. PRESS d. AIC e. BIC
The model for R2 and Adjusted R2 would be a multiple regression model using the function lm() and summary().
In order to find the PRESS we would use our multiple regression model “house” in the function PRESS(). To use the PRESS() we will used the created function.
In order to find the AIC we would use our multiple regression model “house” in the function AIC()
In order to find the BIC we would use our multiple regression model “house” in the function BIC()
data("house.selling.price.2")
house <- lm(P ~ Be + Ba + New + S, data = house.selling.price.2)
summary(house)
Call:
lm(formula = P ~ Be + Ba + New + S, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-36.212 -9.546 1.277 9.406 71.953
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -41.795 12.104 -3.453 0.000855 ***
Be -2.766 3.960 -0.698 0.486763
Ba 19.203 5.650 3.399 0.001019 **
New 18.984 3.873 4.902 4.3e-06 ***
S 64.761 5.630 11.504 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared: 0.8689, Adjusted R-squared: 0.8629
F-statistic: 145.8 on 4 and 88 DF, p-value: < 2.2e-16
AIC(house)
[1] 790.6225
BIC(house)
[1] 805.8181
PRESS <- function(linear.model) {
pr <- residuals(linear.model)/(1 - lm.influence(linear.model)$hat)
sum(pr^2)
}
fit <- lm(P ~ Be + Ba + New + S, data = house.selling.price.2)
PRESS(fit)
[1] 28390.22
E. Explain which model you prefer and why.
I prefer the AIC and BIC models as they both penalize the addition of new variables. Additionally they both want the lowest value possible to have a better-fit (parsimonious) model. This makes it easy to rerun test with multiple variables to see when adding a new variable would increase the score instead of decreasing it. Seeing the AIC or BIC score increase also lets us know when we should stop adding new variables. Additionally running both the AIC and BIC allow us to receive two scores and pick the lower score of the two.
(Data file: trees from base R) From the documentation: “This data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees. Note that the diameter (in inches) is erroneously labelled Girth in the data. It is measured at 4 ft 6 in above the ground.”
Tree volume estimation is a big deal, especially in the lumber industry. Use the trees data to build a basic model of tree volume prediction. In particular,
A. fit a multiple regression model with the Volume as the outcome and Girth and Height as the explanatory variables
To do the multiple regression model I used the function lm().
tree <- lm(Volume ~ Girth + Height, data=trees)
B. Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions is violated?
To run the diagnostic plots I will use the function plot(). I will also use the function par() to align the graphs.
Based on the plots I believe that the following regression assumptions are violated for residuals vs fitted plot:
Both of these assumptions are violated in the plot.
I believe the following regression assumptions for Scale-Location plot are also violated:
Both of these assumptions are violated in the plot.
In addition the Cook’s Distance vs Leverage plot shows a highly influential data point of data item 31.
(inspired by ALR 9.16) (Data file: florida in alr R package) In the 2000 election for U.S. president, the counting of votes in Florida was controversial. In Palm Beach County in south Florida, for example, voters used a so-called butterfly ballot. Some believe that the layout of the ballot caused some voters to cast votes for Buchanan when their intended choice was Gore.
The data has variables for the number of votes for each candidate—Gore, Bush, and Buchanan. Run a simple linear regression model where the Buchanan vote is the outcome and the Bush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?
To run a linear regression I used lm(), for regression diagnostic plots I used plot() along with par() to align the graphics.
data("florida")
vote <- lm(Buchanan ~ Bush, data=florida)
par(mfrow = c(2,3)); plot(vote, which = 1:6)
After looking at the diagnostic plots, Palm Beach County is an outlier. The Palm Beach residual is the only residuals that violates the regression assumptions for most of the plots. One exception is the Dade residual also appears to violate some of the assumptions.
In the Residuals vs Fitted plot Palm Beach County’s residual violates the linearity assumption by being far away from the other residuals. It is much higher up.
In the Normal Q-Q plot Palm Beach County’s residual violates the normality assumption by not falling around the line.
In the Scale-Location plot Palm Beach County’s residual violates the homoskedasticity by the spreading wider and further than the rest of the residuals.
In the Residuals vs Leverage plot Palm Beach County’s residual violates the influential observation by having the residual being outside the red dashed line. Thus Palm Beach County can be influential against a regression line.
In the Cook’s Distance plot, the Cook’s distance for the Palm Beach County observation is larger than 1 and also larger than 4/67 (.06). This indicates that Palm Beach County is an outlier.
1. What is your research question for the final project?
My research question: Is age statistically significant to views on homosexual sexual relationships, same sex marriage, and abortions when including if you’re a democrat or republican?
2. What is your hypothesis (i.e. an answer to the research question) that you want to test?
My Hypothesis to be tested: My hypothesis to be tested is that age is not statistically significant in predicting views about homosexual sexual relationships, homosexual marriage, and abortions.
3. Present some exploratory analysis. In particular:
a. Numerically summarize (e.g. with the summary() function) the variables of interest (the outcome, the explanatory variable, the control variables).
Below I have numerically summarized the variables. However due to the variables ABANY, HOMOSEX, MARHOMO, and PARTY ID being categorical variables, the summary does not explain as much.
# Check if "drat" package is in your library and if it's not this will download for you
if (!require("drat")) {
install.packages("drat")
library("drat")
}
# Finds the repository that "drat" is in
drat::addRepo("kjhealy")
# Installs the package "gssr" and use library function
if (!require("gssr")) {
install.packages("gssr")
library("gssr")
}
data(gss_doc)
gss21 <- gss_get_yr(2021)
gssmh <- gss21 %>%
dplyr::select(age, partyid, homosex, marhomo, abany) %>%
dplyr::filter_at(vars(age, partyid),
all_vars(!is.na(.))) %>%
dplyr::filter(partyid == 5 | partyid <=1 | partyid == 6)
info <- gss_get_marginals(varnames = c("age", "partyid",
"homosex", "marhomo", "abany"))
paged_table(info)
summary(gssmh)
age partyid homosex marhomo
Min. :18.00 Min. :0.00 Min. :1.000 Min. :1.000
1st Qu.:40.00 1st Qu.:0.00 1st Qu.:1.000 1st Qu.:1.000
Median :55.00 Median :1.00 Median :4.000 Median :2.000
Mean :53.81 Mean :2.46 Mean :3.059 Mean :2.139
3rd Qu.:67.00 3rd Qu.:5.00 3rd Qu.:4.000 3rd Qu.:3.000
Max. :89.00 Max. :6.00 Max. :4.000 Max. :5.000
NA's :708 NA's :687
abany
Min. :1.000
1st Qu.:1.000
Median :1.000
Mean :1.409
3rd Qu.:2.000
Max. :2.000
NA's :1368
For the variable AGE we can see the minimum age is 18, the maximum age is 89 and the mean age is about 54.
For the variable PARTY ID we can see the minimum is 0 (Strongly Democrat), the maximum is 6 (Strongly Republican), and the mean is 2.46 which would be leaning close to “Not Strongly Democratic”. In order to do a study on only Democrats and Republicans, Independent participants were removed (2-4).
For the variable HOMOSEX (Question: “219. What about sexual relations between two adults of the same sex?”) we can see the minimum is 1 (Always Wrong), the maximum is 4 (Not Wrong at All) and the mean is 3.06 which would be leaning close to “Sometimes Wrong”.
For the variable MARHOMO (Question: “1280. Do you agree or disagree? j. Homosexual couples should have the right to marry one another.”) we can see the minimum is 1 (Strongly Agree), the maximum is 5 (Strongly Disagree) and the mean is 2.14 which would be leaning close to “Agree”.
For the variable ABANY (Question: “206. Please tell me whether or not you think it should be possible for a pregnant woman to obtain a legal abortion if: g. The woman wants it for any reason?”) we can see the minimum is 1 (Yes), the maximum is 2 (No) and the mean is 1.41 which would be leaning slightly closer to “Yes”.
I also ran the following linear models below.
homosexpa <- lm(homosex ~ partyid + age + partyid*age, data = gssmh)
marhomopa <- lm(marhomo ~ partyid + age + partyid*age, data = gssmh)
abandypa <- lm(abany ~ partyid + age + partyid*age, data = gssmh)
stargazer(homosexpa, marhomopa, abandypa, type = 'text',
dep.var.labels = c('Homosexual Sex Relationships',
'Homosexual Marriage', 'Abortion'),
covariate.labels = c('Party ID', 'Age', 'Party ID: Age')
)
==================================================================================================
Dependent variable:
------------------------------------------------------------------------------
Homosexual Sex Relationships Homosexual Marriage Abortion
(1) (2) (3)
--------------------------------------------------------------------------------------------------
Party ID -0.179*** 0.199*** 0.098***
(0.044) (0.044) (0.022)
Age -0.010*** 0.011*** 0.002*
(0.002) (0.002) (0.001)
Party ID: Age -0.0001 0.001 -0.0004
(0.001) (0.001) (0.0004)
Constant 4.025*** 0.978*** 1.089***
(0.139) (0.140) (0.072)
--------------------------------------------------------------------------------------------------
Observations 1,416 1,437 756
R2 0.153 0.234 0.170
Adjusted R2 0.151 0.232 0.167
Residual Std. Error 1.209 (df = 1412) 1.220 (df = 1433) 0.449 (df = 752)
F Statistic 84.900*** (df = 3; 1412) 145.879*** (df = 3; 1433) 51.506*** (df = 3; 752)
==================================================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
For these summaries, I used stargazer and a 5% significance level for the analyses.
For homosexual sexual relationships, the p-value for variable for AGE has 3 asterisks. From the function stargazer() we can conclude that the p-value was between 0 and 0.01. With the p-value being under the .05 significance level we reject the null hypothesis that age is not statistically significant in predicting views about homosexual sexual relationships.
For homosexual marriage, the p-value for the variable AGE has 3 asterisks. From the function stargazer() we can conclude that the p-value was between 0 and 0.01. With the p-value being under the .05 significance level we reject the null hypothesis that age is not statistically significant in predicting views about homosexual marriage.
For views on abortion, the p-value for the variable AGE has 1 asterisks. From the function stargazer() we can conclude that the p-value was between 0.05 and 0.1. With the p-value being above the .05 significance level we fail to reject the null hypothesis that age is not statistically significant in predicting views about abortion.
Overall we can see that for views on homosexual sexual relationships, homosexual marriage, and abortion that the variable PARTY ID has 3 asterisks. From the function stargazer() we can conclude that the p-value is less than .01. With the p-value being below the .05 significance level. Thus in relation to views on homosexual sexual relationships, homosexual marriage, and abortion PARTY ID has a significant impact.
Comparatively when we look at the interaction term PARTY ID * AGE, the interaction term has no asterisks. From the function stargazer() we can conclude that the p-value was above 0.1. With the p-value being above the .05 significance level. Thus in relation to views on homosexual sexual relationships, homosexual marriage, and abortion PARTY ID * AGE has an insignificant impact.
To test further I made a dummy variable where you are either a democrat (0) or republican (1). I then performed lm() again.
gssmh$dr <- ifelse(gssmh$partyid <=3, 0, 1)
homosexpa2 <- lm(homosex ~ dr + age + dr*age , data = gssmh)
marhomopa2 <- lm(marhomo ~ dr + age + dr*age, data = gssmh)
abandypa2 <- lm(abany ~ dr + age + dr*age, data = gssmh)
stargazer(homosexpa2, marhomopa2, abandypa2, type = 'text',
dep.var.labels = c('Homosexual Sex Relationships',
'Homosexual Marriage', 'Abortion'),
covariate.labels = c('Party ID', 'Age', 'Party ID: Age')
)
==================================================================================================
Dependent variable:
------------------------------------------------------------------------------
Homosexual Sex Relationships Homosexual Marriage Abortion
(1) (2) (3)
--------------------------------------------------------------------------------------------------
Party ID -0.862*** 0.913*** 0.466***
(0.227) (0.229) (0.114)
Age -0.009*** 0.010*** 0.002
(0.002) (0.002) (0.001)
Party ID: Age -0.001 0.006 -0.001
(0.004) (0.004) (0.002)
Constant 3.915*** 1.118*** 1.153***
(0.128) (0.129) (0.066)
--------------------------------------------------------------------------------------------------
Observations 1,416 1,437 756
R2 0.144 0.221 0.164
Adjusted R2 0.142 0.220 0.161
Residual Std. Error 1.215 (df = 1412) 1.230 (df = 1433) 0.451 (df = 752)
F Statistic 79.241*** (df = 3; 1412) 135.741*** (df = 3; 1433) 49.127*** (df = 3; 752)
==================================================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
For homosexual sexual relationships, the p-value for the variable AGE has 3 asterisks. From the function stargazer() we can conclude that the p-value was between 0 and 0.01. With the p-value being under the .05 significance level we reject the null hypothesis that age is not statistically significant in predicting views about homosexual sexual relationships.
For homosexual marriage, the p-value for the variable AGE has 3 asterisks. From the function stargazer() we can conclude that the p-value was between 0 and 0.01. With the p-value being under the .05 significance level we reject the null hypothesis that age is not statistically significant in predicting views about homosexual marriage.
For views on abortion, the p-value for the variable AGE has no asterisks. From the function stargazer() we can conclude that the p-value was above 0.1. With the p-value being above the .05 significance level we fail to reject the null hypothesis that age is not statistically significant in predicting views about abortion.
Overall we can see that for views on homosexual sexual relationships, homosexual marriage, and abortion that the variable PARTY ID has 3 asterisks. From the function stargazer() we can conclude that the p-value was below 0.1. With the p-value being below the .05 significance level. Thus in relation to views on homosexual sexual relationships, homosexual marriage, and abortion PARTY ID has a significant impact.
Comparatively when we look at the interaction term PARTY ID * AGE, the interaction term has no asterisks. From the function stargazer() we can conclude that the p-value was above 0.1. With the p-value being above the .05 significance level. Thus in relation to views on homosexual sexual relationships, homosexual marriage, and abortion PARTY ID* AGE has an insignificant impact.
b. Plot the relationships between key variables. You can do this any way you want, but one straightforward way of doing this would be with the pairs() function or other scatter plots / box plots.
Below is the three plots I will use for my final project. The size of each bubble is based on the count of people’s age and the response picked for the variable (HOMOSEX, MARHOMO, ABANY). The color shows the dummy variable DR where 0 is “DEMOCRAT” and 1 is “REPUBLICAN”.
p1 <- ggplot(gssmhhs, aes(x=age, y=homosex,
color=dr)) +
geom_point() +
labs(title= "Age compared to attitudes of homosexual relationships",
x= "Age of participant",
y = "Attitudes towards homosexual relationships") +
theme_minimal()
p1 + geom_point(aes(size = count))
This graph shows:
x-axis: Age of the respondent
y-axis: Attitudes towards homosexual sexual relationships with the mimimum 1 (Always Wrong) and the maximum is 4 (Not Wrong at All)
fill colors: Party ID where 0 is “Democrat” and 1 is “Republican”
Count: The size of each bubble is based on the count of people’s age and the response picked for HOMOSEX
This is a graph is showing the attitudes towards homosexual sexual relationships by age, count and party id
p2 <- ggplot(gssmhms, aes(x=age, y=marhomo,
color=dr)) +
geom_point() +
labs(title= "Age compared to attitudes of homosexual marriage",
x= "Age of participant",
y = "Attitudes towards homosexual marriage") +
theme_minimal()
p2 + geom_point(aes(size = count))
This graph shows:
x-axis: Age of the respondent
y-axis: Attitudes towards homosexual marriage with the mimimum 1 (Strongly Agree) and the maximum is 5 (Strongly Disagree)
fill colors: Party ID where 0 is “Democrat” and 1 is “Republican”
Count: The size of each bubble is based on the count of people’s age and the response picked for MARHOMO
This is a graph is showing the attitudes towards homosexual marriage by age, count and party id
p3 <- ggplot(gssmhab, aes(x=age, y=abany, color=dr)) +
geom_point() +
labs(title= "Age compared to attitudes on abortion",
x= "Age of participant",
y = "Attitudes towards abortion") +
scale_y_continuous(breaks=c(0,1,2)) +
theme_minimal()
p3 + geom_point(aes(size = count))
This graph shows:
x-axis: Age of the respondent
y-axis: Attitudes towards abortion with the mimimum 1 (Yes) and the maximum is 2 (No)
fill colors: Party ID where 0 is “Democrat” and 1 is “Republican”
Count: The size of each bubble is based on the count of people’s age and the response picked for ABANY
This is a graph is showing the attitudes towards abortion by age, count and party id