DACSS 603 Homework 4
(SMSS 14.3, 14.4, merged & modified)
(Data file: house.selling.price.2
from smss R package)
For the house.selling.price.2
data the tables below show a correlation matrix and a model fit using four predictors of selling price.
x <- c("Price", "Size", "Beds", "Baths", "New")
Price <- c(1, 0.899, 0.590, 0.714, 0.357)
Size <- c(0.899, 1, 0.669, 0.662, 0.176)
Beds <- c(0.590, 0.669, 1, 0.334, 0.267)
Baths <- c(0.714, 0.662, 0.334, 1, 0.182)
New <- c(0.357, 0.176, 0.267, 0.182, 1)
correlation_matrix <- data.frame(x, Price, Size, Beds, Baths, New)
correlation_matrix
x Price Size Beds Baths New
1 Price 1.000 0.899 0.590 0.714 0.357
2 Size 0.899 1.000 0.669 0.662 0.176
3 Beds 0.590 0.669 1.000 0.334 0.267
4 Baths 0.714 0.662 0.334 1.000 0.182
5 New 0.357 0.176 0.267 0.182 1.000
o <- c("(Intercept)", "Size", "Beds", "Baths", "New")
Estimate <- c(-41.795, 64.761, 5.630, 11.504, 0)
Std.Error <- c(12.104, 5.630, 3.960, 5.650, 3.873)
tvalue <- c(-3.453, 11.504, -0.698, 3.399, 4.902)
Pr <- c(0.001, 0, 0.487, 0.001, 0.00000)
model_fit <- data.frame(o, Estimate, Std.Error, tvalue, Pr)
model_fit
o Estimate Std.Error tvalue Pr
1 (Intercept) -41.795 12.104 -3.453 0.001
2 Size 64.761 5.630 11.504 0.000
3 Beds 5.630 3.960 -0.698 0.487
4 Baths 11.504 5.650 3.399 0.001
5 New 0.000 3.873 4.902 0.000
With these four predictors,
(A) For backward elimination, which variable would be deleted first? Why?
For backwards elimination, you start with all variables and delete the variable with the largest p-value first. In this case, the first variable that would be deleted is Beds
as it has the highest p-value at 0.487.
(B) For forward selection, which variable would be added first? Why?
For forward selection, you add the variable that is the most significant first. So, you would add Size
first as it has a p-value of 0 (I am assuming that New
has a slightly larger p-value based on the decimal points).
(C) Why do you think that BEDS
has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?
This might be because the Beds
and the Size
variable are very similar - i.e. if a house has more Bedrooms, it is bigger (and vice-versa, often the reason a house is bigger is because it has more bedrooms). So, the information captured in the Beds
variable is already captured in Size
.
(D) Using software with these four predictors, find the model that would be selected using each criterion:
data("house.selling.price.2")
# head(house.selling.price.2) looking at names of variables
model4 <- lm(P ~ ., data = house.selling.price.2) #Size, New, Baths, Beds
model3 <- lm(P ~ . - Be, data = house.selling.price.2) #Size, New, Baths
model2 <- lm(P ~ . - Be - Ba, data = house.selling.price.2) # Size, New
model1 <- lm(P ~ . - Be - Ba - New, data = house.selling.price.2) #Size
# finding R^2, adjusted R^2, AIC, BIC using broom package, glace()
broom::glance(model1) %>%
select(r.squared, adj.r.squared, AIC, BIC)
# A tibble: 1 × 4
r.squared adj.r.squared AIC BIC
<dbl> <dbl> <dbl> <dbl>
1 0.808 0.806 820. 828.
broom::glance(model2) %>%
select(r.squared, adj.r.squared, AIC, BIC)
# A tibble: 1 × 4
r.squared adj.r.squared AIC BIC
<dbl> <dbl> <dbl> <dbl>
1 0.848 0.845 800. 810.
broom::glance(model3) %>%
select(r.squared, adj.r.squared, AIC, BIC)
# A tibble: 1 × 4
r.squared adj.r.squared AIC BIC
<dbl> <dbl> <dbl> <dbl>
1 0.868 0.864 789. 802.
broom::glance(model4) %>%
select(r.squared, adj.r.squared, AIC, BIC)
# A tibble: 1 × 4
r.squared adj.r.squared AIC BIC
<dbl> <dbl> <dbl> <dbl>
1 0.869 0.863 791. 806.
# finding PRESS for each model
# model1
r1 <- resid(model1) #residuals
pr1 <- resid(model1)/(1-lm.influence(model1)$hat) #predicted residuals
PRESS1 <- sum(pr1^2) #PRESS
# model2
pr2 <- resid(model2)/(1-lm.influence(model2)$hat) #predicted residuals
PRESS2 <- sum(pr2^2) #PRESS
# model3
pr3 <- resid(model3)/(1-lm.influence(model3)$hat) #predicted residuals
PRESS3 <- sum(pr3^2) #PRESS
# model4
pr4 <- resid(model4)/(1-lm.influence(model4)$hat) #predicted residuals
PRESS4 <- sum(pr4^2) #PRESS
#comparison_table
Model <- c("Model 1", "Model 2", "Model 3", "Model 4")
Variables <- c("Size", "Size, New", "Size, New, Baths", "Size, New, Baths, Beds")
R.Squared <- c(0.807866, 0.8483699, 0.8681361, 0.868863 )
Adj.R.Squared <- c(0.8057546, 0.8450003, 0.8636912, 0.8629022 )
PRESS <- c(PRESS1, PRESS2, PRESS3, PRESS4)
AIC <- c(820.1439, 800.1262, 789.1366, 790.6225)
BIC <- c(827.7417, 810.2566, 801.7996, 805.8181)
Comparison_table <- data.frame(Model, Variables, R.Squared, Adj.R.Squared, PRESS, AIC, BIC)
Comparison_table
Model Variables R.Squared Adj.R.Squared PRESS
1 Model 1 Size 0.8078660 0.8057546 38203.29
2 Model 2 Size, New 0.8483699 0.8450003 31066.00
3 Model 3 Size, New, Baths 0.8681361 0.8636912 27860.05
4 Model 4 Size, New, Baths, Beds 0.8688630 0.8629022 28390.22
AIC BIC
1 820.1439 827.7417
2 800.1262 810.2566
3 789.1366 801.7996
4 790.6225 805.8181
Now that we have all the values, we can select the model based on the listed criterion.
If we are trying to minimize R^2 we would choose Model 1. If we were using the criterion of trying to minimize adjusted R^2, we would select Model 2, which has just the variable Size
and New
.
According to the criterion of minimizing the predicted residual sum of squares (PRESS), we would select Model 3, which has a PRESS = 27869.05. This was also the model selected by backward elimination and forward selection.
According to the criterion of minimizing AIC, we would also select Model 3. Lastly, if we were looking to minimize BIC, we would also select Model 3.
(E) Explain which model you prefer and why.
I would select Model 3, which was the model selected by PRESS, AIC, and BIC. AIC and BIC can be particularly helpful when choosing a model because they both penalize the addition of new variables. Additionally they both want the lowest value possible to have a better-fit model. Also, they are simple functions in r, so it is easy to rerun the test with multiple models to see what adding a new variable would do to AIC and BIC.
(Data file: trees
from base R)
From the documentation: “This data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees. Note that the diameter (in inches) is erroneously labeled Girth
in the data. It is measured at 4 ft 6 in above the ground.”
Tree volume estimation is a big deal, especially in the lumber industry. Use the trees data to build a basic model of tree volume prediction. In particular,
(A) fit a multiple regression model with the Volume as the outcome and Girth
and Height
as the explanatory variables
Call:
lm(formula = Volume ~ Girth + Height, data = trees)
Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
Girth 4.7082 0.2643 17.816 < 2e-16 ***
Height 0.3393 0.1302 2.607 0.0145 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
(B) Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions is violated?
trees %>%
summarize(n = n())
n
1 31
fit = (lm(Volume ~ Girth + Height, data = trees))
par(mfrow = c(2,2)) # change the panel layout to 2x2
plot(fit, which = 1:6)
Yes, it does look like some of the regression assumptions are violated with this linear model. First, looking at the Residuals vs. Fitted plot, there is the assumption of Linearity, which is that the residuals should form roughly a straight line. Here we see that the residuals are bowed in shape. Next, There is the Constant variance assumption (that residuals should form roughly a horizonal band). This is also violated in the plot.
It alos looks like some assumptions are also violated for the Scale vs. Location plot. Again, with the constant variance assumption we assume that the residuals should form roughly a horizontal band with equally spread points. We also assume that the red line is approximately horizonal. These are both violated in the plot.
Finally, the Cook’s Distance vs Leverage plot shows a highly influential data point of data item 31. 4/n is 4/31 = 0.129, which is less than 0.6 (the distance of the outlier), so this assumption is also violated.
(inspired by ALR 9.16)
(Data file: florida
in alr R package)
In the 2000 election for U.S. president, the counting of votes in Florida was controversial. In Palm Beach County in south Florida, for example, voters used a so-called butterfly ballot. Some believe that the layout of the ballot caused some voters to cast votes for Buchanan when their intended choice was Gore.
The data has variables for the number of votes for each candidate—Gore
, Bush
, and Buchanan
. Run a simple linear regression model where the Buchanan
vote is the outcome and the Bush
vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?
data("florida")
florida %>%
summarize(n = n())
n
1 67
Call:
lm(formula = Buchanan ~ Bush, data = florida)
Residuals:
Min 1Q Median 3Q Max
-907.50 -46.10 -29.19 12.26 2610.19
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.529e+01 5.448e+01 0.831 0.409
Bush 4.917e-03 7.644e-04 6.432 1.73e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 353.9 on 65 degrees of freedom
Multiple R-squared: 0.3889, Adjusted R-squared: 0.3795
F-statistic: 41.37 on 1 and 65 DF, p-value: 1.727e-08
fit = (lm(Buchanan ~ Bush, data = florida))
par(mfrow = c(2,2)) # change the panel layout to 2x2
plot(fit, which = 1:6)
Based on the diagnostic plots, it looks like Palm Beach County is an outlier. We see that the Palm Beach residual is the only residuals that violates the regression assumptions for most of the plots. Additionally, in some cases the Dade county residual also appears to violate some of the assumptions. In the Residuals vs Fitted plot Palm Beach County’s residual violates the linearity assumption. In the Normal Q-Q plot Palm Beach County’s residual violates the normality assumption by not falling around the line.In the Scale-Location plot Palm Beach County’s residual violates the homoskedasticity by the spreading wider and further than the rest of the residuals (in this one Dade county may also violate this assumption). In the Residuals vs Leverage plot Palm Beach County’s residual violates the influential observation by having the residual being outside the red dashed line. This means that Palm Beach County can be influential against a regression line. Finally, in the Cook’s Distance plot, the Cook’s distance for the Palm Beach County observation is larger than 1 and also larger than 4/67 (.06). This indicates that Palm Beach County is an outlier. In this case we also see that Dade County is an outlier (and perhaps also Orange County).
1. What is your research question for the final project?
My research question is looking to answer the age-old question of what factors go into happiness? I started broadly with a bunch of factors that I thought could go into happiness.
2. What is your hypothesis (i.e. an answer to the research question) that you want to test?
My hypothesis is that the factors that are most strongly correlated with happiness are marital status (with married men potentially being the happiest group), age (older people being happier than younger people), sex (men are happier than women), physical and mental health, and income (or, more importantly, income satisfaction).
# reading in the data files
happiness_data <- read_excel("~/Documents/Eliza Geeslin/DACSS/DACSS semester 2 (603)/GSS_Final_Project.xlsx")
happiness_data2 <- happiness_data %>%
filter(year == 2018) %>%
select(zodiac, wrkstat, marital, childs, age, educ, sex, race, earnrs, income, polviews, relig, happy, hapmar, life, satjob, satfin, relpersn, sexfreq, hlthphys, hlthmntl, satsoc, wtsscomp,sei10)
na_strings <- c(".n: No answer", ".d: Do not Know/Cannot Choose", ".r: Refused", ".i: Inapplicable", ".x: Not available in this release")
happiness_data2 <- happiness_data2 %>%
replace_with_na_all(condition = ~.x %in% na_strings)
happiness_data2
# A tibble: 2,348 × 24
zodiac wrkstat marital childs age educ sex race earnrs income
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Virgo With a… Never … 0 43 14 MALE White 1 .r: …
2 Aquar… Retired Separa… 3 74 10 FEMA… White 1 $25,0…
3 Aries Workin… Married 2 42 16 MALE White 2 $25,0…
4 Aries Workin… Married 2 63 16 FEMA… White 2 .r: …
5 Cancer Retired Divorc… 0 71 18 MALE Black 1 .r: …
6 Scorp… Retired Widowed 2 67 16 FEMA… White 1 .d: …
7 Leo Workin… Divorc… 6 59 13 FEMA… Black 2 $15,0…
8 Pisces Workin… Never … 0 43 12 MALE White 1 $25,0…
9 .n: … Workin… Widowed 4 62 8 FEMA… White 1 $5,00…
10 Scorp… Workin… Married 2 55 12 MALE White 1 $25,0…
# … with 2,338 more rows, and 14 more variables: polviews <chr>,
# relig <chr>, happy <chr>, hapmar <chr>, life <chr>, satjob <chr>,
# satfin <chr>, relpersn <chr>, sexfreq <chr>, hlthphys <chr>,
# hlthmntl <chr>, satsoc <chr>, wtsscomp <dbl>, sei10 <chr>
3. Present some exploratory analysis. In particular: i) Numerically summarize (e.g. with the summary() function) the variables of interest (the outcome, the explanatory variable, the control variables).
# summarize data
# happiness_data2 <- happiness_data2 %>%
# replace_na(replace = list(x = c(".i: Inapplicable", ".r: Refused", ".d: Do not Know/Cannot Choose", ".n: No answer", ".x: # Not available in this release")))
# transform data into numbers
happiness_data2$age <- as.numeric(happiness_data2$age)
happiness_data2$educ <- as.numeric(happiness_data2$educ)
happiness_data2$earnrs <- as.numeric(happiness_data2$earnrs)
happiness_data2$sei10 <- as.numeric(happiness_data2$sei10)
# re-code data into numbers
happiness_data3 <- happiness_data2 %>%
add_column(happy_num = if_else(.$happy == "Not too happy", 1, if_else(.$happy == "Pretty happy", 2, if_else(.$happy == "Very happy", 3,999)))) %>%
add_column(hapmar_num = if_else(.$hapmar == "NOT TOO HAPPY", 1, if_else(.$hapmar == "PRETTY HAPPY", 2, if_else(.$hapmar == "VERY HAPPY", 3, 999)))) %>%
add_column(life_num = if_else(.$life == "Dull", 1, if_else(.$life == "Routine", 2, if_else(.$life == "Exciting", 3, 999)))) %>%
add_column(satjob_num = if_else(.$satjob == "Very dissatisfied", 1, if_else(.$satjob == "A little dissatisfied", 2, if_else(.$satjob == "Moderately satisfied", 3, if_else(.$satjob == "Very satisfied", 4, 999))))) %>%
add_column(satfin_num = if_else(.$satfin == "Not satisfied at all", 1, if_else(.$satfin == "More or less satisfied", 2, if_else(.$satfin == "Pretty well satisfied", 3, 999)))) %>%
add_column(relpersn_num = if_else(.$relpersn == "Not religious at all", 1, if_else(.$relpersn == "Slightly religious", 2, if_else(.$relpersn == "Moderately religious", 3, if_else(.$relpersn == "Very religious", 4, 999))))) %>%
add_column(hlthphys_num = if_else(.$hlthphys == "Poor", 1, if_else(.$hlthphys == "Fair", 2, if_else(.$hlthphys == "Good", 3, if_else(.$hlthphys == "Very good", 4, if_else(.$hlthphys == "Excellent", 5, 999)))))) %>%
add_column(hlthmntl_num = if_else(.$hlthmntl == "Poor", 1, if_else(.$hlthmntl == "Fair", 2, if_else(.$hlthmntl == "Good", 3, if_else(.$hlthmntl == "Very good", 4, if_else(.$hlthmntl == "Excellent", 5, 999)))))) %>%
add_column(satsoc_num = if_else(.$satsoc == "Poor", 1, if_else(.$satsoc == "Fair", 2, if_else(.$satsoc == "Good", 3, if_else(.$satsoc == "Very good", 4, if_else(.$satsoc == "Excellent", 5, 999)))))) %>%
add_column(childs_num = if_else(.$childs == 0, 0, if_else(.$childs == 1, 1, if_else(.$childs == 2, 2, if_else(.$childs == 3, 3, if_else(.$childs == 4, 4, if_else(.$childs == 5, 5, if_else(.$childs == 6, 6, if_else(.$childs == 7, 7, if_else(.$childs == "8 or more", 8, 999)))))))))) %>%
replace_with_na_all(condition = ~.x == 999)
summary(happiness_data3)
zodiac wrkstat marital
Length:2348 Length:2348 Length:2348
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
childs age educ
Length:2348 Min. :18.00 Min. : 1.00
Class :character 1st Qu.:34.00 1st Qu.:12.00
Mode :character Median :48.00 Median :14.00
Mean :48.47 Mean :13.76
3rd Qu.:63.00 3rd Qu.:16.00
Max. :88.00 Max. :20.00
NA's :36 NA's :7
sex race earnrs
Length:2348 Length:2348 Min. :0.000
Class :character Class :character 1st Qu.:1.000
Mode :character Mode :character Median :1.000
Mean :1.408
3rd Qu.:2.000
Max. :7.000
NA's :13
income polviews relig
Length:2348 Length:2348 Length:2348
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
happy hapmar life
Length:2348 Length:2348 Length:2348
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
satjob satfin relpersn
Length:2348 Length:2348 Length:2348
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
sexfreq hlthphys hlthmntl
Length:2348 Length:2348 Length:2348
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
satsoc wtsscomp sei10
Length:2348 Min. :0.4715 Min. :-100.00
Class :character 1st Qu.:0.4715 1st Qu.: 24.20
Mode :character Median :0.9430 Median : 39.70
Mean :1.0000 Mean : 40.69
3rd Qu.:0.9430 3rd Qu.: 65.20
Max. :5.8974 Max. : 92.80
happy_num hapmar_num life_num satjob_num
Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.000
Median :2.000 Median :3.000 Median :2.000 Median :3.000
Mean :2.156 Mean :2.613 Mean :2.446 Mean :3.311
3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:4.000
Max. :3.000 Max. :3.000 Max. :3.000 Max. :4.000
NA's :4 NA's :1356 NA's :793 NA's :609
satfin_num relpersn_num hlthphys_num hlthmntl_num
Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
1st Qu.:2.000 1st Qu.:2.000 1st Qu.:3.000 1st Qu.:3.000
Median :2.000 Median :3.000 Median :3.000 Median :4.000
Mean :2.078 Mean :2.476 Mean :3.335 Mean :3.663
3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :3.000 Max. :4.000 Max. :5.000 Max. :5.000
NA's :10 NA's :20 NA's :19 NA's :19
satsoc_num childs_num
Min. :1.000 Min. :0.000
1st Qu.:3.000 1st Qu.:0.000
Median :4.000 Median :2.000
Mean :3.452 Mean :1.855
3rd Qu.:4.000 3rd Qu.:3.000
Max. :5.000 Max. :8.000
NA's :21 NA's :4
ii) Plot the relationships between key variables. You can do this any way you want, but one straightforward way of doing this would be with the pairs() function or other scatter plots / box plots. Interpret what you see.
# did not include hapmar_num bc of so many NAs
happiness_data_num_1 <- happiness_data3 %>%
select(happy_num, age, educ, childs_num, hlthphys_num, hlthmntl_num, relpersn_num)
# happiness_data_num_2 <- happiness_data3 %>%
# select(happy_num, satfin_num, satjob_num, life_num, relpersn_num)
# happiness_data_num_3 <- happiness_data3 %>%
# select(happy_num, hlthphys_num, hlthmntl_num, satsoc_num, childs_num)
pairs(happiness_data_num_1)
#pairs(happiness_data_num_2)
#pairs(happiness_data_num_3)
This is just one summary visual. I think I need a better way to visualize this data since a lot of it is categorical. I am not sure if I am going to be able to find meaningful insights from this data because of how broadly happiness is calculated (just 3 levels). As a next step I want to see how other analysis is done using GSS happiness data to see if there is something I can get inspiration from.