DACSS603_HW4

DACSS 603 Homework 4

Eliza Geeslin
4/9/2022

Question 1

(SMSS 14.3, 14.4, merged & modified)

(Data file: house.selling.price.2 from smss R package)

For the house.selling.price.2 data the tables below show a correlation matrix and a model fit using four predictors of selling price.

x <- c("Price", "Size", "Beds", "Baths", "New")
Price <- c(1, 0.899, 0.590, 0.714, 0.357)
Size <- c(0.899, 1, 0.669, 0.662, 0.176)
Beds <- c(0.590, 0.669, 1, 0.334, 0.267)
Baths <- c(0.714, 0.662, 0.334, 1, 0.182)
New <- c(0.357, 0.176, 0.267, 0.182, 1)

correlation_matrix <- data.frame(x, Price, Size, Beds, Baths, New)

correlation_matrix
      x Price  Size  Beds Baths   New
1 Price 1.000 0.899 0.590 0.714 0.357
2  Size 0.899 1.000 0.669 0.662 0.176
3  Beds 0.590 0.669 1.000 0.334 0.267
4 Baths 0.714 0.662 0.334 1.000 0.182
5   New 0.357 0.176 0.267 0.182 1.000
o <- c("(Intercept)", "Size", "Beds", "Baths", "New")
Estimate <- c(-41.795, 64.761, 5.630, 11.504, 0)
Std.Error <- c(12.104, 5.630, 3.960, 5.650, 3.873)
tvalue <- c(-3.453, 11.504, -0.698, 3.399, 4.902)
Pr <- c(0.001, 0, 0.487, 0.001, 0.00000)

model_fit <- data.frame(o, Estimate, Std.Error, tvalue, Pr)

model_fit
            o Estimate Std.Error tvalue    Pr
1 (Intercept)  -41.795    12.104 -3.453 0.001
2        Size   64.761     5.630 11.504 0.000
3        Beds    5.630     3.960 -0.698 0.487
4       Baths   11.504     5.650  3.399 0.001
5         New    0.000     3.873  4.902 0.000

With these four predictors,

(A) For backward elimination, which variable would be deleted first? Why?

For backwards elimination, you start with all variables and delete the variable with the largest p-value first. In this case, the first variable that would be deleted is Beds as it has the highest p-value at 0.487.

(B) For forward selection, which variable would be added first? Why?

For forward selection, you add the variable that is the most significant first. So, you would add Size first as it has a p-value of 0 (I am assuming that New has a slightly larger p-value based on the decimal points).

(C) Why do you think that BEDS has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?

This might be because the Beds and the Size variable are very similar - i.e. if a house has more Bedrooms, it is bigger (and vice-versa, often the reason a house is bigger is because it has more bedrooms). So, the information captured in the Beds variable is already captured in Size.

(D) Using software with these four predictors, find the model that would be selected using each criterion:

data("house.selling.price.2")

# head(house.selling.price.2) looking at names of variables

model4 <- lm(P ~ ., data = house.selling.price.2) #Size, New, Baths, Beds

model3 <- lm(P ~ . - Be, data = house.selling.price.2) #Size, New, Baths

model2 <- lm(P ~ . - Be - Ba, data = house.selling.price.2) # Size, New

model1 <- lm(P ~ . - Be - Ba - New, data = house.selling.price.2) #Size
# finding R^2, adjusted R^2, AIC, BIC using broom package, glace()

broom::glance(model1) %>%
  select(r.squared, adj.r.squared, AIC, BIC)
# A tibble: 1 × 4
  r.squared adj.r.squared   AIC   BIC
      <dbl>         <dbl> <dbl> <dbl>
1     0.808         0.806  820.  828.
broom::glance(model2) %>%
  select(r.squared, adj.r.squared, AIC, BIC)
# A tibble: 1 × 4
  r.squared adj.r.squared   AIC   BIC
      <dbl>         <dbl> <dbl> <dbl>
1     0.848         0.845  800.  810.
broom::glance(model3) %>%
  select(r.squared, adj.r.squared, AIC, BIC)
# A tibble: 1 × 4
  r.squared adj.r.squared   AIC   BIC
      <dbl>         <dbl> <dbl> <dbl>
1     0.868         0.864  789.  802.
broom::glance(model4) %>%
  select(r.squared, adj.r.squared, AIC, BIC)
# A tibble: 1 × 4
  r.squared adj.r.squared   AIC   BIC
      <dbl>         <dbl> <dbl> <dbl>
1     0.869         0.863  791.  806.
# finding PRESS for each model

# model1 

r1 <- resid(model1) #residuals 

pr1 <- resid(model1)/(1-lm.influence(model1)$hat) #predicted residuals

PRESS1 <- sum(pr1^2) #PRESS


# model2
pr2 <- resid(model2)/(1-lm.influence(model2)$hat) #predicted residuals

PRESS2 <- sum(pr2^2) #PRESS

# model3
pr3 <- resid(model3)/(1-lm.influence(model3)$hat) #predicted residuals

PRESS3 <- sum(pr3^2) #PRESS

# model4
pr4 <- resid(model4)/(1-lm.influence(model4)$hat) #predicted residuals

PRESS4 <- sum(pr4^2) #PRESS


#comparison_table

Model <- c("Model 1", "Model 2", "Model 3", "Model 4")
Variables <- c("Size", "Size, New", "Size, New, Baths", "Size, New, Baths, Beds")
R.Squared <- c(0.807866, 0.8483699, 0.8681361, 0.868863 )
Adj.R.Squared <- c(0.8057546, 0.8450003, 0.8636912, 0.8629022 )
PRESS <- c(PRESS1, PRESS2, PRESS3, PRESS4)
AIC <- c(820.1439, 800.1262, 789.1366, 790.6225)
BIC <- c(827.7417, 810.2566, 801.7996, 805.8181)

Comparison_table <- data.frame(Model, Variables, R.Squared, Adj.R.Squared, PRESS, AIC, BIC)

Comparison_table
    Model              Variables R.Squared Adj.R.Squared    PRESS
1 Model 1                   Size 0.8078660     0.8057546 38203.29
2 Model 2              Size, New 0.8483699     0.8450003 31066.00
3 Model 3       Size, New, Baths 0.8681361     0.8636912 27860.05
4 Model 4 Size, New, Baths, Beds 0.8688630     0.8629022 28390.22
       AIC      BIC
1 820.1439 827.7417
2 800.1262 810.2566
3 789.1366 801.7996
4 790.6225 805.8181

Now that we have all the values, we can select the model based on the listed criterion.

If we are trying to minimize R^2 we would choose Model 1. If we were using the criterion of trying to minimize adjusted R^2, we would select Model 2, which has just the variable Size and New.

According to the criterion of minimizing the predicted residual sum of squares (PRESS), we would select Model 3, which has a PRESS = 27869.05. This was also the model selected by backward elimination and forward selection.

According to the criterion of minimizing AIC, we would also select Model 3. Lastly, if we were looking to minimize BIC, we would also select Model 3.

(E) Explain which model you prefer and why.

I would select Model 3, which was the model selected by PRESS, AIC, and BIC. AIC and BIC can be particularly helpful when choosing a model because they both penalize the addition of new variables. Additionally they both want the lowest value possible to have a better-fit model. Also, they are simple functions in r, so it is easy to rerun the test with multiple models to see what adding a new variable would do to AIC and BIC.

Question 2

(Data file: trees from base R)

From the documentation: “This data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees. Note that the diameter (in inches) is erroneously labeled Girth in the data. It is measured at 4 ft 6 in above the ground.”

Tree volume estimation is a big deal, especially in the lumber industry. Use the trees data to build a basic model of tree volume prediction. In particular,

(A) fit a multiple regression model with the Volume as the outcome and Girth and Height as the explanatory variables

data("trees")

summary(lm(Volume ~ Girth + Height, data = trees))

Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Girth         4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

(B) Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions is violated?

trees %>%
  summarize(n = n())
   n
1 31
fit = (lm(Volume ~ Girth + Height, data = trees))

par(mfrow = c(2,2)) # change the panel layout to 2x2
plot(fit, which = 1:6)

Yes, it does look like some of the regression assumptions are violated with this linear model. First, looking at the Residuals vs. Fitted plot, there is the assumption of Linearity, which is that the residuals should form roughly a straight line. Here we see that the residuals are bowed in shape. Next, There is the Constant variance assumption (that residuals should form roughly a horizonal band). This is also violated in the plot.

It alos looks like some assumptions are also violated for the Scale vs. Location plot. Again, with the constant variance assumption we assume that the residuals should form roughly a horizontal band with equally spread points. We also assume that the red line is approximately horizonal. These are both violated in the plot.

Finally, the Cook’s Distance vs Leverage plot shows a highly influential data point of data item 31. 4/n is 4/31 = 0.129, which is less than 0.6 (the distance of the outlier), so this assumption is also violated.

Question 3

(inspired by ALR 9.16)

(Data file: florida in alr R package)

In the 2000 election for U.S. president, the counting of votes in Florida was controversial. In Palm Beach County in south Florida, for example, voters used a so-called butterfly ballot. Some believe that the layout of the ballot caused some voters to cast votes for Buchanan when their intended choice was Gore.

The data has variables for the number of votes for each candidate—Gore, Bush, and Buchanan. Run a simple linear regression model where the Buchanan vote is the outcome and the Bush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?

data("florida")

florida %>%
  summarize(n = n())
   n
1 67
summary(lm(Buchanan ~ Bush, data = florida))

Call:
lm(formula = Buchanan ~ Bush, data = florida)

Residuals:
    Min      1Q  Median      3Q     Max 
-907.50  -46.10  -29.19   12.26 2610.19 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.529e+01  5.448e+01   0.831    0.409    
Bush        4.917e-03  7.644e-04   6.432 1.73e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 353.9 on 65 degrees of freedom
Multiple R-squared:  0.3889,    Adjusted R-squared:  0.3795 
F-statistic: 41.37 on 1 and 65 DF,  p-value: 1.727e-08
fit = (lm(Buchanan ~ Bush, data = florida))

par(mfrow = c(2,2)) # change the panel layout to 2x2
plot(fit, which = 1:6)

Based on the diagnostic plots, it looks like Palm Beach County is an outlier. We see that the Palm Beach residual is the only residuals that violates the regression assumptions for most of the plots. Additionally, in some cases the Dade county residual also appears to violate some of the assumptions. In the Residuals vs Fitted plot Palm Beach County’s residual violates the linearity assumption. In the Normal Q-Q plot Palm Beach County’s residual violates the normality assumption by not falling around the line.In the Scale-Location plot Palm Beach County’s residual violates the homoskedasticity by the spreading wider and further than the rest of the residuals (in this one Dade county may also violate this assumption). In the Residuals vs Leverage plot Palm Beach County’s residual violates the influential observation by having the residual being outside the red dashed line. This means that Palm Beach County can be influential against a regression line. Finally, in the Cook’s Distance plot, the Cook’s distance for the Palm Beach County observation is larger than 1 and also larger than 4/67 (.06). This indicates that Palm Beach County is an outlier. In this case we also see that Dade County is an outlier (and perhaps also Orange County).

PART 2 (Final Project)

1. What is your research question for the final project?

My research question is looking to answer the age-old question of what factors go into happiness? I started broadly with a bunch of factors that I thought could go into happiness.

2. What is your hypothesis (i.e. an answer to the research question) that you want to test?

My hypothesis is that the factors that are most strongly correlated with happiness are marital status (with married men potentially being the happiest group), age (older people being happier than younger people), sex (men are happier than women), physical and mental health, and income (or, more importantly, income satisfaction).

# reading in the data files

happiness_data <- read_excel("~/Documents/Eliza Geeslin/DACSS/DACSS semester 2 (603)/GSS_Final_Project.xlsx")


happiness_data2 <- happiness_data %>%
  filter(year == 2018) %>%
  select(zodiac, wrkstat, marital, childs, age, educ, sex, race, earnrs, income, polviews, relig, happy, hapmar, life, satjob, satfin, relpersn, sexfreq, hlthphys, hlthmntl, satsoc, wtsscomp,sei10)
  
na_strings <- c(".n: No answer", ".d: Do not Know/Cannot Choose", ".r: Refused", ".i: Inapplicable", ".x:  Not available in this release")
  
happiness_data2 <- happiness_data2 %>%
  replace_with_na_all(condition = ~.x %in% na_strings)


happiness_data2
# A tibble: 2,348 × 24
   zodiac wrkstat marital childs age   educ  sex   race  earnrs income
   <chr>  <chr>   <chr>   <chr>  <chr> <chr> <chr> <chr> <chr>  <chr> 
 1 Virgo  With a… Never … 0      43    14    MALE  White 1      .r:  …
 2 Aquar… Retired Separa… 3      74    10    FEMA… White 1      $25,0…
 3 Aries  Workin… Married 2      42    16    MALE  White 2      $25,0…
 4 Aries  Workin… Married 2      63    16    FEMA… White 2      .r:  …
 5 Cancer Retired Divorc… 0      71    18    MALE  Black 1      .r:  …
 6 Scorp… Retired Widowed 2      67    16    FEMA… White 1      .d:  …
 7 Leo    Workin… Divorc… 6      59    13    FEMA… Black 2      $15,0…
 8 Pisces Workin… Never … 0      43    12    MALE  White 1      $25,0…
 9 .n:  … Workin… Widowed 4      62    8     FEMA… White 1      $5,00…
10 Scorp… Workin… Married 2      55    12    MALE  White 1      $25,0…
# … with 2,338 more rows, and 14 more variables: polviews <chr>,
#   relig <chr>, happy <chr>, hapmar <chr>, life <chr>, satjob <chr>,
#   satfin <chr>, relpersn <chr>, sexfreq <chr>, hlthphys <chr>,
#   hlthmntl <chr>, satsoc <chr>, wtsscomp <dbl>, sei10 <chr>

3. Present some exploratory analysis. In particular: i) Numerically summarize (e.g. with the summary() function) the variables of interest (the outcome, the explanatory variable, the control variables).

# summarize data
# happiness_data2 <- happiness_data2 %>%
#  replace_na(replace = list(x = c(".i: Inapplicable", ".r: Refused", ".d: Do not Know/Cannot Choose", ".n:  No answer", ".x: # Not available in this release")))

# transform data into numbers

happiness_data2$age <- as.numeric(happiness_data2$age)

happiness_data2$educ <- as.numeric(happiness_data2$educ)

happiness_data2$earnrs <- as.numeric(happiness_data2$earnrs)

happiness_data2$sei10 <- as.numeric(happiness_data2$sei10)

# re-code data into numbers

happiness_data3 <- happiness_data2 %>%
  add_column(happy_num = if_else(.$happy == "Not too happy", 1, if_else(.$happy == "Pretty happy", 2, if_else(.$happy == "Very happy", 3,999))))  %>%
  add_column(hapmar_num = if_else(.$hapmar == "NOT TOO HAPPY", 1, if_else(.$hapmar == "PRETTY HAPPY", 2, if_else(.$hapmar == "VERY HAPPY", 3, 999)))) %>%
  add_column(life_num = if_else(.$life == "Dull", 1, if_else(.$life == "Routine", 2, if_else(.$life == "Exciting", 3, 999)))) %>%
  add_column(satjob_num = if_else(.$satjob == "Very dissatisfied", 1, if_else(.$satjob == "A little dissatisfied", 2, if_else(.$satjob == "Moderately satisfied", 3, if_else(.$satjob == "Very satisfied", 4, 999))))) %>%
  add_column(satfin_num = if_else(.$satfin == "Not satisfied at all", 1, if_else(.$satfin == "More or less satisfied", 2, if_else(.$satfin == "Pretty well satisfied", 3, 999)))) %>%
  add_column(relpersn_num = if_else(.$relpersn == "Not religious at all", 1, if_else(.$relpersn == "Slightly religious", 2, if_else(.$relpersn == "Moderately religious", 3, if_else(.$relpersn == "Very religious", 4, 999))))) %>%
  add_column(hlthphys_num = if_else(.$hlthphys == "Poor", 1, if_else(.$hlthphys == "Fair", 2, if_else(.$hlthphys == "Good", 3, if_else(.$hlthphys == "Very good", 4, if_else(.$hlthphys == "Excellent", 5, 999)))))) %>%
  add_column(hlthmntl_num = if_else(.$hlthmntl == "Poor", 1, if_else(.$hlthmntl == "Fair", 2, if_else(.$hlthmntl == "Good", 3, if_else(.$hlthmntl == "Very good", 4, if_else(.$hlthmntl == "Excellent", 5, 999)))))) %>%
  add_column(satsoc_num = if_else(.$satsoc == "Poor", 1, if_else(.$satsoc == "Fair", 2, if_else(.$satsoc == "Good", 3, if_else(.$satsoc == "Very good", 4, if_else(.$satsoc == "Excellent", 5, 999)))))) %>%
  add_column(childs_num = if_else(.$childs == 0, 0, if_else(.$childs == 1, 1, if_else(.$childs == 2, 2, if_else(.$childs == 3, 3, if_else(.$childs == 4, 4, if_else(.$childs == 5, 5, if_else(.$childs == 6, 6, if_else(.$childs == 7, 7, if_else(.$childs == "8 or more", 8, 999)))))))))) %>%
  replace_with_na_all(condition = ~.x == 999)


summary(happiness_data3)
    zodiac            wrkstat            marital         
 Length:2348        Length:2348        Length:2348       
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                                                         
    childs               age             educ      
 Length:2348        Min.   :18.00   Min.   : 1.00  
 Class :character   1st Qu.:34.00   1st Qu.:12.00  
 Mode  :character   Median :48.00   Median :14.00  
                    Mean   :48.47   Mean   :13.76  
                    3rd Qu.:63.00   3rd Qu.:16.00  
                    Max.   :88.00   Max.   :20.00  
                    NA's   :36      NA's   :7      
     sex                race               earnrs     
 Length:2348        Length:2348        Min.   :0.000  
 Class :character   Class :character   1st Qu.:1.000  
 Mode  :character   Mode  :character   Median :1.000  
                                       Mean   :1.408  
                                       3rd Qu.:2.000  
                                       Max.   :7.000  
                                       NA's   :13     
    income            polviews            relig          
 Length:2348        Length:2348        Length:2348       
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                                                         
    happy              hapmar              life          
 Length:2348        Length:2348        Length:2348       
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                                                         
    satjob             satfin            relpersn        
 Length:2348        Length:2348        Length:2348       
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                                                         
   sexfreq            hlthphys           hlthmntl        
 Length:2348        Length:2348        Length:2348       
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                                                         
    satsoc             wtsscomp          sei10        
 Length:2348        Min.   :0.4715   Min.   :-100.00  
 Class :character   1st Qu.:0.4715   1st Qu.:  24.20  
 Mode  :character   Median :0.9430   Median :  39.70  
                    Mean   :1.0000   Mean   :  40.69  
                    3rd Qu.:0.9430   3rd Qu.:  65.20  
                    Max.   :5.8974   Max.   :  92.80  
                                                      
   happy_num       hapmar_num       life_num       satjob_num   
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:3.000  
 Median :2.000   Median :3.000   Median :2.000   Median :3.000  
 Mean   :2.156   Mean   :2.613   Mean   :2.446   Mean   :3.311  
 3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:4.000  
 Max.   :3.000   Max.   :3.000   Max.   :3.000   Max.   :4.000  
 NA's   :4       NA's   :1356    NA's   :793     NA's   :609    
   satfin_num     relpersn_num    hlthphys_num    hlthmntl_num  
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:2.000   1st Qu.:2.000   1st Qu.:3.000   1st Qu.:3.000  
 Median :2.000   Median :3.000   Median :3.000   Median :4.000  
 Mean   :2.078   Mean   :2.476   Mean   :3.335   Mean   :3.663  
 3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :3.000   Max.   :4.000   Max.   :5.000   Max.   :5.000  
 NA's   :10      NA's   :20      NA's   :19      NA's   :19     
   satsoc_num      childs_num   
 Min.   :1.000   Min.   :0.000  
 1st Qu.:3.000   1st Qu.:0.000  
 Median :4.000   Median :2.000  
 Mean   :3.452   Mean   :1.855  
 3rd Qu.:4.000   3rd Qu.:3.000  
 Max.   :5.000   Max.   :8.000  
 NA's   :21      NA's   :4      

ii) Plot the relationships between key variables. You can do this any way you want, but one straightforward way of doing this would be with the pairs() function or other scatter plots / box plots. Interpret what you see.

# did not include hapmar_num bc of so many NAs

happiness_data_num_1 <- happiness_data3 %>%
  select(happy_num, age, educ, childs_num, hlthphys_num, hlthmntl_num, relpersn_num)

# happiness_data_num_2 <- happiness_data3 %>%
#  select(happy_num, satfin_num, satjob_num, life_num, relpersn_num)

# happiness_data_num_3 <- happiness_data3 %>%
#  select(happy_num, hlthphys_num, hlthmntl_num, satsoc_num, childs_num)

pairs(happiness_data_num_1)
#pairs(happiness_data_num_2)

#pairs(happiness_data_num_3)

This is just one summary visual. I think I need a better way to visualize this data since a lot of it is categorical. I am not sure if I am going to be able to find meaningful insights from this data because of how broadly happiness is calculated (just 3 levels). As a next step I want to see how other analysis is done using GSS happiness data to see if there is something I can get inspiration from.