DACSS 603 HW#4

Fourth homework for DACSS 603.

Alexander Hong
2022-04-14

Question 1

(SMSS 14.3, 14.4, merged & modified)

(Data file: house.selling.price.2 from smss R package)

For the house.selling.price.2 data the tables below show a correlation matrix and a model fit using four predictors of selling price. 

Price Size Beds Baths New
Price 1 0.899 0.590 0.714 0.357
Size 0.899 1 0.669 0.662 0.176
Beds 0.590 0.669 1 0.334 0.267
Baths 0.714 0.662 0.334 1 0.182
New 0.357 0.176 0.267 0.182 1
Estimate Std. Error t value Pr(> | t| )
(Intercept) -41.795 12.104 -3.453 0.001
Size 64.761 5.630 11.504 0
Beds -2.766 3.960 -0.698 0.487
Baths 19.203 5.650 3.399 0.001
New 18.984 3.873 4.902 0.00000



With these four predictors,

  1. For backward elimination, which variable would be deleted first? Why?

Of all the four predictors listed, Beds would be deleted first because it has the highest p-value of

  1. For forward selection, which variable would be added first? Why?

Size would be the first variable to be added, because it is the most significant over New (o vs 0.0000x)

  1. Why do you think that BEDS has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?
    library(smss)
    data("house.selling.price.2")
    hw1 <- house.selling.price.2

    hw1_lm_full <- lm(P ~ S + Be + Ba + New, data=hw1)

    model_terms <- c('New', 'Ba', 'Be', 'S', 'Be + Ba + New', 'S + Be + New', 'S + Ba + New', 'S + Be + Ba', 'Ba + New', 'Be + New', 'Be + Ba', 'S + New', 'S + Be', 'S + Ba', 'S + Be + Ba + New')

    hw1_model_stats <- data.frame(model = character(),
                                r2 = numeric(),
                                adj_r2 = numeric(),
                                PRESS = numeric(),
                                AIC = numeric(),
                                BIC = numeric(),
                                stringsAsFactors = FALSE)

    # Derived from https://www.statology.org/press-statistic/
    PRESS <- function(model) {
        i <- residuals(model)/(1 - lm.influence(model)$hat)
        sum(i^2)
    }

    attach(hw1)

    for(i in 1:length(model_terms)){
      lm.i <- lm(paste("P ~ ", model_terms[i], sep = ""))
      sul = summary(lm.i)
      rowsss <- c(model_terms[i],
                  signif(sul$r.squared, 4),
                  signif(sul$adj.r.squared, 4), 
                  signif(PRESS(lm.i), 4),
                  signif(AIC(lm.i), 4),
                  signif(BIC(lm.i))
      )
      hw1_model_stats <- rbind(hw1_model_stats, rowsss)
    }

    colnames(hw1_model_stats)<-c("Model", "R2", "Adj R2", "PRESS", "AIC", "BIC")

    kbl <- knitr::kable(hw1_model_stats)
  1. Using software with these four predictors, find the model that would be selected using each criterion:

    1. R2 - In terms of R2, using all the predictors wins out by a slim margin (.8681)

    2. Adjusted R2 - In terms of Adjusted R2, using Size, Baths, and New wins out.

    3. PRESS - In terms of using PRESS, again, using Size, Baths, and New wins out.

    4. AIC - In terms of using AIC, again, using Size, Baths, and New wins out.

    5. BIC - In terms of using BIC, again, again,using Size, Baths, and New wins out.

Model R2 Adj R2 PRESS AIC BIC
New 0.1271 0.1175 164000 960.9 968.506
Ba 0.5094 0.504 95730 907.3 914.93
Be 0.3484 0.3413 123000 933.7 941.315
S 0.8079 0.8058 38200 820.1 827.742
Be + Ba + New 0.6717 0.6606 67940 874 886.642
S + Be + New 0.8516 0.8466 30840 800.1 812.756
S + Ba + New 0.8681 0.8637 27860 789.1 801.8
S + Be + Ba 0.8331 0.8274 35100 811.1 823.739
Ba + New 0.5625 0.5528 87680 898.7 908.807
Be + New 0.391 0.3775 117500 929.4 939.563
Be + Ba 0.6488 0.641 70800 878.2 888.36
S + New 0.8484 0.845 31070 800.1 810.257
S + Be 0.8081 0.8038 38870 822 832.165
S + Ba 0.8328 0.8291 34170 809.2 819.355
S + Be + Ba + New 0.8689 0.8629 28390 790.6 805.818
  1. Explain which model you prefer and why.

Based off the five criterion listed in the previous question, the model that uses (size, baths, new) as predictors won out on four of the five criteria, thus I would use this model.

Question 2

(Data file: trees from base R)

From the documentation:

“This data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees. Note that the diameter (in inches) is erroneously labelled Girth in the data. It is measured at 4 ft 6 in above the ground.”

Tree volume estimation is a big deal, especially in the lumber industry. Use the trees data to build a basic model of tree volume prediction. In particular, 

fit a multiple regression model with  the Volume as the outcome and Girth  and Height as the explanatory variables

  1. Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions is violated?
hw2 <- trees
hw2_lm <- lm(Volume ~ Height + Girth, data = hw2)
par(mfrow = c(2,3)); plot(hw2_lm, which = 1:6)

The following regression assumptions are being violated:

  1. Homoskedasticity - According the the residuals vs fitted plot, the residuals seemed to be grouped at various spots within the plot. This seems to suggest some heteroskedasticity going on.
  2. Independence of Errors - Judging by the Cook’s Distance and Residuals vs Leverage plots, observation 31 seems to be unusually influential. Given its distance from observations, as well as the observation falling outside of the bound in the Residuals vs Leverage plot, it be said that due to this observation, this assumption is being violated.

Question 3

(inspired by ALR 9.16)

(Data file: florida in alr R package) 

In the 2000 election for U.S. president, the counting of votes in Florida was controversial. In Palm Beach County in south Florida, for example, voters used a so-called butterfly ballot. Some believe that the layout of the ballot caused some voters to cast votes for Buchanan when their intended choice was Gore.


The data has variables for the number of votes for each candidate—Gore, Bush, and Buchanan. Run a simple linear regression model where the Buchanan vote is the outcome and the Bush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?

library(alr4)
data(florida)
hw3 <- florida

hw3_lm <- lm(Buchanan ~ Bush, data = hw3)
par(mfrow = c(2,3)); plot(hw3_lm, which = 1:6)

Judging by the plots, it is reasonable to assume that Palm Beach is an outlier. - In the residuals vs fitted plot, while the vast majority of points are clumped in the middle, both Palm Beach and Dade County are way outside of all the other points. - In the Residuals vs Leverage plot, Palm Beach seems to exude significant leverage, judging by its position on the plot being far away from other points. - The Cook’s distance plot displays that Palm Beach’s distance is above 2, which again signifies its distance from other data points.

PART 2 (Final Project)

The dataset that I intend to analyze pertains to predicting whether or not a start-up will be a success or not. The dataset can be found at this website: https://www.kaggle.com/datasets/manishkc06/startup-success-prediction

library(readr)
library(tidyverse)
options(scipen=1, digits=3)

startup_data_2 <- read_csv("/Users/nmb48ayatin_alexanderh/Documents/DACSS Classes/DACSS603/startup data 2.csv")

ff_sum <- summary(startup_data_2$age_first_funding_year)
fr_sum <- summary(startup_data_2$funding_rounds)

funding_1styr_stats <- startup_data_2 %>%
  group_by(status) %>%
  summarise(avg_age_1stfundingyr = mean(age_first_funding_year, na.rm=TRUE),
            sd_age_1stfundingyr = sd(age_first_funding_year, na.rm=TRUE),
            iqr_age_1stfundingyr = IQR(age_first_funding_year, na.rm=TRUE)
            ) 

funding_rounds_stats <- startup_data_2 %>%
  group_by(status) %>%
  summarise(avg_fundingrounds = mean(funding_rounds, na.rm=TRUE),
            sd_fundingrounds = sd(funding_rounds, na.rm=TRUE),
            iqr_fundingrounds = IQR(funding_rounds, na.rm=TRUE)
            )

boxplot <- startup_data_2 %>%
  group_by(status) %>%
  dplyr::select(status, age_first_funding_year, funding_rounds) %>% 
  tidyr::gather("Variable", "Value", 2:3) %>% 
  ggplot(., aes(x = Variable, y = Value))+geom_boxplot()
  1. What is your research question for the final project?

How does funding, particularly when or how often a start-up is funded, determine its success?

  1. What is your hypothesis (i.e. an answer to the research question) that you want to test?

Start-ups that received their first round of funding 5 years or more since beginning will not be successful.

  1. Present some exploratory analysis. In particular:
  1. Numerically summarize (e.g. with the summary() function) the variables of interest (the outcome, the explanatory variable, the control variables).

Outcome Variable

Status Frequency
acquired 597
closed 326

Explantory Variables

status avg_age_1stfundingyr sd_age_1stfundingyr iqr_age_1stfundingyr
acquired 2.10 1.94 2.81
closed 2.49 3.30 3.38
status avg_fundingrounds sd_fundingrounds iqr_fundingrounds
acquired 2.52 1.4 2
closed 1.92 1.3 1
  1. Plot the relationships between key variables. You can do this any way you want, but one straightforward way of doing this would be with the pairs() function or other scatter plots / box plots. Interpret what you see.

Regardless of Status:

We can see that some start ups received funding before being founded, as noted by the values and points below. Some other startups took as many as 21 years to receive funding, On average, most start ups get their first round of funding around the 2nd year, and receive around 2 rounds of funding.