Question 1

(SMSS 14.3, 14.4, merged & modified)

(Data file: house.selling.price.2 from smss R package)

For the house.selling.price.2 data the tables below show a correlation matrix and a model fit using four predictors of selling price.

	Price	Size	Beds	Baths	New
Price	1	0.899	0.590	0.714	0.357
Size	0.899	1	0.669	0.662	0.176
Beds	0.590	0.669	1	0.334	0.267
Baths	0.714	0.662	0.334	1	0.182
New	0.357	0.176	0.267	0.182	1

	Estimate	Std. Error	t value	Pr(> \| t\| )
(Intercept)	-41.795	12.104	-3.453	0.001
Size	64.761	5.630	11.504	0
Beds	-2.766	3.960	-0.698	0.487
Baths	19.203	5.650	3.399	0.001
New	18.984	3.873	4.902	0.00000

With these four predictors,

For backward elimination, which variable would be deleted first? Why?

Of all the four predictors listed, Beds would be deleted first because it has the highest p-value of

For forward selection, which variable would be added first? Why?

Size would be the first variable to be added, because it is the most significant over New (o vs 0.0000x)

Why do you think that BEDS has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?

    library(smss)
    data("house.selling.price.2")
    hw1 <- house.selling.price.2

    hw1_lm_full <- lm(P ~ S + Be + Ba + New, data=hw1)

    model_terms <- c('New', 'Ba', 'Be', 'S', 'Be + Ba + New', 'S + Be + New', 'S + Ba + New', 'S + Be + Ba', 'Ba + New', 'Be + New', 'Be + Ba', 'S + New', 'S + Be', 'S + Ba', 'S + Be + Ba + New')

    hw1_model_stats <- data.frame(model = character(),
                                r2 = numeric(),
                                adj_r2 = numeric(),
                                PRESS = numeric(),
                                AIC = numeric(),
                                BIC = numeric(),
                                stringsAsFactors = FALSE)

    # Derived from https://www.statology.org/press-statistic/
    PRESS <- function(model) {
        i <- residuals(model)/(1 - lm.influence(model)$hat)
        sum(i^2)
    }

    attach(hw1)

    for(i in 1:length(model_terms)){
      lm.i <- lm(paste("P ~ ", model_terms[i], sep = ""))
      sul = summary(lm.i)
      rowsss <- c(model_terms[i],
                  signif(sul$r.squared, 4),
                  signif(sul$adj.r.squared, 4), 
                  signif(PRESS(lm.i), 4),
                  signif(AIC(lm.i), 4),
                  signif(BIC(lm.i))
      )
      hw1_model_stats <- rbind(hw1_model_stats, rowsss)
    }

    colnames(hw1_model_stats)<-c("Model", "R2", "Adj R2", "PRESS", "AIC", "BIC")

    kbl <- knitr::kable(hw1_model_stats)

Using software with these four predictors, find the model that would be selected using each criterion:
1. R2 - In terms of R2, using all the predictors wins out by a slim margin (.8681)
2. Adjusted R2 - In terms of Adjusted R2, using Size, Baths, and New wins out.
3. PRESS - In terms of using PRESS, again, using Size, Baths, and New wins out.
4. AIC - In terms of using AIC, again, using Size, Baths, and New wins out.
5. BIC - In terms of using BIC, again, again,using Size, Baths, and New wins out.

Model	R2	Adj R2	PRESS	AIC	BIC
New	0.1271	0.1175	164000	960.9	968.506
Ba	0.5094	0.504	95730	907.3	914.93
Be	0.3484	0.3413	123000	933.7	941.315
S	0.8079	0.8058	38200	820.1	827.742
Be + Ba + New	0.6717	0.6606	67940	874	886.642
S + Be + New	0.8516	0.8466	30840	800.1	812.756
S + Ba + New	0.8681	0.8637	27860	789.1	801.8
S + Be + Ba	0.8331	0.8274	35100	811.1	823.739
Ba + New	0.5625	0.5528	87680	898.7	908.807
Be + New	0.391	0.3775	117500	929.4	939.563
Be + Ba	0.6488	0.641	70800	878.2	888.36
S + New	0.8484	0.845	31070	800.1	810.257
S + Be	0.8081	0.8038	38870	822	832.165
S + Ba	0.8328	0.8291	34170	809.2	819.355
S + Be + Ba + New	0.8689	0.8629	28390	790.6	805.818

Explain which model you prefer and why.

Based off the five criterion listed in the previous question, the model that uses (size, baths, new) as predictors won out on four of the five criteria, thus I would use this model.

Question 2

(Data file: trees from base R)

From the documentation:

“This data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees. Note that the diameter (in inches) is erroneously labelled Girth in the data. It is measured at 4 ft 6 in above the ground.”

Tree volume estimation is a big deal, especially in the lumber industry. Use the trees data to build a basic model of tree volume prediction. In particular,

fit a multiple regression model with the Volume as the outcome and Girth and Height as the explanatory variables

Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions is violated?

hw2 <- trees
hw2_lm <- lm(Volume ~ Height + Girth, data = hw2)
par(mfrow = c(2,3)); plot(hw2_lm, which = 1:6)

The following regression assumptions are being violated:

Homoskedasticity - According the the residuals vs fitted plot, the residuals seemed to be grouped at various spots within the plot. This seems to suggest some heteroskedasticity going on.
Independence of Errors - Judging by the Cook’s Distance and Residuals vs Leverage plots, observation 31 seems to be unusually influential. Given its distance from observations, as well as the observation falling outside of the bound in the Residuals vs Leverage plot, it be said that due to this observation, this assumption is being violated.

Question 3

(inspired by ALR 9.16)

(Data file: florida in alr R package)

In the 2000 election for U.S. president, the counting of votes in Florida was controversial. In Palm Beach County in south Florida, for example, voters used a so-called butterfly ballot. Some believe that the layout of the ballot caused some voters to cast votes for Buchanan when their intended choice was Gore.

The data has variables for the number of votes for each candidate—Gore, Bush, and Buchanan. Run a simple linear regression model where the Buchanan vote is the outcome and the Bush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?

library(alr4)
data(florida)
hw3 <- florida

hw3_lm <- lm(Buchanan ~ Bush, data = hw3)
par(mfrow = c(2,3)); plot(hw3_lm, which = 1:6)

Judging by the plots, it is reasonable to assume that Palm Beach is an outlier. - In the residuals vs fitted plot, while the vast majority of points are clumped in the middle, both Palm Beach and Dade County are way outside of all the other points. - In the Residuals vs Leverage plot, Palm Beach seems to exude significant leverage, judging by its position on the plot being far away from other points. - The Cook’s distance plot displays that Palm Beach’s distance is above 2, which again signifies its distance from other data points.

PART 2 (Final Project)

The dataset that I intend to analyze pertains to predicting whether or not a start-up will be a success or not. The dataset can be found at this website: https://www.kaggle.com/datasets/manishkc06/startup-success-prediction

library(readr)
library(tidyverse)
options(scipen=1, digits=3)

startup_data_2 <- read_csv("/Users/nmb48ayatin_alexanderh/Documents/DACSS Classes/DACSS603/startup data 2.csv")

ff_sum <- summary(startup_data_2$age_first_funding_year)
fr_sum <- summary(startup_data_2$funding_rounds)

funding_1styr_stats <- startup_data_2 %>%
  group_by(status) %>%
  summarise(avg_age_1stfundingyr = mean(age_first_funding_year, na.rm=TRUE),
            sd_age_1stfundingyr = sd(age_first_funding_year, na.rm=TRUE),
            iqr_age_1stfundingyr = IQR(age_first_funding_year, na.rm=TRUE)
            ) 

funding_rounds_stats <- startup_data_2 %>%
  group_by(status) %>%
  summarise(avg_fundingrounds = mean(funding_rounds, na.rm=TRUE),
            sd_fundingrounds = sd(funding_rounds, na.rm=TRUE),
            iqr_fundingrounds = IQR(funding_rounds, na.rm=TRUE)
            )

boxplot <- startup_data_2 %>%
  group_by(status) %>%
  dplyr::select(status, age_first_funding_year, funding_rounds) %>% 
  tidyr::gather("Variable", "Value", 2:3) %>% 
  ggplot(., aes(x = Variable, y = Value))+geom_boxplot()

What is your research question for the final project?

How does funding, particularly when or how often a start-up is funded, determine its success?

What is your hypothesis (i.e. an answer to the research question) that you want to test?

Start-ups that received their first round of funding 5 years or more since beginning will not be successful.

Present some exploratory analysis. In particular:

Numerically summarize (e.g. with the summary() function) the variables of interest (the outcome, the explanatory variable, the control variables).

Outcome Variable

Status	Frequency
acquired	597
closed	326

Explantory Variables

status	avg_age_1stfundingyr	sd_age_1stfundingyr	iqr_age_1stfundingyr
acquired	2.10	1.94	2.81
closed	2.49	3.30	3.38

status	avg_fundingrounds	sd_fundingrounds	iqr_fundingrounds
acquired	2.52	1.4	2
closed	1.92	1.3	1

Plot the relationships between key variables. You can do this any way you want, but one straightforward way of doing this would be with the pairs() function or other scatter plots / box plots. Interpret what you see.

Regardless of Status:

We can see that some start ups received funding before being founded, as noted by the values and points below. Some other startups took as many as 21 years to receive funding, On average, most start ups get their first round of funding around the 2nd year, and receive around 2 rounds of funding.