Fourth homework for DACSS 603.
(SMSS 14.3, 14.4, merged & modified)
(Data file: house.selling.price.2 from smss R package)
For the house.selling.price.2 data the tables below show a correlation matrix and a model fit using four predictors of selling price.
| Price | Size | Beds | Baths | New | |
| Price | 1 | 0.899 | 0.590 | 0.714 | 0.357 |
| Size | 0.899 | 1 | 0.669 | 0.662 | 0.176 |
| Beds | 0.590 | 0.669 | 1 | 0.334 | 0.267 |
| Baths | 0.714 | 0.662 | 0.334 | 1 | 0.182 |
| New | 0.357 | 0.176 | 0.267 | 0.182 | 1 |
| Estimate | Std. Error | t value | Pr(> | t| ) | |
| (Intercept) | -41.795 | 12.104 | -3.453 | 0.001 |
| Size | 64.761 | 5.630 | 11.504 | 0 |
| Beds | -2.766 | 3.960 | -0.698 | 0.487 |
| Baths | 19.203 | 5.650 | 3.399 | 0.001 |
| New | 18.984 | 3.873 | 4.902 | 0.00000 |
With these four predictors,
Of all the four predictors listed, Beds would be deleted first because it has the highest p-value of
Size would be the first variable to be added, because it is the most significant over New (o vs 0.0000x)
library(smss)
data("house.selling.price.2")
hw1 <- house.selling.price.2
hw1_lm_full <- lm(P ~ S + Be + Ba + New, data=hw1)
model_terms <- c('New', 'Ba', 'Be', 'S', 'Be + Ba + New', 'S + Be + New', 'S + Ba + New', 'S + Be + Ba', 'Ba + New', 'Be + New', 'Be + Ba', 'S + New', 'S + Be', 'S + Ba', 'S + Be + Ba + New')
hw1_model_stats <- data.frame(model = character(),
r2 = numeric(),
adj_r2 = numeric(),
PRESS = numeric(),
AIC = numeric(),
BIC = numeric(),
stringsAsFactors = FALSE)
# Derived from https://www.statology.org/press-statistic/
PRESS <- function(model) {
i <- residuals(model)/(1 - lm.influence(model)$hat)
sum(i^2)
}
attach(hw1)
for(i in 1:length(model_terms)){
lm.i <- lm(paste("P ~ ", model_terms[i], sep = ""))
sul = summary(lm.i)
rowsss <- c(model_terms[i],
signif(sul$r.squared, 4),
signif(sul$adj.r.squared, 4),
signif(PRESS(lm.i), 4),
signif(AIC(lm.i), 4),
signif(BIC(lm.i))
)
hw1_model_stats <- rbind(hw1_model_stats, rowsss)
}
colnames(hw1_model_stats)<-c("Model", "R2", "Adj R2", "PRESS", "AIC", "BIC")
kbl <- knitr::kable(hw1_model_stats)
Using software with these four predictors, find the model that would be selected using each criterion:
R2 - In terms of R2, using all the predictors wins out by a slim margin (.8681)
Adjusted R2 - In terms of Adjusted R2, using Size, Baths, and New wins out.
PRESS - In terms of using PRESS, again, using Size, Baths, and New wins out.
AIC - In terms of using AIC, again, using Size, Baths, and New wins out.
BIC - In terms of using BIC, again, again,using Size, Baths, and New wins out.
| Model | R2 | Adj R2 | PRESS | AIC | BIC |
|---|---|---|---|---|---|
| New | 0.1271 | 0.1175 | 164000 | 960.9 | 968.506 |
| Ba | 0.5094 | 0.504 | 95730 | 907.3 | 914.93 |
| Be | 0.3484 | 0.3413 | 123000 | 933.7 | 941.315 |
| S | 0.8079 | 0.8058 | 38200 | 820.1 | 827.742 |
| Be + Ba + New | 0.6717 | 0.6606 | 67940 | 874 | 886.642 |
| S + Be + New | 0.8516 | 0.8466 | 30840 | 800.1 | 812.756 |
| S + Ba + New | 0.8681 | 0.8637 | 27860 | 789.1 | 801.8 |
| S + Be + Ba | 0.8331 | 0.8274 | 35100 | 811.1 | 823.739 |
| Ba + New | 0.5625 | 0.5528 | 87680 | 898.7 | 908.807 |
| Be + New | 0.391 | 0.3775 | 117500 | 929.4 | 939.563 |
| Be + Ba | 0.6488 | 0.641 | 70800 | 878.2 | 888.36 |
| S + New | 0.8484 | 0.845 | 31070 | 800.1 | 810.257 |
| S + Be | 0.8081 | 0.8038 | 38870 | 822 | 832.165 |
| S + Ba | 0.8328 | 0.8291 | 34170 | 809.2 | 819.355 |
| S + Be + Ba + New | 0.8689 | 0.8629 | 28390 | 790.6 | 805.818 |
Based off the five criterion listed in the previous question, the model that uses (size, baths, new) as predictors won out on four of the five criteria, thus I would use this model.
(Data file: trees from base R)
From the documentation:
“This data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees. Note that the diameter (in inches) is erroneously labelled Girth in the data. It is measured at 4 ft 6 in above the ground.”
Tree volume estimation is a big deal, especially in the lumber industry. Use the trees data to build a basic model of tree volume prediction. In particular,
fit a multiple regression model with the Volume as the outcome and Girth and Height as the explanatory variables
hw2 <- trees
hw2_lm <- lm(Volume ~ Height + Girth, data = hw2)
par(mfrow = c(2,3)); plot(hw2_lm, which = 1:6)
The following regression assumptions are being violated:
(inspired by ALR 9.16)
(Data file: florida in alr R package)
In the 2000 election for U.S. president, the counting of votes in Florida was controversial. In Palm Beach County in south Florida, for example, voters used a so-called butterfly ballot. Some believe that the layout of the ballot caused some voters to cast votes for Buchanan when their intended choice was Gore.
The data has variables for the number of votes for each candidate—Gore, Bush, and Buchanan. Run a simple linear regression model where the Buchanan vote is the outcome and the Bush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?
library(alr4)
data(florida)
hw3 <- florida
hw3_lm <- lm(Buchanan ~ Bush, data = hw3)
par(mfrow = c(2,3)); plot(hw3_lm, which = 1:6)
Judging by the plots, it is reasonable to assume that Palm Beach is an outlier. - In the residuals vs fitted plot, while the vast majority of points are clumped in the middle, both Palm Beach and Dade County are way outside of all the other points. - In the Residuals vs Leverage plot, Palm Beach seems to exude significant leverage, judging by its position on the plot being far away from other points. - The Cook’s distance plot displays that Palm Beach’s distance is above 2, which again signifies its distance from other data points.
The dataset that I intend to analyze pertains to predicting whether or not a start-up will be a success or not. The dataset can be found at this website: https://www.kaggle.com/datasets/manishkc06/startup-success-prediction
library(readr)
library(tidyverse)
options(scipen=1, digits=3)
startup_data_2 <- read_csv("/Users/nmb48ayatin_alexanderh/Documents/DACSS Classes/DACSS603/startup data 2.csv")
ff_sum <- summary(startup_data_2$age_first_funding_year)
fr_sum <- summary(startup_data_2$funding_rounds)
funding_1styr_stats <- startup_data_2 %>%
group_by(status) %>%
summarise(avg_age_1stfundingyr = mean(age_first_funding_year, na.rm=TRUE),
sd_age_1stfundingyr = sd(age_first_funding_year, na.rm=TRUE),
iqr_age_1stfundingyr = IQR(age_first_funding_year, na.rm=TRUE)
)
funding_rounds_stats <- startup_data_2 %>%
group_by(status) %>%
summarise(avg_fundingrounds = mean(funding_rounds, na.rm=TRUE),
sd_fundingrounds = sd(funding_rounds, na.rm=TRUE),
iqr_fundingrounds = IQR(funding_rounds, na.rm=TRUE)
)
boxplot <- startup_data_2 %>%
group_by(status) %>%
dplyr::select(status, age_first_funding_year, funding_rounds) %>%
tidyr::gather("Variable", "Value", 2:3) %>%
ggplot(., aes(x = Variable, y = Value))+geom_boxplot()
How does funding, particularly when or how often a start-up is funded, determine its success?
Start-ups that received their first round of funding 5 years or more since beginning will not be successful.
| Status | Frequency |
|---|---|
| acquired | 597 |
| closed | 326 |
| status | avg_age_1stfundingyr | sd_age_1stfundingyr | iqr_age_1stfundingyr |
|---|---|---|---|
| acquired | 2.10 | 1.94 | 2.81 |
| closed | 2.49 | 3.30 | 3.38 |
| status | avg_fundingrounds | sd_fundingrounds | iqr_fundingrounds |
|---|---|---|---|
| acquired | 2.52 | 1.4 | 2 |
| closed | 1.92 | 1.3 | 1 |
Regardless of Status:
We can see that some start ups received funding before being founded, as noted by the values and points below. Some other startups took as many as 21 years to receive funding, On average, most start ups get their first round of funding around the 2nd year, and receive around 2 rounds of funding.