fevdata <- read.csv("fevdata.txt", header = TRUE, sep = "")
HAMILTON <- read.csv("HAMILTON.txt", header = TRUE, sep = "\t")
basketball = read.csv("Basketball.txt")
names(basketball)[names(basketball) == 'X.1'] <- 'Win'
levels(fevdata$sex) <- c(0,1)
levels(fevdata$smoke) <- c(0,1)
model_a <- lm(fevdata$fev ~ fevdata$age)
bc <- MASS::boxcox(model_a, plotit = FALSE)
best_lm <- bc$x[which(bc$y == max(bc$y))]
best_lm
## [1] 0.2
log transformation seems best.
model_b <- lm(I(log(fevdata$fev)) ~ fevdata$age)
p1<-ggplot(model_b, aes(fevdata$age, resid(model_b))) + geom_point()
p1<-p1+stat_smooth(method="loess")+geom_hline(yintercept=0, col="red", linetype="dashed")
p1<-p1+xlab("Age")+ylab("Residuals")
p1<-p1+ggtitle("Residual vs Age")+theme_bw()
p1
The transformation has not improved adherence to the constant variance assumption. This linear model is not acceptable because the variance is not constant along the x axis.
model_log_a <- lm(fevdata$fev ~ I(log(fevdata$age)))
bc <- MASS::boxcox(model_log_a, plotit = FALSE)
best_lm <- bc$x[which(bc$y == max(bc$y))]
best_lm
## [1] 0
log_a <- lm(I(log(fevdata$fev)) ~ I(log(fevdata$age)))
p2<-ggplot(log_a, aes(log(fevdata$age), resid(log_a))) + geom_point()
p2<-p2+stat_smooth(method="loess")+geom_hline(yintercept=0, col="red", linetype="dashed")
p2<-p2+xlab("log(age)")+ylab("Residuals")
p2<-p2+ggtitle("Residual vs log(age)")+theme_bw()
p2
The shape of the loss function is different but seem equally bad. In particular, it exhibit similar sine wave like shape alone the x axis. This time, the curvatures are worst at low and high x values.
model_d_1 <- lm(fevdata$fev ~ I(log(fevdata$age)) + fevdata$ht)
p3<-ggplot(model_d_1, aes(.fitted, .resid)) + geom_point()
p3<-p3+stat_smooth(method="loess")+geom_hline(yintercept=0, col="red", linetype="dashed")
p3<-p3+xlab("Fitted value")+ylab("Residuals")
p3<-p3+ggtitle("Residual vs Fitted")+theme_bw()
model_d_2 <- lm(I(log(fevdata$fev)) ~ I(log(fevdata$age)) + fevdata$ht)
p4<-ggplot(model_d_2, aes(.fitted, .resid)) + geom_point()
p4<-p4+stat_smooth(method="loess")+geom_hline(yintercept=0, col="red", linetype="dashed")
p4<-p4+xlab("Fitted value")+ylab("Residuals")
p4<-p4+ggtitle("Residual vs Fitted")+theme_bw()
model_d_all_log <- lm(I(log(fevdata$fev)) ~ I(log(fevdata$age)) + I(log(fevdata$ht)))
p5<-ggplot(model_d_all_log, aes(.fitted, .resid)) + geom_point()
p5<-p5+stat_smooth(method="loess")+geom_hline(yintercept=0, col="red", linetype="dashed")
p5<-p5+xlab("Fitted value")+ylab("Residuals")
p5<-p5+ggtitle("Residual vs Fitted log(ht)")+theme_bw()
cowplot::plot_grid(p3, p4, p5, ncol = 3, nrow = 1)
According to the residual plots, the model with ht in it’s original and log(fev) adheres best to the normality assumption. There are slightly less upward curvature from the dotted line for the model using ht at it’s original scale. \[
\begin{align*}
log(fev) = -2.1705476 + 0.1945704 \; log(Age) + 0.1945704 \; HT \\
\end{align*}
\]
CI <- confint(model_d_2)
CI
## 2.5 % 97.5 %
## (Intercept) -2.29651917 -2.04457606
## I(log(fevdata$age)) 0.13069595 0.25844479
## fevdata$ht 0.03981037 0.04681717
We are 95% confident that the true intercept of the model is between -2.2965192 and -2.0445761. That is, on average, the logged value of fev will be somewhere between -2.2965192 and -2.0445761 when all other variables have values at zero.
We are 95% confident that the true elasticity of the variable age is between 0.130696 and 0.2584448. That is, on average, for every 1% increase in age, fev will increase by somewhere between 13.1% and 13.1%.
We are 95% confident that the true coefficient of the variable ht is between 0.0398104 and 0.0468172. That is, for every unit increase in ht, fev will increase by somewhere between 3.98% and 4.68%, on average.
cor(HAMILTON$Y, HAMILTON$X1)
## [1] 0.002497966
cor(HAMILTON$Y, HAMILTON$X2)
## [1] 0.4340688
y_cor_x1 <- cor.test(HAMILTON$Y, HAMILTON$X1)
y_cor_x1
##
## Pearson's product-moment correlation
##
## data: HAMILTON$Y and HAMILTON$X1
## t = 0.0090066, df = 13, p-value = 0.993
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.510418 0.514103
## sample estimates:
## cor
## 0.002497966
The p-value of the Pearson correlation test is 0.9929506. There is no evidence of a linear relation between sale price and appraised land value.
y_cor_x2 <- cor.test(HAMILTON$Y, HAMILTON$X2)
y_cor_x2
##
## Pearson's product-moment correlation
##
## data: HAMILTON$Y and HAMILTON$X2
## t = 1.7373, df = 13, p-value = 0.106
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1005527 0.7741858
## sample estimates:
## cor
## 0.4340688
The p-value of the Pearson correlation test is 0.1059593. There is no evidence of a linear relation between sale price and appraised land value.
\[ \begin{align*} \operatorname{E} (y \mid X ) = \beta_0 + \beta_1 x_1 + \beta_2 x_2\\ \end{align*} \]
No. Neither variables are significantly correlated to the response.
HAMILTON_model <- lm(HAMILTON$Y ~ HAMILTON$X1 + HAMILTON$X2)
summary(HAMILTON_model)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -45.154136 0.611418065 -73.85149 2.527961e-17
## HAMILTON$X1 3.097008 0.012274433 252.31373 1.010771e-23
## HAMILTON$X2 1.031859 0.003684173 280.07891 2.888447e-24
result <- data.frame(resid = MASS::studres(HAMILTON_model), fitted = HAMILTON_model$fitted.values)
HAMILTON_resid_fitted <- ggplot(result, aes(fitted, resid)) +
geom_point()+
stat_smooth(method="loess")+
geom_hline(yintercept=0, col="red", linetype="dashed")+
xlab("Fitted value")+
ylab("Studentized\nresiduals")+
ggtitle("Studentized residuals vs Fitted")+
theme_bw()
HAMILTON_resid_X1 <- ggplot(result, aes(fitted, resid)) +
geom_point()+
stat_smooth(method="loess")+
geom_hline(yintercept=0, col="red", linetype="dashed")+
xlab("Land Value")+
ylab("Studentized\nresiduals")+
ggtitle("Studentized residuals vs Land Value")+
theme_bw()
HAMILTON_resid_X2 <- ggplot(result, aes(fitted, resid)) +
geom_point()+
stat_smooth(method="loess")+
geom_hline(yintercept=0, col="red", linetype="dashed")+
xlab("Appraised Improvements Value")+
ylab("Studentized\nresiduals")+
ggtitle("Studentized residuals vs Appraised Improvements Value")+
theme_bw()
HAMILTON_qq <- ggplot(result, aes(sample = resid)) +
stat_qq() +
stat_qq_line() +
ggtitle("QQ plot")
cowplot::plot_grid(HAMILTON_resid_fitted, HAMILTON_resid_X1, HAMILTON_resid_X2, HAMILTON_qq, ncol = 2, nrow = 2)
The Studentized residual plots a sine pattern between the residuals and the fitted values/predictors; although the smooth curve tend to stay in range of zero, the bounce suggest possibility or correlation with the fitted values/predictors. The QQ plot tell us that the residuals are left tailed, suggesting nor-normality.
\(R^2\) of the model is 0.999847, which is abnormally high. It is possible that there is a garbage predictor in the model boosting the \(R^2\).
The result does not match my prediction made in (b). In particular, I did not expect both coefficients to be significantly different fro zero, since neither predictor are significantly correlated with the response.
cor(HAMILTON$X2, HAMILTON$X1)
## [1] -0.8997765
The two are highly correlated. This implies colinearity.
car::vif(HAMILTON_model)
## HAMILTON$X1 HAMILTON$X2
## 5.252038 5.252038
VIF for both variables are over 5: one can explain a significant portion of variance of the other. One of the two is redundant in a multiple regression model.
basketball$Win <- sub("^W.*", 1, basketball$Win)
basketball$Win <- sub("^L.*", 0, basketball$Win)
basketball$Win = as.numeric(basketball$Win)
logit_model <- glm(Win ~ AST, data = basketball, family = "binomial")
coef(logit_model)
## (Intercept) AST
## -2.2352669 0.2937756
predict(logit_model, newdata = data.frame(AST = c(3, 24)), type="response")
## 1 2
## 0.2052270 0.9919608
exp(confint.default(logit_model))
## 2.5 % 97.5 %
## (Intercept) 0.02094995 0.5461208
## AST 1.13952611 1.5792322