Data is from blackboard.ttu.edu, course 5371 regression analysis, provided by professor Dr. Su Jingyong. Data comes as .dat file. Our goal is to explore the relationship between variables ppTV and ppDr.
#path <- "STAT5371/TV.dat"
#df <- read.table(path)
df <- read.table("TV.dat")
names(df); dim(df)
## [1] "life" "ppTV" "ppDr" "flife" "mlife"
## [1] 40 5
par(mfrow=c(1,3))
plot(df$ppDr, df$ppTV, xlab="people per Dr",ylab="people per TV", main="The scatter plot of original data")
hist(df$ppDr, breaks=50, main="distribution of people per Dr")
hist(df$ppTV, breaks=50, main="distribution of people per Tv")
According to above figuers, both ppTV and ppDr are strongly right skewed, so transformations on both variables needed, logarithm were used here
df$ppt <- log(df$ppTV)
df$ppd <- log(df$ppDr)
#Check distribution again
par(mfrow=c(1,3))
plot(df$ppd, df$ppt, col="red", xlab="people per Dr",ylab="people per TV", main="The scatter plot of tranformed data")
hist(df$ppd, breaks=50, main="Transformed people per Dr")
hist(df$ppt, breaks=50, main="Transformed people per TV")
Still skewed but much better and acceptible. And sort of linear relationship shows up
Our model will be used here is simple linear regression model as followed
\[ log(ppt_{i}) = \beta_{0} + \beta_{1} \times log(ppd_{i}) + \epsilon_{i}\]
Log(people per Doctor) is independent variable and log(people per TV) is dependent variable and we are assuming the simple linear relationship between them.
y<-df$ppt
x<-df$ppd
fit <- lm(y~x)
summary(fit)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5139 -0.7092 -0.0871 0.3575 3.2077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.3417 0.9933 -4.371 0.000101 ***
## x 0.9527 0.1388 6.864 4.95e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.045 on 36 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.5669, Adjusted R-squared: 0.5549
## F-statistic: 47.12 on 1 and 36 DF, p-value: 4.949e-08
#Get rid of missing observations from x and y
y2<-y[!is.na(y)]
x2<-x[!is.na(y)]
sigmahat <- sqrt(sum((y2-predict(fit))^2)/36)
confidence_interval <- coef(fit)[2] + c(-1,1)*qt(0.975, 36)*(sigmahat/sqrt(sum((x2-mean(x2))^2)))
First glance of the summary, 2 missing values are not used, \[R^2\] = 0.5669. All p-values reported are pretty significant. Looks like we get a decent model. The confidence interval of Beta1 is 0.6712262, 1.2341878
plot(x,y, pch=19, col="blue", xlab="Log(people per Dr)",ylab="Log(people per TV)", main="The scatter plot")
abline(fit, lwd=3, col="red")
2. Diagnostic plots
#Calculate needed values first
Sxx <- sum((x2-mean(x2))^2)
Syy <- sum((y2-mean(y2))^2)
Sxy <- sum((x2-mean(x2))*(y2-mean(y2)))
k <- (x2-mean(x2))/Sxx
n <- length(x2)
h <- 1/n + k*(x2-mean(x2))
loob1 <- numeric()
looQ <- numeric()
for(i in 1:length(x2)){
tempx <- x2[-i]
tempy <- y2[-i]
tempfit <- lm(tempy~tempx)
loob1[i] <- coef(tempfit)[2]
looQ[i] <- sum((predict(tempfit)-tempy)^2)
}
par(mfrow=c(2,2))
#Residual plot
plot(x2, resid(fit), pch=19, main="Residual plot", ylab="Residuals", xlab="Log(people per Doctor")
abline(0,0, col="blue")
#Leverage plot
plot(x2, h, pch=19, col=rgb(red=0, green=0, blue=0, alpha=0.4), main="Leverage plot", ylab="Leverage", xlab="Log(people per Doctor)")
#Effects on Beta1, leave one out plot
plot(seq(length(x2)), loob1, pch=19, main="Effects on Beta1", ylab="Leverage", xlab="Index of data point been left out")
abline(coef(fit)[2],0, col="blue")
#Effects on Q, leave one out plot
plot(seq(length(x2)), looQ, pch=19, main="Effects on Q", ylab="Leverage", xlab="Index of data point been left out")
The last portion of my report comes to variance decomposition.
#Total variance in y
SST <- sum((y2-mean(y2))^2)
#Variance explained by x
SSreg <- sum((predict(fit)-mean(y2))^2)
#Variance not explained by x
RSS <- sum((y2-predict(fit))^2)
data.frame("RSS"=RSS, "SSreg"=SSreg, "RSS plus SSreg"=RSS+SSreg, "SST"=SST)
## RSS SSreg RSS.plus.SSreg SST
## 1 39.31282 51.45517 90.76799 90.76799
So SST = SSreg + RSS
Comments: