Data reading

Data is from blackboard.ttu.edu, course 5371 regression analysis, provided by professor Dr. Su Jingyong. Data comes as .dat file. Our goal is to explore the relationship between variables ppTV and ppDr.

#path <- "STAT5371/TV.dat"
#df <- read.table(path)
df <- read.table("TV.dat")
names(df); dim(df)
## [1] "life"  "ppTV"  "ppDr"  "flife" "mlife"
## [1] 40  5

Exploratory analysis and processing

par(mfrow=c(1,3))
plot(df$ppDr, df$ppTV, xlab="people per Dr",ylab="people per TV", main="The scatter plot of original data")
hist(df$ppDr, breaks=50, main="distribution of people per Dr")
hist(df$ppTV, breaks=50, main="distribution of people per Tv")

According to above figuers, both ppTV and ppDr are strongly right skewed, so transformations on both variables needed, logarithm were used here

df$ppt <- log(df$ppTV)
df$ppd <- log(df$ppDr)
#Check distribution again
par(mfrow=c(1,3))
plot(df$ppd, df$ppt, col="red", xlab="people per Dr",ylab="people per TV", main="The scatter plot of tranformed data")
hist(df$ppd, breaks=50, main="Transformed people per Dr")
hist(df$ppt, breaks=50, main="Transformed people per TV")

Still skewed but much better and acceptible. And sort of linear relationship shows up

Fit model

Our model will be used here is simple linear regression model as followed

\[ log(ppt_{i}) = \beta_{0} + \beta_{1} \times log(ppd_{i}) + \epsilon_{i}\]

Log(people per Doctor) is independent variable and log(people per TV) is dependent variable and we are assuming the simple linear relationship between them.

y<-df$ppt
x<-df$ppd
fit <- lm(y~x)
summary(fit)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5139 -0.7092 -0.0871  0.3575  3.2077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4.3417     0.9933  -4.371 0.000101 ***
## x             0.9527     0.1388   6.864 4.95e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.045 on 36 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.5669, Adjusted R-squared:  0.5549 
## F-statistic: 47.12 on 1 and 36 DF,  p-value: 4.949e-08
#Get rid of missing observations from x and y
y2<-y[!is.na(y)]
x2<-x[!is.na(y)]

sigmahat <- sqrt(sum((y2-predict(fit))^2)/36)
confidence_interval <- coef(fit)[2] + c(-1,1)*qt(0.975, 36)*(sigmahat/sqrt(sum((x2-mean(x2))^2)))

First glance of the summary, 2 missing values are not used, \[R^2\] = 0.5669. All p-values reported are pretty significant. Looks like we get a decent model. The confidence interval of Beta1 is 0.6712262, 1.2341878

Graphs to further check assamptions and outliers.

  1. Scatter plot with regression lines and residual plots
plot(x,y, pch=19, col="blue", xlab="Log(people per Dr)",ylab="Log(people per TV)", main="The scatter plot")
abline(fit, lwd=3, col="red")

2. Diagnostic plots

#Calculate needed values first
Sxx <- sum((x2-mean(x2))^2)
Syy <- sum((y2-mean(y2))^2)
Sxy <- sum((x2-mean(x2))*(y2-mean(y2)))
k <- (x2-mean(x2))/Sxx
n <- length(x2)
h <- 1/n + k*(x2-mean(x2))

loob1 <- numeric()
looQ <- numeric()
for(i in 1:length(x2)){
  tempx <- x2[-i]
  tempy <- y2[-i]
  tempfit <- lm(tempy~tempx)
  loob1[i] <- coef(tempfit)[2]
  looQ[i] <- sum((predict(tempfit)-tempy)^2)
}



par(mfrow=c(2,2))
#Residual plot
plot(x2, resid(fit), pch=19, main="Residual plot", ylab="Residuals", xlab="Log(people per Doctor")
abline(0,0, col="blue")
#Leverage plot
plot(x2, h, pch=19, col=rgb(red=0, green=0, blue=0, alpha=0.4), main="Leverage plot", ylab="Leverage", xlab="Log(people per Doctor)")
#Effects on Beta1, leave one out plot
plot(seq(length(x2)), loob1, pch=19, main="Effects on Beta1", ylab="Leverage", xlab="Index of data point been left out")
abline(coef(fit)[2],0, col="blue")
#Effects on Q, leave one out plot
plot(seq(length(x2)), looQ, pch=19, main="Effects on Q", ylab="Leverage", xlab="Index of data point been left out")

Comments:

    1. According to residual plots, the distribution of errors are roughly symetric around 0 and it is not notably corelatd with independent variable because no obvious pattern noticed. Further analysis could be done by using QQ plot to check normality of errors
    1. Leverage plot told us there are at least one point with very high leverage, because its corresponding x value are far away from mean of x, However, we cannot tell if this point really has some impact on our estimations. Because leverage plot only tells the leverage distribution of the data, it is a function of x only. So from this figuer, we should be careful about the high leverage points.
    1. The last two “leave one out” plots are both used to tell us how much each data point is affecting our estimated slope and cost function respectively. Big change caused by removal of any point indicates probable outlier. So from these two figuer, we do find some outliers.

Variance decomposition

The last portion of my report comes to variance decomposition.

#Total variance in y
SST <- sum((y2-mean(y2))^2)
#Variance explained by x
SSreg <- sum((predict(fit)-mean(y2))^2)
#Variance not explained by x
RSS <- sum((y2-predict(fit))^2)

data.frame("RSS"=RSS, "SSreg"=SSreg, "RSS plus SSreg"=RSS+SSreg, "SST"=SST)
##        RSS    SSreg RSS.plus.SSreg      SST
## 1 39.31282 51.45517       90.76799 90.76799

So SST = SSreg + RSS

Thank you for reading!