Setup:

knitr::opts_chunk$set(echo = TRUE)
library(alr4)
library(tidyverse)
library(readr)
library(broom)



Question 1

Here consider salary is the response variable and year is the predictor variable for the analysis.

1.1 Calculate the total sum of squares, the regression sum of squares and the residual sum of squares for the fitted regression model. Use these to calculate the coefficient of determination (\(R^2\)) for the model. What do these values tell you about the strength of the linear relationship between salary and year?

salary1<-salary$salary
years <- salary$year


xbar=mean(years)
ybar=mean(salary1)

#Total Sum of Squares
SYY <- sum((salary1 - ybar)^2)  

#Residual Sum of Squares
SXX <- sum((years - xbar)^2)
SXY <- sum((years - xbar) * (salary1 - ybar))
b1hat <- SXY/SXX          
b0hat<- ybar - b1hat*xbar

#SSREG
SSReg<- (SXY^2)/SXX

RSS <- sum((salary1- (b0hat + (b1hat * salary1)))^2)

#R^2
R2<-SSReg/SYY
print(R2)
## [1] 0.490937

The \(R^2\) value of .49 indicated there is some linear relatinship between year and salary but it is not that strong. Also with the high RSS and SYY indicates a lot of values were far from the linear model.

1.2 Use appropriate residual plots to assess whether the fitted mean function is appropriate. Do the plots indicate any lack of fit?

smod<- lm(salary1 ~ years, data = salary)

# Linear regression plot 
  plot(years, salary1, xlab = "Year", ylab = "Salary ($)")
abline(reg = smod)

#Plot of Residuals Plot vs Fitted Values 


(eplot = augment(smod) %>% ggplot(aes(x = .fitted, y = .resid)) + geom_point(size = 2) + geom_hline(yintercept =0) + theme_bw(18) + xlab("fitted values") + ylab("residuals")) + geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'


There is a slight lack of fit as the data is scattered way more from fitted values less than 30000 instead of being randomly scattered throughout.

1.3 Use appropriate residual plots to assess the assumption that the variance of employer salary is the same at all years.

# Linear regression plot 
  plot(years, salary1, xlab = "Year", ylab = "Salary ($)")
abline(reg = smod)

## check constant variance with standardized residuals

plot(salary1, resid(smod), xlab = "fitted values", 
     ylab =  "residuals", main = "residuals vs fitted values"); abline(h = 0)


There appears to be a linear pattern of the residuals vs fitted values, indicating the variance is not constant for all years.

1.4 Use appropriate residual plots to assess the assumption that the error terms in the model are normally distributed.

## Checking normal errors: Histogram
hist(resid(smod), xlab = "residuals", main = "Histogram of Normal Errors")

## Checking normal errors: Normal probability plot

qqnorm(resid(smod)); qqline(resid(smod), main= "Normality Plot of e_is")


Although for the most part the expected value of the errors falls on the line, the data starts to fall of the line around x=1 and x=-1. Also the histogram of residuals is not a unimodal distribution centered at zero. These results lead go the conclusion that the errors are not normally distributed, and a linear model assumption would not be appropriate



Question 2

This question relates to the UN11 data set which can be found in alr4 package. Simply load the package and access UN11 date set.

Here we consider the following two variables.

Let’s take fertility as the response variable and ppgdp as the predictor variables.

  1. Make four plots:

2.1 Explain why transforming both fertility and ppgdp to log scale is useful for building a linear regression model that uses ppgdp to predict fertility.

#(i) fertility vs. ppgdp, 
plot(x=UN11$ppgdp, y=UN11$fertility, xlab = "ppgdp ($)", ylab= "fertility", main = "fertility vs ppgdp")

#(ii) fertility vs. log(ppgdp), 
plot(UN11$ppgdp, UN11$fertility, xlab = "log ppgdp ($)", ylab = " fertility", main = "fertility vs log ppgdp")

# (iii) log(fertility) vs. ppgdp, 
plot(UN11$ppgdp, log10(UN11$fertility), xlab = "ppgdp ($)", ylab = "log fertility", main = "log fertility vs ppgdp")

# (iv) log(fertility) vs. log(ppgdp). 

plot(log10(UN11$ppgdp), log10(UN11$fertility), xlab = "log ppgdp ($)", ylab = "log fertility", main = "log fertility vs log ppgdp")


The magnitudes for both ppgpd and fertility were pretty large, so using the log scale was appropriate. Transforming to a log scale allows fitted effect that change most rapidly when the predictor is small and less rapidly when the predictor is large. So taking the log brings the value closer together and in this case taking the log of for both ppgpd and fertility helped show the linear relationship

2.2. Recall the book’s definition of the “log rule” (p188): “If the values of a variable range over more than one order of magnitude and the variable is strictly positive, then replacing the variable by its logarithm is likely to be helpful” in linearizing relationships. Can this rule be applied meaningfully to any of these two variables.

#checking if variable is strictly positive
if (all(UN11$fertility > 0)){
   print("All values of fertility are positive")
  
}else{
  print("All values of fertility are not positive")
}
## [1] "All values of fertility are positive"
if (all(UN11$ppgdp > 0)){
   print("All values of ppgdp are positive")
  
}else{
  print("All values of ppgdp are not positive")
}
## [1] "All values of ppgdp are positive"
# checking span of order of magnitude
minFertility<- min(UN11$fertility)
maxFertility<- max(UN11$fertility)

minPPGDP<- min(UN11$ppgdp)
maxPPGDP<- max(UN11$ppgdp)

log10(maxFertility) - log10(minFertility)
## [1] 0.7858067
log10(maxPPGDP) - log10(minPPGDP)
## [1] 2.961642


This rule can only be applied to the ppgdp variable. Although both variables are strictly positive, only the values of PPGDP span over more then one order of magnitude with the order being close to 3. The fertility values only span over less than 1 order of magnitude.