Question 1

Sent as a separate scanned PDF.

Question 2

Parts a) and b) were also sent as a separate scanned PDF.

sexsalary <- read.table("sexsalary.txt", header=TRUE)

## Female is 0, Male is 1
f_0 = factor(sexsalary$sex, labels = c(1,0))
sex <- as.numeric(levels(f_0))[f_0]
salary <- sexsalary$bsal
n <- length(f_0)

salary_male <- salary[sex == 0]
salary_female <- salary[sex == 1]

mean_msalary <- mean(salary_male)
mean_fsalary <- mean(salary_female)

intercept <- mean_msalary
slope <- mean_fsalary - mean_msalary 

predicted_salary <- intercept + slope * sex
RSS <- sum((salary - predicted_salary)^2)
sigma_hat_squared = RSS/(n - 2)
SXX <- sum((sex - mean(sex))^2)

plot(sex, salary, main = "{0,1} Parametrization")
abline(a = intercept, b = slope, col = "red")

The regression line for the {0,1} parametrization is \[Salary = 5956.875 -818.022541 * Sex\]

We perform a hypothesis test at the 0.05 significance level to determine if there is a significant difference in salary between males and females based on our linear model of salary on sex with parametrization of {0, 1}. We assume that linearity between sex and salary, and independence and normality of errors are met by the data.

The t value is

#Hypothesis test for the slope 
SE_slope = (sigma_hat_squared/SXX)
(tvalue_slope = slope/SE_slope)
## [1] -0.04840558

And the Pr(>|t|)

2*(1 - pt(abs(tvalue_slope), df = n - 2))
## [1] 0.9614991

Because our p-value is greater than 0.05, we fail to reject the null hypothesis that there is a significant linear relationship between salary and sex at the bank.

The regression line for the {-1,1} parametrization is \[Salary = 5547.8637295 -409.0112705 * Sex\]

We perform a hypothesis test at the 0.05 significance level to determine if there is a significant difference in salary between males and females based on our linear model of salary on sex with parametrization of {-1, 1}. We assume that linearity between sex and salary, and independence and normality of errors are met by the data.

The t value is

## [1] -0.09681117

And the Pr(>|t|)

## [1] 0.9230893

Because our p-value is greater than 0.05, we fail to reject the null hypothesis that there is a significant linear relationship between salary and sex at the bank.

The t-statistic for the {0,1} parametrization is half of the t-statistic for the {-1,1} parametrization so there was a slight difference in p-value, but both tests reached the same conclusion that we fail to reject the null hypothesis. The {0,1} parametrization model’s intercept represented the average salary of men in the sample, while its slope was the difference between the average salary of men and the average salary of women in the sample. For the {-1,1} parametrization model, the intercept represented the halfway point or average of the average salary of men and average salary of women in the sample, and the slope represented one-half of the difference between the average salary of men and the average salary of women in the sample.

Question 3

  1. Plot of data with regression line and 95% CI
nyc <- read.csv("nyc.csv") 

service <- nyc$Service
price <- nyc$Price

mean_service <- mean(service)
mean_price <- mean(price)

SXX <- sum((service - mean_service)^2)
SYY <- sum((price - mean_price)^2)
SXY <- sum((service - mean_service)*(price - mean_price))

# OLS Point Estimators for slope and intercept of population regression line 
slope <- SXY/SXX 
intercept <- mean_price - slope*mean_service 

# All predicted prices from the SLRM 
predicted_price <- intercept + slope*service 

# Plot the sample points with the sample regression line 
plot(service, price, main = "Service and Price")
abline(a = intercept, b = slope, col = "red")

## Confidence Intervals for the population regression line
n <- length(service)
RSS <- sum((price - predicted_price)^2)
df <- n - 2
alpha <- 0.05

sigma_hat <- sqrt(RSS/(n - 2))

SE_predicted <- sigma_hat * sqrt(1/n + (service - mean_service)^2/SXX)

upper_predicted <- predicted_price + SE_predicted*qt(1 - alpha/2, df)
lower_predicted <- predicted_price - SE_predicted*qt(1 - alpha/2, df)

# Plot bounds of the CI for each predicted value 
lines(service, upper_predicted)
lines(service, lower_predicted)

# Standard error for the slope and intercept 
SE_intercept <- sigma_hat * sqrt(1/n + mean_service^2/SXX)
SE_slope <- sigma_hat/sqrt(SXX)
  1. Point estimates for regression coefficients and SE
  1. Hypothesis Tests for the two coefficients. We assume that linearity between the variables and independence and normality of errors are met by the data.
(t_intercept <- intercept / SE_intercept)
## [1] -2.344327
(t_slope <- slope / SE_slope)
## [1] 10.76395
2*(1 - pt(abs(t_intercept), df))
## [1] 0.0202451
2*(1 - pt(abs(t_slope), df))
## [1] 0
  1. Confidence Intervals for the two coefficients and interpretation
# Upper and lower bounds for beta0 confidence interval 
upper_intercept = intercept + SE_intercept * qt(1 - alpha/2, df)
lower_intercept = intercept - SE_intercept * qt(1 - alpha/2, df)

# Upper and lower bounds for beta1 confidence interval 
upper_slope = slope + SE_slope * qt(1 - alpha/2, df)
lower_slope = slope - SE_slope * qt(1 - alpha/2, df)

We are 95% confident that the intercept of the population regression line for price on service is between -22.0653456 and -1.8902756. We are also 95% confident that the slope of the population regression line is between 2.301467 and 3.3353984.

  1. ANOVA Regression Table
Source Sum of Squares(SS) df Mean Square(MS) F-Ratio
Regression \[SS_{reg}\] \[1\] \[MS_{reg}\] \[F = MS_{reg}/MS_{res}\]
Residual \[RSS\] \[n-2\] \[MS_{res}\]
Total \[SST\] \[n-1\]
SST = sum((price - mean_price)^2)
SS_reg = SST - RSS 

df_reg = 1
df_res = n-2

MS_reg = SS_reg/1
MS_res = RSS/(n - 2)

F_ratio = MS_reg/MS_res
p_value <- 1 - pf(F_ratio, df_reg, df_res)
Source Sum of Squares(SS) df Mean Square(MS) F-ratio
Regression \[5928.120226\] \[1\] \[5928.120226\] \[F = 115.8626972\]
Residual \[8493.3976311\] \[166\] \[51.165046\]
Total \[1.4421518\times 10^{4}\] \[167\]

p-value: 0

  1. R2
(R2 <- SS_reg/SST)
## [1] 0.4110608

Now the same operations on decor instead of service!

  1. Plot of data with regression line and 95% CI
  2. Point estimates for regression coefficients and SE
  1. Hypothesis Tests for the two coefficients. We assume that linearity between the variables and independence and normality of errors are met by the data.
## [1] -0.4137853
## [1] 13.53674
## [1] 0.6795654
## [1] 0
  1. Confidence Intervals for the two coefficients and interpretation

We are 95% confident that the intercept of the population regression line for price on decor is between -7.8624744 and 5.1378667. We are also 95% confident that the slope of the population regression line is between 2.127285 and 2.853783.

  1. ANOVA Regression Table
Source Sum of Squares(SS) df Mean Square(MS) F-ratio
Regression \[7566.7759764\] \[1\] \[7566.7759764\] \[F = 183.2431963\]
Residual \[ 6854.7418807\] \[166\] \[41.2936258\]
Total \[1.4421518\times 10^{4}\] \[167\]

p-value: 0

  1. R2
## [1] 0.5246865
  1. I think that decor is a better predictor than service for price. I base my opinion based on the ANOVA table, in particular the Explained Sum of Squares (ESS) and the Residual Sum of Squares (RSS) values. Since the decor linear model had a smaller RSS compared to the decor linear model, the data was better explained/predicted from the decor model, with less residual values. In addition, the R-squared value was higher for the decor linear model, indicating that more of the variance in the data was explained by the decor model compared to the service model. Other values like the F-ratio and Mean Squared Error support the claim that decor was a better predictor for price and its model fit the data better compared to that of service.