Sent as a separate scanned PDF.
Parts a) and b) were also sent as a separate scanned PDF.
sexsalary <- read.table("sexsalary.txt", header=TRUE)
## Female is 0, Male is 1
f_0 = factor(sexsalary$sex, labels = c(1,0))
sex <- as.numeric(levels(f_0))[f_0]
salary <- sexsalary$bsal
n <- length(f_0)
salary_male <- salary[sex == 0]
salary_female <- salary[sex == 1]
mean_msalary <- mean(salary_male)
mean_fsalary <- mean(salary_female)
intercept <- mean_msalary
slope <- mean_fsalary - mean_msalary
predicted_salary <- intercept + slope * sex
RSS <- sum((salary - predicted_salary)^2)
sigma_hat_squared = RSS/(n - 2)
SXX <- sum((sex - mean(sex))^2)
plot(sex, salary, main = "{0,1} Parametrization")
abline(a = intercept, b = slope, col = "red")
The regression line for the {0,1} parametrization is \[Salary = 5956.875 -818.022541 * Sex\]
We perform a hypothesis test at the 0.05 significance level to determine if there is a significant difference in salary between males and females based on our linear model of salary on sex with parametrization of {0, 1}. We assume that linearity between sex and salary, and independence and normality of errors are met by the data.
The t value is
#Hypothesis test for the slope
SE_slope = (sigma_hat_squared/SXX)
(tvalue_slope = slope/SE_slope)
## [1] -0.04840558
And the Pr(>|t|)
2*(1 - pt(abs(tvalue_slope), df = n - 2))
## [1] 0.9614991
Because our p-value is greater than 0.05, we fail to reject the null hypothesis that there is a significant linear relationship between salary and sex at the bank.
The regression line for the {-1,1} parametrization is \[Salary = 5547.8637295 -409.0112705 * Sex\]
We perform a hypothesis test at the 0.05 significance level to determine if there is a significant difference in salary between males and females based on our linear model of salary on sex with parametrization of {-1, 1}. We assume that linearity between sex and salary, and independence and normality of errors are met by the data.
The t value is
## [1] -0.09681117
And the Pr(>|t|)
## [1] 0.9230893
Because our p-value is greater than 0.05, we fail to reject the null hypothesis that there is a significant linear relationship between salary and sex at the bank.
The t-statistic for the {0,1} parametrization is half of the t-statistic for the {-1,1} parametrization so there was a slight difference in p-value, but both tests reached the same conclusion that we fail to reject the null hypothesis. The {0,1} parametrization model’s intercept represented the average salary of men in the sample, while its slope was the difference between the average salary of men and the average salary of women in the sample. For the {-1,1} parametrization model, the intercept represented the halfway point or average of the average salary of men and average salary of women in the sample, and the slope represented one-half of the difference between the average salary of men and the average salary of women in the sample.
nyc <- read.csv("nyc.csv")
service <- nyc$Service
price <- nyc$Price
mean_service <- mean(service)
mean_price <- mean(price)
SXX <- sum((service - mean_service)^2)
SYY <- sum((price - mean_price)^2)
SXY <- sum((service - mean_service)*(price - mean_price))
# OLS Point Estimators for slope and intercept of population regression line
slope <- SXY/SXX
intercept <- mean_price - slope*mean_service
# All predicted prices from the SLRM
predicted_price <- intercept + slope*service
# Plot the sample points with the sample regression line
plot(service, price, main = "Service and Price")
abline(a = intercept, b = slope, col = "red")
## Confidence Intervals for the population regression line
n <- length(service)
RSS <- sum((price - predicted_price)^2)
df <- n - 2
alpha <- 0.05
sigma_hat <- sqrt(RSS/(n - 2))
SE_predicted <- sigma_hat * sqrt(1/n + (service - mean_service)^2/SXX)
upper_predicted <- predicted_price + SE_predicted*qt(1 - alpha/2, df)
lower_predicted <- predicted_price - SE_predicted*qt(1 - alpha/2, df)
# Plot bounds of the CI for each predicted value
lines(service, upper_predicted)
lines(service, lower_predicted)
# Standard error for the slope and intercept
SE_intercept <- sigma_hat * sqrt(1/n + mean_service^2/SXX)
SE_slope <- sigma_hat/sqrt(SXX)
Point estimate for intercept = -11.9778106
Point estimate for slope = 2.8184327
Standard error for intercept = 5.1092741
Standard error for slope = 0.2618399
(t_intercept <- intercept / SE_intercept)
## [1] -2.344327
(t_slope <- slope / SE_slope)
## [1] 10.76395
2*(1 - pt(abs(t_intercept), df))
## [1] 0.0202451
2*(1 - pt(abs(t_slope), df))
## [1] 0
# Upper and lower bounds for beta0 confidence interval
upper_intercept = intercept + SE_intercept * qt(1 - alpha/2, df)
lower_intercept = intercept - SE_intercept * qt(1 - alpha/2, df)
# Upper and lower bounds for beta1 confidence interval
upper_slope = slope + SE_slope * qt(1 - alpha/2, df)
lower_slope = slope - SE_slope * qt(1 - alpha/2, df)
We are 95% confident that the intercept of the population regression line for price on service is between -22.0653456 and -1.8902756. We are also 95% confident that the slope of the population regression line is between 2.301467 and 3.3353984.
| Source | Sum of Squares(SS) | df | Mean Square(MS) | F-Ratio |
|---|---|---|---|---|
| Regression | \[SS_{reg}\] | \[1\] | \[MS_{reg}\] | \[F = MS_{reg}/MS_{res}\] |
| Residual | \[RSS\] | \[n-2\] | \[MS_{res}\] | |
| Total | \[SST\] | \[n-1\] |
SST = sum((price - mean_price)^2)
SS_reg = SST - RSS
df_reg = 1
df_res = n-2
MS_reg = SS_reg/1
MS_res = RSS/(n - 2)
F_ratio = MS_reg/MS_res
p_value <- 1 - pf(F_ratio, df_reg, df_res)
| Source | Sum of Squares(SS) | df | Mean Square(MS) | F-ratio |
|---|---|---|---|---|
| Regression | \[5928.120226\] | \[1\] | \[5928.120226\] | \[F = 115.8626972\] |
| Residual | \[8493.3976311\] | \[166\] | \[51.165046\] | |
| Total | \[1.4421518\times 10^{4}\] | \[167\] |
p-value: 0
(R2 <- SS_reg/SST)
## [1] 0.4110608
Now the same operations on decor instead of service!
Point estimate for intercept = -1.3623038
Point estimate for slope = 2.490534
Standard error for intercept = 3.2922962
Standard error for slope = 0.1839834
## [1] -0.4137853
## [1] 13.53674
## [1] 0.6795654
## [1] 0
We are 95% confident that the intercept of the population regression line for price on decor is between -7.8624744 and 5.1378667. We are also 95% confident that the slope of the population regression line is between 2.127285 and 2.853783.
| Source | Sum of Squares(SS) | df | Mean Square(MS) | F-ratio |
|---|---|---|---|---|
| Regression | \[7566.7759764\] | \[1\] | \[7566.7759764\] | \[F = 183.2431963\] |
| Residual | \[ 6854.7418807\] | \[166\] | \[41.2936258\] | |
| Total | \[1.4421518\times 10^{4}\] | \[167\] |
p-value: 0
## [1] 0.5246865