Part 1:
BOD Data set
# generate descriptive stats
str(BOD)
## 'data.frame': 6 obs. of 2 variables:
## $ Time : num 1 2 3 4 5 7
## $ demand: num 8.3 10.3 19 16 15.6 19.8
## - attr(*, "reference")= chr "A1.4, p. 270"
library("psych")
describe(BOD)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## Time 1 6 3.67 2.16 3.5 3.67 2.22 1.0 7.0 6.0 0.26 -1.58 0.88
## demand 2 6 14.83 4.63 15.8 14.83 5.34 8.3 19.8 11.5 -0.29 -1.86 1.89
A. Tell us what are the dependent and independent variable.Â
I picked the BOD data set, the dependent variable (y) is the biochemical oxygen demand (mg/l) and the independent variable (x) is the time of the measurement in days. Biochemical oxygen demand (mg/l) can be effected by time, while time is independent of the observed values of biochemical oxygen demand.Â
Linear Regression Formula:
\[Y \sim X\beta_0 + X\beta_1 + \epsilon\]
\[ \epsilon \sim N(0,\sigma^2) \]
Linear Regression Formula for BOD Dataset:
\[ Yi= \beta0+\beta1timei+\epsilon i \] \[ Yi = biochemical \ oxygen \ demand \ (mg/l) \] \[ Xi = \ time \ (days) \] \[ \epsilon i\ =N(0,\sigma^2) \]
B. Estimate the linear regression in R using the lm() command.Â
lm(BOD) # Estimate the linear regression in R using the lm() command
##
## Call:
## lm(formula = BOD)
##
## Coefficients:
## (Intercept) demand
## -1.8905 0.3746
C. Interpret the slope and intercept parameters.
Slope indicates the rate of change in y per unit change in x. The slope of BOD is 0.3746, meaning that if x increased by 1 unit, y would increase by 0.3746. Therefore the slope represents how much the y value changes when the x value changes by 1 unit.
Generally, y-intercept indicates the y-value when the x-value is 0. Therefore, the BOD intercept of -1.8905 is the value of y where the line crosses the x-axis. With a negative intercept, the line of best fit crosses the y-axis below zero, meaning the dependent variable has a negative value when all independent variables are zero.
Below is the graph showing the linearity of the BOD data, with a slope of 0.3746 and an intercept of -1.8905.
# plot BOD data and add line of best fit
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(BOD, aes(x = demand, y = Time)) +
geom_point(color= "blue")+xlim(-30,30)+ylim(-10,10)+geom_hline(yintercept=0)+geom_vline(xintercept=0)+geom_abline(intercept = -1.9, slope = 0.375, color = "red")
D. Replicate the slope and intercept parameter using the covariance/variance formulas
# find the slope of BOD
B1 <- cov(BOD$Time, BOD$demand)/var(BOD$demand)
B1 # print BOD slope
## [1] 0.3746425
# find the intercept of BOD
B0 <- mean(BOD$Time) - B1 * mean(BOD$demand)
B0 # print the value of B0
## [1] -1.89053
Part 2:
Gauss Markov Theorem states that when certain conditions are met, such as the four OLS assumptions, the OLS estimate for regression coefficients is BLUE. The acronym BLUE, stands for Best Linear Unbiased Estimator, thus OLS is BLUE, means it has the smallest variance. The four OLS assumptions otherwise know as the Gauss Markov conditions, help to validate OLS for estimating of regression coefficients. There are four main condistions to OLS (1) Linearity, (2) Non-Colinearity, (3) Exogeneity, (4) Homoscedasticity. Linearity assumes data shows a linear trend, and Non-Colinearity assumes there is no corrolaion between the regressors themselves. Exogeneity assumes there is no correlation between variables and error term. Lastly, Homoscedasticity assumes a constant variance regardless values of regressors.