I am going to revisit the penguins dataset, which is same dataset I used in my week 1 discussion post.I will also use the same variables that used in that post:
head(penguins)
## species island bill_len bill_dep flipper_len body_mass sex year
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
## 2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
## 3 Adelie Torgersen 40.3 18.0 195 3250 female 2007
## 4 Adelie Torgersen NA NA NA NA <NA> 2007
## 5 Adelie Torgersen 36.7 19.3 193 3450 female 2007
## 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
Dependent variable (Y): body_mass - integer, body mass of the penguin in grams Independent variable (X): bill_len - numeric, length of the penguin bill in millimeters with the estimating equation being: \[ body\_mass_i = \beta_0 + \beta_1*bill\_len_i + \epsilon_i \]
# removing the empty variables
new_penguins <- na.omit(penguins)
# estimating the linear regression in R using lm()
lr <- lm(new_penguins$body_mass ~ new_penguins$bill_len)
summary(lr)
##
## Call:
## lm(formula = new_penguins$body_mass ~ new_penguins$bill_len)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1759.38 -468.82 27.79 464.20 1641.00
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 388.845 289.817 1.342 0.181
## new_penguins$bill_len 86.792 6.538 13.276 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 651.4 on 331 degrees of freedom
## Multiple R-squared: 0.3475, Adjusted R-squared: 0.3455
## F-statistic: 176.2 on 1 and 331 DF, p-value: < 2.2e-16
lr$coefficients
## (Intercept) new_penguins$bill_len
## 388.84516 86.79176
The intercept is telling us that when the bill length is 0 mm, the predicted body mass is 388.85 grams. The slope is telling us for every one additional millimeter of bill length, the body mass increases by about 86.79 grams
# setting my variables
x <- new_penguins$bill_len
y <- new_penguins$body_mass
# calculate slope
slope <- cov(x, y) / var(x)
# calculate intercept
intercept <- mean(y) - slope * mean(x)
# results
slope
## [1] 86.79176
intercept
## [1] 388.8452
Using the covariance/variance formulas we get a slope of 86.79 and intercept of 388.85
Fitting a least squares line in regression means finding the line that best matches the data points by minimizing the squared differences between observed and predicted values. We typically require linearity, nearly normal residuals, constant variability, and independent observations when fitting a least square line.
The Gauss Markov Theorem tells us that if certain assumptions are met, then the Ordinary Least Squares, OLS, method will give the Best Linear Unbiased Estimate (BLUE). This where the OLS is BLUE comes from. The assumptions that are typically required are for the OLS to be BLUE is linearity between the variables, randomness in our data selection, non-collinearity (regressors aren’t perfectly correlated with each other), exogeneity (regressors aren’t correlated with the error term), and homoscedasticity (error of the variance is constant).