1. Pick any two quantitative variables from a data set that interests you.

I am going to revisit the penguins dataset, which is same dataset I used in my week 1 discussion post.I will also use the same variables that used in that post:

head(penguins)
##   species    island bill_len bill_dep flipper_len body_mass    sex year
## 1  Adelie Torgersen     39.1     18.7         181      3750   male 2007
## 2  Adelie Torgersen     39.5     17.4         186      3800 female 2007
## 3  Adelie Torgersen     40.3     18.0         195      3250 female 2007
## 4  Adelie Torgersen       NA       NA          NA        NA   <NA> 2007
## 5  Adelie Torgersen     36.7     19.3         193      3450 female 2007
## 6  Adelie Torgersen     39.3     20.6         190      3650   male 2007

A. Tell us what are the dependent and indepent variable

Dependent variable (Y): body_mass - integer, body mass of the penguin in grams Independent variable (X): bill_len - numeric, length of the penguin bill in millimeters with the estimating equation being: \[ body\_mass_i = \beta_0 + \beta_1*bill\_len_i + \epsilon_i \]

B. Esitmate the linear regression in R using the lm() command

# removing the empty variables
new_penguins <- na.omit(penguins)

# estimating the linear regression in R using lm()
lr <- lm(new_penguins$body_mass ~ new_penguins$bill_len)
summary(lr)
## 
## Call:
## lm(formula = new_penguins$body_mass ~ new_penguins$bill_len)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1759.38  -468.82    27.79   464.20  1641.00 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            388.845    289.817   1.342    0.181    
## new_penguins$bill_len   86.792      6.538  13.276   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 651.4 on 331 degrees of freedom
## Multiple R-squared:  0.3475, Adjusted R-squared:  0.3455 
## F-statistic: 176.2 on 1 and 331 DF,  p-value: < 2.2e-16

C. Interpret the slope and intercept parameters

lr$coefficients
##           (Intercept) new_penguins$bill_len 
##             388.84516              86.79176

The intercept is telling us that when the bill length is 0 mm, the predicted body mass is 388.85 grams. The slope is telling us for every one additional millimeter of bill length, the body mass increases by about 86.79 grams

D. Replicate the slope and intercept parameter using the covariance/variance formulas

# setting my variables
x <- new_penguins$bill_len
y <- new_penguins$body_mass

# calculate slope 
slope <- cov(x, y) / var(x)

# calculate intercept
intercept <- mean(y) - slope * mean(x)

# results
slope
## [1] 86.79176
intercept
## [1] 388.8452

Using the covariance/variance formulas we get a slope of 86.79 and intercept of 388.85

2. In less than 20 lines summarize your findings about OLS

Fitting a least squares line in regression means finding the line that best matches the data points by minimizing the squared differences between observed and predicted values. We typically require linearity, nearly normal residuals, constant variability, and independent observations when fitting a least square line.

The Gauss Markov Theorem tells us that if certain assumptions are met, then the Ordinary Least Squares, OLS, method will give the Best Linear Unbiased Estimate (BLUE). This where the OLS is BLUE comes from. The assumptions that are typically required are for the OLS to be BLUE is linearity between the variables, randomness in our data selection, non-collinearity (regressors aren’t perfectly correlated with each other), exogeneity (regressors aren’t correlated with the error term), and homoscedasticity (error of the variance is constant).