Discussion Week 6

#1. Run a regression on a dataset of my choosing

I am going to use the pirates dataset, which I am recycling from my week 3 post. This dataseset lists characteristics of various pirates, such as sex, age, height, number of tattoos, favorite pirate besides themselves, and so on.

# Install the yarrr package
#install.packages('yarrr')

# Load the package
library(yarrr)

## Loading required package: jpeg

## Loading required package: BayesFactor

## Loading required package: coda

## Loading required package: Matrix

## ************
## Welcome to BayesFactor 0.9.12-4.7. If you have questions, please contact Richard Morey (richarddmorey@gmail.com).
## 
## Type BFManual() to open the manual.
## ************

## Loading required package: circlize

## ========================================
## circlize version 0.4.16
## CRAN page: https://cran.r-project.org/package=circlize
## Github page: https://github.com/jokergoo/circlize
## Documentation: https://jokergoo.github.io/circlize_book/book/
## 
## If you use it in published research, please cite:
## Gu, Z. circlize implements and enhances circular visualization
##   in R. Bioinformatics 2014.
## 
## This message can be suppressed by:
##   suppressPackageStartupMessages(library(circlize))
## ========================================

## yarrr v0.1.5. Citation info at citation('yarrr'). Package guide at yarrr.guide()

## Email me at Nathaniel.D.Phillips.is@gmail.com

pirates <- data.frame(pirates)

A. The two variables of interest I will choose are sword.time, which measures hows many seconds it takes for the pirate to draw their sword, and grogg, the count of gogg drinks the pirate imbibes on average per day. I will argue that grogg is the independent variable, and sword.time is the dependent variable. In other words, the number of drinks the pirates has affects their sword drawing time, but not vice versa (I’ll ignore the potential for endogeneity here for the sake of a fun argument). My equation will look like:

\[ sword.time_i \sim \alpha + grogg_i\beta_1 + \epsilon \] \[ \epsilon \sim N(0,\sigma^2) \] B. Use lm function

# fit linear model 
linear_model <- lm(sword.time ~ grogg, data=pirates) 
  
# view summary of linear model 
summary(linear_model)

## 
## Call:
## lm(formula = sword.time ~ grogg, data = pirates)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -2.731  -2.326  -1.972  -1.231 167.167 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  2.82435    1.02082   2.767  0.00577 **
## grogg       -0.02777    0.09632  -0.288  0.77320   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.334 on 998 degrees of freedom
## Multiple R-squared:  8.326e-05,  Adjusted R-squared:  -0.0009187 
## F-statistic: 0.0831 on 1 and 998 DF,  p-value: 0.7732

C. Even though my results don’t look statistically significant, here is what they say: the intercept is 2.8, and the slope is about -0.03. The intercept tells me that without any drinks, a pirate has a sword draw time of 2.8 seconds (not bad!). The slope being negative aligns with my assumption that for every drink of grogg, the pirate’s sword draw time decreases by 0.03 seconds.

D. Replicating the slope and intercept

# the slope of a variable is given by cov(xy)/var(x)
slope <- cov(pirates$sword.time, pirates$grogg)/var(BOD$demand)

print(slope)

## [1] -0.01217177

# the intercept is given by mean(y) - (slope * mean(x))
intercept <- mean(pirates$sword.time) - (slope * mean(pirates$grogg))

print(intercept)

## [1] 2.666133

#2. Gauss-Markov Assumptions

BLUE stands for the best linear unbiased regression, which occurs when the four conditions of the Gauss-Markov assumptions have been met. The four assumptions are: 1. The dependent and independent variables are linear, and can be plotted as a straight line on a graph, 2. The expected values of the error term must be zero, 3. Homoskedasticity must be achieved; the variance across errors must be constant, and 4. The error terms across observations must not be correlated.

If these conditions are not met, any estimates from a regression output should not be trusted because they are either biased, inconsistent, and inefficient.

Discussion Week 6

Emily Ward

2025-02-20