2025-03-17

R Markdown

This is an R Markdown presentation. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

Intro

In this presentation, I’ll go over an example of Simple Linear Regression using a small dataset I made up about house prices and square footage.

What is Simple Linear Regression?

Simple linear regression is a way to model the relationship between two variables. One variable is used to predict the other.

equation: \[ Y = \beta_0 + \beta_1 X + \epsilon \]

Where: - \(Y\): house price - \(X\): square footage - \(\beta_0, \beta_1\): intercept and slope - \(\epsilon\): errors

Dataset: House Prices and Size

Square Footage | House Price 1000 | 200,000 1500 | 250,000 2000 | 300,000 2500 | 350,000 3000 | 400,000

Why?

  • It’s simple and easy to understand.
  • Helps see how two things are related.
  • Makes predictions based on patterns in data.

Summary of Dataset

data <- data.frame(
  sqft = c(1000, 1500, 2000, 2500, 3000),
  price = c(200000, 250000, 300000, 350000, 400000)
)
summary(data)
##       sqft          price       
##  Min.   :1000   Min.   :200000  
##  1st Qu.:1500   1st Qu.:250000  
##  Median :2000   Median :300000  
##  Mean   :2000   Mean   :300000  
##  3rd Qu.:2500   3rd Qu.:350000  
##  Max.   :3000   Max.   :400000

Scatterplot: Price vs. Square Footage

The relationship looks pretty linear.

Regression Line on the Plot

\[ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X \]

Fitting the Model in R

## Warning in summary.lm(model): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = price ~ sqft, data = data)
## 
## Residuals:
##          1          2          3          4          5 
## -2.867e-11  1.456e-11  3.071e-11  9.552e-12 -2.616e-11 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 1.000e+05  4.064e-11 2.461e+15   <2e-16 ***
## sqft        1.000e+02  1.916e-14 5.220e+15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.029e-11 on 3 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.724e+31 on 1 and 3 DF,  p-value: < 2.2e-16

This gives us the equation of the line. summary() shows the slope, intercept, R-squared, and p-values.

3D Plot of the errors

What Did I Learn?

  • Linear regression is a great way to start understanding relationships between variables.
  • R makes it easy to fit the model and see results.

Things to watch out for: - If data isn’t linear, this won’t work well. - Outliers can really affect the line.

Final Equation and Residual Sum of Squares

\[ \hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X \]

\[ \text{RSS} = \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 \]

When we measure how good our line fits, smaller RSS means better fit.