2025-03-16

R Markdown

This is an R Markdown presentation. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

Outline of Presentation

  • Introduction to Simple Linear Regression?
  • The Mathematical Equation
  • An Example Dataset
  • Fitting The Model
  • Visualization With ggplot
  • Residual Analysis
  • 3D Visualization With Plotly
  • R Code
  • Analysis of the Results and Conclusion

Introduction to Simple Linear Regression

  • Simple Linear Regression is a statistical method used to model the relationship between independent and dependent variables
  • It assumes that there is a linear relationship between the two
  • It predicts Y based on X

The Mathematical Equation

  • For Linear Regression: \[Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]
  • \(Y_i\) is the dependent variable
  • \(X_i\) is the independent variable
  • \(\beta_0\) is the y-intercept
  • \(\beta_1\) is the slope
  • \(\epsilon_i\) is the random error

An Example Dataset

We can explore the relationship between speed and distance in the cars dataset

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Fitting the Model

car_model = lm(dist ~ speed, data = cars)
summary(car_model)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Visualizing the relationship with ggplot

library(ggplot2)
ggplot(cars, aes(x = speed, y = dist)) + 
  geom_point(color = "#8C1D40", alpha = 0.7) + 
  geom_smooth(method = "lm", color = "blue") + 
  labs(title = "Car Speed VS Distance", x = "Speed in mph", y = "Stopping Distance in ft", caption = "Source: 'cars' dataset in R") + 
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Residual Analysis

model_data = data.frame(Speed = cars$speed, Fitted = fitted(car_model), Residuals = residuals(car_model))

ggplot(model_data, aes(x = Fitted, y = Residuals)) +
  geom_point(color = "#8C1D40") +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(
    title = "Residuals VS. Fitted Values",
    x = "Fitted Values",
    y = "Residuals"
  ) +
  theme_minimal()

3D Visualization With Plotly

library(plotly)
cars_data = cars
cars_data$fitted = fitted(car_model)

plot_ly(cars_data, x = ~speed, y = ~dist, z = ~fitted, type = 'scatter3d', mode = 'markers', marker=list(color = '#8C1D40', size = 5)) %>%
  layout(
    scene = list(
      x_axis = list(title = 'Speed in mph'),
      y_axis = list(title = 'Actual Distance in ft'),
      z_axis = list(title = 'Fitted Distance in ft')
    ),
    title = '3D Visualization of Linear Regression'
  )

R Output Demonstration

library(ggplot2)
ggplot(cars, aes(x=speed, y = dist)) +
  geom_point(color = '#8C1D40') +
  geom_smooth(method = 'lm', color = 'blue') +
  ggtitle("Scatter Plot with Regression Line") +
  xlab("Speed in mph") +
  ylab("Stopping Distance in ft")

Analysis of the Results

The regression model gives us the following equation: \[Distance = -17.579 + 3.932 \times Speed\] Some things we can gather from these results: - Correlation Coefficient: The p-value for the speed coefficient is extremely small (2.2e-16) meaning that there is a statistically significant relationship between the independent and dependent variable - R-squared: The speed of 0.6511 explains 65% of the variation in stopping distance - Intercept: The negative intercept (-17.579) explains that the model predicts negative distances at low speeds, which isn’t physically possible so therefore meaningless. - Practical Interpretation: For each 1 mph increase in speed, the stopping distance increases by approximately 3.9 feet.

Therefore, in conclusion, we can see that the quality of prediction in the data depends on how well the data satisfies assumptions of linear relationships. The model performed reasonably well and showed statistical significance. However, the residuals don’t suggest a good relationship between the data in the sense of it being perfectly linear.