library(plotly) library(ggplot2) library(dplyr) set.seed(12)
2026-03-08
library(plotly) library(ggplot2) library(dplyr) set.seed(12)
In Statistics, we use Simple Linear Regression to see the relationship between two variables.
Independent Variable (X): This is the variable we use to predict.
Dependent Variable (Y): This is the variable we want to know about.
We try to find a straight line that fits the data points the best.
We can write the relationship using a mathematical formula. In LaTeX, it looks like this:
\[Y = \beta_0 + \beta_1 X + \epsilon\]
Where:
To find the best line, we need to calculate the slope \(\beta_1\). The formula is:
\[\beta_1 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}\]
This formula helps us minimize the distance between the actual data points and our line. This method is called Ordinary Least Squares (OLS).
We will use the trees dataset in R. We want to see if the Girth of a tree can predict its Volume.
Here is the R code to create a simple model:
# Create the linear model model <- lm(Volume ~ Girth, data = trees) # Show the summary of the model summary(model)$coefficients
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -36.943459 3.365145 -10.97827 7.621449e-12 ## Girth 5.065856 0.247377 20.47829 8.644334e-19
This plot shows the raw data points of Tree Girth versus Volume.
Now we add the Regression Line to the plot to see the trend.
## `geom_smooth()` using formula = 'y ~ x'
Below is an interactive plot using plotly. You can hover over the points to see the values.
This is the code I used to visualize the tree data using the ggplot2 library.
ggplot(trees, aes(x = Girth, y = Volume)) +
geom_point(color = "darkblue") +
theme_minimal() +
labs(title = "Scatter plot of Tree Data", x = "Girth of Tree",
y = "Volume")
Simple Linear Regression is a very powerful tool.
It helps us summarize the relationship between variables.
It helps us predict future values.
It is the first step for many advanced data science methods.