2024-03-21

Introduction to Simple Linear Regression

Simple linear regression is a statistical technique used to understand the relationship between two variables: a dependent variable and an independent variable. This method helps in predicting the value of the dependent variable based on the value of the independent variable, using a straight line known as the regression line.

Understanding this relationship is crucial because it allows us to make predictions and decisions based on data, a cornerstone of data analysis across various fields.

Mathematical Expression

The mathematical formula for a simple linear regression is: \[Y = \beta_0 + \beta_1X\] where \(Y\) is the dependent variable we aim to predict, \(X\) is the independent variable we use for prediction, \(\beta_0\) is the y-intercept, indicating the value of \(Y\) when \(X\) is 0. \(\beta_1\) is the slope, showing the change in \(Y\) for a one-unit change in \(X\).

Application of Simple Linear Regression

Simple linear regression is versatile, finding applications in economics for predicting demand based on price, in meteorology for forecasting weather, in health sciences for estimating the impact of lifestyle choices on health outcomes, and more. These real-world applications demonstrate the power of understanding the linear relationships between variables.

Correlation vs. Causation

Understanding the strength and direction of a relationship between two variables is key. However, it’s crucial to remember that correlation does not imply causation. A strong linear relationship does not necessarily mean that changes in the independent variable cause changes in the dependent variable.

Correlation vs. Causation (conti.)

Example: Valentine’s Day Spending

We will use a dataset from Kaggle showing the historical averages of spending on Valentine’s Day. Our goal is to find the relationship between the year and the average spending per person.

Code:

ggplot(historical_spending, aes(x=Year, y=PerPerson)) +
geom_point() + # Plot the data points
geom_smooth(method="lm", col="maroon") + # Add a linear regression line
theme_minimal() + # Use a minimal theme for a clean look
labs(title="Average Spending on Valentine's Day Over the Years",
x="Year",
y="Average Spending ($)")

This plot shows the relationship between the year and average spending per person on Valentine’s Day.

Graph (with ggplot)

## `geom_smooth()` using formula = 'y ~ x'

Interpretation of Coefficients

Understanding the coefficients in a linear regression model allows us to interpret the relationship between variables.

## [1] "y = -11650.92 + 5.85x"

\(\beta_0\) and \(\beta_1\) in our regression model have specific meanings. \(\beta_0\) (the intercept) represents the average spending per person when the year is 2010. \(\beta_1\) (the slope) indicates how the spending per person on average changes for each additional year.

Since the \(\beta_1\) is positive, it suggests that spending per person on average increases over the years.

Making Predictions

Simple linear regression not only helps us understand relationships but also allows us to make predictions about future data. Using our model, we can predict that in 2024, an average person might spend this amount on their partner for Valentine’s Day.

predicted_spending <- predict(lm_model, data.frame(Year=2024))
sprintf("Predicted: $%.2f", predicted_spending)
## [1] "Predicted: $191.26"

On average, a person may spend $191.26 for their partner on Valentine’s Day in 2024 based on the data modeled previously.

Exploring Multivariable Linear Regression

While simple linear regression involves two variables, multivariable linear regression allows us to consider multiple independent variables. This can provide a more detailed analysis in complex datasets.

## [1] "Females with name 'Minnie': y = 48086.02 + -24.09x"

Exploring Multivariable Linear Regression

## A marker object has been specified, but markers is not in the mode
## Adding markers to the mode...

Conclusion and Further Learning

This lecture has explored simple and multivariable linear regression, emphasizing their role in analyzing relationships between variables, making predictions, and interpreting data. Engaging with diverse datasets and applying these concepts is key to unlocking insights across various fields. Linear regression’s simplicity and wide applicability make it a fundamental tool for data analysis, offering a statistical perspective on economic, health, and historical data trends.