This presentation covers the basics of Simple Linear Regression, focusing on the relationship between hair color and frequency among females.
2024-09-25
This presentation covers the basics of Simple Linear Regression, focusing on the relationship between hair color and frequency among females.
Simple Linear Regression is a statistical method focused on the relationship between two variables. It uses a linear equation to observe and measure the data. The two variables that this method focuses on are independent and dependent variables. In this dataset, we will be focusing on hair color in relation to individuals; more specifically females.
Simple Linear Regression uses the equation \[ Y = mx + b \]. You might have seen this equation often since high school, and it seems we just can’t get rid of it! The Y represents the dependent variable. The m is the slope. The x is the independent variable. And the b is the y-intercept.
The importance behind simple linear regression is to observe the relationship between the two variables, helping us conclude our data. Simple linear regressions are much simpler to understand and interpret.
Based on this dataset: - Independent variable: Hair color - Dependent variable: Frequency (number of females observed with hair color)
The equation for this: \[ 5.50x + 10.00 \]
Here’s how we perform simple linear regression using the HairEyeColor dataset: Loading the packages:
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## ✔ dplyr 1.1.4 ✔ readr 2.1.5 ## ✔ forcats 1.0.0 ✔ stringr 1.5.1 ## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1 ## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1 ## ✔ purrr 1.0.2 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code:
HairFreqdf <- as.data.frame(HairEyeColor) FemaleData <- HairFreqdf %>% filter(Sex == "Female") model <- lm(Freq ~ Hair, data = FemaleData) summary(model)
## ## Call: ## lm(formula = Freq ~ Hair, data = FemaleData) ## ## Residuals: ## Min 1Q Median 3Q Max ## -21.750 -11.312 -3.125 0.375 43.750 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 13.000 9.988 1.302 0.217 ## HairBrown 22.750 14.125 1.611 0.133 ## HairRed -3.750 14.125 -0.265 0.795 ## HairBlond 7.250 14.125 0.513 0.617 ## ## Residual standard error: 19.98 on 12 degrees of freedom ## Multiple R-squared: 0.256, Adjusted R-squared: 0.07002 ## F-statistic: 1.376 on 3 and 12 DF, p-value: 0.2972
suppressMessages(library(plotly))
p <- plot_ly(data = FemaleData,
x = ~Hair, y = ~Freq,
type = 'scatter', mode = 'markers',
text = ~paste("Count:", Freq), hoverinfo = 'text') %>%
layout(title = "Frequency of Females by Hair Color",
xaxis = list(title = "Hair Color"),
yaxis = list(title = "Frequency"))
p
## `geom_smooth()` using formula = 'y ~ x'