Simple Linear Regression is a statistical method focused on the relationship between two variables. It uses a linear equation to observe and measure the data. The two variables that this method focuses on are independent variables and dependent variables. In this dataset we will be focusing on hair color in relation to individuals; more specifically females.
Simple Linear Regression uses the equation \[ Y = mx + b \] You might have seen this equation often since high school, and it seems we just can’t get rid of it! The Y represents the dependent variable. The m is the slope. The x is the independent variable. And the b is the y-intercept.
The importance behind simple linear regression is to observe the
relationship between the two variables, to conclude our data.Simple
linear regressions are much simpler to understand and interpret.
##Slide5 Based off this data set: Independent varibale : Hair color
Dependent variable : Frequency(number of females observed with hair
color) The equation for this : \[ 5.50x +
10.00 \] Here’s how we perform simple linear regression using the
HairEyeColor dataset:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
HairFreqdf <- as.data.frame(HairEyeColor)
FemaleData <- HairFreqdf %>% filter(Sex == "Female")
model <- lm(Freq ~ Hair, data = FemaleData)
summary(model)
##
## Call:
## lm(formula = Freq ~ Hair, data = FemaleData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.750 -11.312 -3.125 0.375 43.750
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.000 9.988 1.302 0.217
## HairBrown 22.750 14.125 1.611 0.133
## HairRed -3.750 14.125 -0.265 0.795
## HairBlond 7.250 14.125 0.513 0.617
##
## Residual standard error: 19.98 on 12 degrees of freedom
## Multiple R-squared: 0.256, Adjusted R-squared: 0.07002
## F-statistic: 1.376 on 3 and 12 DF, p-value: 0.2972
##Slide 6 Plotly Plotly Plot
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
p <- plot_ly(data = FemaleData,
x = ~Hair,
y = ~Freq,
type = 'scatter', mode = 'markers',
text = ~paste("Count:", Freq),
hoverinfo = 'text') %>%
layout(title = "Frequency of Females by Hair Color",
xaxis = list(title = "Hair Color"),
yaxis = list(title = "Frequency"))
p
##Slide 7 ggplot 1 ggplot 2
ggplot(FemaleData, aes(x = Hair, y = Freq, fill = Hair)) +
geom_bar(stat = "identity") +
labs(title = "Frequency of Females by Hair Color",
x = "Hair Color",
y = "Frequency") +
theme_minimal()
##Slide 8 ggplot 2 ggplot2
ggplot(FemaleData, aes(x = Hair, y = Freq)) +
geom_point(aes(color = Hair), size = 3) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = "Linear Regression of Frequency by Hair Color",
x = "Hair Color",
y = "Frequency") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'