2024-04-05

Simple Linear Regression

Definition

“The goal of a simple linear regression is to predict the value of a dependent variable based on an independent variable. The greater the linear relationship between the independent variable and the dependent variable, the more accurate is the prediction.” (DataTab)

Linear Regression Math Formula

\[ y = b * x + a \]

y = Estimated dependent variable

b = Slope

x = Independent variable

a = y-intercept

Example 1 Data Scientist Salary vs. Experiences

Calculate Coefficient of Determination

\[ R^2 = 1 - \frac{RSS}{TSS}\] \[ RSS = \sum_{i=1}^n (y-x)\] \[ TSS = \sum_{i=1}^n (y-X)\]

n = the number of observations

y = the dependent variable

x = the predicted values of y

X = the mean of the dependent variable y

Example of calculating R^2 from the prior dataset

model <- lm(Salary ~ Experiences, data = df)

df$fitted <- predict(model)
df$residual <- df$Salary - df$fitted

total_sum_of_squares <- sum((df$Salary - mean(df$Salary))^2)

sum_of_squares_residuals <- sum(df$residual^2)

R_squared <- 1 - (sum_of_squares_residuals / total_sum_of_squares)

R_squared
## [1] 0.7770042

Example 2 Find whether sleeping time affect to Physical health and heart disease

data <- read.csv("C:\\Users\\ASUS\\OneDrive\\Desktop\\ASU Online\\DAT301\\heart_2020_cleaned.csv")
library(ggplot2)

ggplot(data, aes(x = SleepTime, y = PhysicalHealth, color = HeartDisease)) +
  geom_point(alpha = 1/2) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Sleep Time", y = "Physical Health") +
  ggtitle("Scatter Plot of Sleep Time vs Physical Health")

Example 3 ggplot2 in medical insurance dataset.

data <- read.csv("C:\\Users\\ASUS\\OneDrive\\Desktop\\ASU Online\\DAT301\\insurance.csv")
library(ggplot2)

ggplot(data, aes(x = age, y = charges, color = smoker)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_color_manual(values = c("orange", "darkgreen")) +
  labs(x = "Age", y = "Charges") +
  ggtitle("Linear Regression: Age vs Charges by Smoking Status")

Reference