In this analysis, we use an employee dataset to predict salaries based on years of experience and the well known Iris dataset for demonstration with binary variables. Our approach begins with simple linear regression and progressively explores more advanced techniques, offering both practical demonstrations and theoretical insights.
The primary goal is to bridge the gap between theory and practice in data science. Many professionals in the field originate from either a statistics/mathematics background or a computer science/programming background. These distinct foundations often influence their approach to further studies and problem-solving. Those with a programming background may excel in optimization and automation but often struggle with statistical concepts. Conversely, individuals with a strong statistical background may grasp data relationships intuitively but find programming challenges more demanding.
This article aims to address these gaps by presenting a structured blend of practical coding demonstrations in R and rigorous mathematical derivations. The content is organized as follows:
Usage and Demonstration:
Practical applications using R programming.
Proofs and Theory: Foundations in econometrics, statistics, and linear algebra.
Suppose we have a variable \(X\) which denotes the age of a person and another variable \(Y\) which denotes the salary the person is receiving.
\(X_1\) denotes the age of the first person and similarly \(Y_1\) represents the salary of the first person
data <- na.omit(read.csv("Salary_Data.csv"))
head(data)
## Age Gender Education.Level Job.Title Years.of.Experience Salary
## 1 32 Male Bachelor's Software Engineer 5 90000
## 2 28 Female Master's Data Analyst 3 65000
## 3 45 Male PhD Senior Manager 15 150000
## 4 36 Female Bachelor's Sales Associate 7 60000
## 5 52 Male Master's Director 20 200000
## 6 29 Male Bachelor's Marketing Analyst 2 55000
X <- data$Years.of.Experience
y <- data$Salary
library(ggplot2)
ggplot(data, aes(Years.of.Experience, Salary)) +
geom_point() +
labs(title = "Salary vs Years of Experience")
print(dim(data))
## [1] 6699 6
Through simple inspection there is an obvious positive LINEAR relation between the salary of a person and the years of experience. To numerically represent this observation we have the sample covariance which is defined as \[ s_{xy} = \frac{1}{n-1}\sum(x_i - \bar{x})(y_i - \bar{y}) \]
We can calculate it simply by
cov(X, y)
## [1] 258770.2
From the sign of the calculated statistic, we can infer that there is a positive relation but the magnitude of the statistic doesn’t tell about the strength of the relation. Yet it’s arguably more useful than correlation coefficient as we will see that the optimal regression parameter (\(\beta\)) can be derived using the covariance.
So instead by using
\[r_{xy} = \frac{s_{xy}}{s_x s_y} = \frac{\frac{1}{n-1}\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\frac{1}{n-1}\sum (x_i - \bar{x})^2}\sqrt{\frac{1}{n-1}\sum (y_i - \bar{y})^2}} = \frac{\frac{1}{n-1}\sum(x_i - \bar{x})(y_i - \bar{y})}{\frac{1}{n-1}\sqrt{\sum (x_i - \bar{x})^2\sum (y_i - \bar{y})^2}} = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2\sum (y_i - \bar{y})^2}}\] which we can calculate with
cor(X, y)
## [1] 0.8089689
we can have a scaled measure for the correlation, i.e. “correlation coefficient”
When we consider one variable \(X\) to predict \(Y\) then:
Supposing the true relation follows the form of: \[y_i = \alpha_0 + \beta_0 x_i + \varepsilon_i \; \; \; where \ i =1, 2, 3, ..., n\] And having the prediction function: \[\hat{y}_i = \hat{\alpha} + \hat{\beta} x_i \; \; \; where \ i =1, 2, 3, ..., n\] Then, the following least squares (LS) function, and the optimal values for estimators can be modeled under OLS model as:
\[(\hat{\alpha}, \hat{\beta}) = argmin_{(\alpha, \beta)} \sum_{i=1}^n (y_i - \hat{y}_i)^2 = argmin_{(\alpha, \beta)} \underbrace{\sum_{i=1}^n (y_i - \alpha - \beta x_i)^2}_{LS_n(\alpha, \beta)}\] Notice that the \(\alpha\) and \(\beta\) for estimators of true coefficients instead of \(\hat{\alpha}\) and \(\hat{\beta}\). The reason is, as the estimators are not “fitted” yet, they don’t represent the optimal estimators. But estimators resulting in the minimal cost of the LS cost/loss function, indeed represent the optimal estimators under the OLS model.
By setting the partial derivatives of the LS function, w.r.t. each estimator, the so called Normal Equations are obtained:
\[\frac{\partial LS(\alpha, \beta)}{\partial\alpha} = -2\sum_{i=1}^n (y_i - \alpha - \beta x_i) \stackrel{(f.o.c.)}{=} 0 \Longrightarrow \sum_{i=1}^n(y_i - \hat{\alpha} - \hat{\beta} x_i ) = 0 \quad \quad \quad (1)\]
\[\frac{\partial LS(\alpha, \beta)}{\partial\beta} = -2\sum_{i=1}^n x_i(y_i - \alpha - \beta x_i) \stackrel{(f.o.c.)}{=} 0 \Longrightarrow \sum_{i=1}^n x_i (y_i - \hat{\alpha} - \hat{\beta} x_i ) = 0 \quad \quad \quad (2)\]
following from 1 and 2, we derive the optimal estimators of simple OLS:
\[ (1): \sum_{i=1}^n (y_i - \hat{\alpha} - \hat{\beta} x_i) = \sum_{i=1}^n{y_i} - \sum_{i=1}^n{\hat{\alpha}} - \sum_{i=1}^n{\hat{\beta} x_i} = \overbrace{\sum_{i=1}^n{y_i}}^{n \bar{y}} - \hat{\alpha}\sum_{i=1}^n{1} - \hat{\beta}\overbrace{\sum_{i=1}^n{x_i}}^{n \bar{x}} = n \bar{y} - n \hat{\alpha} - n\hat{\beta} \bar{x} = 0 \] \[ \Leftrightarrow n \hat{\alpha} = n(\bar{y} - \hat{\beta} \bar{x}) \Leftrightarrow \hat{\alpha} = \bar{y} - \hat{\beta} \bar{x} \quad (3)\] \[(2): \sum_{i=1}^n x_i (y_i - \hat{\alpha} - \hat{\beta} x_i) \stackrel{(3)}{=} \sum_{i=1}^n x_i(y_i - (\bar{y} - \hat{\beta} \bar{x} ) - \hat{\beta} x_i) = \sum_{i=1}^n x_i((y_i - \bar{y}) - \hat{\beta}(x_i - \bar{x})) = 0\] \[\Leftrightarrow \sum_{i=1}^n x_i (y_i - \bar{y}) = \hat{\beta} \sum_{i=1}^n x_i (x_i - \bar{x})\] \[ \Leftrightarrow \hat{\beta} = \frac{\sum_{i=1}^n x_i(y_i - \bar{y})}{\sum_{i=1}^n x_i(x_i - \bar{x})} \stackrel{(*)}{=} \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{s_{xy}}{s_x^2}\] the common form \(\frac{s_{xy}}{s_x}\) where \(s_{xy}\) denotes the sample covariance between the two variables and \(s^2_x\) denotes the sample variance x, can be shown to be an equivalent of our derivations through showing that left hand side (our derivation) is equivalent to right hand following:
\[*\] \[LHS: \frac{\sum_{i=1}^n x_i(y_i - \bar{y})}{\sum_{i=1}^n x_i(x_i - \bar{x})} = \frac{\sum_{i=1}^n x_iy_i - \sum_{i=1}^n x_i \bar{y}}{\sum_{i=1}^nx_i^2 - \sum_{i=1}^n \bar{x} x_i} = \frac{\sum_{i=1}^n x_iy_i - n \bar{x} \bar{y}}{\sum_{i=1}^nx_i^2 - n\bar{x}^2}\] \[RHS: \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{\sum_{i=1}^n x_iy_i - \overbrace{\sum_{i=1}^n x_i \bar{y}}^{n\bar{x}\bar{y}} - \overbrace{\sum_{i=1}^n y_i \bar{x}}^{n\bar{x}\bar{y}} + \overbrace{\sum_{i=1}^n \bar{x} \bar{y}}^{n\bar{x}\bar{y}}}{\sum_{i=1}^nx_i^2 - \underbrace{\sum_{i=1}^n x_i \bar{x}}_{n\bar{x}^2} - \underbrace{\sum_{i=1}^n x_i \bar{x}}_{n\bar{x}^2} + \underbrace{\sum_{i=1}^n \bar{x}^2}_{n\bar{x}^2}} = \frac{\sum_{i=1}^n x_iy_i - n \bar{x} \bar{y}}{\sum_{i=1}^nx_i^2 - n\bar{x}^2}\]
set.seed(42)
x <- rnorm(50, 10, 2)
# Generating y with the true model
alpha0 <- 32
beta0 <- 2.3
error <- rnorm(50, 0, 1)
y <- alpha0 + x*beta0 + error
beta_hat <- cov(x, y)/var(x)
alpha_hat <- mean(y) - beta_hat*mean(x)
sprintf("Alpha: %.2f, Beta: %.2f", alpha_hat, beta_hat)
## [1] "Alpha: 33.05, Beta: 2.20"
Checking whether we have the same solution with the built-in lm() function.
reg = lm(y ~ x)
sprintf("Alpha: %.2f, Beta: %.2f", reg$coefficients[1], reg$coefficients[2])
## [1] "Alpha: 33.05, Beta: 2.20"
library(ggplot2)
library(tibble)
data <- tibble(x, y)
# Calculate predicted y values using the regression equation
data$y_pred <- alpha_hat + beta_hat * data$x
# Create the plot
ggplot(data=data, aes(x=x, y=y)) +
geom_point() + # Plot actual data points
geom_line(aes(x=x, y=y_pred), color="blue") + # Plot regression line
geom_segment(aes(xend=x, yend=y_pred), color="red", linetype="dashed") + # Draw residuals
theme_minimal() + # Use a minimal theme
labs(title="Scatterplot with Regression Line and Residuals",
x="X Variable",
y="Y Variable")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
iris %>% group_by(Species) %>% summarize(mean=mean(Petal.Length))
## # A tibble: 3 × 2
## Species mean
## <fct> <dbl>
## 1 setosa 1.46
## 2 versicolor 4.26
## 3 virginica 5.55
library(dplyr)
library(ggplot2)
data <- iris %>% select(c(Petal.Width, Petal.Length))
data$is_Setosa <- ifelse(iris$Species == "setosa", 1, 0)
head(data[47:52,])
## Petal.Width Petal.Length is_Setosa
## 47 0.2 1.6 1
## 48 0.2 1.4 1
## 49 0.2 1.5 1
## 50 0.2 1.4 1
## 51 1.4 4.7 0
## 52 1.5 4.5 0
Considering the data above, if we want to predict the Petal.Length by the variable \(D\) which denotes whether the observed flower is Setosa. So our prediction function which only takes the binary variable into account can be modeled as \[\hat{y} = \hat{\alpha} + \hat{\beta} D_i\]
We can instinctively expect that the prediction function \(\hat{y}\) will be equal to the corresponding group’s mean Petal width.
So in practice, our naive estimator will give the mean petal width of Setosa species if the observed flower is a setosa, or give the mean petal width of the other species that are NOT Setosa.
Then, by our instincts we expect the following: \(\hat{y}(D_i) = \begin{cases} \hat{y_1} & D_i=1 \\ \hat{y}_0 & D_i=0 \end{cases}\) just by that we can actually write out \(\hat{y}(D_i=1)\) and \[\hat{y}(D_i=0)\] and solve for \(\hat{\alpha}\) and \(\hat{\beta}\)
\[y(D_i=0) = \hat{\alpha_0} + \hat{\beta} \cdot 0 = \hat{y_0} \ \Rightarrow \ \hat{\alpha} = \bar{y}_0\] \[y(D_i=1) = \hat{\alpha_0} + \hat{\beta} \cdot 1 = \bar{y_0} + \hat{\beta} = \bar{y}_1 \Rightarrow \hat{\beta} = (\bar{y}_1 - \bar{y}_0)\] So we have \(\hat{\alpha} = \bar{y}_0\) and \(\hat{\beta} = \bar{y}_1 - \bar{y}_0\)
## (Intercept) IsSetosa
## 6.262 -1.256
Starting off with the variance of our target variable \(y\) we have
\[\frac{1}{n-1} \sum_{i=1}^n (y_i - \bar{y})^2\] if we ignore the scaling term \(\frac{1}{n-1}\), we have the Total Sum of Squares (SST) of the target variable. \[SST = \sum_{i=1}^n (y_i - \bar{y})^2\] With rewriting using \(y_i = \hat{y_i} + \hat{e}_i\), and using the fact that \(\sum y_i = \sum \hat{y}_i\) following from the first Normal Equation. We have:
\[\underbrace{\sum_{i=1}^n (y_i - \bar{y})^2}_{SST} = \sum_{i=1}^n (\hat{y}_i + \hat{e} - \bar{\hat{y}})^2 \stackrel{*}{=} \sum_{i=1}^n (\hat{y}_i - \bar{\hat{y}})^2 + \sum_{i=1}^n (\hat{e})^2 = \underbrace{\sum_{i=1}^n (\hat{y}_i - \bar{\hat{y}})^2}_{SSE} + \underbrace{\sum_{i=1}^n (y_i - \hat{y}_i)^2}_{SSR} \] \[SST = SSE + SSR\] As the sum of SSE and SSR are bounded between (0 and SST) we have \[\frac{SST}{SST} = 1= \frac{SSE}{SST} + \frac{SSR}{SST}\] then the \(R^2\) which tells about the goodness of fit is demonstrated as \(R^2 = \frac{SSE}{SST}\) which can also be expressed as \(R^2 = 1 - \frac{SSR}{SST}\) So intuitively, we are trying to maximize R^2 \(\in [0, 1]\) and trying to minimize \(\frac{SSR}{SST}\) = \(\frac{\sum \hat{e}^2}{\sum (y_i - \bar{y})^2} = \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}\)
The R^2 is interpreted as the ratio of Explained variance, and \(1 - R^2\) as the unexplained (residual) variance of the target variable, but basically \(R^2\) denotes how close our predictions are to the actual values of the target variable.
\(R^2\) isn’t a good metric, which we will discuss this later on (when dealing with multiple variables). but before that, a quick look into data transformations which improve model performance, interpretability, or compliance with statistical assumptions.
Starting off with a simple example, if we have: \[X \sim N(\mu, \sigma^2)\]
Then if we define \[Y = 2X\] What is distr. of \(Y\)?
By rewriting the Cumulative Distribution Function of the newly created \(Y\) variable in terms of \(X\), pdf and CDF of \(Y\) can be derived, Therefore the distribution of \(Y\) can be recognized. \[F_Y(y) = P[Y \leq y] = P[2X \leq y] = P[X \leq y/2] = F_X(y/2)\]
By rewriting the Moment Generating Function of the newly created \(Y\) variable in terms of Expectations and \(X\), mgf of \(Y\) can be derived, Therefore the distribution of \(Y\) can be recognized. \[M_y(t) = E[e^{tY}] = E[e^{2tX}] = E[e^{(2t)X}] = M_X(2t) = e^{\mu \cdot 2t + \sigma^2 \cdot (2t)^2/2} = e^{2\mu \cdot t + 4\sigma^2t^2/2}\]
Starting off with calculating the Expected value and Variance of the new variable Y \[ E[Y] = E[2X] = 2 \cdot E[X] = 2 \mu \\ \, \, \, Var[Y] = Var[2X] = 4 \cdot Var[X] = 4 \sigma^2 \] This is quite straightforward as we now resulting distributions from linear transformations of normal variables are also normal
In the wages dataset, we have more than one variable to build the estimator, to move our regression to the next level, we need to look at multiple features/columns to better estimate the wage.
The good thing is that the process will remain the same, but since there will be multiple variables/dimensions we will use Linear Algebra
Here is a quick hint for the topics to be discussed
\[\hat{\beta}_2 = (X_2'M_{X_1}X_2)^{-1}(X_2'M_{X_1}y)\]
Stay tuned for the next articles for more advanced techniques and building our way up to great regressions, All the best!