Data Science Concepts

1 Regression (from StatQuest)

1.1 Linear Regression

We want to minimize the square of the distance between the observed values and the line

1.1.1 `lowess` vs `loess` smoothing

Use a type of sliding window to divide the data into smaller blobs.
At each data point, use a type of least squares to a line.

`lowess`	`loess`
Only fits line	Can fit a line or a parabola, default is fitting a parabola
Doesn’t draw confidence interval around the curve	Draws confidence interval around the curve

# noisy gamma distribution plot
x <- seq(from = 0, to = 20, by = 0.1)
y.gamma <- dgamma(x, shape = 2, scale = 2)
y.gamma.scaled <- y.gamma * 100

y.norm <- vector(length = 201)
for (i in 1:201) {
  y.norm[i] = rnorm(
      n = 1,
      mean = y.gamma.scaled[i],
      sd = 2
  )
}

data <- data.frame(x, y.norm)
data |> 
  ggplot(mapping = aes(
    x,
    y.norm
  )) +
  geom_point()

# by default "lowess()" fits a line in each window using 2/3's 
# of the data points
# y.norm ~ x says that y.norm is being modeled by x
# f is the fraction of points to use in each window

lo.fit.gamma <- lowess(y.norm ~ x, f = 1/5)
plot(data,
     frame.plot = F,
     xlab = "",
     ylab = "",
     col = "#d95f0e",
     lwd = 1.5
)
lines(x, lo.fit.gamma$y, col = "black", lwd = 3)
title("lowess() smoothing")

# by default "loess()" fits a parabola in each window using
# 75% of the data points

plx<-predict(
  loess(y.norm ~ x, 
        span=1/5, 
        degree=2, 
        family="symmetric", 
        iterations=4
  ), 
  se=T
)
 
## Now let's add a confidence interval to the loess() fit...
plot(
  data, 
  type="n", 
  frame.plot=FALSE,
  xlab="", 
  ylab="", 
  col="#d95f0e", 
  lwd=1.5
)
polygon(
  c(x, rev(x)), 
  c(
    plx$fit + qt(0.975,plx$df)*plx$se, 
    rev(plx$fit - qt(0.975,plx$df)*plx$se)
  ), 
  col="#99999977"
)
points(
  data, 
  col="#d95f0e", 
  lwd=1.5
)
lines(
  x, 
  plx$fit, 
  col="black", 
  lwd=3
)

Above are the technical details of the method. Below is the practical way of fitting the curve,

data |> 
  ggplot(aes(x, y.norm)) +
  geom_point() +
  geom_smooth(span = 1/5)

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

1.2 General Linear Models - Part 1

There are at least three parts to it,

Use least squares to fit a line to the data
Calculate R^2
Calculate a p-value for the R^2

The distance from a line to a data point is called a “residual”. Fitting the line to the data is fairly straightforward. Pick a random line. Then adjust the line’s parameters so that it better fits the data.

In the form, y = a + bx, if the b is non-zero, that means if a particular observation of x is provided, it is possible to make a guess about the y value. Quantification of how good the guess is can be achieved by two values,

R^2
p-value for the R^2

SS(mean) \rightarrow Sum of Squares around the mean (of the responses, y) = (\text{data} - \text{mean})^2

Var(mean) = \frac{SS(\text{mean})}{n}

SS(fit) \rightarrow Sum of Squares around the least-squares fit = (\text{data} - \text{fit})^2

Var(fit) = \frac{SS(\text{fit})}{n}

If and when Var(\text{mean}) > Var(\text{fit}), some of the variation is explained by the least-squared line. R^2 tells us how much of the variation in mouse size can be explained by taking mouse weight into account. R^2 = \frac{Var(\text{mean}) - Var(\text{fit})}{Var(\text{mean})}. For example, if R^2 = 60\%, it is said that x explains 60\% of the variation in y.

mouse.data <- data.frame(
  size = c(1.4, 2.6, 1.0, 3.7, 5.5, 3.2, 3.0, 4.9, 6.3),
  weight = c(0.9, 1.8, 2.4, 3.5, 3.9, 4.4, 5.1, 5.6, 6.3),
  tail = c(0.7, 1.3, 0.7, 2.0, 3.6, 3.0, 2.9, 3.9, 4.0))

mouse.data # print the data to the screen in a nice format

  size weight tail
1  1.4    0.9  0.7
2  2.6    1.8  1.3
3  1.0    2.4  0.7
4  3.7    3.5  2.0
5  5.5    3.9  3.6
6  3.2    4.4  3.0
7  3.0    5.1  2.9
8  4.9    5.6  3.9
9  6.3    6.3  4.0

1.2.1 Simple Linear Regression

## STEP 1: Draw a graph of the data to make sure the relationship make sense
plot(mouse.data$weight, mouse.data$size)

## STEP 2: Do the regression
simple.regression <- lm(size ~ weight, data=mouse.data)

## STEP 3: Look at the R^2, F-value and p-value
summary(simple.regression)


Call:
lm(formula = size ~ weight, data = mouse.data)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5482 -0.8037  0.1186  0.6186  1.8852 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   0.5813     0.9647   0.603   0.5658  
weight        0.7778     0.2334   3.332   0.0126 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.19 on 7 degrees of freedom
Multiple R-squared:  0.6133,    Adjusted R-squared:  0.558 
F-statistic:  11.1 on 1 and 7 DF,  p-value: 0.01256

## add the regression line to our x/y scatter plot
abline(simple.regression, col="purple")

## now let's verify that our formula for R^2 is correct..
ss.mean <- sum((mouse.data$size - mean(mouse.data$size))^2)
ss.simple <- sum(simple.regression$residuals^2)

(ss.mean - ss.simple) / ss.mean # this is the R^2 value

[1] 0.6132867

## now let's verify the our formula for F is correct...
f.simple <- ((ss.mean - ss.simple) / (2 - 1)) / 
  (ss.simple / (nrow(mouse.data) - 2))
 
f.simple # this is the F-value

[1] 11.10127

It is also possible to calculate the p-value from the F-distribution curve manually like so,

## First, draw the correct f-distribution curve with df1=1 and df2=7
x <- seq(from=0, to=15, by=0.1)
y <- df(x, df1=1, df2=7)
plot(x, y, type="l")
 
## now draw a vertical line where our F-value, f.simple, is.
abline(v=f.simple, col="purple")
 
## color the graph on the left side of the line blue
x.zero.to.line <- seq(from=0, to=f.simple, by=0.1)
y.zero.to.line <- df(x.zero.to.line, df1=1, df2=7)
polygon(x=c(x.zero.to.line, 0), y=c(y.zero.to.line, 0), col="lightblue")
 
## color the graph on the right side of the line red
x.line.to.20 <- seq(from=f.simple, to=20, by=0.1)
y.line.to.20 <- df(x.line.to.20, df1=1, df2=7)
polygon(x=c(x.line.to.20, f.simple), y=c(y.line.to.20, 0), col="purple")

pf(f.simple, df1=1, df2=7) ## the area under the curve that is blue

[1] 0.9874409

1-pf(f.simple, df1=1, df2=7) ## the area under the curve that is red

[1] 0.01255906

Compare the above p-value to the one from the original regression.

Equations with more parameters will never make SS(fit) worse than equation with fewer parameters.

R^2 is the proportion of variance described by the model. The p-value of R^2 comes from F-statistic, F = \frac{\text{The variation of } y \text{ explained by } x}{\text{The variation of } y \text{ not explained by } x}

F = \frac{(SS(\text{mean}) - SS(\text{fit}))/(p_{\text{fit}} - p_{\text{mean}})}{SS(\text{fit}) / (n - p_{\text{fit}})}

If the fit is good, F-statistic will be large. If the fit is small, F-statistic will be small. To get the p-value from this F-statistic, look at the F-curve of DF = p_{\text{mean}} - p_{\text{fit}} and DF = n - p_{\text{fit}}. Area of the more extreme portion is the p-value. For example, F-curve of DF = 1 and 7 is,

The p-value will be smaller when there are more samples relative to the number of parameters in the fit equation.

1.2.2 Multiple Regression

Compare the simple regression with multiple regression to find out whether it is worth the time and trouble to collect the additional data. Calculate the following F-statistic,

F = \frac{(SS(\text{simple}) - SS(\text{multiple})) / (p_{\text{multiple}} - p_{\text{simple}})}{SS(\text{multiple}) / (n - p_{\text{multiple}})}

If the p-value of this F-statistic is significant, it is worth it to add the extra parameter, otherwise not.

## STEP 1: Draw a graph of the data to make sure the relationship make sense
## This graph is more complex because it shows the relationships between all
## of the columns in "mouse.data".
plot(mouse.data)

## STEP 2: Do the regression
multiple.regression <- lm(size ~ weight + tail, data=mouse.data)
 
## STEP 3: Look at the R^2, F-value and p-value
summary(multiple.regression)


Call:
lm(formula = size ~ weight + tail, data = mouse.data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.99928 -0.38648 -0.06967  0.34454  1.07932 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   0.7070     0.6510   1.086   0.3192  
weight       -0.3293     0.3933  -0.837   0.4345  
tail          1.6470     0.5363   3.071   0.0219 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8017 on 6 degrees of freedom
Multiple R-squared:  0.8496,    Adjusted R-squared:  0.7995 
F-statistic: 16.95 on 2 and 6 DF,  p-value: 0.003399

## again, we can verify that our R^2 value is what we think it is
ss.multiple <- sum(multiple.regression$residuals^2)
 
(ss.mean - ss.multiple) / ss.mean

[1] 0.8496438

## we can also verify that the F-value is what we think it is
f.multiple <- ((ss.mean - ss.multiple) / (3 - 1)) / 
  (ss.multiple / (nrow(mouse.data) - 3))
 
f.multiple

[1] 16.95262

## Again let's draw a figure that shows how to calculate the p-value from the
## F-value
##
## First, draw the correct f-distribution curve with df1=2 and df2=6
x <- seq(from=0, to=20, by=0.1)
y <- df(x, df1=2, df2=6)
plot(x, y, type="l")
 
## now draw a verticle line where our f.value is for this test
abline(v=f.multiple, col="purple")
 
## color the graph on the left side of the line blue
x.zero.to.line <- seq(from=0, to=f.multiple, by=0.1)
y.zero.to.line <- df(x.zero.to.line, df1=2, df2=6)
polygon(x=c(x.zero.to.line, 0), y=c(y.zero.to.line, 0), col="lightblue")
 
## color the graph on the right side of the line red
x.line.to.20 <- seq(from=f.multiple, to=20, by=0.1)
y.line.to.20 <- df(x.line.to.20, df1=2, df2=6)
polygon(x=c(x.line.to.20, f.multiple), y=c(y.line.to.20, 0), col="purple")

pf(f.multiple, df1=2, df2=6) ## the area under the curve that is blue

[1] 0.9966009

1-pf(f.multiple, df1=2, df2=6) ## the area under the curve that is red

[1] 0.003399103

## lastly, let's compare this p-value to the one from the 
## original regression
summary(multiple.regression)


Call:
lm(formula = size ~ weight + tail, data = mouse.data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.99928 -0.38648 -0.06967  0.34454  1.07932 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   0.7070     0.6510   1.086   0.3192  
weight       -0.3293     0.3933  -0.837   0.4345  
tail          1.6470     0.5363   3.071   0.0219 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8017 on 6 degrees of freedom
Multiple R-squared:  0.8496,    Adjusted R-squared:  0.7995 
F-statistic: 16.95 on 2 and 6 DF,  p-value: 0.003399

In the above call to summary(), on the rightmost column there are two p-values. The p-value next to weight is 0.4345 which is not statistically significant. This means that using weight and tail isn’t significantly better than using tail alone to predict size. But the p-value next to tail is statistically significant. This means that using weight and tail is significantly better than using weight alone to predict size. It should also be noted that the overall p-value is 0.003399, which is significant.

As a whole, we can use both weight and tail to predict the size, but if for some reason, weight data is not available, it is okay to use only tail to predict the size.

Whether the additional feature makes a significant contribution can also be found manually,

f.simple.v.multiple <- ((ss.simple - ss.multiple) / (3-2)) / 
  (ss.multiple / (nrow(mouse.data) - 3))
 
1-pf(f.simple.v.multiple, df1=1, df2=6)

[1] 0.02191014

When doing multiple regression, adjusted R^2 value is important.

1.3 General Linear Models Part 2 - t-test and ANOVA

The goal of a t-test is to compare means of two datasets and see if they are significantly different from each other.

Step 1: Ignore the x-axis and find the overall mean (SS(\text{mean})).

Step 2: Calculate SS(\text{mean}), the sum of squared residuals around the mean.

Step 3: Fit a line (or two lines, if there are two data groups) to the data.

Step 4: Calculate SS(\text{fit}), the sum of squares of the residuals around the fitted line(s). When there are two groups, fit each group with horizontal lines and use those lines to calculate SS(\text{fit}).

Step 5: Find the F-statistic and p-value corresponding to the F-statistic.

If there are more than two groups, use ANOVA test. The process is identical except that there are more than three groups. Refer to this video for elaborate explanation.