Outlier Treatment

The normal distribution is the most important probability distribution in statistics because it fits many natural phenomena. For example, heights, blood pressure, measurement error, and IQ scores follow the normal distribution. It is also known as the Gaussian distribution and the bell curve.

The normal distribution is a probability function that describes how the values of a variable are distributed. It is a symmetric distribution where most of the observations cluster around the central peak and the probabilities for values further away from the mean taper off equally in both directions.

Extreme values in both tails of the distribution are similarly unlikely. In the histogram bellow seen this extreme values – outliers.

Function to search outliers by groups and eliminate them

Before appling the function

After appling the function

And the same formula for redusing outliers, but more with function merge

Nonlinear regression problem

Linear form of the model it the simplest form for data prediction. In most cases models does not have this type of form. One of the way to fix our formula of prediction model is to make a logarithmic transformation or raise a variable to a power - Tukey’s transformation.

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   30.099      1.634  18.421        0
## hp            -0.068      0.010  -6.742        0
## [1] 0.602

Tukey’s transformation
Raising to the power

Raising to the power of the variable could be the best for increasing R-squared but given estimate for independent variable is hard to interpret.

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    4.703      1.735   2.710    0.011
## I(-hp^-0.7) -444.584     47.580  -9.344    0.000
## [1] 0.744

\(Y = b_{1} * -X^{-0.7} + b_{0}\)
\(b_{1} = -444.58\) it means that one step chenging negative variable hp to a pover -0.7 will change value of the dependent variable Y (mpg) for -444.58 points.

logarithmic transformation

Making logarithmic transformation of the given estimate could say more about structural relations between variables. For example, * In the model – \(log(Y) = b_{1} ∗ X + b_{0}\) coefficient of the angle means that one step of the changing variable X, will change variable Y by 100 * b1 percent on the average * In the second model – \(Y = b_{1} ∗ log(X) + b_{0}\) coefficient of the angle means that 1% changing of X, will change variable Y by 0.01* b1 on the average.

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    5.545      0.299  18.538        0
## log(hp)       -0.530      0.061  -8.691        0
## [1] 0.716

\(log(Y) = b_{1} ∗ log(X) + b_{0}\) - coefficient of the angle means that 1% changing of X, will change variable Y by b1 persent on the average. In this case \(b_{1} = -0.53\) changeing hp by the 1% will change mpg by -0.5% on the avarage.

How to find the best solution?
Function for transformation of nonlinear regression variable
## $test
##                  origin_lm    best_lm
## std.error y 0.082385029080 0.01373725
## std.error x 0.000005205172 0.00146045
## r.squ       0.900568916996 0.99996160
## 
## $original_x
##  [1]  9927.17  5401.99  6985.04 11639.93 36296.68 11526.84  5172.36
##  [8]  9971.65 16592.56 27271.26  9711.96  9733.95 16806.18 16648.66
## [15]  7365.96 21291.80 17843.58  4644.63 15094.09 16690.57
## 
## $transformed_x
##  [1]  9.203031  8.594523  8.851526  9.362197 10.499482  9.352434  8.551084
##  [8]  9.207501  9.716710 10.213589  9.181113  9.183375  9.729502  9.720085
## [15]  8.904625  9.966077  9.789399  8.443467  9.622059  9.722599

As seen transformation reduce standard error and r-squared increased.

And the same, but more elegant way

Heteroscedasticity test

heteroscedasticity - the opposite being homoscedasticity - is used in statistics, especially in the context of linear regression or for time series analysis, to describe the case where the variance of errors or the model is not the same for all observations, while often one of the basic assumption in modeling is that the variances are homogeneous and that the errors of the model are identically distributed. How it look In linear regression analysis, the fact that the errors of the model (also named residuals) are not homoskedastic has the consequence that the model coefficients estimated using ordinary least squares (OLS) are neither unbiased nor those with minimum variance. The estimation of their variance is not reliable.

How to test?
Function for heteroscedasticity
##     var_names R-squared p-value shapiro p bart stat bartlett p
##  1:       mpg     0.466   0.116     0.226    14.914      0.135
##  2:       cyl     0.443   0.155     0.916    14.175      0.165
##  3:      disp     0.225   0.789     0.897     7.194      0.707
##  4:        hp     0.571   0.023     0.107    18.275      0.050
##  5:      drat     0.462   0.123     0.289    14.769      0.141
##  6:        wt     0.257   0.693     0.274     8.212      0.608
##  7:      qsec     0.361   0.352     0.002    11.568      0.315
##  8:        vs     0.487   0.088     0.361    15.586      0.112
##  9:        am     0.596   0.014     0.468    19.064      0.039
## 10:      gear     0.714   0.001     0.578    22.834      0.011
## 11:      carb     0.559   0.028     0.569    17.881      0.057

The more R-squared the more probability of heteroskedastic in the model.
\(p-value < 0.05\) - R-squared have sense. Self-made White’s test for heteroscedasticity.
Shapiro-test - \(p-value > 0.05\) distribution of residuals is normal.
Bartlett test - the more stat the hire heteroscedasticity. \(Bartlett p-value < 0.05\) then Bartlett stat have sense.

How to fix?

BoxCox or log transformation

  • BoxCox — boxcox {MASS}
  • log — lm(log(price) ~ log(carat), diamonds))

Multicollinearity

Multicollinearity - is a state of very high intercorrelations or inter-associations among the independent variables. It is therefore a type of disturbance in the data, and if present in the data the statistical inferences made about the data may not be reliable.

Why is it the problem?

It’s important for testing hypothesis about relations between variables, but not critical for general prediction, because \(R^{2}\) stay almost the same.

In the presence of high multicollinearity, the confidence intervals of the coefficients tend to become very wide and the statistics tend to be very small. It becomes difficult to reject the null hypothesis of any study when multicollinearity is present in the data under study.

Also multicollinearity have huge negative impact to the p-value of each variable, depriving them of significance.

How to identify and fix it?
First case – idial correletion

##                Estimate Std. Error    t value  Pr(>|t|)
## (Intercept)  0.05335452  0.2000167  0.2667504 0.7916882
## x_1          0.22654906  0.1608418  1.4085211 0.1703888
## x_3         -0.18195099  0.1871420 -0.9722615 0.3395508

Is seen that summary automaticly exclude x_2 from the report becouse of the correletion equal to the 1

Second case – significant correletion

Need to find how good each variable explaned by the others variables - VIF (verience influation factor) - \(VIF = \frac{1}{(1-R^{2}_{j})}\)

If VIF more than 10 for particular independent variable, it is necessary to exclude it from the model. 10 means that \(R^{2}\) more than 0.9.

## $data_orign
##             Estimate Pr(>|t|)
## (Intercept)   4.0008   0.0000
## dist          0.0031   0.3225
## speed_2       0.0719   0.0000
## speed_3      -0.0016   0.0000
## 
## $data_fixed
##             Estimate Pr(>|t|)
## (Intercept)   9.0739   0.0000
## dist          0.0299   0.0809
## speed_3       0.0010   0.0000

As seen this function have found variables with high VIF (more than 10), and have cut them from the model, making better estimates.

Independence of observations

The assumption of independence means that your data isn’t connected in any way (at least, in ways that you haven’t accounted for in your model).

There are two assumptions:

  • The observations between groups should be independent, which basically means the groups are made up of different people. You don’t want one person appearing twice in two different groups as it could skew your results.
  • The observations within each group must be independent. If two or more data points in one group are connected in some way, this could also skew your data.

Main effect (also called a simple effect) – is the effect of one independent variable on the dependent variable. It ignores the effects of any other independent variables.

Random effect – if an effect is assumed to be a realized value of a random variable, it is called a random effect.” (LaMotte, 1983)

Excample: in medicine Main effect is drug and random is sex, age, location, doctor, paciant

In most cases it’s impossible to reach full independence of observations. In practice observer permanently faced with random effects, which affect to the prediction model.

Pseudoreplication

Pseudoreplication occurs when you analyse the data as if you had more degrees of freedom than you really have. There are two kinds of pseudoreplication:

  • temporal pseudoreplication, involving repeated measurements from the same individual.
  • spatial pseudoreplication, involving several measurements taken from the same vicinity.

Pseudoreplication is a problem because one of the most important assumptions of standard statistical analysis is independence of errors. Repeated measures through time on the same individual will have non-independent errors because peculiarities of the individual will be reflected in all of the measurements made on it (the repeated measures will be temporally correlated with one another).

How to deal with this? – linear mixed models LMMs

Linear mixed models are an extension of simple linear models to allow both fixed and random effects, and are particularly used when there is non independence in the data, such as arises from a hierarchical structure.

This is the function in R lmer(DV ~ IV + (1 + IV | RV), data = my_data)

Let’s see how it works

## operators function
# 1 - angle and free coefficients are important
# 0 - only angle coefficient is important
# in summary - REML - restrict maximum likelihood 

#library(lmerTest) - add to the model summary P-value and Satterthwaite approximation

## Data
# Exam scores of 4,059 students from 65 schools in Inner London.
# normexam - Normalized exam score - final exam
# standLRT - Standardised LR test score - entrance exam (English language)

## Model2 random free element
# Main effect 
#+ random free element 
Model2_free <- lmer(normexam ~ standLRT + (1|school), data=Exam)
Exam$Model2_free_pred <- predict(Model2_free)
g1 <- ggplot(data = Exam, aes(x = standLRT, y = normexam)) + 
  geom_point(alpha = 0.2) + 
  geom_line(data = Exam, aes(x = standLRT, y = Model2_free_pred, col = school)) +
  theme(legend.position = "none") +
  ggtitle("Model2 random free element")

## Model2 random angle coefficient
# Main effect 
#+ random angle coefficient 
Model2_angle <- lmer(normexam ~ standLRT + (0 + standLRT|school), data=Exam)
Exam$Model2_angle_pred <- predict(Model2_angle)
g2 <- ggplot(data = Exam, aes(x = standLRT, y = normexam)) + 
  geom_point(alpha = 0.2) + 
  geom_line(data = Exam, aes(x = standLRT, y = Model2_angle_pred, col = school)) +
  theme(legend.position = "none") +
  ggtitle("Model2 random angle coefficient") 

## Model1 two random corr elements
# Main effect 
#+ random free element 
#+ random angle coefficient 
#+ random effects correlate to each other (the higher line of regression the steeper angle) 
Model1_cor <- lmer(normexam ~ standLRT + (1 + standLRT|school), data=Exam)
Exam$Model1_cor_pred <- predict(Model1_cor)
g3 <- ggplot(data = Exam, aes(x = standLRT, y = normexam)) + 
  geom_point(alpha = 0.2) + 
  geom_line(data = Exam, aes(x = standLRT, y = Model1_cor_pred, col = school)) +
  theme(legend.position = "none") +
  ggtitle("Model1 two random corr elements") 

## Model1 two random uncorr elements
# Main effect 
#+ random free element 
#+ random angle coefficient 
#+ random effects uncorrelated to each other (the higher line of regression the steeper angle) 
Model1_uncor <- lmer(normexam ~ standLRT + (1|school) + (0 + standLRT|school), data=Exam)
Exam$Model1_uncor_pred <- predict(Model1_uncor)
g4 <- ggplot(data = Exam, aes(x = standLRT, y = normexam)) + 
  geom_point(alpha = 0.2) + 
  geom_line(data = Exam, aes(x = standLRT, y = Model1_uncor_pred, col = school)) +
  theme(legend.position = "none") +
  ggtitle("Model1 two random uncorr elements") 

#Graphics
as_ggplot(arrangeGrob(g1, g2, g3, g4, ncol = 2, top = textGrob("Comparison of different types of linear mixed models"), bottom = textGrob("legend: colored lines means schools")))

This graphs represent how regression lines looks like in different conditions. In the first graph the condition is accepted that the random slope coefficient is unimportant. The second is the opposite – random free coeff unimportant. Third and fourth graphs shows how lines looks when is accepted or not correlation between two random coeff.

How to choose best linear mixed models (LMMs)?

ANOVA is the one of the method for comparison the models

## Data: Exam
## Models:
## Model0: normexam ~ 1 + (1 | school)
## Model2: normexam ~ standLRT + (1 | school)
##        Df     AIC     BIC  logLik deviance  Chisq Chi Df
## Model0  3 11016.6 11035.6 -5505.3  11010.6              
## Model2  4  9365.2  9390.5 -4678.6   9357.2 1653.4      1
##                   Pr(>Chisq)    
## Model0                          
## Model2 < 0.00000000000000022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The models are statistically significantly different from each other. Second model is much better as indicated by p-value.