Contents
Outlier Treatment
The normal distribution is the most important probability distribution in statistics because it fits many natural phenomena. For example, heights, blood pressure, measurement error, and IQ scores follow the normal distribution. It is also known as the Gaussian distribution and the bell curve.
The normal distribution is a probability function that describes how the values of a variable are distributed. It is a symmetric distribution where most of the observations cluster around the central peak and the probabilities for values further away from the mean taper off equally in both directions.
Extreme values in both tails of the distribution are similarly unlikely. In the histogram bellow seen this extreme values – outliers.
data_out <- data.frame(num = rnorm(n = 500, 0, 30),
factor1 = factor(sample(c(1,2),500, replace = T)),
factor2 = factor(sample(c("A","B"),500, replace = T)))
data_out$num[sample(length(data_out$num), 40, replace=F)] <- runif(n = 40, min = 95, max = 500)
hist(data_out$num)Function to search outliers by groups and eliminate them
find_outliers <- function(t){
factor_vars <- names(which(sapply(t, is.factor)))
num_var <- names(which(sapply(t, is.numeric)))
t %>%
group_by(.dots = factor_vars) %>%
mutate_at(vars(num_var), funs(is_outlier = ifelse(abs(.-mean(.)) > 1.5*sd(.), 1, 0)))
}
data_out <- find_outliers(data_out)
data_norm <- subset(data_out, is_outlier == 0)Before appling the function
data_out$groups <- factor(group_indices(data_out))
ggplot(data_out, aes(x = num)) +
geom_histogram(aes(color = groups, fill = groups),
position = "identity", bins = 30, alpha = 0.4)+
facet_wrap(.~groups)After appling the function
data_norm$groups <- factor(group_indices(data_norm))
ggplot(data_norm, aes(x = num)) +
geom_histogram(aes(color = groups, fill = groups),
position = "identity", bins = 30, alpha = 0.4)+
facet_wrap(.~groups)And the same formula for redusing outliers, but more with function merge
find_outliers <- function(t){
# index for factor
factor_item <- sapply(t, is.factor)
# rename coll
names(t)[sapply(t, is.numeric)] <- "num"
# find mean by group (list) and rename coll
t_mean <- aggregate(x = t$num, by = t[factor_item], mean)
names(t_mean)[sapply(t_mean, is.numeric)] <- "mean"
# find sd by group (list) and rename coll
t_sd <- aggregate(x = t$num, by = t[factor_item], sd)
names(t_sd)[sapply(t_sd, is.numeric)] <- "sd"
# add in data.frame mean and sd by group (names)
t <- merge(t, t_mean, by = names(t)[sapply(t, is.factor)])
t <- merge(t, t_sd, by = names(t)[sapply(t, is.factor)])
# calc result
t$is_outlier <- ifelse((abs(t$num - t$mean) > 2 * t$sd), 1, 0)
t
}
find_outliers(ToothGrowth)Nonlinear regression problem
Linear form of the model it the simplest form for data prediction. In most cases models does not have this type of form. One of the way to fix our formula of prediction model is to make a logarithmic transformation or raise a variable to a power - Tukey’s transformation.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.099 1.634 18.421 0
## hp -0.068 0.010 -6.742 0
## [1] 0.602
plot(mtcars$mpg ~ mtcars$hp)
abline(lm(mpg ~ mtcars$hp, mtcars), col = "blue")
lines(lowess(mtcars$hp, mtcars$mpg), col = "red")Tukey’s transformation
Raising to the power
Raising to the power of the variable could be the best for increasing R-squared but given estimate for independent variable is hard to interpret.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.703 1.735 2.710 0.011
## I(-hp^-0.7) -444.584 47.580 -9.344 0.000
## [1] 0.744
plot(mtcars$mpg ~ I(-mtcars$hp^-0.7))
abline(lm(mpg ~ I(-mtcars$hp^-0.7), mtcars), col = "blue")
lines(lowess(I(-mtcars$hp^-0.7), mtcars$mpg), col = "red")\(Y = b_{1} * -X^{-0.7} + b_{0}\)
\(b_{1} = -444.58\) it means that one step chenging negative variable hp to a pover -0.7 will change value of the dependent variable Y (mpg) for -444.58 points.
logarithmic transformation
Making logarithmic transformation of the given estimate could say more about structural relations between variables. For example, * In the model – \(log(Y) = b_{1} ∗ X + b_{0}\) coefficient of the angle means that one step of the changing variable X, will change variable Y by 100 * b1 percent on the average * In the second model – \(Y = b_{1} ∗ log(X) + b_{0}\) coefficient of the angle means that 1% changing of X, will change variable Y by 0.01* b1 on the average.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.545 0.299 18.538 0
## log(hp) -0.530 0.061 -8.691 0
## [1] 0.716
plot(log(mtcars$mpg) ~ log(mtcars$hp))
abline(lm(log(mpg) ~ log(hp), mtcars), col = "blue")
lines(lowess(log(mtcars$hp), log(mtcars$mpg)), col = "red")\(log(Y) = b_{1} ∗ log(X) + b_{0}\) - coefficient of the angle means that 1% changing of X, will change variable Y by b1 persent on the average. In this case \(b_{1} = -0.53\) changeing hp by the 1% will change mpg by -0.5% on the avarage.
How to find the best solution?
Function for transformation of nonlinear regression variable
transform_x <- function(test_data){
calc <- data.frame(exp = seq(-2,2,0.1), cor = 0)
calc$cor <- sapply(calc$exp, function(a){
ifelse(a==0,
cor.test(test_data[,1],log(test_data[,2]))$estimate,
ifelse(a>0,
cor.test(test_data[,1],test_data[,2]^a)$estimate,
cor.test(test_data[,1],-test_data[,2]^-a)$estimate
))})
best_a <- calc$exp[which.max(abs(calc$cor))]
ifelse(best_a==0,
best_lm <- lm(test_data[,1] ~ log(test_data[,2])),
ifelse(best_a>0,
best_lm <- lm(test_data[,1] ~ I(test_data[,2]^best_a)),
best_lm <- lm(test_data[,1] ~ I(-test_data[,2]^-best_a))))
ifelse(best_a==0,
best_x <- log(test_data[,2]),
ifelse(best_a>0,
best_x <- test_data[,2]^best_a,
best_x <- -test_data[,2]^-best_a))
result.std.err <- data.frame(
"origin_lm" = summary(lm(test_data[,1] ~ test_data[,2]))$coeff[,2],
"best_lm" = summary(best_lm)$coefficients[,2])
row.names(result.std.err) <- c("std.error y", "std.error x")
result.r.squared <- data.frame(
"origin_lm" = summary(lm(test_data[,1] ~ test_data[,2]))$r.squared,
"best_lm" = summary(best_lm)$r.squared)
row.names(result.r.squared) <- "r.squ"
return(list(
"test" = rbind(result.std.err, result.r.squared),
"original_x" = test_data[,2],
"transformed_x" = best_x))
}
transform_x(test_data)## $test
## origin_lm best_lm
## std.error y 0.082385029080 0.01373725
## std.error x 0.000005205172 0.00146045
## r.squ 0.900568916996 0.99996160
##
## $original_x
## [1] 9927.17 5401.99 6985.04 11639.93 36296.68 11526.84 5172.36
## [8] 9971.65 16592.56 27271.26 9711.96 9733.95 16806.18 16648.66
## [15] 7365.96 21291.80 17843.58 4644.63 15094.09 16690.57
##
## $transformed_x
## [1] 9.203031 8.594523 8.851526 9.362197 10.499482 9.352434 8.551084
## [8] 9.207501 9.716710 10.213589 9.181113 9.183375 9.729502 9.720085
## [15] 8.904625 9.966077 9.789399 8.443467 9.622059 9.722599
As seen transformation reduce standard error and r-squared increased.
And the same, but more elegant way
transform_x <- function(test_data) {
y <- test_data[[1]]
x <- test_data[[2]]
# getting range
l <- seq(-2, 2, 0.1)
# transformation
d <- outer(x, l, "^")
# changing sign
d[, l < 0] <- -d[, l < 0]
# for case when lymbda equal to the zero
d[, l == 0] <- log(x)
# calculus corr
r <- cor(d, y)[, 1]
# chosing best var
d[, which.max(abs(r))]
}Heteroscedasticity test
heteroscedasticity - the opposite being homoscedasticity - is used in statistics, especially in the context of linear regression or for time series analysis, to describe the case where the variance of errors or the model is not the same for all observations, while often one of the basic assumption in modeling is that the variances are homogeneous and that the errors of the model are identically distributed. How it look In linear regression analysis, the fact that the errors of the model (also named residuals) are not homoskedastic has the consequence that the model coefficients estimated using ordinary least squares (OLS) are neither unbiased nor those with minimum variance. The estimation of their variance is not reliable.
How to test?
Function for heteroscedasticity
hetero_test <- function(d){
r <- vector(mode="numeric")
p <- vector(mode="numeric")
s <- vector(mode="numeric")
bps <- vector(mode="numeric")
bpp <- vector(mode="numeric")
for(i in seq_along(d)){
fit <- lm(d[,i]~.,d[,-i])$res
fit.res <- lm(fit^2 ~.,d[,-i])
s[i] <- round(shapiro.test(fit)$p, 3)
r[i] <- round(summary(fit.res)$r.squared, 3)
f <- summary(fit.res)$fstatistic
p[i] <- round(pf(f[1],f[2],f[3],lower.tail=F), 3)
bps[i] <- round(bptest(lm(d[,i]~.,d[,-i]))$stat, 3)
bpp[i] <- round(bptest(lm(d[,i]~.,d[,-i]))$p.value, 3)
attributes(s) <- NULL
attributes(bps) <- NULL
attributes(bpp) <- NULL
}
result <- data.table("R-squared" = r,
"p-value" = p,
"shapiro p" = s,
"bart stat" = bps,
"bartlett p" = bpp
)
result <- cbind("var_names" = colnames(d),result)
return(result)
}
hetero_test(mtcars)## var_names R-squared p-value shapiro p bart stat bartlett p
## 1: mpg 0.466 0.116 0.226 14.914 0.135
## 2: cyl 0.443 0.155 0.916 14.175 0.165
## 3: disp 0.225 0.789 0.897 7.194 0.707
## 4: hp 0.571 0.023 0.107 18.275 0.050
## 5: drat 0.462 0.123 0.289 14.769 0.141
## 6: wt 0.257 0.693 0.274 8.212 0.608
## 7: qsec 0.361 0.352 0.002 11.568 0.315
## 8: vs 0.487 0.088 0.361 15.586 0.112
## 9: am 0.596 0.014 0.468 19.064 0.039
## 10: gear 0.714 0.001 0.578 22.834 0.011
## 11: carb 0.559 0.028 0.569 17.881 0.057
The more R-squared the more probability of heteroskedastic in the model.
\(p-value < 0.05\) - R-squared have sense. Self-made White’s test for heteroscedasticity.
Shapiro-test - \(p-value > 0.05\) distribution of residuals is normal.
Bartlett test - the more stat the hire heteroscedasticity. \(Bartlett p-value < 0.05\) then Bartlett stat have sense.
How to fix?
BoxCox or log transformation
- BoxCox —
boxcox {MASS} - log —
lm(log(price) ~ log(carat), diamonds))
Multicollinearity
Multicollinearity - is a state of very high intercorrelations or inter-associations among the independent variables. It is therefore a type of disturbance in the data, and if present in the data the statistical inferences made about the data may not be reliable.
Why is it the problem?
It’s important for testing hypothesis about relations between variables, but not critical for general prediction, because \(R^{2}\) stay almost the same.
In the presence of high multicollinearity, the confidence intervals of the coefficients tend to become very wide and the statistics tend to be very small. It becomes difficult to reject the null hypothesis of any study when multicollinearity is present in the data under study.
Also multicollinearity have huge negative impact to the p-value of each variable, depriving them of significance.
How to identify and fix it?
First case – idial correletion
d <- tibble(y = rnorm(30),
x_1 = rnorm(30),
x_2 = x_1,
x_3 = rnorm(30))
fit <- lm(y ~ ., d)
pairs(d)## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.05335452 0.2000167 0.2667504 0.7916882
## x_1 0.22654906 0.1608418 1.4085211 0.1703888
## x_3 -0.18195099 0.1871420 -0.9722615 0.3395508
Is seen that summary automaticly exclude x_2 from the report becouse of the correletion equal to the 1
Second case – significant correletion
Need to find how good each variable explaned by the others variables - VIF (verience influation factor) - \(VIF = \frac{1}{(1-R^{2}_{j})}\)
If VIF more than 10 for particular independent variable, it is necessary to exclude it from the model. 10 means that \(R^{2}\) more than 0.9.
test_data_orign <- mutate(cars, speed_2 = speed^2, speed_3 = speed^3)
# life-hack! lm(data) === lm(data[, 1] ~ ., data[-1])
smart_model <- function(test_data_orig){
test_data <<- test_data_orign[-1]
VIFd <- 11
while (any(VIFd>10) & length(test_data)>1) {
VIFd <- sapply(1:ncol(test_data), function(x) {
data_reg <- lm(test_data[,x] ~ ., test_data[-x])
r.sq <- summary(data_reg)$r.squared
vif <- 1/(1-r.sq)
})
if(any(VIFd>10)){
test_data[which.max(VIFd)] <<- list(NULL)
}
}
data_orign <- round(summary(
lm(test_data_orign[,1] ~., test_data_orign[,-1]))$coefficients[,c(1,4)], 4)
data_fixed <- round(summary(
lm(test_data_orign[,1] ~., test_data))$coefficients[,c(1,4)], 4)
return(list(data_orign = data_orign, data_fixed = data_fixed))
}
smart_model(test_data_orign)## $data_orign
## Estimate Pr(>|t|)
## (Intercept) 4.0008 0.0000
## dist 0.0031 0.3225
## speed_2 0.0719 0.0000
## speed_3 -0.0016 0.0000
##
## $data_fixed
## Estimate Pr(>|t|)
## (Intercept) 9.0739 0.0000
## dist 0.0299 0.0809
## speed_3 0.0010 0.0000
As seen this function have found variables with high VIF (more than 10), and have cut them from the model, making better estimates.
Independence of observations
The assumption of independence means that your data isn’t connected in any way (at least, in ways that you haven’t accounted for in your model).
There are two assumptions:
- The observations between groups should be independent, which basically means the groups are made up of different people. You don’t want one person appearing twice in two different groups as it could skew your results.
- The observations within each group must be independent. If two or more data points in one group are connected in some way, this could also skew your data.
Main effect (also called a simple effect) – is the effect of one independent variable on the dependent variable. It ignores the effects of any other independent variables.
Random effect – if an effect is assumed to be a realized value of a random variable, it is called a random effect.” (LaMotte, 1983)
Excample: in medicine Main effect is drug and random is sex, age, location, doctor, paciant
In most cases it’s impossible to reach full independence of observations. In practice observer permanently faced with random effects, which affect to the prediction model.
Pseudoreplication
Pseudoreplication occurs when you analyse the data as if you had more degrees of freedom than you really have. There are two kinds of pseudoreplication:
- temporal pseudoreplication, involving repeated measurements from the same individual.
- spatial pseudoreplication, involving several measurements taken from the same vicinity.
Pseudoreplication is a problem because one of the most important assumptions of standard statistical analysis is independence of errors. Repeated measures through time on the same individual will have non-independent errors because peculiarities of the individual will be reflected in all of the measurements made on it (the repeated measures will be temporally correlated with one another).
How to deal with this? – linear mixed models LMMs
Linear mixed models are an extension of simple linear models to allow both fixed and random effects, and are particularly used when there is non independence in the data, such as arises from a hierarchical structure.
This is the function in R lmer(DV ~ IV + (1 + IV | RV), data = my_data)
Let’s see how it works
## operators function
# 1 - angle and free coefficients are important
# 0 - only angle coefficient is important
# in summary - REML - restrict maximum likelihood
#library(lmerTest) - add to the model summary P-value and Satterthwaite approximation
## Data
# Exam scores of 4,059 students from 65 schools in Inner London.
# normexam - Normalized exam score - final exam
# standLRT - Standardised LR test score - entrance exam (English language)
## Model2 random free element
# Main effect
#+ random free element
Model2_free <- lmer(normexam ~ standLRT + (1|school), data=Exam)
Exam$Model2_free_pred <- predict(Model2_free)
g1 <- ggplot(data = Exam, aes(x = standLRT, y = normexam)) +
geom_point(alpha = 0.2) +
geom_line(data = Exam, aes(x = standLRT, y = Model2_free_pred, col = school)) +
theme(legend.position = "none") +
ggtitle("Model2 random free element")
## Model2 random angle coefficient
# Main effect
#+ random angle coefficient
Model2_angle <- lmer(normexam ~ standLRT + (0 + standLRT|school), data=Exam)
Exam$Model2_angle_pred <- predict(Model2_angle)
g2 <- ggplot(data = Exam, aes(x = standLRT, y = normexam)) +
geom_point(alpha = 0.2) +
geom_line(data = Exam, aes(x = standLRT, y = Model2_angle_pred, col = school)) +
theme(legend.position = "none") +
ggtitle("Model2 random angle coefficient")
## Model1 two random corr elements
# Main effect
#+ random free element
#+ random angle coefficient
#+ random effects correlate to each other (the higher line of regression the steeper angle)
Model1_cor <- lmer(normexam ~ standLRT + (1 + standLRT|school), data=Exam)
Exam$Model1_cor_pred <- predict(Model1_cor)
g3 <- ggplot(data = Exam, aes(x = standLRT, y = normexam)) +
geom_point(alpha = 0.2) +
geom_line(data = Exam, aes(x = standLRT, y = Model1_cor_pred, col = school)) +
theme(legend.position = "none") +
ggtitle("Model1 two random corr elements")
## Model1 two random uncorr elements
# Main effect
#+ random free element
#+ random angle coefficient
#+ random effects uncorrelated to each other (the higher line of regression the steeper angle)
Model1_uncor <- lmer(normexam ~ standLRT + (1|school) + (0 + standLRT|school), data=Exam)
Exam$Model1_uncor_pred <- predict(Model1_uncor)
g4 <- ggplot(data = Exam, aes(x = standLRT, y = normexam)) +
geom_point(alpha = 0.2) +
geom_line(data = Exam, aes(x = standLRT, y = Model1_uncor_pred, col = school)) +
theme(legend.position = "none") +
ggtitle("Model1 two random uncorr elements")
#Graphics
as_ggplot(arrangeGrob(g1, g2, g3, g4, ncol = 2, top = textGrob("Comparison of different types of linear mixed models"), bottom = textGrob("legend: colored lines means schools")))This graphs represent how regression lines looks like in different conditions. In the first graph the condition is accepted that the random slope coefficient is unimportant. The second is the opposite – random free coeff unimportant. Third and fourth graphs shows how lines looks when is accepted or not correlation between two random coeff.
How to choose best linear mixed models (LMMs)?
ANOVA is the one of the method for comparison the models
# Build zero-model - model with 0 - slope line as in mean evaluation
Model0 <- lmer(normexam ~ 1 + (1|school), REML = FALSE, data = Exam)
#summary(Model0)
Exam$Model0_pred <- predict(Model0)
g0 <- ggplot(data = Exam, aes(x = standLRT, y = normexam)) +
geom_point(alpha = 0.2) +
geom_line(data = Exam, aes(x = standLRT, y = Model0_pred, col = school)) +
theme(legend.position = "none") +
ggtitle("Model0 - model with 0 - slope line")
# Model with main effect and random free element (without random slope)
Model2 <- lmer(normexam ~ standLRT + (1|school), REML = FALSE, data=Exam)
#summary(Model2)
Exam$Model2_pred <- predict(Model2)
g1 <- ggplot(data = Exam, aes(x = standLRT, y = normexam)) +
geom_point(alpha = 0.2) +
geom_line(data = Exam, aes(x = standLRT, y = Model2_pred, col = school)) +
theme(legend.position = "none") +
ggtitle("Model1 - with main effect and random free element")
# Graphics
as_ggplot(arrangeGrob(g0, g1, ncol = 2, top = textGrob("Comparison of linear mixed models"), bottom = textGrob("legend: colored lines means schools")))## Data: Exam
## Models:
## Model0: normexam ~ 1 + (1 | school)
## Model2: normexam ~ standLRT + (1 | school)
## Df AIC BIC logLik deviance Chisq Chi Df
## Model0 3 11016.6 11035.6 -5505.3 11010.6
## Model2 4 9365.2 9390.5 -4678.6 9357.2 1653.4 1
## Pr(>Chisq)
## Model0
## Model2 < 0.00000000000000022 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The models are statistically significantly different from each other. Second model is much better as indicated by p-value.
Build logarithmic model
Prediction on the new data
with usnig an operator – allow.new.levels – control random effects in the prediction
#modifi data for prediction example
new_Exam <- Exam[sample(1:nrow(Exam), 100), ]
new_Exam$school <- sample(101:200)
# Model2 - name of the model
# new_Exam - name of the new data for prediction
# allow.new.levels - apply model to a new dataset which may not have
# the same levels of the random effect. If function "predict" faced
# with new random effects, regression use only fixed coefficients.
predict(Model2, new_Exam, allow.new.levels = T)
# Researching of the random effects
# fixef - fixed effects
# ranef - random effects - value of the random effect - is a
# deviation from the fixed effect. Positive random angle coefficient
# means that relation between entrance and final exams stronger for
# particular school
fixef(Model3)
ranef(Model3)