The main goal of this report is to understand how the normalization can affect the mean squared error.
The Mean squared errors tells you how close a regression line is to a set of points.
The general steps to calculate it is the following: * Find the regression line;
Insert your X values into the linear regression equation to find the new Y values (Y’);
Subtract the new Y value from the original to get the error;
Square the errors;
Add up the errors;
Find the mean;
(extracted from http://www.statisticshowto.com/mean-squared-error/)
In the simplest cases, normalization of ratings means adjusting values measured on different scales to a notionally common scale, often prior to averaging
(extracted from https://en.wikipedia.org/wiki/Normalization_(statistics))
Here I’ll use the Z-score normalization given by:
(X-mu)/s, where:
X = Observed data;
mu = mean of the data;
s = standard deviation of the data;
For this analysis I’ll load up some basic R datasets, and do the normalization
ds <- mtcars[,c("mpg","wt")]
z_score = (mtcars$wt - mean(mtcars$wt))/sd(mtcars$wt)
ds$n_wt <- z_score
head(ds)
## mpg wt n_wt
## Mazda RX4 21.0 2.620 -0.610399567
## Mazda RX4 Wag 21.0 2.875 -0.349785269
## Datsun 710 22.8 2.320 -0.917004624
## Hornet 4 Drive 21.4 3.215 -0.002299538
## Hornet Sportabout 18.7 3.440 0.227654255
## Valiant 18.1 3.460 0.248094592
## mpg wt n_wt
## Min. :10.40 Min. :1.513 Min. :-1.7418
## 1st Qu.:15.43 1st Qu.:2.581 1st Qu.:-0.6500
## Median :19.20 Median :3.325 Median : 0.1101
## Mean :20.09 Mean :3.217 Mean : 0.0000
## 3rd Qu.:22.80 3rd Qu.:3.610 3rd Qu.: 0.4014
## Max. :33.90 Max. :5.424 Max. : 2.2553
## [1] "Standard deviation of original variable: 0.978457442989697"
## [1] "Standard deviation of normalized variable: 1"
Of course, my scale using the standard deviation worked out, since it’s value is 1 now.
par(mfrow=c(1,2))
boxplot(ds$wt, main="Box Plot of Original Data")
boxplot(ds$n_wt, main="Box Plot on Normalized Data")
The plot above show that the data distribution for the variable wt was not affected by the normalization.
Creating the models:
fit1 <- lm(mpg~wt, data = ds)
fit2 <- lm(mpg~n_wt, data=ds)
Checking the models:
## Original Data
summary(fit1)
##
## Call:
## lm(formula = mpg ~ wt, data = ds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
## Normalized Data
summary(fit2)
##
## Call:
## lm(formula = mpg ~ n_wt, data = ds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.0906 0.5384 37.313 < 2e-16 ***
## n_wt -5.2293 0.5471 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
##Original Data
x_origin = predict(fit1)
# Subtract the new Y value from the original to get the error;
e_origin = x_origin - ds$mpg
# Square the errors;
e_origin_sqr = e_origin^2
# Add up the errors;
sum_e_origin_sqr = sum(e_origin_sqr)
# Find the mean:
sum_e_origin_sqr/NROW(ds)
## [1] 8.697561
##Reescaled Data
x_norm = predict(fit2)
# Subtract the new Y value from the original to get the error;
e_norm = x_norm - ds$mpg
# Square the errors;
e_norm_sqr = e_norm^2
# Add up the errors;
sum_e_norm_sqr = sum(e_norm_sqr)
# Find the mean:
sum_e_norm_sqr/NROW(ds)
## [1] 8.697561
For a simple linear regression on mtcars dataset using mpg as the outcome and wt as the predictor, the mean squared error is not affected by the normalization of wt variable.