Summary

The main goal of this report is to understand how the normalization can affect the mean squared error.

Definitions



MSE

The Mean squared errors tells you how close a regression line is to a set of points.

The general steps to calculate it is the following: * Find the regression line;

  • Insert your X values into the linear regression equation to find the new Y values (Y’);

  • Subtract the new Y value from the original to get the error;

  • Square the errors;

  • Add up the errors;

  • Find the mean;

(extracted from http://www.statisticshowto.com/mean-squared-error/)

Normalization

In the simplest cases, normalization of ratings means adjusting values measured on different scales to a notionally common scale, often prior to averaging

(extracted from https://en.wikipedia.org/wiki/Normalization_(statistics))

Here I’ll use the Z-score normalization given by:

(X-mu)/s, where:

  • X = Observed data;

  • mu = mean of the data;

  • s = standard deviation of the data;

Loading and Normalizing Datasets

For this analysis I’ll load up some basic R datasets, and do the normalization

ds <- mtcars[,c("mpg","wt")]

z_score = (mtcars$wt - mean(mtcars$wt))/sd(mtcars$wt)

ds$n_wt <- z_score

head(ds)
##                    mpg    wt         n_wt
## Mazda RX4         21.0 2.620 -0.610399567
## Mazda RX4 Wag     21.0 2.875 -0.349785269
## Datsun 710        22.8 2.320 -0.917004624
## Hornet 4 Drive    21.4 3.215 -0.002299538
## Hornet Sportabout 18.7 3.440  0.227654255
## Valiant           18.1 3.460  0.248094592

Part-1 Some Summary Statistcs my dataset

##       mpg              wt             n_wt        
##  Min.   :10.40   Min.   :1.513   Min.   :-1.7418  
##  1st Qu.:15.43   1st Qu.:2.581   1st Qu.:-0.6500  
##  Median :19.20   Median :3.325   Median : 0.1101  
##  Mean   :20.09   Mean   :3.217   Mean   : 0.0000  
##  3rd Qu.:22.80   3rd Qu.:3.610   3rd Qu.: 0.4014  
##  Max.   :33.90   Max.   :5.424   Max.   : 2.2553
## [1] "Standard deviation of original variable: 0.978457442989697"
## [1] "Standard deviation of normalized variable: 1"

Of course, my scale using the standard deviation worked out, since it’s value is 1 now.

par(mfrow=c(1,2))
boxplot(ds$wt, main="Box Plot of Original Data")
boxplot(ds$n_wt, main="Box Plot on Normalized Data")

The plot above show that the data distribution for the variable wt was not affected by the normalization.

Part-2 Checking if MSE is affected by the normalization

Creating the models:

fit1 <- lm(mpg~wt, data = ds)

fit2 <- lm(mpg~n_wt, data=ds)

Checking the models:

## Original Data
summary(fit1)
## 
## Call:
## lm(formula = mpg ~ wt, data = ds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10
## Normalized Data
summary(fit2)
## 
## Call:
## lm(formula = mpg ~ n_wt, data = ds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  20.0906     0.5384  37.313  < 2e-16 ***
## n_wt         -5.2293     0.5471  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10
##Original Data

x_origin = predict(fit1)

# Subtract the new Y value from the original to get the error;
e_origin = x_origin - ds$mpg

# Square the errors;
e_origin_sqr = e_origin^2

# Add up the errors;
sum_e_origin_sqr = sum(e_origin_sqr)

# Find the mean:
sum_e_origin_sqr/NROW(ds)
## [1] 8.697561
##Reescaled Data

x_norm = predict(fit2)

# Subtract the new Y value from the original to get the error;
e_norm = x_norm - ds$mpg

# Square the errors;
e_norm_sqr = e_norm^2

# Add up the errors;
sum_e_norm_sqr = sum(e_norm_sqr)

# Find the mean:
sum_e_norm_sqr/NROW(ds)
## [1] 8.697561

Conclusions

For a simple linear regression on mtcars dataset using mpg as the outcome and wt as the predictor, the mean squared error is not affected by the normalization of wt variable.