Lesson 14 | 24 June 2020

Data Science Basics Continued: Revisiting Functions and Statistical Applications

I’ve shown you some basic data statistics and a couple lessons on functions, but let’s revisit the two in a more interactive way.

data_frame <- tibble(  
  c1 = rnorm(50, 5, 1.5), 
  c2 = rnorm(50, 5, 1.5),    
  c3 = rnorm(50, 5, 1.5),    
)
data_frame
## # A tibble: 50 x 3
##       c1    c2    c3
##    <dbl> <dbl> <dbl>
##  1  2.61  8.62  4.15
##  2  2.94  5.61  4.85
##  3  5.43  7.62  5.32
##  4  6.48  3.99  3.24
##  5  6.33  4.32  5.04
##  6  4.20  7.35  3.24
##  7  5.40  6.95  2.64
##  8  6.17  5.47  4.70
##  9  3.49  6.19  3.95
## 10  5.01  4.46  5.45
## # … with 40 more rows

Centering, standardizing or normalizing your data cna help tune the varying scales of your data. We’ve already gone through centering data in the last lesson, so let’s go through standardizing and normalizing data while building functions for them.

Standardization transforms data to have a mean of zero and a standard deviation of 1. Standardization assumes that your data has a Gaussian (bell curve) distribution. Standardizaitons, also known as Z-scores, are very common in statistics. They allow you to compare different sets of data and to find probabilities for sets of data using standardized tables (called z-tables).

Normalization usually means to scale a variable to have a values between 0 and 1. Normalization is good to use when you know that the distribution of your data does not follow a Gaussian distribution. When you normalize data you eliminate the units of measurement for data, enabling you to more easily compare data from different places. Normalization also makes variables, measured in different scales, have comparable values. For example, sometimes you may come across some scaling issues in your regressions and normalizing or transforming the data using a z-score or t-score can help. Normalization is useful when your data has varying scales and the regression or algorithm you are using does not make assumptions about the distribution of your data, such as k-nearest neighbors and artificial neural networks.

# Centering data
center <- function(x) {
  center <- x - mean(x)
  return(center)
}

# Standardizing data
standardize <- function(x) { # the z-score
  standardize <- (x - mean(x)) / sd(x)
  return(standardize)
}

# Normalizing data | Min-Max Normalization 
normalize <- function(x){
  normalize <- (x - min(x)) / (max(x) - min (x))
  return(normalize)
}

data_frame$c1_c <- center(data_frame$c1)
data_frame$c1_s <- standardize(data_frame$c1)
data_frame$c1_n<- normalize(data_frame$c1)
par(mfrow=c(2,2))
plot(data_frame$c2, data_frame$c1, ylab="c1") 
plot(data_frame$c2, data_frame$c1_c, ylab="c1 centered") 
plot(data_frame$c2, data_frame$c1_s, ylab="c1 standardized") 
plot(data_frame$c2, data_frame$c1_n, ylab="c1 normalized")