1 Z Scaling

Z-scaling, also known as standardization or z-score normalization, is a common preprocessing step in statistics and machine learning. It involves transforming the features of a dataset so that they have a mean of 0 and a standard deviation of 1. The formula for z-scaling is:

\[z=\dfrac{x-\mu}{\sigma}\]

where:

  • z is the standardized value,

  • x is the original value,

  • \(\mu\) is the mean of the feature,

  • \(\sigma\) is the standard deviation of the feature.

Here are some reasons why z-scaling is used:

  1. Comparability: Z-scaling makes different features more comparable by putting them on the same scale. This is particularly important when features have different units or scales. Standardizing the features ensures that no single feature dominates the others in terms of scale.

  2. Gradient Descent Convergence: In machine learning algorithms that use gradient descent for optimization, having features on a similar scale can help the algorithm converge faster. Features with large scales might cause the algorithm to take longer to find the optimal solution.

  3. Regularization: Regularization techniques, such as L1 and L2 regularization, penalize large coefficients in regression models. Standardizing the features helps prevent one feature with a large scale from disproportionately influencing the model’s coefficients.

  4. Some Machine Learning Algorithms: Certain machine learning algorithms, such as k-nearest neighbors and support vector machines, rely on distance measures between data points. Standardizing features ensures that each feature contributes equally to the distance computation.

  5. Principal Component Analysis (PCA): In dimensionality reduction techniques like PCA, where the goal is to capture the most significant variance in the data, z-scaling is often applied to ensure that all features contribute equally to the principal components.

It’s important to note that z-scaling is not always necessary or beneficial. For algorithms that are not sensitive to feature scales (e.g., tree-based models like decision trees and random forests), standardization might not be required. The decision to z-scale features depends on the specific characteristics of the data and the requirements of the algorithm being used.

1.1 Implementation

Lets first create the normally distributed data with mean 108 and sd of 7.2

# Clear the workspace
  rm(list = ls()) # Clear environment
  gc()            # Clear unused memory
##          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 531073 28.4    1182718 63.2         NA   669265 35.8
## Vcells 975163  7.5    8388608 64.0      16384  1840452 14.1
  cat("\f")       # Clear the console
  dev.off         # Clear the charts
## function (which = dev.cur()) 
## {
##     if (which == 1) 
##         stop("cannot shut down device 1 (the null device)")
##     .External(C_devoff, as.integer(which))
##     dev.cur()
## }
## <bytecode: 0x1401bd3c8>
## <environment: namespace:grDevices>
# Generate a dataset 'data_values' of 1000 observations from a normal distribution
# with a mean 'mu' and a standard deviation 'sigma'.

set.seed(123)  # Set seed for reproducibility

mu    <- 108
sigma <- 7.2
data_values <- rnorm(n = 1000, 
                     mean = mu, 
                     sd = sigma
                     )

head(data_values)
## [1] 103.9646 106.3427 119.2227 108.5077 108.9309 120.3485
?hist
hist(data_values, 
     main = "Original Data", 
     xlab = "Data values")

Your task is to write R code to convert each value in data_values to its corresponding z-score, which is the standard score in a standard normal distribution.

The z-score is calculated using the formula:

\[ z = \dfrac{x-\mu}{\sigma} \]

z_scores <- (data_values - mu)/sigma
hist(z_scores, 
     main = "Z Scores", 
     xlab = "Transformed values")

# Install and load ggplot2 if not already installed
# install.packages("ggplot2")
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.3
# Create a data frame for visualization
df <- data.frame(Data = c(data_values, z_scores),
                 Type = rep(c("Original Data", "Z-Score"), each = length(data_values)))

# Plot using ggplot2
ggplot(df, aes(x = Data, fill = Type)) +
  geom_density() +
  facet_wrap(~Type, scales = "free") +
  labs(title = "Distribution of Normally Distributed Data and Z-Scores",
       x = "Value or Z-Score", y = "Frequency") +
  theme_minimal()

2 P value

The p-value, or probability value, is a measure used in statistical hypothesis testing to determine the evidence against a null hypothesis. In statistical hypothesis testing, you start with a null hypothesis (often denoted as H0) that there is no effect or no difference. The alternative hypothesis (often denoted as H1 or Ha) suggests the presence of an effect or a difference.

The p-value is the probability of observing a test statistic as extreme as, or more extreme than, the one observed in your data, assuming the null hypothesis is true. In simpler terms, it helps you assess the strength of the evidence against the null hypothesis.

Here’s a general idea of how p-values work:

  • Low p-value (typically ≤ 0.05): If the p-value is below a certain threshold (often 0.05), you reject the null hypothesis. This suggests that the observed data is unlikely if the null hypothesis is true, providing evidence in favor of the alternative hypothesis.

  • High p-value (> 0.05): If the p-value is above the threshold, you fail to reject the null hypothesis. This implies that the observed data is reasonably likely to occur if the null hypothesis is true.

It’s important to note that a p-value does not provide the probability of the null hypothesis being true or false; it only gives the probability of obtaining the observed data if the null hypothesis is true.

Researchers and statisticians use p-values along with other information to make decisions about hypotheses. It’s also crucial to interpret p-values cautiously and consider them in the context of the study design, effect size, and other relevant factors. The threshold of 0.05 is commonly used but should not be seen as a strict rule; the interpretation of p-values depends on the specific field and context. Additionally, statistical significance does not necessarily imply practical or scientific significance.