Previously on STAT 412:

  • Confounding & Interaction

Today’s focus is on TRANSFORMATION & IMPUTATION!

Transformation:

Data transformation is a process that changes your data into another data via a mathematical equation. In stronger sense, it is a replacement that changes the shape of a distribution or relationship.

Reasons for transformation:

  • Normalization: Data transformation allows for the normalization of data, which helps in comparing variables that originally have different scales or units. Normalizing the data can make it easier to interpret and analyze.

  • Statistical Assumptions: Many statistical methods assume that the data follow certain distributions or have specific properties. Data transformation can help meet these assumptions, thus ensuring the validity of the statistical analysis. For example, transforming data to achieve normality is often necessary for many parametric tests such as t-tests and ANOVA.

  • Outlier Handling: Data transformation can reduce the impact of outliers, making the data more robust to extreme values.

  • Linear Relationships: In regression analysis, transforming the data can help linearize relationships between variables. For example, if the relationship between variables is not linear, applying a transformation such as logarithmic or exponential can make the relationship linear, making it easier to model and interpret.

  • Homogeneity of Variance: Some statistical tests, such as ANOVA, assume homogeneity of variance among groups. Data transformation can help stabilize variance across groups, making the assumption more likely to hold and improving the validity of the analysis.

  • Data Interpretation: Transforming the data can sometimes reveal patterns or relationships that were not apparent in the original form.

Overall, data transformation is an essential tool for preparing data for analysis, meeting statistical assumptions, improving interpretability, and enhancing the validity of statistical results.

CAUTION!:

Transformation, standardization, and normalization are related but distinct concepts:

Transformation: Adjusts the scale or distribution of data using functions like logarithmic or exponential transformations.

Standardization: Rescales data to have a mean of 0 and a standard deviation of 1, aiding comparison of variables with different scales.

Normalization: Scales data to a specific range, often between 0 and 1, preserving relative differences while ensuring consistency.

Let’s examine how centering, scaling and transformation affects the data:

set.seed(1) #to make random number generation consistent
x = rnorm(100, 10, 4)  #random number generation from normal dist
x2 = x - mean(x) #center values
x3 = (x - mean(x)) / sd(x) #scale values
df=data.frame(x,x2,x3)
head(df)
x x2 x3
7.494185 -2.9413647 -0.8186837
10.734573 0.2990238 0.0832287
6.657486 -3.7780639 -1.0515661
16.381123 5.9455737 1.6548592
11.318031 0.8824816 0.2456252
6.718127 -3.7174230 -1.0346876

To see the difference, we’ll construct the plots for each modification:

library(ggplot2)
library(gridExtra)

original=ggplot(df,aes(x=x))+geom_histogram(aes(y=stat(density)))+labs(title="Histogram of Original X",y="Count",x="x")+geom_density(col="royalblue")
centered=ggplot(df,aes(x=x2))+geom_histogram(aes(y=stat(density)))+labs(title="Histogram of Centered X",y="Count",x="centered x")+geom_density(col="royalblue") # Center of x has changed
scaled=ggplot(df,aes(x=x3))+geom_histogram(aes(y=stat(density)))+labs(title="Histogram of Scaled X",y="Count",x="scaled x")+geom_density(col="royalblue")# Both the location and scale of x changed

grid.arrange(original,centered,scaled,nrow=3)

Centering adjusts the center of the data, while scaling changes both the scale and center of the data. Moreover, centering does not affect the spread or shape of the distribution; it simply changes the location of the distribution along the scale.

Now, let’s try to change the shape of the distribution:

df$x4 = log(x)  # Be careful on ln or log transformation. Data may contain non-positive values
df$x5 = x^2
df$x6 = sqrt(x) # Be careful on square root transformation. Data may contain negative values


log_trans=ggplot(df,aes(x=x4))+geom_histogram(aes(y=stat(density)))+labs(title="Histogram of log(X)",y="Count",x="log X")+geom_density(col="royalblue") # Shape has changed to a left-skewed type
square_trans=ggplot(df,aes(x=x5))+geom_histogram(aes(y=stat(density)))+labs(title="Histogram of X^2",y="Count",x="Square of X")+geom_density(col="royalblue")  # Shape has changed to a right-skewed type
root_trans=ggplot(df,aes(x=x6))+geom_histogram(aes(y=stat(density)))+labs(title="Histogram of sqrt(X)",y="Count",x="Square-root of X")+geom_density(col="royalblue")# Again, shape has changed to a left-skewed type

grid.arrange(original,log_trans,square_trans,root_trans,nrow=4)

Transformation changes the form or distribution of the data using mathematical functions like logarithmic, square root, or exponential transformations. These functions alter the shape or spread of the data, but they may or may not affect the center depending on the transformation applied.

Box-Cox (Power) Transformation:

A power transform will make the probability distribution of a variable more Gaussian.

This is often described as removing a skew in the distribution, although more generally is described as stabilizing the variance of the distribution.

Although the most common transformations are logarithmic, exponential and etc., we can use a generalized version of the transform that finds a parameter (lambda) that best transforms a variable to a Gaussian probability distribution.

Create a data set:

mydata= data.frame(x=c(0.2, 0.528, 0.11, 0.260, 0.091, 1.314, 1.52, 0.244, 1.981, 0.273,
            0.461, 0.366, 1.407, 0.79, 2.266))
str(mydata)
## 'data.frame':    15 obs. of  1 variable:
##  $ x: num  0.2 0.528 0.11 0.26 0.091 ...
summary(mydata)
##        x         
##  Min.   :0.0910  
##  1st Qu.:0.2520  
##  Median :0.4610  
##  Mean   :0.7874  
##  3rd Qu.:1.3605  
##  Max.   :2.2660

According to the summary statistics, we observe that the mean is greater than the median. It suggests that there may be right skewness in the distribution.

REMEMBER:

Determine the skewness with the plot:

library(ggplot2)
ggplot(mydata, aes(x=x)) + geom_density()

It has a right-skewed distribution. To be sure, check the skewness.

library(e1071)
## Warning: package 'e1071' was built under R version 4.3.3
skewness_value = skewness(mydata$x)
skewness_value
## [1] 0.7613376

It is a positive number (skewness>0), hence we can be sure that x is a positively skewed distribution.

What about normality? Does x follow normal distribution?

To check normality, we can examine the QQ-plot. If the points align along a straight line, then the data satisfies the normality.

ggplot(mydata, aes(sample = x))+ stat_qq() + stat_qq_line()

#OR


qqnorm(mydata$x)
qqline(mydata$x)

In the first plot, x-axis represents the theoretical quantiles while y-axis shows the sample quantiles. According to the plot, normality is violated.

To be sure, let’s implement the normality-test:

shapiro.test(mydata$x)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$x
## W = 0.84522, p-value = 0.01487

REMEMBER:

\(H_0:\) Sample distribution is normal. \(H_A:\) Sample distribution is not normal.

Since the p-value is smaller than 0.05, it is indicated that the data does not satisfy normality.

Apply the transformation to make it more Gaussian:

library(MASS)
## Warning: package 'MASS' was built under R version 4.3.3
trf=boxcox(lm(mydata$x~1))

The two lines include the value of 0, hence we can apply the log transformation.

REMEMBER:

To find the best lambda:

lambda = trf$x[which.max(trf$y)]
lambda
## [1] 0.02020202

Guide with respect to lambda:

Figure below shows us common values for lambda and corresponding transformation methods

Let’s apply the log transformation:

mydata$log_x=log(mydata$x)

Use formal way to check normality!

shapiro.test(mydata$log_x)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$log_x
## W = 0.94531, p-value = 0.4538

After transformation, the data is normally distributed since p-value is statistically significant (p-value>0.05).

It is nice to see the difference between the original and transformed x values.

require(gridExtra)
original=ggplot(mydata, aes(sample = x))+ stat_qq() + stat_qq_line()
transformed=ggplot(mydata, aes(sample = log_x))+ stat_qq() + stat_qq_line()
grid.arrange(original, transformed, ncol=2)

If transformation method is not one of them given in the table above, we can still transform the values using the rule \(\frac {x^\lambda-1} {\lambda}\)

lambda = trf$x[which.max(trf$y)] #Determine the exact lambda
trnsfrmed_x = (mydata$x ^ lambda - 1) / lambda
trnsfrmed_x
##  [1] -1.5835546 -0.6345566 -2.1587856 -1.3289094 -2.3397898  0.2738305
##  [7]  0.4204862 -1.3906781  0.6883439 -1.2814057 -0.7683318 -0.9949859
## [13]  0.3426402 -0.2351620  0.8248126

CAUTION!: You have to provide the summary statistics, visualizations and etc. for transformed values as well!

Transformation via “BestNormalize”:

Create a data set from exponential distribution:

set.seed(1)
data2 = data.frame(x=rgamma(100,1,5))

Check the QQ-plot for normalization:

ggplot(data2, aes(sample = x))+ stat_qq() + stat_qq_line()

It is obvious that the data does not follow normal distribution.

shapiro.test(data2$x)
## 
##  Shapiro-Wilk normality test
## 
## data:  data2$x
## W = 0.83634, p-value = 3.831e-09

Since p-value is smaller than 0.05, we can reject \(H_0\). Therefore, we conclude that the data is not normally distributed.

library(bestNormalize)
## Warning: package 'bestNormalize' was built under R version 4.3.3
## 
## Attaching package: 'bestNormalize'
## The following object is masked from 'package:MASS':
## 
##     boxcox
best_x= bestNormalize(data2$x)
best_xdata=predict(best_x, newdata = data2$x)  #transform the original values
shapiro.test(best_xdata)
## 
##  Shapiro-Wilk normality test
## 
## data:  best_xdata
## W = 0.98923, p-value = 0.6027

The Shapiro test says that the data is now normally distributed.

Missingness

Missing data refers to the absence of values in a dataset where information is expected or required.

Handling missing data is essential in data analysis because it can affect the validity and reliability of statistical conclusions drawn from the dataset. Ignoring missing data or improperly handling it can lead to biased results, reduced statistical power, and inaccurate conclusions.

There are several approaches to dealing with missing data, including:

  • Complete Case Analysis: Discarding observations with missing values.
  • Imputation: Replacing missing values with estimated values based on the available data.
  • Model-based Methods: Incorporating missing data mechanisms into statistical models

Missing Data Mechanism:

Missing Completely at Random (MCAR): In this mechanism, the probability of a data point being missing is unrelated to both observed and unobserved data. Essentially, missingness is random and occurs independently of any other variables.

Missing at Random (MAR): In this mechanism, the probability of missingness may depend on observed data but not on unobserved data. MCAR is a special case of MAR. That is, if the data are MCAR, they are also MAR.

Missing Not at Random (MNAR): In this mechanism, missingness depends on unobserved data or the values of the missing variable itself.

Handling with missingness

Read the data into R.

iris = read.table("iris_mis.txt",sep=",")
colnames(iris)=c("var1","var2","var3","var4","var5")
dim(iris)
## [1] 150   5

The original data includes 150 observations with 5 variables.

sum(is.na(iris))
## [1] 42

There is a total of 42 missing values in our data set.

colSums(is.na(iris))
## var1 var2 var3 var4 var5 
##   12   13   10    7    0

The variables have 12, 13, 10,7 and 0 missing values, respectively.

which(is.na(iris)) #see index
##  [1]  31  32  33  53  64  67  68  88  93 121 138 148 181 182 183 203 214 217 218
## [20] 224 238 269 279 281 298 309 320 323 334 354 410 417 427 442 450 554 557 582
## [39] 583 593 594 595

The first three missing values are 31st, 32nd and 33rd observations.

Row-wise Deletion (List-wise Deletion)

#create new data frame that only contains rows with no missing values
iris_complete = iris[complete.cases(iris), ]   #known as complete case
head(iris,10)
var1 var2 var3 var4 var5
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3.0 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
4.6 3.1 1.5 0.2 Iris-setosa
5.0 3.6 1.4 0.2 Iris-setosa
5.4 3.9 1.7 0.4 Iris-setosa
4.6 3.4 1.4 0.3 Iris-setosa
5.0 3.4 1.5 0.2 Iris-setosa
4.4 2.9 NA 0.2 Iris-setosa
4.9 3.1 1.5 0.1 Iris-setosa
head(iris_complete,10)
var1 var2 var3 var4 var5
1 5.1 3.5 1.4 0.2 Iris-setosa
2 4.9 3.0 1.4 0.2 Iris-setosa
3 4.7 3.2 1.3 0.2 Iris-setosa
4 4.6 3.1 1.5 0.2 Iris-setosa
5 5.0 3.6 1.4 0.2 Iris-setosa
6 5.4 3.9 1.7 0.4 Iris-setosa
7 4.6 3.4 1.4 0.3 Iris-setosa
8 5.0 3.4 1.5 0.2 Iris-setosa
10 4.9 3.1 1.5 0.1 Iris-setosa
11 5.4 3.7 1.5 0.2 Iris-setosa
dim(iris_complete)
## [1] 117   5
#OR

iris_withoutna=na.omit(iris)
head(iris_withoutna,10)
var1 var2 var3 var4 var5
1 5.1 3.5 1.4 0.2 Iris-setosa
2 4.9 3.0 1.4 0.2 Iris-setosa
3 4.7 3.2 1.3 0.2 Iris-setosa
4 4.6 3.1 1.5 0.2 Iris-setosa
5 5.0 3.6 1.4 0.2 Iris-setosa
6 5.4 3.9 1.7 0.4 Iris-setosa
7 4.6 3.4 1.4 0.3 Iris-setosa
8 5.0 3.4 1.5 0.2 Iris-setosa
10 4.9 3.1 1.5 0.1 Iris-setosa
11 5.4 3.7 1.5 0.2 Iris-setosa
dim(iris_withoutna)
## [1] 117   5

Pairwise deletion

average = colMeans(iris[,1:4], na.rm = TRUE) #it handles missing values while ignoring them
average
##     var1     var2     var3     var4 
## 5.831884 3.062044 3.770714 1.156643
covar = cov(iris[,1:4], use = "pairwise") #it handles missing values while ignoring them as well
covar
##             var1        var2       var3       var4
## var1  0.70904898 -0.02503759  1.2924348  0.5240763
## var2 -0.02503759  0.18869579 -0.2886689 -0.1182260
## var3  1.29243479 -0.28866892  3.0701434  1.2488722
## var4  0.52407634 -0.11822600  1.2488722  0.5695154

Variable Deletion

If there is too much data missing for a variable (higher than 60%), it may be an option to delete the variable or the column from the dataset. However, it causes loss of significant amount of information.

iris2 = read.table("iris_mis2.txt",sep=",")  #there is a small change between "iris" and "iris2"
dim(iris2)
## [1] 150   5
sum(is.na(iris2))
## [1] 132
colSums(is.na(iris2))/nrow(iris2)
##         V1         V2         V3         V4         V5 
## 0.08000000 0.08666667 0.06666667 0.64666667 0.00000000
iris2_dropvar=iris[,-4]  #V4 has missing values almost 65% of all values so drop it
sum(is.na(iris2_dropvar))
## [1] 35
iris2_withoutna=na.omit(iris2_dropvar)
dim(iris2_withoutna) 
## [1] 124   4
##what if we prefer to apply list-wise deletion...
iris2_withoutna2=na.omit(iris2)
dim(iris2_withoutna2)
## [1] 30  5

Imputation

Firstly, examine the missingness pattern. The most common package in R is “mice”.

library(mice)
## Warning: package 'mice' was built under R version 4.3.3
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
md.pattern(iris)

##     var5 var4 var3 var1 var2   
## 117    1    1    1    1    1  0
## 4      1    1    1    1    0  1
## 3      1    1    1    0    1  1
## 9      1    1    1    0    0  2
## 10     1    1    0    1    1  1
## 7      1    0    1    1    1  1
##        0    7   10   12   13 42

Each row represents a different pattern of missingness, and each column represents a variable in the data set.

This information can be useful for understanding the nature and extent of missing data in our data set and guiding the imputation process if necessary.

md.pairs(iris)$mm  #to see number of missing values as a matrix form
##      var1 var2 var3 var4 var5
## var1   12    9    0    0    0
## var2    9   13    0    0    0
## var3    0    0   10    0    0
## var4    0    0    0    7    0
## var5    0    0    0    0    0
Single Imputation (mean/median imputation)
#apply mean imputation:
iris_copy=iris
iris_copy$var1[is.na(iris_copy$var1)] = round(mean(iris_copy$var1,na.rm=T))
iris_copy$var2[is.na(iris_copy$var2)] = round(mean(iris_copy$var2,na.rm=T))
iris_copy$var3[is.na(iris_copy$var3)] = round(mean(iris_copy$var3,na.rm=T))
iris_copy$var4[is.na(iris_copy$var4)] = round(mean(iris_copy$var4,na.rm=T))

summary(iris_copy)
##       var1            var2            var3            var4      
##  Min.   :4.300   Min.   :2.000   Min.   :1.100   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.900   Median :3.000   Median :4.200   Median :1.300  
##  Mean   :5.845   Mean   :3.057   Mean   :3.786   Mean   :1.149  
##  3rd Qu.:6.375   3rd Qu.:3.300   3rd Qu.:5.075   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##      var5          
##  Length:150        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
head(iris,25)  #9th, 20th, 23rd obs in var 3 are missing
var1 var2 var3 var4 var5
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3.0 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
4.6 3.1 1.5 0.2 Iris-setosa
5.0 3.6 1.4 0.2 Iris-setosa
5.4 3.9 1.7 0.4 Iris-setosa
4.6 3.4 1.4 0.3 Iris-setosa
5.0 3.4 1.5 0.2 Iris-setosa
4.4 2.9 NA 0.2 Iris-setosa
4.9 3.1 1.5 0.1 Iris-setosa
5.4 3.7 1.5 0.2 Iris-setosa
4.8 3.4 1.6 0.2 Iris-setosa
4.8 3.0 1.4 0.1 Iris-setosa
4.3 3.0 1.1 0.1 Iris-setosa
5.8 4.0 1.2 0.2 Iris-setosa
5.7 4.4 1.5 0.4 Iris-setosa
5.4 3.9 1.3 0.4 Iris-setosa
5.1 3.5 1.4 0.3 Iris-setosa
5.7 3.8 1.7 0.3 Iris-setosa
5.1 3.8 NA 0.3 Iris-setosa
5.4 3.4 1.7 0.2 Iris-setosa
5.1 3.7 1.5 0.4 Iris-setosa
4.6 3.6 NA 0.2 Iris-setosa
5.1 3.3 1.7 0.5 Iris-setosa
4.8 3.4 1.9 0.2 Iris-setosa
round(mean(iris_copy$var3,na.rm=T)) #what is the mean value of var3?
## [1] 4
head(iris_copy,25) #Check whether 9th, 20th, 23rd obs in var 3 are the value of 4.
var1 var2 var3 var4 var5
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3.0 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
4.6 3.1 1.5 0.2 Iris-setosa
5.0 3.6 1.4 0.2 Iris-setosa
5.4 3.9 1.7 0.4 Iris-setosa
4.6 3.4 1.4 0.3 Iris-setosa
5.0 3.4 1.5 0.2 Iris-setosa
4.4 2.9 4.0 0.2 Iris-setosa
4.9 3.1 1.5 0.1 Iris-setosa
5.4 3.7 1.5 0.2 Iris-setosa
4.8 3.4 1.6 0.2 Iris-setosa
4.8 3.0 1.4 0.1 Iris-setosa
4.3 3.0 1.1 0.1 Iris-setosa
5.8 4.0 1.2 0.2 Iris-setosa
5.7 4.4 1.5 0.4 Iris-setosa
5.4 3.9 1.3 0.4 Iris-setosa
5.1 3.5 1.4 0.3 Iris-setosa
5.7 3.8 1.7 0.3 Iris-setosa
5.1 3.8 4.0 0.3 Iris-setosa
5.4 3.4 1.7 0.2 Iris-setosa
5.1 3.7 1.5 0.4 Iris-setosa
4.6 3.6 4.0 0.2 Iris-setosa
5.1 3.3 1.7 0.5 Iris-setosa
4.8 3.4 1.9 0.2 Iris-setosa
LOCF (Last Observation Carried Forward)

It is a method used to impute missing values by carrying forward the last observed value. It is commonly used in longitudinal studies or time-series data where observations are collected over time.

library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
iris_locf=na.locf(iris)  

sum(is.na(iris_locf))
## [1] 0
head(iris,25)
var1 var2 var3 var4 var5
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3.0 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
4.6 3.1 1.5 0.2 Iris-setosa
5.0 3.6 1.4 0.2 Iris-setosa
5.4 3.9 1.7 0.4 Iris-setosa
4.6 3.4 1.4 0.3 Iris-setosa
5.0 3.4 1.5 0.2 Iris-setosa
4.4 2.9 NA 0.2 Iris-setosa
4.9 3.1 1.5 0.1 Iris-setosa
5.4 3.7 1.5 0.2 Iris-setosa
4.8 3.4 1.6 0.2 Iris-setosa
4.8 3.0 1.4 0.1 Iris-setosa
4.3 3.0 1.1 0.1 Iris-setosa
5.8 4.0 1.2 0.2 Iris-setosa
5.7 4.4 1.5 0.4 Iris-setosa
5.4 3.9 1.3 0.4 Iris-setosa
5.1 3.5 1.4 0.3 Iris-setosa
5.7 3.8 1.7 0.3 Iris-setosa
5.1 3.8 NA 0.3 Iris-setosa
5.4 3.4 1.7 0.2 Iris-setosa
5.1 3.7 1.5 0.4 Iris-setosa
4.6 3.6 NA 0.2 Iris-setosa
5.1 3.3 1.7 0.5 Iris-setosa
4.8 3.4 1.9 0.2 Iris-setosa
head(iris_locf,25)  #see 23rd obs
var1 var2 var3 var4 var5
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3.0 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
4.6 3.1 1.5 0.2 Iris-setosa
5.0 3.6 1.4 0.2 Iris-setosa
5.4 3.9 1.7 0.4 Iris-setosa
4.6 3.4 1.4 0.3 Iris-setosa
5.0 3.4 1.5 0.2 Iris-setosa
4.4 2.9 1.5 0.2 Iris-setosa
4.9 3.1 1.5 0.1 Iris-setosa
5.4 3.7 1.5 0.2 Iris-setosa
4.8 3.4 1.6 0.2 Iris-setosa
4.8 3.0 1.4 0.1 Iris-setosa
4.3 3.0 1.1 0.1 Iris-setosa
5.8 4.0 1.2 0.2 Iris-setosa
5.7 4.4 1.5 0.4 Iris-setosa
5.4 3.9 1.3 0.4 Iris-setosa
5.1 3.5 1.4 0.3 Iris-setosa
5.7 3.8 1.7 0.3 Iris-setosa
5.1 3.8 1.7 0.3 Iris-setosa
5.4 3.4 1.7 0.2 Iris-setosa
5.1 3.7 1.5 0.4 Iris-setosa
4.6 3.6 1.5 0.2 Iris-setosa
5.1 3.3 1.7 0.5 Iris-setosa
4.8 3.4 1.9 0.2 Iris-setosa
NOCB (Next Observation Carried Backward)

Missing values are filled by carrying backward the next observed value.

iris_nocb=na.locf(iris,fromLast=TRUE) 


head(iris,25)
var1 var2 var3 var4 var5
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3.0 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
4.6 3.1 1.5 0.2 Iris-setosa
5.0 3.6 1.4 0.2 Iris-setosa
5.4 3.9 1.7 0.4 Iris-setosa
4.6 3.4 1.4 0.3 Iris-setosa
5.0 3.4 1.5 0.2 Iris-setosa
4.4 2.9 NA 0.2 Iris-setosa
4.9 3.1 1.5 0.1 Iris-setosa
5.4 3.7 1.5 0.2 Iris-setosa
4.8 3.4 1.6 0.2 Iris-setosa
4.8 3.0 1.4 0.1 Iris-setosa
4.3 3.0 1.1 0.1 Iris-setosa
5.8 4.0 1.2 0.2 Iris-setosa
5.7 4.4 1.5 0.4 Iris-setosa
5.4 3.9 1.3 0.4 Iris-setosa
5.1 3.5 1.4 0.3 Iris-setosa
5.7 3.8 1.7 0.3 Iris-setosa
5.1 3.8 NA 0.3 Iris-setosa
5.4 3.4 1.7 0.2 Iris-setosa
5.1 3.7 1.5 0.4 Iris-setosa
4.6 3.6 NA 0.2 Iris-setosa
5.1 3.3 1.7 0.5 Iris-setosa
4.8 3.4 1.9 0.2 Iris-setosa
head(iris_nocb,25) #23rd obs
var1 var2 var3 var4 var5
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3.0 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
4.6 3.1 1.5 0.2 Iris-setosa
5.0 3.6 1.4 0.2 Iris-setosa
5.4 3.9 1.7 0.4 Iris-setosa
4.6 3.4 1.4 0.3 Iris-setosa
5.0 3.4 1.5 0.2 Iris-setosa
4.4 2.9 1.5 0.2 Iris-setosa
4.9 3.1 1.5 0.1 Iris-setosa
5.4 3.7 1.5 0.2 Iris-setosa
4.8 3.4 1.6 0.2 Iris-setosa
4.8 3.0 1.4 0.1 Iris-setosa
4.3 3.0 1.1 0.1 Iris-setosa
5.8 4.0 1.2 0.2 Iris-setosa
5.7 4.4 1.5 0.4 Iris-setosa
5.4 3.9 1.3 0.4 Iris-setosa
5.1 3.5 1.4 0.3 Iris-setosa
5.7 3.8 1.7 0.3 Iris-setosa
5.1 3.8 1.7 0.3 Iris-setosa
5.4 3.4 1.7 0.2 Iris-setosa
5.1 3.7 1.5 0.4 Iris-setosa
4.6 3.6 1.7 0.2 Iris-setosa
5.1 3.3 1.7 0.5 Iris-setosa
4.8 3.4 1.9 0.2 Iris-setosa
Regression Imputation

Here’s how regression imputation works: 1. Identify the variable with missing values that you want to impute (dependent variable). 2. Select a set of predictor variables (independent variables) that are strongly correlated with the dependent variable and do not contain missing values. 3. Fit a regression model using the observations where the dependent variable is not missing. 4. Use the fitted regression model to predict the missing values of the dependent variable based on the values of the predictor variables. 5. Replace the missing values with the predicted values.

iris3 = read.table("iris_mis3.txt",sep=",")  #there is a small change between "iris" and "iris3"
dim(iris3)
## [1] 150   5
sum(is.na(iris3))
## [1] 35
colSums(is.na(iris3)) #no missing value in var4
## V1 V2 V3 V4 V5 
## 12 13 10  0  0
library(mice)
iris_reg=iris3[,c(2,4)]  #make the data narrower var4(least missing) var2(most missing)
colnames(iris_reg)=c("var2","var4")
fit = lm(var2 ~ var4, data = iris_reg)  #let's say var2 is dep. var. while var4 is indep.
pred = predict(fit, newdata = ic(iris_reg))   #ic(): extracts incomplete cases from a data set.
pred
##       31       32       33       53       64       67       68       74 
## 3.229337 3.191918 3.248047 2.986113 3.004823 2.986113 3.079661 3.042242 
##       88      119      129      131      148 
## 3.023532 2.836437 2.873856 2.911275 2.892565
iris_reg$var2[as.numeric(names(pred))]=pred
head(iris_reg)
var2 var4
3.5 0.2
3.0 0.2
3.2 0.2
3.1 0.2
3.6 0.2
3.9 0.4
sum(is.na(iris_reg))
## [1] 0

References: https://machinelearningmastery.com/power-transforms-with-scikit-learn/

Ref for figure: https://ledidi.com/academy/measures-of-central-tendency-mean-median-and-mode