By Ebenezer Akpati

July 6, 2023

Introduction

The uniformity of a dataset helps the analyst to get an accurate result or an higher accuracy; two major issues to accuracy are from outliers and missing values not handled well. Thus, pre-processing of your data value is the crucial point of any analysis and the focal point of any analyst whose interest is getting accurate insight from the dataset.

Donald Rubin Statement: “Missing data are just another example of outliers.” Donald Rubin highlighted the connection between missing data and outliers. He viewed missing data as a form of outliers, indicating observations that are different from the complete data. Rubin’s statement suggests that missing data should be treated as a distinct type of outlier in data analysis.

Tukey’s rule

Tukey’s rule says that the outliers are values more than 1.5 times the interquartile range from the quartiles — either below Q1 − 1.5IQR, or above Q3 + 1.5IQR.

I shall look at two method of handling outlier in this article, bagImpute and the imputate_outlier method. The choice between the “bagImpute” method and “imputate_outlier” method (specifically using “capping” as the imputation approach) depends on the nature of your data and the specific requirements of your research.

The “bagImpute” method is a more comprehensive approach that takes into account the overall patterns and relationships in the data to impute missing values. It utilizes multiple imputation based on bootstrap samples and can capture the uncertainty and variability in the imputed values. This method is generally more suitable when you have a larger dataset and want to impute missing values considering the entire dataset.

On the other hand, “imputate_outlier” with “capping” is a simpler approach that specifically focuses on handling outliers by replacing them with values within a predefined range. This method is useful when you have identified outliers in your data and want to replace them with more reasonable values. It is particularly suitable when you have a small number of outliers and want a quick and straightforward way to address them.

Data Source:

The library(faraway) makes the data used in this analysis available while data(pima) calls up this particular dataset. The pima dataset is not a built-in dataset in R. It is part of the faraway package, which provides various datasets used in the book “Faraway, J.J. (2006). Extending the Linear Model with R”. The pima dataset contains information about diabetes patients.

loading and calling up required packages

library(reshape2)

Attaching package: ‘reshape2’

The following object is masked from ‘package:tidyr’:

    smiths

summary Information of My Dataset

This is a quick way to get the usual univariate summary information. At this stage, we are looking for anything unusual or unexpected perhaps indicating a data entry error. Five variables have minimum values of zero; looking at what these veriables represents, No blood pressure is not good for the health — something must be wrong and it is not possible for some one to have a BMI of zero and zero blood presure.

summary(pima)
    pregnant         glucose        diastolic         triceps         insulin     
 Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00   Min.   :  0.0  
 1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00   1st Qu.:  0.0  
 Median : 3.000   Median :117.0   Median : 72.00   Median :23.00   Median : 30.5  
 Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54   Mean   : 79.8  
 3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00   3rd Qu.:127.2  
 Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00   Max.   :846.0  
      bmi           diabetes           age             test      
 Min.   : 0.00   Min.   :0.0780   Min.   :21.00   Min.   :0.000  
 1st Qu.:27.30   1st Qu.:0.2437   1st Qu.:24.00   1st Qu.:0.000  
 Median :32.00   Median :0.3725   Median :29.00   Median :0.000  
 Mean   :31.99   Mean   :0.4719   Mean   :33.24   Mean   :0.349  
 3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00   3rd Qu.:1.000  
 Max.   :67.10   Max.   :2.4200   Max.   :81.00   Max.   :1.000  
dim(pima)
[1] 768   9

sort the BMI columns

We see that the first 11 values are zero. The description that comes with the data says nothing about it but it seems likely that the zero has been used as a missing value code. For one reason or another, the researchers did not obtain the bmi of 11 patients. In a real investigation, one would likely be able to question the researchers about what really happened because one cannot have zero bmi reading, Nevertheless, this does illustrate the kind of misunderstanding a data analyst encounters.

#sort(pima$bmi)
sorted_bmi <- sort(pima$bmi)
sorted_bmi
  [1]  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 18.2 18.2 18.2 18.4 19.1 19.3 19.4
 [19] 19.5 19.5 19.6 19.6 19.6 19.9 20.0 20.1 20.4 20.4 20.8 20.8 21.0 21.0 21.1 21.1 21.1 21.1
 [37] 21.2 21.7 21.8 21.8 21.8 21.8 21.8 21.9 21.9 21.9 22.1 22.1 22.2 22.2 22.3 22.4 22.4 22.5
 [55] 22.5 22.5 22.6 22.6 22.7 22.9 22.9 23.0 23.0 23.1 23.1 23.1 23.1 23.2 23.2 23.2 23.3 23.3
 [73] 23.4 23.5 23.5 23.5 23.6 23.6 23.6 23.7 23.7 23.8 23.8 23.9 23.9 24.0 24.0 24.0 24.0 24.1
 [91] 24.2 24.2 24.2 24.2 24.2 24.2 24.3 24.3 24.3 24.3 24.4 24.4 24.4 24.5 24.6 24.6 24.6 24.6
[109] 24.7 24.7 24.7 24.7 24.7 24.8 24.8 24.8 24.9 25.0 25.0 25.0 25.0 25.0 25.0 25.1 25.1 25.1
[127] 25.2 25.2 25.2 25.2 25.2 25.2 25.3 25.3 25.4 25.4 25.4 25.4 25.5 25.5 25.6 25.6 25.6 25.6
[145] 25.6 25.6 25.8 25.8 25.9 25.9 25.9 25.9 25.9 25.9 25.9 26.0 26.0 26.0 26.0 26.1 26.1 26.1
[163] 26.2 26.2 26.2 26.2 26.3 26.4 26.4 26.4 26.5 26.5 26.5 26.6 26.6 26.6 26.6 26.7 26.8 26.8
[181] 26.8 26.8 26.9 27.0 27.0 27.1 27.1 27.1 27.2 27.2 27.3 27.3 27.3 27.3 27.4 27.4 27.4 27.4
[199] 27.4 27.5 27.5 27.5 27.5 27.5 27.6 27.6 27.6 27.6 27.6 27.6 27.6 27.7 27.7 27.7 27.7 27.8
[217] 27.8 27.8 27.8 27.8 27.8 27.8 27.9 27.9 28.0 28.0 28.0 28.0 28.0 28.1 28.2 28.2 28.3 28.3
[235] 28.4 28.4 28.4 28.4 28.4 28.4 28.5 28.5 28.5 28.6 28.6 28.7 28.7 28.7 28.7 28.7 28.7 28.7
[253] 28.8 28.8 28.9 28.9 28.9 28.9 28.9 28.9 29.0 29.0 29.0 29.0 29.0 29.2 29.3 29.3 29.3 29.3
[271] 29.3 29.5 29.5 29.5 29.5 29.5 29.6 29.6 29.6 29.6 29.7 29.7 29.7 29.7 29.7 29.7 29.7 29.7
[289] 29.8 29.8 29.8 29.9 29.9 29.9 29.9 29.9 30.0 30.0 30.0 30.0 30.0 30.0 30.0 30.1 30.1 30.1
[307] 30.1 30.1 30.1 30.1 30.1 30.1 30.2 30.3 30.4 30.4 30.4 30.4 30.4 30.4 30.4 30.5 30.5 30.5
[325] 30.5 30.5 30.5 30.5 30.7 30.8 30.8 30.8 30.8 30.8 30.8 30.8 30.8 30.8 30.9 30.9 30.9 30.9
[343] 30.9 31.0 31.0 31.1 31.2 31.2 31.2 31.2 31.2 31.2 31.2 31.2 31.2 31.2 31.2 31.2 31.3 31.6
[361] 31.6 31.6 31.6 31.6 31.6 31.6 31.6 31.6 31.6 31.6 31.6 31.9 31.9 32.0 32.0 32.0 32.0 32.0
[379] 32.0 32.0 32.0 32.0 32.0 32.0 32.0 32.0 32.1 32.2 32.3 32.3 32.3 32.4 32.4 32.4 32.4 32.4
[397] 32.4 32.4 32.4 32.4 32.4 32.5 32.5 32.5 32.5 32.5 32.5 32.6 32.7 32.7 32.7 32.8 32.8 32.8
[415] 32.8 32.8 32.8 32.8 32.8 32.8 32.9 32.9 32.9 32.9 32.9 32.9 32.9 32.9 32.9 33.1 33.1 33.1
[433] 33.2 33.2 33.2 33.2 33.2 33.2 33.2 33.3 33.3 33.3 33.3 33.3 33.3 33.3 33.3 33.3 33.3 33.5
[451] 33.6 33.6 33.6 33.6 33.6 33.6 33.6 33.6 33.7 33.7 33.7 33.7 33.7 33.8 33.8 33.8 33.8 33.8
[469] 33.9 33.9 34.0 34.0 34.0 34.0 34.0 34.0 34.1 34.1 34.1 34.1 34.2 34.2 34.2 34.2 34.2 34.2
[487] 34.2 34.2 34.3 34.3 34.3 34.3 34.3 34.3 34.4 34.4 34.4 34.4 34.5 34.5 34.5 34.5 34.5 34.6
[505] 34.6 34.6 34.6 34.6 34.7 34.7 34.7 34.7 34.8 34.8 34.9 34.9 34.9 34.9 34.9 34.9 35.0 35.0
[523] 35.0 35.0 35.1 35.1 35.1 35.2 35.2 35.3 35.3 35.3 35.3 35.4 35.4 35.4 35.4 35.5 35.5 35.5
[541] 35.5 35.5 35.5 35.5 35.6 35.6 35.7 35.7 35.7 35.7 35.8 35.8 35.8 35.8 35.8 35.9 35.9 35.9
[559] 35.9 35.9 36.0 36.0 36.1 36.1 36.1 36.2 36.3 36.3 36.3 36.4 36.4 36.5 36.5 36.5 36.5 36.6
[577] 36.6 36.6 36.6 36.6 36.7 36.8 36.8 36.8 36.8 36.8 36.8 36.9 36.9 36.9 37.0 37.1 37.1 37.2
[595] 37.2 37.2 37.2 37.3 37.4 37.4 37.4 37.5 37.5 37.6 37.6 37.6 37.6 37.6 37.7 37.7 37.7 37.7
[613] 37.7 37.8 37.8 37.8 37.9 37.9 38.0 38.0 38.1 38.1 38.1 38.2 38.2 38.2 38.2 38.3 38.4 38.4
[631] 38.5 38.5 38.5 38.5 38.5 38.5 38.6 38.7 38.7 38.7 38.8 38.9 39.0 39.0 39.0 39.0 39.1 39.1
[649] 39.1 39.1 39.2 39.2 39.3 39.4 39.4 39.4 39.4 39.4 39.4 39.4 39.5 39.5 39.5 39.6 39.7 39.8
[667] 39.8 39.9 39.9 39.9 40.0 40.0 40.1 40.2 40.5 40.5 40.5 40.6 40.6 40.6 40.6 40.7 40.8 40.9
[685] 40.9 41.0 41.2 41.3 41.3 41.3 41.5 41.5 41.8 42.0 42.1 42.1 42.2 42.3 42.3 42.3 42.4 42.4
[703] 42.4 42.6 42.7 42.7 42.8 42.9 42.9 42.9 42.9 43.1 43.2 43.3 43.3 43.3 43.3 43.3 43.4 43.4
[721] 43.5 43.5 43.6 43.6 44.0 44.0 44.1 44.2 44.2 44.5 44.5 44.6 45.0 45.2 45.3 45.3 45.3 45.4
[739] 45.5 45.6 45.6 45.7 45.8 46.1 46.1 46.2 46.2 46.3 46.5 46.7 46.8 46.8 47.9 47.9 48.3 48.8
[757] 49.3 49.6 49.7 50.0 52.3 52.3 52.9 53.2 55.0 57.3 59.4 67.1

A careless statistician might overlook these presumed missing values and complete an analysis assuming that these were real observed zeroes. If the error was later discovered, they might then blame the researchers for using 0 as a missing value code (not a good choice since it is a valid value for some of the variables) and not mentioning it in their data description. Unfortunately such oversights are not uncommon particularly with datasets of any size or complexity. The statistician bears some share of responsibility for spotting these mistakes. We set all zero values of the five variables to NA which is the missing value code used by R . (Julian J. Faraway July 2002)

pima$diastolic[pima$diastolic == 0] <- NA
pima$glucose[pima$glucose == 0] <- NA
pima$triceps[pima$triceps == 0] <- NA
pima$insulin[pima$insulin == 0] <- NA
pima$bmi[pima$bmi == 0] <- NA

Imputing missing values

We can choose any approach to impute the missing data. There are packages like mice and caret, for example, that can handle this for you. vis_dat(), visualize the entire dataset and vis_miss shows only the missing value

library(dlookr)

Attaching package: ‘dlookr’

The following object is masked from ‘package:tidyr’:

    extract

The following object is masked from ‘package:base’:

    transform
plot_na_pareto(pima, only_na = TRUE)

plot_na_pareto is from the dlookr package, it shows the level of missing data that can be tolerated in an analysis. <= 10 % is ok and from <= 20% is not bad, but from 21% to 50% is bad, following this lead, insulin and triceps missing records are in the range of bad. you can also use vis_dat(),vis_miss() and missRanger() functions to see the state of the missing data in your data set, note you will have to install visdat package

library(visdat)
vis_miss(pima)

Removing Missing Values:

Pros: Removing missing values can simplify the dataset and eliminate potential bias introduced by imputation methods. It can also make certain analyses or models easier to implement.

Cons: Removing missing values can result in a reduction of sample size, potentially leading to loss of information and statistical power. It may also introduce bias if the missing values are not missing completely at random (MCAR).

Replacing Missing Values:

Pros: Replacing missing values allows you to retain the complete dataset and avoid sample size reduction. Imputation techniques can help preserve statistical power and reduce bias when missing values are not MCAR.

Cons: Imputation introduces uncertainty and potential bias depending on the chosen imputation method. The imputed values may not accurately reflect the true missing values, leading to distorted results. Imputation methods can also be sensitive to the specific characteristics of the dataset.

The choice between removing or replacing missing values depends on various factors such as the nature and extent of missingness, the analysis objectives, the underlying assumptions of the data, and the specific techniques available for imputation. It is essential to carefully consider the potential effects and limitations of each approach before making a decision.

BagImpute Method of Handling Missing Values

sub_pima <-(pima[1:8])
# use method bagImpute
pre_proc <- preProcess(sub_pima, method = "bagImpute")
# predict Missing Variables
train_pima <- predict(pre_proc, sub_pima)

# Include the 9th column back into the dataset
imputed_pima <- cbind(train_pima, pima[9])

# Assign the imputed dataset back to the original variable
pima <- imputed_pima
# Check if there are any missing values in pima
sum(is.na(pima))
[1] 0
dim(pima)
[1] 768   9

no data has been lost, the size of the dataset is still 768 and 9 column

The variable ‘test’ is not quantitative but categorical. Such variables are also called factors. However, because of the numerical coding, this variable has been treated as if it were quantitative. It’s best to designat such variables as factors so that they are treated appropriately. Sometimes people forget this and compute stupid statistics such as “average zip code”. (Julian J. Faraway July 2002)

pima$test <- factor(pima$test)
summary(pima$test)
  0   1 
500 268 

We now see that 500 cases were negative and 268 positive. Even better is to use descriptive labels:

levels(pima$test) <- c("negative","positive")
summary(pima)
    pregnant         glucose        diastolic         triceps         insulin     
 Min.   : 0.000   Min.   : 44.0   Min.   : 24.00   Min.   : 7.00   Min.   : 14.0  
 1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.:22.00   1st Qu.: 87.0  
 Median : 3.000   Median :117.0   Median : 72.00   Median :28.74   Median :135.3  
 Mean   : 3.845   Mean   :121.7   Mean   : 72.32   Mean   :28.84   Mean   :155.4  
 3rd Qu.: 6.000   3rd Qu.:141.0   3rd Qu.: 80.00   3rd Qu.:35.11   3rd Qu.:191.2  
 Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00   Max.   :846.0  
      bmi           diabetes           age              test    
 Min.   :18.20   Min.   :0.0780   Min.   :21.00   negative:500  
 1st Qu.:27.50   1st Qu.:0.2437   1st Qu.:24.00   positive:268  
 Median :32.26   Median :0.3725   Median :29.00                 
 Mean   :32.46   Mean   :0.4719   Mean   :33.24                 
 3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00                 
 Max.   :67.10   Max.   :2.4200   Max.   :81.00                 

None Parametic Test

Nonparametric tests, also known as distribution-free tests, are statistical tests that do not make any assumptions about the underlying distribution of the data. These tests are useful when the data does not meet the assumptions required by parametric tests, such as normality or equal variances. in this analysis i will use bmi and diastolic to see the normality of pima dataset

# Plot histogram with density
ggplot(data = pima, aes(x = bmi, y = after_stat(density))) +
  geom_histogram(colour = "black", bins = 25) +
  geom_density(colour = "blue", linewidth = 1.2)


# Plot histogram with density
ggplot(data =pima, aes(x =  diastolic, y = after_stat(density))) +
  geom_histogram(colour = "black", bins = 25) +
  geom_density(colour = "blue", linewidth = 1.2)

in terms of normality, diastolic seems to be more normally distributed when compare to bmi distribution, however lets look at their mean and sd distribution, the mean and the sd followed same distribution in both bmi and diastolicdata distribution


# Plot histogram and density
ggplot(data = pima, aes(x = diastolic)) +
  geom_histogram(aes(y = ..density..), bins = 25, colour = "black") +
  geom_density(colour = "blue", linewidth = 1.3) +
  stat_function(
    fun = dnorm,
    args = list(mean = mean(pima$diastolic), sd = sd(pima$diastolic)),
    colour = "red",
    size = 1
  ) +
  labs(x = "Diastolic", y = "Density") +
  ggtitle("Histogram and Density of Diastolic") +
  theme_minimal()



# Plot histogram and density
ggplot(data = pima, aes(x = bmi)) +
  geom_histogram(aes(y = ..density..), bins = 25, colour = "black") +
  geom_density(colour = "blue", linewidth = 1.3) +
  stat_function(
    fun = dnorm,
    args = list(mean = mean(pima$bmi), sd = sd(pima$bmi)),
    colour = "red",
    size = 1
  ) +
  labs(x = "BMI", y = "Density") +
  ggtitle("Histogram and Density of Diastolic") +
  theme_minimal()

Quantitative Nomality

Quantitative normality refers to the assumption or condition that a quantitative variable follows a normal distribution. The normal distribution, also known as the Gaussian distribution or bell curve, is a symmetric probability distribution characterized by its bell-shaped curve.

When a quantitative variable follows a normal distribution, it exhibits specific properties and characteristics. These include:

Symmetry: The distribution is symmetric around its mean, with equal probabilities of values occurring on either side of the mean.

Unimodality: The distribution has a single peak, indicating a central value around which the data cluster.

Constant standard deviation: The variability of the data is consistent across the distribution, with the standard deviation remaining constant.

library(summarytools)

Attaching package: ‘summarytools’

The following object is masked from ‘package:tibble’:

    view
descr(pima$bmi)
Descriptive Statistics  
pima$bmi  
N: 768  

                       bmi
----------------- --------
             Mean    32.46
          Std.Dev     6.88
              Min    18.20
               Q1    27.50
           Median    32.26
               Q3    36.60
              Max    67.10
              MAD     6.77
              IQR     9.10
               CV     0.21
         Skewness     0.59
      SE.Skewness     0.09
         Kurtosis     0.88
          N.Valid   768.00
        Pct.Valid   100.00
library(knitr)
# Create a subset of the pima dataset with the desired columns
subset_data <- pima[, c(1,2,3,4,5,6,7,8)]

# Generate the descriptive statistics table using descr function
table <- descr(subset_data)

# Print the ktable
#kable(table)
table
Descriptive Statistics  
subset_data  
N: 768  

                       age      bmi   diabetes   diastolic   glucose   insulin   pregnant   triceps
----------------- -------- -------- ---------- ----------- --------- --------- ---------- ---------
             Mean    33.24    32.46       0.47       72.32    121.65    155.40       3.85     28.84
          Std.Dev    11.76     6.88       0.33       12.16     30.47     97.34       3.37      9.55
              Min    21.00    18.20       0.08       24.00     44.00     14.00       0.00      7.00
               Q1    24.00    27.50       0.24       64.00     99.00     87.00       1.00     22.00
           Median    29.00    32.26       0.37       72.00    117.00    135.30       3.00     28.74
               Q3    41.00    36.60       0.63       80.00    141.00    191.48       6.00     35.22
              Max    81.00    67.10       2.42      122.00    199.00    846.00      17.00     99.00
              MAD    10.38     6.77       0.25       11.86     29.65     78.58       2.97      9.99
              IQR    17.00     9.10       0.38       16.00     42.00    104.24       5.00     13.11
               CV     0.35     0.21       0.70        0.17      0.25      0.63       0.88      0.33
         Skewness     1.13     0.59       1.91        0.15      0.53      2.17       0.90      0.70
      SE.Skewness     0.09     0.09       0.09        0.09      0.09      0.09       0.09      0.09
         Kurtosis     0.62     0.88       5.53        1.00     -0.28      7.95       0.14      3.24
          N.Valid   768.00   768.00     768.00      768.00    768.00    768.00     768.00    768.00
        Pct.Valid   100.00   100.00     100.00      100.00    100.00    100.00     100.00    100.00

The table displays the descriptive statistics for the variables in the pima dataset, which includes the columns age, bmi, diastolic, glucose, insulin, pregnant, and triceps. Here is an explanation of each measure, the N. Valid in all the columns remains 768, this point to the fact that NA were taken care of by the “bagImpute” method from caret package.

IQR: The interquartile range, calculated as the difference between the third and first quartiles. CV: The coefficient of variation, which is the ratio of the standard deviation to the mean. Skewness: A measure of the asymmetry of the distribution. Positive skewness indicates a longer tail on the right side. SE.Skewness: The standard error of skewness. Kurtosis: A measure of the “tailedness” of the distribution. Positive kurtosis indicates heavier tails compared to a normal distribution. N.Valid: The number of valid (non-missing) values in each variable. Pct.Valid: The percentage of valid values out of the total observations. These statistics provide information about the central tendency, spread, skewness, kurtosis, and validity of the variables in the dataset.

Analyze the P value of Skewness and Kurtosis


# Perform Shapiro-Wilk test on each column
shapiro_results <- lapply(pima[c(1,2,3,4,5,6,7,8)], shapiro.test)

# Extract p-values from the test results
p_values <- sapply(shapiro_results, function(x) x$p.value)

# Combine the column names and p-values into a data frame
result_df <- data.frame(Column = c(1,2,3,4,5,6,7,8),
                        p_value = p_values)

# Print the result table using kable
#kable(result_df)
result_df

The p-values for the pregnant, glucose, diastolic, triceps, insulin, bmi, and age columns are all less than 0.05. This suggests that there is sufficient evidence to reject the null hypothesis of normality for these variables, indicating that they may not follow a normal distribution.

Multivariate normality is evidenced by p-values associated with multivariate skewness and kurtosis statistics that are > 0.05. then the data are assumed to follow a multivariate normal distribution where p>.05 (Korkmaz, Goksuluk, & Zararsiz, 2014, 2019).

Handling Outlier in dataset

we already seen that our dataset is skewed it is a pointer that there is an outlier in our data set i am going to plot one more graph to make everything clearer, then i will use two methods to handle it.

ggplot(pima, mapping = aes(x = bmi, y = age, fill = diastolic)) + 
  geom_boxplot(outlier.colour = "red", outlier.shape = 5, outlier.size = 4) + 
  facet_wrap(~test)

An outlier is an observation that significantly deviates from the other observations in a dataset. It is a value that is unusually large or small compared to the majority of the data points. Outliers can arise due to various reasons such as measurement errors, data entry errors, natural variation, or truly extreme values.

Outliers can have a significant impact on data analysis and statistical modeling. They can distort statistical measures such as the mean and standard deviation, as well as affect the results of certain statistical techniques. Therefore, it is important to identify and handle outliers appropriately based on the context of the analysis.

method one

# Select the desired columns
selected_columns <- c(1, 2, 3, 4, 5, 6, 7, 8)
pima_selected <- pima[, selected_columns]

# Outlier detection and normalization function using Tukey's rule
outlier_norm <- function(x) {
  qntile <- quantile(x, probs = c(0.25, 0.75))
  caps <- quantile(x, probs = c(0.05, 0.95))
  H <- 1.5 * IQR(x, na.rm = TRUE)
  x[x < (qntile[1] - H)] <- caps[1]
  x[x > (qntile[2] + H)] <- caps[2]
  return(x)
}

# Apply outlier detection and normalization to selected columns
for (col in names(pima_selected)) {
  pima_selected[[col]] <- outlier_norm(pima_selected[[col]])
}

# Melt the data for plotting

melted_pima <- melt(pima_selected)
No id variables; using all as measure variables
# Plot boxplots for each column
ggplot(melted_pima, aes(x = variable, y = value)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 8, outlier.size = 4) +
  labs(x = "Variable", y = "Value") +
  theme_minimal()

comperision

the mean, standard diviation and median before and after outlier treatments are :32.45961, 6.880279 and 32.3 before treatment and :32.33677, 6.534469 and 32.3 after treatment the difference are not significants these shoes the caret model worked well in predicting the missing vales

mean(pima$bmi)
[1] 32.45961
sd(pima$bmi)
[1] 6.880279
median(pima$bmi)
[1] 32.3
mean(pima_selected$bmi)
[1] 32.33677
sd(pima_selected$bmi)
[1] 6.534469
median(pima_selected$bmi)
[1] 32.3

method 2

i am going to make it very simple and concise, i will onle use the bmi colume to demostrate this method, first i will dignose the dataset

pima_meth2 <-pima
library(dlookr)
outlier_diag <- diagnose_outlier(pima_meth2)
print(outlier_diag)

Capping refers to replacing extreme values (outliers) with less extreme values that are within a specified range.

In the case of the imputate_outlier(pima, xvar = bmi, “capping”) function call, it will identify outliers in the “bmi” variable of the “pima” dataset and replace those outliers with values within a specified range.

The specific range or threshold for capping outliers can be adjusted depending on the context and the desired treatment of outliers in the analysis.

Medoth two

imputate_outlier(pima_meth2, xvar = bmi, "capping")
  [1] 33.60000 26.60000 23.30000 28.10000 43.10000 25.60000 31.00000 35.30000 30.50000 38.68713
 [11] 37.60000 38.00000 27.10000 30.10000 25.80000 30.00000 45.80000 29.60000 43.30000 34.60000
 [21] 39.30000 35.40000 39.80000 29.00000 36.60000 31.10000 39.40000 23.20000 22.20000 34.10000
 [31] 36.00000 31.60000 24.80000 19.90000 27.60000 24.00000 33.20000 32.90000 38.20000 37.10000
 [41] 34.00000 40.20000 22.70000 45.40000 27.40000 42.00000 29.70000 28.00000 39.10000 31.70443
 [51] 19.40000 24.20000 24.40000 33.70000 34.70000 23.00000 37.70000 46.80000 40.50000 41.50000
 [61] 29.15403 32.90000 25.00000 25.40000 32.80000 29.00000 32.50000 42.70000 19.60000 28.90000
 [71] 32.90000 28.60000 43.40000 35.10000 32.00000 24.70000 32.60000 37.70000 43.20000 25.00000
 [81] 22.40000 30.75815 29.30000 24.60000 48.80000 32.40000 36.60000 38.50000 37.10000 26.50000
 [91] 19.10000 32.00000 46.70000 23.80000 24.70000 33.90000 31.60000 20.40000 28.70000 49.70000
[101] 39.00000 26.10000 22.50000 26.60000 39.60000 28.70000 22.40000 29.50000 34.30000 37.40000
[111] 33.30000 34.00000 31.20000 34.00000 30.50000 31.20000 34.00000 33.70000 28.20000 23.20000
[121] 44.39500 34.20000 33.60000 26.80000 33.30000 44.39500 42.90000 33.30000 34.50000 27.90000
[131] 29.70000 33.30000 34.50000 38.30000 21.10000 33.80000 30.80000 28.70000 31.20000 36.90000
[141] 21.10000 39.50000 32.50000 32.40000 32.80000 31.62951 32.80000 30.50000 33.70000 27.30000
[151] 37.40000 21.90000 34.30000 40.60000 47.90000 50.00000 24.60000 25.20000 29.00000 40.90000
[161] 29.70000 37.20000 44.20000 29.70000 31.60000 29.90000 32.50000 29.60000 31.90000 28.40000
[171] 30.80000 35.40000 28.90000 43.50000 29.70000 32.70000 31.20000 44.39500 45.00000 39.10000
[181] 23.20000 34.90000 27.70000 26.80000 27.60000 35.90000 30.10000 32.00000 27.90000 31.60000
[191] 22.60000 33.10000 30.40000 44.39500 24.40000 39.40000 24.30000 22.90000 34.80000 30.90000
[201] 31.00000 40.10000 27.30000 20.40000 37.70000 23.90000 37.50000 37.70000 33.20000 35.50000
[211] 27.70000 42.80000 34.20000 42.60000 34.20000 41.80000 35.80000 30.00000 29.00000 37.80000
[221] 34.60000 31.60000 25.20000 28.80000 23.60000 34.60000 35.70000 37.20000 36.70000 45.20000
[231] 44.00000 46.20000 25.40000 35.00000 29.70000 43.60000 35.90000 44.10000 30.80000 18.40000
[241] 29.20000 33.10000 25.60000 27.10000 38.20000 30.00000 31.20000 44.39500 35.40000 30.10000
[251] 31.20000 28.00000 24.40000 35.80000 27.60000 33.60000 30.10000 28.70000 25.90000 33.30000
[261] 30.90000 30.00000 32.10000 32.40000 32.00000 33.60000 36.30000 40.00000 25.10000 27.50000
[271] 45.60000 25.20000 23.00000 33.20000 34.20000 40.50000 26.50000 27.80000 24.90000 25.30000
[281] 37.90000 35.90000 32.40000 30.40000 27.00000 26.00000 38.70000 45.60000 20.80000 36.10000
[291] 36.90000 36.60000 43.30000 40.50000 21.90000 35.50000 28.00000 30.70000 36.60000 23.60000
[301] 32.30000 31.60000 35.80000 44.39500 21.00000 39.70000 25.50000 24.80000 30.50000 32.90000
[311] 26.20000 39.40000 26.60000 29.50000 35.90000 34.10000 19.30000 30.50000 38.10000 23.50000
[321] 27.50000 31.60000 27.40000 26.80000 35.70000 25.60000 35.10000 35.10000 45.50000 30.80000
[331] 23.10000 32.70000 43.30000 23.60000 23.90000 47.90000 33.80000 31.20000 34.20000 39.90000
[341] 25.90000 25.90000 32.00000 34.70000 36.80000 38.50000 28.70000 23.50000 21.80000 41.00000
[351] 42.20000 31.20000 34.40000 27.20000 42.70000 30.40000 33.30000 39.90000 35.30000 36.50000
[361] 31.20000 29.80000 39.20000 38.50000 34.90000 34.00000 27.60000 21.00000 27.50000 32.80000
[371] 38.40000 30.68725 35.80000 34.90000 36.20000 39.20000 25.20000 37.20000 48.30000 43.40000
[381] 30.80000 20.00000 25.40000 25.10000 24.30000 22.30000 32.30000 43.30000 32.00000 31.60000
[391] 32.00000 45.70000 23.70000 22.10000 32.90000 27.70000 24.70000 34.30000 21.10000 34.90000
[401] 32.00000 24.20000 35.00000 31.60000 32.90000 42.10000 28.90000 21.90000 25.90000 42.40000
[411] 35.70000 34.40000 42.40000 26.20000 34.60000 35.70000 27.20000 38.50000 18.20000 26.40000
[421] 45.30000 26.00000 40.60000 30.80000 42.90000 37.00000 37.09534 34.10000 40.60000 35.00000
[431] 22.20000 30.40000 30.00000 25.60000 24.50000 42.40000 37.40000 29.90000 18.20000 36.80000
[441] 34.30000 32.20000 33.20000 30.50000 29.70000 44.39500 25.30000 36.50000 33.60000 30.50000
[451] 21.20000 28.90000 39.90000 19.60000 37.80000 33.60000 26.70000 30.20000 37.60000 25.90000
[461] 20.80000 21.80000 35.30000 27.60000 24.00000 21.80000 27.80000 36.80000 30.00000 46.10000
[471] 41.30000 33.20000 38.80000 29.90000 28.90000 27.30000 33.70000 23.80000 25.90000 28.00000
[481] 35.50000 35.20000 27.80000 38.20000 44.20000 42.30000 40.70000 46.50000 25.60000 26.10000
[491] 36.80000 33.50000 32.80000 28.90000 30.08729 26.60000 26.00000 30.10000 25.10000 29.30000
[501] 25.20000 37.20000 39.00000 33.30000 37.30000 33.30000 36.50000 28.60000 30.40000 25.00000
[511] 29.70000 22.10000 24.20000 27.30000 25.60000 31.60000 30.30000 37.60000 32.80000 19.60000
[521] 25.00000 33.20000 32.22871 34.20000 31.60000 21.80000 18.20000 26.30000 30.80000 24.60000
[531] 29.80000 45.30000 41.30000 29.80000 33.30000 32.90000 29.60000 21.70000 36.30000 36.40000
[541] 39.40000 32.40000 34.90000 39.50000 32.00000 34.50000 43.60000 33.10000 32.80000 28.50000
[551] 27.40000 31.90000 27.80000 29.90000 36.90000 25.50000 38.10000 27.80000 46.20000 30.10000
[561] 33.80000 41.30000 37.60000 26.90000 32.40000 26.10000 38.60000 32.00000 31.30000 34.30000
[571] 32.50000 22.60000 29.50000 34.70000 30.10000 35.50000 24.00000 42.90000 27.00000 34.70000
[581] 42.10000 25.00000 26.50000 38.70000 28.70000 22.50000 34.90000 24.30000 33.30000 21.10000
[591] 46.80000 39.40000 34.40000 28.50000 33.60000 32.00000 45.30000 27.80000 36.80000 23.10000
[601] 27.10000 23.70000 27.80000 35.20000 28.40000 35.80000 40.00000 19.50000 41.50000 24.00000
[611] 30.90000 32.90000 38.20000 32.50000 36.10000 25.80000 28.70000 20.10000 28.20000 32.40000
[621] 38.40000 24.20000 40.80000 43.50000 30.80000 37.70000 24.70000 32.40000 34.60000 24.70000
[631] 27.40000 34.50000 26.20000 27.50000 25.90000 31.20000 28.80000 31.60000 40.90000 19.50000
[641] 29.30000 34.30000 29.50000 28.00000 27.60000 39.40000 23.40000 37.80000 28.30000 26.40000
[651] 25.20000 33.80000 34.10000 26.80000 34.20000 38.70000 21.80000 38.90000 39.00000 34.20000
[661] 27.70000 42.90000 37.60000 37.90000 33.70000 34.80000 32.50000 27.50000 34.00000 30.90000
[671] 33.60000 25.40000 35.50000 44.39500 35.60000 30.90000 24.80000 35.30000 36.00000 24.20000
[681] 24.20000 49.60000 44.60000 32.30000 34.89673 33.20000 23.10000 28.30000 24.10000 46.10000
[691] 24.60000 42.30000 39.10000 38.50000 23.50000 30.40000 29.90000 25.00000 34.50000 44.50000
[701] 35.90000 27.60000 35.00000 38.50000 28.40000 39.80000 33.43828 34.40000 32.80000 38.00000
[711] 31.20000 29.60000 41.20000 26.40000 29.50000 33.90000 33.80000 23.10000 35.50000 35.60000
[721] 29.30000 38.10000 29.30000 39.10000 32.80000 39.40000 36.10000 32.40000 22.90000 30.10000
[731] 28.40000 28.40000 44.50000 29.00000 23.30000 35.40000 27.40000 32.00000 36.60000 39.50000
[741] 42.30000 30.80000 28.50000 32.70000 40.60000 30.00000 49.30000 46.30000 36.40000 24.30000
[751] 31.20000 39.00000 26.00000 43.30000 32.40000 36.50000 32.00000 36.30000 37.50000 35.50000
[761] 28.40000 44.00000 22.50000 32.90000 36.80000 26.20000 30.10000 30.40000
attr(,"method")
[1] "capping"
attr(,"var_type")
[1] "numerical"
attr(,"outlier_pos")
[1] 121 126 178 194 248 304 446 674
attr(,"outliers")
[1] 53.2 55.0 67.1 52.3 52.3 52.9 59.4 57.3
attr(,"type")
[1] "outliers"
attr(,"message")
[1] "complete imputation"
attr(,"success")
[1] TRUE
attr(,"class")
[1] "imputation" "numeric"   
ibm_no_outlier<-imputate_outlier(pima_meth2, xvar = bmi, "capping") # save it in a data fram

the ibm colum has a normal distribution after outlier was removed


hist(ibm_no_outlier)

In summary, if you are primarily concerned with imputing missing values in your dataset and capturing the overall patterns, “bagImpute” method may be more appropriate. If your main focus is on handling outliers in specific variables, then “imputate_outlier” with “capping” can be a useful approach. Ultimately, the choice depends on your research goals, the characteristics of your data, and the specific context of your analysis.

