Remove Outliers and Perform Data Cleaning in RStudio

Read the Original Article and Get a Code: Remove Outliers and Perform Data Cleaning in RStudio

The mtcars dataset in R is a great example to learn basic exploratory data analysis techniques. In this blog post, we will use the mtcars dataset to understand how to make basic plots like boxplots and histograms, identify outliers, remove outliers, impute missing values, and encode categorical variables.

Introduction

The mtcars dataset contains fuel consumption, horsepower, and other specifications for 32 automobiles manufactured between 1973 and 1974. It’s a tidy dataset with 11 numeric variables and 1 categorical variable. The variables are:

mpg - Miles Per Gallon
cyl - Number of cylinders
disp - Displacement
hp - Gross horsepower
drat - Rear axle ratio
wt - Weight
qsec - 1/4 mile time
vs - V/S
am - Transmission
gear - Number of forward gears
carb - Number of carburetors

This dataset lends itself nicely to basic EDA techniques like making plots, identifying outliers, imputing missing values, and encoding variables. The code snippets in this post illustrate how to perform these tasks in R.

Boxplots and Histograms

We start by loading the mtcars dataset and creating a boxplot for the hp variable. Boxplots are great for visualizing the distribution of a numeric variable:

The boxplot shows the distribution of hp values ranging from 52 to 335, with a right-skewed distribution where most values are concentrated on the lower end.

Next, we create a histogram which is another useful plot for understanding the distribution of a numeric variable:

The histogram displays the frequency distribution of hp values. We can see a right-skewed distribution again with most cars having low horsepower < 150 hp.

Identifying Outliers

Now let’s examine how to identify potential outliers in the hp variable. Outliers are data points that are significantly different from other observations. They can skew results of analyses so it’s useful to detect them.

We will use 3 methods - z-scores, interquartile range, and statistical tests.

Z-Scores

We can calculate z-scores to identify potential outliers:

## [1] 335

Interquartile Range

Another method is to use the interquartile range (IQR):

## [1] 264 335

Statistical Tests

We can also use statistical tests like Dixon’s test and Rosner’s test to detect outliers:

Dixon’s test suggests 335 is a potential outlier while Rosner’s test does not detect any outliers at the default significance level of 0.05.

Removing Outliers

Once we’ve identified potential outliers, we may want to remove them from the dataset before further analysis. Here are some ways to remove outliers:

The subset, filter, and logical operator approaches all create a new dataset without the outlier observations.

Imputing Missing Values

Another common task is dealing with missing values encoded as NA in R. We will create a simulated dataset and introduce some missing values:

We can impute the missing numeric variables x and z with mean imputation:

For categorical variables like y, we can impute the mode:

##            x y         z
## 1   44.39524 B 91.506376
## 2   47.69823 B 49.308800
## 3   65.58708 A 28.628535
## 4   50.70508 B 73.779740
## 5   51.29288 C 83.405431
## 6   67.15065 C 31.427078
## 7   54.60916 A 49.256655
## 8   37.34939 B 49.308800
## 9   43.13147 A 64.146235
## 10  45.54338 B 64.392291
## 11  62.24082 C 49.308800
## 12  53.59814 C 41.473533
## 13  54.00771 B 11.940478
## 14  51.10683 C 52.602966
## 15  44.44159 B 22.507335
## 16  67.86913 A 48.641176
## 17  54.97850 A 37.021480
## 18  30.33383 B 98.335018
## 19  57.01356 C 38.831912
## 20  45.27209 B 22.924484
## 21  39.32176 B 62.329755
## 22  47.82025 B 13.654020
## 23  39.73996 A 96.746949
## 24  42.71109 B 51.507181
## 25  43.74961 A 16.307033
## 26  33.13307 B 62.190230
## 27  58.37787 B 98.595417
## 28  51.53373 A 66.877152
## 29  38.61863 C 41.891590
## 30  50.72672 A 32.334499
## 31  54.26464 A 83.525532
## 32  47.04929 B 14.381704
## 33  58.95126 A 19.281595
## 34  58.78133 B 89.673868
## 35  50.72672 A 30.811955
## 36  56.88640 A 36.330054
## 37  55.53918 A 78.394648
## 38  49.38088 C 19.337868
## 39  46.94037 C 49.308800
## 40  46.19529 B 40.660787
## 41  43.05293 A 49.308800
## 42  47.92083 B 42.184495
## 43  37.34604 C 34.280880
## 44  71.68956 B 86.648331
## 45  62.07962 C 45.510805
## 46  38.76891 B 53.376487
## 47  50.72672 C 96.384333
## 48  45.33345 A 77.459154
## 49  57.79965 A 20.887635
## 50  50.72672 A 30.878683
## 51  52.53319 A 97.134245
## 52  49.71453 C 58.490009
## 53  49.57130 B 76.082363
## 54  63.68602 A 37.270939
## 55  47.74229 B 76.919391
## 56  65.16471 A 53.767718
## 57  34.51247 C 91.399545
## 58  55.84614 C 18.529644
## 59  51.23854 A 28.221842
## 60  52.15942 B  9.496241
## 61  53.79639 C 21.048708
## 62  44.97677 A 97.709900
## 63  46.66793 C 49.308800
## 64  39.81425 B 72.598303
## 65  39.28209 A 78.568783
## 66  53.03529 A 10.541775
## 67  50.72672 A 23.959463
## 68  50.53004 A 27.054487
## 69  59.22267 A 10.105849
## 70  70.50085 C 11.791384
## 71  45.08969 B 99.123656
## 72  26.90831 B 98.605430
## 73  60.05739 B 13.706747
## 74  42.90799 C 49.308800
## 75  43.11991 A 57.630184
## 76  50.72672 B 39.544886
## 77  47.15227 B 44.980248
## 78  37.79282 A 70.650190
## 79  51.81303 B  8.250275
## 80  48.61109 A 33.931258
## 81  50.72672 B 49.308800
## 82  50.72672 B 49.308800
## 83  46.29340 C 83.156860
## 84  56.44377 B 21.517208
## 85  47.79513 C 49.794894
## 86  53.31782 A 27.604967
## 87  60.96839 B 19.202332
## 88  54.35181 A 95.062126
## 89  50.72672 B 32.172554
## 90  61.48808 B 47.845638
## 91  59.93504 B  2.799257
## 92  55.48397 C 54.745947
## 93  52.38732 B 64.424022
## 94  50.72672 A 49.308800
## 95  63.60652 A 32.193738
## 96  43.99740 B 89.111431
## 97  71.87333 C 62.625695
## 98  65.32611 B 30.290492
## 99  47.64300 B 38.820466
## 100 39.73579 A 16.047509

There are more sophisticated methods like mice and DMwR packages also. But mean/mode imputation provide a simple option for missing value treatment.

Encoding Categorical Variables

The mtcars dataset has a categorical variable am representing the transmission type. Before modelling, we need to encode such categorical variables into numeric formats.

Common encoding methods include:

Label Encoding: Assign numeric values to categories
One-Hot Encoding: Create dummy variables for each category
Ordinal Encoding: Encode logical order between categories

##            x y         z y_label y_onehot.yA y_onehot.yB y_onehot.yC y_ordinal
## 1   44.39524 B 91.506376       1           0           1           0         B
## 2   47.69823 B 49.308800       1           0           1           0         B
## 3   65.58708 A 28.628535       0           1           0           0         A
## 4   50.70508 B 73.779740       1           0           1           0         B
## 5   51.29288 C 83.405431       2           0           0           1         C
## 6   67.15065 C 31.427078       2           0           0           1         C
## 7   54.60916 A 49.256655       0           1           0           0         A
## 8   37.34939 B 49.308800       1           0           1           0         B
## 9   43.13147 A 64.146235       0           1           0           0         A
## 10  45.54338 B 64.392291       1           0           1           0         B
## 11  62.24082 C 49.308800       2           0           0           1         C
## 12  53.59814 C 41.473533       2           0           0           1         C
## 13  54.00771 B 11.940478       1           0           1           0         B
## 14  51.10683 C 52.602966       2           0           0           1         C
## 15  44.44159 B 22.507335       1           0           1           0         B
## 16  67.86913 A 48.641176       0           1           0           0         A
## 17  54.97850 A 37.021480       0           1           0           0         A
## 18  30.33383 B 98.335018       1           0           1           0         B
## 19  57.01356 C 38.831912       2           0           0           1         C
## 20  45.27209 B 22.924484       1           0           1           0         B
## 21  39.32176 B 62.329755       1           0           1           0         B
## 22  47.82025 B 13.654020       1           0           1           0         B
## 23  39.73996 A 96.746949       0           1           0           0         A
## 24  42.71109 B 51.507181       1           0           1           0         B
## 25  43.74961 A 16.307033       0           1           0           0         A
## 26  33.13307 B 62.190230       1           0           1           0         B
## 27  58.37787 B 98.595417       1           0           1           0         B
## 28  51.53373 A 66.877152       0           1           0           0         A
## 29  38.61863 C 41.891590       2           0           0           1         C
## 30  50.72672 A 32.334499       0           1           0           0         A
## 31  54.26464 A 83.525532       0           1           0           0         A
## 32  47.04929 B 14.381704       1           0           1           0         B
## 33  58.95126 A 19.281595       0           1           0           0         A
## 34  58.78133 B 89.673868       1           0           1           0         B
## 35  50.72672 A 30.811955       0           1           0           0         A
## 36  56.88640 A 36.330054       0           1           0           0         A
## 37  55.53918 A 78.394648       0           1           0           0         A
## 38  49.38088 C 19.337868       2           0           0           1         C
## 39  46.94037 C 49.308800       2           0           0           1         C
## 40  46.19529 B 40.660787       1           0           1           0         B
## 41  43.05293 A 49.308800       0           1           0           0         A
## 42  47.92083 B 42.184495       1           0           1           0         B
## 43  37.34604 C 34.280880       2           0           0           1         C
## 44  71.68956 B 86.648331       1           0           1           0         B
## 45  62.07962 C 45.510805       2           0           0           1         C
## 46  38.76891 B 53.376487       1           0           1           0         B
## 47  50.72672 C 96.384333       2           0           0           1         C
## 48  45.33345 A 77.459154       0           1           0           0         A
## 49  57.79965 A 20.887635       0           1           0           0         A
## 50  50.72672 A 30.878683       0           1           0           0         A
## 51  52.53319 A 97.134245       0           1           0           0         A
## 52  49.71453 C 58.490009       2           0           0           1         C
## 53  49.57130 B 76.082363       1           0           1           0         B
## 54  63.68602 A 37.270939       0           1           0           0         A
## 55  47.74229 B 76.919391       1           0           1           0         B
## 56  65.16471 A 53.767718       0           1           0           0         A
## 57  34.51247 C 91.399545       2           0           0           1         C
## 58  55.84614 C 18.529644       2           0           0           1         C
## 59  51.23854 A 28.221842       0           1           0           0         A
## 60  52.15942 B  9.496241       1           0           1           0         B
## 61  53.79639 C 21.048708       2           0           0           1         C
## 62  44.97677 A 97.709900       0           1           0           0         A
## 63  46.66793 C 49.308800       2           0           0           1         C
## 64  39.81425 B 72.598303       1           0           1           0         B
## 65  39.28209 A 78.568783       0           1           0           0         A
## 66  53.03529 A 10.541775       0           1           0           0         A
## 67  50.72672 A 23.959463       0           1           0           0         A
## 68  50.53004 A 27.054487       0           1           0           0         A
## 69  59.22267 A 10.105849       0           1           0           0         A
## 70  70.50085 C 11.791384       2           0           0           1         C
## 71  45.08969 B 99.123656       1           0           1           0         B
## 72  26.90831 B 98.605430       1           0           1           0         B
## 73  60.05739 B 13.706747       1           0           1           0         B
## 74  42.90799 C 49.308800       2           0           0           1         C
## 75  43.11991 A 57.630184       0           1           0           0         A
## 76  50.72672 B 39.544886       1           0           1           0         B
## 77  47.15227 B 44.980248       1           0           1           0         B
## 78  37.79282 A 70.650190       0           1           0           0         A
## 79  51.81303 B  8.250275       1           0           1           0         B
## 80  48.61109 A 33.931258       0           1           0           0         A
## 81  50.72672 B 49.308800       1           0           1           0         B
## 82  50.72672 B 49.308800       1           0           1           0         B
## 83  46.29340 C 83.156860       2           0           0           1         C
## 84  56.44377 B 21.517208       1           0           1           0         B
## 85  47.79513 C 49.794894       2           0           0           1         C
## 86  53.31782 A 27.604967       0           1           0           0         A
## 87  60.96839 B 19.202332       1           0           1           0         B
## 88  54.35181 A 95.062126       0           1           0           0         A
## 89  50.72672 B 32.172554       1           0           1           0         B
## 90  61.48808 B 47.845638       1           0           1           0         B
## 91  59.93504 B  2.799257       1           0           1           0         B
## 92  55.48397 C 54.745947       2           0           0           1         C
## 93  52.38732 B 64.424022       1           0           1           0         B
## 94  50.72672 A 49.308800       0           1           0           0         A
## 95  63.60652 A 32.193738       0           1           0           0         A
## 96  43.99740 B 89.111431       1           0           1           0         B
## 97  71.87333 C 62.625695       2           0           0           1         C
## 98  65.32611 B 30.290492       1           0           1           0         B
## 99  47.64300 B 38.820466       1           0           1           0         B
## 100 39.73579 A 16.047509       0           1           0           0         A

Summary

In this post, we explored the mtcars dataset in R to understand exploratory data analysis techniques like:

Creating boxplots and histograms
Identifying outliers using z-scores, IQR, and tests
Removing outliers by subsetting
Imputing missing values with mean/mode
Encoding categorical variables

The mtcars dataset provides a nice way to get hands-on practice with cleaning and preparing data for analysis. The same principles can be applied to other real-world datasets.

Some next steps for a more thorough analysis:

Explore relationships between variables using correlations, PCA
Build models to predict mpg consumption using variables like hp, wt, cyl etc.
Use cross-validation to evaluate model performance
Tune models by removing outliers, transforming variables etc.

I hope you found this example useful! Let me know if you have any other questions.