Read the Original Article and Get a Code: Remove Outliers and Perform Data Cleaning in RStudio
The mtcars dataset in R is a great example to learn basic exploratory data analysis techniques. In this blog post, we will use the mtcars dataset to understand how to make basic plots like boxplots and histograms, identify outliers, remove outliers, impute missing values, and encode categorical variables.
The mtcars dataset contains fuel consumption, horsepower, and other specifications for 32 automobiles manufactured between 1973 and 1974. It’s a tidy dataset with 11 numeric variables and 1 categorical variable. The variables are:
This dataset lends itself nicely to basic EDA techniques like making plots, identifying outliers, imputing missing values, and encoding variables. The code snippets in this post illustrate how to perform these tasks in R.
We start by loading the mtcars dataset and creating a boxplot for the hp variable. Boxplots are great for visualizing the distribution of a numeric variable:
The boxplot shows the distribution of hp values ranging from 52 to 335, with a right-skewed distribution where most values are concentrated on the lower end.
Next, we create a histogram which is another useful plot for understanding the distribution of a numeric variable:
The histogram displays the frequency distribution of hp values. We can see a right-skewed distribution again with most cars having low horsepower < 150 hp.
Now let’s examine how to identify potential outliers in the hp variable. Outliers are data points that are significantly different from other observations. They can skew results of analyses so it’s useful to detect them.
We will use 3 methods - z-scores, interquartile range, and statistical tests.
We can calculate z-scores to identify potential outliers:
## [1] 335
Another method is to use the interquartile range (IQR):
## [1] 264 335
We can also use statistical tests like Dixon’s test and Rosner’s test to detect outliers:
Dixon’s test suggests 335 is a potential outlier while Rosner’s test does not detect any outliers at the default significance level of 0.05.
Once we’ve identified potential outliers, we may want to remove them from the dataset before further analysis. Here are some ways to remove outliers:
The subset, filter, and logical operator approaches all create a new dataset without the outlier observations.
Another common task is dealing with missing values encoded as NA in R. We will create a simulated dataset and introduce some missing values:
We can impute the missing numeric variables x and z with mean imputation:
For categorical variables like y, we can impute the mode:
## x y z
## 1 44.39524 B 91.506376
## 2 47.69823 B 49.308800
## 3 65.58708 A 28.628535
## 4 50.70508 B 73.779740
## 5 51.29288 C 83.405431
## 6 67.15065 C 31.427078
## 7 54.60916 A 49.256655
## 8 37.34939 B 49.308800
## 9 43.13147 A 64.146235
## 10 45.54338 B 64.392291
## 11 62.24082 C 49.308800
## 12 53.59814 C 41.473533
## 13 54.00771 B 11.940478
## 14 51.10683 C 52.602966
## 15 44.44159 B 22.507335
## 16 67.86913 A 48.641176
## 17 54.97850 A 37.021480
## 18 30.33383 B 98.335018
## 19 57.01356 C 38.831912
## 20 45.27209 B 22.924484
## 21 39.32176 B 62.329755
## 22 47.82025 B 13.654020
## 23 39.73996 A 96.746949
## 24 42.71109 B 51.507181
## 25 43.74961 A 16.307033
## 26 33.13307 B 62.190230
## 27 58.37787 B 98.595417
## 28 51.53373 A 66.877152
## 29 38.61863 C 41.891590
## 30 50.72672 A 32.334499
## 31 54.26464 A 83.525532
## 32 47.04929 B 14.381704
## 33 58.95126 A 19.281595
## 34 58.78133 B 89.673868
## 35 50.72672 A 30.811955
## 36 56.88640 A 36.330054
## 37 55.53918 A 78.394648
## 38 49.38088 C 19.337868
## 39 46.94037 C 49.308800
## 40 46.19529 B 40.660787
## 41 43.05293 A 49.308800
## 42 47.92083 B 42.184495
## 43 37.34604 C 34.280880
## 44 71.68956 B 86.648331
## 45 62.07962 C 45.510805
## 46 38.76891 B 53.376487
## 47 50.72672 C 96.384333
## 48 45.33345 A 77.459154
## 49 57.79965 A 20.887635
## 50 50.72672 A 30.878683
## 51 52.53319 A 97.134245
## 52 49.71453 C 58.490009
## 53 49.57130 B 76.082363
## 54 63.68602 A 37.270939
## 55 47.74229 B 76.919391
## 56 65.16471 A 53.767718
## 57 34.51247 C 91.399545
## 58 55.84614 C 18.529644
## 59 51.23854 A 28.221842
## 60 52.15942 B 9.496241
## 61 53.79639 C 21.048708
## 62 44.97677 A 97.709900
## 63 46.66793 C 49.308800
## 64 39.81425 B 72.598303
## 65 39.28209 A 78.568783
## 66 53.03529 A 10.541775
## 67 50.72672 A 23.959463
## 68 50.53004 A 27.054487
## 69 59.22267 A 10.105849
## 70 70.50085 C 11.791384
## 71 45.08969 B 99.123656
## 72 26.90831 B 98.605430
## 73 60.05739 B 13.706747
## 74 42.90799 C 49.308800
## 75 43.11991 A 57.630184
## 76 50.72672 B 39.544886
## 77 47.15227 B 44.980248
## 78 37.79282 A 70.650190
## 79 51.81303 B 8.250275
## 80 48.61109 A 33.931258
## 81 50.72672 B 49.308800
## 82 50.72672 B 49.308800
## 83 46.29340 C 83.156860
## 84 56.44377 B 21.517208
## 85 47.79513 C 49.794894
## 86 53.31782 A 27.604967
## 87 60.96839 B 19.202332
## 88 54.35181 A 95.062126
## 89 50.72672 B 32.172554
## 90 61.48808 B 47.845638
## 91 59.93504 B 2.799257
## 92 55.48397 C 54.745947
## 93 52.38732 B 64.424022
## 94 50.72672 A 49.308800
## 95 63.60652 A 32.193738
## 96 43.99740 B 89.111431
## 97 71.87333 C 62.625695
## 98 65.32611 B 30.290492
## 99 47.64300 B 38.820466
## 100 39.73579 A 16.047509
There are more sophisticated methods like mice and DMwR packages also. But mean/mode imputation provide a simple option for missing value treatment.
The mtcars dataset has a categorical variable am representing the transmission type. Before modelling, we need to encode such categorical variables into numeric formats.
Common encoding methods include:
## x y z y_label y_onehot.yA y_onehot.yB y_onehot.yC y_ordinal
## 1 44.39524 B 91.506376 1 0 1 0 B
## 2 47.69823 B 49.308800 1 0 1 0 B
## 3 65.58708 A 28.628535 0 1 0 0 A
## 4 50.70508 B 73.779740 1 0 1 0 B
## 5 51.29288 C 83.405431 2 0 0 1 C
## 6 67.15065 C 31.427078 2 0 0 1 C
## 7 54.60916 A 49.256655 0 1 0 0 A
## 8 37.34939 B 49.308800 1 0 1 0 B
## 9 43.13147 A 64.146235 0 1 0 0 A
## 10 45.54338 B 64.392291 1 0 1 0 B
## 11 62.24082 C 49.308800 2 0 0 1 C
## 12 53.59814 C 41.473533 2 0 0 1 C
## 13 54.00771 B 11.940478 1 0 1 0 B
## 14 51.10683 C 52.602966 2 0 0 1 C
## 15 44.44159 B 22.507335 1 0 1 0 B
## 16 67.86913 A 48.641176 0 1 0 0 A
## 17 54.97850 A 37.021480 0 1 0 0 A
## 18 30.33383 B 98.335018 1 0 1 0 B
## 19 57.01356 C 38.831912 2 0 0 1 C
## 20 45.27209 B 22.924484 1 0 1 0 B
## 21 39.32176 B 62.329755 1 0 1 0 B
## 22 47.82025 B 13.654020 1 0 1 0 B
## 23 39.73996 A 96.746949 0 1 0 0 A
## 24 42.71109 B 51.507181 1 0 1 0 B
## 25 43.74961 A 16.307033 0 1 0 0 A
## 26 33.13307 B 62.190230 1 0 1 0 B
## 27 58.37787 B 98.595417 1 0 1 0 B
## 28 51.53373 A 66.877152 0 1 0 0 A
## 29 38.61863 C 41.891590 2 0 0 1 C
## 30 50.72672 A 32.334499 0 1 0 0 A
## 31 54.26464 A 83.525532 0 1 0 0 A
## 32 47.04929 B 14.381704 1 0 1 0 B
## 33 58.95126 A 19.281595 0 1 0 0 A
## 34 58.78133 B 89.673868 1 0 1 0 B
## 35 50.72672 A 30.811955 0 1 0 0 A
## 36 56.88640 A 36.330054 0 1 0 0 A
## 37 55.53918 A 78.394648 0 1 0 0 A
## 38 49.38088 C 19.337868 2 0 0 1 C
## 39 46.94037 C 49.308800 2 0 0 1 C
## 40 46.19529 B 40.660787 1 0 1 0 B
## 41 43.05293 A 49.308800 0 1 0 0 A
## 42 47.92083 B 42.184495 1 0 1 0 B
## 43 37.34604 C 34.280880 2 0 0 1 C
## 44 71.68956 B 86.648331 1 0 1 0 B
## 45 62.07962 C 45.510805 2 0 0 1 C
## 46 38.76891 B 53.376487 1 0 1 0 B
## 47 50.72672 C 96.384333 2 0 0 1 C
## 48 45.33345 A 77.459154 0 1 0 0 A
## 49 57.79965 A 20.887635 0 1 0 0 A
## 50 50.72672 A 30.878683 0 1 0 0 A
## 51 52.53319 A 97.134245 0 1 0 0 A
## 52 49.71453 C 58.490009 2 0 0 1 C
## 53 49.57130 B 76.082363 1 0 1 0 B
## 54 63.68602 A 37.270939 0 1 0 0 A
## 55 47.74229 B 76.919391 1 0 1 0 B
## 56 65.16471 A 53.767718 0 1 0 0 A
## 57 34.51247 C 91.399545 2 0 0 1 C
## 58 55.84614 C 18.529644 2 0 0 1 C
## 59 51.23854 A 28.221842 0 1 0 0 A
## 60 52.15942 B 9.496241 1 0 1 0 B
## 61 53.79639 C 21.048708 2 0 0 1 C
## 62 44.97677 A 97.709900 0 1 0 0 A
## 63 46.66793 C 49.308800 2 0 0 1 C
## 64 39.81425 B 72.598303 1 0 1 0 B
## 65 39.28209 A 78.568783 0 1 0 0 A
## 66 53.03529 A 10.541775 0 1 0 0 A
## 67 50.72672 A 23.959463 0 1 0 0 A
## 68 50.53004 A 27.054487 0 1 0 0 A
## 69 59.22267 A 10.105849 0 1 0 0 A
## 70 70.50085 C 11.791384 2 0 0 1 C
## 71 45.08969 B 99.123656 1 0 1 0 B
## 72 26.90831 B 98.605430 1 0 1 0 B
## 73 60.05739 B 13.706747 1 0 1 0 B
## 74 42.90799 C 49.308800 2 0 0 1 C
## 75 43.11991 A 57.630184 0 1 0 0 A
## 76 50.72672 B 39.544886 1 0 1 0 B
## 77 47.15227 B 44.980248 1 0 1 0 B
## 78 37.79282 A 70.650190 0 1 0 0 A
## 79 51.81303 B 8.250275 1 0 1 0 B
## 80 48.61109 A 33.931258 0 1 0 0 A
## 81 50.72672 B 49.308800 1 0 1 0 B
## 82 50.72672 B 49.308800 1 0 1 0 B
## 83 46.29340 C 83.156860 2 0 0 1 C
## 84 56.44377 B 21.517208 1 0 1 0 B
## 85 47.79513 C 49.794894 2 0 0 1 C
## 86 53.31782 A 27.604967 0 1 0 0 A
## 87 60.96839 B 19.202332 1 0 1 0 B
## 88 54.35181 A 95.062126 0 1 0 0 A
## 89 50.72672 B 32.172554 1 0 1 0 B
## 90 61.48808 B 47.845638 1 0 1 0 B
## 91 59.93504 B 2.799257 1 0 1 0 B
## 92 55.48397 C 54.745947 2 0 0 1 C
## 93 52.38732 B 64.424022 1 0 1 0 B
## 94 50.72672 A 49.308800 0 1 0 0 A
## 95 63.60652 A 32.193738 0 1 0 0 A
## 96 43.99740 B 89.111431 1 0 1 0 B
## 97 71.87333 C 62.625695 2 0 0 1 C
## 98 65.32611 B 30.290492 1 0 1 0 B
## 99 47.64300 B 38.820466 1 0 1 0 B
## 100 39.73579 A 16.047509 0 1 0 0 A
In this post, we explored the mtcars dataset in R to understand exploratory data analysis techniques like:
The mtcars dataset provides a nice way to get hands-on practice with cleaning and preparing data for analysis. The same principles can be applied to other real-world datasets.
Some next steps for a more thorough analysis:
I hope you found this example useful! Let me know if you have any other questions.