Pre-processing in R

Here we are trying to do some basic data pre-processing. First let us read the data.

cars <-read.csv("cars.csv")
summary(cars)
##      Sl..No         mpg          cylinders     cubicinches          hp       
##  Min.   :  1   Min.   :10.00   Min.   :3.00   Min.   : 68.0   Min.   : 46.0  
##  1st Qu.: 66   1st Qu.:16.90   1st Qu.:4.00   1st Qu.:101.0   1st Qu.: 75.0  
##  Median :131   Median :22.00   Median :6.00   Median :156.0   Median : 95.0  
##  Mean   :131   Mean   :23.14   Mean   :5.59   Mean   :201.1   Mean   :106.4  
##  3rd Qu.:196   3rd Qu.:28.80   3rd Qu.:8.00   3rd Qu.:302.0   3rd Qu.:138.0  
##  Max.   :261   Max.   :46.60   Max.   :8.00   Max.   :455.0   Max.   :230.0  
##    weightlbs      time.to.60         year         brand          
##  Min.   :1613   Min.   : 8.00   Min.   :1971   Length:261        
##  1st Qu.:2246   1st Qu.:14.00   1st Qu.:1974   Class :character  
##  Median :2835   Median :16.00   Median :1977   Mode  :character  
##  Mean   :3005   Mean   :15.55   Mean   :1977                     
##  3rd Qu.:3664   3rd Qu.:17.00   3rd Qu.:1980                     
##  Max.   :4997   Max.   :25.00   Max.   :1983

GRAPHICAL METHODS FOR IDENTIFYING OUTLIERS

Set up the plot area

Here we are trying to examine the presence of outliers using histogram, which is a very informal method to examine the presence of outlier within a variable.

hist(cars$weightlbs,
     breaks = 30,
     xlim = c(1500, 5000),
     col = "light blue",
     border = "black",
     ylim = c(0, 20),
     xlab = "Weight",
     ylab = "Counts",
     main = "Histogram of Car Weights")

Scatterplot

Here we are examining scatter plots to explore the presence of outlier

plot(cars$weightlbs,
     cars$mpg,
     xlim = c(1500, 5000),
     ylim = c(0, 50),
     xlab = "Weight",
     ylab = "MPG",
     main = "Scatterplot of MPG by Weight",
     type = "p",
     pch = 20,
     col = "blue")

Results: scatter plot

From the scatter plot we found that the majority of the data points hovered around the mean and there was no evidence of outlier between mpg and weight.

Box Plot

Here we create box plot using IQR based approach.

boxplot(mpg ~ cylinders, data = cars, xlab = "Number of Cylinders",
        ylab = "Miles Per Gallon", main = "Mileage Data")

Box

box <- boxplot(mpg ~ brand, data = cars, xlab = "brand",
        ylab = "Miles Per Gallon", main = "Mileage Data")

Boxplot HP vs MPG

box <- boxplot(mpg ~ hp, data = cars, xlab = "hp",
        ylab = "Miles Per Gallon", main = "Mileage Data")

Boxplot for only mileage

box <- boxplot(cars$mpg,  xlab = "box",
               ylab = "Miles Per Gallon", main = "Mileage Data")$out

## New data Creating a data with outlier

separately run library(mice) cars1[8,2]<-56

cars1<-cars
cars1[8,2]<-56
cars1[18,2]<-58

The Box plot of all the cars mpg

This shows the outliers (if any) for all the cars with respect to MPG

cars1<- cars
cars1[8,2]<-56
cars1[18,2]<-58
box1<-boxplot(cars1$mpg)$out

The row that contains the outlier is given below:

outliers_data1 <- cars1[which(cars1$mpg %in% box1),]
outliers_data1
##    Sl..No mpg cylinders cubicinches  hp weightlbs time.to.60 year brand
## 8       8  56         8         440 215      4312          9 1971   US.
## 18     18  58         8         304 150      3433         12 1971   US.
data2 <- cars1[-which(cars1$mpg %in% box1),]

Deleting outlier from original data (car1) using -which

cars2<-cars1[-which(cars1$mpg %in% box1),]