Pre-processing in R

Here we are trying to do some basic data pre-processing

setwd("/Users/sajithamarnath/Documents/IIM/R/Dataset")
cars <- read.csv("cars.csv")

GRAPHICAL METHODS FOR IDENTIFYING OUTLIERS

Set up the plot area

hist(cars$weightlbs,
     breaks = 30,
     xlim = c(1500, 5000),
     col = "blue",
     border = "black",
     ylim = c(0, 20),
     xlab = "Weight",
     ylab = "Counts",
     main = "Histogram of Car Weights")

Scatter polt

We are examining scatter plot to explore the presence of outliers

plot(cars$weightlbs,
     cars$mpg,
     xlim = c(1500, 5000),
     ylim = c(0, 50),
     xlab = "Weight",
     ylab = "MPG",
     main = "Scatterplot of MPG by Weight",
     type = "p",
     pch = 1,
     col = "blue")

Results; Scatter Plot

From the scatter plot, we found that the majority of the data points hovered around the mean, there is no much ouyliers

Box plot

Here we create box plot

boxplot(mpg ~ cylinders, data = cars, xlab = "Number of Cylinders",
        ylab = "Miles Per Gallon", main = "Mileage Data")

Box plot

Here we create box plot

box <- boxplot(mpg ~ brand, data = cars, xlab = "brand",
        ylab = "Miles Per Gallon", main = "Mileage Data")

Box plotf for only mileage

Here we create box plot

box <- boxplot(cars$mpg,  xlab = "box",
               ylab = "Miles Per Gallon", main = "Mileage Data")$out

New Data

Creating data with outliers

cars1 <- cars

cars1[8,2] <- 56
cars1$mpg[19] <- 59
box1 <- boxplot(cars1$mpg,  xlab = "box",
               ylab = "Miles Per Gallon", main = "Mileage Data")$out

Checking outliers for new data

box1 <- boxplot(cars1$mpg,  plot=F)$out
#edit(cars1)
outliers_data <- cars1[which(cars1$mpg %in% box1),]
outliers_data
##    Sl..No mpg cylinders cubicinches  hp weightlbs time.to.60 year   brand
## 8       8  56         8         440 215      4312          9 1971     US.
## 19     19  59         4         113  95      2278         16 1973  Japan.
box1 <- boxplot(cars1$mpg)$out

write.csv(outliers_data, "outlier.csv")

Deleting outliers from data

cars2 <- cars1[-which(cars1$mpg %in% box1),]

summary(cars2)
##      Sl..No           mpg          cylinders      cubicinches   
##  Min.   :  1.0   Min.   :10.00   Min.   :3.000   Min.   : 68.0  
##  1st Qu.: 67.5   1st Qu.:16.95   1st Qu.:4.000   1st Qu.: 99.5  
##  Median :132.0   Median :22.00   Median :6.000   Median :156.0  
##  Mean   :131.9   Mean   :23.18   Mean   :5.587   Mean   :200.5  
##  3rd Qu.:196.5   3rd Qu.:28.90   3rd Qu.:8.000   3rd Qu.:302.0  
##  Max.   :261.0   Max.   :46.60   Max.   :8.000   Max.   :455.0  
##        hp          weightlbs      time.to.60         year     
##  Min.   : 46.0   Min.   :1613   Min.   : 8.00   Min.   :1971  
##  1st Qu.: 75.0   1st Qu.:2246   1st Qu.:14.00   1st Qu.:1974  
##  Median : 95.0   Median :2835   Median :16.00   Median :1977  
##  Mean   :106.0   Mean   :3003   Mean   :15.57   Mean   :1977  
##  3rd Qu.:137.5   3rd Qu.:3654   3rd Qu.:17.00   3rd Qu.:1980  
##  Max.   :230.0   Max.   :4997   Max.   :25.00   Max.   :1983  
##     brand          
##  Length:259        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
box2 <- boxplot(cars2$mpg)$out

Result:

Now we are able to get data without outliers