Here we are trying to do some basic data pre-processing
setwd("/Users/sajithamarnath/Documents/IIM/R/Dataset")
cars <- read.csv("cars.csv")
hist(cars$weightlbs,
breaks = 30,
xlim = c(1500, 5000),
col = "blue",
border = "black",
ylim = c(0, 20),
xlab = "Weight",
ylab = "Counts",
main = "Histogram of Car Weights")
We are examining scatter plot to explore the presence of outliers
plot(cars$weightlbs,
cars$mpg,
xlim = c(1500, 5000),
ylim = c(0, 50),
xlab = "Weight",
ylab = "MPG",
main = "Scatterplot of MPG by Weight",
type = "p",
pch = 1,
col = "blue")
From the scatter plot, we found that the majority of the data points hovered around the mean, there is no much ouyliers
Here we create box plot
boxplot(mpg ~ cylinders, data = cars, xlab = "Number of Cylinders",
ylab = "Miles Per Gallon", main = "Mileage Data")
Here we create box plot
box <- boxplot(mpg ~ brand, data = cars, xlab = "brand",
ylab = "Miles Per Gallon", main = "Mileage Data")
Here we create box plot
box <- boxplot(cars$mpg, xlab = "box",
ylab = "Miles Per Gallon", main = "Mileage Data")$out
Creating data with outliers
cars1 <- cars
cars1[8,2] <- 56
cars1$mpg[19] <- 59
box1 <- boxplot(cars1$mpg, xlab = "box",
ylab = "Miles Per Gallon", main = "Mileage Data")$out
box1 <- boxplot(cars1$mpg, plot=F)$out
#edit(cars1)
outliers_data <- cars1[which(cars1$mpg %in% box1),]
outliers_data
## Sl..No mpg cylinders cubicinches hp weightlbs time.to.60 year brand
## 8 8 56 8 440 215 4312 9 1971 US.
## 19 19 59 4 113 95 2278 16 1973 Japan.
box1 <- boxplot(cars1$mpg)$out
write.csv(outliers_data, "outlier.csv")
cars2 <- cars1[-which(cars1$mpg %in% box1),]
summary(cars2)
## Sl..No mpg cylinders cubicinches
## Min. : 1.0 Min. :10.00 Min. :3.000 Min. : 68.0
## 1st Qu.: 67.5 1st Qu.:16.95 1st Qu.:4.000 1st Qu.: 99.5
## Median :132.0 Median :22.00 Median :6.000 Median :156.0
## Mean :131.9 Mean :23.18 Mean :5.587 Mean :200.5
## 3rd Qu.:196.5 3rd Qu.:28.90 3rd Qu.:8.000 3rd Qu.:302.0
## Max. :261.0 Max. :46.60 Max. :8.000 Max. :455.0
## hp weightlbs time.to.60 year
## Min. : 46.0 Min. :1613 Min. : 8.00 Min. :1971
## 1st Qu.: 75.0 1st Qu.:2246 1st Qu.:14.00 1st Qu.:1974
## Median : 95.0 Median :2835 Median :16.00 Median :1977
## Mean :106.0 Mean :3003 Mean :15.57 Mean :1977
## 3rd Qu.:137.5 3rd Qu.:3654 3rd Qu.:17.00 3rd Qu.:1980
## Max. :230.0 Max. :4997 Max. :25.00 Max. :1983
## brand
## Length:259
## Class :character
## Mode :character
##
##
##
box2 <- boxplot(cars2$mpg)$out
Now we are able to get data without outliers