Histograms and Outliers

Histogram

h = hist(data$annual_inc, main="Histogram of Annual Income", xlab="Annual Income")

By changing the values of breaks, you can get a more informative histogram. Here, square root of the number of rows in dataset is “a rule of thumb”. Even after doing this, we can still see a lot of blanks in the histogram.

n_breaks <- sqrt(nrow(data))
h = hist(data$annual_inc, main="Histogram of Annual Income", xlab="Annual Income",breaks = n_breaks)

Then we will use a scatter plot to see what’s going on. In the scatter plot, we find there is one huge salary of 6M dollars whereas none of others is bigger then 2M dollars. We consider this as an outlier.

plot(data$annual_inc, xlab="Annual Income", main="Scatter Plot of Annual Income")

Outliers

Outliers are values that are very different from others. There are three ways to figure out when a specific value is an outlier: 1. Expert judgement

index_outlier_expert = which(data$annual_inc > 3000000)
data_expert = data[-index_outlier_expert,]
hist(data_expert$annual_inc, main="Histogram of Annual Income", xlab="Annual Income")

Rules of thumb

outlier_cutoff = quantile(data$annual_inc,0.75) + 1.5 * IQR(data$annual_inc)
index_outlier_ROT = which(data$annual_inc>outlier_cutoff)
data_ROT = data[-index_outlier_ROT,]
hist(data_ROT$annual_inc, main="Histogram of Annual Income", xlab="Annual Income")

Combination of both