The central tendency measures answer the most basic question of which value is the most “typical”.
my_mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
mean(numeric_train$SalePrice)
## [1] 180921.2
## [1] 163000
## [1] 140000
The variability mesures provide information about dispersion of the data.
## Minimun Sale price
## [1] 34900
## Maximun Sale price
## [1] 755000
## Range of Sale Price
## [1] 34900 755000
## [1] 720100
## Percentiles
## Fivenum() returns the min, 25%, median , 75% and max values of a set of numbers
fivenum(numeric_train$SalePrice)
## [1] 34900 129950 163000 214000 755000
#Quantile: provides the values of a given set of probabilities
quantile(numeric_train$SalePrice, probs = c(0.1,0.45,0.67,0.98))
## 10% 45% 67% 98%
## 106475.0 155000.0 191000.0 394931.1
## [1] 84025
##Summary function return all the measures seen before, but the 1st and 3rd quartiles are different from the values of fivenum()
summary(numeric_train$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
## [1] 6311111264
## [1] 79442.5
#Mean absolute deviation
mad(numeric_train$SalePrice,
center = mean(numeric_train$SalePrice),
na.rm = TRUE)
## [1] 68082.77
#Median absolute deviation
mad(numeric_train$SalePrice,
center = median(numeric_train$SalePrice),
na.rm = TRUE)
## [1] 56338.8
\[ MAD = \frac{1}{n}\sum_{i=1}^n|x_i-\mu| \]
\[ MAD = median(|x_i-median(x)|) \]
Two additional measures of a distribution that you will hear occasionally include skewness and kurtosis. Skewness is a measure of symmetry for a distribution. Negative values represent a left-skewed distribution where there are more extreme values to the left causing the mean to be less than the median. Positive values represent a right-skewed distribution where there are more extreme values to the right causing the mean to be more than the median.
## negative values represent left-skewed (mean smaller than median) distribution and positive values represente right-skewed distributions (mean greater than median)
skewness(numeric_train$SalePrice,
na.rm = TRUE)
## [1] 1.880941
##negative Values represente a platycurtic distribution and positive values represent a leptokurtic distribuction
kurtosis(numeric_train$SalePrice)
## [1] 9.509812
Outliers in data can distort predictions and affect their accuracy. Consequently, its important to understand if outliers are present and, if so, which observations are considered outliers.
## [1] 755000
#Outlier based on Z-score or Chisq,t etc
z_scores <- scores(numeric_train$SalePrice,type = "z")
outliers_z <- which(abs(z_scores ) > 1.96)
## Outliers based on 1.5 * IQR range
outliers_iqr <- which(scores(numeric_train$SalePrice,
type = "iqr", lim = 1.5))
## remove outliers and replace it with the mean/median
NO_numeric_train <- rm.outlier(numeric_train$SalePrice,fill = TRUE)
ggplot(numeric_train, aes(x = SalePrice)) +
geom_histogram(bins = 25, colour = "#ffffff", fill = "#4f81c7") +
theme_classic() + ggtitle("House Prices distribution")+
geom_vline(aes(xintercept = mean(SalePrice)),
color = "red", linetype = "dashed") +
annotate("text", x = mean(numeric_train$SalePrice) * 1.5,
y = 250,label = paste0("Avg: $",
round(mean(numeric_train$SalePrice)/1000000, 1),"M"))
ggplot(numeric_train, aes(y = SalePrice)) +
geom_boxplot(outlier.colour = "#eda593", colour = "#4f81c7") +
coord_flip() +
theme_classic() +
ggtitle("House prices boxplot")