January 26, 2017

Understanding the data

-Often, we deal with high dimensional datasets
-To make sense of the data it is necessary to reduce the dimensionality
-One of the ways is to remove variables which dont show much variance
-For example, if a variable takes the same value 99/100 times it may be useless
-I have used mode instead of variance as outliers might affect the variance

Writing the function

##Calulating number of times the mode value of the variable appears
repvalues=function(x){
     check=table(x, useNA = "always")
     check2=check[which.max(check)]
     return (cbind(check2,len=length(x)))
}
##We now loop this function over all variables
useless_var=function(x){
     mode_count=t(sapply(x,function(y)repvalues(y)))
     return (mode_count)
 }

Loading sample data

##Loading the dataset & Understanding it
load(file="diamonds.RData")
head(diamonds)
##   carat       cut color clarity depth table price    x    y    z
## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

Testing the function

##Running the function on the diamonds dataset
##First column represents count of the mode, second being total count
diamonds_example=useless_var(diamonds)
diamonds_example
##          [,1]  [,2]
## carat    2604 53940
## cut     21551 53940
## color   11292 53940
## clarity 13065 53940
## depth    2239 53940
## table    9881 53940
## price     132 53940
## x         448 53940
## y         437 53940
## z         767 53940

Interpretation

Since no variable is severely imbalanced it may not make sense to drop any variable
If one of the variables had a value repeating 50,000+ times it might make sense to drop it
The threshold number would also depend on the target variable's distribution
A Highly imbalanced target variable may call for a higher threshold