Generating radom skewed data
set.seed(25)
data<- data.frame(age=rnorm(100,25,3),weight=rnorm(100,50,10),iq=rnorm(100,30,5),infant=0)
data<-rbind(data,data.frame(age=rnorm(20,15,1),weight=rnorm(10,20,5),iq=rnorm(10,30,5),infant=0))
data<-rbind(data,data.frame(age=rnorm(10,5,1),weight=rnorm(10,.5,.01),iq=rnorm(10,30,5),infant=1))
We have taken three sets, adults, teens and infants. We have added bias with AGE and INFANT is the dependent variable, WEIGHT is distributed and IQ is uniformly distributed and dependent vaiable is not dependent on INFANT
visualizing the data
We will keep “infant” as the dependent variable and “age”, “weight” and “IQ” as independent variable.
Using hierarchical clustering
Looks messy, hard to infer anything from this.
Performing regression on our data
y<-lm(infant~.,data=data)
summary(y)
Call:
lm(formula = infant ~ ., data = data)
Residuals:
Min 1Q Median 3Q Max
-0.34443 -0.06922 0.00267 0.08674 0.44259
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.756838 0.113701 6.656 7.76e-10 ***
age -0.024556 0.004333 -5.667 9.40e-08 ***
weight -0.002672 0.001547 -1.727 0.0867 .
iq -0.001348 0.003151 -0.428 0.6696
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.181 on 126 degrees of freedom
Multiple R-squared: 0.5529, Adjusted R-squared: 0.5422
F-statistic: 51.94 on 3 and 126 DF, p-value: < 2.2e-16
So we have got a good P-Value but our R-squared is low. Although the algorithm has rightly identified AGE as the variable with highest influence.
Now, analyzing the same data using Kmeans clustering
km
K-means clustering with 2 clusters of sizes 30, 100
Cluster means:
age weight iq infant
1 11.66294 13.61844 30.81073 0.3333333
2 24.45353 49.99159 30.09028 0.0000000
Clustering vector:
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[70] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Within cluster sum of squares by cluster:
[1] 4049.157 11970.891
(between_SS / total_SS = 68.2 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" "iter" "ifault"
KMEANS could not segregate accurately in the grey area between teens and infants, this can also be observerd viz SS which is 68.2%
Coming to the main part, Decision tree
clust<-rpart(infant~.,data=data)
rpart.plot(clust)
Interesting, lets look at the MSE
summary(clust)
Call:
rpart(formula = infant ~ ., data = data)
n= 130
CP nsplit rel error xerror xstd
1 1.00 0 1 1.019547 0.2837117
2 0.01 1 0 0.000000 0.0000000
Variable importance
age weight
50 50
Node number 1: 130 observations, complexity param=1
mean=0.07692308, MSE=0.07100592
left son=2 (120 obs) right son=3 (10 obs)
Primary splits:
age < 9.904064 to the right, improve=1.00000000, (0 missing)
weight < 6.940662 to the right, improve=1.00000000, (0 missing)
iq < 32.52711 to the right, improve=0.02116367, (0 missing)
Surrogate splits:
weight < 6.940662 to the right, agree=1, adj=1, (0 split)
Node number 2: 120 observations
mean=0, MSE=0
Node number 3: 10 observations
mean=1, MSE=0
Wow, decision tree nailed it. It identified the data and correctly picked up AGE as the splitting variable. Obtained MSE is 0.
Hence, when analysing skewed data, one will have to experiment with various techniques to find the best suited technique, in our case decision tree outperformed others.
Hence, when analysing skewed data, one will have to experiment with various techniques to find the best suited technique, in our case decision tree outperformed others.
Also, I would compute the cost matrix as below to get minimum loss