Generating radom skewed data
set.seed(25)
data<- data.frame(age=rnorm(100,25,3),weight=rnorm(100,50,10),iq=rnorm(100,30,5),infant=0)
data<-rbind(data,data.frame(age=rnorm(20,15,1),weight=rnorm(10,20,5),iq=rnorm(10,30,5),infant=0))
data<-rbind(data,data.frame(age=rnorm(10,5,1),weight=rnorm(10,.5,.01),iq=rnorm(10,30,5),infant=1))
We have taken three sets, adults, teens and infants. We have added bias with AGE and INFANT is the dependent variable, WEIGHT is distributed and IQ is uniformly distributed and dependent vaiable is not dependent on INFANT
visualizing the data
plot(data,col=3)
We will keep “infant” as the dependent variable and “age”, “weight” and “IQ” as independent variable.
Using hierarchical clustering
vclust<-hclust(dist(as.matrix(data)))
plot(vclust,col=2)
Looks messy, hard to infer anything from this.
Performing regression on our data
y<-lm(infant~.,data=data)
summary(y)
Call:
lm(formula = infant ~ ., data = data)
Residuals:
Min 1Q Median 3Q Max
-0.34443 -0.06922 0.00267 0.08674 0.44259
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.756838 0.113701 6.656 7.76e-10 ***
age -0.024556 0.004333 -5.667 9.40e-08 ***
weight -0.002672 0.001547 -1.727 0.0867 .
iq -0.001348 0.003151 -0.428 0.6696
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.181 on 126 degrees of freedom
Multiple R-squared: 0.5529, Adjusted R-squared: 0.5422
F-statistic: 51.94 on 3 and 126 DF, p-value: < 2.2e-16
So we have got a good P-Value but our R-squared is low. Although the algorithm has rightly identified AGE as the variable with highest influence.
Now, analyzing the same data using Kmeans clustering
km<-kmeans(data,centers = 2)
km
K-means clustering with 2 clusters of sizes 100, 30
Cluster means:
age weight iq infant
1 24.45353 49.99159 30.09028 0.0000000
2 11.66294 13.61844 30.81073 0.3333333
Clustering vector:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[70] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Within cluster sum of squares by cluster:
[1] 11970.891 4049.157
(between_SS / total_SS = 68.2 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" "iter" "ifault"
KMEANS could not segregate accurately in the grey area between teens and infants, this can also be observerd viz SS which is 68.2%
Coming to the main part, Decision tree
clust<-rpart(infant~.,data=data)
rpart.plot(clust)
Wow, decision tree nailed it. Interesting, lets look at the MSE
summary(clust)
Call:
rpart(formula = infant ~ ., data = data)
n= 130
CP nsplit rel error xerror xstd
1 1.00 0 1 1.019547 0.2837117
2 0.01 1 0 0.000000 0.0000000
Variable importance
age weight
50 50
Node number 1: 130 observations, complexity param=1
mean=0.07692308, MSE=0.07100592
left son=2 (120 obs) right son=3 (10 obs)
Primary splits:
age < 9.904064 to the right, improve=1.00000000, (0 missing)
weight < 6.940662 to the right, improve=1.00000000, (0 missing)
iq < 32.52711 to the right, improve=0.02116367, (0 missing)
Surrogate splits:
weight < 6.940662 to the right, agree=1, adj=1, (0 split)
Node number 2: 120 observations
mean=0, MSE=0
Node number 3: 10 observations
mean=1, MSE=0
It identified the data and correctly picked up AGE as the splitting variable. Obtained MSE is 0.
Hence, when analysing skewed data, one will have to experiment with various techniques to find the best suited technique, in our case decision tree outperformed others.
Hence, when analysing skewed data, one will have to experiment with various techniques to find the best suited technique, in our case decision tree outperformed others.