STAT 545A Homework 6 Final Report

Jinyuan Zhang

2013-10-19

Introduction
Description of Dataset
Detect Outliers and Construct Clean Dataset
Clean Data Analysis
Conclusion

Introduction

Abalone, an excellent source of iron and pantothenic acid, is a nutritious food resource and farming in Australia, America and East Asia. 100 grams of abalone yields more than 20% recommended daily intake of these nutrients. The economic value of abalone is positively correlated with its age. Therefore, to detect the age of abalone accurately is important for both farmers and customers to determine its price. However, the current technology to decide the age is quite costly and inefficient. Farmers usually cut the shells and count the rings through microscopes to estimate the abalone¡¯s age. This complex method increases the cost and limits its popularity. Our target is to find out the best indicators to forecast the rings, then the age of abalones.

Description of Dataset

Description of Dataset In this project, the data set Abalone is obtained from UCI Machine Learning Repository (1995). The data set contains physical measurements of 4177 abalones recorded in December 1995 by Marine Research Laboratories Taroona, Department of Primary Industry and Fisheries, Tasmania, Australia. There are nine variables, namely, Sex, Length, Diameter, and Height, Whole weight, Shucked weight, Viscera weight, Shell weight and Rings.The variable Rings is linearly related to the age of an abalone, as age equals to number of rings plus 1.5.

Detect Outliers and Construct Clean Dataset

library(ggplot2)
library(plyr)
library(nnet)
library(MASS)
library(gridExtra)

## Loading required package: grid

library(lattice)
library(RColorBrewer)
library(xtable)
Data = read.csv("abalone.csv")  # Import Data
print(str(Data))  # Structure of the Data

## 'data.frame':    4177 obs. of  9 variables:
##  $ Sex           : Factor w/ 3 levels "F","I","M": 3 3 1 3 2 2 1 1 3 1 ...
##  $ Length        : num  0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
##  $ Diameter      : num  0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
##  $ Height        : num  0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
##  $ Whole.weight  : num  0.514 0.226 0.677 0.516 0.205 ...
##  $ Shucked.weight: num  0.2245 0.0995 0.2565 0.2155 0.0895 ...
##  $ Viscera.weight: num  0.101 0.0485 0.1415 0.114 0.0395 ...
##  $ Shell.weight  : num  0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
##  $ Rings         : int  15 7 9 10 7 8 20 16 9 19 ...
## NULL

There are 4 different measures for weight i.e. Whole.weight, Shucked.weight, Viscera.weight and Shell.weight. And Whole.weight should be the easiest one to measure. So I drop out all the other measures.

Data = subset(Data, select = -c(Shucked.weight, Viscera.weight, Shell.weight))

Plot number of abalone with different rings.

ggplot(Data, aes(x = Rings, fill = Sex)) + geom_bar(binwidth = 1, color = "blue", 
    origin = min(Data$Rings), position = "identity") + scale_x_continuous(name = "Rings", 
    breaks = seq(0, 30, by = 2)) + theme(axis.text.x = element_text(angle = 90)) + 
    scale_fill_brewer("Location", type = "qual", palette = 3) + ylab("Number of abalone") + 
    ggtitle("Number of abalone with different Rings") + facet_wrap(~Sex, ncol = 3)

plot of chunk unnamed-chunk-3

From the graph above, we can see that the range of the Rings is from 1 to 29, which maybe too much to measure. In reality, people may not require so detailed category. So I group abalones with less than 6 rings (<7.5 years old), from 6 to 13 rings (7.5 to 14.5 years old) and more than 13 rings (>14.5 years old), indicating young, adult and old abalones correspondingly, and label them as 1,2,3.

Age = c(rep(0, nrow(Data)))
for (i in 1:nrow(Data)) {
    if (Data[i, ]$Rings < 7) 
        Age[i] = 1
    if (Data[i, ]$Rings >= 7 & Data[i, ]$Rings <= 13) 
        Age[i] = 2
    if (Data[i, ]$Rings > 13) 
        Age[i] = 3
}
gData = cbind(Data, Age)

Roughly plot the graph and get little taste about data

ggplot(gData, aes(x = Height, y = factor(Rings), colour = factor(Sex))) + geom_jitter(position = position_jitter(width = 0.3)) + 
    geom_point() + ggtitle("Original Data: Height vs Rings") + scale_colour_brewer(type = "seq", 
    palette = "Set1")

plot of chunk unnamed-chunk-5

From the graph above we can see that there are some outlines in Female. I want to get rid of them.

jData = subset(gData, Height < 0.4)
ggplot(jData, aes(x = Height, y = factor(Rings), colour = factor(Sex))) + geom_jitter(position = position_jitter(width = 0.1)) + 
    geom_point() + ggtitle("Reduced Data: Height vs Rings") + scale_colour_brewer(type = "seq", 
    palette = "Set1")

plot of chunk unnamed-chunk-6

Reorder data according to Sex and Rings

Order_Data = arrange(jData, Sex, Rings, Length)
write.table(Order_Data, "abalone_clean.csv", quote = FALSE, sep = ",", row.names = FALSE)

Clean Data Analysis

Data = read.csv("abalone_clean.csv")  # Import Data
head(Data)

##   Sex Length Diameter Height Whole.weight Rings Age
## 1   F  0.275    0.195  0.070       0.0800     5   1
## 2   F  0.290    0.225  0.075       0.1400     5   1
## 3   F  0.360    0.270  0.090       0.1885     5   1
## 4   F  0.370    0.275  0.085       0.2405     5   1
## 5   F  0.290    0.210  0.075       0.2750     6   1
## 6   F  0.335    0.220  0.070       0.1700     6   1

Number of abalone of different Sex in different Age group.

with(Data, table(Sex, Age))

##    Age
## Sex    1    2    3
##   F   20 1067  219
##   I  381  913   48
##   M   47 1257  223

First of all, I want to test whether Observation data like Height, Whole.weight can be helpful to decide the abalone to different Age group. Logistic regression is applied to do this analysis.

ggplot(Data, aes(x = Whole.weight, y = Height)) + geom_point(aes(colour = Rings)) + 
    scale_colour_gradient(low = "purple") + stat_smooth(colour = "red") + ggtitle("Whole.weight vs Height")

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

plot of chunk unnamed-chunk-10


ggplot(Data, aes(x = Length, y = Height)) + geom_point(aes(colour = Rings)) + 
    scale_colour_gradient(low = "purple") + stat_smooth(colour = "red") + ggtitle("Length vs Height")

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

plot of chunk unnamed-chunk-10


ggplot(Data, aes(x = Length, y = Diameter)) + geom_point(aes(colour = Rings)) + 
    scale_colour_gradient(low = "purple") + stat_smooth(colour = "red") + ggtitle("Length vs Diameter")

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

plot of chunk unnamed-chunk-10

From above graphs, it is obvious that Length, Height and Diameter are relatively linear correlated. In fact, it is natural to think that the bigger the abalone, the heavier they are, so I want to check whether it is perfectly correlated between Whole.weight and Volume.

Construct Volume and Whole.weight dataset

yData = ddply(Data, ~Sex + Age, summarize, Volume = Length * Diameter * Height, 
    Whole.weight = Whole.weight, Rings = Rings)

Draw the point and regression lines

cols <- c(Linear = "#f04546", Cubic = "#3591d1")
ggplot(data = yData, aes(x = Whole.weight, y = Volume)) + geom_smooth(method = "lm", 
    aes(colour = "Linear"), lwd = 1) + geom_smooth(method = "lm", formula = y ~ 
    poly(x, 3), aes(colour = "Cubic"), lwd = 1) + geom_point(alpha = 1/10, colour = "purple") + 
    scale_colour_manual("Type of Regression", values = cols) + ggtitle("Volume vs Whole.weight") + 
    coord_flip() + scale_y_continuous(name = "Volume", breaks = seq(0, 0.15, 
    by = 0.03))

plot of chunk unnamed-chunk-12

From the graph above, it can be seen that Cubic regression and Linear regression is almost same in the beginning, but diverge when Volume is large, which means that Volume and Whole.weight is highly correlated for small abalone.

Why not just use Whole.weight to predict the Rings of abalone?

Calculate mean and variance of Whole.weight in different Sex.

with(Data, do.call(rbind, tapply(Whole.weight, Sex, function(x) c(M = mean(x), 
    SD = sd(x)))))

##        M     SD
## F 1.0469 0.4303
## I 0.4314 0.2863
## M 0.9907 0.4697

It seems that F and M are more heavier than I. Does that mean I group do not have large rings?

ggplot(data = Data, aes(x = Rings, group = Sex, colour = Sex)) + geom_density() + 
    ggtitle("Density of Rings of each Sex")

plot of chunk unnamed-chunk-14

Sex I group abalone is younger than F and M group, but it do have old abalone with Sex I.

ggplot(data = Data, aes(x = factor(Age), y = Whole.weight)) + geom_boxplot(outlier.colour = "purple", 
    outlier.size = 3, aes(fill = Sex)) + facet_wrap(~Sex, ncol = 3) + ggtitle("Whole.weight vs Age for different Sex") + 
    xlab("Age")

plot of chunk unnamed-chunk-15

Compare to M and F, it seems that abalone with sex I have less weight.

Fit Multinational Regression Model

logit <- multinom(Age ~ Whole.weight, data = yData)

## # weights:  9 (4 variable)
## initial  value 4586.706305 
## iter  10 value 2099.141042
## final  value 2088.269847 
## converged

summary(logit)

## Call:
## multinom(formula = Age ~ Whole.weight, data = yData)
## 
## Coefficients:
##   (Intercept) Whole.weight
## 2      -1.826        9.362
## 3      -4.900       10.551
## 
## Std. Errors:
##   (Intercept) Whole.weight
## 2      0.1353       0.4488
## 3      0.1781       0.4600
## 
## Residual Deviance: 4177 
## AIC: 4185

pp = fitted(logit)
pred <- predict(logit)
table(pred, Data$Age)

##     
## pred    1    2    3
##    1  268  108    0
##    2  180 3127  489
##    3    0    2    1

Construct a new dataset with information above.

newdata = with(yData, data.frame(Weight = Whole.weight, Sex = Sex, Pred_Prob = pp, 
    Age = Age, Pred_Age = pred))
head(newdata)

##   Weight Sex Pred_Prob.1 Pred_Prob.2 Pred_Prob.3 Age Pred_Age
## 1 0.0800   F      0.7365      0.2507     0.01276   1        1
## 2 0.1400   F      0.6136      0.3664     0.02002   1        1
## 3 0.1885   F      0.5013      0.4714     0.02728   1        1
## 4 0.2405   F      0.3811      0.5830     0.03590   1        2
## 5 0.2750   F      0.3078      0.6505     0.04173   1        2
## 6 0.1700   F      0.5448      0.4308     0.02439   1        1

ggplot(newdata, aes(x = Pred_Age, y = Age, colour = factor(Pred_Age == Age))) + 
    geom_point(position = "jitter") + scale_colour_manual("Age Group", values = c("2", 
    "3"), labels = c("Wong Classification", "Right Classification")) + ggtitle("Classification")

plot of chunk unnamed-chunk-17

Plot the Predicted Probability for different age group

cols <- c(Young = "blue", Adult = "#f04546", Old = "green")
ggplot(newdata, aes(x = Weight)) + geom_line(aes(y = Pred_Prob.1, colour = "Young")) + 
    geom_line(aes(y = Pred_Prob.2, colour = "Adult")) + geom_line(aes(y = Pred_Prob.3, 
    colour = "Old")) + ylab("Predicted Probability") + facet_wrap(~Sex, ncol = 3) + 
    scale_colour_manual("Age Group", values = cols) + ggtitle("Predicted Probability")

plot of chunk unnamed-chunk-18

From the graph, we can tell that prediction for Young group should be good,but it may difficult to distinguish Old and Adult when Weight is heavy.

It seems that more Sex I abalone is mis-classified to Young and Adult group, but more Sex F abalone is mis-classified to Old grope. And the biggest problem is in Old group. One reason maybe that we adopt linear combination of variables in multinational regression instead.

How about using Linear Discriminant Analysis?

Only using Whole.weight

LDA1 = lda(Age ~ Whole.weight, yData)
pred1 = predict(LDA1)$class
newdataL1 = with(yData, data.frame(Weight = Whole.weight, Sex = Sex, Age = Age, 
    Pred_Age1 = pred1))
Data_misL1 = subset(newdataL1, Age != Pred_Age1)

Using both Whole.weight and Volume.

LDA2 = lda(Age ~ ., yData)
pred2 = predict(LDA2)$class
newdataL2 = with(yData, data.frame(Weight = Whole.weight, Sex = Sex, Age = Age, 
    Pred_Age2 = pred2))
Data_misL2 = subset(newdataL2, Age != Pred_Age2)

Plot the graph:

p1 = ggplot(Data_misL1, aes(x = Age, fill = Sex)) + geom_bar(binwidth = 1/2, 
    color = "darkgrey") + ylab("Number of mis-classfied data") + ggtitle(" Using Whole.weight only") + 
    theme(axis.text.x = element_text(angle = 90)) + scale_fill_brewer("Sex", 
    type = "qual", palette = 3)
p2 = ggplot(Data_misL2, aes(x = Age, fill = Sex)) + geom_bar(binwidth = 1/2, 
    color = "darkgrey") + ylab("Number of mis-classfied data") + ggtitle("Using Whole.weight & Volume") + 
    theme(axis.text.x = element_text(angle = 90)) + scale_fill_brewer("Sex", 
    type = "qual", palette = 3)
grid.arrange(p1, p2, main = "Mis Classified Data for Different Sex__LDA")

plot of chunk unnamed-chunk-21

From above analysis, we find out that even though Whole.weight is highly correlated to Volume, but Volume have significant influence on the classification problem. Does this mean Volume is more important to do classification?

Only use Volume

LDA3 = lda(Age ~ Volume, yData)
pred3 = predict(LDA3)$class
newdataL3 = with(yData, data.frame(Weight = Whole.weight, Sex = Sex, Age = Age, 
    Pred_Age3 = pred3))
Data_misL3 = subset(newdataL3, Age != Pred_Age3)

Plot the graph:

ggplot(newdataL2, aes(x = Pred_Age2, y = Age, colour = factor(Pred_Age2 == Age))) + 
    geom_point(position = "jitter") + scale_colour_manual("Age Group", values = c("2", 
    "3"), labels = c("Wong Classification", "Right Classification")) + ggtitle("Classification using Whole.weight & Volume")

plot of chunk unnamed-chunk-23

ggplot(newdataL3, aes(x = Pred_Age3, y = Age, colour = factor(Pred_Age3 == Age))) + 
    geom_point(position = "jitter") + scale_colour_manual("Age Group", values = c("2", 
    "3"), labels = c("Wong Classification", "Right Classification")) + ggtitle("Classification Using Volume")

plot of chunk unnamed-chunk-23

The answer is NO. Using Volume and Whole.weight separately will not give a good classification.

Little conclusion: Using LDA is more efficient to classify the data then multinational regression. It is sure that using observations ie, Whole.weight and Volume together, we can already classify the abalone into different Age group well.

Then, which variable is more important, Length, Diameter, Height, or Whole.weight? We do not want to consider Volume now because this measure can not be obtained directly. So we will go back to above four easily measured variables.

New data frame which record the minimum value and maximum value of each variable.

jData <- ddply(Data, ~Sex + Rings, function(x) {
    jLevels <- c("min", "max")
    Length = range(x$Length)
    Height = range(x$Height)
    Diameter = range(x$Diameter)
    Whole.weight = range(x$Whole.weight)
    return(data.frame(Length, Height, Diameter, Whole.weight, stat = jLevels))
})

Plot the graph

p_Height = ggplot(jData, aes(x = Rings, y = Height, group = stat, colour = stat)) + 
    geom_line() + geom_point() + facet_wrap(~Sex, ncol = 3) + ggtitle("Height: Min vs Min") + 
    theme(legend.position = "none")
p_Length = ggplot(jData, aes(x = Rings, y = Length, group = stat, colour = stat)) + 
    geom_line() + geom_point() + facet_wrap(~Sex, ncol = 3) + ggtitle("Length: Min vs Max") + 
    theme(legend.position = "none")
p_Diameter = ggplot(jData, aes(x = Rings, y = Diameter, group = stat, colour = stat)) + 
    geom_line() + geom_point() + facet_wrap(~Sex, ncol = 3) + ggtitle("Diameter: Min vs Max") + 
    theme(legend.position = "none")
p_Whole.weight = ggplot(jData, aes(x = Rings, y = Whole.weight, group = stat, 
    colour = stat)) + geom_line() + geom_point() + facet_wrap(~Sex, ncol = 3) + 
    ggtitle("Whole.weight: Min vs Max")

tmp <- ggplot_gtable(ggplot_build(p_Whole.weight))
leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
legend <- tmp$grobs[[leg]]

grid.arrange(arrangeGrob(p_Whole.weight + theme(legend.position = "none"), p_Length, 
    p_Height, p_Diameter))

plot of chunk unnamed-chunk-25

Try the similar thing in xyplot

px_Height = xyplot(Height ~ Rings | Sex, jData, group = stat, auto.key = list(columns = 2, 
    x = 0.35, y = 0.85, corner = c(0, 1)), type = c("p", "l"), par.settings = list(superpose.line = list(col = c("red", 
    "blue")), superpose.symbol = list(col = c("red", "blue"))), grid = "h")
grid.arrange(arrangeGrob(p_Height + xlab(""), px_Height))

plot of chunk unnamed-chunk-26

From the graph above, we can see that Whole.weight and Height share the similar trend of range, while Length and Diameter share similar trend of range. Does it mean Whole.weight is more close related to Height, and Length is more related to Diameter?

Regression.


jFun <- function(x) {
    name = c()
    Coefs = c()
    for (i in 1:4) {
        temp = c()
        temp <- rbind(temp, as.numeric(coef(lm(x[, i + 1] ~ Rings, x))))
        Coefs = rbind(Coefs, temp)
        name = rbind(name, names(x[i + 1]))
    }
    return(data.frame(Intercept = Coefs[, 1], Slope = Coefs[, 2], Variable = name))
}
jCoefs <- ddply(Data, ~Sex, jFun)

foo <- ddply(jCoefs, ~Variable, function(x) {
    lerange <- c(which.min(x$Intercept), which.max(x$Intercept))
    cbind(x[lerange, c("Variable", "Sex", "Intercept", "Slope")], stat = c("min_slope", 
        "max_slope"))
})
foo = xtable(foo)
print(foo, type = "html", include.rownames = FALSE)

Variable	Sex	Intercept	Slope	stat
Diameter	I	0.13	0.02	min_slope
Diameter	F	0.39	0.01	max_slope
Height	I	0.04	0.01	min_slope
Height	F	0.12	0.00	max_slope
Length	I	0.19	0.03	min_slope
Length	F	0.51	0.01	max_slope
Whole.weight	I	-0.19	0.08	min_slope
Whole.weight	F	0.64	0.04	max_slope

From above table, it shows that Sex I have min_slope for all Variables, and Sex F have max_slope for all variables.

Plot the regression line of Height

ggplot(subset(Data, Sex %in% c("F", "I")), aes(x = Rings, y = Height)) + facet_wrap(~Sex, 
    ncol = 2) + geom_smooth(method = "lm", colour = "purple") + geom_point(aes(colour = factor(Age))) + 
    ggtitle("Max_slope vs Min_slope of Height") + scale_colour_manual("Age Group", 
    values = c("4", "3", "5"), labels = c("Young", "Adult", "Old"))

plot of chunk unnamed-chunk-28

Plot the regression line of Diameter

ggplot(subset(Data, Sex %in% c("F", "I")), aes(x = Rings, y = Diameter)) + facet_wrap(~Sex, 
    ncol = 2) + geom_smooth(method = "lm", colour = "purple") + geom_point(aes(colour = factor(Age))) + 
    ggtitle("Max_slope vs Min_slope of Diameter") + scale_colour_manual("Age Group", 
    values = c("4", "3", "5"), labels = c("Young", "Adult", "Old"))

plot of chunk unnamed-chunk-29

Plot the regression line of Length

ggplot(subset(Data, Sex %in% c("F", "I")), aes(x = Rings, y = Length)) + facet_wrap(~Sex, 
    ncol = 2) + geom_smooth(method = "lm", colour = "purple") + geom_point(aes(colour = factor(Age))) + 
    ggtitle("Max_slope vs Min_slope of Length") + scale_colour_manual("Age Group", 
    values = c("4", "3", "5"), labels = c("Young", "Adult", "Old"))

plot of chunk unnamed-chunk-30

Plot the regression line of Whole.weight

ggplot(subset(Data, Sex %in% c("F", "I")), aes(x = Rings, y = Whole.weight)) + 
    facet_wrap(~Sex, ncol = 2) + geom_smooth(method = "lm", colour = "purple") + 
    geom_point(aes(colour = factor(Age))) + ggtitle("Max_slope vs Min_slope of Whole.weight") + 
    scale_colour_manual("Age Group", values = c("4", "3", "5"), labels = c("Young", 
        "Adult", "Old"))

plot of chunk qplot

From the above graphs, it is obvious that the quantity increase of Sex I along with Rings is much greater than Sex F, which means that it may easy to classify the Sex to different Age group by quantity measure like Height, Weight etc.

Conclusion

From the analysis above, we can conclude that quantity measure Volume (body size) and Whole.weight (weight) can classify abalone to different age group, especially for Sex I. In practice, breeders do not need to adopt complicated method to detect the Rings of abalone, however, they can know abalone’s age only based on its body size and weight. This method is simple but highly accurate, which should be recommended. Nevertheless, if more detailed age information is desired, the method may fail; thus further research on the topic is required.