2013-10-19
Abalone, an excellent source of iron and pantothenic acid, is a nutritious food resource and farming in Australia, America and East Asia. 100 grams of abalone yields more than 20% recommended daily intake of these nutrients. The economic value of abalone is positively correlated with its age. Therefore, to detect the age of abalone accurately is important for both farmers and customers to determine its price. However, the current technology to decide the age is quite costly and inefficient. Farmers usually cut the shells and count the rings through microscopes to estimate the abalone¡¯s age. This complex method increases the cost and limits its popularity. Our target is to find out the best indicators to forecast the rings, then the age of abalones.
Description of Dataset In this project, the data set Abalone is obtained from UCI Machine Learning Repository (1995). The data set contains physical measurements of 4177 abalones recorded in December 1995 by Marine Research Laboratories Taroona, Department of Primary Industry and Fisheries, Tasmania, Australia. There are nine variables, namely, Sex, Length, Diameter, and Height, Whole weight, Shucked weight, Viscera weight, Shell weight and Rings.The variable Rings is linearly related to the age of an abalone, as age equals to number of rings plus 1.5.
library(ggplot2)
library(plyr)
library(nnet)
library(MASS)
library(gridExtra)
## Loading required package: grid
library(lattice)
library(RColorBrewer)
library(xtable)
Data = read.csv("abalone.csv") # Import Data
print(str(Data)) # Structure of the Data
## 'data.frame': 4177 obs. of 9 variables:
## $ Sex : Factor w/ 3 levels "F","I","M": 3 3 1 3 2 2 1 1 3 1 ...
## $ Length : num 0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
## $ Diameter : num 0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
## $ Height : num 0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
## $ Whole.weight : num 0.514 0.226 0.677 0.516 0.205 ...
## $ Shucked.weight: num 0.2245 0.0995 0.2565 0.2155 0.0895 ...
## $ Viscera.weight: num 0.101 0.0485 0.1415 0.114 0.0395 ...
## $ Shell.weight : num 0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
## $ Rings : int 15 7 9 10 7 8 20 16 9 19 ...
## NULL
Data = subset(Data, select = -c(Shucked.weight, Viscera.weight, Shell.weight))
Plot number of abalone with different rings.
ggplot(Data, aes(x = Rings, fill = Sex)) + geom_bar(binwidth = 1, color = "blue",
origin = min(Data$Rings), position = "identity") + scale_x_continuous(name = "Rings",
breaks = seq(0, 30, by = 2)) + theme(axis.text.x = element_text(angle = 90)) +
scale_fill_brewer("Location", type = "qual", palette = 3) + ylab("Number of abalone") +
ggtitle("Number of abalone with different Rings") + facet_wrap(~Sex, ncol = 3)
Age = c(rep(0, nrow(Data)))
for (i in 1:nrow(Data)) {
if (Data[i, ]$Rings < 7)
Age[i] = 1
if (Data[i, ]$Rings >= 7 & Data[i, ]$Rings <= 13)
Age[i] = 2
if (Data[i, ]$Rings > 13)
Age[i] = 3
}
gData = cbind(Data, Age)
Roughly plot the graph and get little taste about data
ggplot(gData, aes(x = Height, y = factor(Rings), colour = factor(Sex))) + geom_jitter(position = position_jitter(width = 0.3)) +
geom_point() + ggtitle("Original Data: Height vs Rings") + scale_colour_brewer(type = "seq",
palette = "Set1")
jData = subset(gData, Height < 0.4)
ggplot(jData, aes(x = Height, y = factor(Rings), colour = factor(Sex))) + geom_jitter(position = position_jitter(width = 0.1)) +
geom_point() + ggtitle("Reduced Data: Height vs Rings") + scale_colour_brewer(type = "seq",
palette = "Set1")
Order_Data = arrange(jData, Sex, Rings, Length)
write.table(Order_Data, "abalone_clean.csv", quote = FALSE, sep = ",", row.names = FALSE)
Data = read.csv("abalone_clean.csv") # Import Data
head(Data)
## Sex Length Diameter Height Whole.weight Rings Age
## 1 F 0.275 0.195 0.070 0.0800 5 1
## 2 F 0.290 0.225 0.075 0.1400 5 1
## 3 F 0.360 0.270 0.090 0.1885 5 1
## 4 F 0.370 0.275 0.085 0.2405 5 1
## 5 F 0.290 0.210 0.075 0.2750 6 1
## 6 F 0.335 0.220 0.070 0.1700 6 1
Number of abalone of different Sex in different Age group.
with(Data, table(Sex, Age))
## Age
## Sex 1 2 3
## F 20 1067 219
## I 381 913 48
## M 47 1257 223
ggplot(Data, aes(x = Whole.weight, y = Height)) + geom_point(aes(colour = Rings)) +
scale_colour_gradient(low = "purple") + stat_smooth(colour = "red") + ggtitle("Whole.weight vs Height")
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
ggplot(Data, aes(x = Length, y = Height)) + geom_point(aes(colour = Rings)) +
scale_colour_gradient(low = "purple") + stat_smooth(colour = "red") + ggtitle("Length vs Height")
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
ggplot(Data, aes(x = Length, y = Diameter)) + geom_point(aes(colour = Rings)) +
scale_colour_gradient(low = "purple") + stat_smooth(colour = "red") + ggtitle("Length vs Diameter")
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
Construct Volume and Whole.weight dataset
yData = ddply(Data, ~Sex + Age, summarize, Volume = Length * Diameter * Height,
Whole.weight = Whole.weight, Rings = Rings)
Draw the point and regression lines
cols <- c(Linear = "#f04546", Cubic = "#3591d1")
ggplot(data = yData, aes(x = Whole.weight, y = Volume)) + geom_smooth(method = "lm",
aes(colour = "Linear"), lwd = 1) + geom_smooth(method = "lm", formula = y ~
poly(x, 3), aes(colour = "Cubic"), lwd = 1) + geom_point(alpha = 1/10, colour = "purple") +
scale_colour_manual("Type of Regression", values = cols) + ggtitle("Volume vs Whole.weight") +
coord_flip() + scale_y_continuous(name = "Volume", breaks = seq(0, 0.15,
by = 0.03))
Calculate mean and variance of Whole.weight in different Sex.
with(Data, do.call(rbind, tapply(Whole.weight, Sex, function(x) c(M = mean(x),
SD = sd(x)))))
## M SD
## F 1.0469 0.4303
## I 0.4314 0.2863
## M 0.9907 0.4697
It seems that F and M are more heavier than I. Does that mean I group do not have large rings?
ggplot(data = Data, aes(x = Rings, group = Sex, colour = Sex)) + geom_density() +
ggtitle("Density of Rings of each Sex")
Sex I group abalone is younger than F and M group, but it do have old abalone with Sex I.
ggplot(data = Data, aes(x = factor(Age), y = Whole.weight)) + geom_boxplot(outlier.colour = "purple",
outlier.size = 3, aes(fill = Sex)) + facet_wrap(~Sex, ncol = 3) + ggtitle("Whole.weight vs Age for different Sex") +
xlab("Age")
Compare to M and F, it seems that abalone with sex I have less weight.
logit <- multinom(Age ~ Whole.weight, data = yData)
## # weights: 9 (4 variable)
## initial value 4586.706305
## iter 10 value 2099.141042
## final value 2088.269847
## converged
summary(logit)
## Call:
## multinom(formula = Age ~ Whole.weight, data = yData)
##
## Coefficients:
## (Intercept) Whole.weight
## 2 -1.826 9.362
## 3 -4.900 10.551
##
## Std. Errors:
## (Intercept) Whole.weight
## 2 0.1353 0.4488
## 3 0.1781 0.4600
##
## Residual Deviance: 4177
## AIC: 4185
pp = fitted(logit)
pred <- predict(logit)
table(pred, Data$Age)
##
## pred 1 2 3
## 1 268 108 0
## 2 180 3127 489
## 3 0 2 1
Construct a new dataset with information above.
newdata = with(yData, data.frame(Weight = Whole.weight, Sex = Sex, Pred_Prob = pp,
Age = Age, Pred_Age = pred))
head(newdata)
## Weight Sex Pred_Prob.1 Pred_Prob.2 Pred_Prob.3 Age Pred_Age
## 1 0.0800 F 0.7365 0.2507 0.01276 1 1
## 2 0.1400 F 0.6136 0.3664 0.02002 1 1
## 3 0.1885 F 0.5013 0.4714 0.02728 1 1
## 4 0.2405 F 0.3811 0.5830 0.03590 1 2
## 5 0.2750 F 0.3078 0.6505 0.04173 1 2
## 6 0.1700 F 0.5448 0.4308 0.02439 1 1
ggplot(newdata, aes(x = Pred_Age, y = Age, colour = factor(Pred_Age == Age))) +
geom_point(position = "jitter") + scale_colour_manual("Age Group", values = c("2",
"3"), labels = c("Wong Classification", "Right Classification")) + ggtitle("Classification")
Plot the Predicted Probability for different age group
cols <- c(Young = "blue", Adult = "#f04546", Old = "green")
ggplot(newdata, aes(x = Weight)) + geom_line(aes(y = Pred_Prob.1, colour = "Young")) +
geom_line(aes(y = Pred_Prob.2, colour = "Adult")) + geom_line(aes(y = Pred_Prob.3,
colour = "Old")) + ylab("Predicted Probability") + facet_wrap(~Sex, ncol = 3) +
scale_colour_manual("Age Group", values = cols) + ggtitle("Predicted Probability")
Only using Whole.weight
LDA1 = lda(Age ~ Whole.weight, yData)
pred1 = predict(LDA1)$class
newdataL1 = with(yData, data.frame(Weight = Whole.weight, Sex = Sex, Age = Age,
Pred_Age1 = pred1))
Data_misL1 = subset(newdataL1, Age != Pred_Age1)
Using both Whole.weight and Volume.
LDA2 = lda(Age ~ ., yData)
pred2 = predict(LDA2)$class
newdataL2 = with(yData, data.frame(Weight = Whole.weight, Sex = Sex, Age = Age,
Pred_Age2 = pred2))
Data_misL2 = subset(newdataL2, Age != Pred_Age2)
Plot the graph:
p1 = ggplot(Data_misL1, aes(x = Age, fill = Sex)) + geom_bar(binwidth = 1/2,
color = "darkgrey") + ylab("Number of mis-classfied data") + ggtitle(" Using Whole.weight only") +
theme(axis.text.x = element_text(angle = 90)) + scale_fill_brewer("Sex",
type = "qual", palette = 3)
p2 = ggplot(Data_misL2, aes(x = Age, fill = Sex)) + geom_bar(binwidth = 1/2,
color = "darkgrey") + ylab("Number of mis-classfied data") + ggtitle("Using Whole.weight & Volume") +
theme(axis.text.x = element_text(angle = 90)) + scale_fill_brewer("Sex",
type = "qual", palette = 3)
grid.arrange(p1, p2, main = "Mis Classified Data for Different Sex__LDA")
Only use Volume
LDA3 = lda(Age ~ Volume, yData)
pred3 = predict(LDA3)$class
newdataL3 = with(yData, data.frame(Weight = Whole.weight, Sex = Sex, Age = Age,
Pred_Age3 = pred3))
Data_misL3 = subset(newdataL3, Age != Pred_Age3)
Plot the graph:
ggplot(newdataL2, aes(x = Pred_Age2, y = Age, colour = factor(Pred_Age2 == Age))) +
geom_point(position = "jitter") + scale_colour_manual("Age Group", values = c("2",
"3"), labels = c("Wong Classification", "Right Classification")) + ggtitle("Classification using Whole.weight & Volume")
ggplot(newdataL3, aes(x = Pred_Age3, y = Age, colour = factor(Pred_Age3 == Age))) +
geom_point(position = "jitter") + scale_colour_manual("Age Group", values = c("2",
"3"), labels = c("Wong Classification", "Right Classification")) + ggtitle("Classification Using Volume")
New data frame which record the minimum value and maximum value of each variable.
jData <- ddply(Data, ~Sex + Rings, function(x) {
jLevels <- c("min", "max")
Length = range(x$Length)
Height = range(x$Height)
Diameter = range(x$Diameter)
Whole.weight = range(x$Whole.weight)
return(data.frame(Length, Height, Diameter, Whole.weight, stat = jLevels))
})
Plot the graph
p_Height = ggplot(jData, aes(x = Rings, y = Height, group = stat, colour = stat)) +
geom_line() + geom_point() + facet_wrap(~Sex, ncol = 3) + ggtitle("Height: Min vs Min") +
theme(legend.position = "none")
p_Length = ggplot(jData, aes(x = Rings, y = Length, group = stat, colour = stat)) +
geom_line() + geom_point() + facet_wrap(~Sex, ncol = 3) + ggtitle("Length: Min vs Max") +
theme(legend.position = "none")
p_Diameter = ggplot(jData, aes(x = Rings, y = Diameter, group = stat, colour = stat)) +
geom_line() + geom_point() + facet_wrap(~Sex, ncol = 3) + ggtitle("Diameter: Min vs Max") +
theme(legend.position = "none")
p_Whole.weight = ggplot(jData, aes(x = Rings, y = Whole.weight, group = stat,
colour = stat)) + geom_line() + geom_point() + facet_wrap(~Sex, ncol = 3) +
ggtitle("Whole.weight: Min vs Max")
tmp <- ggplot_gtable(ggplot_build(p_Whole.weight))
leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
legend <- tmp$grobs[[leg]]
grid.arrange(arrangeGrob(p_Whole.weight + theme(legend.position = "none"), p_Length,
p_Height, p_Diameter))
Try the similar thing in xyplot
px_Height = xyplot(Height ~ Rings | Sex, jData, group = stat, auto.key = list(columns = 2,
x = 0.35, y = 0.85, corner = c(0, 1)), type = c("p", "l"), par.settings = list(superpose.line = list(col = c("red",
"blue")), superpose.symbol = list(col = c("red", "blue"))), grid = "h")
grid.arrange(arrangeGrob(p_Height + xlab(""), px_Height))
jFun <- function(x) {
name = c()
Coefs = c()
for (i in 1:4) {
temp = c()
temp <- rbind(temp, as.numeric(coef(lm(x[, i + 1] ~ Rings, x))))
Coefs = rbind(Coefs, temp)
name = rbind(name, names(x[i + 1]))
}
return(data.frame(Intercept = Coefs[, 1], Slope = Coefs[, 2], Variable = name))
}
jCoefs <- ddply(Data, ~Sex, jFun)
foo <- ddply(jCoefs, ~Variable, function(x) {
lerange <- c(which.min(x$Intercept), which.max(x$Intercept))
cbind(x[lerange, c("Variable", "Sex", "Intercept", "Slope")], stat = c("min_slope",
"max_slope"))
})
foo = xtable(foo)
print(foo, type = "html", include.rownames = FALSE)
| Variable | Sex | Intercept | Slope | stat |
|---|---|---|---|---|
| Diameter | I | 0.13 | 0.02 | min_slope |
| Diameter | F | 0.39 | 0.01 | max_slope |
| Height | I | 0.04 | 0.01 | min_slope |
| Height | F | 0.12 | 0.00 | max_slope |
| Length | I | 0.19 | 0.03 | min_slope |
| Length | F | 0.51 | 0.01 | max_slope |
| Whole.weight | I | -0.19 | 0.08 | min_slope |
| Whole.weight | F | 0.64 | 0.04 | max_slope |
Plot the regression line of Height
ggplot(subset(Data, Sex %in% c("F", "I")), aes(x = Rings, y = Height)) + facet_wrap(~Sex,
ncol = 2) + geom_smooth(method = "lm", colour = "purple") + geom_point(aes(colour = factor(Age))) +
ggtitle("Max_slope vs Min_slope of Height") + scale_colour_manual("Age Group",
values = c("4", "3", "5"), labels = c("Young", "Adult", "Old"))
Plot the regression line of Diameter
ggplot(subset(Data, Sex %in% c("F", "I")), aes(x = Rings, y = Diameter)) + facet_wrap(~Sex,
ncol = 2) + geom_smooth(method = "lm", colour = "purple") + geom_point(aes(colour = factor(Age))) +
ggtitle("Max_slope vs Min_slope of Diameter") + scale_colour_manual("Age Group",
values = c("4", "3", "5"), labels = c("Young", "Adult", "Old"))
Plot the regression line of Length
ggplot(subset(Data, Sex %in% c("F", "I")), aes(x = Rings, y = Length)) + facet_wrap(~Sex,
ncol = 2) + geom_smooth(method = "lm", colour = "purple") + geom_point(aes(colour = factor(Age))) +
ggtitle("Max_slope vs Min_slope of Length") + scale_colour_manual("Age Group",
values = c("4", "3", "5"), labels = c("Young", "Adult", "Old"))
Plot the regression line of Whole.weight
ggplot(subset(Data, Sex %in% c("F", "I")), aes(x = Rings, y = Whole.weight)) +
facet_wrap(~Sex, ncol = 2) + geom_smooth(method = "lm", colour = "purple") +
geom_point(aes(colour = factor(Age))) + ggtitle("Max_slope vs Min_slope of Whole.weight") +
scale_colour_manual("Age Group", values = c("4", "3", "5"), labels = c("Young",
"Adult", "Old"))