Abalone are marine snails. Their taxonomy puts them in the family Haliotidae which contains only one genus, Haliotis, which once contained six subgenera. These subgenera have become alternate representations of Haliotis.[4] The number of species recognized worldwide ranges between 30[6] and 130[7] with over 230 species-level taxa described. The most comprehensive treatment of the family considers 56 species valid, with 18 additional subspecies.[8]
The shells of abalones have a low, open spiral structure, and are characterized by several open respiratory pores in a row near the shell’s outer edge. The thick inner layer of the shell is composed of nacre (mother-of-pearl), which in many species is highly iridescent, giving rise to a range of strong, changeable colors, which make the shells attractive to humans as decorative objects, jewelry, and as a source of colorful mother-of-pearl.
The flesh of abalones is widely considered to be a desirable food, and is consumed raw or cooked by a variety of cultures.
The dataset that I’m using is from UCI machine learning dataset collection https://archive.ics.uci.edu/ml/datasets/abalone
The variables in this dataset include: 1. Sex: Male, female, or infant 2. Length(in mm): longest shell measurement 3. Diameter(in mm): perpendicular to length 4. Height(in mm): with meat in shell 5. Whole weight(in grams) of the whole abalone 6. Shucked weight( in grams) which is the weight in meat 7. Viscera weight( in grams) which is the gut weight after bleeding out 8. Shell weight( in grams) after being dried 9. Rings(Integer) gives the age in year
The main goal of this project Is to predict Age through physical measurements of the abalone through using the linear regression technique across different independant variables, assuming that the dependant variable is age. I want to see which independant variable has the most effect on Age.
So the UCI machine learning website has this dataset which contains 4177 rows. For this project, I decided to cut down the data. So I am going to use the first 1500 rows instead of all 4177 rows. The reason I’m doing this is because of convenience and also I feel that more data doesn’t have much impact on the regression equation or the calculations.
So for this data, I took it from UCI ML website, converted it into excel, then imported it into SQLITE prior to putting this into R for analysis. I also created the column names for this dataset. Here is the sql code:
setwd("C:/Documents/My Excel")
library(RSQLite)
db<-dbConnect(SQLite(),dbname="Abalone.db")
df<-dbGetQuery(db,'SELECT * from Abalone LIMIT 1500 ')
colnames(df)=c("sex","length","diameter","height","total.weight",
"meat.weight","viscera.weight","shell.weight","age")
head(df)
## sex length diameter height total.weight meat.weight viscera.weight
## 1 M 0.455 0.365 0.095 0.5140 0.2245 0.1010
## 2 M 0.350 0.265 0.090 0.2255 0.0995 0.0485
## 3 F 0.530 0.420 0.135 0.6770 0.2565 0.1415
## 4 M 0.440 0.365 0.125 0.5160 0.2155 0.1140
## 5 I 0.330 0.255 0.080 0.2050 0.0895 0.0395
## 6 I 0.425 0.300 0.095 0.3515 0.1410 0.0775
## shell.weight age
## 1 0.150 15
## 2 0.070 7
## 3 0.210 9
## 4 0.155 10
## 5 0.055 7
## 6 0.120 8
tail(df)
## sex length diameter height total.weight meat.weight viscera.weight
## 1495 M 0.620 0.485 0.155 1.0490 0.4620 0.2310
## 1496 F 0.620 0.435 0.155 1.0120 0.4770 0.2360
## 1497 M 0.620 0.480 0.165 1.0725 0.4815 0.2350
## 1498 M 0.625 0.520 0.175 1.4105 0.6910 0.3220
## 1499 M 0.625 0.470 0.180 1.1360 0.4510 0.3245
## 1500 M 0.630 0.470 0.145 1.1005 0.5200 0.2600
## shell.weight age
## 1495 0.2500 10
## 1496 0.2750 8
## 1497 0.3120 9
## 1498 0.3465 10
## 1499 0.3050 11
## 1500 0.2760 9
abalone=df
So what I’m doing next is I’m going to try to filter out the data according to gender. I’m doing this because it’s more organized. After calculating the regression for gender, I plan on doing it for the whole dataset
male=dplyr::filter(abalone,sex=='M')
female=dplyr::filter(abalone,sex=='F')
infant=dplyr::filter(abalone,sex=='I')
summary(male)
## sex length diameter height
## Length:558 Min. :0.1550 Min. :0.1100 Min. :0.0400
## Class :character 1st Qu.:0.4850 1st Qu.:0.3800 1st Qu.:0.1250
## Mode :character Median :0.5600 Median :0.4400 Median :0.1500
## Mean :0.5391 Mean :0.4228 Mean :0.1472
## 3rd Qu.:0.6100 3rd Qu.:0.4800 3rd Qu.:0.1700
## Max. :0.7650 Max. :0.6000 Max. :0.5150
## total.weight meat.weight viscera.weight shell.weight
## Min. :0.0155 Min. :0.0065 Min. :0.0030 Min. :0.0050
## 1st Qu.:0.5873 1st Qu.:0.2352 1st Qu.:0.1296 1st Qu.:0.1653
## Median :0.8965 Median :0.3782 Median :0.1900 Median :0.2550
## Mean :0.9064 Mean :0.3867 Mean :0.1952 Mean :0.2628
## 3rd Qu.:1.1823 3rd Qu.:0.5101 3rd Qu.:0.2539 3rd Qu.:0.3400
## Max. :2.8255 Max. :1.1465 Max. :0.5640 Max. :0.8970
## age
## Min. : 3.00
## 1st Qu.: 9.00
## Median :10.00
## Mean :10.88
## 3rd Qu.:13.00
## Max. :26.00
summary(female)
## sex length diameter height
## Length:488 Min. :0.2750 Min. :0.1950 Min. :0.0150
## Class :character 1st Qu.:0.5050 1st Qu.:0.4000 1st Qu.:0.1350
## Mode :character Median :0.5750 Median :0.4500 Median :0.1550
## Mean :0.5649 Mean :0.4453 Mean :0.1534
## 3rd Qu.:0.6250 3rd Qu.:0.4950 3rd Qu.:0.1700
## Max. :0.8150 Max. :0.6500 Max. :0.2500
## total.weight meat.weight viscera.weight shell.weight
## Min. :0.0800 Min. :0.031 Min. :0.0215 Min. :0.0250
## 1st Qu.:0.6541 1st Qu.:0.261 1st Qu.:0.1434 1st Qu.:0.1988
## Median :0.9630 Median :0.392 Median :0.2095 Median :0.2750
## Mean :0.9880 Mean :0.415 Mean :0.2164 Mean :0.2880
## 3rd Qu.:1.2416 3rd Qu.:0.540 3rd Qu.:0.2765 3rd Qu.:0.3571
## Max. :2.6570 Max. :1.488 Max. :0.5195 Max. :1.0050
## age
## Min. : 5.00
## 1st Qu.: 9.00
## Median :11.00
## Mean :11.53
## 3rd Qu.:13.00
## Max. :29.00
summary(infant)
## sex length diameter height
## Length:454 Min. :0.0750 Min. :0.0550 Min. :0.0000
## Class :character 1st Qu.:0.3350 1st Qu.:0.2512 1st Qu.:0.0850
## Mode :character Median :0.4150 Median :0.3150 Median :0.1000
## Mean :0.4064 Mean :0.3101 Mean :0.1033
## 3rd Qu.:0.4838 3rd Qu.:0.3750 3rd Qu.:0.1250
## Max. :0.6800 Max. :0.5300 Max. :0.1950
## total.weight meat.weight viscera.weight shell.weight
## Min. :0.0020 Min. :0.00100 Min. :0.00050 Min. :0.00150
## 1st Qu.:0.1792 1st Qu.:0.07575 1st Qu.:0.03800 1st Qu.:0.05500
## Median :0.3330 Median :0.15075 Median :0.06700 Median :0.09775
## Mean :0.3807 Mean :0.16958 Mean :0.08098 Mean :0.11205
## 3rd Qu.:0.5221 3rd Qu.:0.23500 3rd Qu.:0.11175 3rd Qu.:0.15500
## Max. :1.6260 Max. :0.63150 Max. :0.34450 Max. :0.53000
## age
## Min. : 1.000
## 1st Qu.: 6.000
## Median : 7.000
## Mean : 7.731
## 3rd Qu.: 9.000
## Max. :21.000
library(ggplot2)
library(cowplot)
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggplot2':
##
## ggsave
lm<-ggplot(male,aes(x=length))+geom_bar()
lf<-ggplot(female,aes(x=length))+geom_bar()
li<-ggplot(infant,aes(x=length))+geom_bar()
lt<-ggplot(abalone,aes(x=length))+geom_bar()
plot_grid(lm,lf,li,lt,labels="AUTO")
malemodel<-lm(age~length+diameter+height+total.weight+meat.weight+viscera.weight+shell.weight,data=male)
summary(malemodel)
##
## Call:
## lm(formula = age ~ length + diameter + height + total.weight +
## meat.weight + viscera.weight + shell.weight, data = male)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8040 -1.3865 -0.2947 1.1090 10.8962
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0802 0.8788 3.505 0.000493 ***
## length 0.2886 5.3736 0.054 0.957192
## diameter 14.0174 6.3572 2.205 0.027870 *
## height 8.1004 4.8503 1.670 0.095472 .
## total.weight 9.3349 2.3135 4.035 6.24e-05 ***
## meat.weight -23.5585 2.5107 -9.383 < 2e-16 ***
## viscera.weight -9.4436 3.6745 -2.570 0.010431 *
## shell.weight 11.4701 3.5513 3.230 0.001313 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.314 on 550 degrees of freedom
## Multiple R-squared: 0.5429, Adjusted R-squared: 0.5371
## F-statistic: 93.31 on 7 and 550 DF, p-value: < 2.2e-16
femalemodel<-lm(age~length+diameter+height+total.weight+meat.weight+viscera.weight+shell.weight,data=female)
summary(femalemodel)
##
## Call:
## lm(formula = age ~ length + diameter + height + total.weight +
## meat.weight + viscera.weight + shell.weight, data = female)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.8584 -1.8418 -0.5048 1.3792 12.8455
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.730 1.740 3.868 0.000125 ***
## length -7.896 6.856 -1.152 0.250073
## diameter 12.395 8.314 1.491 0.136661
## height 16.402 7.664 2.140 0.032842 *
## total.weight 14.914 2.284 6.529 1.69e-10 ***
## meat.weight -25.779 2.593 -9.941 < 2e-16 ***
## viscera.weight -15.133 4.275 -3.540 0.000440 ***
## shell.weight 1.604 3.388 0.473 0.636135
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.788 on 480 degrees of freedom
## Multiple R-squared: 0.3832, Adjusted R-squared: 0.3742
## F-statistic: 42.59 on 7 and 480 DF, p-value: < 2.2e-16
totalmodel<-lm(age~length+diameter+height+total.weight+meat.weight+viscera.weight+shell.weight,data=abalone)
summary(totalmodel)
##
## Call:
## lm(formula = age ~ length + diameter + height + total.weight +
## meat.weight + viscera.weight + shell.weight, data = abalone)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2855 -1.4827 -0.4077 1.0578 12.7417
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.670 0.460 3.631 0.000292 ***
## length -1.100 3.195 -0.344 0.730728
## diameter 16.745 3.999 4.187 2.99e-05 ***
## height 17.765 3.561 4.989 6.79e-07 ***
## total.weight 12.160 1.408 8.635 < 2e-16 ***
## meat.weight -25.421 1.551 -16.386 < 2e-16 ***
## viscera.weight -13.219 2.469 -5.354 9.94e-08 ***
## shell.weight 5.223 2.144 2.436 0.014957 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.404 on 1492 degrees of freedom
## Multiple R-squared: 0.5697, Adjusted R-squared: 0.5677
## F-statistic: 282.2 on 7 and 1492 DF, p-value: < 2.2e-16
x<-lm(age~length,data=male)
summary(x)
##
## Call:
## lm(formula = age ~ length, data = male)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5803 -2.1571 -0.6384 1.6313 14.2122
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.8253 0.6503 4.344 1.66e-05 ***
## length 14.9374 1.1831 12.626 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.001 on 556 degrees of freedom
## Multiple R-squared: 0.2228, Adjusted R-squared: 0.2214
## F-statistic: 159.4 on 1 and 556 DF, p-value: < 2.2e-16
y<-lm(age~length,data=female)
summary(y)
##
## Call:
## lm(formula = age ~ length, data = female)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4670 -2.3822 -0.9522 1.3713 16.1386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.958 1.006 5.922 6.02e-09 ***
## length 9.862 1.760 5.603 3.52e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.419 on 486 degrees of freedom
## Multiple R-squared: 0.06069, Adjusted R-squared: 0.05875
## F-statistic: 31.4 on 1 and 486 DF, p-value: 3.525e-08
z<-lm(age~length,data=infant)
summary(z)
##
## Call:
## lm(formula = age ~ length, data = infant)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3059 -1.3059 -0.3968 0.6335 11.1569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7907 0.3844 2.057 0.0403 *
## length 17.0801 0.9133 18.702 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.137 on 452 degrees of freedom
## Multiple R-squared: 0.4363, Adjusted R-squared: 0.435
## F-statistic: 349.8 on 1 and 452 DF, p-value: < 2.2e-16
a<-lm(age~length,data=abalone)
summary(a)
##
## Call:
## lm(formula = age ~ length, data = abalone)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.781 -2.020 -0.742 1.326 15.527
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3554 0.3278 4.135 3.75e-05 ***
## length 17.3109 0.6281 27.561 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.979 on 1498 degrees of freedom
## Multiple R-squared: 0.3365, Adjusted R-squared: 0.336
## F-statistic: 759.6 on 1 and 1498 DF, p-value: < 2.2e-16