Determining whether a trend exists for abalone with regards to age

Introduction( from wikipedia):

Abalone are marine snails. Their taxonomy puts them in the family Haliotidae which contains only one genus, Haliotis, which once contained six subgenera. These subgenera have become alternate representations of Haliotis.[4] The number of species recognized worldwide ranges between 30[6] and 130[7] with over 230 species-level taxa described. The most comprehensive treatment of the family considers 56 species valid, with 18 additional subspecies.[8]

The shells of abalones have a low, open spiral structure, and are characterized by several open respiratory pores in a row near the shell’s outer edge. The thick inner layer of the shell is composed of nacre (mother-of-pearl), which in many species is highly iridescent, giving rise to a range of strong, changeable colors, which make the shells attractive to humans as decorative objects, jewelry, and as a source of colorful mother-of-pearl.

The flesh of abalones is widely considered to be a desirable food, and is consumed raw or cooked by a variety of cultures.

The dataset that I’m using is from UCI machine learning dataset collection https://archive.ics.uci.edu/ml/datasets/abalone

The variables in this dataset include: 1. Sex: Male, female, or infant 2. Length(in mm): longest shell measurement 3. Diameter(in mm): perpendicular to length 4. Height(in mm): with meat in shell 5. Whole weight(in grams) of the whole abalone 6. Shucked weight( in grams) which is the weight in meat 7. Viscera weight( in grams) which is the gut weight after bleeding out 8. Shell weight( in grams) after being dried 9. Rings(Integer) gives the age in year

What is the goal of this project?

The main goal of this project Is to predict Age through physical measurements of the abalone through using the linear regression technique across different independant variables, assuming that the dependant variable is age. I want to see which independant variable has the most effect on Age.

What am I doing with this data?

So the UCI machine learning website has this dataset which contains 4177 rows. For this project, I decided to cut down the data. So I am going to use the first 1500 rows instead of all 4177 rows. The reason I’m doing this is because of convenience and also I feel that more data doesn’t have much impact on the regression equation or the calculations.

So for this data, I took it from UCI ML website, converted it into excel, then imported it into SQLITE prior to putting this into R for analysis. I also created the column names for this dataset. Here is the sql code:

setwd("C:/Documents/My Excel")
library(RSQLite)
db<-dbConnect(SQLite(),dbname="Abalone.db")
df<-dbGetQuery(db,'SELECT * from Abalone LIMIT 1500 ')
colnames(df)=c("sex","length","diameter","height","total.weight",
                        "meat.weight","viscera.weight","shell.weight","age")
head(df)
##   sex length diameter height total.weight meat.weight viscera.weight
## 1   M  0.455    0.365  0.095       0.5140      0.2245         0.1010
## 2   M  0.350    0.265  0.090       0.2255      0.0995         0.0485
## 3   F  0.530    0.420  0.135       0.6770      0.2565         0.1415
## 4   M  0.440    0.365  0.125       0.5160      0.2155         0.1140
## 5   I  0.330    0.255  0.080       0.2050      0.0895         0.0395
## 6   I  0.425    0.300  0.095       0.3515      0.1410         0.0775
##   shell.weight age
## 1        0.150  15
## 2        0.070   7
## 3        0.210   9
## 4        0.155  10
## 5        0.055   7
## 6        0.120   8
tail(df)
##      sex length diameter height total.weight meat.weight viscera.weight
## 1495   M  0.620    0.485  0.155       1.0490      0.4620         0.2310
## 1496   F  0.620    0.435  0.155       1.0120      0.4770         0.2360
## 1497   M  0.620    0.480  0.165       1.0725      0.4815         0.2350
## 1498   M  0.625    0.520  0.175       1.4105      0.6910         0.3220
## 1499   M  0.625    0.470  0.180       1.1360      0.4510         0.3245
## 1500   M  0.630    0.470  0.145       1.1005      0.5200         0.2600
##      shell.weight age
## 1495       0.2500  10
## 1496       0.2750   8
## 1497       0.3120   9
## 1498       0.3465  10
## 1499       0.3050  11
## 1500       0.2760   9
abalone=df

Filtering and summarizing the data according to gender category

So what I’m doing next is I’m going to try to filter out the data according to gender. I’m doing this because it’s more organized. After calculating the regression for gender, I plan on doing it for the whole dataset

male=dplyr::filter(abalone,sex=='M')
female=dplyr::filter(abalone,sex=='F')
infant=dplyr::filter(abalone,sex=='I')

summary(male)
##      sex                length          diameter          height      
##  Length:558         Min.   :0.1550   Min.   :0.1100   Min.   :0.0400  
##  Class :character   1st Qu.:0.4850   1st Qu.:0.3800   1st Qu.:0.1250  
##  Mode  :character   Median :0.5600   Median :0.4400   Median :0.1500  
##                     Mean   :0.5391   Mean   :0.4228   Mean   :0.1472  
##                     3rd Qu.:0.6100   3rd Qu.:0.4800   3rd Qu.:0.1700  
##                     Max.   :0.7650   Max.   :0.6000   Max.   :0.5150  
##   total.weight     meat.weight     viscera.weight    shell.weight   
##  Min.   :0.0155   Min.   :0.0065   Min.   :0.0030   Min.   :0.0050  
##  1st Qu.:0.5873   1st Qu.:0.2352   1st Qu.:0.1296   1st Qu.:0.1653  
##  Median :0.8965   Median :0.3782   Median :0.1900   Median :0.2550  
##  Mean   :0.9064   Mean   :0.3867   Mean   :0.1952   Mean   :0.2628  
##  3rd Qu.:1.1823   3rd Qu.:0.5101   3rd Qu.:0.2539   3rd Qu.:0.3400  
##  Max.   :2.8255   Max.   :1.1465   Max.   :0.5640   Max.   :0.8970  
##       age       
##  Min.   : 3.00  
##  1st Qu.: 9.00  
##  Median :10.00  
##  Mean   :10.88  
##  3rd Qu.:13.00  
##  Max.   :26.00
summary(female)
##      sex                length          diameter          height      
##  Length:488         Min.   :0.2750   Min.   :0.1950   Min.   :0.0150  
##  Class :character   1st Qu.:0.5050   1st Qu.:0.4000   1st Qu.:0.1350  
##  Mode  :character   Median :0.5750   Median :0.4500   Median :0.1550  
##                     Mean   :0.5649   Mean   :0.4453   Mean   :0.1534  
##                     3rd Qu.:0.6250   3rd Qu.:0.4950   3rd Qu.:0.1700  
##                     Max.   :0.8150   Max.   :0.6500   Max.   :0.2500  
##   total.weight     meat.weight    viscera.weight    shell.weight   
##  Min.   :0.0800   Min.   :0.031   Min.   :0.0215   Min.   :0.0250  
##  1st Qu.:0.6541   1st Qu.:0.261   1st Qu.:0.1434   1st Qu.:0.1988  
##  Median :0.9630   Median :0.392   Median :0.2095   Median :0.2750  
##  Mean   :0.9880   Mean   :0.415   Mean   :0.2164   Mean   :0.2880  
##  3rd Qu.:1.2416   3rd Qu.:0.540   3rd Qu.:0.2765   3rd Qu.:0.3571  
##  Max.   :2.6570   Max.   :1.488   Max.   :0.5195   Max.   :1.0050  
##       age       
##  Min.   : 5.00  
##  1st Qu.: 9.00  
##  Median :11.00  
##  Mean   :11.53  
##  3rd Qu.:13.00  
##  Max.   :29.00
summary(infant)
##      sex                length          diameter          height      
##  Length:454         Min.   :0.0750   Min.   :0.0550   Min.   :0.0000  
##  Class :character   1st Qu.:0.3350   1st Qu.:0.2512   1st Qu.:0.0850  
##  Mode  :character   Median :0.4150   Median :0.3150   Median :0.1000  
##                     Mean   :0.4064   Mean   :0.3101   Mean   :0.1033  
##                     3rd Qu.:0.4838   3rd Qu.:0.3750   3rd Qu.:0.1250  
##                     Max.   :0.6800   Max.   :0.5300   Max.   :0.1950  
##   total.weight     meat.weight      viscera.weight     shell.weight    
##  Min.   :0.0020   Min.   :0.00100   Min.   :0.00050   Min.   :0.00150  
##  1st Qu.:0.1792   1st Qu.:0.07575   1st Qu.:0.03800   1st Qu.:0.05500  
##  Median :0.3330   Median :0.15075   Median :0.06700   Median :0.09775  
##  Mean   :0.3807   Mean   :0.16958   Mean   :0.08098   Mean   :0.11205  
##  3rd Qu.:0.5221   3rd Qu.:0.23500   3rd Qu.:0.11175   3rd Qu.:0.15500  
##  Max.   :1.6260   Max.   :0.63150   Max.   :0.34450   Max.   :0.53000  
##       age        
##  Min.   : 1.000  
##  1st Qu.: 6.000  
##  Median : 7.000  
##  Mean   : 7.731  
##  3rd Qu.: 9.000  
##  Max.   :21.000

graphing count vs length

library(ggplot2)
library(cowplot)
## 
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggplot2':
## 
##     ggsave
lm<-ggplot(male,aes(x=length))+geom_bar()
lf<-ggplot(female,aes(x=length))+geom_bar()
li<-ggplot(infant,aes(x=length))+geom_bar()
lt<-ggplot(abalone,aes(x=length))+geom_bar()
plot_grid(lm,lf,li,lt,labels="AUTO")

Calculating total regression

malemodel<-lm(age~length+diameter+height+total.weight+meat.weight+viscera.weight+shell.weight,data=male)
summary(malemodel)
## 
## Call:
## lm(formula = age ~ length + diameter + height + total.weight + 
##     meat.weight + viscera.weight + shell.weight, data = male)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8040 -1.3865 -0.2947  1.1090 10.8962 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.0802     0.8788   3.505 0.000493 ***
## length           0.2886     5.3736   0.054 0.957192    
## diameter        14.0174     6.3572   2.205 0.027870 *  
## height           8.1004     4.8503   1.670 0.095472 .  
## total.weight     9.3349     2.3135   4.035 6.24e-05 ***
## meat.weight    -23.5585     2.5107  -9.383  < 2e-16 ***
## viscera.weight  -9.4436     3.6745  -2.570 0.010431 *  
## shell.weight    11.4701     3.5513   3.230 0.001313 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.314 on 550 degrees of freedom
## Multiple R-squared:  0.5429, Adjusted R-squared:  0.5371 
## F-statistic: 93.31 on 7 and 550 DF,  p-value: < 2.2e-16
femalemodel<-lm(age~length+diameter+height+total.weight+meat.weight+viscera.weight+shell.weight,data=female)
summary(femalemodel)
## 
## Call:
## lm(formula = age ~ length + diameter + height + total.weight + 
##     meat.weight + viscera.weight + shell.weight, data = female)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.8584 -1.8418 -0.5048  1.3792 12.8455 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       6.730      1.740   3.868 0.000125 ***
## length           -7.896      6.856  -1.152 0.250073    
## diameter         12.395      8.314   1.491 0.136661    
## height           16.402      7.664   2.140 0.032842 *  
## total.weight     14.914      2.284   6.529 1.69e-10 ***
## meat.weight     -25.779      2.593  -9.941  < 2e-16 ***
## viscera.weight  -15.133      4.275  -3.540 0.000440 ***
## shell.weight      1.604      3.388   0.473 0.636135    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.788 on 480 degrees of freedom
## Multiple R-squared:  0.3832, Adjusted R-squared:  0.3742 
## F-statistic: 42.59 on 7 and 480 DF,  p-value: < 2.2e-16
totalmodel<-lm(age~length+diameter+height+total.weight+meat.weight+viscera.weight+shell.weight,data=abalone)
summary(totalmodel)
## 
## Call:
## lm(formula = age ~ length + diameter + height + total.weight + 
##     meat.weight + viscera.weight + shell.weight, data = abalone)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2855 -1.4827 -0.4077  1.0578 12.7417 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.670      0.460   3.631 0.000292 ***
## length           -1.100      3.195  -0.344 0.730728    
## diameter         16.745      3.999   4.187 2.99e-05 ***
## height           17.765      3.561   4.989 6.79e-07 ***
## total.weight     12.160      1.408   8.635  < 2e-16 ***
## meat.weight     -25.421      1.551 -16.386  < 2e-16 ***
## viscera.weight  -13.219      2.469  -5.354 9.94e-08 ***
## shell.weight      5.223      2.144   2.436 0.014957 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.404 on 1492 degrees of freedom
## Multiple R-squared:  0.5697, Adjusted R-squared:  0.5677 
## F-statistic: 282.2 on 7 and 1492 DF,  p-value: < 2.2e-16

seeing if there is a relationship b/t length of shell and age x=M,y=F,z=I,a=entire dataset

x<-lm(age~length,data=male)
summary(x)
## 
## Call:
## lm(formula = age ~ length, data = male)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5803 -2.1571 -0.6384  1.6313 14.2122 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.8253     0.6503   4.344 1.66e-05 ***
## length       14.9374     1.1831  12.626  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.001 on 556 degrees of freedom
## Multiple R-squared:  0.2228, Adjusted R-squared:  0.2214 
## F-statistic: 159.4 on 1 and 556 DF,  p-value: < 2.2e-16
y<-lm(age~length,data=female)
summary(y)
## 
## Call:
## lm(formula = age ~ length, data = female)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4670 -2.3822 -0.9522  1.3713 16.1386 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.958      1.006   5.922 6.02e-09 ***
## length         9.862      1.760   5.603 3.52e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.419 on 486 degrees of freedom
## Multiple R-squared:  0.06069,    Adjusted R-squared:  0.05875 
## F-statistic:  31.4 on 1 and 486 DF,  p-value: 3.525e-08
z<-lm(age~length,data=infant)
summary(z)
## 
## Call:
## lm(formula = age ~ length, data = infant)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3059 -1.3059 -0.3968  0.6335 11.1569 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.7907     0.3844   2.057   0.0403 *  
## length       17.0801     0.9133  18.702   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.137 on 452 degrees of freedom
## Multiple R-squared:  0.4363, Adjusted R-squared:  0.435 
## F-statistic: 349.8 on 1 and 452 DF,  p-value: < 2.2e-16
a<-lm(age~length,data=abalone)
summary(a)
## 
## Call:
## lm(formula = age ~ length, data = abalone)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.781 -2.020 -0.742  1.326 15.527 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.3554     0.3278   4.135 3.75e-05 ***
## length       17.3109     0.6281  27.561  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.979 on 1498 degrees of freedom
## Multiple R-squared:  0.3365, Adjusted R-squared:  0.336 
## F-statistic: 759.6 on 1 and 1498 DF,  p-value: < 2.2e-16