Examining the chewness and springiness of berries based on some factors regarding concentration

For this project, I have several goals that I will adress later in this project. This project will combine 2 datasets from the website http://users.stat.ufl.edu/~winner/datasets.html under the “linear regression” section which is abbreviated as “LR 1a) Linear Regression”. There are 2 datasets which deal with sugar equivalent and both the chewiness and springiness of berries. The factors from these 2 datasets include

1.NaCl concentration a.k.a Salt

2.Sugar equivalent (in g/L). It’s the same as sugar content from the springy dataset

3.Chewiness(mJ) of Berries from 6 sugar equivalent levels( 15 berries per level)

4.Springiness(mm) of Berries=elasticity and can be stretched and return to its original length

The other dataset which deals with springiness has some variables including SugTrt, sugCont, and springiness(elasticity and can be stretched and return to its original length)

Looking at the data

library(readr)
library(ggplot2)
library(ggpubr)
## Loading required package: magrittr
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
chewy<-read_csv("C:/Users/kevin/Downloads/berry_sugar_chewy.csv")
## Parsed with column specification:
## cols(
##   nacl = col_integer(),
##   sugar = col_double(),
##   chewiness = col_double()
## )
springy<-read_csv("C:/Users/kevin/Downloads/berry_sugar_springy.csv")
## Parsed with column specification:
## cols(
##   nacl = col_integer(),
##   sugTrt = col_integer(),
##   sugCont = col_double(),
##   springiness = col_double()
## )
summary(chewy)
##       nacl         sugar         chewiness     
##  Min.   :130   Min.   :176.5   Min.   :0.4185  
##  1st Qu.:140   1st Qu.:192.6   1st Qu.:1.8694  
##  Median :155   Median :217.2   Median :2.6538  
##  Mean   :155   Mean   :217.3   Mean   :2.7083  
##  3rd Qu.:170   3rd Qu.:242.1   3rd Qu.:3.4667  
##  Max.   :180   Max.   :258.5   Max.   :5.5386
summary(springy)
##       nacl         sugTrt       sugCont       springiness   
##  Min.   :130   Min.   :1.0   Min.   :176.5   Min.   :1.100  
##  1st Qu.:140   1st Qu.:2.0   1st Qu.:192.6   1st Qu.:1.500  
##  Median :155   Median :3.5   Median :217.2   Median :1.764  
##  Mean   :155   Mean   :3.5   Mean   :217.5   Mean   :1.768  
##  3rd Qu.:170   3rd Qu.:5.0   3rd Qu.:242.1   3rd Qu.:1.999  
##  Max.   :180   Max.   :6.0   Max.   :259.5   Max.   :2.523

Merging the data

For me, I decided to merge the data by combine the 2 datasets into one. I merged the data according the NaCl since that is what both of the datasets have in commmon.

entiredata <- dplyr::bind_cols(chewy,springy)
nrow(entiredata)
## [1] 90
head(entiredata)
## # A tibble: 6 x 7
##    nacl sugar chewiness nacl1 sugTrt sugCont springiness
##   <int> <dbl>     <dbl> <int>  <int>   <dbl>       <dbl>
## 1   130  176.      2.14   130      1    176.        1.64
## 2   130  176.      3.66   130      1    176.        2.19
## 3   130  176.      4.77   130      1    176.        2.18
## 4   130  176.      1.18   130      1    176.        2.19
## 5   130  176.      3.75   130      1    176.        2.45
## 6   130  176.      1.84   130      1    176.        2.21
tail(entiredata)
## # A tibble: 6 x 7
##    nacl sugar chewiness nacl1 sugTrt sugCont springiness
##   <int> <dbl>     <dbl> <int>  <int>   <dbl>       <dbl>
## 1   180  258.     1.54    180      6    260.        1.19
## 2   180  258.     1.71    180      6    260.        1.72
## 3   180  258.     1.30    180      6    260.        1.34
## 4   180  258.     1.02    180      6    260.        1.28
## 5   180  258.     0.418   180      6    260.        1.17
## 6   180  258.     2.24    180      6    260.        1.71

You can see above that the 4th column which is sugTrt is placed that way when I merged the datasets. To make things look more organized, I am going to shift the 4th column to make the the 5th column to make the dataset look more organized. Also since chewiness and springiness are dependant variables, i decided to put those columns first in the dataset

entiredata=entiredata[,-c(4,6)]
entiredata=entiredata[,c(3,5,1,2,4)]
head(entiredata)
## # A tibble: 6 x 5
##   chewiness springiness  nacl sugar sugTrt
##       <dbl>       <dbl> <int> <dbl>  <int>
## 1      2.14        1.64   130  176.      1
## 2      3.66        2.19   130  176.      1
## 3      4.77        2.18   130  176.      1
## 4      1.18        2.19   130  176.      1
## 5      3.75        2.45   130  176.      1
## 6      1.84        2.21   130  176.      1
tail(entiredata)
## # A tibble: 6 x 5
##   chewiness springiness  nacl sugar sugTrt
##       <dbl>       <dbl> <int> <dbl>  <int>
## 1     1.54         1.19   180  258.      6
## 2     1.71         1.72   180  258.      6
## 3     1.30         1.34   180  258.      6
## 4     1.02         1.28   180  258.      6
## 5     0.418        1.17   180  258.      6
## 6     2.24         1.71   180  258.      6

Now this looks nicer!

Plotting several factors against the chewiness of the berries

xx<-ggplot(data=entiredata,aes(x=entiredata$chewiness,y=entiredata$sugar))+geom_jitter()
xy<-ggplot(data=entiredata,aes(x=entiredata$chewiness,y=entiredata$springiness))+geom_jitter()
xz<-ggplot(data=entiredata,aes(x=entiredata$chewiness,y=entiredata$nacl))+geom_jitter()

figure<-ggarrange(xx,xy,xz,labels=c("a","b","c"), ncol=2, nrow=2)
figure

#Here is a graph that covers sugTrt vs chewiness. The colors are what define the variety

tt<-ggplot(data=entiredata,aes(x=entiredata$chewiness,y=entiredata$sugTrt,
                               color=factor(entiredata$sugTrt)))+geom_jitter()
tt

From looking at this colored graph. It appears that the berries with the least sugar( the orange ones) were the most chewiness. We can see a correlation between sugar and chewiness. The less sugar, the more chewy, and so on. From graph “B”, it appears that there is a upward correlation between chewiness and elasticiy.

Checking the accuracy of the models that I will create

The goal now is to see if the chewiness and springiness are dependant on several independant variables.Since the chewiness and springiness are both dependant variables. I will calculate the regressions seperately.

chewiness<-lm(entiredata$chewiness~entiredata$springiness+entiredata$nacl+entiredata$sugar)
summary(chewiness)
## 
## Call:
## lm(formula = entiredata$chewiness ~ entiredata$springiness + 
##     entiredata$nacl + entiredata$sugar)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3803 -0.6323  0.1253  0.5269  1.9726 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)
## (Intercept)            -7.09916   13.72720  -0.517    0.606
## entiredata$springiness -0.01934    0.42407  -0.046    0.964
## entiredata$nacl         0.65813    0.61074   1.078    0.284
## entiredata$sugar       -0.42409    0.37262  -1.138    0.258
## 
## Residual standard error: 0.9222 on 86 degrees of freedom
## Multiple R-squared:  0.3402, Adjusted R-squared:  0.3172 
## F-statistic: 14.78 on 3 and 86 DF,  p-value: 7.632e-08
springiness<-lm(entiredata$springiness~entiredata$chewiness+entiredata$nacl+entiredata$sugar)
summary(springiness)
## 
## Call:
## lm(formula = entiredata$springiness ~ entiredata$chewiness + 
##     entiredata$nacl + entiredata$sugar)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5836 -0.1442 -0.0101  0.1432  0.4974 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)           0.59026    3.49533   0.169    0.866
## entiredata$chewiness -0.00125    0.02742  -0.046    0.964
## entiredata$nacl       0.13590    0.15565   0.873    0.385
## entiredata$sugar     -0.09149    0.09495  -0.964    0.338
## 
## Residual standard error: 0.2345 on 86 degrees of freedom
## Multiple R-squared:  0.5286, Adjusted R-squared:  0.5122 
## F-statistic: 32.14 on 3 and 86 DF,  p-value: 4.95e-14

What can I say about these 2 models that I created?

Looking at both of them, it appears that the second model which had an \(R^2\) value of close to 51% was much more accurate than the previous model, which had an \(R^2\) of 31%. There are some good reasons as to why the elasticy of the berries is more accurate than the chewiness of the berries.