For this project, I have several goals that I will adress later in this project. This project will combine 2 datasets from the website http://users.stat.ufl.edu/~winner/datasets.html under the “linear regression” section which is abbreviated as “LR 1a) Linear Regression”. There are 2 datasets which deal with sugar equivalent and both the chewiness and springiness of berries. The factors from these 2 datasets include
The other dataset which deals with springiness has some variables including SugTrt, sugCont, and springiness(elasticity and can be stretched and return to its original length)
library(readr)
library(ggplot2)
library(ggpubr)
## Loading required package: magrittr
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
chewy<-read_csv("C:/Users/kevin/Downloads/berry_sugar_chewy.csv")
## Parsed with column specification:
## cols(
## nacl = col_integer(),
## sugar = col_double(),
## chewiness = col_double()
## )
springy<-read_csv("C:/Users/kevin/Downloads/berry_sugar_springy.csv")
## Parsed with column specification:
## cols(
## nacl = col_integer(),
## sugTrt = col_integer(),
## sugCont = col_double(),
## springiness = col_double()
## )
summary(chewy)
## nacl sugar chewiness
## Min. :130 Min. :176.5 Min. :0.4185
## 1st Qu.:140 1st Qu.:192.6 1st Qu.:1.8694
## Median :155 Median :217.2 Median :2.6538
## Mean :155 Mean :217.3 Mean :2.7083
## 3rd Qu.:170 3rd Qu.:242.1 3rd Qu.:3.4667
## Max. :180 Max. :258.5 Max. :5.5386
summary(springy)
## nacl sugTrt sugCont springiness
## Min. :130 Min. :1.0 Min. :176.5 Min. :1.100
## 1st Qu.:140 1st Qu.:2.0 1st Qu.:192.6 1st Qu.:1.500
## Median :155 Median :3.5 Median :217.2 Median :1.764
## Mean :155 Mean :3.5 Mean :217.5 Mean :1.768
## 3rd Qu.:170 3rd Qu.:5.0 3rd Qu.:242.1 3rd Qu.:1.999
## Max. :180 Max. :6.0 Max. :259.5 Max. :2.523
For me, I decided to merge the data by combine the 2 datasets into one. I merged the data according the NaCl since that is what both of the datasets have in commmon.
entiredata <- dplyr::bind_cols(chewy,springy)
nrow(entiredata)
## [1] 90
head(entiredata)
## # A tibble: 6 x 7
## nacl sugar chewiness nacl1 sugTrt sugCont springiness
## <int> <dbl> <dbl> <int> <int> <dbl> <dbl>
## 1 130 176. 2.14 130 1 176. 1.64
## 2 130 176. 3.66 130 1 176. 2.19
## 3 130 176. 4.77 130 1 176. 2.18
## 4 130 176. 1.18 130 1 176. 2.19
## 5 130 176. 3.75 130 1 176. 2.45
## 6 130 176. 1.84 130 1 176. 2.21
tail(entiredata)
## # A tibble: 6 x 7
## nacl sugar chewiness nacl1 sugTrt sugCont springiness
## <int> <dbl> <dbl> <int> <int> <dbl> <dbl>
## 1 180 258. 1.54 180 6 260. 1.19
## 2 180 258. 1.71 180 6 260. 1.72
## 3 180 258. 1.30 180 6 260. 1.34
## 4 180 258. 1.02 180 6 260. 1.28
## 5 180 258. 0.418 180 6 260. 1.17
## 6 180 258. 2.24 180 6 260. 1.71
You can see above that the 4th column which is sugTrt is placed that way when I merged the datasets. To make things look more organized, I am going to shift the 4th column to make the the 5th column to make the dataset look more organized. Also since chewiness and springiness are dependant variables, i decided to put those columns first in the dataset
entiredata=entiredata[,-c(4,6)]
entiredata=entiredata[,c(3,5,1,2,4)]
head(entiredata)
## # A tibble: 6 x 5
## chewiness springiness nacl sugar sugTrt
## <dbl> <dbl> <int> <dbl> <int>
## 1 2.14 1.64 130 176. 1
## 2 3.66 2.19 130 176. 1
## 3 4.77 2.18 130 176. 1
## 4 1.18 2.19 130 176. 1
## 5 3.75 2.45 130 176. 1
## 6 1.84 2.21 130 176. 1
tail(entiredata)
## # A tibble: 6 x 5
## chewiness springiness nacl sugar sugTrt
## <dbl> <dbl> <int> <dbl> <int>
## 1 1.54 1.19 180 258. 6
## 2 1.71 1.72 180 258. 6
## 3 1.30 1.34 180 258. 6
## 4 1.02 1.28 180 258. 6
## 5 0.418 1.17 180 258. 6
## 6 2.24 1.71 180 258. 6
Now this looks nicer!
xx<-ggplot(data=entiredata,aes(x=entiredata$chewiness,y=entiredata$sugar))+geom_jitter()
xy<-ggplot(data=entiredata,aes(x=entiredata$chewiness,y=entiredata$springiness))+geom_jitter()
xz<-ggplot(data=entiredata,aes(x=entiredata$chewiness,y=entiredata$nacl))+geom_jitter()
figure<-ggarrange(xx,xy,xz,labels=c("a","b","c"), ncol=2, nrow=2)
figure
#Here is a graph that covers sugTrt vs chewiness. The colors are what define the variety
tt<-ggplot(data=entiredata,aes(x=entiredata$chewiness,y=entiredata$sugTrt,
color=factor(entiredata$sugTrt)))+geom_jitter()
tt
From looking at this colored graph. It appears that the berries with the least sugar( the orange ones) were the most chewiness. We can see a correlation between sugar and chewiness. The less sugar, the more chewy, and so on. From graph “B”, it appears that there is a upward correlation between chewiness and elasticiy.
The goal now is to see if the chewiness and springiness are dependant on several independant variables.Since the chewiness and springiness are both dependant variables. I will calculate the regressions seperately.
chewiness<-lm(entiredata$chewiness~entiredata$springiness+entiredata$nacl+entiredata$sugar)
summary(chewiness)
##
## Call:
## lm(formula = entiredata$chewiness ~ entiredata$springiness +
## entiredata$nacl + entiredata$sugar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3803 -0.6323 0.1253 0.5269 1.9726
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.09916 13.72720 -0.517 0.606
## entiredata$springiness -0.01934 0.42407 -0.046 0.964
## entiredata$nacl 0.65813 0.61074 1.078 0.284
## entiredata$sugar -0.42409 0.37262 -1.138 0.258
##
## Residual standard error: 0.9222 on 86 degrees of freedom
## Multiple R-squared: 0.3402, Adjusted R-squared: 0.3172
## F-statistic: 14.78 on 3 and 86 DF, p-value: 7.632e-08
springiness<-lm(entiredata$springiness~entiredata$chewiness+entiredata$nacl+entiredata$sugar)
summary(springiness)
##
## Call:
## lm(formula = entiredata$springiness ~ entiredata$chewiness +
## entiredata$nacl + entiredata$sugar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5836 -0.1442 -0.0101 0.1432 0.4974
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.59026 3.49533 0.169 0.866
## entiredata$chewiness -0.00125 0.02742 -0.046 0.964
## entiredata$nacl 0.13590 0.15565 0.873 0.385
## entiredata$sugar -0.09149 0.09495 -0.964 0.338
##
## Residual standard error: 0.2345 on 86 degrees of freedom
## Multiple R-squared: 0.5286, Adjusted R-squared: 0.5122
## F-statistic: 32.14 on 3 and 86 DF, p-value: 4.95e-14
Looking at both of them, it appears that the second model which had an \(R^2\) value of close to 51% was much more accurate than the previous model, which had an \(R^2\) of 31%. There are some good reasons as to why the elasticy of the berries is more accurate than the chewiness of the berries.