library(dslabs)
data("olive")DsLab Assignment
DSlab Assignment
Loading Required Libraries
Explore the Olive dataset by exploring the head of the Olive data
head(olive) region area palmitic palmitoleic stearic oleic linoleic
1 Southern Italy North-Apulia 10.75 0.75 2.26 78.23 6.72
2 Southern Italy North-Apulia 10.88 0.73 2.24 77.09 7.81
3 Southern Italy North-Apulia 9.11 0.54 2.46 81.13 5.49
4 Southern Italy North-Apulia 9.66 0.57 2.40 79.52 6.19
5 Southern Italy North-Apulia 10.51 0.67 2.59 77.71 6.72
6 Southern Italy North-Apulia 9.11 0.49 2.68 79.24 6.78
linolenic arachidic eicosenoic
1 0.36 0.60 0.29
2 0.31 0.61 0.29
3 0.31 0.63 0.29
4 0.50 0.78 0.35
5 0.50 0.80 0.46
6 0.51 0.70 0.44
summarize the dataset
str(olive)'data.frame': 572 obs. of 10 variables:
$ region : Factor w/ 3 levels "Northern Italy",..: 3 3 3 3 3 3 3 3 3 3 ...
$ area : Factor w/ 9 levels "Calabria","Coast-Sardinia",..: 5 5 5 5 5 5 5 5 5 5 ...
$ palmitic : num 10.75 10.88 9.11 9.66 10.51 ...
$ palmitoleic: num 0.75 0.73 0.54 0.57 0.67 0.49 0.66 0.61 0.6 0.55 ...
$ stearic : num 2.26 2.24 2.46 2.4 2.59 2.68 2.64 2.35 2.39 2.13 ...
$ oleic : num 78.2 77.1 81.1 79.5 77.7 ...
$ linoleic : num 6.72 7.81 5.49 6.19 6.72 6.78 6.18 7.34 7.09 6.33 ...
$ linolenic : num 0.36 0.31 0.31 0.5 0.5 0.51 0.49 0.39 0.46 0.26 ...
$ arachidic : num 0.6 0.61 0.63 0.78 0.8 0.7 0.56 0.64 0.83 0.52 ...
$ eicosenoic : num 0.29 0.29 0.29 0.35 0.46 0.44 0.29 0.35 0.33 0.3 ...
Summary statistcs of the above dataset
summary(olive) region area palmitic palmitoleic
Northern Italy:151 South-Apulia :206 Min. : 6.10 Min. :0.1500
Sardinia : 98 Inland-Sardinia: 65 1st Qu.:10.95 1st Qu.:0.8775
Southern Italy:323 Calabria : 56 Median :12.01 Median :1.1000
Umbria : 51 Mean :12.32 Mean :1.2609
East-Liguria : 50 3rd Qu.:13.60 3rd Qu.:1.6925
West-Liguria : 50 Max. :17.53 Max. :2.8000
(Other) : 94
stearic oleic linoleic linolenic
Min. :1.520 Min. :63.00 Min. : 4.480 Min. :0.0000
1st Qu.:2.050 1st Qu.:70.00 1st Qu.: 7.707 1st Qu.:0.2600
Median :2.230 Median :73.03 Median :10.300 Median :0.3300
Mean :2.289 Mean :73.12 Mean : 9.805 Mean :0.3189
3rd Qu.:2.490 3rd Qu.:76.80 3rd Qu.:11.807 3rd Qu.:0.4025
Max. :3.750 Max. :84.10 Max. :14.700 Max. :0.7400
arachidic eicosenoic
Min. :0.000 Min. :0.0100
1st Qu.:0.500 1st Qu.:0.0200
Median :0.610 Median :0.1700
Mean :0.581 Mean :0.1628
3rd Qu.:0.700 3rd Qu.:0.2800
Max. :1.050 Max. :0.5800
Creating the Multivariable Graph: Create a scatterplot using ggplot2 to visualize the relationship between palmitic (x-axis), oleic (y-axis), and color the points by region. I decided to run a scatterplot for the two acids compounds as it appears to be the largest composition by percentage of the acid componds in the olive studied.
Scatterplot with ggplot2
library(ggplot2)
Plot1<-ggplot(olive, aes(x = palmitic, y = oleic, color = region)) +
geom_point(size = 3) +
geom_smooth()
labs(x = "Palmitic Acid (%)",
y = "Oleic Acid (%)",
color = "Region",
title = "Fatty Acid Composition of Olive Oil by Region",
caption = "Data source: dslabs package") +
theme_gray()NULL
Plot1Fit a linear regression model
model <- lm(oleic ~ palmitic + region, data = olive)Summarize the model
library(ggplot2)
summary(model)
Call:
lm(formula = oleic ~ palmitic + region, data = olive)
Residuals:
Min 1Q Median 3Q Max
-4.7267 -1.1399 0.0959 1.2679 3.7668
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 98.53696 0.58587 168.19 <2e-16 ***
palmitic -1.88221 0.05227 -36.01 <2e-16 ***
regionSardinia -4.93890 0.20032 -24.66 <2e-16 ***
regionSouthern Italy -2.46045 0.19632 -12.53 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.543 on 568 degrees of freedom
Multiple R-squared: 0.8562, Adjusted R-squared: 0.8555
F-statistic: 1127 on 3 and 568 DF, p-value: < 2.2e-16
Observations
Positive Correlation:There is a visible positive correlation between the palmitic and oleic acid compositions. As the percentage of palmitic acid increases, the percentage of oleic acid also tends to increase.This relationship appears to be roughly linear, suggesting that a simple linear model might be appropriate to describe this association.
Cluster Identification: The data points form three distinct clusters, which correspond to the different regions. This indicates regional differences in the fatty acid composition of olive oil. The clustering suggests that each region has a unique profile for palmitic and oleic acid compositions.
Outliers: There appears to be an outlier in the data, particularly from the northern region. This point deviates significantly from the general trend observed in the rest of the data.Investigating this outlier could provide insights into whether it’s due to measurement error, a unique environmental factor, or a different olive variety.
Residual standard error: 1.543 on 568 degrees of freedom.Multiple R-squared: 0.8562, Adjusted R-squared: 0.8555 . F-statistic: 1127 on 3 and 568 DF, p-value: < 2.2e-16 suggest that the correlation between the acids are statistically significant for those region.