DsLab Assignment

Author

Paul Daniel-Orie

DSlab Assignment

Loading Required Libraries

library(dslabs)
data("olive")

Explore the Olive dataset by exploring the head of the Olive data

head(olive)
          region         area palmitic palmitoleic stearic oleic linoleic
1 Southern Italy North-Apulia    10.75        0.75    2.26 78.23     6.72
2 Southern Italy North-Apulia    10.88        0.73    2.24 77.09     7.81
3 Southern Italy North-Apulia     9.11        0.54    2.46 81.13     5.49
4 Southern Italy North-Apulia     9.66        0.57    2.40 79.52     6.19
5 Southern Italy North-Apulia    10.51        0.67    2.59 77.71     6.72
6 Southern Italy North-Apulia     9.11        0.49    2.68 79.24     6.78
  linolenic arachidic eicosenoic
1      0.36      0.60       0.29
2      0.31      0.61       0.29
3      0.31      0.63       0.29
4      0.50      0.78       0.35
5      0.50      0.80       0.46
6      0.51      0.70       0.44

summarize the dataset

str(olive)
'data.frame':   572 obs. of  10 variables:
 $ region     : Factor w/ 3 levels "Northern Italy",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ area       : Factor w/ 9 levels "Calabria","Coast-Sardinia",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ palmitic   : num  10.75 10.88 9.11 9.66 10.51 ...
 $ palmitoleic: num  0.75 0.73 0.54 0.57 0.67 0.49 0.66 0.61 0.6 0.55 ...
 $ stearic    : num  2.26 2.24 2.46 2.4 2.59 2.68 2.64 2.35 2.39 2.13 ...
 $ oleic      : num  78.2 77.1 81.1 79.5 77.7 ...
 $ linoleic   : num  6.72 7.81 5.49 6.19 6.72 6.78 6.18 7.34 7.09 6.33 ...
 $ linolenic  : num  0.36 0.31 0.31 0.5 0.5 0.51 0.49 0.39 0.46 0.26 ...
 $ arachidic  : num  0.6 0.61 0.63 0.78 0.8 0.7 0.56 0.64 0.83 0.52 ...
 $ eicosenoic : num  0.29 0.29 0.29 0.35 0.46 0.44 0.29 0.35 0.33 0.3 ...

Summary statistcs of the above dataset

summary(olive)
            region                 area        palmitic      palmitoleic    
 Northern Italy:151   South-Apulia   :206   Min.   : 6.10   Min.   :0.1500  
 Sardinia      : 98   Inland-Sardinia: 65   1st Qu.:10.95   1st Qu.:0.8775  
 Southern Italy:323   Calabria       : 56   Median :12.01   Median :1.1000  
                      Umbria         : 51   Mean   :12.32   Mean   :1.2609  
                      East-Liguria   : 50   3rd Qu.:13.60   3rd Qu.:1.6925  
                      West-Liguria   : 50   Max.   :17.53   Max.   :2.8000  
                      (Other)        : 94                                   
    stearic          oleic          linoleic        linolenic     
 Min.   :1.520   Min.   :63.00   Min.   : 4.480   Min.   :0.0000  
 1st Qu.:2.050   1st Qu.:70.00   1st Qu.: 7.707   1st Qu.:0.2600  
 Median :2.230   Median :73.03   Median :10.300   Median :0.3300  
 Mean   :2.289   Mean   :73.12   Mean   : 9.805   Mean   :0.3189  
 3rd Qu.:2.490   3rd Qu.:76.80   3rd Qu.:11.807   3rd Qu.:0.4025  
 Max.   :3.750   Max.   :84.10   Max.   :14.700   Max.   :0.7400  
                                                                  
   arachidic       eicosenoic    
 Min.   :0.000   Min.   :0.0100  
 1st Qu.:0.500   1st Qu.:0.0200  
 Median :0.610   Median :0.1700  
 Mean   :0.581   Mean   :0.1628  
 3rd Qu.:0.700   3rd Qu.:0.2800  
 Max.   :1.050   Max.   :0.5800  
                                 

Creating the Multivariable Graph: Create a scatterplot using ggplot2 to visualize the relationship between palmitic (x-axis), oleic (y-axis), and color the points by region. I decided to run a scatterplot for the two acids compounds as it appears to be the largest composition by percentage of the acid componds in the olive studied.

Scatterplot with ggplot2

library(ggplot2)
Plot1<-ggplot(olive, aes(x = palmitic, y = oleic, color = region)) +
  geom_point(size = 3) +
  geom_smooth()
  labs(x = "Palmitic Acid (%)",
       y = "Oleic Acid (%)",
       color = "Region",
       title = "Fatty Acid Composition of Olive Oil by Region",
       caption = "Data source: dslabs package") +
  theme_gray()
NULL
Plot1

Fit a linear regression model

model <- lm(oleic ~ palmitic + region, data = olive)

Summarize the model

library(ggplot2)
summary(model)

Call:
lm(formula = oleic ~ palmitic + region, data = olive)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7267 -1.1399  0.0959  1.2679  3.7668 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)          98.53696    0.58587  168.19   <2e-16 ***
palmitic             -1.88221    0.05227  -36.01   <2e-16 ***
regionSardinia       -4.93890    0.20032  -24.66   <2e-16 ***
regionSouthern Italy -2.46045    0.19632  -12.53   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.543 on 568 degrees of freedom
Multiple R-squared:  0.8562,    Adjusted R-squared:  0.8555 
F-statistic:  1127 on 3 and 568 DF,  p-value: < 2.2e-16

Observations

  1. Positive Correlation:There is a visible positive correlation between the palmitic and oleic acid compositions. As the percentage of palmitic acid increases, the percentage of oleic acid also tends to increase.This relationship appears to be roughly linear, suggesting that a simple linear model might be appropriate to describe this association.

  2. Cluster Identification: The data points form three distinct clusters, which correspond to the different regions. This indicates regional differences in the fatty acid composition of olive oil. The clustering suggests that each region has a unique profile for palmitic and oleic acid compositions.

  3. Outliers: There appears to be an outlier in the data, particularly from the northern region. This point deviates significantly from the general trend observed in the rest of the data.Investigating this outlier could provide insights into whether it’s due to measurement error, a unique environmental factor, or a different olive variety.

Residual standard error: 1.543 on 568 degrees of freedom.Multiple R-squared: 0.8562, Adjusted R-squared: 0.8555 . F-statistic: 1127 on 3 and 568 DF, p-value: < 2.2e-16 suggest that the correlation between the acids are statistically significant for those region.