Choosing a dataset

After using the “data()” function, I saw that there was dslabs dataset about tissue gene function, but when I loaded it, it was a list. So I tried the brca dataset and it was also a list, I then chose one at random, which came to be the olive dataset. Luckily the olive dataset was in the form of a dataframe.

Saving the dataset to my files

write_csv(olive, "olive.csv", na = "")

Configuring the dataframe to show me something I can work with

The original dataset includes rows of olives with the columns being variables. The variables were the region of italy they were from, the area of that region, and the percent of each fatty acid present in the olive, each fatty acid had its own column.

I then thought that we could compare the olives by how healthy they are for consumption based off the composition of fatty acids in each type of olive. I then calculated the percent of saturated, monounsaturated, and polyunsaturated fatty acids my adding the percents of each fatty acid that falls into its respective category.

Stearic acid, palmitic acid, and arachidic acid are saturated fatty acids. Linoleic acid and linolenic acid are bothy pulyunsaturated fatty acids. Palmitoleic acid, oleic acid, and eicosenoic acid are all monounsaturated fatty acids.

olive_by_sat <- olive %>%
  rowwise() %>%
  mutate(percent_saturated = sum(palmitic, stearic, arachidic),
         percent_monounsaturated = sum(palmitoleic, oleic, eicosenoic),
         percent_polyunsaturated = sum(linoleic, linolenic))
head(olive_by_sat)
## # A tibble: 6 × 13
## # Rowwise: 
##   region     area  palmi…¹ palmi…² stearic oleic linol…³ linol…⁴ arach…⁵ eicos…⁶
##   <fct>      <fct>   <dbl>   <dbl>   <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 Southern … Nort…   10.8     0.75    2.26  78.2    6.72    0.36    0.6     0.29
## 2 Southern … Nort…   10.9     0.73    2.24  77.1    7.81    0.31    0.61    0.29
## 3 Southern … Nort…    9.11    0.54    2.46  81.1    5.49    0.31    0.63    0.29
## 4 Southern … Nort…    9.66    0.57    2.4   79.5    6.19    0.5     0.78    0.35
## 5 Southern … Nort…   10.5     0.67    2.59  77.7    6.72    0.5     0.8     0.46
## 6 Southern … Nort…    9.11    0.49    2.68  79.2    6.78    0.51    0.7     0.44
## # … with 3 more variables: percent_saturated <dbl>,
## #   percent_monounsaturated <dbl>, percent_polyunsaturated <dbl>, and
## #   abbreviated variable names ¹​palmitic, ²​palmitoleic, ³​linoleic, ⁴​linolenic,
## #   ⁵​arachidic, ⁶​eicosenoic

I then decided to calculate the mean percent of saturated, polyunsaturated, and monounsaturated fatty acids for the olives from each area of Italy.

mean_area <- olive_by_sat %>%
  group_by(area) %>%
  mutate(mean_polyunsaturated_area = mean(percent_polyunsaturated), 
         mean_saturated_by_area = mean(percent_saturated),
         mean_monounsaturated_by_area = mean(percent_monounsaturated)) %>%
  select(region, area, mean_polyunsaturated_area, mean_saturated_by_area, mean_monounsaturated_by_area)%>%
 distinct()
head(mean_area)
## # A tibble: 6 × 5
## # Groups:   area [6]
##   region         area            mean_polyunsaturated_area mean_satura…¹ mean_…²
##   <fct>          <fct>                               <dbl>         <dbl>   <dbl>
## 1 Southern Italy North-Apulia                         7.48          13.3    79.2
## 2 Southern Italy Calabria                             8.65          16.3    74.6
## 3 Southern Italy South-Apulia                        12.0           16.7    71.2
## 4 Southern Italy Sicily                               8.77          15.8    75.0
## 5 Sardinia       Inland-Sardinia                     11.5           13.9    74.6
## 6 Sardinia       Coast-Sardinia                      13.6           14.5    71.9
## # … with abbreviated variable names ¹​mean_saturated_by_area,
## #   ²​mean_monounsaturated_by_area
cols_area <- c("orange", "springgreen", "hotpink", "yellow", "skyblue", "grey", "tomato", "rosybrown", "peachpuff")

Visualizing the data!!!

For my main visualization, I created a stacked column pyramid graph breaking down the olives from the region it was grown and comparing the mean percent composition of fatty acids by saturated, polyunsaturated, and monounsaturated fatty acids. Polyunsaturated fatty acids are an essential part of our diet that we need to get from food, and we are better off avoiding saturated fatty acids in our diets, so I have come to the conclusion from this data that the Italian olives that are healthiest to eat come from West-Liguria or the Sardinia Coast.

highchart() %>%
   hc_add_series(data = mean_area,
               type = "columnpyramid",
               name = 'Percent Monounsaturated Fatty Acid',
               hcaes( x = area,
                      y = mean_monounsaturated_by_area)) %>%
  hc_add_series(data = mean_area,
               type = "columnpyramid",
               name = 'Percent Polyunsaturated Fatty Acid',
               hcaes( x = area,
                      y = mean_polyunsaturated_area)) %>%
   hc_add_series(data = mean_area,
               type = "columnpyramid",
               name = 'Percent Saturated Fatty Acid',
               hcaes( x = area,
                      y = mean_saturated_by_area)) %>%
  hc_colors(cols_area) %>%
  hc_plotOptions(series = list(stacking = "normal")) %>%
  hc_xAxis(categories = mean_area$area, title = list(text= "Area in Italy")) %>%
  hc_yAxis(title = list(text="Percent"), max=100) %>%
  hc_title(text = "Mean Percent of Each Fatty Acid Type in Olives Grown in Different Areas in Italy") %>%
  # hc_theme_db(chart = list(backgroundColor = "#15C0DE")) %>%
  hc_tooltip(enabled = T)

I also wanted to experiment with some other plots, so I made a boxplot of percent polyunsaturated fatty acids and percent saturated fatty acids in olives by region, and also noticed that the Inland Sardinia olives are in between the West Liguria and Sardinia Coast olives in terms of healthiness.

for_plot <- olive_by_sat %>%
  select(area, percent_polyunsaturated, percent_saturated, percent_monounsaturated)
ggplot(for_plot, aes(x = percent_saturated, y = percent_polyunsaturated, fill = area)) +
  labs(title = "Percent of Saturation in Olives by Region")+
  geom_boxplot() +
  theme_light()

My last visualization is something I just wanted to see and each column of dots is one olive that was recorded from the inland Sardinia area. It breaks down the fatty acid percentage pefatty acid and the line indicates the percent of saturated fatty acid as does the x-axis.

smooth_scatter <- olive_by_sat %>%
  filter(area == "Inland-Sardinia") %>%
  gather(key = "Fatty Acid", value = "Percent Composition", palmitoleic, oleic, linoleic, linolenic, eicosenoic, arachidic, stearic, palmitic)
ggplot(smooth_scatter, aes(x = percent_saturated, y = `Percent Composition`, color = `Fatty Acid`)) +
  geom_point() +
  geom_line(aes(`percent_saturated`))+
  labs(title = "Percent Composition of Olives by Fatty Acid in Inland-Sardinia")+
  theme_light()