Data Dive - Documentation

For this week’s data dive, we had to take a deeper look at the documentation for our data. Having good documentation for a dataset is very important, as it can give the user a more clear explanation of how the data might have been collected and what it can represent.

For my Data Dives, I’ve been using the “diamonds” dataset from the ggplot2 library in R. Here is the link to the documentation. https://ggplot2.tidyverse.org/reference/diamonds.html

Columns that are unclear until you read the documentation.

I didn’t know much about diamonds when choosing this dataset, so there was a lot that was confusing to me until I read the documentation.

The first variable that I had confusion with was the carat variable. I had heard the term carat before, but I had no clue what it meant. Upon reading the documentation, I found that the carat of the diamond represents the weight of the diamond.

Second, I looked at the depth variable. I didn’t quit understand what it was, as I was just seeing what were seemingly random numbers. What I found after the documentation is that the depth represents the total depth percentage. The total depth percentage was a calculation using the x, y, and z variables. The formula is as follows:

total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)

Another column that I had some initial confusion on was the table variable. I had no clue what that could have been. I checked the documentation, and it turns out that the table is the measurement of the width of the top of the diamond relative to the widest point.

Elements that are unclear even after reading the documentation

I didn’t really understand what the values of the clarity variable are. I read through the documentation, and it seems to show what each of the possible values are, and it shows the worst and best values, but it doesn’t really help with understanding what each label means.

Visualization

I created a visualization showing the confusion from the element of clarity from the diamonds dataset.

library(ggplot2)
data(diamonds)
ggplot(data=diamonds, aes(x=clarity)) +
  geom_bar(fill="darkseagreen", color="azure") +
  ggtitle("Bar Plot for Clarity")

We see what looks to be a pretty normal distribution here, but since there is confusion on what clarities are better or worse,we don’t quite know if the bulk of our data is actually from the middle label for clarity.