The diamonds dataset is a structured dataset comprised of information gathered about diamonds and is included as part of the ggplot2 package. Each row in the dataset is a single entry describing one diamond. There are 54,940 rows and 10 descriptive variables such as price, carat, cut, clarity, and color. The goal of the analysis in this report is to form a greater understanding of diamonds, better understand the distribution of data within the dataset, and to form a greater understanding of the strongest quantitative and qualitative relationship with one another and their relationship with the price of the diamonds.
# Upload data set
data("diamonds")
# Initial inspection
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
Initial inspection of the dataset describes the distribution of
information and provides descriptive statistics of each numeric or
integer variable, such as carat and price, but explains the quantity of
the ordinal variables, such as color and clarity. The summary statistics
provide the minimum value, each quartile, and the mean, median, and
maximum values. From this information, we can now understand the range
of the values within each numeric variable and achieve an understanding
of measures of central tendency. For example, the carat variable
contains diamond observations that range from .2 carats in weight to
5.01 carats with a mean carat weight of .7979 and median carat weight of
.7000.
Taken from the R documentation about the dataset, there are multiple levels describing the cut, color, and clarity of the diamond and several variables describing the size of the diamonds, known as “the four Cs.” The combination of cut, color, carat, and clarity are the key benchmark descriptors of a diamond and provide customers context when shopping for stones and give objectivity to retailers when pricing stones.
The x, y, z, depth, and table variables are all descriptive values
explaining the size of the diamond. The depth variable is calculated
based on the x, y, and z values and defines the total depth percentage.
The formula used to create the values in the depth variable for each
observation is 2 * z / (x+y). The table variable explains the width of
the top of the diamond relative to its widest point. For ease of
understanding future analysis and to aid in communicating the data, some
variables will later be renamed to be more descriptive. Figure 1.1
visually explains the table and depth percentage.
Other than focusing on length, width, and height, a more simplified
method of measuring the size of the stone is by carat. The carat is
derived from an ancient unit of measure called the carob seed, which was
uniform in size and weight. The carat measurement accounts for the
stone’s length, width, and height. Figure 1.2 provides a graphic of
carat weights.
Cut is a qualitative descriptor ranging in quality from “Fair” to
“Ideal,” with “Fair” being the worst cut and “Ideal” as the best cut in
the dataset. The cut is a quality of the diamond that creates how light
will reflect on the stone’s interior. Diamonds with “Ideal” cuts are
highly reflective of light due to the quality of shaping, and diamonds
with a “Fair” rating create less brilliance back to the observer based
on the external geometry. Figure 1.3 outlines how light interacts with
diamonds based on the cut quality.
The color variable is also represented through qualitative descriptions
of color ranging from “D” as the worst and “J” as the best, following an
alphabetic sequence. However, diamonds can contain ratings beyond “J.”
The image below illustrates all color ratings for diamonds. As evident
from Figure 1.4, the colors in the dataset are not representative of all
color ratings. Additionally, the diamonds in the dataset represent the
most clear rating available. Therefore, it is likely that the diamonds
in the dataset are for jewelry and not industrial use.
Clarity is slightly more complex in that it does not follow a numeric or
alphabetic sequence to rank each of the colors of diamonds, but a
sequence from best to worst within the dataset is “IF”, “VVS1”, “VVS2”,
“VS1”, “VS2”, “SI1”, “SI2”, “I1”. Each of the ratings is, however, an
acronym describing the clarity of the diamond. For example, a clarity
rating of ’IF” is an acronym for internally flawless. Understanding that
diamonds of lower grade are not seen in the dataset further reinforces
the notions that the diamonds contained in the dataset are of higher
quality. Figure 1.5 provides a visual representation of the appearance
of each clarity rating a diamond can possess.
The R documentation about the Diamonds dataset can be found here.
The first step in the data cleaning process is to rename the variables to create more easily understood dataset variables that are more descriptive of their content.
# Rename variables
names(diamonds)[10] <- "depth"
names(diamonds)[9] <- "width"
names(diamonds)[8] <- "length"
names(diamonds)[5] <- "depth percentage"
The next step in the cleaning process is to understand the data quality. Understanding the frequency and location of missing information is crucial for future analysis. The visdat data package is used below to visually explain instances of missing information in each of the variables. From this visualization, it is evident that all the information is present, but that only partially ensures all information is high enough quality to begin analysis.
# Identify missing information
vis_miss(diamonds)
In addition to searching for missing information, a thorough search should be conducted to identify duplicated rows. The code snippet below searches rows to identify any instances of duplicated rows. While it is possible that diamonds could be sold for the same price, the dataset contains five quantitative variables describing size and weight (carat, depth percentage, table, length, width, depth) and three qualitative variables describing quality (cut, clarity, color). It is doubtful for a diamond to not only contain an exact copy of those descriptive variables but also unlikely for those variables to create an exact match in price. In the original dataset, 146 instances of duplicated rows must be removed.
# Show duplicate rows
diamonds[duplicated(diamonds), ]
## # A tibble: 146 × 10
## carat cut color clarity `depth percentage` table price length width depth
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.79 Ideal G SI1 62.3 57 2898 5.9 5.85 3.66
## 2 0.79 Ideal G SI1 62.3 57 2898 5.9 5.85 3.66
## 3 0.79 Ideal G SI1 62.3 57 2898 5.9 5.85 3.66
## 4 0.79 Ideal G SI1 62.3 57 2898 5.9 5.85 3.66
## 5 1.52 Good E I1 57.3 58 3105 7.53 7.42 4.28
## 6 1 Fair E SI2 67 53 3136 6.19 6.13 4.13
## 7 1 Fair F SI2 65.1 55 3265 6.26 6.23 4.07
## 8 0.9 Very G… I VS2 58.4 62 3334 6.29 6.35 3.69
## 9 1 Ideal E SI2 62.9 56 3450 6.32 6.3 3.97
## 10 1 Fair H SI1 65.5 57 3511 6.26 6.21 4.08
## # ℹ 136 more rows
# Create dataframe removing duplicate rows
diamonds <- diamonds %>% distinct()
From the summary function earlier in the analysis, it was recognized that several variables contained values of 0.000 in variables describing the size of the diamond. The analysis must conduct either an imputation to rectify the issue or drop those observations to ensure more accurate conclusions, especially when forming conclusions about the relationship between variables. To identify the rows containing 0.000 values, the data will be subset per size column, and a row count will be conducted to find the column with the most number of 0.000 instances. Then, the data will be subset to exclude rows based on the values in that column.
# Count number of rows in the length variable less than 0.001
subdat_length <- subset(diamonds, subset = !(length > 0.001))
nrow(subdat_length)
## [1] 7
# Count number of rows in the width variable less than 0.001
subdat_width <- subset(diamonds, subset = !(width > 0.001))
nrow(subdat_width)
## [1] 6
# Count number of rows in the depth variable less than 0.001
subdat_depth <- subset(diamonds, subset = !(depth > 0.001))
nrow(subdat_depth)
## [1] 19
It has been found that the depth column contains the highest number of instances with 0.000 values. Additionally, by evaluating the data frame, it is clear that by sub-setting data by 0.000 values in the depth column, the rows that contain 0.000 in width and length are present. Therefore, by removing the rows containing 0.000 in the depth column, the data is now as complete as possible, and analysis of the data quality can continue. While measures of central tendency such as median and mean did not change based on the removal of the rows lacking size information about the diamonds, through the removal of rows, future analysis will now be more accurate in describing connections between variables.
# View rows with 0.000 in size variables
subdat_depth
## # A tibble: 19 × 10
## carat cut color clarity `depth percentage` table price length width depth
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 1 Premium G SI2 59.1 59 3142 6.55 6.48 0
## 2 1.01 Premium H I1 58.1 59 3167 6.66 6.6 0
## 3 1.1 Premium G SI2 63 59 3696 6.5 6.47 0
## 4 1.01 Premium F SI2 59.2 58 3837 6.5 6.47 0
## 5 1.5 Good G I1 64 61 4731 7.15 7.04 0
## 6 1.07 Ideal F SI2 61.6 56 4954 0 6.62 0
## 7 1 Very G… H VS2 63.3 53 5139 0 0 0
## 8 1.15 Ideal G VS2 59.2 56 5564 6.88 6.83 0
## 9 1.14 Fair G VS1 57.5 67 6381 0 0 0
## 10 2.18 Premium H SI2 59.4 61 12631 8.49 8.45 0
## 11 1.56 Ideal G VS2 62.2 54 12800 0 0 0
## 12 2.25 Premium I SI1 61.3 58 15397 8.52 8.42 0
## 13 1.2 Premium D VVS1 62.1 59 15686 0 0 0
## 14 2.2 Premium H SI1 61.2 59 17265 8.42 8.37 0
## 15 2.25 Premium H SI2 62.8 59 18034 0 0 0
## 16 2.02 Premium H VS2 62.7 53 18207 8.02 7.95 0
## 17 2.8 Good G SI2 63.8 58 18788 8.9 8.85 0
## 18 0.71 Good F SI2 64.1 60 2130 0 0 0
## 19 1.12 Premium G I1 60.4 59 2383 6.71 6.67 0
# Remove rows with 0.000 in size variable, confirm removal, and observe changes to measures of central tendency
diamonds_complete <- diamonds[diamonds$depth > 0.001, ]
summary(diamonds_complete)
## carat cut color clarity depth percentage
## Min. :0.2000 Fair : 1597 D: 6754 SI1 :13030 Min. :43.00
## 1st Qu.:0.4000 Good : 4888 E: 9776 VS2 :12225 1st Qu.:61.00
## Median :0.7000 Very Good:12068 F: 9517 SI2 : 9142 Median :61.80
## Mean :0.7975 Premium :13737 G:11254 VS1 : 8155 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21485 H: 8266 VVS2 : 5056 3rd Qu.:62.50
## Max. :5.0100 I: 5406 VVS1 : 3646 Max. :79.00
## J: 2802 (Other): 2521
## table price length width
## Min. :43.00 Min. : 326 Min. : 3.730 Min. : 3.680
## 1st Qu.:56.00 1st Qu.: 951 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3931 Mean : 5.732 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## depth
## Min. : 1.07
## 1st Qu.: 2.91
## Median : 3.53
## Mean : 3.54
## 3rd Qu.: 4.03
## Max. :31.80
##
Each row within the dataset represents a single diamond with several descriptive factors that could be represented as subgroups. Earlier in the report, summary counts per variable were given, but they were only counted per column. Further analysis is required to understand the contents of the dataset more deeply. For example, it is currently known that there are 21,485 “Ideal” cut diamonds in the dataset, but it is unknown the clarity and color ratings within the “Ideal” group. Several useful visualizations can describe grouped data. This report provides a dendrogram, heat maps, and a treemap to represent the number of diamonds per subgroup visually.
# Create aggregated data frame that counts the number of diamonds in each subgroup.
diamonds_agg <- diamonds_complete %>% group_by(cut, color, clarity) %>%
summarise(count = n())
# Create dendrogram to visually describe the total number of diamonds per subgroup.
dendrogram <- diamonds_agg %>%
ggplot(aes(x = "", y = count, fill = cut)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y", start = 0) +
facet_grid(color ~ clarity) +
theme_void() +
theme(legend.position = "right") +
labs(fill = "Cut",
y = "Total Diamonds",
title = "Number of Diamonds per Cut, Color, and Clarity",
subtitle = "Diamonds Dataset",
caption = "Data Source: ggplot2 package")
dendrogram
Along the right side vertical axis is the diamond color rating, and across the top of the chart is the diamond clarity rating. Making up each section of the individual pie charts is the number of diamonds per cut rating. The dendrogram serves several purposes for this report. First, the size of each pie in every pie chart illuminates the total number of diamonds per subgroup. Second, the pie charts and their respective slices show the density among all subgroups. For example, from the dendrogram, it is apparent that diamonds containing a clarity rating of “I1” occur at the lowest frequency of all other clarity ratings. The majority of the diamonds in the dataset reside in the clarity ratings of “SI2”, “SI1”, and “VS2”, with “Ideal” cut diamonds representing the most significant amount of diamonds.
# Create heatmap to visualize the number of diamonds with specific cut and color ratings
heatmap <- ggplot(diamonds_agg, aes(x = cut, y = color, fill = count)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "white", high = "blue") +
labs(x = "Cut Rating",
y = "Color Rating",
fill = "Total Diamonds",
title = "Number of Diamonds by Cut and Color",
subtitle = "Diamonds Dataset",
caption = "Data Source: ggplot2 package") +
theme_minimal()
heatmap
The heat map shows the number of diamonds that contain an intersection of cut and color ratings. From the heatmap above, it is apparent that the combination of “Ideal” cut with a color rating of “G” occurs more frequently in the dataset than any other pairing.
# Create heatmap to visualize the number of diamonds with specific clarity and color ratings.
heatmap_2 <- ggplot(diamonds_agg, aes(x = clarity, y = color, fill = count)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "white", high = "blue") +
labs(x = "Clarity Rating",
y = "Color Rating",
fill = "Total Diamonds",
title = "Number of Diamonds by Clarity and Color",
subtitle = "Diamonds Dataset",
caption = "Data Source: ggplot2 package") +
theme_minimal()
heatmap_2
The heat map above represents the frequency of diamonds grouped by color and clarity ratings. The heat map shows more variation within the dataset among clarity and color combinations than seen in the intersection of cut and color. This observation contrasts the cut and color rating heat map, where most diamonds were tightly grouped.
# Create heatmap to visualize the number of diamonds with specific cut and clarity ratings.
heatmap_3 <- ggplot(diamonds_agg, aes(x = cut, y = clarity, fill = count)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "white", high = "blue") +
labs(x = "Cut Rating",
y = "Clarity Rating",
fill = "Total Diamonds",
title = "Number of Diamonds by Clarity and Color",
subtitle = "Diamonds Dataset",
caption = "Data Source: ggplot2 package") +
theme_minimal()
heatmap_3
The final heatmap presents the intersection of cut and clarity ratings. While there is more variation than the cut and color heatmap, the dataset still has trends. It is apparent that the combinations with the most amount of diamonds in the dataset occur with clarity ratings of “VS1” to “SI1” and cut ratings of “Very Good” to “Ideal.”
# Create treemap to understand the distribution of cut, color, and clarity.
treemap(diamonds_agg,
index = c("cut", "color", "clarity"),
vSize = "count",
title = "Number of Diamonds by Cut, Color, and Clarity")
The treemap aims to provide an understanding of the distribution of diamond qualities through visualization of the total number of diamonds across the categorical values of cut, color, and clarity. The first level of the treemap separates the diamonds according to their cut and assigns a color within the visualization. Next, within each color, there is another separation based on the color of the diamond. Finally, the smallest squares within the tree map are made by a count of diamonds per clarity. Immediate conclusions from the treemap are that the dataset is comprised of higher quality diamonds as the “Ideal” rated diamonds are the largest section within the treemap, and the “Fair” rated stones occur at the lowest frequency.
Crucial in understanding descriptive aspects of the dataset is deducing the distribution of information contained within the variables and ascertaining any abnormal qualities. In this report, boxplots are used to identify outliers within the dataset.
# Create boxplot to visualize the distribution of price per clarity rating
diamonds_complete %>%
ggplot(aes(x = clarity, y = price, fill = clarity)) +
geom_boxplot() +
labs(title = "Boxplot of Clarity",
subtitle = "Diamonds Dataset",
caption = "Data Source: ggplot2 package",
x = "Clarity Rating",
y = "Price ($)") +
theme(legend.position = "none") +
theme_minimal()
The boxplot above presents some interesting information when inspected closely. The clarity rating travels along the x-axis of the chart from left to right, with the lowest quality stones on the far left and the highest quality stones on the far right, but it also shows that the mean price of the stone is decreasing. While there are other factors, such as cut and color, further analysis is needed to understand the relationship between the categorical value of clarity and price.
# Create boxplot to visualize the distribution of price per color rating.
diamonds_complete %>%
ggplot(aes(x = color, y = price, fill = color)) +
geom_boxplot() +
labs(title = "Boxplot of Color",
subtitle = "Diamonds Dataset",
caption = "Data Source: ggplot2 package",
x = "Color Rating",
y = "Price ($)") +
theme(legend.position = "none") +
theme_minimal()
In the color and price box plot, the highest quality stones are rated from best to worst, moving left to right along the x-axis. The box plot of color rating again shows some interesting distributions of data. The mean price increases as the color rating decreases.
# Create boxplot to understand the distribution of price per cut rating.
diamonds_complete %>%
ggplot(aes(x = cut, y = price, fill = cut)) +
geom_boxplot() +
labs(title = "Boxplot of Cut",
subtitle = "Diamonds Dataset",
caption = "Data Source: ggplot2 package",
x = "Cut Rating",
y = "Price ($)") +
theme(legend.position = "none") +
theme_minimal()
Following the same perplexing trend, the diamonds with the best cut rating also create the lowest mean price.
Found within the box plots of cut, color, and clarity, an interesting trend came to the surface. In every rating category, as the quality of the stone increased, the mean price within the dataset decreased. Additionally, there is a significant number of outliers within each quality descriptor. The logical conclusion when faced with unexpected trends and such a high number of outliers is that there is not one single factor that can explain the price of the diamond but is instead a combination of factors and likely a more significant factor than cut, color, and clarity.
Because the combination of a diamond’s length, width, and height inform the carat weight, the analysis will begin with creating a visual understanding of the distribution of carat size throughout the dataset.
# Create histogram of price
diamonds_complete %>%
ggplot(aes(x = carat)) +
geom_histogram(color = "steelblue4", fill = "lightblue2") +
labs(title = "Histogram of Carat",
subtitle = "Diamonds Dataset",
caption = "Data Source: ggplot2 package",
x = "Carat Size",
y = "Number of Diamonds") +
theme_minimal()
The histogram shows a significant right skew in the data distribution, with most diamonds in the dataset weighing less than 1.2 carats. It is understood that higher-carat diamonds will generally be taller, wider, and deeper. Therefore, further exploration into the diamonds’ size aspects is unnecessary.
Though knowledge gained during the research of the “four Cs” of diamonds, it is understood that the main factors describing a diamond’s quality are qualitative and quantitative variables. To understand the relationship between these variables and how they are related to price, analysis will begin with scatter plots and then explore correlations between the variables in the dataset.
# Create scatter plot of carat and price with grouping by cut
diamonds_complete %>%
ggplot(aes(x = carat, y = price, color = cut)) +
geom_point() +
theme_minimal() +
labs(title = "Scatter plot of Carat and Price",
subtitle = "Diamonds Dataset",
caption = "Data Source: ggplot2 package",
x = "Carat Size",
y = "Number of Diamonds")
Contrary to what is seen in the box plot of only one categorical value and price, the relationship between carat and price follows a more expected path. As the size of the carat increases, the price also increases. While there are instances where there are higher priced stones of lower cut ratings, across the scatter plot, there seems to be a trend where the higher cut stones are gathering a higher price.
# Create scatterplot of carat and price with grouping by color
diamonds_complete %>%
ggplot(aes(x = carat, y = price, color = color)) +
geom_point() +
theme_minimal() +
labs(title = "Scatter plot of Carat and Price",
subtitle = "Diamonds Dataset",
caption = "Data Source: ggplot2 package",
x = "Carat Size",
y = "Number of Diamonds")
Based on the scatter plot of carat and price with points colored by the color of the diamond, the same trend is apparent: once separated by carat, the color ratings impact the price of the stone.
# Create scatterplot of carat and price with grouping by clarity
diamonds_complete %>%
ggplot(aes(x = carat, y = price, color = clarity)) +
geom_point() +
theme_minimal() +
labs(title = "Scatter plot of Carat and Price",
subtitle = "Diamonds Dataset",
caption = "Data Source: ggplot2 package",
x = "Carat Size",
y = "Number of Diamonds")
The same trend of the qualitative descriptor of quality becoming more aligned with price is apparent in the above scatter plot, but only when paired with an understanding of the diamond’s carat weight. To measure the observations seen in the scatter plot, the next step of analysis is to create a correlation matrix and plot the correlations. It is expected that carat weight is the most closely related variable to price, and all other quality ratings are a secondary concern when pricing stones.
# Subset data into variables of cut, color, clarity, carat, and price
corr_diamonds <- diamonds_complete[-c(5,6,8:10)]
# Create factor levels for mapping of qualitative values. All values will rank from 1 to n with 1 representing lower quality and the higher number representing higher quality.
cut_levels <- c("Fair", "Good", "Very Good", "Premium", "Ideal")
color_levels <- c("D", "E", "F", "G", "H", "I", "J")
clarity_level <- c("I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF")
# Create factors for cut values
corr_diamonds$cut <- factor(corr_diamonds$cut, levels = cut_levels)
# Create factors for color values
corr_diamonds$color <- factor(corr_diamonds$color, levels = color_levels)
# Create factors for clarity values
corr_diamonds$clarity <- factor(corr_diamonds$clarity, levels = clarity_level)
# Create mapping for cut values
cut_mapping <- c("Fair" = 1, "Good" = 2, "Very Good" = 3, "Premium" = 4, "Ideal" = 5)
# Apply mapping to cut variable
corr_diamonds$cut <- cut_mapping[corr_diamonds$cut]
# Convert the factor to numeric
corr_diamonds$color <- as.numeric(corr_diamonds$color)
corr_diamonds$clarity <- as.numeric(corr_diamonds$clarity)
# Create correlation matrix of the four Cs
cor(corr_diamonds)
## carat cut color clarity price
## carat 1.0000000 -0.13335113 0.29094288 -0.35219384 0.9215485
## cut -0.1333511 1.00000000 -0.02014858 0.18833670 -0.0522251
## color 0.2909429 -0.02014858 1.00000000 0.02570228 0.1717463
## clarity -0.3521938 0.18833670 0.02570228 1.00000000 -0.1461243
## price 0.9215485 -0.05222510 0.17174629 -0.14612430 1.0000000
# Create correlation visualization
vis_cor(corr_diamonds)
Identified in the correlation matrix and visualization is the statistical context of visual observations. It was seen in the box plots that there was more than just a single factor that informed price, but the level to which each factor was related was unknown. However, the correlation matrix shows that carat has a very strong positive relationship with price, meaning that as price increases, we can likely see an increase in carat weight. The other qualitative factors that describe a diamond’s quality have very weak relationships with price. Within this dataset, the rating most closely related tp the price of the stone is carat weight.