Introduction

The diamonds dataset is a structured dataset comprised of information gathered about diamonds and is included as part of the ggplot2 package. Each row in the dataset is a single entry describing one diamond. There are 54,940 rows and 10 descriptive variables such as price, carat, cut, clarity, and color. The goal of the analysis in this report is to form a greater understanding of diamonds, better understand the distribution of data within the dataset, and to form a greater understanding of the strongest quantitative and qualitative relationship with one another and their relationship with the price of the diamonds.

# Upload data set
data("diamonds")
# Initial inspection
summary(diamonds)
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 


Initial inspection of the dataset describes the distribution of information and provides descriptive statistics of each numeric or integer variable, such as carat and price, but explains the quantity of the ordinal variables, such as color and clarity. The summary statistics provide the minimum value, each quartile, and the mean, median, and maximum values. From this information, we can now understand the range of the values within each numeric variable and achieve an understanding of measures of central tendency. For example, the carat variable contains diamond observations that range from .2 carats in weight to 5.01 carats with a mean carat weight of .7979 and median carat weight of .7000.

Diamond Ratings Explained

Taken from the R documentation about the dataset, there are multiple levels describing the cut, color, and clarity of the diamond and several variables describing the size of the diamonds, known as “the four Cs.” The combination of cut, color, carat, and clarity are the key benchmark descriptors of a diamond and provide customers context when shopping for stones and give objectivity to retailers when pricing stones.

The x, y, z, depth, and table variables are all descriptive values explaining the size of the diamond. The depth variable is calculated based on the x, y, and z values and defines the total depth percentage. The formula used to create the values in the depth variable for each observation is 2 * z / (x+y). The table variable explains the width of the top of the diamond relative to its widest point. For ease of understanding future analysis and to aid in communicating the data, some variables will later be renamed to be more descriptive. Figure 1.1 visually explains the table and depth percentage.

Figure 1.1
Figure 1.1


Other than focusing on length, width, and height, a more simplified method of measuring the size of the stone is by carat. The carat is derived from an ancient unit of measure called the carob seed, which was uniform in size and weight. The carat measurement accounts for the stone’s length, width, and height. Figure 1.2 provides a graphic of carat weights.

Figure 1.2
Figure 1.2


Cut is a qualitative descriptor ranging in quality from “Fair” to “Ideal,” with “Fair” being the worst cut and “Ideal” as the best cut in the dataset. The cut is a quality of the diamond that creates how light will reflect on the stone’s interior. Diamonds with “Ideal” cuts are highly reflective of light due to the quality of shaping, and diamonds with a “Fair” rating create less brilliance back to the observer based on the external geometry. Figure 1.3 outlines how light interacts with diamonds based on the cut quality.

Figure 1.3
Figure 1.3


The color variable is also represented through qualitative descriptions of color ranging from “D” as the worst and “J” as the best, following an alphabetic sequence. However, diamonds can contain ratings beyond “J.” The image below illustrates all color ratings for diamonds. As evident from Figure 1.4, the colors in the dataset are not representative of all color ratings. Additionally, the diamonds in the dataset represent the most clear rating available. Therefore, it is likely that the diamonds in the dataset are for jewelry and not industrial use.

Figure 1.4
Figure 1.4


Clarity is slightly more complex in that it does not follow a numeric or alphabetic sequence to rank each of the colors of diamonds, but a sequence from best to worst within the dataset is “IF”, “VVS1”, “VVS2”, “VS1”, “VS2”, “SI1”, “SI2”, “I1”. Each of the ratings is, however, an acronym describing the clarity of the diamond. For example, a clarity rating of ’IF” is an acronym for internally flawless. Understanding that diamonds of lower grade are not seen in the dataset further reinforces the notions that the diamonds contained in the dataset are of higher quality. Figure 1.5 provides a visual representation of the appearance of each clarity rating a diamond can possess.

Figure 1.5
Figure 1.5


The R documentation about the Diamonds dataset can be found here.

Data Cleaning and Manipulation

Readability

The first step in the data cleaning process is to rename the variables to create more easily understood dataset variables that are more descriptive of their content.

# Rename variables
names(diamonds)[10] <- "depth"
names(diamonds)[9] <- "width"
names(diamonds)[8] <- "length"
names(diamonds)[5] <- "depth percentage"

Assessment of Data Quality

The next step in the cleaning process is to understand the data quality. Understanding the frequency and location of missing information is crucial for future analysis. The visdat data package is used below to visually explain instances of missing information in each of the variables. From this visualization, it is evident that all the information is present, but that only partially ensures all information is high enough quality to begin analysis.

# Identify missing information
vis_miss(diamonds)

In addition to searching for missing information, a thorough search should be conducted to identify duplicated rows. The code snippet below searches rows to identify any instances of duplicated rows. While it is possible that diamonds could be sold for the same price, the dataset contains five quantitative variables describing size and weight (carat, depth percentage, table, length, width, depth) and three qualitative variables describing quality (cut, clarity, color). It is doubtful for a diamond to not only contain an exact copy of those descriptive variables but also unlikely for those variables to create an exact match in price. In the original dataset, 146 instances of duplicated rows must be removed.

# Show duplicate rows
diamonds[duplicated(diamonds), ]
## # A tibble: 146 × 10
##    carat cut     color clarity `depth percentage` table price length width depth
##    <dbl> <ord>   <ord> <ord>                <dbl> <dbl> <int>  <dbl> <dbl> <dbl>
##  1  0.79 Ideal   G     SI1                   62.3    57  2898   5.9   5.85  3.66
##  2  0.79 Ideal   G     SI1                   62.3    57  2898   5.9   5.85  3.66
##  3  0.79 Ideal   G     SI1                   62.3    57  2898   5.9   5.85  3.66
##  4  0.79 Ideal   G     SI1                   62.3    57  2898   5.9   5.85  3.66
##  5  1.52 Good    E     I1                    57.3    58  3105   7.53  7.42  4.28
##  6  1    Fair    E     SI2                   67      53  3136   6.19  6.13  4.13
##  7  1    Fair    F     SI2                   65.1    55  3265   6.26  6.23  4.07
##  8  0.9  Very G… I     VS2                   58.4    62  3334   6.29  6.35  3.69
##  9  1    Ideal   E     SI2                   62.9    56  3450   6.32  6.3   3.97
## 10  1    Fair    H     SI1                   65.5    57  3511   6.26  6.21  4.08
## # ℹ 136 more rows
# Create dataframe removing duplicate rows
diamonds <- diamonds %>% distinct()

Completeness of Data

From the summary function earlier in the analysis, it was recognized that several variables contained values of 0.000 in variables describing the size of the diamond. The analysis must conduct either an imputation to rectify the issue or drop those observations to ensure more accurate conclusions, especially when forming conclusions about the relationship between variables. To identify the rows containing 0.000 values, the data will be subset per size column, and a row count will be conducted to find the column with the most number of 0.000 instances. Then, the data will be subset to exclude rows based on the values in that column.

# Count number of rows in the length variable less than 0.001
subdat_length <- subset(diamonds, subset = !(length > 0.001))
nrow(subdat_length)
## [1] 7
# Count number of rows in the width variable less than 0.001
subdat_width <- subset(diamonds, subset = !(width > 0.001))
nrow(subdat_width)
## [1] 6
# Count number of rows in the depth variable less than 0.001
subdat_depth <- subset(diamonds, subset = !(depth > 0.001))
nrow(subdat_depth)
## [1] 19

It has been found that the depth column contains the highest number of instances with 0.000 values. Additionally, by evaluating the data frame, it is clear that by sub-setting data by 0.000 values in the depth column, the rows that contain 0.000 in width and length are present. Therefore, by removing the rows containing 0.000 in the depth column, the data is now as complete as possible, and analysis of the data quality can continue. While measures of central tendency such as median and mean did not change based on the removal of the rows lacking size information about the diamonds, through the removal of rows, future analysis will now be more accurate in describing connections between variables.

# View rows with 0.000 in size variables
subdat_depth
## # A tibble: 19 × 10
##    carat cut     color clarity `depth percentage` table price length width depth
##    <dbl> <ord>   <ord> <ord>                <dbl> <dbl> <int>  <dbl> <dbl> <dbl>
##  1  1    Premium G     SI2                   59.1    59  3142   6.55  6.48     0
##  2  1.01 Premium H     I1                    58.1    59  3167   6.66  6.6      0
##  3  1.1  Premium G     SI2                   63      59  3696   6.5   6.47     0
##  4  1.01 Premium F     SI2                   59.2    58  3837   6.5   6.47     0
##  5  1.5  Good    G     I1                    64      61  4731   7.15  7.04     0
##  6  1.07 Ideal   F     SI2                   61.6    56  4954   0     6.62     0
##  7  1    Very G… H     VS2                   63.3    53  5139   0     0        0
##  8  1.15 Ideal   G     VS2                   59.2    56  5564   6.88  6.83     0
##  9  1.14 Fair    G     VS1                   57.5    67  6381   0     0        0
## 10  2.18 Premium H     SI2                   59.4    61 12631   8.49  8.45     0
## 11  1.56 Ideal   G     VS2                   62.2    54 12800   0     0        0
## 12  2.25 Premium I     SI1                   61.3    58 15397   8.52  8.42     0
## 13  1.2  Premium D     VVS1                  62.1    59 15686   0     0        0
## 14  2.2  Premium H     SI1                   61.2    59 17265   8.42  8.37     0
## 15  2.25 Premium H     SI2                   62.8    59 18034   0     0        0
## 16  2.02 Premium H     VS2                   62.7    53 18207   8.02  7.95     0
## 17  2.8  Good    G     SI2                   63.8    58 18788   8.9   8.85     0
## 18  0.71 Good    F     SI2                   64.1    60  2130   0     0        0
## 19  1.12 Premium G     I1                    60.4    59  2383   6.71  6.67     0
# Remove rows with 0.000 in size variable, confirm removal, and observe changes to measures of central tendency
diamonds_complete <- diamonds[diamonds$depth > 0.001, ]
summary(diamonds_complete)
##      carat               cut        color        clarity      depth percentage
##  Min.   :0.2000   Fair     : 1597   D: 6754   SI1    :13030   Min.   :43.00   
##  1st Qu.:0.4000   Good     : 4888   E: 9776   VS2    :12225   1st Qu.:61.00   
##  Median :0.7000   Very Good:12068   F: 9517   SI2    : 9142   Median :61.80   
##  Mean   :0.7975   Premium  :13737   G:11254   VS1    : 8155   Mean   :61.75   
##  3rd Qu.:1.0400   Ideal    :21485   H: 8266   VVS2   : 5056   3rd Qu.:62.50   
##  Max.   :5.0100                     I: 5406   VVS1   : 3646   Max.   :79.00   
##                                     J: 2802   (Other): 2521                   
##      table           price           length           width       
##  Min.   :43.00   Min.   :  326   Min.   : 3.730   Min.   : 3.680  
##  1st Qu.:56.00   1st Qu.:  951   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3931   Mean   : 5.732   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##      depth      
##  Min.   : 1.07  
##  1st Qu.: 2.91  
##  Median : 3.53  
##  Mean   : 3.54  
##  3rd Qu.: 4.03  
##  Max.   :31.80  
## 

Heirarchical Visualization

Each row within the dataset represents a single diamond with several descriptive factors that could be represented as subgroups. Earlier in the report, summary counts per variable were given, but they were only counted per column. Further analysis is required to understand the contents of the dataset more deeply. For example, it is currently known that there are 21,485 “Ideal” cut diamonds in the dataset, but it is unknown the clarity and color ratings within the “Ideal” group. Several useful visualizations can describe grouped data. This report provides a dendrogram, heat maps, and a treemap to represent the number of diamonds per subgroup visually.

# Create aggregated data frame that counts the number of diamonds in each subgroup.
diamonds_agg <- diamonds_complete %>% group_by(cut, color, clarity) %>% 
  summarise(count = n())
# Create dendrogram to visually describe the total number of diamonds per subgroup.
dendrogram <- diamonds_agg %>%
  ggplot(aes(x = "", y = count, fill = cut)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar("y", start = 0) +
  facet_grid(color ~ clarity) +
  theme_void() +
  theme(legend.position = "right") +
  labs(fill = "Cut", 
       y = "Total Diamonds",
       title = "Number of Diamonds per Cut, Color, and Clarity",
       subtitle = "Diamonds Dataset",
       caption = "Data Source: ggplot2 package")

dendrogram

Along the right side vertical axis is the diamond color rating, and across the top of the chart is the diamond clarity rating. Making up each section of the individual pie charts is the number of diamonds per cut rating. The dendrogram serves several purposes for this report. First, the size of each pie in every pie chart illuminates the total number of diamonds per subgroup. Second, the pie charts and their respective slices show the density among all subgroups. For example, from the dendrogram, it is apparent that diamonds containing a clarity rating of “I1” occur at the lowest frequency of all other clarity ratings. The majority of the diamonds in the dataset reside in the clarity ratings of “SI2”, “SI1”, and “VS2”, with “Ideal” cut diamonds representing the most significant amount of diamonds.

# Create heatmap to visualize the number of diamonds with specific cut and color ratings
heatmap <- ggplot(diamonds_agg, aes(x = cut, y = color, fill = count)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(x = "Cut Rating", 
       y = "Color Rating", 
       fill = "Total Diamonds",
       title = "Number of Diamonds by Cut and Color",
       subtitle = "Diamonds Dataset",
       caption = "Data Source: ggplot2 package") +
  theme_minimal()

heatmap

The heat map shows the number of diamonds that contain an intersection of cut and color ratings. From the heatmap above, it is apparent that the combination of “Ideal” cut with a color rating of “G” occurs more frequently in the dataset than any other pairing.

# Create heatmap to visualize the number of diamonds with specific clarity and color ratings.
heatmap_2 <- ggplot(diamonds_agg, aes(x = clarity, y = color, fill = count)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(x = "Clarity Rating", 
       y = "Color Rating", 
       fill = "Total Diamonds",
       title = "Number of Diamonds by Clarity and Color",
       subtitle = "Diamonds Dataset",
       caption = "Data Source: ggplot2 package") +
  theme_minimal()

heatmap_2

The heat map above represents the frequency of diamonds grouped by color and clarity ratings. The heat map shows more variation within the dataset among clarity and color combinations than seen in the intersection of cut and color. This observation contrasts the cut and color rating heat map, where most diamonds were tightly grouped.

# Create heatmap to visualize the number of diamonds with specific cut and clarity ratings.
heatmap_3 <- ggplot(diamonds_agg, aes(x = cut, y = clarity, fill = count)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "white", high = "blue") +
  labs(x = "Cut Rating", 
       y = "Clarity Rating", 
       fill = "Total Diamonds",
       title = "Number of Diamonds by Clarity and Color",
       subtitle = "Diamonds Dataset",
       caption = "Data Source: ggplot2 package") +
  theme_minimal()

heatmap_3

The final heatmap presents the intersection of cut and clarity ratings. While there is more variation than the cut and color heatmap, the dataset still has trends. It is apparent that the combinations with the most amount of diamonds in the dataset occur with clarity ratings of “VS1” to “SI1” and cut ratings of “Very Good” to “Ideal.”

# Create treemap to understand the distribution of cut, color, and clarity.
treemap(diamonds_agg,
        index = c("cut", "color", "clarity"),
        vSize = "count",
        title = "Number of Diamonds by Cut, Color, and Clarity")

The treemap aims to provide an understanding of the distribution of diamond qualities through visualization of the total number of diamonds across the categorical values of cut, color, and clarity. The first level of the treemap separates the diamonds according to their cut and assigns a color within the visualization. Next, within each color, there is another separation based on the color of the diamond. Finally, the smallest squares within the tree map are made by a count of diamonds per clarity. Immediate conclusions from the treemap are that the dataset is comprised of higher quality diamonds as the “Ideal” rated diamonds are the largest section within the treemap, and the “Fair” rated stones occur at the lowest frequency.

Qualitative Analysis

Crucial in understanding descriptive aspects of the dataset is deducing the distribution of information contained within the variables and ascertaining any abnormal qualities. In this report, boxplots are used to identify outliers within the dataset.

# Create boxplot to visualize the distribution of price per clarity rating
diamonds_complete %>%
  ggplot(aes(x = clarity, y = price, fill = clarity)) +
  geom_boxplot() +
  labs(title = "Boxplot of Clarity",
       subtitle = "Diamonds Dataset",
       caption = "Data Source: ggplot2 package",
       x = "Clarity Rating",
       y = "Price ($)") +
  theme(legend.position = "none") +
  theme_minimal()

The boxplot above presents some interesting information when inspected closely. The clarity rating travels along the x-axis of the chart from left to right, with the lowest quality stones on the far left and the highest quality stones on the far right, but it also shows that the mean price of the stone is decreasing. While there are other factors, such as cut and color, further analysis is needed to understand the relationship between the categorical value of clarity and price.

# Create boxplot to visualize the distribution of price per color rating.
diamonds_complete %>%
  ggplot(aes(x = color, y = price, fill = color)) +
  geom_boxplot() +
  labs(title = "Boxplot of Color",
       subtitle = "Diamonds Dataset",
       caption = "Data Source: ggplot2 package",
       x = "Color Rating",
       y = "Price ($)") +
  theme(legend.position = "none") +
  theme_minimal()

In the color and price box plot, the highest quality stones are rated from best to worst, moving left to right along the x-axis. The box plot of color rating again shows some interesting distributions of data. The mean price increases as the color rating decreases.

# Create boxplot to understand the distribution of price per cut rating.
diamonds_complete %>%
  ggplot(aes(x = cut, y = price, fill = cut)) +
  geom_boxplot() +
  labs(title = "Boxplot of Cut",
       subtitle = "Diamonds Dataset",
       caption = "Data Source: ggplot2 package",
       x = "Cut Rating",
       y = "Price ($)") +
  theme(legend.position = "none") +
  theme_minimal()

Following the same perplexing trend, the diamonds with the best cut rating also create the lowest mean price.

Conclusions

Found within the box plots of cut, color, and clarity, an interesting trend came to the surface. In every rating category, as the quality of the stone increased, the mean price within the dataset decreased. Additionally, there is a significant number of outliers within each quality descriptor. The logical conclusion when faced with unexpected trends and such a high number of outliers is that there is not one single factor that can explain the price of the diamond but is instead a combination of factors and likely a more significant factor than cut, color, and clarity.

Quantitative Analysis

Because the combination of a diamond’s length, width, and height inform the carat weight, the analysis will begin with creating a visual understanding of the distribution of carat size throughout the dataset.

# Create histogram of price
diamonds_complete %>%
  ggplot(aes(x = carat)) +
  geom_histogram(color = "steelblue4", fill = "lightblue2") +
    labs(title = "Histogram of Carat",
       subtitle = "Diamonds Dataset",
       caption = "Data Source: ggplot2 package",
       x = "Carat Size",
       y = "Number of Diamonds") +
  theme_minimal()

Conclusions

The histogram shows a significant right skew in the data distribution, with most diamonds in the dataset weighing less than 1.2 carats. It is understood that higher-carat diamonds will generally be taller, wider, and deeper. Therefore, further exploration into the diamonds’ size aspects is unnecessary.

Correlation Analaysis

Though knowledge gained during the research of the “four Cs” of diamonds, it is understood that the main factors describing a diamond’s quality are qualitative and quantitative variables. To understand the relationship between these variables and how they are related to price, analysis will begin with scatter plots and then explore correlations between the variables in the dataset.

# Create scatter plot of carat and price with grouping by cut
diamonds_complete %>%
  ggplot(aes(x = carat, y = price, color = cut)) +
  geom_point() +
  theme_minimal() +     
  labs(title = "Scatter plot of Carat and Price",
       subtitle = "Diamonds Dataset",
       caption = "Data Source: ggplot2 package",
       x = "Carat Size",
       y = "Number of Diamonds")

Contrary to what is seen in the box plot of only one categorical value and price, the relationship between carat and price follows a more expected path. As the size of the carat increases, the price also increases. While there are instances where there are higher priced stones of lower cut ratings, across the scatter plot, there seems to be a trend where the higher cut stones are gathering a higher price.

# Create scatterplot of carat and price with grouping by color
diamonds_complete %>% 
  ggplot(aes(x = carat, y = price, color = color)) +
  geom_point() +
  theme_minimal() +
    labs(title = "Scatter plot of Carat and Price",
       subtitle = "Diamonds Dataset",
       caption = "Data Source: ggplot2 package",
       x = "Carat Size",
       y = "Number of Diamonds")

Based on the scatter plot of carat and price with points colored by the color of the diamond, the same trend is apparent: once separated by carat, the color ratings impact the price of the stone.

# Create scatterplot of carat and price with grouping by clarity
diamonds_complete %>% 
  ggplot(aes(x = carat, y = price, color = clarity)) +
  geom_point() +
  theme_minimal() +
    labs(title = "Scatter plot of Carat and Price",
       subtitle = "Diamonds Dataset",
       caption = "Data Source: ggplot2 package",
       x = "Carat Size",
       y = "Number of Diamonds")

The same trend of the qualitative descriptor of quality becoming more aligned with price is apparent in the above scatter plot, but only when paired with an understanding of the diamond’s carat weight. To measure the observations seen in the scatter plot, the next step of analysis is to create a correlation matrix and plot the correlations. It is expected that carat weight is the most closely related variable to price, and all other quality ratings are a secondary concern when pricing stones.

# Subset data into variables of cut, color, clarity, carat, and price
corr_diamonds <- diamonds_complete[-c(5,6,8:10)]
# Create factor levels for mapping of qualitative values. All values will rank from 1 to n with 1 representing lower quality and the higher number representing higher quality.
cut_levels <- c("Fair", "Good", "Very Good", "Premium", "Ideal")
color_levels <- c("D", "E", "F", "G", "H", "I", "J")
clarity_level <- c("I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF")

# Create factors for cut values
corr_diamonds$cut <- factor(corr_diamonds$cut, levels = cut_levels)

# Create factors for color values
corr_diamonds$color <- factor(corr_diamonds$color, levels = color_levels)

# Create factors for clarity values
corr_diamonds$clarity <- factor(corr_diamonds$clarity, levels = clarity_level)

# Create mapping for cut values
cut_mapping <- c("Fair" = 1, "Good" = 2, "Very Good" = 3, "Premium" = 4, "Ideal" = 5)

# Apply mapping to cut variable
corr_diamonds$cut <- cut_mapping[corr_diamonds$cut]

# Convert the factor to numeric
corr_diamonds$color <- as.numeric(corr_diamonds$color)
corr_diamonds$clarity <- as.numeric(corr_diamonds$clarity)
# Create correlation matrix of the four Cs
cor(corr_diamonds)
##              carat         cut       color     clarity      price
## carat    1.0000000 -0.13335113  0.29094288 -0.35219384  0.9215485
## cut     -0.1333511  1.00000000 -0.02014858  0.18833670 -0.0522251
## color    0.2909429 -0.02014858  1.00000000  0.02570228  0.1717463
## clarity -0.3521938  0.18833670  0.02570228  1.00000000 -0.1461243
## price    0.9215485 -0.05222510  0.17174629 -0.14612430  1.0000000
# Create correlation visualization
vis_cor(corr_diamonds)

Correlation Conclusions

Identified in the correlation matrix and visualization is the statistical context of visual observations. It was seen in the box plots that there was more than just a single factor that informed price, but the level to which each factor was related was unknown. However, the correlation matrix shows that carat has a very strong positive relationship with price, meaning that as price increases, we can likely see an increase in carat weight. The other qualitative factors that describe a diamond’s quality have very weak relationships with price. Within this dataset, the rating most closely related tp the price of the stone is carat weight.