Examples of Modeling & Visualization with diamonds

I created the .Rmd file that created this RPubs document so that you can examine the RMarkdown code that I selected to format this page. Download the original RMarkdown document from a Piazza note I just published.Double-click on the file, Diamonds_Plot_Demo.Rmd, you download. This file should open in your RStudio. Study the code I wrote in comparison with the information on the RMarkdown Cheat Sheet. ~dlp

The `diamonds` Data Frame

Source

The diamonds data frame is available when the ggplot2 package is loaded. Data extracted for the diamonds data frame include 10 characteristics (variables in columns) of 53,940 diamonds (observations in rows).

Variables

Characteristics of variables in diamonds include:

carat - The weight or overall size of a diamond is measured in carats. In fact, all gemstones are measured in this fashion. Carat weight is made up of points…like ounces to a pound. It takes 100 points to equal 1 carat. For example, 25 points = 1/4 carat, 50 points = 1/2 carat, etc. Of course, the higher the carat weight of the diamond, the more you can expect to pay for it. However the price does not increase on an even scale. A 2 carat diamond will not be twice the cost of a 1 carat diamond, despite being twice the size. The larger the diamond, the rarer it becomes and the price increases exponentially.
cut - A diamond cut is a style or design guide used when shaping a diamond for polishing. Cut does not refer to shape (pear, oval), but the symmetry, proportioning, and polish of a diamond. The cut of a diamond greatly affects a diamond’s brilliance; this means if it is cut poorly, it will be less luminous. This variable focuses on a judgment about the quality of the diamond’s cut: Fair; Good; Very Good; Premium; Ideal.
color - Most commercially available diamonds are classified by color, or more appropriately, the lack of color. The most valuable diamonds are those classified as colorless, yet there are stones that have rich colors inluding yellow, red, green and even black that are extremely rare and valuable. Color is graded on a letter scale from D to Z, with D representing a colorless diamond.

TRYING TO FIGURE OUT ALT TEXT

[id]: http://www.diamondse.info/images/diamond-color-chart.gif “Diamond Color Chart” !Images [diamond color chart: colorless=DEF; near colorless = GHIJ; faint yellow=KLM; very light yellow=NOPQR; lt. yellow=SZ; fancy yellow=FANCY][id]

clarity - The clarity of a diamond is determined by the number, location and type of inclusions it contains. Inclusions can be microscopic cracks, mineral deposits, or external markings. Clarity is rated using a scale which contains a combination of letters and numbers to signify the amount and type of inclusions. This scale ranges from FL to I3, FL being Flawless and the most valuable. The Gemological Institute of America has designated that clarity of diamonds is graded under the following guidelines: FL- Diamond free from internal and external flaws under 10X magnification; IF - Absolutely free from internal faults under 10X magnification; VVS1 - Very, very small inclusions in the stone, very difficult to recognize under 10X magnification; VVS2 - Very, very small inclusions anywhere in the stone, only smallest external defects allowed; VS1 - Only the smallest inclusions are allowable in the field of the table and only small faults elsewhere in the stone; VS2 - Very small internal faults and small external defects; SI1 - Small internal faults, not visible to the naked eye; SI2 - Small, easily seen inclusions under magnification in the table, but still not visible to the naked eye; I1 - Inclusions easily seen under magnification, but difficult to see with the naked eye (Inclusions do not influence brilliance); I2 - Large and numerous inclusions, just barely visible to the naked eye; I3 - Large and numerous inclusions, easily visible to the naked eye, diminishing brilliance.

x - Length of the diamond in millimeters.
y - Width of the diamond in millimeters.
z - This variable is a measure of the height in millimeters measured from the bottom of the diamond to its table (the flat surface on the top of the diamond); also called depth of the diamond.
depth - This variable actually is the depth total percentage of the diamond defined by 2(z) / (x + y).
table - This variable actually is a measure of table width, the width of top of diamond relative to widest point.
price - The retail price of the diamond in U.S. dollars.

Data types

Here is a listing of the first 10 lines in diamonds:

  # A tibble: 53,940 x 10
     carat       cut color clarity depth table price     x     y     z
     <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
   1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
   2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
   3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
   4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
   5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
   6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48
   7  0.24 Very Good     I    VVS1  62.3    57   336  3.95  3.98  2.47
   8  0.26 Very Good     H     SI1  61.9    55   337  4.07  4.11  2.53
   9  0.22      Fair     E     VS2  65.1    61   337  3.87  3.78  2.49
  10  0.23 Very Good     H     VS1  59.4    61   338  4.00  4.05  2.39
  # ... with 53,930 more rows

Notice that the data types of variables in diamond include dbl and ord. dbl indicates a numeric variable, but we have not seen an ord data type yet. Here is a list of the variables in the data frame, along with their data types:

  Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of  10 variables:
   $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
   $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
   $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
   $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
   $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
   $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
   $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
   $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
   $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
   $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

The ord data type indicates an ordered factor. So, it is a factor, meaning that the variable has a small number of values that represent nominal categories. The values are ordered meaning that, for example, in cut diamonds are classified in an order ranging from “fair,” the lowest quality cut, to “Ideal,” the highest quality cut`."

Frequency Distributions

A variable is an entity that has two or more mutually esclusive values. A frequency distributions displays the number of observations for each value of a variable. Here are a few examples. Try others on your own. Start by copying my code; then, modify the statements to include other variables.

Some discrete variables with small number of values

Color & clarity

    color.freq
  D       6775
  E       9797
  F       9542
  G      11292
  H       8304
  I       5422
  J       2808

       clarity.freq
  I1            741
  SI2          9194
  SI1         13065
  VS2         12258
  VS1          8171
  VVS2         5066
  VVS1         3655
  IF           1790

A continuous variable with many values

Price

The range of the prices (variable price) of the 53,940 diamonds:

  [1]   326 18823

The frequency distribution of prices (in scientific notation) grouped in $1,000 intervals:

                      price.freq
  [326,1.33e+03)           18790
  [1.33e+03,2.33e+03)       7517
  [2.33e+03,3.33e+03)       5466
  [3.33e+03,4.33e+03)       4381
  [4.33e+03,5.33e+03)       4303
  [5.33e+03,6.33e+03)       2743
  [6.33e+03,7.33e+03)       2057
  [7.33e+03,8.33e+03)       1502
  [8.33e+03,9.33e+03)       1257
  [9.33e+03,1.03e+04)       1015
  [1.03e+04,1.13e+04)        935
  [1.13e+04,1.23e+04)        737
  [1.23e+04,1.33e+04)        685
  [1.33e+04,1.43e+04)        561
  [1.43e+04,1.53e+04)        512
  [1.53e+04,1.63e+04)        479
  [1.63e+04,1.73e+04)        439
  [1.73e+04,1.83e+04)        393

Summary Statistics

I computed all of these statistics with dplyr. Look at the code in the .Rmd file, my friends….look at the code.

Price by color

  # A tibble: 7 x 3
    color average_price number
    <ord>         <dbl>  <int>
  1     D      3169.954   6775
  2     E      3076.752   9797
  3     F      3724.886   9542
  4     G      3999.136  11292
  5     H      4486.669   8304
  6     I      5091.875   5422
  7     J      5323.818   2808

Price by clarity

Average price

  # A tibble: 8 x 3
    clarity average_price number
      <ord>         <dbl>  <int>
  1      I1      3924.169    741
  2     SI2      5063.029   9194
  3     SI1      3996.001  13065
  4     VS2      3924.989  12258
  5     VS1      3839.455   8171
  6    VVS2      3283.737   5066
  7    VVS1      2523.115   3655
  8      IF      2864.839   1790

Average, minimum, and maximum price

  # A tibble: 8 x 5
    clarity average_price minimum maximum number
      <ord>         <dbl>   <dbl>   <dbl>  <int>
  1      I1      3924.169     345   18531    741
  2     SI2      5063.029     326   18804   9194
  3     SI1      3996.001     326   18818  13065
  4     VS2      3924.989     334   18823  12258
  5     VS1      3839.455     327   18795   8171
  6    VVS2      3283.737     336   18768   5066
  7    VVS1      2523.115     336   18777   3655
  8      IF      2864.839     369   18806   1790

Average, minimum, and maximum price sorted in descending order by average price

  # A tibble: 8 x 5
    clarity average_price minimum maximum number
      <ord>         <dbl>   <dbl>   <dbl>  <int>
  1     SI2      5063.029     326   18804   9194
  2     SI1      3996.001     326   18818  13065
  3     VS2      3924.989     334   18823  12258
  4      I1      3924.169     345   18531    741
  5     VS1      3839.455     327   18795   8171
  6    VVS2      3283.737     336   18768   5066
  7      IF      2864.839     369   18806   1790
  8    VVS1      2523.115     336   18777   3655

Carat by cut

  # A tibble: 5 x 3
          cut average_carat number
        <ord>         <dbl>  <int>
  1      Fair     1.0461366   1610
  2      Good     0.8491847   4906
  3 Very Good     0.8063814  12082
  4   Premium     0.8919549  13791
  5     Ideal     0.7028370  21551

A Hypothesis Test

Test of the hypothesis that there is no difference in the price of diamonds with clarity = VVS1 and clarity = IF.

Examine the code in the .Rmd file, and try this yourself with other variables. ~ dlp

The calculation

# I will show the R code in the RPubs document
price.vvs1 <- diamonds[diamonds$clarity == "VVS1",]$price
price.if <- diamonds[diamonds$clarity == "IF",]$price
t.test(price.vvs1,price.if, var.equal=TRUE)

  
    Two Sample t-test
  
  data:  price.vvs1 and price.if
  t = -3.3481, df = 5443, p-value = 0.0008193
  alternative hypothesis: true difference in means is not equal to 0
  95 percent confidence interval:
   -541.8145 -141.6344
  sample estimates:
  mean of x mean of y 
   2523.115  2864.839

Explanation of the result

The difference in the average price of diamonds with clarity VVS1 (M = M = $2,523) and diamonds with clarity IF (M = $2,865) was -$342, which was statistically significant at the .05 level, t(5443) = -3.35, p < .001, 95% CI [-$542, -$142].

Plots through `ggplot2`

*Again, examine the R code in the .RMD file to see how these plots were executed. Compare my code withe the GGPLOT2 Cheat Sheet.

Examples of Modeling & Visualization with `diamonds`

David L. Passmore copied by Renee Ford

October 9, 2017

The `diamonds` Data Frame

Source

Variables

TRYING TO FIGURE OUT ALT TEXT

Data types

Frequency Distributions

Some discrete variables with small number of values

Color & clarity

A continuous variable with many values

Price

Summary Statistics

Price by color

Price by clarity

Average price

Average, minimum, and maximum price

Average, minimum, and maximum price sorted in descending order by average price

Carat by cut

A Hypothesis Test

The calculation

Explanation of the result

Plots through `ggplot2`

Histogram of diamond prices

Histogram of diamond prices by cut

Diamond frequency by carat

Diamond count by cut

Diamond count by clarity for levele of cut

Diamond price by carat

Examples of Modeling & Visualization with diamonds

David L. Passmore copied by Renee Ford

October 9, 2017

The diamonds Data Frame

Source

Variables

TRYING TO FIGURE OUT ALT TEXT

Data types

Frequency Distributions

Some discrete variables with small number of values

Color & clarity

A continuous variable with many values

Price

Summary Statistics

Price by color

Price by clarity

Average price

Average, minimum, and maximum price

Average, minimum, and maximum price sorted in descending order by average price

Carat by cut

A Hypothesis Test

The calculation

Explanation of the result

Plots through ggplot2

Histogram of diamond prices

Histogram of diamond prices by cut

Diamond frequency by carat

Diamond count by cut

Diamond count by clarity for levele of cut

Diamond price by carat

Examples of Modeling & Visualization with `diamonds`

The `diamonds` Data Frame

Plots through `ggplot2`