Overviewing, Cleaning, EDA on Diamonds Dataset

We will load the Diamonds data in R studio first.

install.packages("tidyverse")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

library("ggplot2")
data("diamonds")

To view the dataset in sheet

click here

Now, we will install and load more packages as per our need.

We will install “here” package to make file referencing easier.

install.packages("here")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

library("here")

## here() starts at /cloud/project

We will install “skimr” package to summarize and skim through data quickly.

install.packages("skimr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

library("skimr")

we will install “janitor” package to do some cleaning.

install.packages("janitor")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

library("janitor")

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

We will also load “dplyr” Package . Since, we will be using some of its functions

install.packages("dplyr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Now we will see an overview of Diamonds dataset, including basic statistics of data

Data summary
Name	diamonds
Number of rows	53940
Number of columns	10
_______________________
Column type frequency:
factor	3
numeric	7
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
cut	1	TRUE	5	Ide: 21551, Pre: 13791, Ver: 12082, Goo: 4906
color	1	TRUE	7	G: 11292, E: 9797, F: 9542, H: 8304
clarity	1	TRUE	8	SI1: 13065, VS2: 12258, SI2: 9194, VS1: 8171

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
carat	1	0.80	0.47	0.2	0.40	0.70	1.04	5.01
depth	1	61.75	1.43	43.0	61.00	61.80	62.50	79.00
table	1	57.46	2.23	43.0	56.00	57.00	59.00	95.00
price	1	3932.80	3989.44	326.0	950.00	2401.00	5324.25	18823.00
x	1	5.73	1.12	0.0	4.71	5.70	6.54	10.74
y	1	5.73	1.14	0.0	4.72	5.71	6.54	58.90
z	1	3.54	0.71	0.0	2.91	3.53	4.04	31.80

Lets see the structure of the dataset

str(diamonds)

## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

We will now take a glimpse of the diamonds dataset, to know the total number of rows and columns. Along with it, we will see some sample values present in those columns

glimpse(diamonds)

## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…

If we want to see column names, it can be known by following code.

glimpse(diamonds)

## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…

Now, lets fetch the first few rows of diamonds dataset

head(diamonds)

## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

To highlight columns, we will make them in uppercase

DIAMOND <- rename_with(diamonds,toupper)
head(DIAMOND)

## # A tibble: 6 × 10
##   CARAT CUT       COLOR CLARITY DEPTH TABLE PRICE     X     Y     Z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Now, we can change the spelling of column named “COLOR”

DIAMOND <- DIAMOND %>% rename(COLOUR=COLOR) 
head(DIAMOND)

## # A tibble: 6 × 10
##   CARAT CUT       COLOUR CLARITY DEPTH TABLE PRICE     X     Y     Z
##   <dbl> <ord>     <ord>  <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E      SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E      SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E      VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I      VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J      SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J      VVS2     62.8    57   336  3.94  3.96  2.48

We will remove the columns that we don’t need.

DIAMOND<-DIAMOND %>% select(CARAT,CUT,COLOUR,CLARITY,DEPTH,PRICE)
head(DIAMOND)

## # A tibble: 6 × 6
##   CARAT CUT       COLOUR CLARITY DEPTH PRICE
##   <dbl> <ord>     <ord>  <ord>   <dbl> <int>
## 1  0.23 Ideal     E      SI2      61.5   326
## 2  0.21 Premium   E      SI1      59.8   326
## 3  0.23 Good      E      VS1      56.9   327
## 4  0.29 Premium   I      VS2      62.4   334
## 5  0.31 Good      J      SI2      63.3   335
## 6  0.24 Very Good J      VVS2     62.8   336

Lets ensure that there are only characters, numbers and underscores in the column names

clean_names(DIAMOND)

## # A tibble: 53,940 × 6
##    carat cut       colour clarity depth price
##    <dbl> <ord>     <ord>  <ord>   <dbl> <int>
##  1  0.23 Ideal     E      SI2      61.5   326
##  2  0.21 Premium   E      SI1      59.8   326
##  3  0.23 Good      E      VS1      56.9   327
##  4  0.29 Premium   I      VS2      62.4   334
##  5  0.31 Good      J      SI2      63.3   335
##  6  0.24 Very Good J      VVS2     62.8   336
##  7  0.24 Very Good I      VVS1     62.3   336
##  8  0.26 Very Good H      SI1      61.9   337
##  9  0.22 Fair      E      VS2      65.1   337
## 10  0.23 Very Good H      VS1      59.4   338
## # … with 53,930 more rows

Lets, visualize this data

library("ggplot2")

We will visualise the relation between cut and clarity, along with the number of diamonds

ggplot(data=DIAMOND) + 
  geom_bar(mapping=aes(x=CUT, fill=CLARITY))

We will plot the relation between diamond colour type and cut type

ggplot(data=DIAMOND) + 
  geom_bar(mapping=aes(x=CUT, fill=COLOUR))

Let’s divide the above bar chart into multiple charts on the basis of cut types in diamond

ggplot(data=DIAMOND) + 
  geom_bar(mapping=aes(x=COLOUR, fill = CUT)) + 
 facet_wrap(~CUT)

we will limit the cut type to “Ideal”(as they are the most preferred one’s) and arrange it by carat.

Ideal_diamonds <- DIAMOND %>% 
filter(CUT=='Ideal') %>%
arrange(-CARAT)
head(Ideal_diamonds)

## # A tibble: 6 × 6
##   CARAT CUT   COLOUR CLARITY DEPTH PRICE
##   <dbl> <ord> <ord>  <ord>   <dbl> <int>
## 1  3.5  Ideal H      I1       62.8 12587
## 2  3.22 Ideal I      I1       62.6 12545
## 3  3.01 Ideal J      SI2      61.7 16037
## 4  3.01 Ideal J      I1       65.4 16538
## 5  2.75 Ideal D      I1       60.9 13156
## 6  2.72 Ideal H      I1       59.6 11594