Week 2: Data Dive

Objectives:

Library Installation and Data Loading
Data Inspection and Summary
Insight Generation from Data Summary and Visualization

Library Installation and Data Loading

Before we get started, we will import the necessary libraries and then load the data

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(plotly) #we can perform powerful visualizations

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

library(skimr) #for Quick overview of the data

we will use the “diamonds” dataset as a consistent reference point. This dataset will serve as a standardized foundation for illustrating various data analysis and visualization techniques throughout the exercise.

#loading the dataset
data(diamonds)

#Now that the dataset has been successfully loaded, 
#let's see what's in the diamonds dataset.
view(diamonds)  #view function shows the dataset in a new tab
str(diamonds)   #str function shows the structure if each variable in the dataset

## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

We can see that the diamonds dataset has 53,940 rows and 10 columns(we can also view it by using the dim() function). It appears that we have three ordered factor variables, one integer, and six numerical variables. The description of each column is
carat : weight of the diamond
cut : quality of the cut
color : diamond color
clarity : measurement of how clear the diamond is
depth : total depth percentage
table : width of the top of the diamond relative to the widest point
price : price in US dollars
x : length in mm
y : width in mm
z : depth in mm

Data Inspection and Summary

Given the volume of data, it is important to summarize the data so that we can understand the data better.

summary(diamonds)

##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
##

For each of the numerical variables, we can observe the following information:
Min: The minimum value. 1st Qu: The first quartile (25th percentile). The median value. Mean: The average value. The value of the third quartile (75 percentile). Max refers to the maximum value. The dataset’s categorical variables (cut, color, and clarity) provide a frequency count for each value.

For example, for the cut variable:
Fair: This value occurs 1,610 times. Good: This value occurs 4,906 times. Very Good: This value occurs 12,082 times. Premium: This value occurs 13,791 times. Ideal: This value occurs 21,551 times.

There are 3 variables with an ordered factor structure. An ordered factor arranges the categorical values in a low-to-high rank order. For example, there are 5 categories of diamond cuts with “Fair” being the lowest grade of cut to ideal being the highest grade.

Observation 1
We can see that the minimum values for x, y, and z are 0, which makes no sense. The dimension of the diamond is never zero. We can omit these rows.

For this exercise, we are going to take 5 variables from diamond dataset.

selected_columns <- c('carat','cut','color','depth','price')

# Creating a new dataframe with the selected columns
diamond_1 <- diamonds[selected_columns]
diamond_1

## # A tibble: 53,940 × 5
##    carat cut       color depth price
##    <dbl> <ord>     <ord> <dbl> <int>
##  1  0.23 Ideal     E      61.5   326
##  2  0.21 Premium   E      59.8   326
##  3  0.23 Good      E      56.9   327
##  4  0.29 Premium   I      62.4   334
##  5  0.31 Good      J      63.3   335
##  6  0.24 Very Good J      62.8   336
##  7  0.24 Very Good I      62.3   336
##  8  0.26 Very Good H      61.9   337
##  9  0.22 Fair      E      65.1   337
## 10  0.23 Very Good H      59.4   338
## # ℹ 53,930 more rows

names(diamond_1) #list out the names of the column

## [1] "carat" "cut"   "color" "depth" "price"

summary(diamond_1)

##      carat               cut        color         depth           price      
##  Min.   :0.2000   Fair     : 1610   D: 6775   Min.   :43.00   Min.   :  326  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   1st Qu.:61.00   1st Qu.:  950  
##  Median :0.7000   Very Good:12082   F: 9542   Median :61.80   Median : 2401  
##  Mean   :0.7979   Premium  :13791   G:11292   Mean   :61.75   Mean   : 3933  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   3rd Qu.:62.50   3rd Qu.: 5324  
##  Max.   :5.0100                     I: 5422   Max.   :79.00   Max.   :18823  
##                                     J: 2808

we can also check the overview of the data using the skim() function,but first, we need to install the ‘skimr’ package. ‘skim’ is a powerful and versatile function in R from the ‘skmir’ package that helps you get a quick overview of your data. It provides summary statistics for various data types within data frames, tibbles, data tables, and even vectors.

skim(diamond_1)

Data summary
Name	diamond_1
Number of rows	53940
Number of columns	5
_______________________
Column type frequency:
factor	2
numeric	3
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
cut	0	1	TRUE	5	Ide: 21551, Pre: 13791, Ver: 12082, Goo: 4906
color	0	1	TRUE	7	G: 11292, E: 9797, F: 9542, H: 8304

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
carat	1	0.80	0.47	0.2	0.4	0.7	1.04	5.01	▇▂▁▁▁
depth	1	61.75	1.43	43.0	61.0	61.8	62.50	79.00	▁▁▇▁▁
price	1	3932.80	3989.44	326.0	950.0	2401.0	5324.25	18823.00	▇▂▁▁▁

Observation 2
We can see that there are 5 unique values for the variable cut and 7 unique values for the variable color.
There are no missing values in our data frame(which is great).

Insight Generation from Data Summary and Visualization

Let’s look at the summary of each column and visualize it

summary(diamond_1$carat)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2000  0.4000  0.7000  0.7979  1.0400  5.0100

Min.: The minimum value observed in the dataset for the ‘price’ variable is 0.2000.
1st Qu (First Quartile): 25% of the data falls below this value, and it is 0.4000.
Median (or 2nd Qu.): The middle value of the dataset when ordered. In this case, it is 0.7000.
Mean: The average value of the ‘price’ variable is 0.7979.
3rd Qu. (Third Quartile): 75% of the data falls below this value, and it is 1.0400.
Max: The maximum value observed in the dataset for the ‘price’ variable is 5.0100.

#visualizing the carat column
ggplot(diamonds, aes(carat)) +
    geom_bar(fill='orange')

Most of the diamonds in our dataset have carat values ranging from 0.5 to 1.5

Q1. Is there any relationship between carat and price?
let’s see

summarise(diamond_1, correlation = cor(x = carat, y = price))

## # A tibble: 1 × 1
##   correlation
##         <dbl>
## 1       0.922

We can see that, there is strong positive correlation between the carat and price, if the carat increases the price of the diamonds increases proportionally.

Q2. What is the average price of the diamond?
let’s see

#we can find the average using the mean function
avg_price <- mean(diamond_1$price)
cat("The Average Price of the diamonds is", avg_price, "\n")

## The Average Price of the diamonds is 3932.8

unique(diamond_1$cut) #unique values in column cut

## [1] Ideal     Premium   Good      Very Good Fair     
## Levels: Fair < Good < Very Good < Premium < Ideal

We have 5 unique values Ideal, Premium, Good, Very Good, Fair with “Fair” being the lowest grade of cut to “ideal” being the highest grade. Let’s analyze it further by visualizing it.

{r}and} ggplot(diamonds, aes(cut)) + geom_bar(fill='blue')

We observed that our data consists of a higher proportion of ideal cuts and a lower proportion of fair cuts. (It also follows the same levels as above )

Q3. How Carat and Cut affect the price?
we can visualize their relationship for better understanding

#plotting the graph
ggplot(data = diamond_1) + geom_point(mapping = aes(x = carat, y = price, color = cut)) +
    labs(title = "carat and cut vs  price")

Premium cut diamonds have high average prices while fair cut diamonds has low average prices

Observation 3
In the graph above, we can observe that fair-cut diamonds with high carat have a higher price than premium cut diamonds with lower carat. We need to do more EDA in order to understand the data better.

Lets see how carat and color affect the price, but first we will see how many unique values are there in color column

unique(diamond_1$color)

## [1] E I J H F G D
## Levels: D < E < F < G < H < I < J

#plotting the graph 
ggplot(data = diamond_1) + geom_point(mapping = aes(x = carat, y = price, color = color)) +
    labs(title = "carat and color vs  price")

We can see that there are 7 unique values for the column color with color D being the best and color J being the worst(according to the documentation).
GIA Color Scale:
DEF - colorless very rare,
GHIJ - near colorless (tints of yellow and brown),
KLM - Faint,
NOPQR - Very light,
STUVWXYZ - Light
The diamond_1 dataset solely contains colorless and nearly colorless diamonds. The more colorless a diamond is, the higher its price.

Let’s understand how carat and depth affect the price

#plotting the 3D-Scatterplot using plotly
plot_ly(data = diamond_1, x = ~carat, y = ~depth, z = ~price, type = "scatter3d", mode = "markers",
        marker = list(size = 5, color = ~price, colorscale = "Viridis"),
        text = ~paste("Carat: ", carat, "<br>Depth: ", depth, "<br>Price: ", price)) %>%
  layout(scene = list(xaxis = list(title = 'Carat'),
                      yaxis = list(title = 'Depth'),
                      zaxis = list(title = 'Price')),
         title = "3D Scatter Plot of Carat, Depth, and Price of Diamonds")

#plotting the data using ggplot
ggplot(data = diamond_1) + geom_point(mapping = aes(x = carat, y = price, color = depth))+labs(title = "carat and depth vs price")

#checking weather there is strong or weak relationship
summarise(diamond_1, correlation = cor(x = depth, y = price))

## # A tibble: 1 × 1
##   correlation
##         <dbl>
## 1     -0.0106

We can see that most of the diamonds have depths ranging from 45 to 75 and there is a very weak negative correlation between depth and price.

Observation 4
The depth of a diamond doesn’t affect the price.