Title: “Data_Dive_W2” Output : html_document
Objectives:
Before we get started, we will import the necessary libraries and then load the data
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly) #we can perform powerful visualizations
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(skimr) #for Quick overview of the data
we will use the “diamonds” dataset as a consistent reference point. This dataset will serve as a standardized foundation for illustrating various data analysis and visualization techniques throughout the exercise.
#loading the dataset
data(diamonds)
#Now that the dataset has been successfully loaded,
#let's see what's in the diamonds dataset.
view(diamonds) #view function shows the dataset in a new tab
str(diamonds) #str function shows the structure if each variable in the dataset
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
We can see that the diamonds dataset has 53,940 rows and 10
columns(we can also view it by using the dim() function). It appears
that we have three ordered factor variables, one integer, and six
numerical variables. The description of each column is
carat : weight of the diamond
cut : quality of the cut
color : diamond color
clarity : measurement of how clear the diamond is
depth : total depth percentage
table : width of the top of the diamond relative to the widest
point
price : price in US dollars
x : length in mm
y : width in mm
z : depth in mm
Given the volume of data, it is important to summarize the data so that we can understand the data better.
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
For each of the numerical variables, we can observe the following
information:
Min: The minimum value. 1st Qu: The first quartile (25th percentile).
The median value. Mean: The average value. The value of the third
quartile (75 percentile). Max refers to the maximum value. The dataset’s
categorical variables (cut, color, and clarity) provide a frequency
count for each value.
For example, for the cut variable:
Fair: This value occurs 1,610 times. Good: This value occurs 4,906
times. Very Good: This value occurs 12,082 times. Premium: This value
occurs 13,791 times. Ideal: This value occurs 21,551 times.
There are 3 variables with an ordered factor structure. An ordered factor arranges the categorical values in a low-to-high rank order. For example, there are 5 categories of diamond cuts with “Fair” being the lowest grade of cut to ideal being the highest grade.
Observation 1
We can see that the minimum values for x, y, and z are 0, which
makes no sense. The dimension of the diamond is never zero. We can omit
these rows.
For this exercise, we are going to take 5 variables from diamond dataset.
selected_columns <- c('carat','cut','color','depth','price')
# Creating a new dataframe with the selected columns
diamond_1 <- diamonds[selected_columns]
diamond_1
## # A tibble: 53,940 × 5
## carat cut color depth price
## <dbl> <ord> <ord> <dbl> <int>
## 1 0.23 Ideal E 61.5 326
## 2 0.21 Premium E 59.8 326
## 3 0.23 Good E 56.9 327
## 4 0.29 Premium I 62.4 334
## 5 0.31 Good J 63.3 335
## 6 0.24 Very Good J 62.8 336
## 7 0.24 Very Good I 62.3 336
## 8 0.26 Very Good H 61.9 337
## 9 0.22 Fair E 65.1 337
## 10 0.23 Very Good H 59.4 338
## # ℹ 53,930 more rows
names(diamond_1) #list out the names of the column
## [1] "carat" "cut" "color" "depth" "price"
summary(diamond_1)
## carat cut color depth price
## Min. :0.2000 Fair : 1610 D: 6775 Min. :43.00 Min. : 326
## 1st Qu.:0.4000 Good : 4906 E: 9797 1st Qu.:61.00 1st Qu.: 950
## Median :0.7000 Very Good:12082 F: 9542 Median :61.80 Median : 2401
## Mean :0.7979 Premium :13791 G:11292 Mean :61.75 Mean : 3933
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 3rd Qu.:62.50 3rd Qu.: 5324
## Max. :5.0100 I: 5422 Max. :79.00 Max. :18823
## J: 2808
we can also check the overview of the data using the skim() function,but first, we need to install the ‘skimr’ package. ‘skim’ is a powerful and versatile function in R from the ‘skmir’ package that helps you get a quick overview of your data. It provides summary statistics for various data types within data frames, tibbles, data tables, and even vectors.
skim(diamond_1)
| Name | diamond_1 |
| Number of rows | 53940 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| factor | 2 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| cut | 0 | 1 | TRUE | 5 | Ide: 21551, Pre: 13791, Ver: 12082, Goo: 4906 |
| color | 0 | 1 | TRUE | 7 | G: 11292, E: 9797, F: 9542, H: 8304 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| carat | 0 | 1 | 0.80 | 0.47 | 0.2 | 0.4 | 0.7 | 1.04 | 5.01 | ▇▂▁▁▁ |
| depth | 0 | 1 | 61.75 | 1.43 | 43.0 | 61.0 | 61.8 | 62.50 | 79.00 | ▁▁▇▁▁ |
| price | 0 | 1 | 3932.80 | 3989.44 | 326.0 | 950.0 | 2401.0 | 5324.25 | 18823.00 | ▇▂▁▁▁ |
Observation 2
We can see that there are 5 unique values for the variable cut
and 7 unique values for the variable color.
There are no missing values in our data frame(which is great).
Let’s look at the summary of each column and visualize it
summary(diamond_1$carat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2000 0.4000 0.7000 0.7979 1.0400 5.0100
#visualizing the carat column
ggplot(diamonds, aes(carat)) +
geom_bar(fill='orange')
Most of the diamonds in our dataset have carat values ranging from 0.5 to 1.5
Q1. Is there any relationship between carat and
price?
let’s see
summarise(diamond_1, correlation = cor(x = carat, y = price))
## # A tibble: 1 × 1
## correlation
## <dbl>
## 1 0.922
We can see that, there is strong positive correlation between the carat and price, if the carat increases the price of the diamonds increases proportionally.
Q2. What is the average price of the diamond?
let’s see
#we can find the average using the mean function
avg_price <- mean(diamond_1$price)
cat("The Average Price of the diamonds is", avg_price, "\n")
## The Average Price of the diamonds is 3932.8
unique(diamond_1$cut) #unique values in column cut
## [1] Ideal Premium Good Very Good Fair
## Levels: Fair < Good < Very Good < Premium < Ideal
We have 5 unique values Ideal, Premium, Good, Very Good, Fair with “Fair” being the lowest grade of cut to “ideal” being the highest grade. Let’s analyze it further by visualizing it.
{r}and} ggplot(diamonds, aes(cut)) + geom_bar(fill='blue')
We observed that our data consists of a higher proportion of ideal cuts and a lower proportion of fair cuts. (It also follows the same levels as above )
Q3. How Carat and Cut affect the price?
we can visualize their relationship for better understanding
#plotting the graph
ggplot(data = diamond_1) + geom_point(mapping = aes(x = carat, y = price, color = cut)) +
labs(title = "carat and cut vs price")
Premium cut diamonds have high average prices while fair cut diamonds has low average prices
Observation 3
In the graph above, we can observe that fair-cut diamonds with
high carat have a higher price than premium cut diamonds with lower
carat. We need to do more EDA in order to understand the data
better.
Lets see how carat and color affect the price, but first we will see how many unique values are there in color column
unique(diamond_1$color)
## [1] E I J H F G D
## Levels: D < E < F < G < H < I < J
#plotting the graph
ggplot(data = diamond_1) + geom_point(mapping = aes(x = carat, y = price, color = color)) +
labs(title = "carat and color vs price")
We can see that there are 7 unique values for the column color with
color D being the best and color J being the worst(according to the
documentation).
GIA Color Scale:
DEF - colorless very rare,
GHIJ - near colorless (tints of yellow and brown),
KLM - Faint,
NOPQR - Very light,
STUVWXYZ - Light
The diamond_1 dataset solely contains colorless and nearly colorless
diamonds. The more colorless a diamond is, the higher its price.
Let’s understand how carat and depth affect the price
#plotting the 3D-Scatterplot using plotly
plot_ly(data = diamond_1, x = ~carat, y = ~depth, z = ~price, type = "scatter3d", mode = "markers",
marker = list(size = 5, color = ~price, colorscale = "Viridis"),
text = ~paste("Carat: ", carat, "<br>Depth: ", depth, "<br>Price: ", price)) %>%
layout(scene = list(xaxis = list(title = 'Carat'),
yaxis = list(title = 'Depth'),
zaxis = list(title = 'Price')),
title = "3D Scatter Plot of Carat, Depth, and Price of Diamonds")
#plotting the data using ggplot
ggplot(data = diamond_1) + geom_point(mapping = aes(x = carat, y = price, color = depth))+labs(title = "carat and depth vs price")
#checking weather there is strong or weak relationship
summarise(diamond_1, correlation = cor(x = depth, y = price))
## # A tibble: 1 × 1
## correlation
## <dbl>
## 1 -0.0106
We can see that most of the diamonds have depths ranging from 45 to 75 and there is a very weak negative correlation between depth and price.
Observation 4
The depth of a diamond doesn’t affect the price.