##Setup This code sets an option for the code chunks in the document. It configures the chunks to display the R code (echo = TRUE) when the document is rendered but not include the code output in the final document.

load data into R

This section loads the required R packages, including tidyverse, ggplot2, and scales. These packages are commonly used for data manipulation and visualization in R.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

##Reads Data Here, the code reads data from a CSV file hosted online and assigns it to a dataframe called df. The head(df) function is used to display the first few rows of the dataframe, allowing you to inspect the data.

df <- read.csv("https://raw.githubusercontent.com/jewelercart/Data607/main/data.csv")
head(df)

##Select Column ##In this section, a subset of columns specified in the select_col vector is extracted from the original dataframe df, and the result is stored in a new dataframe called df2. Again, head(df2) displays the first few rows of this subset.

Let’s take out the required variable or columns into other dataframe because this dataframe has some unnecessary columns.

I am considering variables owner, shape, carat, color, clarity, cert, Depth, Girdle, Polish, Sym, Meas, Cert_n, City, State, country, cut, pricepercrt, price, retail.

select_col <-c("owner", "shape", "carat", "color","clarity","Cert", "Depth", "Girdle","Polish","Sym", "Meas", "City", "State", "Country", "cut", "price")
df2 <- df[select_col]
head(df2)

Data Transformation ##This code further filters df2 to create a new dataframe df3. It selects rows where the “carat” column is less than or equal to 4, and the “clarity” column matches certain patterns defined by the str_detect function. ## Transfor required columns in suitable format.

Data Visualations

This code uses ggplot2 to create a scatter plot. It visualizes the relationship between “carat” and “price” with points colored by “clarity” and adds a smoothed line (geom_smooth) to the plot.

df3<- filter(df2, 
             carat<=4, # Carat is less than equal to 4, you can change to any number like 5
             #price<=50000,
             str_detect(clarity, "IF|FL|VVS1|VVS2|VS1|VS2|SI1|SI2|SI3"))

ggplot(df3, aes(x=carat, y=price, color= clarity))+
  geom_point()+
  scale_y_continuous(labels = dollar)+  
  scale_x_continuous(breaks = seq(0, 4, 0.25) )+  #seq(a, b, c) genrates sequence of number staring from a upto b with equal interval width c i.e., in our number width is 0.25. you can try changing to 0.5 or whatever you like. 
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## Summary of price according to clarity for diamonds with carats ## Summary Statistics This code calculates summary statistics (minimum, mean, maximum, and standard deviation of “price”) for specific rows of df3 based on conditions.

df3|>
  filter(carat >=3 & carat<=4)|>  # This line of code gets all rows whose carat lies between 3 and 4 inclusive both. 
  group_by(clarity)|>
  summarize(
    minimum_price = min(price),
    mean_price = mean(price),
    maximum_price = max(price),
    standard_deviation = sd(price)
  )

Second Visualization

This code creates another scatter plot similar to the one above, but this time, points are colored by “Country.” It visualizes the relationship between “carat” and “price” for different countries.

Plot price vs carat differentiated based on country.

ggplot(df3, aes(x=carat, y=price, color=Country))+
  geom_point()+
  scale_y_continuous(labels = dollar)+
  scale_x_continuous(breaks = seq(0, 4, 0.25) )+
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Direction of Relationship: The slope of the regression line tells us about the direction of the relationship between “carat” and “price.” If the slope is positive, it means that, on average, as the carat weight of gemstones increases, their price tends to increase as well. If it’s negative, it suggests that as carat weight increases, prices tend to decrease. In most cases, for gemstones, you would expect a positive slope.

##Strength of Relationship: The steepness of the slope indicates the strength of the relationship. A steeper slope implies a stronger relationship between carat and price, while a shallower slope suggests a weaker relationship.

##Scatter around the Line: The scatter of data points around the line represents the variability in prices for a given carat weight. If the data points are tightly clustered around the line, it suggests that carat weight is a strong predictor of price. If the points are more spread out, it indicates greater price variability for the same carat weight.

##To obtain specific insights about the UAE, you would need to perform a country-specific analysis or filter the data for the UAE and then examine the regression line for that subset of data. The current plot, with a regression line, provides a general understanding of the relationship between carat and price across all countries in the dataset, but it doesn’t provide insights specific to the UAE alone.