setwd("~/CST-425")

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
diamonds <- read.csv("~/CST-425/diamonds.csv")

#Vizualization 1

The visualization below gives a linear regression model of 2000 data points. representing data in a scatter plot can help with statistical analysis later on. visualizing data in this can help an analyst determine how price is compared with the carat of the diamond. someone can also make assumptions after looking at this type of visualization. For example, an analyst might say that the higher the carat of the diamond the more expensive it will be. Looking the plot that seems to be true.

randomcarat <- diamonds %>%
  sample_n(2000)
           

ggplot(randomcarat, aes(x = price, y = carat)) + geom_point() + geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'

#vizualization 2

The histogram below used to compare the distribution of the data. The count on the y axis is a default option that R automatically puts on the graph. this type of visualization is good when comparing the price of the diamond to the cut. It stacks the histograms of all the data points with the respect color of the cut. This is shown on the right hand side of the graph.

ggplot(diamonds, aes(x = price, fill = cut)) +
  geom_histogram(binwidth = 200)

#vizualization 3

The graph below is a better representation of the histogram above. instead of stacking the histogram its easier to read a frequency polygon. With this visualization you can make better assumptions on the data before doing a statistical analysis on the data set.

ggplot(diamonds, aes(x = price, colour = cut)) + geom_freqpoly(binwidth = 500)