Load libraries Descr: Describe attributes of objects/variables

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.1.0     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(descr)

Examine the diamonds dataset variables using the class function Diamonds: A dataset containing the prices and other attributes of almost 54,000 diamonds.

class(cut)
## [1] "function"
sapply(diamonds,class)
## $carat
## [1] "numeric"
## 
## $cut
## [1] "ordered" "factor" 
## 
## $color
## [1] "ordered" "factor" 
## 
## $clarity
## [1] "ordered" "factor" 
## 
## $depth
## [1] "numeric"
## 
## $table
## [1] "numeric"
## 
## $price
## [1] "integer"
## 
## $x
## [1] "numeric"
## 
## $y
## [1] "numeric"
## 
## $z
## [1] "numeric"

Frequency table from descr package Can add a plot by default

attach(diamonds)
freq(cut,plot=T)

## cut 
##           Frequency Percent Cum Percent
## Fair           1610   2.985       2.985
## Good           4906   9.095      12.080
## Very Good     12082  22.399      34.479
## Premium       13791  25.567      60.046
## Ideal         21551  39.954     100.000
## Total         53940 100.000

Frequency table, alternative method

with(diamonds, {freq(cut, plot=T)})

## cut 
##           Frequency Percent Cum Percent
## Fair           1610   2.985       2.985
## Good           4906   9.095      12.080
## Very Good     12082  22.399      34.479
## Premium       13791  25.567      60.046
## Ideal         21551  39.954     100.000
## Total         53940 100.000

Create tibble object out of diamonds dataset Tibble: a reformed version of a dataframe that never changes type of data upon input, columns can be lists, can have non-standard variable names, can start with a number/have spaces, doesn’t create row names, etc.

diam_tb=as_tibble(diamonds)
class(diam_tb)
## [1] "tbl_df"     "tbl"        "data.frame"

Plot the bar chart using ggplot2

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut)) 

Better plot with percentage on y axis

cut_plot <- ggplot(data = diamonds,aes(cut)) +
  geom_bar(mapping = aes(y = (..count..)/sum(..count..))) + theme_bw() +
  scale_y_continuous(labels=scales::percent) +
  ylab("Percent")

Exercise 1 Save the chart into PNG, PDF and SVG formats What are the differences among all these formats? PDF files are generally better than PNG files for resizing and other visual aspects as they contain vectors. However, they are generally not supported in web design software and thus are generally not useful for logos. SVG files also contain vectors which are useful for resizing but it can also run into issues with compatibility with older operating systems as it’s a new file type. SVG can be preferable to PDF because they have a smaller file size, although most of the other qualities are fairly similar.

png(file="PNG Example.png",
width=600, height=350)
cut_plot
dev.off()
## quartz_off_screen 
##                 2
pdf(file="PDF Example.pdf")
cut_plot
dev.off()
## quartz_off_screen 
##                 2
svg(file = "SVG Example.svg")
cut_plot
dev.off()
## quartz_off_screen 
##                 2

Plot another variable carat

ggplot(data = diam_tb) +
  geom_histogram(mapping = aes(x = carat), binwidth = 0.5)

Exercise 2 Can you improve the visualization of this chart? I think a density plot gives a more clean impression of the trends in the data as the data at the edges of the histogram is sparse enough to be difficult to read What is the difference between barchart and histogram? A barchart should be used to show the number of observations that fall in each value of a categorical or factor variable that has a limited number of possible values. A histogram is used for continuous data that is put into bins to show the frequency of values within a given range.

ggplot(data = diam_tb) +
  geom_density(mapping = aes(x = carat), fill = "pink", alpha = 0.5)

Subsetting the dataset

smaller <- diamonds %>% 
  filter(carat < 3)

Histogram of smaller dataset

ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.1)

Polygon

ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
  geom_freqpoly(binwidth = 0.1) + theme_bw() 

Exercise 3 Can you change colors? Hint: use the RcolorBrewer package

ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.1, fill = "pink")

Continuous variables

ggplot(data = faithful) + 
  geom_point(mapping = aes(x = eruptions, y = waiting))

Exercise 4 What method you will use to analyze: Dependent variable: continuous, Independent variable: discrete Barplot with different plots for each level of the discrete variable Dependent variable: discrete, Independent variable: continuous Scatterplot with levels of dependent variable on y-axis Dependent variable: continuous, Independent variable: continuous Use a regression with continuous variables