Load libraries Descr: Describe attributes of objects/variables
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.1.0 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(descr)
Examine the diamonds dataset variables using the class function Diamonds: A dataset containing the prices and other attributes of almost 54,000 diamonds.
class(cut)
## [1] "function"
sapply(diamonds,class)
## $carat
## [1] "numeric"
##
## $cut
## [1] "ordered" "factor"
##
## $color
## [1] "ordered" "factor"
##
## $clarity
## [1] "ordered" "factor"
##
## $depth
## [1] "numeric"
##
## $table
## [1] "numeric"
##
## $price
## [1] "integer"
##
## $x
## [1] "numeric"
##
## $y
## [1] "numeric"
##
## $z
## [1] "numeric"
Frequency table from descr package Can add a plot by default
attach(diamonds)
freq(cut,plot=T)
## cut
## Frequency Percent Cum Percent
## Fair 1610 2.985 2.985
## Good 4906 9.095 12.080
## Very Good 12082 22.399 34.479
## Premium 13791 25.567 60.046
## Ideal 21551 39.954 100.000
## Total 53940 100.000
Frequency table, alternative method
with(diamonds, {freq(cut, plot=T)})
## cut
## Frequency Percent Cum Percent
## Fair 1610 2.985 2.985
## Good 4906 9.095 12.080
## Very Good 12082 22.399 34.479
## Premium 13791 25.567 60.046
## Ideal 21551 39.954 100.000
## Total 53940 100.000
Create tibble object out of diamonds dataset Tibble: a reformed version of a dataframe that never changes type of data upon input, columns can be lists, can have non-standard variable names, can start with a number/have spaces, doesn’t create row names, etc.
diam_tb=as_tibble(diamonds)
class(diam_tb)
## [1] "tbl_df" "tbl" "data.frame"
Plot the bar chart using ggplot2
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
Better plot with percentage on y axis
cut_plot <- ggplot(data = diamonds,aes(cut)) +
geom_bar(mapping = aes(y = (..count..)/sum(..count..))) + theme_bw() +
scale_y_continuous(labels=scales::percent) +
ylab("Percent")
Exercise 1 Save the chart into PNG, PDF and SVG formats What are the differences among all these formats? PDF files are generally better than PNG files for resizing and other visual aspects as they contain vectors. However, they are generally not supported in web design software and thus are generally not useful for logos. SVG files also contain vectors which are useful for resizing but it can also run into issues with compatibility with older operating systems as it’s a new file type. SVG can be preferable to PDF because they have a smaller file size, although most of the other qualities are fairly similar.
png(file="PNG Example.png",
width=600, height=350)
cut_plot
dev.off()
## quartz_off_screen
## 2
pdf(file="PDF Example.pdf")
cut_plot
dev.off()
## quartz_off_screen
## 2
svg(file = "SVG Example.svg")
cut_plot
dev.off()
## quartz_off_screen
## 2
Plot another variable carat
ggplot(data = diam_tb) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
Exercise 2 Can you improve the visualization of this chart? I think a density plot gives a more clean impression of the trends in the data as the data at the edges of the histogram is sparse enough to be difficult to read What is the difference between barchart and histogram? A barchart should be used to show the number of observations that fall in each value of a categorical or factor variable that has a limited number of possible values. A histogram is used for continuous data that is put into bins to show the frequency of values within a given range.
ggplot(data = diam_tb) +
geom_density(mapping = aes(x = carat), fill = "pink", alpha = 0.5)
Subsetting the dataset
smaller <- diamonds %>%
filter(carat < 3)
Histogram of smaller dataset
ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.1)
Polygon
ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
geom_freqpoly(binwidth = 0.1) + theme_bw()
Exercise 3 Can you change colors? Hint: use the RcolorBrewer package
ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.1, fill = "pink")
Continuous variables
ggplot(data = faithful) +
geom_point(mapping = aes(x = eruptions, y = waiting))
Exercise 4 What method you will use to analyze: Dependent variable: continuous, Independent variable: discrete Barplot with different plots for each level of the discrete variable Dependent variable: discrete, Independent variable: continuous Scatterplot with levels of dependent variable on y-axis Dependent variable: continuous, Independent variable: continuous Use a regression with continuous variables