The Iris dataset

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

data(iris)
iris <- as_tibble(iris)
iris

# what the code does

# The above code uses the tidyverse library to work with the iris dataset. The first line of code loads the tidyverse library into the workspace, which provides a suite of tools for manipulating data. The second line of code uses the data() function to load the iris dataset into the workspace. The third line of code converts the dataset into a tibble object. Finally, the fourth line of code prints the contents of the tibble object to the screen. This code allows us to easily manipulate the iris dataset, allowing us to quickly explore and analyze the data.

print(iris, n = 3, width = Inf)

## # A tibble: 150 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## # … with 147 more rows

# what the code does

# The above line of code displays the iris dataset in a printable format. The code has two parameters, n and width. The n parameter specifies how many rows of the dataset will be printed and the width parameter specifies the maximum width of the output. In our case, n is set to 3 and width is set to Inf, this means that the maximum width of the output is unlimited. The result of the code is a display of the first three rows of the iris dataset in a neat, easy to read format.

Data Quality

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

# what the code does

# The above code is used to generate a summary of the iris dataset. The summary includes the total number of observations, the number of variables, and the mean, median, minimum, and maximum values of each variable in the dataset. The output of this code is a table that provides a quick overview of the data.

iris %>% summarize_if(is.numeric, mean)

# what the code does


# The above code uses the summarize_if() function to calculate the mean of all numeric columns in the iris dataset. This function takes two arguments; the first is a logical statement and the second is the summarization function. In this case, the logical statement is is.numeric, which checks to see if the columns in the iris dataset are numeric. If they are, then the summarization function, mean, is applied to calculate the mean of those columns. The result of this operation is a summary table of the mean values for all the numeric columns in the iris dataset.

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

# what the code does

# The above code is loading the GGally library. GGally is an extension of the ggplot2 library for R. It provides a set of functions that can be used to create various types of plots and graphs. The library contains functions for creating scatterplots, barplots, boxplots, lineplots, and correlation plots, as well as functions for creating parallel coordinates, correlation matrices, cluster plots e.t.c

ggpairs(iris, aes(color = Species))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# what the code does


# The above code is using the ggplot2 package to create a pairs plot. This type of plot is used to visualize relationships between different variables in a dataset. In this case, the ggpair() function is used to create a pairs plot on the Iris dataset, with the color of the points in the plot determined by the Species column. This means that each species will have its own color, allowing us to easily see how the different species compare in terms of their relationships with the other variables in the dataset. The plot will show scatterplots of each pair of variables, with the points colored according to the Species column, so we can quickly identify how the different species compare in terms of the relationships between the variables.

clean.data <- iris %>% drop_na() %>% unique()
summary(clean.data)

##   Sepal.Length    Sepal.Width    Petal.Length    Petal.Width          Species  
##  Min.   :4.300   Min.   :2.00   Min.   :1.000   Min.   :0.100   setosa    :50  
##  1st Qu.:5.100   1st Qu.:2.80   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
##  Median :5.800   Median :3.00   Median :4.300   Median :1.300   virginica :49  
##  Mean   :5.844   Mean   :3.06   Mean   :3.749   Mean   :1.195                  
##  3rd Qu.:6.400   3rd Qu.:3.30   3rd Qu.:5.100   3rd Qu.:1.800                  
##  Max.   :7.900   Max.   :4.40   Max.   :6.900   Max.   :2.500

# what the code does

# The above code works by first taking the iris dataset and removing any rows with missing values using the drop_na() function. Then, it uses the unique() function to remove any duplicate rows from the dataset. Lastly, it provides a summary of the clean data using the summary() function. This gives us a quick overview of the clean dataset, including the number of observations, mean, median, and other summary statistics.

Aggregation

iris %>% group_by(Species) %>% summarize_all(mean)

# what the code does

# The above code uses the group_by() and summarize_all() functions from the dplyr package to compute the mean of each variable in the iris dataset for each species. First, the group_by() function is used to group the data by the Species column, so that the mean for each variable can be calculated for each species. Then, the summarize_all() function is used to calculate the mean for each variable. The result is a new data frame with mean values for each variable in the iris dataset, grouped by species.

iris %>% group_by(Species) %>% summarize_all(median)

# what the code does

# The above code works by grouping the iris data set by the species of the flowers and summarizing all the variables with the median. It starts by taking the iris data set and grouping by the species, then it summarizes all the variables with the median. This will give  a table with the median of each of the variables for each of the species.

Sampling

Random Sampling

sample(c("A", "B", "C"), size = 10, replace = TRUE)

##  [1] "A" "C" "B" "B" "C" "A" "C" "C" "B" "A"

# what the code does


# The above code is used to generate a random sample of size 10 from the set of elements "A", "B", and "C". The argument size = 10 indicates that 10 elements should be drawn from the set. The argument replace = TRUE indicates that the same element can be sampled more than once. The code will generate a random sample of 10 elements from the set, potentially with some elements repeated.

take <- sample(seq(nrow(iris)), size = 15)
take

##  [1] 140 124  36  82 137  66 142  39 128 143 126  81  28  71  54

# what the code does

# The above code is used to randomly sample from the Iris data set. It starts by using the "nrow" function, which is used to count the number of rows in the data set. This number is then used to create a sequence of numbers, which is then passed to the "sample" function. The "sample" function randomly samples from the sequence generated by nrow, and takes the specified number of samples, in this case 15. The sequence of numbers that are randomly sampled are then stored in the "take" vector.

iris[take, ]

# what the code does

# The above code starts by creating an object called 'iris' which contains data from the iris dataset. This dataset contains information about the characteristics of various species of the iris flower.
# 
# Next, the code creates an object called 'take' which is used to identify which rows from the dataset should be taken. This object is set to a vector of integers, and each integer represents the row number in the dataset.
# 
# Lastly, the code uses the 'iris' and 'take' objects to select a subset of the dataset. The 'iris[take, ]' command tells the program to take the rows specified in the 'take' object, and to select all columns from the 'iris' dataset. This command will return the subset of the dataset that was specified.

set.seed(1000)

s <- iris %>% slice_sample(n = 15)
ggpairs(s, aes(color = Species))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# what the code does

# 
# The above code works by using the iris data set and then using the slice_sample() function to sample 15 rows of data from the data set. It then uses the ggpairs() function to create a scatterplot matrix displaying the relationship between the different variables in the data set. The color of each point on the scatterplot matrix is based on the Species of the row of data.

Stratified Sampling

library(sampling)
id2 <- strata(iris, stratanames = "Species", size = c(5,5,5), method = "srswor")
id2

# what the code does

# This code uses the sampling library to stratify a dataset. The dataset used is the iris dataset and the stratification is done based on the column called Species. The size of each stratum is set to 5,5,5. The method used for stratification is srswor. This means that it uses systematic random sampling without replacement. This means that it will randomly select a sample of size 5 from each stratum while making sure that no one item is selected more than once. The output of the code is a vector of the indices of the rows that have been selected.

s2 <- iris %>% slice(id2$ID_unit)
ggpairs(s2, aes(color = Species))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# what the code does


# The above code utilizes the 'ggpairs' function from the 'ggplot2' package to create a scatterplot matrix of a subset of the iris dataset. The subset of the iris dataset is created using the 'slice' function from the 'dplyr' package. The 'slice' function takes the 'id2$ID_unit' object as an argument and uses it to select the subset of the dataset. The 'ggpairs' function then takes the subsetted dataset and creates a scatterplot matrix. The plot is further customized by mapping the 'Species' variable to the color aesthetic in order to visualize the differences between the species in the dataset.

Features

Dimensionality Reduction

Principal Components Analysis (PCA)

# library(plotly) # I don't load the package because it's namespace clashes with select in dplyr.
plotly::plot_ly(iris, x = ~Sepal.Length, y = ~Petal.Length, z = ~Sepal.Width,
  size = ~Petal.Width, color = ~Species, type="scatter3d")

## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

# what the code does

# The above code uses the plotly package to create a 3D scatterplot of the iris dataset. The code first specifies the x, y, and z axes as the Sepal.Length, Petal.Length, and Sepal.Width columns of the iris dataset, respectively. It then specifies the size of the points on the plot to be determined by the Petal.Width column of the dataset, and the color of the points to be determined by the Species column. Finally, the type is set to "scatter3d" so that the plot is a 3D scatterplot. The code then creates the 3D scatterplot and displays it on the screen.

pc <- iris %>% select(-Species) %>% as.matrix() %>% prcomp()
summary(pc)

## Importance of components:
##                           PC1     PC2    PC3     PC4
## Standard deviation     2.0563 0.49262 0.2797 0.15439
## Proportion of Variance 0.9246 0.05307 0.0171 0.00521
## Cumulative Proportion  0.9246 0.97769 0.9948 1.00000

# what the code does


# The code above performs a Principal Component Analysis (PCA) on an iris dataset. The PCA is used to reduce the dimensionality of the data by transforming it into a new set of variables that explain the maximum amount of variance in the dataset. 
# 
# The code begins by selecting all variables from the iris dataset except for the Species variable and converting this subset of data into a matrix. This matrix is then passed to the prcomp() function which performs the PCA on the data. The summary() function is then used to provide a summary of the results. This summary shows the amount of variance explained by each principal component, as well as the total variance explained by all principal components combined.

plot(pc, type = "line")

# what the code does

# The above code is used to create a line graph from a given 'pc' object. The 'pc' object is a collection of data points that can be plotted on a graph. The code begins by using the 'plot' function to call the 'pc' object and specify that the type of graph to be created is a line graph. The 'plot' function then uses the data points from the 'pc' object to create the line graph.

str(pc)

## List of 5
##  $ sdev    : num [1:4] 2.056 0.493 0.28 0.154
##  $ rotation: num [1:4, 1:4] 0.3614 -0.0845 0.8567 0.3583 -0.6566 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
##   .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
##  $ center  : Named num [1:4] 5.84 3.06 3.76 1.2
##   ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
##  $ scale   : logi FALSE
##  $ x       : num [1:150, 1:4] -2.68 -2.71 -2.89 -2.75 -2.73 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
##  - attr(*, "class")= chr "prcomp"

# what the code does

# The code above works by first creating a variable called 'pc' and assigning it a value of 12. Then, it uses the 'str()' function to convert the value of 'pc' into a string.

iris_projected <- as_tibble(pc$x) %>% add_column(Species = iris$Species)
ggplot(iris_projected, aes(x = PC1, y = PC2, color = Species)) + 
  geom_point()

# what the code does

# The above code takes the data from the iris dataset, which contains measurements of sepal length, sepal width, petal length, and petal width for three species of iris flowers (Setosa, Versicolor, and Virginica). The code then performs a principal components analysis (PCA) to reduce the four variables into two principal components (PC1 and PC2). The PCA is performed using the prcomp() function, which takes the four variables as inputs and outputs the two principal components.
# 
# Next, the code creates a tibble containing the two principal components and the Species column from the original iris dataset. This is done by using the as_tibble() function to convert the output from the prcomp() function into a tibble, and then adding the Species column from the original iris dataset. 
# 
# Finally, the code creates a scatter plot using ggplot() by mapping the PC1 and PC2 variables to the x- and y-axes respectively, and mapping the Species variable to the color. The result is a scatter plot showing the three species of iris flowers in two dimensions.

ggplot(iris_projected, 
  aes(x = PC1, y = 0, color = Species)) + 
  geom_point() +
  scale_y_continuous(expand=c(0,0)) +
  theme(axis.text.y = element_blank(),
      axis.title.y = element_blank()
  )

# what the code does

# The above code creates a scatterplot using the iris_projected dataset, which contains the data from the Iris dataset projected onto its two principal components, PC1 and PC2. The x axis is set to the PC1 variable and the y axis is set to 0. The color aesthetic is set to Species, so that the points on the scatterplot are colored according to their species. The y axis is given a continuous scale with no expansion in either direction. Finally, the text and title on the y axis are removed. This creates a simple scatterplot that uses the PC1 variable to show the differences between the three species of iris.

library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

# what the code does
# 
# The above code is used to install the factoextra package into the R language. The factoextra package is a package that provides additional functions to help in the exploration and visualization of factor analysis.

fviz_pca(pc)

# what the code does

# When the above code is run, it creates a visualization of the principal components of a dataset. This visualization utilizes the fviz_pca function from the FactoMineR package. This function takes in a principal component object and uses it to create a scree plot, which is a graph that shows the relative contribution of each principal component to the total variance of the dataset. The scree plot is used to determine how many principal components should be used to explain the variance in the data.

fviz_pca_var(pc)

# what the code does

# 
# The above code is used to visualize a given Principal Component Analysis (PCA) using the ggfortify library in R. The ggfortify library provides an easy to use interface that allows users to quickly produce an informative and attractive chart of a given PCA. The code begins by loading the ggfortify library and then passing the PCA object to the ggfortify function. This function then uses the information stored in the PCA object to construct a chart of the PCA.

d <- iris %>% select(-Species) %>% dist()

# what the code does


# The above code begins by selecting the iris dataset and then removing the Species column. This is done using the select function, which takes the argument -Species to indicate the column should be removed. After this, the dist function is used to calculate the Euclidean distances between each pair of observations. This results in a matrix of distances, which is stored in the object d.

fit <- cmdscale(d, k = 2)
colnames(fit) <- c("comp1", "comp2")
fit <- as_tibble(fit) %>% add_column(Species = iris$Species)

ggplot(fit, aes(x = comp1, y = comp2, color = Species)) + geom_point()

# what the code does

# The above code is making use of the cmdscale() function to perform multidimensional scaling of the iris dataset. The function takes two arguments: the data (d), and the number of desired components (k). The number of components is set to 2 in this code, which will create two components (comp1 and comp2). Then the data is converted into a tibble and the species column from the iris dataset is added. Finally, a ggplot object is created, with the x axis being comp1, the y axis being comp2, and the color of the points being based on the species. This visualizes the multidimensional scaling of the iris dataset, allowing for easier exploration of the data.

Feature selection

Discrete Features

ggplot(iris, aes(x = Petal.Width)) + geom_histogram(binwidth = .2)

# what the code does

# The above code is creating a histogram using the ggplot package from R. It is using the iris data set as the data source, and the x-axis is set as the Petal Width from the iris data set. The binwidth is set to .2 to define the size of the bins for the histogram. Once the code is run, a histogram is generated that displays the frequency of the different Petal Widths found in the iris data set.

iris %>% pull(Sepal.Width) %>% cut(breaks = 3)

##   [1] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (3.6,4.4] (2.8,3.6]
##   [8] (2.8,3.6] (2.8,3.6] (2.8,3.6] (3.6,4.4] (2.8,3.6] (2.8,3.6] (2.8,3.6]
##  [15] (3.6,4.4] (3.6,4.4] (3.6,4.4] (2.8,3.6] (3.6,4.4] (3.6,4.4] (2.8,3.6]
##  [22] (3.6,4.4] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6]
##  [29] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (3.6,4.4] (3.6,4.4] (2.8,3.6]
##  [36] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2,2.8]  
##  [43] (2.8,3.6] (2.8,3.6] (3.6,4.4] (2.8,3.6] (3.6,4.4] (2.8,3.6] (3.6,4.4]
##  [50] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2,2.8]   (2,2.8]   (2,2.8]  
##  [57] (2.8,3.6] (2,2.8]   (2.8,3.6] (2,2.8]   (2,2.8]   (2.8,3.6] (2,2.8]  
##  [64] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2,2.8]   (2,2.8]   (2,2.8]  
##  [71] (2.8,3.6] (2,2.8]   (2,2.8]   (2,2.8]   (2.8,3.6] (2.8,3.6] (2,2.8]  
##  [78] (2.8,3.6] (2.8,3.6] (2,2.8]   (2,2.8]   (2,2.8]   (2,2.8]   (2,2.8]  
##  [85] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2,2.8]   (2.8,3.6] (2,2.8]   (2,2.8]  
##  [92] (2.8,3.6] (2,2.8]   (2,2.8]   (2,2.8]   (2.8,3.6] (2.8,3.6] (2.8,3.6]
##  [99] (2,2.8]   (2,2.8]   (2.8,3.6] (2,2.8]   (2.8,3.6] (2.8,3.6] (2.8,3.6]
## [106] (2.8,3.6] (2,2.8]   (2.8,3.6] (2,2.8]   (2.8,3.6] (2.8,3.6] (2,2.8]  
## [113] (2.8,3.6] (2,2.8]   (2,2.8]   (2.8,3.6] (2.8,3.6] (3.6,4.4] (2,2.8]  
## [120] (2,2.8]   (2.8,3.6] (2,2.8]   (2,2.8]   (2,2.8]   (2.8,3.6] (2.8,3.6]
## [127] (2,2.8]   (2.8,3.6] (2,2.8]   (2.8,3.6] (2,2.8]   (3.6,4.4] (2,2.8]  
## [134] (2,2.8]   (2,2.8]   (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6] (2.8,3.6]
## [141] (2.8,3.6] (2.8,3.6] (2,2.8]   (2.8,3.6] (2.8,3.6] (2.8,3.6] (2,2.8]  
## [148] (2.8,3.6] (2.8,3.6] (2.8,3.6]
## Levels: (2,2.8] (2.8,3.6] (3.6,4.4]

# what the code does

# The above code begins by using the pipe operator to take the iris dataset and extract the column "Sepal.Width". This column is then passed through the cut() function, which divides the values into three distinct groups. The cut() function will return a factor object, indicating which group each value in the original column belongs to. This is a useful technique when looking to break a large dataset into more manageable chunks.

library(arules)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## 
## Attaching package: 'arules'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

# what the code does

# The above code is loading the arules library, which is a package in R that is used for analyzing and visualizing data that is stored in the form of transactions. This library contains functions and datasets that allow users to manipulate their datasets in order to identify patterns and trends in the data.

iris %>% pull(Petal.Width) %>% discretize(method = "interval", breaks = 3)

##   [1] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
##   [8] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
##  [15] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
##  [22] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
##  [29] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
##  [36] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
##  [43] [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9) [0.1,0.9)
##  [50] [0.1,0.9) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
##  [57] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
##  [64] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
##  [71] [1.7,2.5] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
##  [78] [1.7,2.5] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
##  [85] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
##  [92] [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7) [0.9,1.7)
##  [99] [0.9,1.7) [0.9,1.7) [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## [106] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## [113] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## [120] [0.9,1.7) [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## [127] [1.7,2.5] [1.7,2.5] [1.7,2.5] [0.9,1.7) [1.7,2.5] [1.7,2.5] [1.7,2.5]
## [134] [0.9,1.7) [0.9,1.7) [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## [141] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## [148] [1.7,2.5] [1.7,2.5] [1.7,2.5]
## attr(,"discretized:breaks")
## [1] 0.1 0.9 1.7 2.5
## attr(,"discretized:method")
## [1] interval
## Levels: [0.1,0.9) [0.9,1.7) [1.7,2.5]

# what the code does

# The above code begins by using the 'iris' data set, which is a well-known data set in the R programming language. It contains data on the measurements of various species of flowers. This code then uses the 'pull' function to extract the 'Petal.Width' column from the data set. Next, the 'discretize' function is used to convert the continuous data in the 'Petal.Width' column into discrete intervals. The 'method' argument is set to "interval" and the 'breaks' argument is set to 3, which specifies that the data should be divided into 3 intervals. This code thus allows us to divide the 'Petal.Width' column into 3 intervals, which can then be used for further analysis.

iris %>% pull(Petal.Width) %>% discretize(method = "frequency", breaks = 3)

##   [1] [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867)
##   [7] [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867)
##  [13] [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867)
##  [19] [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867)
##  [25] [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867)
##  [31] [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867)
##  [37] [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867)
##  [43] [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867) [0.1,0.867)
##  [49] [0.1,0.867) [0.1,0.867) [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6)
##  [55] [0.867,1.6) [0.867,1.6) [1.6,2.5]   [0.867,1.6) [0.867,1.6) [0.867,1.6)
##  [61] [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6)
##  [67] [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6) [1.6,2.5]   [0.867,1.6)
##  [73] [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6) [1.6,2.5]  
##  [79] [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6) [1.6,2.5]  
##  [85] [0.867,1.6) [1.6,2.5]   [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6)
##  [91] [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6)
##  [97] [0.867,1.6) [0.867,1.6) [0.867,1.6) [0.867,1.6) [1.6,2.5]   [1.6,2.5]  
## [103] [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]  
## [109] [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]  
## [115] [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [0.867,1.6)
## [121] [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]  
## [127] [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]  
## [133] [1.6,2.5]   [0.867,1.6) [0.867,1.6) [1.6,2.5]   [1.6,2.5]   [1.6,2.5]  
## [139] [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]  
## [145] [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]   [1.6,2.5]  
## attr(,"discretized:breaks")
## [1] 0.1000000 0.8666667 1.6000000 2.5000000
## attr(,"discretized:method")
## [1] frequency
## Levels: [0.1,0.867) [0.867,1.6) [1.6,2.5]

# what the code does

# The code above uses the dplyr and discretize packages to perform a frequency based discretization on the Petal.Width column of the iris dataset. First, the 'pull' function is used to extract the Petal.Width data from the iris dataset. This data is then passed to the 'discretize' function, which uses the "frequency" method to divide the data into three equal-frequency bins, or intervals. The output of this code is a vector of interval labels for each value in the Petal.Width column, which can then be used for further analysis.

iris %>% pull(Petal.Width) %>% discretize(method = "cluster", breaks = 3)

##   [1] [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792) 
##   [6] [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792) 
##  [11] [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792) 
##  [16] [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792) 
##  [21] [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792) 
##  [26] [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792) 
##  [31] [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792) 
##  [36] [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792) 
##  [41] [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792) 
##  [46] [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792)  [0.1,0.792) 
##  [51] [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71)
##  [56] [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71)
##  [61] [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71)
##  [66] [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71)
##  [71] [1.71,2.5]   [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71)
##  [76] [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71)
##  [81] [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71)
##  [86] [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71)
##  [91] [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71)
##  [96] [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71) [0.792,1.71)
## [101] [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]  
## [106] [1.71,2.5]   [0.792,1.71) [1.71,2.5]   [1.71,2.5]   [1.71,2.5]  
## [111] [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]  
## [116] [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [0.792,1.71)
## [121] [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]  
## [126] [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [0.792,1.71)
## [131] [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [0.792,1.71) [0.792,1.71)
## [136] [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]  
## [141] [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]  
## [146] [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]   [1.71,2.5]  
## attr(,"discretized:breaks")
## [1] 0.1000000 0.7915185 1.7054750 2.5000000
## attr(,"discretized:method")
## [1] cluster
## Levels: [0.1,0.792) [0.792,1.71) [1.71,2.5]

# what the code does

# The above code works by first accessing the Petal.Width column from the iris dataset. Then it discretizes the column by using a cluster method, with 3 breaks. This means that the column will be divided into 3 distinct clusters, each containing values that are similar. The clusters are created by analyzing the similarities between the values in the Petal.Width column and then grouping together the most similar values. The end result is a new column with 3 distinct clusters, which can then be used for further analysis.

ggplot(iris, aes(Petal.Width)) + geom_histogram(binwidth = .2) +
  geom_vline(xintercept =
      iris %>% pull(Petal.Width) %>% discretize(method = "interval", breaks = 3, onlycuts = TRUE),
    color = "blue") +
  labs(title = "Discretization: interval", subtitle = "Blue lines are boundaries")

# what the code does

# The above code is using the ggplot2 package to create a histogram of the Petal.Width variable from the iris dataset. The binwidth argument is set to 0.2, which means that the histogram will have bars that represent 0.2 wide bins. Then, the geom_vline function is used to draw a vertical line at each of the intervals determined by the discretize function. This function uses the Petal.Width variable from the iris dataset and divides it into 3 equal intervals. The color argument is set to "blue" to make the lines more visible and the title and subtitle are added to give the graph context. Finally, the graph is displayed with the boundaries of the intervals indicated in blue.

ggplot(iris, aes(Petal.Width)) + geom_histogram(binwidth = .2) +
  geom_vline(xintercept =
      iris %>% pull(Petal.Width) %>% discretize(method = "frequency", breaks = 3, onlycuts = TRUE),
    color = "blue") +
  labs(title = "Discretization: frequency", subtitle = "Blue lines are boundaries")

# what the code does


# The above code uses the ggplot2 package to create a histogram of the Petal Width of the iris dataset. The geom_histogram function is used to create the histogram, with a binwidth of .2. The geom_vline function is then used to draw vertical lines at the boundaries that are created by the discretize function. The discretize function takes the Petal.Width column from the iris dataset and creates three boundaries based on the frequency of the values.

ggplot(iris, aes(Petal.Width)) + geom_histogram(binwidth = .2) +
  geom_vline(xintercept =
      iris %>% pull(Petal.Width) %>% discretize(method = "cluster", breaks = 3, onlycuts = TRUE),
    color = "blue") +
  labs(title = "Discretization: cluster", subtitle = "Blue lines are boundaries")

# what the code does

# The above code produces a histogram of the Petal Width column from the iris dataset. This histogram is created using the ggplot2 package, with the Petal Width as the x axis. The binwidth is set to 0.2, meaning that each bin in the histogram will represent a range of 0.2. The geom_vline function is used to draw a vertical line at the boundaries of the clusters that are created using the discretize function. This function groups the Petal Width values into three clusters, and the lines are drawn at the boundaries of these clusters.

Standardize Data

scale_numeric <- function(x) x %>% mutate_if(is.numeric, function(y) as.vector(scale(y)))

iris.scaled <- iris %>% scale_numeric()
iris.scaled

# what the code does


# This code uses the dplyr package to scale the numeric columns in the iris dataset. The first line defines the function 'scale_numeric', which takes a data frame as its argument. The second line uses the dplyr 'mutate_if' function to work through each column in the dataset, checking for numeric columns. If a column is numeric, it is transformed using the 'scale' function from the base package. The 'scale' function normalizes the data in the column, so that all values are between -1 and 1. Finally, the transformed dataset is returned as 'iris.scaled'.

summary(iris.scaled)

##   Sepal.Length       Sepal.Width       Petal.Length      Petal.Width     
##  Min.   :-1.86378   Min.   :-2.4258   Min.   :-1.5623   Min.   :-1.4422  
##  1st Qu.:-0.89767   1st Qu.:-0.5904   1st Qu.:-1.2225   1st Qu.:-1.1799  
##  Median :-0.05233   Median :-0.1315   Median : 0.3354   Median : 0.1321  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.67225   3rd Qu.: 0.5567   3rd Qu.: 0.7602   3rd Qu.: 0.7880  
##  Max.   : 2.48370   Max.   : 3.0805   Max.   : 1.7799   Max.   : 1.7064  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

# what the code does

# The above code is used to perform a summary on the scaled version of the iris dataset. The summary function in R is used to generate descriptive statistics of the data. This includes the mean, median, quartiles, maximum, minimum, and range of the dataset. The output of this function provides a general overview of the data to help understand the distribution and range of the values.

Proximities: Similarities and Distances

Minkowsky Distances

iris_sample <- iris.scaled %>% select(-Species) %>% slice(1:5)
iris_sample

# what the code does

# The above code works by first selecting the iris.scaled dataset, which is a dataset of the Iris flower species. It then uses the select function to remove the Species column, and then uses the slice function to select the first five rows. This creates a subset of the original dataset that only contains the five first rows of the remaining columns. The result is a data frame called iris_sample that contains the first five rows of the scaled Iris data without the Species column.

dist(iris_sample, method = "euclidean")

##           1         2         3         4
## 2 1.1722914                              
## 3 0.8427840 0.5216255                    
## 4 1.0999999 0.4325508 0.2829432          
## 5 0.2592702 1.3818560 0.9882608 1.2459861

# what the code does

# The above code works by calculating the Euclidean distance between two observations in the Iris data set. The Euclidean distance is a measure of similarity between two points in a multi-dimensional space, and is calculated by taking the square root of the sum of the squares of the differences between the corresponding elements of the two points. In this case, the elements of the two points are the four attributes of the Iris data set: sepal length, sepal width, petal length, and petal width. 
# 
# The code first creates an empty matrix that will hold the distance values for each pair of observations. It then uses a for loop to iterate over each pair of rows from the Iris data set and calculates the Euclidean distance between them. This is accomplished by subtracting the corresponding elements from each row and then squaring and summing the differences. The result is then taken as the square root of the sum and stored in the empty matrix. The result is the Euclidean distance between the two observations.

dist(iris_sample, method = "manhattan")

##           1         2         3         4
## 2 1.3886674                              
## 3 1.2279853 0.7570306                    
## 4 1.5781768 0.6483657 0.4634868          
## 5 0.3501915 1.4973323 1.3366502 1.6868417

# what the code does

# The above code computes the Manhattan distance between each pair of observations in the iris_sample data frame. The Manhattan distance is the sum of the absolute differences between the corresponding coordinates of the two points. This is a measure of the distance between two points in a Euclidean space.

dist(iris_sample, method = "maximum")

##           1         2         3         4
## 2 1.1471408                              
## 3 0.6882845 0.4588563                    
## 4 0.9177126 0.3622899 0.2294282          
## 5 0.2294282 1.3765690 0.9177126 1.1471408

# what the code does

# The above code is used to calculate the maximum distance between two rows in a dataset. The code takes two arguments, the dataset (iris_sample) and the method (maximum). The 'maximum' argument allows the code to use the maximum distance between the two rows as the metric to measure the distance between them.

Distances for Binary Data

b <- rbind(
  c(0,0,0,1,1,1,1,0,0,1),
  c(0,0,1,1,1,0,0,1,0,0)
  )
b

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    0    0    1    1    1    1    0    0     1
## [2,]    0    0    1    1    1    0    0    1    0     0

# what the code does
# 
# The code above creates a matrix of two rows and ten columns. The first row contains the values 0, 0, 0, 1, 1, 1, 1, 0, 0, 1. The second row contains the values 0, 0, 1, 1, 1, 0, 0, 1, 0, 0. This creates a table of two rows and ten columns that can be used for further analysis.

b_logical <- apply(b, MARGIN = 2, as.logical)
b_logical

##       [,1]  [,2]  [,3] [,4] [,5]  [,6]  [,7]  [,8]  [,9] [,10]
## [1,] FALSE FALSE FALSE TRUE TRUE  TRUE  TRUE FALSE FALSE  TRUE
## [2,] FALSE FALSE  TRUE TRUE TRUE FALSE FALSE  TRUE FALSE FALSE

# what the code does

# The above code takes a matrix, b, and applies the as.logical function over each column of the matrix. This transforms each element of the matrix into a logical value. A logical value is either TRUE or FALSE, and this transformation creates a new matrix, b_logical, with logical values in place of the original elements of b. This new matrix is then returned and stored in the b_logical variable.

dist(b, method = "manhattan")

##   1
## 2 5

# what the code does

# The above line of code calculates the Manhattan distance between two vectors. The Manhattan distance is the sum of the absolute differences between the two vectors. In this case, the two vectors being compared are the two rows of the matrix b.

dist(b, method = "euclidean")^2

##   1
## 2 5

# what the code does

# The above line of code calculates the Euclidean distance between two vectors of data, in this case two rows of 10 elements each. The Euclidean distance is a measure of the straight-line distance between two points. In this case, it is used to calculate the distance between the two rows of data. The distance is calculated by subtracting each element of the two rows from each other, squaring the result, adding all the squared results together, and then taking the square root of the sum. The result of this calculation is the Euclidean distance between the two rows of data.

dist(b, method = "binary")

##           1
## 2 0.7142857

# what the code does

# The above code uses the dist() function to calculate binary distance between two vectors. The dist() function is used to calculate the Euclidean distance between two vectors. In this case, the two vectors are the row vectors b[1,] and b[2,]. The binary distance between the two vectors is the number of elements that differ. In this case, there are 4 elements that differ, so the binary distance between the two vectors is 4.

people <- tibble(
  height = c(      160,    185,    170),
  weight = c(       52,     90,     75),
  sex    = c( "female", "male", "male")
)
people

# what the code does

# The above code creates a tibble called "people" that contains three variables: height, weight, and sex. The height variable is measured in centimeters and contains three values (160, 185, 170). The weight variable is measured in kilograms and contains three values (52, 90, 75). The sex variable is a categorical variable with two categories (female and male) and contains three values ("female", "male", "male"). The tibble contains three observations, one for each set of values in the three variables.

library(proxy)

## 
## Attaching package: 'proxy'

## The following object is masked from 'package:Matrix':
## 
##     as.matrix

## The following objects are masked from 'package:stats':
## 
##     as.dist, dist

## The following object is masked from 'package:base':
## 
##     as.matrix

# what the code does

# This code is referencing a library called 'proxy'. This library is used for web proxy services, which allow users to access websites more securely and anonymously. This code does not actually perform any action, but rather just references the library.
# 
# To use the library, we would first need to install it on our system. The library itself contains code that provides users with the necessary services to access websites through a proxy. This includes routing requests through a proxy server, encrypting the traffic, and masking the IP address of the user.

# d_Gower <- dist(people, method = "Gower")
# d_Gower


# what the code does

# The above code generates a distance matrix for the people dataset using the Gower method. This method is used to calculate the distance between objects in a dataset with mixed type features (numerical, categorical, binary, ordinal). The Gower distance measures the similarity between two objects by taking into account the type of data and the magnitude of the difference between the values.

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:sampling':
## 
##     cluster

## The following object is masked from 'package:purrr':
## 
##     lift

# what the code does

# The above code is using the caret library in R to implement machine learning algorithms. The caret library is a powerful tool that allows users to quickly and easily create, train, and evaluate predictive models. It provides many useful functions that can be used to preprocess data, tune model parameters, and evaluate model performance.

data_dummy <- dummyVars(~., people) %>% predict(people)
data_dummy

##   height weight sexfemale sexmale
## 1    160     52         1       0
## 2    185     90         0       1
## 3    170     75         0       1

# what the code does
# The above code works by using the dummyVars() function from the caret package. This function takes the data frame in which you want to create dummy variables, in this case "people", and creates dummy variables for all of the categorical variables in the data. This is done by creating a new variable for each level of each categorical variable, with the level being represented by a binary indicator (1 for the level and 0 for all other levels). Once the dummy variables have been created, the predict function is used to create a new data frame, "data_dummy", that contains the original data plus the dummy variables.

weight_matrix <- matrix(c(1, 1, 1/2, 1/2), ncol = 4, nrow = nrow(data_dummy), byrow = TRUE)
data_dummy_scaled <- scale(data_dummy) * weight_matrix

d_dummy <- dist(data_dummy_scaled)
d_dummy

##          1        2
## 2 3.064169         
## 3 1.890931 1.426621

# what the code does

# The above code is used to create a weighted distance matrix for a dummy dataset. The first step is to create a dummy dataset, which is done using the matrix() function. The matrix consists of 4 columns and an unspecified number of rows. 
# 
# Next, a weight matrix is created using the matrix() function. This matrix consists of 4 columns, the same number as the dummy dataset, and the same number of rows as the dummy dataset. This matrix assigns a weight to each element in the dummy dataset. 
# 
# The scaled distance matrix is then created using the scale() function. This function takes the dummy dataset and applies the weights from the weight matrix to each element. The result is a scaled distance matrix. 
# 
# Finally, the dist() function is used to calculate the distance between each element in the scaled distance matrix. This produces a distance matrix, which can then be used to analyze the relationships between the elements in the dummy dataset.

# ggplot(tibble(d_dummy, d_Gower), aes(x = d_dummy, y = d_Gower)) +
#   geom_point() +
#   geom_smooth(method = "lm", se = FALSE)


# what the code does
# The above code is used to create a scatterplot that shows the relationship between two variables, d_dummy and d_Gower. It begins by creating a tibble that combines the two variables. Then, the ggplot function is used to create a scatterplot in which the x-axis is d_dummy and the y-axis is d_Gower. The geom_point function is used to add the individual points to the plot and the geom_smooth function is used to add a linear regression line to the plot. The method used for the regression line is "lm" and the se argument is set to false so that the standard error of the regression line is not shown. The resulting plot provides a visual representation of the relationship between the two variables.

Additional proximity Measures Available in Package proxy

library(proxy)
pr_DB$get_entry_names()

##  [1] "Jaccard"         "Kulczynski1"     "Kulczynski2"     "Mountford"      
##  [5] "Fager"           "Russel"          "simple matching" "Hamman"         
##  [9] "Faith"           "Tanimoto"        "Dice"            "Phi"            
## [13] "Stiles"          "Michael"         "Mozley"          "Yule"           
## [17] "Yule2"           "Ochiai"          "Simpson"         "Braun-Blanquet" 
## [21] "cosine"          "angular"         "eJaccard"        "eDice"          
## [25] "correlation"     "Chi-squared"     "Phi-squared"     "Tschuprow"      
## [29] "Cramer"          "Pearson"         "Gower"           "Euclidean"      
## [33] "Mahalanobis"     "Bhjattacharyya"  "Manhattan"       "supremum"       
## [37] "Minkowski"       "Canberra"        "Wave"            "divergence"     
## [41] "Kullback"        "Bray"            "Soergel"         "Levenshtein"    
## [45] "Podani"          "Chord"           "Geodesic"        "Whittaker"      
## [49] "Hellinger"       "fJaccard"

# what the code does

# The above code works by first loading the proxy library, which allows for communication between the system and a remote database. Then, the pr_DB$get_entry_names() command is used to access the remote database and retrieve the names of all entries stored in the database. This command allows the user to query the database and get a list of all the entries stored in the database.

Relationships Between Features

Correlation

cc <- iris %>% select(-Species) %>% cor()
cc

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
## Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
## Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
## Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

# what the code does

# The above code is written in R and is used to calculate the correlation between the variables in the Iris dataset. The code begins by using the 'iris' dataset and using the 'select' function to remove the Species variable from the dataset. Then the 'cor' function is used to calculate the correlation between the remaining variables. The output of this code is a correlation matrix that shows the correlation between each of the variables in the dataset.

ggplot(iris, aes(Petal.Length, Petal.Width)) + 
  geom_point() +
  geom_smooth(method = "lm")

## `geom_smooth()` using formula = 'y ~ x'

# what the code does

# The above code creates a scatterplot of the Petal Length and Width of the iris dataset. The data for the plot is taken from the iris dataset, which includes measurements for the sepal and petal length and width of three different species of iris flowers. First, the ggplot() function is used to create the basic plot structure and to specify the data and variables to be used in the plot. The aes() argument is used to specify the x and y values, which are the Petal Length and Width of the iris dataset, respectively. The geom_point() function is then used to create the points on the scatterplot, and the geom_smooth() function is used to fit a linear regression model and to draw the line of best fit. This produces a scatterplot of Petal Length and Width with a linear regression line, providing a visual representation of the relationship between the two variables.

with(iris, cor(Petal.Length, Petal.Width))

## [1] 0.9628654

# what the code does

# The above code uses the iris dataset and creates a correlation between the Petal Length and Petal Width variables. It does this by using the R cor() function, which calculates the Pearson's correlation coefficient between two variables. The resulting value of the correlation will range from -1 to 1, with the former indicating a perfect negative correlation between the two variables, and the latter indicating a perfect positive correlation

Rank Correlation

iris_ord <- iris %>% mutate_if(is.numeric,
  function(x) cut(x, 3, labels = c("short", "medium", "long"), ordered = TRUE))

iris_ord

# what the code does

# The above code works by using the mutate_if function to cut the numeric variables in the iris data frame into three categories: short, medium, and long. This is done by setting the number of cuts to 3 and providing labels for the categories. The ordered argument is also set to TRUE, which will ensure that the categories are ordered from short to long. The output of this code is a new data frame containing the same variables as the original iris data frame, but with the numeric variables now being categorical and ordered.

summary(iris_ord)

##  Sepal.Length Sepal.Width Petal.Length Petal.Width       Species  
##  short :59    short :47   short :50    short :50   setosa    :50  
##  medium:71    medium:88   medium:54    medium:54   versicolor:50  
##  long  :20    long  :15   long  :46    long  :46   virginica :50

# what the code does


# This code is used to summarize the Iris_ord dataset. The Iris_ord dataset contains four variables: Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width. The code uses the summary() function to calculate the mean, median, mode, minimum and maximum values for each of these variables. The summary() function also calculates the standard deviation, quartiles, and range for each of the four variables. This gives a comprehensive overview of the data contained in the Iris_ord dataset

iris_ord %>% pull(Sepal.Length)

##   [1] short  short  short  short  short  short  short  short  short  short 
##  [11] short  short  short  short  medium medium short  short  medium short 
##  [21] short  short  short  short  short  short  short  short  short  short 
##  [31] short  short  short  short  short  short  short  short  short  short 
##  [41] short  short  short  short  short  short  short  short  short  short 
##  [51] long   medium long   short  medium medium medium short  medium short 
##  [61] short  medium medium medium medium medium medium medium medium medium
##  [71] medium medium medium medium medium medium long   medium medium medium
##  [81] short  short  medium medium short  medium medium medium medium short 
##  [91] short  medium medium short  medium medium medium medium short  medium
## [101] medium medium long   medium medium long   short  long   medium long  
## [111] medium medium long   medium medium medium medium long   long   medium
## [121] long   medium long   medium medium long   medium medium medium long  
## [131] long   long   medium medium medium long   medium medium medium long  
## [141] medium long   medium long   medium medium medium medium medium medium
## Levels: short < medium < long

# what the code does

# The above code uses the dplyr package to pull the variable "Sepal.Length" from a dataset called "iris_ord". The resulting output is a vector of values corresponding to the Sepal Length for each observation in the dataset. This is done by using the "pull" function from the dplyr package, which takes a dataset and a variable name as arguments.

iris_ord %>% select(-Species) %>% sapply(xtfrm) %>% cor(method = "kendall")

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1437985    0.7418595   0.7295139
## Sepal.Width    -0.1437985   1.0000000   -0.3298796  -0.3154474
## Petal.Length    0.7418595  -0.3298796    1.0000000   0.9198290
## Petal.Width     0.7295139  -0.3154474    0.9198290   1.0000000

# what the code does

# The above code begins by selecting all columns in the iris_ord dataframe except for the Species column. It then applies the xtfrm function to each of the columns. Finally, it uses the cor function to calculate the Kendall correlation coefficient between the columns. The output is a matrix of Kendall correlation coefficients between each of the columns which can be used to analyze the relationship between the variables.

iris_ord %>% select(-Species) %>% sapply(xtfrm) %>% cor(method = "spearman")

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1569659    0.7937613   0.7843406
## Sepal.Width    -0.1569659   1.0000000   -0.3662775  -0.3517262
## Petal.Length    0.7937613  -0.3662775    1.0000000   0.9399038
## Petal.Width     0.7843406  -0.3517262    0.9399038   1.0000000

# what the code does

# The above code works by first selecting all the columns of the iris_ord data set except for the Species column. It then applies the xtfrm function to each of the selected columns. Finally, the cor function is used to calculate the Spearman correlation coefficients between each of the transformed variables. This code can be used to easily obtain the Spearman correlation values between all the variables in the data set, minus the Species column.

iris %>% select(-Species) %>% cor()

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
## Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
## Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
## Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

# what the code does
# 
# The above code is using the dplyr package to select the data from the built-in iris dataset. The code is using the select command to exclude the Species column from the data. The code is then using the cor command to calculate the correlation between all the other columns in the data. This will calculate the correlation between all the columns except the Species column.

Density Estimation

ggplot(iris, aes(x = Petal.Length, y = 0)) + geom_point()

# what the code does

# The above code is creating a scatter plot using the iris dataset. The plot is using Petal Length as the x-axis and a y-axis of 0. The geom_point() command is used to draw the points on the plot. This will create a scatter plot that shows the Petal Length of each flower in the iris dataset. The plot will show the distribution of Petal Lengths in the dataset and can be used to compare the different species of iris.

Histograms

ggplot(iris, aes(x = Petal.Length)) +
  geom_histogram() +
  geom_rug(alpha = 1/2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# what the code does

# The above code creates a histogram of the Petal Length variable of the iris dataset. The ggplot() function is used to create the base plot with the iris dataset as the dataframe and Petal Length as the x variable. The geom_histogram() function is then used to create the histogram with the associated bin counts and bin widths. Finally, the geom_rug() function is used to add a rug plot with alpha set to 1/2 to the base plot. The rug plot adds reference points for each data point to the histogram, allowing the viewer to better visualize the distribution of the Petal Length variable.

ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
  geom_bin2d(bins = 10) +
  geom_jitter(color = "red")

# what the code does


# The above code uses the ggplot2 package to visualize the relationship between sepal length and sepal width in the Iris dataset. It begins by creating a ggplot object which uses the Iris dataset and sets the x-axis to Sepal Length and the y-axis to Sepal Width. The geom_bin2d() function is then used to create a 2D binned plot of the data with 10 bins. This allows us to better visualize the distribution of the data points. Finally, the geom_jitter() function is used to add a jitter of red points over the binned plot, allowing us to better see the individual data points.

ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
  geom_hex(bins = 10) +
  geom_jitter(color = "red")

## Warning: Computation failed in `stat_binhex()`
## Caused by error in `compute_group()`:
## ! The package `hexbin` is required for `stat_binhex()`

# what the code does

# The above code uses the ggplot2 library to create a hexbin plot of the Iris dataset. The dataset contains four variables that describe the characteristics of iris flowers: sepal length, sepal width, petal length, and petal width. The code uses the Sepal Length and Sepal Width variables to create the hexbin plot. A hexbin plot is a type of scatterplot that displays the density of points in the data by creating hexagonal bins. The code also adds a jitter layer to the plot, which adds random noise to the points to help show the overall distribution of the data. The jitter layer is colored red to make it easier to distinguish from the hexbins. The resulting plot displays the relationship between the Sepal Length and Sepal Width variables and gives a visual representation of the density of points in the data.

Kernel Density Estimate (KDE)

ggplot(iris, aes(Petal.Length)) +
  geom_density(bw = .2) +
  geom_rug(alpha = 1/2)

# what the code does

# The above code creates a density plot for the Petal Length variable from the iris dataset. The geom_density function plots the kernel density estimate of the variable, using a bandwidth (bw) of 0.2. The geom_rug function then adds a rug plot to the density plot, which serves to show the actual values of the variable on the x-axis. The alpha parameter is set to 1/2 to make the rug plot slightly transparent. This allows the rug plot and density plot to be visible at the same time, making the visualization more informative.

ggplot(iris, aes(Sepal.Length, Sepal.Width)) +
  geom_density_2d_filled() +
  geom_jitter()

# what the code does

# The above code uses the ggplot2 package in R to create a visual representation of the data from the iris dataset. The code starts by calling the ggplot function to create a graph with the Sepal Length and Width variables from the iris dataset. Then it adds the geom_density_2d_filled function to create a filled density plot. The geom_jitter function is then used to add a jitter effect to the graph to make it easier to see the data points. The resulting graph provides a visual representation of the data and makes it easier to identify patterns and correlations between the two variables.

Exploring Data

Basic statistics

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

# what the code does

# The above code is using the summary() function to generate a summary of the Iris dataset. The summary() function will provide a summary of the dataset which includes the number of observations, the mean, median, and mode for each variable in the dataset, as well as the minimum, maximum, and range for each variable.

iris %>% pull(Sepal.Length) %>% mean()

## [1] 5.843333

# what the code does

# The above code calculates the mean of the sepal length of the iris dataset. First, the code uses the 'iris' dataset and uses the 'pull' function to select the Sepal.Length column. Then, the mean() function is used to calculate the average of the Sepal.Length column. The result of this code is the mean of the Sepal.Length column of the iris dataset.

iris %>% pull(Sepal.Length) %>% sd()

## [1] 0.8280661

# what the code does

# The above code uses the pipe operator (%>%) to read the iris data set and pull out the Sepal Length column. It then uses the sd() function to calculate the standard deviation of the Sepal Length column. The end result is the standard deviation of the Sepal Length in the iris dataset.

mean(c(1, 2, NA, 3, 4, 5))

## [1] NA

# what the code does

# The above code computes the mean of six values, with the sixth value being an NA value. The code is written in the R programming language. The code begins by defining a vector c containing the six values, 1, 2, NA, 3, 4 and 5. The mean function is then used to calculate the mean of this vector. The NA value is ignored and is not included in the calculation of the mean.

mean(c(1, 2, NA, 3, 4, 5),  na.rm = TRUE)

## [1] 3

# what the code does

# The above code calculates the mean of a set of data. The data is provided in the form of a vector, which is 1, 2, NA, 3, 4, and 5. The "na.rm = TRUE" argument indicates that the NA (not available) value should be removed from the calculation. The mean of the data without the NA value is calculated by summing the values and dividing by the number of values in the vector.

iris %>% pull(Sepal.Length) %>% mean()

## [1] 5.843333

# what the code does
 
# The above code is written in the R programming language and utilizes a pipe operator ( %>% ) to calculate the mean of the Sepal.Length column in the iris dataset. The code starts by using the 'iris' data frame and then uses the 'pull' function to extract the Sepal.Length column from the data frame. The 'mean' function is then used to calculate the average of the values in the Sepal.Length column.

iris %>% pull(Sepal.Length) %>% mean(trim = .1)

## [1] 5.808333

# what the code does

# The above code is using the dplyr package in R to calculate the mean of the sepal length column of the iris dataset. The first line of code uses the pipe operator to take the iris dataset, then pulls out the column 'Sepal.Length'. The second line uses the mean function with the parameter 'trim' set to .1. This tells the mean function to exclude the top 10% and bottom 10% of the values in the column. This calculation then outputs the mean of the sepal length column.

iris %>% summarize_if(is.numeric, mean)

# what the code does


# The above code works by using the summarize_if() function in the dplyr package to calculate the mean of any numeric variables in the dataset. The summarize_if() function takes two arguments: the first argument is a logical vector indicating which variables in the dataset should be summarized, and the second argument is the function to be used to summarize the variables. In this case, the first argument is is.numeric, which will return TRUE for any numeric variables in the dataset, and the second argument is mean, which will calculate the mean of the numeric variables. This code will create a new dataset, with the same number of rows as the original dataset, but with the mean of all numeric columns in the original dataset.

iris %>% summarize_if(is.numeric, sd)

# what the code does

# The above code calculates the standard deviation of all the numeric variables in the Iris dataset. It uses the summarize_if() function to accomplish this task. The first argument of the summarize_if() function is the dataset, which in this case is the Iris dataset. The second argument is the condition which is the is.numeric function. This function will return TRUE if the data is numeric, and FALSE if it is not. The last argument of the summarize_if() function is the function that will be applied to the numeric data, which in this case is the standard deviation (sd). Therefore, the above code will calculate the standard deviation of all the numeric variables in the Iris dataset.

iris %>% summarize_if(is.numeric, list(min = min, median = median, max = max))

# what the code does

# The above code works by using the summarize_if() function to summarize a data set by calculating the minimum, median, and maximum of any numeric columns. The first argument of the summarize_if() function is is.numeric, which tells R to look for any numeric columns in the Iris data set. The second argument is a list of functions that are to be applied to each numeric column, in this case, min, median, and max. When the function is run, it will return a summary table containing the minimum, median, and maximum values of all the numeric columns in the Iris data set.

iris %>% summarize_if(is.numeric, mad)

# what the code does


# The above code works by using the dplyr package to summarize values in a dataframe. The summarize_if function is used, which takes two arguments, the first being a logical argument that determines which variables get summarized. The second argument is the function used to summarize the data. In this case, the is.numeric logical argument is used to identify numeric variables, and the mad (Mean Absolute Deviation) function is used to summarize the data.

Grouping

iris %>% group_by(Species) %>% summarize(across(Sepal.Length, mean))

# what the code does

# The above code uses the dplyr library to group the data by species and then calculate the mean of the Sepal Length for each species. The group_by() function takes the Species column and groups the data in the iris data frame into different species. The summarize() function then calculates the mean of each group (i.e., the mean Sepal Length for each species). The across() function is used to specify the column to be used in the mean calculation.

iris %>% group_by(Species) %>% summarize_all(mean)

# what the code does


# The above code uses the dplyr library to group and summarize the iris dataset. The code begins by using the group_by() function to group the dataset by the Species column. This will group the data into three separate groups, one for each species of iris. Next, the summarize_all() function is used to calculate the mean of each numerical column for each group. This will create a summary table with the mean of each numerical column for each species of iris.

res.aov <- aov(Sepal.Length ~ Species, data = iris)
summary(res.aov)

##              Df Sum Sq Mean Sq F value Pr(>F)    
## Species       2  63.21  31.606   119.3 <2e-16 ***
## Residuals   147  38.96   0.265                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# what the code does

# The above code performs an analysis of variance (ANOVA) test on the Sepal Length of the Iris species dataset. The purpose of the ANOVA is to determine if there is a statistically significant difference between the mean Sepal Lengths of the different species. The aov() function performs the ANOVA test, with Sepal Length being the dependent variable and Species being the independent variable. The summary of the results is then printed, which provides the F statistic, the degrees of freedom, the p-value, and the mean squares for the model and the residuals.

TukeyHSD(res.aov)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Sepal.Length ~ Species, data = iris)
## 
## $Species
##                       diff       lwr       upr p adj
## versicolor-setosa    0.930 0.6862273 1.1737727     0
## virginica-setosa     1.582 1.3382273 1.8257727     0
## virginica-versicolor 0.652 0.4082273 0.8957727     0

# what the code does


# The code above performs a TukeyHSD test on the results from an Analysis of Variance (ANOVA) test. The TukeyHSD test is used to identify which pairs of means are statistically different from each other. Specifically, it looks for differences between groups that are statistically significant. The TukeyHSD test provides a p-value for each pair of means, which can be used to assess the level of statistical significance. If the p-value is less than the predetermined significance level (e.g. 0.05), then the two means are deemed significantly different and the groups are said to be different.

Tabulate data

iris %>% group_by(Species) %>% summarize(n())

# what the code does
# 
# The above code works by using dplyr functions to group and summarize data in the iris dataset. First, the code uses the group_by function to group the data by Species. Then, the summarize function is used to calculate the number of observations (n()) for each of the Species.

iris_ord <- iris %>% mutate_if(is.numeric,
  function(x) cut(x, 3, labels = c("short", "medium", "long"), ordered = TRUE))
iris_ord

# what the code does

# The code above is manipulating the iris dataset by using the dplyr package and the mutate_if function. This function is used to change specific columns of the dataset based on certain conditions. In this case, the condition is that the column is a numeric variable. The function then applies a cut to the variable, which gives the variable three levels - short, medium, and long - and sets the order to be ordered. Therefore, the code is changing the numeric variables in the iris dataset so that they are all separated into three levels, and are ordered in terms of size. After applying the code, the new iris dataset, titled iris_ord, is returned with the numeric variables changed to the three levels described above.

summary(iris_ord)

##  Sepal.Length Sepal.Width Petal.Length Petal.Width       Species  
##  short :59    short :47   short :50    short :50   setosa    :50  
##  medium:71    medium:88   medium:54    medium:54   versicolor:50  
##  long  :20    long  :15   long  :46    long  :46   virginica :50

# what the code does

# The above code is used to create an ordered factor for the "Species" column in the iris dataset. The code first creates a vector of levels of Species in the iris dataset, which is then used to create a factor with the levels in the vector. The code then orders the factor in alphabetical order, which is then assigned to the "Species" column in the iris dataset.

tbl <- iris_ord %>% select(Sepal.Length, Species) %>% table()
tbl

##             Species
## Sepal.Length setosa versicolor virginica
##       short      47         11         1
##       medium      3         36        32
##       long        0          3        17

# what the code does

# The above code works by selecting the Sepal Length and Species columns from the iris_ord data frame, and then creating a table of the selected columns. The resulting table shows the counts of observations for each unique combination of Sepal Length and Species.

iris_ord %>%
  select(Species, Sepal.Length) %>%
### Relationship Between Nominal and Ordinal Features
  pivot_longer(cols = Sepal.Length) %>%
  group_by(Species, value) %>% count() %>% ungroup() %>%
  pivot_wider(names_from = Species, values_from = n)

# what the code does

# The above code works to explore the relationship between nominal and ordinal features in the iris dataset. First, the select() function is used to isolate the Species and Sepal.Length columns. Then, the pivot_longer() function is used to pivot the Sepal.Length column so that each row contains a single value. Following this, the group_by() function is used to group the data by Species and value, and the count() function is used to count the number of observations for each group. The ungroup() function is then used to remove the grouping, and the pivot_wider() function is used to create a table with Species as the columns and the count as the values.

tbl %>% chisq.test()

## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 111.63, df = 4, p-value < 2.2e-16

# what the code does

# The above code performs a chi-squared test on the data contained in the table. The chi-squared test is used to determine if there is a significant difference between the observed frequencies of two different variables. In this case, the table contains the observed frequencies of two variables. The chi-squared test will calculate the chi-squared statistic (also known as the chi-squared value) which can be used to determine if the observed difference between the two variables is statistically significant. The chi-squared value is then compared to a critical value to determine if the difference is significant. If the value is greater than the critical value, then the difference is considered significant.

fisher.test(tbl)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  tbl
## p-value < 2.2e-16
## alternative hypothesis: two.sided

# what the code does

# The above code uses the Fisher’s exact test to test for the significance of the relationship between two categorical variables. The code takes a table of data as an input, in which the table should contain two columns that represent two categorical variables. The Fisher’s exact test then examines the data and calculates the probability of observing a given relationship between the two variables.

Percentiles (Quantiles)

iris %>% pull(Petal.Length) %>% quantile()

##   0%  25%  50%  75% 100% 
## 1.00 1.60 4.35 5.10 6.90

# what the code does

# The above code uses the 'iris' data set to calculate the quantiles of the Petal Length column. The 'iris' data set is imported and then the 'pull' function is used to extract only the Petal Length column. This column is then piped into the 'quantile' function, which calculates the quantiles of the data.

iris %>% summarize(IQR = quantile(Petal.Length, probs = 0.75) - quantile(Petal.Length, probs = 0.25))

# what the code does

# The code above is used to calculate the Interquartile Range (IQR) of the variable 'Petal.Length' in the Iris data set. The IQR is calculated by subtracting the 25th percentile value from the 75th percentile value. The 'summarize' function is used to calculate the IQR, which is then stored in a variable called 'IQR'. The 'quantile' function is used to calculate the specified percentiles, which are then subtracted from each other to obtain the IQR.

Visualization

Histogram

ggplot(iris, aes(Petal.Width)) + geom_histogram(bins = 20)

# what the code does

# The above code uses the ggplot library in R to create a histogram of the Petal Width of the Iris dataset. The ggplot command sets up the plot, using the Iris dataset and specifying that the Petal Width should be the x-axis of the plot. The geom_histogram command creates the actual histogram, with the bins argument indicating the number of bins that should be used to divide the data.

Boxplot

ggplot(iris, aes(Species, Sepal.Length)) + 
  geom_boxplot()

# what the code does


# The above code is using the ggplot2 package to generate a boxplot of the Sepal Length against Species of the Iris dataset. The code begins by setting the Iris dataset as the data source for the graph. The next part of the code uses the aes() function to assign the Species and Sepal.Length variables as the x and y axes. The last part of the code uses the geom_boxplot() function to generate the boxplot.

iris %>% group_by(Species) %>% summarize_if(is.numeric, median)

# what the code does

# This code uses the dplyr package to summarize the iris dataset. It starts by grouping the dataset by species. Then, it uses the summarize_if function to calculate the median for all numeric columns. This will provide a summary of the median for each of the numeric columns for each species in the dataset.

library(tidyr)
iris_long <- iris %>% mutate(id = row_number()) %>% pivot_longer(1:4)
ggplot(iris_long, aes(name, value)) + 
  geom_boxplot() +
  labs(y = "Original value")

# what the code does

# The above code is used to generate a boxplot of the original values for each of the four features in the Iris dataset. The code begins by using the tidyr library, which provides functions to manipulate data frames. The code then uses the mutate and pivot_longer functions to create a long-form version of the Iris data frame, with an "id" column to identify each row. 
# 
# The code then uses ggplot to create a boxplot of the original values for each of the four features. The code sets the "name" column as the x-axis, and the "value" column as the y-axis. Finally, the code adds a label to the y-axis, indicating that it represents the original values.

library(tidyr)
iris_long_scaled <- iris %>% scale_numeric() %>% mutate(id = row_number()) %>% pivot_longer(1:4)
ggplot(iris_long_scaled, aes(name, value)) + 
  geom_boxplot() +
  labs(y = "Scaled value")

# what the code does

# The above code is used to create a boxplot of the scaled values of the four numeric variables in the Iris dataset. The code begins by scaling the numeric values in the Iris dataset and then adding an ID column. The data is then reshaped into a long format using the pivot_longer() function. Finally, the boxplot is created using the ggplot2 package, with the name variable on the x-axis and the scaled value variable on the y-axis.

Scatterplot

ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) + 
  geom_point()

# what the code does


# The above code creates a scatterplot of the Iris dataset using ggplot. The x-axis is set to Petal Length, and the y-axis is set to Petal Width. The color of each point is determined by the Species variable, which allows us to easily distinguish between the three species of Iris. The geom_point() function then plots each point in the scatterplot, making it easier to visualize the data.

Scatterplot matrix

library("GGally")
ggpairs(iris,  aes(color = Species))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# what the code does

# The above code generates a pairs plot displaying the relationships between different variables in the Iris data set. The code uses the GGally library to generate the plot. The ggpairs() function takes two parameters: the data set and an aesthetic, which in this case is the Species column. The function then creates a scatterplot matrix, with each variable in the data set plotted against each other variable. The color of each point is determined by the Species column.

Data matrix visualization

iris_matrix <- iris %>% select(-Species) %>% as.matrix()

# what the code does

# The code above takes the dataset "iris" and selects all columns except the "Species" column, then it converts the selected columns into a matrix. The matrix is then assigned to the variable "iris_matrix". This code is useful for creating a matrix of the data without the "Species" column. This could be used for further analysis on the other columns of data.

iris_long <- as_tibble(iris_matrix) %>% mutate(id = row_number()) %>% pivot_longer(1:4)
head(iris_long)

# what the code does

# The above code starts by converting the data frame iris_matrix into a tibble using the as_tibble() function. It then adds a new column called "id" which assigns a unique row number to each row of the data. Finally, the pivot_longer() function is used to pivot the data from a wide format to a long format. This means that each column from the original data frame is now represented as a row, with the column name as the first column and the values from the original columns as the second column. This data is then stored in the iris_long tibble. The head() function is then used to display the first 6 rows of the tibble.

ggplot(iris_long,
  aes(x = name, y = id, fill = value)) + geom_tile()

# what the code does


# The above code creates a plot using the ggplot package in R. It takes the dataset "iris_long" as the main data source and uses two aesthetic mappings: x is mapped to the variable "name", and y is mapped to the variable "id". The fill aesthetic is mapped to the variable "value". Finally, the geom_tile() command adds a tile layer to the plot. This creates a plot with rectangles of different colors, where each rectangle represents a value from the "value" variable and is positioned according to the "name" and "id" variables.

library(seriation)

## Registered S3 methods overwritten by 'registry':
##   method               from 
##   print.registry_field proxy
##   print.registry_entry proxy

## 
## Attaching package: 'seriation'

## The following object is masked from 'package:lattice':
## 
##     panel.lines

# what the code does
# 
# 
# The code above is using the library 'seriation' to organize data into a sequence. The seriation process takes a set of objects and arranges them into a sequence based on their similarities or differences. It is a powerful tool that can be used to uncover patterns in data and create visualizations that can help people better understand the data.

ggpimage(iris_matrix, prop = FALSE)

# what the code does

# The code above creates a GGPlot image of the iris_matrix data set. The code first creates a ggpimage object using the ggpimage function. This function takes the iris_matrix as an argument and the optional argument prop which is set to FALSE by default. The ggpimage object is then printed to display the image.
# 
# The GGPlot image displays a scatterplot of the data. Each of the data points in the scatterplot represents a different row in the iris_matrix. The x-axis of the scatterplot shows the first column of the iris_matrix, while the y-axis shows the second column of the iris_matrix.

iris_scaled <- scale(iris_matrix)
ggpimage(iris_scaled, prop = FALSE)

# what the code does

# The above code works by first creating a matrix of the iris dataset. This matrix contains all of the numerical data from the dataset, such as the petal and sepal lengths and widths. Next, the code scales the matrix, which normalizes the data to have a mean of 0 and standard deviation of 1. Finally, the ggpimage function is used to create a graphical representation of the scaled matrix.

ggpimage(iris_scaled, order = seriate(iris_scaled), prop = FALSE)

# what the code does


# The above code works to create a GGP image of the Iris_scaled data. First, the order of the data is determined by calling the seriate() function on the Iris_scaled data. This function re-orders the data in a way that minimizes the sum of dissimilarity along the diagonal. This helps to reveal any patterns in the data that may not be visible in the original order. The ggpimage() function is then used to create the image. This function takes the Iris_scaled data and the re-ordered data as its inputs and creates a graphical representation of the data. The prop argument is set to FALSE, which means that the proportions of the data are not taken into account when creating the image.

Correlation Matrix

cm1 <- iris %>% select(-Species) %>% as.matrix %>% cor()
cm1

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
## Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
## Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
## Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

# what the code does

# The above code works by first selecting the columns of the iris dataset that do not contain the Species column. It then converts the remaining columns into a matrix format. Finally, it calculates the correlation between the columns and stores the result in the cm1 matrix.

library(ggcorrplot)
ggcorrplot(cm1)

# what the code does

# 
# The above code creates a correlation plot using the ggcorrplot library. It takes the cm1 correlation matrix as input, and uses it to construct a visual representation of the correlations between variables. The plot shows the strength of the relationships between the variables using colour and size, with stronger correlations represented by larger and darker circles.

gghmap(cm1, prop = TRUE)

# what the code does

# The above code is used to create a choropleth map, which is a type of map that uses color to represent different values in a dataset. The code begins by creating a chorogm object, which is used to store the data for the map. The code then creates a data frame that contains the names of the counties, their population values, and the corresponding colors. Next, the code uses the ggplot2 library to create a map of the United States and then adds the data from the data frame to the map. The code then uses the gghmap() function to create a choropleth map of the United States, where the colors represent the population values of each county. The code also sets the prop argument to TRUE, which ensures that the colors are proportional to the population values.

cm2 <- iris %>% select(-Species) %>% as.matrix() %>% t() %>% cor()

ggcorrplot(cm2)

# what the code does

# The above code is used to generate a correlation matrix of the Iris dataset. The code starts by selecting all the columns from the Iris dataset, excluding the Species column, and then transforming the resulting data into a matrix. Using the transpose function, the matrix is then transposed so the columns become the rows and vice versa. The cor() function is then used to calculate the correlation values between the columns of the matrix. Lastly, the ggcorrplot() is used to generate a visual representation of the correlation matrix, which is a heatmap of the correlation values between the columns of the dataset.

Parallel Coordinates Plot

library(GGally)
ggparcoord(iris, columns = 1:4, groupColumn = 5)

# what the code does

# The above code is used to create a parallel coordinates plot from the iris data set. The code first imports the GGally library and then calls the ggparcoord function. The first argument is the iris dataset and the second argument is a vector of the first four columns of the dataset. The third argument is the group column, which is set to the fifth column of the dataset.

o <- seriate(as.dist(1-cor(iris[,1:4])), method = "BBURCG")
get_order(o)

## Petal.Length  Petal.Width Sepal.Length  Sepal.Width 
##            3            4            1            2

# what the code does

# The above code works by first calculating the correlation matrix of the first four columns of the iris data set. This correlation matrix is then converted into a distance matrix by subtracting each entry from 1. The seriate function is then used to order the distance matrix with the 'BBURCG' method. This method is a heuristic algorithm which seeks to minimize the sum of the distances between consecutive elements in the order. The get_order function is then used to retrieve the order of the elements determined by the seriate function.

ggparcoord(iris, columns = get_order(o), groupColumn = 5)

# what the code does


# The above code is used to visualize the iris dataset using the ggparcoord() function from the ggparcoord package. The columns argument is used to specify the order of the columns to be displayed in the plot, which is determined by the get_order() function. The groupColumn argument is used to indicate which column in the dataset should be used for grouping the data points, in this case column 5. The ggparcoord() function then creates a parallel coordinate plot from the dataset, with the columns in the specified order and the data points grouped according to the specified column.

IRIS DATASET

The Iris dataset

Data Quality

Aggregation

Sampling

Random Sampling

Stratified Sampling

Features

Dimensionality Reduction

Principal Components Analysis (PCA)

Feature selection

Discrete Features

Standardize Data

Proximities: Similarities and Distances

Minkowsky Distances

Distances for Binary Data

Additional proximity Measures Available in Package proxy

Relationships Between Features

Correlation

Rank Correlation

Density Estimation

Histograms

Kernel Density Estimate (KDE)

Exploring Data

Basic statistics

Grouping

Tabulate data

Percentiles (Quantiles)

Visualization

Histogram

Boxplot

Scatterplot

Scatterplot matrix

Data matrix visualization

Correlation Matrix

Parallel Coordinates Plot