ASSIGNMENT 5
Due Apr 12 by 11:59pm
Data Engineering and Mining I
Spring 2025
Instructor: C. Pierre, Ph.D., M.Sc. in Analytics
##Q 1: What is data visualization? This is a big area, so try and give an overview.
Data visualization presents the data information through graphs, or maps which are obviously easier to understand. Data visualization could be more effective than other method of presenting information like table.
##Q. 2: List the two main graphics systems of R.
(1) the standard graphics (canonical graphics) – This is implemented by the package, graphics.
(2) the grid graphics – This is provided by functions of the package, grid.
These both being part of any base R installation.
##Q. 3: List the tools for visualizing
R – has many excellent tools for visual representations of a dataset.
The tools for visualizing are –
(1) a single variable
(2) two variables, and
(3) multiple multivariate plots
##Q. 4: Explain faceting/facets
Facets are small polished planes arranged in a geometric pattern on a gemstone.
##Q. 5(a): What are the two basic problems of barplots or barcharts? * (i). If you have too many categories in barplot, that may be harder to read. * (ii). If the frequency of graph are similar and not distinct.
##Q. 5(b): What are the solutions of each of the problems in (a) above?
* (i). Group the homogenous category.
* (ii). Point plot could be helpful
##Q. 6:It is generally a good idea to use the information provided by barcharts to create, or show, as a pie chart.
FALSE
##Q. 7:When analyzing data, explain what barcharts would be used for.
Barchats are used for visualizing frequency, percentage or summary statistics for the discrete type of data.
##Q. 8(a):What does the package, GGally, provide? (e.g., what functions does it provide, etc.)
Package GGally (Schloerke et al., 2016) provides a series of interesting additions to the graphs available in package ggplot2. Among these are scatterplot matrices obtained
with function ggpairs().
#install.packages("GGally")
#install.packages("DMwR2")
data(algae, package="DMwR2")
library(GGally)
ggpairs(algae,columns=12:16)
ggpairs(algae,columns=2:5)
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 1 row containing non-finite outside the scale range (`stat_bin()`).
## Removed 1 row containing non-finite outside the scale range (`stat_bin()`).
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_density()`).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 3 rows containing missing values
## Warning: Removed 2 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 2 rows containing non-finite outside the scale range (`stat_bin()`).
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_density()`).
Another statistical graph provided by package GGally that also involves the visualization
of several variables at the same time, is the *parallel coordinate plot*. These graphs try
to show all observations of a dataset in a single plot, each being represented by a line. This
is clearly not feasible for large datasets, but sometimes these graphs are very interesting for
detecting strongly deviating cases in a dataset.
Figure below shows an example of this type of graph where we try to compare the frequency values of all 200 observations of the algae dataset, with the color of the lines representing the season of the year when the respectivewater sample was collected. The following code obtained that figure,
library(GGally)
dim(algae)
## [1] 200 18
head( algae[, c(1, 12:18)]) %>% as.data.frame()
ggparcoord(algae,columns=12:18,groupColumn="season")
##Q. 8(b):What does the function, ggpairs, do?
The ggpairs() function from *GGally* package, creates a pairwise plot matrix from a data frame. It displays all possible pairwise relationships between variables within the data, allowing for the visualization of correlations and distributions.
##Q. 9(a):Explain the function, facet_wrap().
The faceting was obtained with function facet_wrap(), allows to indicate a nominal variable whose
values will create a set of subplots that will be presented sequentially.
##Q. 9(b):Explain the function, facet_grid().
The function facet_grid() allows us to set up a matrix of plots with each dimension of
the matrix getting as many plots as there are values of the respective variable.
##Q. 10: Explain the following argument:
**aes(x = Sepal.Length, y = Sepal.Width)**
It is mapping in ggplot() funciton with x-axis is "Sepal.Length" and y-axis is "Sepal.Width"
##Q. 11:Give an interpretation of the symbol, “ ~ ”
R supports "formula" represent by "~" symbole.
Following code says Find mean of "mpg" variable by grouping variable "cyl".
aggregate(mpg~cyl,mtcars,mean)
##Q. 12:What are the aesthetics in a plot?
In ggplot2, "aesthetic" refers to the visual attributes of a plot like axis, color, size, type.
##Q. 13(a):What are layers?
Layers are responsible for creating the objects that we perceive on the plot.
##Q. 13(b):What are the five components of layers?
(1) Data
(2) Aesthetic mappings
(3) A statistical transformation (stat)
(4) A geometric object (geom)
(5) A position adjustment
##Q. 14:What is scaling?
Changing the scale of the variable so that the visualization is more effective. For example if ther are some extreme values possibly we take a log scale. If we would like to compare income status we may scale different monetory values into one scale of US dollar.
##Q. 15:Explain layers and what they are used for
The layers in R plot a combination of data, aesthetic mapping, geometric objects, statistical transformation, position adjustment that all will produce a visualization.
##Q. 16: What are themes?
Themes will add title, labels, legends in the visualization which are not the layer blocks that the data visualizaton.
##Q. 17:What are APIs?
An application programming interface (API) key is a code used to identify and authenticate a user. Example: Google API key to access the google map which are in public domain.
##Q. 18:Give two examples of “geom,” and explain what they do.
The "geom" a short form of "geometry" visual marks that represent data points like geom_point() for scatter plot, geom_bar() for bar plot, geom_polygon() for ploygon, geom_histogram() for histogram, geom_boxplot() for boxplot, geom_map() for mapping.
##Q. 19:Explain the following function and all of its arguments, etc. ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot( )
It draws boxplot ob Sepal.Length by Species
Data: iris
x-axis = Species
y-axis = Sepal.Length
ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot( )
##Q. 20:Explain the R graphics layered architecture.
The layered architecture in R graphics, involves building a plot by adding distinct layers, each contributing to the final plot. The layer are like like data, aesthetic mappings, statistical transformations, and geometric objects.
##Q. 21:In “ggplot” what does “gg” stand for?
Grammar for Graphics
##Q. 22(a): List some aesthetic attributes.
Color, Shape, Size, x-axis, Y-axis
##Q. 22(b):List some geometric objects that are defined by the grammar for graphics.
Points, Lines, Bars
##Q. 23:What statistical plot can we use to explore the distribution of the values of a nominal variable?
Bar plot, Pie chart
##Q. 24: Use the ggplot2 package to write an algorithm, or a chunk of code, that will create a plot of the distribution of the values of a continuous variable (use any geom except the histogram). Choose the correct geom. Use the iris dataset. ## (a). Then, explain why you chose your algorithm, ## (b). explain why you chose the functions you used, ## (c) explain why you chose your geom, and ## (d) show your plot.
Ans:
(a). I chose a boxplot for a continuous varible Sepal.Width and category by Species.
(b). To display continuous data distribution.
(c).It is a single continuous variable.
(d). Plot below.
library(tidyverse)
iris %>% ggplot( aes(x= Species, y=Sepal.Width)) +geom_boxplot()
##Q. 25:Use the ggplot2 package to write an algorithm, or a chunk of
code, that will create a plot of the distribution of the values of a
continuous variable using a histogram. Use the iris dataset.
##(a) Then, explain why you chose to use your aes( ) function,
##(b) explain why you chose to use your particular geom, and.
##(c) show your plot.
Ans:
(a) aes(x = Sepal.Width) was used to draw histogram fo the continuous variable Sepal.Width.
(b) geom_histogram() was chosen as this function will draw histogram.
(c) plot below
iris %>% ggplot(aes(x = Sepal.Width)) + geom_histogram() + facet_wrap( ~ Species)+
coord_flip()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
##Q. :