R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

## This is Homework #5, following the textbook from page 96.
library(ggplot2)
# #1. What is data visualization? This is a big area, so try and give an overview.

#Data visualization is a vital field that leverages visual elements like charts, graphs, maps, and dashboards to represent data and information. Its core purpose is to simplify complex datasets, making them more accessible and understandable for a broad audience, regardless of their technical expertise. In essence, it transforms raw data into intuitive visual formats, allowing viewers to quickly identify trends, patterns, relationships, and outliers. 
# #2. List the two main graphics systems of R.

# R has two main graphics systems: (i) the standard graphics and (ii) the grid graphics. The first is implement by the package graphics, while the second is provided by functions of the package called grid.
# #3. List the tools for visualizing.

# (i) a single variable; (ii) two variables; and (iii) multivariate plots. 
# #4. Explain faceting/facets.

# Faceting is a visualization technique used to create multiple subplots (facets) from a single plot, each displaying a subset of the data based on a categorical variable. This allows for easy comparison of trends and patterns across different groups within the data. In R, particularly within the ggplot2 package, faceting is achieved using facet_wrap and facet_grid functions. 
# #5. 

# (a) What are the two basic problems of barplots or barcharts?

# When the labels of the values of the nominal variable are too long it may be easier to read the graph if one plots the bars horizontally. Also, the problem that one sometimes faces of having a few of the bar differences being almost undistinguishable.

# Oversimplification and misinterpretation due to scales. They can oversimplify data by only showing aggregated values (like means) without displaying the underlying variability or distribution within categories. This can lead to inaccurate or incomplete understanding of the data. Additionally, bar charts are susceptible to misinterpretations due to how viewers perceive and process visual information, such as errors in judging the height of bars or focusing on individual bars rather than the overall comparison. 

# (b) What are the solutions of each of the problems in (a) above?

# One could use the coord_filp() function to swap the x-y coordinates or when a bar chart is not the most appropriate visualization, consider using alternative chart types like histograms, dot plots, or line charts to better represent the data. 
# #6. It is generally a good idea to use the information provided by barcharts to create, or show, as a pie chart. TRUE or  FALSE

# False. Comparing the heights of the bars is much easier for the eyes than comparing areas of slices, particularly if there are many slices/values. 
# #7. When analyzing data, explain what barcharts would be used for.

# They are used for the visualization of the values of a nominal variable. We explore the distribution of these values and the graphs will show as many bars as there are different values of the variable, with the height of the bars corresponding to the frequency of the values. 
# #8. 

# (a) What does the package, GGally, provide? (e.g., what functions does it provide, etc.)

# GGally provides a series of interesting additions to the graphs available in package ggplot2. Among these are scatterplot matrices obtained with the function ggpairs() and the function ggparcoord, which is for parallel coordinate plots.

# (b) What does the function, ggpairs, do?

# It provides scatterplot matrices.
# #9. 
# (a)  Explain the function, facet_wrap().

# This function allows you to indicate a nominal variable whose values will create a set of subplots that will be presented sequentially with reasonable wrapping around the screen space.

#  (b)  Explain the function, facet_grid().

# The function facet_grid() allows one to set up a matrix of plots with each dimension of the matrix getting as many plots as there are values of the respective variable. For each cell of this matrix the graph specified before the facet is shown using only the subset of rows that have the respective values on the variables defining the grid.
# #10. Explain the following argument:
# >   aes(x = Sepal.Length, y = Sepal.Width)

# This code assigns the respective variables to the x and y axes.
# #11. Give an interpretation of the symbol, “ ~ ”? 

# The tilde symbol separates the response and predictor variables in the specification of a model.
# #12. What are the aesthetics in a plot?  

# Aesthetics in a plot are the visual/physical features such as color, shape, size, etc.
# #13. 
# (a) What are layers?

# Layers help to create what we see when constructing a plot.
 
# (b) What are the five components of layers?
      # 1. Data
      # 2. Aesthetic mappings
      # 3. Statistical transformation 
      # 4. Geometric object 
      # 5. Position adjustment
# #14. What is scaling?

# Scaling is a process that involves adjusting the range and distribution of numerical data so that all features or variables are presented on a comparable scale.  
# #15. Explain layers and what they are used for.

# Layers are building blocks that allow one to construct and customize complex visualizations. They are levels in the drawing of a program, where each level holds a specific visual element or aspect of the data.
# #16. What are themes?

# (a) Themes refer to the overall aesthetic and visual style of a plot or visualization. They are akin to design templates.
# #17. What are APIs?  

# An application programming interface (API) is a set of rules and tools that allow different software applications to communicate and exchange data. An API key is a code used to identify and authenticate a user.
# #18. Give two examples of “geom,” and explain what they do.

# One example is the geom_point() function which is uesed to draw points on a plot. Another example is the geom_bar() function which is used to create the bars on a bar graph.
# #19. Explain the following function and all of its arguments, etc.
 
ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot( )

# The ggplot() is calling the Iris dataset, then assigning the variable Species to the x-axis, Sepal.Length to the y-axis and lastly creating a 5-number summary boxplot.
# #20. Explain the R graphics layered architecture.

# Layers help to create what we see when constructing a plot. The layers are structured as data, aesthetic mappings, statistical transformations, geometric object and position adjustment.
# #21. In “ggplot” what does “gg” stand for?

#  Grammar of Graphics
# #22. 
# (a) List some aesthetic attributes.

# Color, size, shape, line type, position

# (b) List some geometric objects that are defined by the grammar for graphics.

# Bar graph, point, histogram, boxplot, polygon
# #23. What statistical plot can we use to explore the distribution of the values of a nominal variable?

# Either a bar graph or pie chart can be used when there is a nominal variable.
# #24. Use the ggplot2 package to write an algorithm, or a chunk of code, that will create a plot of the distribution of the values of a continuous variable (use any geom except the histogram). Choose the correct geom.  Use the iris dataset.

library(ggplot2)
data(iris)

# Create a frequency polygon of Sepal.Length
ggplot(data = iris, aes(x = Sepal.Length)) +
  geom_freqpoly(color = "red", linewidth = 1) +
  labs(title = "Sepal Length Frequency Polygon",
       x = "Sepal Length",
       y = "Count") +
  theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# (a) Then, explain why you chose your algorithm,

# I chose this algorithm it is simple and goes through the layers.

# (b) explain why you chose the functions you used,

# I chose this function because because I know a frequency polygon is appropriate with a continuous variable.

# (c) explain why you chose your geom, and

# I chose this function because because I know a frequency polygon is appropriate with a continuous variable.


# (d) show your plot.
# #25. Use the ggplot2 package to write an algorithm, or a chunk of code, that will create a plot of the distribution of the values of a continuous variable using a histogram. Use the iris dataset.

# This line will create a histogram of Sepal.Length
ggplot(data = iris, aes(x = Sepal.Length)) +
  geom_histogram(binwidth = 0.5, fill = "pink", color = "black") +
  labs(title = "Sepal Length Histogram",
       x = "Sepal Length",
       y = "Frequency") +
  theme_minimal()

# (a) Then, explain why you chose to use your aes( ) function,

# The aes() function is needed to assign the variable Sepal.Length to the x-axis.

# (b) explain why you chose to use your particular geom, and

# The instructions specified that a histogram was to be plotted because this is a continuous variable.

# (c) show your plot.