Tarea Nro 1. Data Visualization de R for Data Science

# Fuente de la informacion y ejercicios.. https://r4ds.hadley.nz/data-visualize

#Se trbajara con palmerpenguins
remove(list = ls())

library(palmerpenguins)
data(penguins)
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.3

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.3

## Warning: package 'tibble' was built under R version 4.3.3

## Warning: package 'tidyr' was built under R version 4.3.3

## Warning: package 'readr' was built under R version 4.3.3

## Warning: package 'purrr' was built under R version 4.3.3

## Warning: package 'dplyr' was built under R version 4.3.3

## Warning: package 'forcats' was built under R version 4.3.3

## Warning: package 'lubridate' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

#1.2.5 Exercises ####
#1. How many rows are in penguins? How many columns?

dim(penguins)

## [1] 344   8

# 344 filas y 8 columnas (344 observaciones y 8 variables)

#2. What does the bill_depth_mm variable in the penguins data frame describe? Read the help for ?penguins to find out.

?penguins

## starting httpd help server ... done

# a number denoting bill depth (millimeters)
# Nos describe la altura de el pico de los pinguinos.

# 3. Make a scatterplot of bill_depth_mm vs. bill_length_mm. That is, make a scatterplot with bill_depth_mm on the y-axis and bill_length_mm on the x-axis. Describe the relationship between these two variables.

gg1 <- ggplot(data = penguins,
              mapping = aes(x = bill_length_mm,
                            y = bill_depth_mm, colour = species)) +
  geom_point() + 
  labs(title = "Relacion entre altura y anchura picos",
       subtitle = "Palmer Penguins",
       caption = "Fuente: Datos de palmerpenguins")
plot(gg1)

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

# La relacion tipica que tienen estas es que a mayor largo mayor profundidad y viceversa.

# 4. What happens if you make a scatterplot of species vs. bill_depth_mm? What might be a better choice of geom?

gg2 <- ggplot(data = penguins,
              mapping = aes(x = species,
                            y = bill_depth_mm, colour = species)) + 
  geom_point()
plot(gg2)

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

# Se distribuyen en las tres especies que son, visualmente se puede ver que Adelie tiene a los pinguinos con el pico mas ancho.

# 5. Why does the following give an error and how would you fix it?

# ggplot(data = penguins) + 
#  geom_point()

# Lo que esta mal de ese codigo es que no proporciona que grtarficar en x y en y por lo cual ocurre el error por falta de informacion para graficar en ggplot.
# Ejemplo de como se podria hacer sin errores: 
plot(ggplot(data = penguins,
       mapping = aes(x = sex,
                     y = island,
                     colour)) + 
  geom_point())

# 6. What does the na.rm argument do in geom_point()? What is the default value of the argument? Create a scatterplot where you successfully use this argument set to TRUE.

gg3 <- ggplot(data = penguins,
              mapping = aes(x = bill_length_mm,
                            y = bill_depth_mm, colour = species)) + 
  geom_point(na.rm = TRUE)
plot(gg3)

# El comando na.rm se usa para ignorar los valores NA, es decir valores faltantes.
#Esta por default TRUE ya que no se puede grafiacr algo que no hay

# 7. Add the following caption to the plot you made in the previous exercise: “Data come from the palmerpenguins package.” Hint: Take a look at the documentation for labs().

gg4 <- ggplot(data = penguins,
              mapping = aes(x = bill_length_mm,
                            y = bill_depth_mm, colour = species)) + 
  geom_point(na.rm = TRUE) + 
  labs(title = "Data come from the palmerpenguins package.")
plot(gg4)

# 8. Recreate the following visualization. What aesthetic should bill_depth_mm be mapped to? And should it be mapped at the global level or at the geom level?

gg5 <- ggplot(penguins,
              aes(x = flipper_length_mm,
                  y = body_mass_g,
                  colour = bill_depth_mm)) + 
  geom_point(na.rm = TRUE) + geom_smooth(method = "loess")
plot(gg5)

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: The following aesthetics were dropped during statistical transformation:
## colour.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

# Lo unico nuevo usado es "loess"
# LOESS es un método de regresión no paramétrica que ajusta una curva suave a los datos mediante la creación de pequeñas regresiones locales a lo largo de los puntos de datos.

# 9. Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island)
) +
  geom_point() +
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

# 10. Will these two graphs look different? Why/why not?

#Se veran iguales ya que ambas piden lo mismo, lo unico es que el comando se ve mas ordenado en gg6

gg6 <- ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point() +
  geom_smooth()
plot(gg6)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 2 rows containing non-finite outside the scale range (`stat_smooth()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

gg7 <- ggplot() +
  geom_point(
    data = penguins,
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
  ) +
  geom_smooth(
    data = penguins,
    mapping = aes(x = flipper_length_mm, y = body_mass_g)
  )
plot(gg7)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 2 rows containing non-finite outside the scale range (`stat_smooth()`).
## Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

# 1.4.3 Exercises ####

# 1. Make a bar plot of species of penguins, where you assign species to the y aesthetic. How is this plot different?
gg8 <- ggplot(
  data = penguins, aes(y = species)) +
  geom_bar()
plot(gg8)

# Solo se grafica el conteo de las especies.

# 2. How are the following two plots different? Which aesthetic, color or fill, is more useful for changing the color of bars?

gg9 <- ggplot(penguins, aes(x = species)) +
  geom_bar(color = "red") +
  labs(title = "gg9.")
plot(gg9)

gg10 <- ggplot(penguins, aes(x = species)) +
  geom_bar(fill = "red") +
  labs(title = "gg10.")
plot(gg10)

# Son diferentes porque gg9 solo contornea las barras mientras gg10 vuelve las barras rosa, el metodo que usa gg10 sirve mas que el de gg9

# 3. What does the bins argument in geom_histogram() do?
# Es para epecificar el numero de intervalos en los que se debe dividir los datos para  el grafico
gg11 <- ggplot(data = penguins, aes(x = body_mass_g)) +
  geom_histogram(bins = 1000) +
  labs(title ="gg11")
plot(gg11)

## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).

# 4. Make a histogram of the carat variable in the diamonds dataset that is available when you load the tidyverse package. Experiment with different binwidths. What binwidth reveals the most interesting patterns?

data("diamonds")
ggdiamonds <- ggplot(diamonds, aes(x = carat)) +
  geom_histogram() + labs(title = "ggdiamonds")
plot(ggdiamonds)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggdiamonds2 <- ggplot(diamonds, aes(x = carat)) +
  geom_histogram(binwidth = 0.01) + labs(title = "ggdiamonds2")
plot(ggdiamonds2)

ggdiamonds3 <- ggplot(diamonds, aes(x = carat)) +
  geom_histogram(binwidth = 0.5) + labs(title = "ggdiamonds3")
plot(ggdiamonds3)

ggdiamonds4 <- ggplot(diamonds, aes(x = carat)) +
  geom_histogram(binwidth = 1) + labs(title = "ggdiamonds4")
plot(ggdiamonds4)

# La mas interesante me parece ggdiamonds2 ya que muestra mas especifico, porque el rango que abarca cada bin es menor.

#1. The mpg data frame that is bundled with the ggplot2 package contains 234 observations collected 
#by the US Environmental Protection Agency on 38 car models. Which variables in mpg are categorical? Which variables are numerical? 
#(Hint: Type ?mpg to read the documentation for the dataset.) How can you see this information when you run mpg?

str(mpg)

## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

# manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
# model       : chr [1:234] "a4" "a4" "a4" "a4" ...
# displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
# year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
# cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
# trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
# drv         : chr [1:234] "f" "f" "f" "f" ...
# cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
# hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
# fl          : chr [1:234] "p" "p" "p" "p" ...
# class       : chr [1:234] "compact" "compact" "compact" "compact" ...


#2. Make a scatterplot of hwy vs. displ using the mpg data frame. 
#Next, map a third, numerical variable to color, then size, then both color and size, 
#then shape. How do these aesthetics behave differently for categorical vs. numerical variables?

gg155 <- ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()
plot(gg155)

gg1551 <- ggplot(mpg, aes(x = displ, y = hwy, color = cty)) +
  geom_point()
plot(gg1551)

gg1552 <- ggplot(mpg, aes(x = displ, y = hwy, size = cty)) +
  geom_point()
plot(gg1552)

gg1553 <- ggplot(mpg, aes(x = displ, y = hwy, color = cty, size = cty)) +
  geom_point()
plot(gg1553)

gg1554 <- ggplot(mpg, aes(x = displ, y = hwy, shape = as.factor(cty))) +
  geom_point()
plot(gg1554)

## Warning: The shape palette can deal with a maximum of 6 discrete values because more
## than 6 becomes difficult to discriminate
## ℹ you have requested 21 values. Consider specifying shapes manually if you need
##   that many have them.

## Warning: Removed 137 rows containing missing values or values outside the scale range
## (`geom_point()`).

#3. In the scatterplot of hwy vs. displ, what happens if you map a third variable to linewidth?

ggplot(mpg, aes(x = displ, y = hwy, linewidth = cty)) +
  geom_point()

#4. What happens if you map the same variable to multiple aesthetics?

#Cambian dos factores de la grafica. La forma representada y el color, hasta un maximo de 6 en el casod de colores.

#5. Make a scatterplot of bill_depth_mm vs. bill_length_mm and color the points by species. 
#What does adding coloring by species reveal about the relationship between these two variables? 
#What about faceting by species?

ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
  geom_point()

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point() +
  facet_wrap(~ species)

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

# se crean 3 diferentes cuadros diferenciados por species

#6. Why does the following yield two separate legends? How would you fix it to combine the two legends?

ggplot(
  data = penguins,
  mapping = aes(
    x = bill_length_mm, y = bill_depth_mm, 
    color = species, shape = species
  )
) +
  geom_point() +
  labs(color = "Species")

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

#7. Create the two following stacked bar plots. Which question can you answer with the first one? 
#Which question can you answer with the second one?

ggplot(penguins, aes(x = island, fill = species)) +
  geom_bar(position = "fill")

ggplot(penguins, aes(x = species, fill = island)) +
  geom_bar(position = "fill")

#la cantidad de tipos de pinguinos por cada isla

#1.6.1 Exercises ####

#1. Run the following lines of code. Which of the two plots is saved as mpg-plot.png? Why?

ggplot(mpg, aes(x = class)) +
  geom_bar()

ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point()

ggsave("mpg-plot.png")

## Saving 7 x 5 in image

#el segundo por el comando  ggsave

#2. What do you need to change in the code above to save the plot as a PDF instead of a PNG? 
#How could you find out what types of image files would work in ggsave()?

ggsave("mpg-plot.pdf")

## Saving 7 x 5 in image

?ggsave

Tarea Nro 1. Data Visualization de R for Data Science

Natalia Torrico Saavedra

Inicio: Lunezs 13 de Agosto, 2024