HW04

Author

Xiangzhe Li

Question 1

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

exercise_data <- read_csv("https://raw.githubusercontent.com/vaiseys/dav-course/main/Data/visualize_data.csv")

New names:
Rows: 142 Columns: 4
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," dbl
(4): ...1, ...2, Exercise, BMI
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
• `...1` -> `...2`

glimpse(exercise_data)

Rows: 142
Columns: 4
$ ...1     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ ...2     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ Exercise <dbl> 55.3846, 51.5385, 46.1538, 42.8205, 40.7692, 38.7179, 35.6410…
$ BMI      <dbl> 1.8320590, 1.7892194, 1.7321050, 1.6178724, 1.5036362, 1.3751…

What you expect the relationship would look like?

ANS: I predict that exercise time and BMI have a negative correlation. But at relatively low BMI, the regression line will flatten, because people who exercise a lot won’t keep getting thinner—they tend to keep their weight in a healthy range.

cor(exercise_data$Exercise, exercise_data$BMI)

[1] -0.06447185

What the output indicates?

ANS: Very weak negative correlation, alomost none.

ggplot(exercise_data, aes(x = Exercise, y = BMI)) +
  geom_point(alpha = 0.7) +
  labs(x = "Exercise", y = "BMI")

???A dinosaur?

Question 2

library(causact)

WARNING: The 'r-causact' Conda environment does not exist. To use the 'dag_numpyro()' function, you need to set up the 'r-causact' environment. Run install_causact_deps() when ready to set up the 'r-causact' environment.


Attaching package: 'causact'

The following objects are masked from 'package:stats':

    binomial, poisson

The following objects are masked from 'package:base':

    beta, gamma

CPI2017: A integer on a scale of 0_100, the smaller this value the more corrupted a country is.

HDI2017: A measurement of a nation’s level of developement, consists of many criterias such as education and economy.

Question 3

ggplot(corruptDF, aes(x = HDI2017, y = CPI2017)) +
  geom_point(alpha = 0.7)+
  labs(
    title = "HDI vs CPI (2017)",
    x = "Human Development Index (2017)",
    y = "Corruption Perceptions Index (2017)"
  )

Describe the relationship that you see.

ANS: There is a strong positive correlation between HDI2017 and CPI2017.

Question 4

ggplot(corruptDF, aes(x = HDI2017, y = CPI2017)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE) +              
  geom_smooth(method = "gam", formula = y ~ s(x, k = 5), se = FALSE, size = 1) + 
  labs(
    title = "HDI vs CPI (2017)",
    x = "Human Development Index (2017)",
    y = "Corruption Perceptions Index (2017)"
  )

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

`geom_smooth()` using formula = 'y ~ x'

What are the differences?

ANS: method = lm creates a straight line while method = gam creates a smooth curve.

Which one do you prefer?

ANS: I prefer the GAM because it’s flatter sections capture the part where the correlation is relatively weak.

Question 5

ggplot(corruptDF, aes(x = HDI2017, y = CPI2017, color = region)) +
  geom_point(alpha = 0.6) +
  geom_smooth(aes(fill = region),
              method = "gam", formula = y ~ s(x, k = 5),
              se = TRUE, alpha = 0.2, linewidth = 1) +
  labs(title = "HDI vs CPI (2017) by Region",
       x = "HDI2017", y = "CPI2017")

What do you see?

ANS: Colored dots with correspond colored lines all overlapping together.

Are patterns clear or is the graph too cluttered?

ANS: Too cluttered, the GAM lines overlap making pattern unrecognized.

What would be another way to get these trends by region but in a way to would be more legible?

ANS: Facet, because all regions have their own panel while they shared axes, which makes easier for comparison of shape/strength.

ggplot(corruptDF, aes(x = HDI2017, y = CPI2017)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "gam", formula = y ~ s(x, k = 5), se = FALSE, linewidth = 1) +
  facet_wrap(~ region, ncol = 3, scales = "fixed") +
  labs(title = "HDI vs CPI (2017) by Region",  x = "HDI2017", y = "CPI2017")

Question 6

ggplot(corruptDF, aes(x = HDI2017, y = CPI2017)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "gam", formula = y ~ s(x, k = 5), se = FALSE, linewidth = 1) +
  scale_x_reverse() +
  facet_wrap(~ region, ncol = 3) +
  labs(title = "HDI vs CPI (2017) — Faceted, X-axis Reversed",
       x = "HDI2017 (reversed)", y = "CPI2017")

Question 7

final_plot <-
ggplot(corruptDF, aes(x = HDI2017, y = CPI2017)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "gam", formula = y ~ s(x, k = 5), se = FALSE, linewidth = 1) +
  facet_wrap(~ region, ncol = 3, scales = "fixed") +
  labs(
    title    = "Human Development and Corruption Perception (2017)",
    subtitle = "Trends by Region",
    x = "Human Development Index",
    y = "Corruption Perceptions Index",
  caption = "Sources:\nTransparency International CPI 2017 (CC BY-ND 4.0)\nUNDP HDI (accessed Oct 1, 2018)\nWorld Bank population data (accessed Oct 1, 2018).")+
  theme(
    plot.caption = element_text(hjust = 0), 
    plot.caption.position = "plot",     
  )

Question 8

ggsave("hdi_cpi_2017.png", plot = final_plot, width = 10, height = 7, dpi = 300)