Lab 02

Author

Affiliation

Moisieiev Vasyl

Kyiv School of Economics

Part 1: Nobel laureates

Import

library(tidyverse)
library(ggplot2)

nobel_df = read_csv("https://raw.githubusercontent.com/Aranaur/aranaur.rbind.io/main/datasets/nobel/nobel.csv")

nobel_df

Exrecise 1

This dataset represented with 935 observations and 26 varaibles.

Exercise 2

nobel_living <- subset(nobel_df,
                      !is.na(country)&
                      gender != "org"&
                      is.na(died_date))

If we filter the dataset to only include people who are alive, we have 228 observations.

nobel_living <- nobel_living %>%
  mutate(country_us = ifelse(country == "USA", "USA", "Other"))

nobel_living_science <- nobel_living %>%
  filter(category %in% c("Physics", "Chemistry", "Medicine", "Economics"))

nobel_living_science

Exercise 3

ggplot(nobel_living_science, aes(x = country_us, fill = country_us)) +
  geom_bar() + 
  facet_wrap(~ category) +
  coord_flip() + 
  labs(
    itle = "Distribution of Nobel Laureates by Category and Country",
    x = "Country (USA or Other)",
    y = "Count of Laureates",
    fill = "Country"
  ) + 
  theme_minimal()

Exercise 4

nobel_living_science <- nobel_living_science %>%
  mutate(born_country_us = ifelse(born_country == "USA", "USA", "Other"))

born_counts <- table(nobel_living_science$born_country_us)

born_counts


Other   USA 
  123   105

Exrecise 5

ggplot(nobel_living_science, aes(x = country_us, fill = born_country_us)) +
  geom_bar() +
  facet_wrap(~ category) +
  coord_flip() +
  labs(
    title = "Segmented Frequency Bar Plot",
    x = "Country (USA or Other)",
    y = "Count of Laureates",
    fill = "Born in"
  ) +
  theme_minimal()

ggplot(nobel_living_science, aes(x = country_us, fill = born_country_us)) +
  geom_bar(position = "fill") +
  facet_wrap(~ category) +
  coord_flip() +
  labs(
    title = "Segmented Relative Frequency Bar Plot",
    x = "Country (USA or Other)",
    y = "Proportion of Laureates",
    fill = "Born in"
  ) +
  theme_minimal()

Relative frequency bar plot (Plot 2) is better for answering the question because it directly visualizes the proportion of laureates born in other countries compared to the USA for each category. This makes it easier to evaluate Buzzfeed’s claim about a significant number of US-based laureates being born elsewhere.

Part 2: IMS Exercises

Exrecise 6

birth_country_counts <- nobel_living_science %>%
  filter(country_us == "USA" & born_country_us == "Other") %>%
  count(born_country, sort = TRUE)

birth_country_counts

Exrecise 7

Democrats overwhelmingly support raising taxes on the rich, with minimal support for raising taxes on the poor.

Republicans show a more divided opinion, with significant support for raising taxes on the poor and less emphasis on taxing the rich.

Independents/Other lie in between, with mixed opinions across all three categories.

There also colud be a different variables that influnce this statistocs. For example, the age of the respondents, their income, their education level, their location, etc. Each of these variables could have an impact on the respondent’s opinion on tax policy.

Exrecise 8

The relationship between steress and productivity can be visualized as a U-shaped curve. At low levels of stress, productivity is low. As stress increases, productivity also increases, up to a certain point. After that point, productivity begins to decrease as stress continues to increase.

stress <- seq(0, 100, by = 1)
productivity <- -0.01 * (stress - 50)^2 + 50 

data <- data.frame(stress = stress, productivity = productivity)

ggplot(data, aes(x = stress, y = productivity)) +
  geom_line(color = "blue", size = 1.2) +
  labs(
    title = "Relationship Between Stress and Productivity",
    x = "Stress Level",
    y = "Productivity Level"
  ) +
  theme_minimal()

Exrecise 9

Distribution: Likely right skewed, as most households have few or no pets, while a small number of households may have many pets.

Best measure for center: Median, as the right skew can disproportionately affect the mean.

Best measure for variability: IQR, as it is less sensitive to extreme values.

Distribution: Likely right skewed, as most people have relatively short commutes, but a minority may travel long distances.

Best measure for center: Median, to account for outliers like people commuting very long distances.

Best measure for variability: IQR, as it is robust to extreme values and provides a better sense of typical variability.

Distribution: Likely symmetric, as human heights usually follow a normal distribution with slight variation around a central value.

Best measure for center: Mean, as the distribution is symmetric, and the mean represents the central tendency well.

Best measure for variability: Standard deviation, as it is most appropriate for symmetric, normal-like distributions.s

Distribution: Likely right skewed, as most people live to old age, but some die prematurely, creating a long tail on the right.

Best measure for center: Median, as the right skew can distort the mean.

Best measure for variability: IQR, as it is more robust against the influence of outliers.

Distribution: Likely left skewed, as most students would score high marks, but a few might perform poorly, creating a tail on the left.

Best measure for center: Median, as it better represents the typical score when skewed distributions are present.

Best measure for variability: IQR, as it is less influenced by the few low outliers.

Exrecise 10

The histogram shows a bimodal distribution with two peaks, indicating that the data may come from two different populations or have two distinct groups within it.

The histogram also provides frequency information, showing how many observations fall within each bin or interval.

The box plot displays the spread of the data, including the median, quartiles, and any outliers.

The bimodal distribution likely arises due to the inclusion of both male and female marathon winners in the combined data. Historically, men and women have different average marathon finishing times, with men typically finishing faster on average than women. These differences in finishing times for the two groups result in two distinct peaks in the data.
Men generally have faster marathon times, as their distribution is concentrated at lower values.

Women have slower marathon times on average, with their distribution centered around higher values.

Both men’s and women’s marathon times have decreased since the 1970s, reflecting improvements in training, technology, and competition levels over time.

Women’s times show a more significant improvement initially, likely due to the increased participation and professionalization of women’s running during this period.