POL 9590 - Assignment 1

Question 1

If 50% of families subscribe to Disney+, 65% of families subscribe to Netflix, and 85% of families subscribe to at least one of the two, what percentage of the families subscribe to both Disney+ and Netflix?

P(A)=0.5 P(B)=0.65 P(AUB)=P(A) + P(B) - P(AnB) = 0.85 P(AnB) = 0.5 + 0.65 - 0.85 = 0.3

p_disney   <- 0.50   
p_netflix  <- 0.65   
p_union    <- 0.85
#final answer is: %
p_both <- p_disney + p_netflix - p_union
p_both * 100

## [1] 30

Question 2

If two dice are rolled, what is the probability that the sum of the two numbers that appear will be even? Will be odd? What is the probability that the difference between the two numbers on the dice will be less than 3? Show your work.

total no. outcomes = 6 * 6 =36 sum_even = (33)2=18 sum_odd = 36-18 = 18 P(sum_even) =18/36 =1/2 P(sum_odd) = 18/36 = 1/2 P(dif3) = (6 + 52 + 42) / 36

dice <- expand.grid(d1 = 1:6, d2 = 1:6)
p_even <- mean((dice$d1 + dice$d2) %% 2 == 0)
p_odd  <- mean((dice$d1 + dice$d2) %% 2 == 1)
p_diff <- mean(abs(dice$d1 - dice$d2) < 3)
c(p_even = p_even, p_odd = p_odd, p_diff_lt3 = p_diff)

##     p_even      p_odd p_diff_lt3 
##  0.5000000  0.5000000  0.6666667

Question 3

Three classes contain 20, 18, and 25 students, respectively, and no student is a member of more than one class. If a team is to be composed of one student from each of the three classes, in how many different ways can the members of the team be chosen?

(20!/(1!19!)) (18!/(1!17!)) (25!/(1!*24!))

ways <- choose(20, 1) * choose(18, 1) * choose(25, 1)
ways

## [1] 9000

Question 4

Suppose that three runners from team A and three runners from Team B participate in a race. If all six runners have equal ability and there are no ties, what is the probability that the three runners from Team A will finish first, second, and third and the three runners from Team B will finish fourth, fifth, and sixth? 3! * 3! / 6!

factorial(3) * factorial(3) / factorial(6)

## [1] 0.05

Question 5

Suppose that a box contains one blue card and four red cards, which are labeled A,B,C, and D. Suppose also that two of these five cards are selected at random, without replacement.

If it is known that card A has been selected, what is the probability that both cards are red? P (both_red | A) = 3/4
If it is known that at least one red card has been selected, what is the probability that both cards are red? P(both_red | at_least_one_red) = (4!/2!2!) / ((5!/2!3!)-1) = 6/9 = 2/3

# (a) 
p_a <- 3/4

# (b) 
total <- choose(5, 2)
both_red <- choose(4, 2)
at_least1_red <- total - choose(1, 2)  #result is 0
p_b <- both_red / at_least1_red

p_a

## [1] 0.75

p_b

## [1] 0.6

Question 6

In a certain city, 30% of the people are Conservative, 50% are Liberals, and 20% are Independents. Records show that in a particular elections, 65% of the Conservatives voted, 82% of the Liberals voted, and 50% of the Independents voted. If a person in the city is selected at random and it is learned that she did not vote in the last election, what is the probability that she is a Liberal? $$ \[ P(C) = 0.30, \quad P(L) = 0.50, \quad P(I) = 0.20 \]

\[ P(\text{NV} \mid C) = 1 - 0.65 = 0.35, \quad P(\text{NV} \mid L) = 1 - 0.82 = 0.18, \quad P(\text{NV} \mid I) = 1 - 0.50 = 0.50 \]

\[ P(\text{NV}) = 0.30(0.35) + 0.50(0.18) + 0.20(0.50) = 0.295 \]

\[ P(L \mid \text{NV}) = \frac{P(L)\,P(\text{NV}\mid L)}{P(\text{NV})} = \frac{0.50 \times 0.18}{0.295} = \frac{0.09}{0.295} \approx 0.305 \]

$$ ### Question 7

Attached with the assignment is the reproduction dataset from Elkjær and Iverson (2023), named data_analysis_final.dta. Here are some variables of interest:

Variable name	Description
`year`	Year
`country`	Country
`red_t1ls`	Relative transfer rate to the top 1%
`red_m20ls`	Relative transfer rate to the middle 20%
`red_b20ls`	Relative transfer rate to the bottom 20%
`pre_p99p100`	Pretax income share of the top 1%
`pre_p40p60`	Pretax income share of the middle 20%
`pre_p0p20`	Pretax income share of the bottom 20%
`cum_left`	Cummulative share of government-controlled parliamentary seats held by left parties since 1980
`educatt_minimal`	Share of population attending no more than secondary education

Create a new dataframe such that:

Only the variables described above are included;
The names of the relative transfer rate and pretax income share variables represent what they are;
A new variable corresponding to the ratio between the pretax income of the top 1% and middle 20% is created, call it inequality_top;
A new variable corresponding to the ratio between the pretax income of the middle 20% and the bottom 20% is created, call it inequality_bottom;
There are no missing values in inequality_top and educatt_minimal.

library(haven)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)  
data <- read_dta("data_analysis_final.dta")
df <- data %>%
  select(year,country,red_t1ls, red_m20ls, red_b20ls,pre_p99p100, pre_p40p60, pre_p0p20,cum_left,educatt_minimal) %>%
  rename(
    transfer_top1    = red_t1ls,
    transfer_mid20   = red_m20ls,
    transfer_bot20   = red_b20ls,
    pretax_top1      = pre_p99p100,
    pretax_mid20     = pre_p40p60,
    pretax_bot20     = pre_p0p20)%>%
  mutate(inequality_top = pretax_top1 / pretax_mid20,inequality_bottom = pretax_mid20 / pretax_bot20) %>% 
  drop_na(inequality_top, educatt_minimal)
head(df)

## # A tibble: 6 × 12
##    year country transfer_top1 transfer_mid20 transfer_bot20 pretax_top1
##   <dbl> <chr>           <dbl>          <dbl>          <dbl>       <dbl>
## 1  2004 Austria         -36.5           12.7           47.6       0.112
## 2  2005 Austria         -38.8           13.5           50.0       0.110
## 3  2006 Austria         -35.5           13.9           50.6       0.130
## 4  2007 Austria         -40.9           12.5           49.5       0.112
## 5  2008 Austria         -38.0           12.9           50.5       0.120
## 6  2009 Austria         -45.9           13.6           52.1       0.106
## # ℹ 6 more variables: pretax_mid20 <dbl>, pretax_bot20 <dbl>, cum_left <dbl>,
## #   educatt_minimal <dbl>, inequality_top <dbl>, inequality_bottom <dbl>

Question 8

From the dataset you created in question X, compute the following:

The average share of the population attending no more than secondary education (educatt_minimal) for France;
The average share of the population attending no more than secondary education (educatt_minimal) for Finland;
The average, standard deviation, intraquartile range, minimum value, first quartile, median, third quartile, and maximum value of inequality_top for each country included in the dataset. Make sure to save these results in an object named gdat. (Hint: use the byvar argument in the DAMisc::sumStats() function).

library(DAMisc)

## Registered S3 method overwritten by 'broom':
##   method        from      
##   nobs.multinom clarkeTest

library(tidyr)
library(dplyr) # Loading this is now optional, but good practice

df_means <- df %>%
  dplyr::filter(country %in% c("Finland", "France")) %>%
  group_by(country) %>%
  summarize(mean_educatt = mean(educatt_minimal, na.rm = TRUE))
print(df_means)

## # A tibble: 2 × 2
##   country mean_educatt
##   <chr>          <dbl>
## 1 Finland         23.3
## 2 France          30.8

gdat <- DAMisc::sumStats(data  = df, vars  = "inequality_top", byvar = "country")
print(gdat)

## # A tibble: 17 × 12
##    variable      country  mean     sd    iqr   min   q25   q50   q75   max     n
##    <chr>         <chr>   <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
##  1 inequality_t… Austria 0.719 0.102  0.0399 0.520 0.713 0.718 0.753 0.918    13
##  2 inequality_t… Belgium 0.538 0.0385 0.0283 0.457 0.521 0.535 0.549 0.627    13
##  3 inequality_t… Denmark 0.687 0.0800 0.116  0.543 0.641 0.687 0.757 0.812    13
##  4 inequality_t… Finland 0.621 0.0755 0.141  0.518 0.559 0.602 0.700 0.755    13
##  5 inequality_t… France  0.623 0.0470 0.0601 0.541 0.594 0.620 0.654 0.704    14
##  6 inequality_t… Germany 0.758 0.145  0.259  0.526 0.639 0.703 0.898 0.943    25
##  7 inequality_t… Greece  0.696 0.152  0.259  0.459 0.575 0.740 0.834 0.889    13
##  8 inequality_t… Ireland 0.741 0.0984 0.159  0.583 0.657 0.745 0.816 0.886    14
##  9 inequality_t… Italy   0.490 0.0216 0.0227 0.459 0.480 0.484 0.502 0.545    13
## 10 inequality_t… Nether… 0.409 0.0275 0.0221 0.375 0.391 0.410 0.413 0.483    13
## 11 inequality_t… Norway  0.728 0.0898 0.108  0.571 0.689 0.749 0.797 0.844    13
## 12 inequality_t… Portug… 0.788 0.0488 0.0671 0.700 0.756 0.796 0.823 0.851    13
## 13 inequality_t… Spain   0.744 0.0328 0.0269 0.692 0.732 0.739 0.759 0.807    13
## 14 inequality_t… Sweden  0.611 0.0814 0.132  0.514 0.537 0.587 0.670 0.754    13
## 15 inequality_t… Switze… 0.735 0.0730 0.115  0.612 0.683 0.711 0.798 0.867    13
## 16 inequality_t… United… 0.968 0.0786 0.139  0.860 0.905 0.950 1.04  1.10     14
## 17 inequality_t… United… 1.22  0.295  0.483  0.701 1.01  1.25  1.49  1.64     38
## # ℹ 1 more variable: nNA <int>

Question 9

Produce a bar plot representing the average top-end inequality rate per country. Use the gdat object you produced in the previous question.

Make sure to label your axes appropriately;
Put the countries on the y axis.

Since the gdat object was created with DAMisc::sumStats(), it already has a column with the mean

library(ggplot2)
ggplot(gdat, aes(x = mean, y = country)) +
  geom_col() +
  labs(
    x = "Average inequality rate -Top 1% or Mid 20%)",
    y = "Country",
    title = "Average Top-End Inequality Rate by Country"
  ) +
  theme_light()