hw-3.knit

Title: Homework 3

Name : Stephen Mustard

Date : 2-10-2023

Subject: Econometrics

rm(list=ls(all=T))
setwd("C:/Users/Stephen/OneDrive/Econometrics")

library('dplyr')

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library('haven')
library('tidyverse')

## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──

## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ stringr 1.5.0
## ✔ tidyr   1.2.1     ✔ forcats 0.5.2
## ✔ readr   2.1.3     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

age_earnings <- read_dta("Age_HourlyEarnings_STATA (4).dta")

#question 1

age_earnings_marg <- age_earnings %>%
  count(age) %>%
  mutate(marg_dist = n/nrow(age_earnings))
age_earnings_marg

## # A tibble: 10 × 3
##      age     n marg_dist
##    <dbl> <int>     <dbl>
##  1    25   725    0.0908
##  2    26   687    0.0860
##  3    27   758    0.0949
##  4    28   750    0.0939
##  5    29   793    0.0993
##  6    30   800    0.100 
##  7    31   779    0.0975
##  8    32   821    0.103 
##  9    33   934    0.117 
## 10    34   939    0.118

age_earnings_AHE <- age_earnings %>%
  group_by(age) %>%
  summarize (mean = mean(earnings))
age_earnings_AHE

## # A tibble: 10 × 2
##      age  mean
##    <dbl> <dbl>
##  1    25  14.4
##  2    26  14.9
##  3    27  15.3
##  4    28  15.9
##  5    29  16.7
##  6    30  17.5
##  7    31  17.6
##  8    32  17.9
##  9    33  18.1
## 10    34  18.2

plot(age_earnings_AHE$age,age_earnings_AHE$mean)

Average hourly earnings and age are related. We can see a moderate positive relationship between age and hourly earnings. As people get older, they tend to make more per hour.

mean_AHE <- age_earnings %>%
  summarize (mean = mean(earnings))
mean_AHE

## # A tibble: 1 × 1
##    mean
##   <dbl>
## 1  16.8

  var(age_earnings$earnings)

## [1] 76.71475

  cov(age_earnings$earnings, age_earnings$age)

## [1] 3.777516

cor(age_earnings$earnings, age_earnings$age)

## [1] 0.1491763

Since covariance > 0, we can conclude that as age increases, earnings also increases. If we look at the graph in part c we can see this positive relationship. Correlation shows the strength of the relationship between the two variables. With a correlation of 0.15, we can say that there is a weak relationship between age and earnings. This surprised me as the graph seemed to show a much stronger correlation than 0.15. ##question 2

growth <- read_dta("Growth.dta")

plot(growth$growth, growth$tradeshare)

There seems to be a very small positive relationship between average trade share and annual growth rate.

Malta is at coordinates (6.6, 2.0). We can see it in the upper right part of the graph. Malta does look like an outlier.

growth_tradeshare_reg <- lm(growth ~ tradeshare, data = growth)
summary(growth_tradeshare_reg)

## 
## Call:
## lm(formula = growth ~ tradeshare, data = growth)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3739 -0.8864  0.2329  0.9248  5.3889 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   0.6403     0.4900   1.307  0.19606   
## tradeshare    2.3064     0.7735   2.982  0.00407 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.79 on 63 degrees of freedom
## Multiple R-squared:  0.1237, Adjusted R-squared:  0.1098 
## F-statistic: 8.892 on 1 and 63 DF,  p-value: 0.00407

The estimated intercept is y = 0.6403. The formula to estimate growth rate is Y = 0.6403 + 2.3064(tradeshare) A country with a tradeshare of 0.5 would be estimated to have a growth rate of 1.6585 A country with a tradeshare of 1.0 would be estimated to have a growth rate of 2.9467

remove_malta <- growth %>% filter(country_name != "Malta")
growth_tradeshare_reg2 <- lm(growth ~ tradeshare, data = remove_malta)
summary(growth_tradeshare_reg2)

## 
## Call:
## lm(formula = growth ~ tradeshare, data = remove_malta)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4247 -0.9383  0.2091  0.9265  5.3776 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   0.9574     0.5804   1.650   0.1041  
## tradeshare    1.6809     0.9874   1.702   0.0937 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.789 on 62 degrees of freedom
## Multiple R-squared:  0.04466,    Adjusted R-squared:  0.02925 
## F-statistic: 2.898 on 1 and 62 DF,  p-value: 0.09369

The estimated intercept is y = 0.9574. The formula to estimate growth rate is Y = 0.9574 + 1.6809(tradeshare) A country with a tradeshare of 0.5 would be estimated to have a growth rate of 1.79785 A country with a tradeshare of 1.0 would be estimated to have a growth rate of 2.6383

ggplot(growth, aes (x = tradeshare, y = growth))+geom_point()+geom_smooth(method="lm")

## `geom_smooth()` using formula = 'y ~ x'

ggplot(remove_malta, aes (x = tradeshare, y = growth))+geom_point()+geom_smooth(method="lm")

## `geom_smooth()` using formula = 'y ~ x'

The regression function including Malta is steeper because it is an outlier. Since it has such a high trade share for its growth rate, it pulls the slope of the regression line to be much steeper than it normally would.

Malta is an Island in the Meditteranean Sea. The trade share is so large because Malta has one of the most optimal positions for sea trade. It also lacks many necessities and relies on trade to survive. Malta should be excluded from this analysis. It is an anomoly of a country and does not represent the vast majority of countries.

##question 3

earnings_height <- read_dta("Earnings_and_Height.dta")

median(earnings_height$height)

## [1] 67

short <- earnings_height %>% filter(height <= 67)
mean(short$earnings)

## [1] 44488.44

tall <- earnings_height %>% filter(height > 67)
mean(tall$earnings)

## [1] 49987.88

ggplot(data=earnings_height, aes(x = height, y = earnings))+ geom_point()

The points on the plot fall on horizontal lines because there are 23 distinct brackets for earnings. For each bracket, the professors in this study calculated an average value to best represent each group.

earnings_height_reg <- lm(earnings ~ height, data = earnings_height)
summary(earnings_height_reg)

## 
## Call:
## lm(formula = earnings ~ height, data = earnings_height)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -47836 -21879  -7976  34323  50599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -512.73    3386.86  -0.151     0.88    
## height        707.67      50.49  14.016   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26780 on 17868 degrees of freedom
## Multiple R-squared:  0.01088,    Adjusted R-squared:  0.01082 
## F-statistic: 196.5 on 1 and 17868 DF,  p-value: < 2.2e-16

The predicted slope is 707.67. The estimated regression line is Y = -512.73 +707.67(height) A worker who is 70 in tall is predicted to earn $49,024.17. A worker who is 67 in tall is predicted to earn $46,901.16 A worker who is 65 in tall is predicted to earn $45,485.82

remove_males <- earnings_height %>% 
  filter(sex == 0)
earnings_height_regf <- lm(earnings ~ height, data = remove_males)
summary(earnings_height_regf)

## 
## Call:
## lm(formula = earnings ~ height, data = remove_males)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42748 -22006  -7466  36641  46865 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12650.9     6383.7   1.982   0.0475 *  
## height         511.2       98.9   5.169  2.4e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26800 on 9972 degrees of freedom
## Multiple R-squared:  0.002672,   Adjusted R-squared:  0.002572 
## F-statistic: 26.72 on 1 and 9972 DF,  p-value: 2.396e-07

The estimated slope is 511.2.
Her earnings are predicted to be $511.2 higher than the average earning.

remove_females <- earnings_height %>% 
  filter(sex == 1)
earnings_height_regm <- lm(earnings ~ height, data = remove_females)
summary(earnings_height_regm)

## 
## Call:
## lm(formula = earnings ~ height, data = remove_females)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50158 -22373  -8118  33091  59228 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -43130.3     7068.5  -6.102  1.1e-09 ***
## height        1306.9      100.8  12.969  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26670 on 7894 degrees of freedom
## Multiple R-squared:  0.02086,    Adjusted R-squared:  0.02074 
## F-statistic: 168.2 on 1 and 7894 DF,  p-value: < 2.2e-16

The estimated slope is 1306.9.
His earnings are predicted to be $1,306.9 higher than the average earning.

Sex is a potential confounder. The sex of a person impacts both the height of the person and their average salary. Women are going to be shorter than men. From this regression we can see that Males will earn more per extra inch than females. As a result, sex has an impact on average pay too. Therefore, sex should be ruled as a potential confounder, impacting both variables.