Title: Homework 3
Name : Stephen Mustard
Date : 2-10-2023
Subject: Econometrics
rm(list=ls(all=T))
setwd("C:/Users/Stephen/OneDrive/Econometrics")
library('dplyr')
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library('haven')
library('tidyverse')
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ stringr 1.5.0
## ✔ tidyr 1.2.1 ✔ forcats 0.5.2
## ✔ readr 2.1.3
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
age_earnings <- read_dta("Age_HourlyEarnings_STATA (4).dta")
#question 1
#a
age_earnings_marg <- age_earnings %>%
count(age) %>%
mutate(marg_dist = n/nrow(age_earnings))
age_earnings_marg
## # A tibble: 10 × 3
## age n marg_dist
## <dbl> <int> <dbl>
## 1 25 725 0.0908
## 2 26 687 0.0860
## 3 27 758 0.0949
## 4 28 750 0.0939
## 5 29 793 0.0993
## 6 30 800 0.100
## 7 31 779 0.0975
## 8 32 821 0.103
## 9 33 934 0.117
## 10 34 939 0.118
#b
age_earnings_AHE <- age_earnings %>%
group_by(age) %>%
summarize (mean = mean(earnings))
age_earnings_AHE
## # A tibble: 10 × 2
## age mean
## <dbl> <dbl>
## 1 25 14.4
## 2 26 14.9
## 3 27 15.3
## 4 28 15.9
## 5 29 16.7
## 6 30 17.5
## 7 31 17.6
## 8 32 17.9
## 9 33 18.1
## 10 34 18.2
#c
plot(age_earnings_AHE$age,age_earnings_AHE$mean)
Average hourly earnings and age are related. We can see a moderate positive relationship between age and hourly earnings. As people get older, they tend to make more per hour.
#d
mean_AHE <- age_earnings %>%
summarize (mean = mean(earnings))
mean_AHE
## # A tibble: 1 × 1
## mean
## <dbl>
## 1 16.8
#e
var(age_earnings$earnings)
## [1] 76.71475
#f
cov(age_earnings$earnings, age_earnings$age)
## [1] 3.777516
#g
cor(age_earnings$earnings, age_earnings$age)
## [1] 0.1491763
#h
Since covariance > 0, we can conclude that as age increases, earnings also increases. If we look at the graph in part c we can see this positive relationship. Correlation shows the strength of the relationship between the two variables. With a correlation of 0.15, we can say that there is a weak relationship between age and earnings. This surprised me as the graph seemed to show a much stronger correlation than 0.15. ##question 2
growth <- read_dta("Growth.dta")
#a
plot(growth$growth, growth$tradeshare)
There seems to be a very small positive relationship between average trade share and annual growth rate.
#b
Malta is at coordinates (6.6, 2.0). We can see it in the upper right part of the graph. Malta does look like an outlier.
#c
growth_tradeshare_reg <- lm(growth ~ tradeshare, data = growth)
summary(growth_tradeshare_reg)
##
## Call:
## lm(formula = growth ~ tradeshare, data = growth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3739 -0.8864 0.2329 0.9248 5.3889
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6403 0.4900 1.307 0.19606
## tradeshare 2.3064 0.7735 2.982 0.00407 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.79 on 63 degrees of freedom
## Multiple R-squared: 0.1237, Adjusted R-squared: 0.1098
## F-statistic: 8.892 on 1 and 63 DF, p-value: 0.00407
The estimated intercept is y = 0.6403. The formula to estimate growth rate is Y = 0.6403 + 2.3064(tradeshare) A country with a tradeshare of 0.5 would be estimated to have a growth rate of 1.6585 A country with a tradeshare of 1.0 would be estimated to have a growth rate of 2.9467
#d
remove_malta <- growth %>% filter(country_name != "Malta")
growth_tradeshare_reg2 <- lm(growth ~ tradeshare, data = remove_malta)
summary(growth_tradeshare_reg2)
##
## Call:
## lm(formula = growth ~ tradeshare, data = remove_malta)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4247 -0.9383 0.2091 0.9265 5.3776
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9574 0.5804 1.650 0.1041
## tradeshare 1.6809 0.9874 1.702 0.0937 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.789 on 62 degrees of freedom
## Multiple R-squared: 0.04466, Adjusted R-squared: 0.02925
## F-statistic: 2.898 on 1 and 62 DF, p-value: 0.09369
The estimated intercept is y = 0.9574. The formula to estimate growth rate is Y = 0.9574 + 1.6809(tradeshare) A country with a tradeshare of 0.5 would be estimated to have a growth rate of 1.79785 A country with a tradeshare of 1.0 would be estimated to have a growth rate of 2.6383
#e
ggplot(growth, aes (x = tradeshare, y = growth))+geom_point()+geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'
ggplot(remove_malta, aes (x = tradeshare, y = growth))+geom_point()+geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'
The regression function including Malta is steeper because it is an outlier. Since it has such a high trade share for its growth rate, it pulls the slope of the regression line to be much steeper than it normally would.
#f
Malta is an Island in the Meditteranean Sea. The trade share is so large because Malta has one of the most optimal positions for sea trade. It also lacks many necessities and relies on trade to survive. Malta should be excluded from this analysis. It is an anomoly of a country and does not represent the vast majority of countries.
##question 3
earnings_height <- read_dta("Earnings_and_Height.dta")
#a
median(earnings_height$height)
## [1] 67
#b
short <- earnings_height %>% filter(height <= 67)
mean(short$earnings)
## [1] 44488.44
tall <- earnings_height %>% filter(height > 67)
mean(tall$earnings)
## [1] 49987.88
#c
ggplot(data=earnings_height, aes(x = height, y = earnings))+ geom_point()
The points on the plot fall on horizontal lines because there are 23 distinct brackets for earnings. For each bracket, the professors in this study calculated an average value to best represent each group.
#d
earnings_height_reg <- lm(earnings ~ height, data = earnings_height)
summary(earnings_height_reg)
##
## Call:
## lm(formula = earnings ~ height, data = earnings_height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47836 -21879 -7976 34323 50599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -512.73 3386.86 -0.151 0.88
## height 707.67 50.49 14.016 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26780 on 17868 degrees of freedom
## Multiple R-squared: 0.01088, Adjusted R-squared: 0.01082
## F-statistic: 196.5 on 1 and 17868 DF, p-value: < 2.2e-16
The predicted slope is 707.67. The estimated regression line is Y = -512.73 +707.67(height) A worker who is 70 in tall is predicted to earn $49,024.17. A worker who is 67 in tall is predicted to earn $46,901.16 A worker who is 65 in tall is predicted to earn $45,485.82
#e
remove_males <- earnings_height %>%
filter(sex == 0)
earnings_height_regf <- lm(earnings ~ height, data = remove_males)
summary(earnings_height_regf)
##
## Call:
## lm(formula = earnings ~ height, data = remove_males)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42748 -22006 -7466 36641 46865
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12650.9 6383.7 1.982 0.0475 *
## height 511.2 98.9 5.169 2.4e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26800 on 9972 degrees of freedom
## Multiple R-squared: 0.002672, Adjusted R-squared: 0.002572
## F-statistic: 26.72 on 1 and 9972 DF, p-value: 2.396e-07
remove_females <- earnings_height %>%
filter(sex == 1)
earnings_height_regm <- lm(earnings ~ height, data = remove_females)
summary(earnings_height_regm)
##
## Call:
## lm(formula = earnings ~ height, data = remove_females)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50158 -22373 -8118 33091 59228
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -43130.3 7068.5 -6.102 1.1e-09 ***
## height 1306.9 100.8 12.969 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26670 on 7894 degrees of freedom
## Multiple R-squared: 0.02086, Adjusted R-squared: 0.02074
## F-statistic: 168.2 on 1 and 7894 DF, p-value: < 2.2e-16
#g
Sex is a potential confounder. The sex of a person impacts both the height of the person and their average salary. Women are going to be shorter than men. From this regression we can see that Males will earn more per extra inch than females. As a result, sex has an impact on average pay too. Therefore, sex should be ruled as a potential confounder, impacting both variables.