Econometrics HW1

Author

Vladyslava Bondarenko

Task 1

mydata<- read.csv('C:/Users/User/Desktop/income.data.csv')
head(mydata)

  X   income happiness
1 1 3.862647  2.314489
2 2 4.979381  3.433490
3 3 4.923957  4.599373
4 4 3.214372  2.791114
5 5 7.196409  5.596398
6 6 3.729643  2.458556

str(mydata)

'data.frame':   498 obs. of  3 variables:
 $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ income   : num  3.86 4.98 4.92 3.21 7.2 ...
 $ happiness: num  2.31 3.43 4.6 2.79 5.6 ...

summary(mydata)

       X             income        happiness    
 Min.   :  1.0   Min.   :1.506   Min.   :0.266  
 1st Qu.:125.2   1st Qu.:3.006   1st Qu.:2.266  
 Median :249.5   Median :4.424   Median :3.473  
 Mean   :249.5   Mean   :4.467   Mean   :3.393  
 3rd Qu.:373.8   3rd Qu.:5.992   3rd Qu.:4.503  
 Max.   :498.0   Max.   :7.482   Max.   :6.863

library(psych)
describe(mydata)

          vars   n   mean     sd median trimmed    mad  min    max  range  skew
X            1 498 249.50 143.90 249.50  249.50 184.58 1.00 498.00 497.00  0.00
income       2 498   4.47   1.74   4.42    4.46   2.25 1.51   7.48   5.98  0.02
happiness    3 498   3.39   1.43   3.47    3.39   1.66 0.27   6.86   6.60 -0.01
          kurtosis   se
X            -1.21 6.45
income       -1.24 0.08
happiness    -0.78 0.06

Average income is about $4,470, and with standard deviation = 1.74 there is a moderate variability. The income data is nearly symmetrical (skewness close to 0), with fewer extreme values (kurtosis = -1.24)
The average happiness score is 3.39 and also with moderate variability (standard deviation = 1.43). The distribution is also nearly symmetrical (skewness = -0.01) and with flatter tails than a normal distribution (kurtosis = - 0.78)

In general, both Income and Happiness are close to normal distribution, with minimal skew and low likelihood of extreme outliers.

plot(mydata$income, mydata$happiness, 
     xlab = "Income (in thousands)", 
     ylab = "Happiness Score", 
     main = "Scatter Plot of Income vs Happiness")
abline(lm(happiness ~ income, data = mydata), col = "blue")
lines(lowess(mydata$income, mydata$happiness), col = "red", lty = 2)

To visually test the linearity assumption between Income and Happiness, we can create a scatter plot where:

Blue line shows the linear regression line.
Red dashed line shows a smoothed trend line to highlight any potential non-linear relationships.

Lines do not deviate as much, so there are some pattern in data

model <- lm(happiness ~ income, data = mydata)
summary(model)


Call:
lm(formula = happiness ~ income, data = mydata)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.02479 -0.48526  0.04078  0.45898  2.37805 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.20427    0.08884   2.299   0.0219 *  
income       0.71383    0.01854  38.505   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7181 on 496 degrees of freedom
Multiple R-squared:  0.7493,    Adjusted R-squared:  0.7488 
F-statistic:  1483 on 1 and 496 DF,  p-value: < 2.2e-16

Since p-value < 0.001, then Income has a strong, positive, and statistically significant effect on happiness in this dataset.

For each additional $1,000 in income, happiness increases by about 0.71 points.

Also, our model explains 74.9% of the variation in happiness (R-squared = 0.7493), so it describes dependency well. Residual standard error is 0.72, therefore predictions are close to observed values.

Task 2

library(wooldridge)
data("ceosal1")
head(ceosal1, 5)

  salary pcsalary   sales  roe pcroe ros indus finance consprod utility
1   1095       20 27595.0 14.1 106.4 191     1       0        0       0
2   1001       32  9958.0 10.9 -30.6  13     1       0        0       0
3   1122        9  6125.9 23.5 -16.3  14     1       0        0       0
4    578       -9 16246.0  5.9 -25.7 -21     1       0        0       0
5   1368        7 21783.2 13.8  -3.0  56     1       0        0       0
   lsalary    lsales
1 6.998509 10.225389
2 6.908755  9.206132
3 7.022868  8.720281
4 6.359574  9.695602
5 7.221105  9.988894

model2 <- lm(salary ~ sales, data = ceosal1)
summary(model2)


Call:
lm(formula = salary ~ sales, data = ceosal1)

Residuals:
    Min      1Q  Median      3Q     Max 
-1463.7  -478.9  -241.4   123.1 13614.6 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.174e+03  1.128e+02  10.407   <2e-16 ***
sales       1.547e-02  8.906e-03   1.737   0.0838 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1366 on 207 degrees of freedom
Multiple R-squared:  0.01437,   Adjusted R-squared:  0.009607 
F-statistic: 3.018 on 1 and 207 DF,  p-value: 0.08385

From the statistics we get, we can suggest that the model explains only 1.44% of the variation in salary and salesndefined, indicating poor predictive power. The standard error is high, so there is a variation from the actual salaries. The p-value for sales is 0.0838 is not strong enough at the 5% level.

The residual standard error is high (1366), and the F-statistic suggests that sales has a minimal impact on salary and the model doesn’t fit the data well.

# Level-level model
model_level_level <- lm(salary ~ sales, data = ceosal1)
summary(model_level_level)


Call:
lm(formula = salary ~ sales, data = ceosal1)

Residuals:
    Min      1Q  Median      3Q     Max 
-1463.7  -478.9  -241.4   123.1 13614.6 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.174e+03  1.128e+02  10.407   <2e-16 ***
sales       1.547e-02  8.906e-03   1.737   0.0838 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1366 on 207 degrees of freedom
Multiple R-squared:  0.01437,   Adjusted R-squared:  0.009607 
F-statistic: 3.018 on 1 and 207 DF,  p-value: 0.08385

The level-level model shows significant positive relationship between sales and salary (p-value = 0.0838), with each unit increase in sales increasing salary by $0.01547. But the model explains only 1.44% of the variation in salary (R-squared = 1.44%). The model is not statistically significant at the 5% level (p-value = 0.08385). The sales variable has a small effect on salary.

# Log-level model
model_log_level <- lm(log(salary) ~ sales, data = ceosal1)
summary(model_log_level)


Call:
lm(formula = log(salary) ~ sales, data = ceosal1)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.44220 -0.29159 -0.02837  0.28323  2.72487 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 6.847e+00  4.500e-02 152.138  < 2e-16 ***
sales       1.498e-05  3.553e-06   4.217  3.7e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5448 on 207 degrees of freedom
Multiple R-squared:  0.07912,   Adjusted R-squared:  0.07467 
F-statistic: 17.79 on 1 and 207 DF,  p-value: 3.696e-05

The log-level model shows relationship between sales and salary (p-value < 0.01). However, the model explains only 7.91% of the variation in salary, sol it is very innacurate. The effect of sales on salary is statistically significant but very small.

# Level-log model
model_level_log <- lm(salary ~ log(sales), data = ceosal1)
summary(model_level_log)


Call:
lm(formula = salary ~ log(sales), data = ceosal1)

Residuals:
    Min      1Q  Median      3Q     Max 
-1072.1  -447.6  -222.8    41.7 13702.5 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -898.93     771.50  -1.165  0.24529   
log(sales)    262.90      92.36   2.847  0.00486 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1349 on 207 degrees of freedom
Multiple R-squared:  0.03767,   Adjusted R-squared:  0.03302 
F-statistic: 8.103 on 1 and 207 DF,  p-value: 0.004863

The level-log model shows a significant positive relationship between log(sales) and salary (p-value = 0.00486), with each one-unit increase in log(sales) increasing salary by $262.90. However, the model explains only 3.77% of the variation in salary, indicating a weak fit. The effect is statistically significant but the model’s explanatory power is low.

# Log-log model
model_log_log <- lm(log(salary) ~ log(sales), data = ceosal1)
summary(model_log_log)


Call:
lm(formula = log(salary) ~ log(sales), data = ceosal1)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.01038 -0.28140 -0.02723  0.21222  2.81128 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.82200    0.28834  16.723  < 2e-16 ***
log(sales)   0.25667    0.03452   7.436  2.7e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5044 on 207 degrees of freedom
Multiple R-squared:  0.2108,    Adjusted R-squared:  0.207 
F-statistic:  55.3 on 1 and 207 DF,  p-value: 2.703e-12

The log-log model shows positive relationship between sales and salary (p-value < 0.001), with each 1% increase in sales corresponding to a 0.25667% increase in salary. The model explains 21.08% of the variation in salary, so there is a moderate fit.

To sum up, only the log-log model provides a moderately better accuracy, yet still very weak to be a good predictor.

Task 3

data(k401k)
model3 <- lm(prate ~ mrate + age, data = k401k)
summary(model3)


Call:
lm(formula = prate ~ mrate + age, data = k401k)

Residuals:
    Min      1Q  Median      3Q     Max 
-81.162  -8.067   4.787  12.474  18.256 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  80.1191     0.7790  102.85  < 2e-16 ***
mrate         5.5213     0.5259   10.50  < 2e-16 ***
age           0.2432     0.0447    5.44 6.21e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.94 on 1531 degrees of freedom
Multiple R-squared:  0.09225,   Adjusted R-squared:  0.09106 
F-statistic: 77.79 on 2 and 1531 DF,  p-value: < 2.2e-16

We examined the effect of a 401k plan’s match rate and the age of the plan on the participation rate. The estimated coefficients are positive for both mrate and age, suggesting that higher match rates and older plans are associated with higher participation rates. Specifically, the intercept 80.1191 represents the estimated baseline participation rate when both mratw and age are zero, though it is unlikely that a real plan would have these values.

The coefficient for mrate is 5.5213, so for each one-unit increase in the match rate, the participation rate is expected to increase by about 5.52 units. The coefficient for age is 0.2432, suggesting that for each additional year a plan has been in place, the participation rate is expected to increase by about 0.24 units. Both of these coefficients are statistically significant at conventional levels, as shown by the very low p-values (p < 2e-16 for mrate and p = 6.21e-08 for age), meaning we have strong evidence against the null hypothesis of no effect for both variables.

The residual standard error of 15.94 shows the average deviation of the observed prate values from the predicted values based on this model, indicating some variability not captured. The R^2 value of 0.09225 tells us that about 9.23% of the variation in prate is explained by the model, which is relatively low. This low R^2 suggests that other factors not included in this model may play a significant role in determining participation rates. Finally, the overall F-statistic 77.79 and its very small p-value (< 2.2e-16) indicate that the model is statistically significant, meaning that at least one of the predictors (mrate or age) is significantly associated with prate.