Midterm

Chapter 2 : C9

Use the data in COUNTYMURDERS to answer this questions. Use only the data for 1996.

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

##    arrests countyid   density  popul perc1019 perc2029 percblack percmale
## 1        8     1001  67.21535  40061 15.89077 13.17491 20.975510 48.70073
## 2        6     1003  77.05643 123023 13.93886 11.63929 13.496660 48.83233
## 3        1     1005  29.91548  26475 15.06327 13.69972 46.190750 49.15203
## 4        0     1009  67.20457  43392 14.17542 12.99318  1.415007 48.97446
## 5        1     1011  17.89899  11188 14.98927 14.13121 72.756520 49.91956
## 6        2     1013  27.71148  21530 15.68509 11.25871 41.384110 46.81839
## 7       20     1015 186.53970 113511 14.71135 14.28936 19.096830 47.99447
## 8        4     1017  61.51258  36748 14.65386 13.13813 37.253730 47.31142
## 9        2     1019  38.27024  21170 14.13321 12.13037  7.042985 49.22060
## 10       0     1021  50.89291  35323 14.80339 12.64332 11.921410 48.60006
##    rpcincmaint rpcpersinc rpcunemins year murders  murdrate arrestrate
## 1      192.038  11852.760     26.796 1996       7 1.7473350  1.9969550
## 2      139.084  13583.020     28.710 1996       6 0.4877137  0.4877137
## 3      405.768  10760.510     63.162 1996       1 0.3777148  0.3777148
## 4      184.382  11094.820     21.692 1996       2 0.4609145  0.0000000
## 5      485.518   8349.506     63.162 1996       0 0.0000000  0.8938148
## 6      357.918   9947.058     54.868 1996       2 0.9289364  0.9289364
## 7      248.820  11536.320     35.090 1996      14 1.2333610  1.7619440
## 8      243.078  10899.590     41.470 1996       3 0.8163710  1.0884950
## 9      200.970   9806.698     26.796 1996       0 0.0000000  0.9447331
## 10     231.594  10819.840     40.194 1996       0 0.0000000  0.0000000
##    statefips countyfips execs    lpopul execrate
## 1          1          1     0 10.598160        0
## 2          1          3     0 11.720130        0
## 3          1          5     0 10.183960        0
## 4          1          9     0 10.678030        0
## 5          1         11     0  9.322598        0
## 6          1         13     0  9.977202        0
## 7          1         15     0 11.639660        0
## 8          1         17     0 10.511840        0
## 9          1         19     0  9.960340        0
## 10         1         21     0 10.472290        0

(i) Numbers of counties, numbers of counties had at least one execution and the largest number of executions.

## Counties with zero murders in 1996: 1051

## Counties with at least one execution in 1996: 31

## Largest number of executions in 1996: 3

(ii) Estimate the equation murders = Bo + B1xecs + u by OLS and report the results in the usual way, including sample size and R-squared.

## 
## Call:
## lm(formula = murders ~ execs, data = subset(countymurders1996))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -149.12   -5.46   -4.46   -2.46 1338.99 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.4572     0.8348   6.537 7.79e-11 ***
## execs        58.5555     5.8333  10.038  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.89 on 2195 degrees of freedom
## Multiple R-squared:  0.04389,    Adjusted R-squared:  0.04346 
## F-statistic: 100.8 on 1 and 2195 DF,  p-value: < 2.2e-16

(iii) Interpret the slope coefficient reported in part (ii). Does the estimated equation suggest a deterrent effect of capital punishment?

## The slope coefficient (ß1) represents the change in murders for a one-unit change in executions.

## If ß1 is negative, it suggests a deterrent effect of capital punishment.

(iv) What is the smallest number of murders that can be predicted by the equation? What is the residual for a county with zero executions and zero murders?

## Smallest number of murders predicted: 5.457241

## Residual for a county with zero executions and zero murders: 5.457241

(v) Explain why a simple regression analysis is not well suited for determining whether capital punishment has a deterrent effect on murders.

## A simple regression analysis may suffer from omitted variable bias and endogeneity issues.

## Factors other than executions could influence the murder rate, leading to biased estimates.

## Additionally, the decision to implement capital punishment may be influenced by the crime rate,

## creating endogeneity problems and making causal inference challenging.

Chapter 3 : Q5

In a study relating college grade point average to time spent in various activities, you distribute a survey to several students. The students are asked how many hours they spend each week in four activities: studying, sleeping, working, and leisure. Any activity is put into one of the four categories, so that for each student, the sum of hours in the four activities must be 168.

##    age soph junior senior senior5 male campus business engineer colGPA hsGPA
## 1   21    0      0      1       0    0      0        1        0    3.0   3.0
## 2   21    0      0      1       0    0      0        1        0    3.4   3.2
## 3   20    0      1      0       0    0      0        1        0    3.0   3.6
## 4   19    1      0      0       0    1      1        1        0    3.5   3.5
## 5   20    0      1      0       0    0      0        1        0    3.6   3.9
## 6   20    0      0      1       0    1      1        1        0    3.0   3.4
## 7   22    0      0      0       1    0      0        1        0    2.7   3.5
## 8   22    0      0      0       1    0      0        0        0    2.7   3.0
## 9   22    0      0      0       1    0      0        0        0    2.7   3.0
## 10  19    1      0      0       0    0      0        1        0    3.8   4.0
##    ACT job19 job20 drive bike walk voluntr PC greek car siblings bgfriend clubs
## 1   21     0     1     1    0    0       0  0     0   1        1        0     0
## 2   24     0     1     1    0    0       0  0     0   1        0        1     1
## 3   26     1     0     0    0    1       0  0     0   1        1        0     1
## 4   27     1     0     0    0    1       0  0     0   0        1        0     0
## 5   28     0     1     0    1    0       0  0     0   1        1        1     0
## 6   25     0     0     0    0    1       0  0     0   1        1        0     0
## 7   25     0     0     0    1    0       0  0     1   1        1        0     1
## 8   22     1     0     1    0    0       0  1     0   0        1        1     0
## 9   21     1     0     1    0    0       0  0     0   1        1        1     1
## 10  27     1     0     0    0    1       0  1     0   0        1        0     1
##    skipped alcohol gradMI fathcoll mothcoll
## 1      2.0    1.00      1        0        0
## 2      0.0    1.00      1        1        1
## 3      0.0    1.00      1        1        1
## 4      0.0    0.00      0        0        0
## 5      0.0    1.50      1        1        0
## 6      0.0    0.00      0        1        0
## 7      0.0    2.00      1        0        1
## 8      3.0    3.00      1        1        1
## 9      2.0    2.50      1        1        1
## 10     0.5    0.75      1        0        1

(i) In the model GPA = ß0 + ß1study + ß2sleep + ß3work + ß4leisure + u, does it make sense to hold sleep, work, and leisure fixed while changing study?

## It doesn't make sense to hold the variables fixed while changing study because all the variables must total up to 168.

## Changing the hours spent on studying would inherently change the hours available for other activities.

(ii) Explain why this model violates Assumption MLR.3.

## All the variables are perfectly collinear with one another, so it does not work. It is stated that there can be no perfect collinearity for there to be no bias and the model to be useful.

(iii) How could you reformulate the model so that its parameters have a useful interpretation and it satisfies Assumption MLR.3?

## You could get rid of one of the varaibles. For example, we could drop work and turn the equation to GPA=β0+β1study+β2sleep+β3leisure+u, With this equation B1 affects one more hour of studying on GPA, holding sleep and leisure fixed. At this point B1 represents the effect on GPA from one more hour of leisure and studying.

Chapter 3 : Q10

Suppose that you are interested in estimating the ceteris paribus relationship between y and xj. For this purpose, you can collect data on two control variables, * and . (For concreteness, you might think of y as final exam score, as class attendance, * as GPA up through the previous semester, and X3 as SAT or ACT score.) Let B, be the simple regression estimate from y on * and let B, be the multiple regression estimate from y on X1, X2, X3.

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

## The following object is masked from 'package:wooldridge':
## 
##     cement

(i) High correlation between x1, x2, x3; large partial effects on y

Explanation: β1 from simple and multiple regression may be very different because of high collinearity and large partial effects.

## High correlation between x1, x2, x3 with large partial effects on y:

## [1] "Simple regression β1: 3.816 SE: 0.146"
## [1] "Multiple regression β1: 1.934 SE: 0.292"

(ii) x1 uncorrelated with x2, x3; x2 and x3 highly correlated

Explanation: β1 from simple and multiple regression might be similar, as x1 is not affected by x2 and x3’s correlation.

## x1 uncorrelated with x2 and x3; x2 and x3 highly correlated:

## [1] "Simple regression β1: 2.526 SE: 0.214"
## [1] "Multiple regression β1: 2.139 SE: 0.11"

(iii) High correlation between x1, x2, x3; small partial effects on y

Explanation: SE of β1 should be larger due to collinearity, but β1 values may not be too different.

## High correlation between x1, x2, x3 with small partial effects on y:

## [1] "Simple regression β1: 2.157 SE: 0.098"
## [1] "Multiple regression β1: 1.901 SE: 0.24"

(iv) x1 uncorrelated with x2, x3; large partial effects of x2, x3 on y

Explanation: SE of β1 should be smaller since x1 is uncorrelated with x2 and x3, even though they have large effects on y.

## x1 uncorrelated with x2 and x3; large partial effects of x2, x3 on y:

## [1] "Simple regression β1: 2.021 SE: 0.191"
## [1] "Multiple regression β1: 2.039 SE: 0.089"

Chapter 3 : C8

Use the data in DISCRIM to answer this question. These are zip code-level data on prices for various items at fast-food restaurants, along with characteristics of the zip code population, in New Jersey and Pennsylvania. The idea is to see whether fast-food restaurants charge higher prices in areas with a larger concentration of blacks.

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

(i) Find the average values of prpblck and income in the sample, along with their standard deviations. What are the units of measurement of prpblck and income?

## Average prpblck: 0.1134864 
## Standard deviation prpblck: 0.1824165

## Average income: 47053.78 
## Standard deviation income: 13179.29

The averages for “prpblck” and “income” are 0.113 and 47,053, respectively. The standard deviations are likewise, 0.1824 and 13,179.29, respectively. It is apparent that prpblck represents a proportion of the black population, while income is represented in dollar terms.

(ii) Consider a model to explain the price of soda, psoda, in terms of the proportion of the population that is black and median income: psoda = b0 + b1prpblck + b2income + u. Estimate this model by OLS and report the results in equation form, including the sample size and R-squared. (Do not use scientific notation when reporting the estimates.) Interpret the coefficient on prpblck. Do you think it is economically large?

## 
## Call:
## lm(formula = psoda ~ prpblck + income, data = discrim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29401 -0.05242  0.00333  0.04231  0.44322 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.563e-01  1.899e-02  50.354  < 2e-16 ***
## prpblck     1.150e-01  2.600e-02   4.423 1.26e-05 ***
## income      1.603e-06  3.618e-07   4.430 1.22e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08611 on 398 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.06422,    Adjusted R-squared:  0.05952 
## F-statistic: 13.66 on 2 and 398 DF,  p-value: 1.835e-06

The resulting regression is psoda.hat = (0.956) + (0.115)prpblck + (0.0000016). The optimal sample size is 399 observations (indicated by the 398 degrees of freedom and 9 missing observations) and the adjusted R^2 is 0.595. The coefficient on prpblck indicates that, all things being equal, if prpblck increases by 10% the price of soda will increase by approximately 1.2 cents, which is not economically significant.

(iii) Compare the estimate from part (ii) with the simple regression estimate from psoda on prpblck. Is the discrimination effect larger or smaller when you control for income?

## 
## Call:
## lm(formula = psoda ~ prpblck, data = discrim)
## 
## Coefficients:
## (Intercept)      prpblck  
##     1.03740      0.06493

## 
## Call:
## lm(formula = psoda ~ prpblck, data = discrim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.30884 -0.05963  0.01135  0.03206  0.44840 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.03740    0.00519  199.87  < 2e-16 ***
## prpblck      0.06493    0.02396    2.71  0.00702 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0881 on 399 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.01808,    Adjusted R-squared:  0.01561 
## F-statistic: 7.345 on 1 and 399 DF,  p-value: 0.007015

The estimate of the coefficient on prpblack with the simple regression is 0.065. This is lower than the prior estimate, and therefore shows that the discrimination effect decreases when income is excluded.

(iv) A model with a constant price elasticity with respect to income may be more appropriate. Report estimates of the model log1psoda2 = b0 + b1prpblck + b2log1income2 + u. If prpblck increases by .20 (20 percentage points), what is the estimated percentage change in psoda? (Hint: The answer is 2.xx, where you fill in the “xx.”)

## 
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income), data = discrim)
## 
## Coefficients:
## (Intercept)      prpblck  log(income)  
##    -0.79377      0.12158      0.07651

## 
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income), data = discrim)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33563 -0.04695  0.00658  0.04334  0.35413 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.79377    0.17943  -4.424 1.25e-05 ***
## prpblck      0.12158    0.02575   4.722 3.24e-06 ***
## log(income)  0.07651    0.01660   4.610 5.43e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0821 on 398 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.06809,    Adjusted R-squared:  0.06341 
## F-statistic: 14.54 on 2 and 398 DF,  p-value: 8.039e-07

## [1] "2.44 percent increase"

If “prpblck” increases by 20 percentage points, estimated psoda will increase by 2.44%

(v) Now add the variable prppov to the regression in part (iv). What happens to b^ prpblck?

## 
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income) + prppov, data = discrim)
## 
## Coefficients:
## (Intercept)      prpblck  log(income)       prppov  
##    -1.46333      0.07281      0.13696      0.38036

Adding prppov causes the prpblck coefficient to fall to 0.0738.

(vi) Find the correlation between log(income) and prppov. Is it roughly what you expected?

## [1] -0.838467

The correlation is approximately -0.838. This makes sense, because one would expect that declines in income would result in higher poverty rates.

(vii) Evaluate the following statement: “Because log(income) and prppov are so highly correlated, they have no business being in the same regression.”

Although they are highly correlated, the incorporation of both does not result in a perfect collinearity and instead compliments the model by adding another control variable that helps to isolate the discrimination effect.

Chapter 4 : Q3

The variable rintens is expenditures on research and development (R&D) as a percentage of sales. Sales are measured in millions of dollars. The variable profmarg is profits as a percentage of sales. Using the data in RDCHEM for 32 firms in the chemical industry, the following equation is estimated: rintens = .472 + .321 log(sales) + .050 profmarg (1.369) (.216) (.046) n = 32, R2 = .099.

## 
## Call:
## lm(formula = rdintens ~ log(sales) + profmarg, data = rdchem)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3016 -1.2707 -0.6895  0.8785  6.0369 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.47225    1.67606   0.282    0.780
## log(sales)   0.32135    0.21557   1.491    0.147
## profmarg     0.05004    0.04578   1.093    0.283
## 
## Residual standard error: 1.839 on 29 degrees of freedom
## Multiple R-squared:  0.09847,    Adjusted R-squared:  0.0363 
## F-statistic: 1.584 on 2 and 29 DF,  p-value: 0.2224

(i) Interpretation of the coefficient on log(sales)

## The estimated percentage point change in rdintens for a 10% increase in sales is: 0.03213484

(ii) Hypothesis test for log(sales)

## P-value for log(sales): 0.1468382

## Fail to reject the null hypothesis at both the 5% and 10% levels.

## b) p-value for the test on log(sales) coefficient: 0.1468382

(iii) Interpretation of the coefficient on profmarg

## The coefficient for profmarg is: 0.0500367

(iv) Statistical significance of profmarg

## P-value for profmarg: 0.2833658

## The coefficient for profmarg is not statistically significant.

Chapter 4 : C8

(i) How many single-person households are there in the data set?

## Number of single-person households: 2017

(ii) Estimate the OLS model and interpret the coefficients

The model to estimate is:: nettfa = B0 + B1inc + B2age + u mo

## 
## Call:
## lm(formula = nettfa ~ inc + age, data = single_person_households)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -179.95  -14.16   -3.42    6.03 1113.94 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -43.03981    4.08039 -10.548   <2e-16 ***
## inc           0.79932    0.05973  13.382   <2e-16 ***
## age           0.84266    0.09202   9.158   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.68 on 2014 degrees of freedom
## Multiple R-squared:  0.1193, Adjusted R-squared:  0.1185 
## F-statistic: 136.5 on 2 and 2014 DF,  p-value: < 2.2e-16

## A positive coefficient for inc suggests that as income increases, net financial wealth also tends to increase.

(iii) Interpret the intercept

## [1] "Intercept: -43.0398119486705"

## The intercept from the regression represents the expected net financial wealth (netffa) for a single-person household with both income (inc) and age (age) equal to zero.

## However, this value does not have a meaningful real-world interpretation in this context because it is unrealistic to have both income and age as zero for an adult.

(iv) Hypothesis test for β2 = 1 vs β2 < 1

We need to perform a t-test for the hypothesis: Null Hypothesis (H₀): β₂ = 1 Alternative Hypothesis (H₁): β₂ < 1

## [1] "t-statistic: -1.70994388324452"

## [1] "p-value: 0.0437151388035654"

## [1] "Fail to reject H0 at the 1% significance level."

(v) Simple regression of netffa on inc

To compare the simple regression of netffa on inc with the earlier multivariate model:

## 
## Call:
## lm(formula = nettfa ~ inc, data = single_person_households)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -185.12  -12.85   -4.85    1.78 1112.66 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -10.5709     2.0607   -5.13 3.18e-07 ***
## inc           0.8207     0.0609   13.48  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45.59 on 2015 degrees of freedom
## Multiple R-squared:  0.08267,    Adjusted R-squared:  0.08222 
## F-statistic: 181.6 on 1 and 2015 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = nettfa ~ inc + age, data = single_person_households)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -179.95  -14.16   -3.42    6.03 1113.94 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -43.03981    4.08039 -10.548   <2e-16 ***
## inc           0.79932    0.05973  13.382   <2e-16 ***
## age           0.84266    0.09202   9.158   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.68 on 2014 degrees of freedom
## Multiple R-squared:  0.1193, Adjusted R-squared:  0.1185 
## F-statistic: 136.5 on 2 and 2014 DF,  p-value: < 2.2e-16

## The coefficient of inc changed slightly, from 0.8207 in the simple model to 0.7993 in the multivariate model.

## This small change suggests that adding age as a predictor did not substantially alter the relationship between inc and nettfa.

## Therefore, age does not strongly confound the effect of inc on nettfa.

Chapter 5 : Q5

The following histogram was created using the variable score in the data file ECONMATH. Thirty bins were used to create the histogram, and the height of each cell is the proportion of observations falling within the corresponding interval. The best-fitting normal distribution-that is, using the sample mean and sample standard deviation-has been superimposed on the histogram.

(i) If you use the normal distribution to estimate the probability that score exceeds 100, would the answer be zero? Why does your answer contradict the assumption of a normal distribution for score?

## The estimated probability of score exceeding 100 is: 0.01883527

## This probability is not zero, which contradicts the assumption of a normal distribution,

## because scores are bounded by 0 and 100 in reality, whereas a normal distribution assumes no such bounds.

(ii) Visualize the left tail of the histogram and compare with normal distribution

## Observing the left tail of the histogram:

## The normal distribution does not fit well on the left tail, as it assigns more probability to

## values below 20 than is observed in the data. This discrepancy suggests a poor fit on the left side,

## where real scores are restricted to be at least 20, while a normal distribution would allow negative values.

Chapter 5 : C1

Use the data in WAGE1 for this exercise.

(i) Estimate the equation wage = Bo + Bjeduc + Brexper + Bstenure + u. Save the residuals and plot a histogram.

(ii) Repeat part (i), but with log(wage) as the dependent variable.

(iii) Would you say that Assumption MLR. is closer to being satisfied for the level-level model or the log-level model?

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## [1] "Breusch-Pagan Test for Level-Level Model:"

## 
##  studentized Breusch-Pagan test
## 
## data:  model_wage
## BP = 43.096, df = 3, p-value = 2.349e-09

## [1] "Breusch-Pagan Test for Log-Level Model:"

## 
##  studentized Breusch-Pagan test
## 
## data:  model_logwage
## BP = 10.761, df = 3, p-value = 0.01309

## The log-level model is closer to satisfying Assumption MLR.6 than the level-level model.

## While the log-level model still shows some evidence of heteroscedasticity, the issue is significantly reduced compared to the level-level model.