Use the data in COUNTYMURDERS to answer this questions. Use only the data for 1996.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## arrests countyid density popul perc1019 perc2029 percblack percmale
## 1 8 1001 67.21535 40061 15.89077 13.17491 20.975510 48.70073
## 2 6 1003 77.05643 123023 13.93886 11.63929 13.496660 48.83233
## 3 1 1005 29.91548 26475 15.06327 13.69972 46.190750 49.15203
## 4 0 1009 67.20457 43392 14.17542 12.99318 1.415007 48.97446
## 5 1 1011 17.89899 11188 14.98927 14.13121 72.756520 49.91956
## 6 2 1013 27.71148 21530 15.68509 11.25871 41.384110 46.81839
## 7 20 1015 186.53970 113511 14.71135 14.28936 19.096830 47.99447
## 8 4 1017 61.51258 36748 14.65386 13.13813 37.253730 47.31142
## 9 2 1019 38.27024 21170 14.13321 12.13037 7.042985 49.22060
## 10 0 1021 50.89291 35323 14.80339 12.64332 11.921410 48.60006
## rpcincmaint rpcpersinc rpcunemins year murders murdrate arrestrate
## 1 192.038 11852.760 26.796 1996 7 1.7473350 1.9969550
## 2 139.084 13583.020 28.710 1996 6 0.4877137 0.4877137
## 3 405.768 10760.510 63.162 1996 1 0.3777148 0.3777148
## 4 184.382 11094.820 21.692 1996 2 0.4609145 0.0000000
## 5 485.518 8349.506 63.162 1996 0 0.0000000 0.8938148
## 6 357.918 9947.058 54.868 1996 2 0.9289364 0.9289364
## 7 248.820 11536.320 35.090 1996 14 1.2333610 1.7619440
## 8 243.078 10899.590 41.470 1996 3 0.8163710 1.0884950
## 9 200.970 9806.698 26.796 1996 0 0.0000000 0.9447331
## 10 231.594 10819.840 40.194 1996 0 0.0000000 0.0000000
## statefips countyfips execs lpopul execrate
## 1 1 1 0 10.598160 0
## 2 1 3 0 11.720130 0
## 3 1 5 0 10.183960 0
## 4 1 9 0 10.678030 0
## 5 1 11 0 9.322598 0
## 6 1 13 0 9.977202 0
## 7 1 15 0 11.639660 0
## 8 1 17 0 10.511840 0
## 9 1 19 0 9.960340 0
## 10 1 21 0 10.472290 0
## Counties with zero murders in 1996: 1051
## Counties with at least one execution in 1996: 31
## Largest number of executions in 1996: 3
##
## Call:
## lm(formula = murders ~ execs, data = subset(countymurders1996))
##
## Residuals:
## Min 1Q Median 3Q Max
## -149.12 -5.46 -4.46 -2.46 1338.99
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.4572 0.8348 6.537 7.79e-11 ***
## execs 58.5555 5.8333 10.038 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.89 on 2195 degrees of freedom
## Multiple R-squared: 0.04389, Adjusted R-squared: 0.04346
## F-statistic: 100.8 on 1 and 2195 DF, p-value: < 2.2e-16
## The slope coefficient (ß1) represents the change in murders for a one-unit change in executions.
## If ß1 is negative, it suggests a deterrent effect of capital punishment.
## Smallest number of murders predicted: 5.457241
## Residual for a county with zero executions and zero murders: 5.457241
## A simple regression analysis may suffer from omitted variable bias and endogeneity issues.
## Factors other than executions could influence the murder rate, leading to biased estimates.
## Additionally, the decision to implement capital punishment may be influenced by the crime rate,
## creating endogeneity problems and making causal inference challenging.
In a study relating college grade point average to time spent in various activities, you distribute a survey to several students. The students are asked how many hours they spend each week in four activities: studying, sleeping, working, and leisure. Any activity is put into one of the four categories, so that for each student, the sum of hours in the four activities must be 168.
## age soph junior senior senior5 male campus business engineer colGPA hsGPA
## 1 21 0 0 1 0 0 0 1 0 3.0 3.0
## 2 21 0 0 1 0 0 0 1 0 3.4 3.2
## 3 20 0 1 0 0 0 0 1 0 3.0 3.6
## 4 19 1 0 0 0 1 1 1 0 3.5 3.5
## 5 20 0 1 0 0 0 0 1 0 3.6 3.9
## 6 20 0 0 1 0 1 1 1 0 3.0 3.4
## 7 22 0 0 0 1 0 0 1 0 2.7 3.5
## 8 22 0 0 0 1 0 0 0 0 2.7 3.0
## 9 22 0 0 0 1 0 0 0 0 2.7 3.0
## 10 19 1 0 0 0 0 0 1 0 3.8 4.0
## ACT job19 job20 drive bike walk voluntr PC greek car siblings bgfriend clubs
## 1 21 0 1 1 0 0 0 0 0 1 1 0 0
## 2 24 0 1 1 0 0 0 0 0 1 0 1 1
## 3 26 1 0 0 0 1 0 0 0 1 1 0 1
## 4 27 1 0 0 0 1 0 0 0 0 1 0 0
## 5 28 0 1 0 1 0 0 0 0 1 1 1 0
## 6 25 0 0 0 0 1 0 0 0 1 1 0 0
## 7 25 0 0 0 1 0 0 0 1 1 1 0 1
## 8 22 1 0 1 0 0 0 1 0 0 1 1 0
## 9 21 1 0 1 0 0 0 0 0 1 1 1 1
## 10 27 1 0 0 0 1 0 1 0 0 1 0 1
## skipped alcohol gradMI fathcoll mothcoll
## 1 2.0 1.00 1 0 0
## 2 0.0 1.00 1 1 1
## 3 0.0 1.00 1 1 1
## 4 0.0 0.00 0 0 0
## 5 0.0 1.50 1 1 0
## 6 0.0 0.00 0 1 0
## 7 0.0 2.00 1 0 1
## 8 3.0 3.00 1 1 1
## 9 2.0 2.50 1 1 1
## 10 0.5 0.75 1 0 1
## It doesn't make sense to hold the variables fixed while changing study because all the variables must total up to 168.
## Changing the hours spent on studying would inherently change the hours available for other activities.
## All the variables are perfectly collinear with one another, so it does not work. It is stated that there can be no perfect collinearity for there to be no bias and the model to be useful.
## You could get rid of one of the varaibles. For example, we could drop work and turn the equation to GPA=β0+β1study+β2sleep+β3leisure+u, With this equation B1 affects one more hour of studying on GPA, holding sleep and leisure fixed. At this point B1 represents the effect on GPA from one more hour of leisure and studying.
Suppose that you are interested in estimating the ceteris paribus relationship between y and xj. For this purpose, you can collect data on two control variables, * and . (For concreteness, you might think of y as final exam score, as class attendance, * as GPA up through the previous semester, and X3 as SAT or ACT score.) Let B, be the simple regression estimate from y on * and let B, be the multiple regression estimate from y on X1, X2, X3.
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## The following object is masked from 'package:wooldridge':
##
## cement
Explanation: β1 from simple and multiple regression may be very different because of high collinearity and large partial effects.
## High correlation between x1, x2, x3 with large partial effects on y:
## [1] "Simple regression β1: 3.816 SE: 0.146"
## [1] "Multiple regression β1: 1.934 SE: 0.292"
Explanation: SE of β1 should be larger due to collinearity, but β1 values may not be too different.
## High correlation between x1, x2, x3 with small partial effects on y:
## [1] "Simple regression β1: 2.157 SE: 0.098"
## [1] "Multiple regression β1: 1.901 SE: 0.24"
Use the data in DISCRIM to answer this question. These are zip code-level data on prices for various items at fast-food restaurants, along with characteristics of the zip code population, in New Jersey and Pennsylvania. The idea is to see whether fast-food restaurants charge higher prices in areas with a larger concentration of blacks.
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
## Average prpblck: 0.1134864
## Standard deviation prpblck: 0.1824165
## Average income: 47053.78
## Standard deviation income: 13179.29
The averages for “prpblck” and “income” are 0.113 and 47,053, respectively. The standard deviations are likewise, 0.1824 and 13,179.29, respectively. It is apparent that prpblck represents a proportion of the black population, while income is represented in dollar terms.
##
## Call:
## lm(formula = psoda ~ prpblck + income, data = discrim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.29401 -0.05242 0.00333 0.04231 0.44322
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.563e-01 1.899e-02 50.354 < 2e-16 ***
## prpblck 1.150e-01 2.600e-02 4.423 1.26e-05 ***
## income 1.603e-06 3.618e-07 4.430 1.22e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08611 on 398 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.06422, Adjusted R-squared: 0.05952
## F-statistic: 13.66 on 2 and 398 DF, p-value: 1.835e-06
The resulting regression is psoda.hat = (0.956) + (0.115)prpblck + (0.0000016). The optimal sample size is 399 observations (indicated by the 398 degrees of freedom and 9 missing observations) and the adjusted R^2 is 0.595. The coefficient on prpblck indicates that, all things being equal, if prpblck increases by 10% the price of soda will increase by approximately 1.2 cents, which is not economically significant.
##
## Call:
## lm(formula = psoda ~ prpblck, data = discrim)
##
## Coefficients:
## (Intercept) prpblck
## 1.03740 0.06493
##
## Call:
## lm(formula = psoda ~ prpblck, data = discrim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.30884 -0.05963 0.01135 0.03206 0.44840
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.03740 0.00519 199.87 < 2e-16 ***
## prpblck 0.06493 0.02396 2.71 0.00702 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0881 on 399 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.01808, Adjusted R-squared: 0.01561
## F-statistic: 7.345 on 1 and 399 DF, p-value: 0.007015
The estimate of the coefficient on prpblack with the simple regression is 0.065. This is lower than the prior estimate, and therefore shows that the discrimination effect decreases when income is excluded.
##
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income), data = discrim)
##
## Coefficients:
## (Intercept) prpblck log(income)
## -0.79377 0.12158 0.07651
##
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income), data = discrim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33563 -0.04695 0.00658 0.04334 0.35413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.79377 0.17943 -4.424 1.25e-05 ***
## prpblck 0.12158 0.02575 4.722 3.24e-06 ***
## log(income) 0.07651 0.01660 4.610 5.43e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0821 on 398 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.06809, Adjusted R-squared: 0.06341
## F-statistic: 14.54 on 2 and 398 DF, p-value: 8.039e-07
## [1] "2.44 percent increase"
If “prpblck” increases by 20 percentage points, estimated psoda will increase by 2.44%
##
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income) + prppov, data = discrim)
##
## Coefficients:
## (Intercept) prpblck log(income) prppov
## -1.46333 0.07281 0.13696 0.38036
Adding prppov causes the prpblck coefficient to fall to 0.0738.
## [1] -0.838467
The correlation is approximately -0.838. This makes sense, because one would expect that declines in income would result in higher poverty rates.
The variable rintens is expenditures on research and development (R&D) as a percentage of sales. Sales are measured in millions of dollars. The variable profmarg is profits as a percentage of sales. Using the data in RDCHEM for 32 firms in the chemical industry, the following equation is estimated: rintens = .472 + .321 log(sales) + .050 profmarg (1.369) (.216) (.046) n = 32, R2 = .099.
##
## Call:
## lm(formula = rdintens ~ log(sales) + profmarg, data = rdchem)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3016 -1.2707 -0.6895 0.8785 6.0369
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.47225 1.67606 0.282 0.780
## log(sales) 0.32135 0.21557 1.491 0.147
## profmarg 0.05004 0.04578 1.093 0.283
##
## Residual standard error: 1.839 on 29 degrees of freedom
## Multiple R-squared: 0.09847, Adjusted R-squared: 0.0363
## F-statistic: 1.584 on 2 and 29 DF, p-value: 0.2224
## The estimated percentage point change in rdintens for a 10% increase in sales is: 0.03213484
## P-value for log(sales): 0.1468382
## Fail to reject the null hypothesis at both the 5% and 10% levels.
## b) p-value for the test on log(sales) coefficient: 0.1468382
## The coefficient for profmarg is: 0.0500367
## P-value for profmarg: 0.2833658
## The coefficient for profmarg is not statistically significant.
## Number of single-person households: 2017
The model to estimate is:: nettfa = B0 + B1inc + B2age + u mo
##
## Call:
## lm(formula = nettfa ~ inc + age, data = single_person_households)
##
## Residuals:
## Min 1Q Median 3Q Max
## -179.95 -14.16 -3.42 6.03 1113.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -43.03981 4.08039 -10.548 <2e-16 ***
## inc 0.79932 0.05973 13.382 <2e-16 ***
## age 0.84266 0.09202 9.158 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.68 on 2014 degrees of freedom
## Multiple R-squared: 0.1193, Adjusted R-squared: 0.1185
## F-statistic: 136.5 on 2 and 2014 DF, p-value: < 2.2e-16
## A positive coefficient for inc suggests that as income increases, net financial wealth also tends to increase.
## [1] "Intercept: -43.0398119486705"
## The intercept from the regression represents the expected net financial wealth (netffa) for a single-person household with both income (inc) and age (age) equal to zero.
## However, this value does not have a meaningful real-world interpretation in this context because it is unrealistic to have both income and age as zero for an adult.
We need to perform a t-test for the hypothesis: Null Hypothesis (H₀): β₂ = 1 Alternative Hypothesis (H₁): β₂ < 1
## [1] "t-statistic: -1.70994388324452"
## [1] "p-value: 0.0437151388035654"
## [1] "Fail to reject H0 at the 1% significance level."
To compare the simple regression of netffa on inc with the earlier multivariate model:
##
## Call:
## lm(formula = nettfa ~ inc, data = single_person_households)
##
## Residuals:
## Min 1Q Median 3Q Max
## -185.12 -12.85 -4.85 1.78 1112.66
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.5709 2.0607 -5.13 3.18e-07 ***
## inc 0.8207 0.0609 13.48 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45.59 on 2015 degrees of freedom
## Multiple R-squared: 0.08267, Adjusted R-squared: 0.08222
## F-statistic: 181.6 on 1 and 2015 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = nettfa ~ inc + age, data = single_person_households)
##
## Residuals:
## Min 1Q Median 3Q Max
## -179.95 -14.16 -3.42 6.03 1113.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -43.03981 4.08039 -10.548 <2e-16 ***
## inc 0.79932 0.05973 13.382 <2e-16 ***
## age 0.84266 0.09202 9.158 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.68 on 2014 degrees of freedom
## Multiple R-squared: 0.1193, Adjusted R-squared: 0.1185
## F-statistic: 136.5 on 2 and 2014 DF, p-value: < 2.2e-16
## The coefficient of inc changed slightly, from 0.8207 in the simple model to 0.7993 in the multivariate model.
## This small change suggests that adding age as a predictor did not substantially alter the relationship between inc and nettfa.
## Therefore, age does not strongly confound the effect of inc on nettfa.
The following histogram was created using the variable score in the data file ECONMATH. Thirty bins were used to create the histogram, and the height of each cell is the proportion of observations falling within the corresponding interval. The best-fitting normal distribution-that is, using the sample mean and sample standard deviation-has been superimposed on the histogram.
## The estimated probability of score exceeding 100 is: 0.01883527
## This probability is not zero, which contradicts the assumption of a normal distribution,
## because scores are bounded by 0 and 100 in reality, whereas a normal distribution assumes no such bounds.
## Observing the left tail of the histogram:
## The normal distribution does not fit well on the left tail, as it assigns more probability to
## values below 20 than is observed in the data. This discrepancy suggests a poor fit on the left side,
## where real scores are restricted to be at least 20, while a normal distribution would allow negative values.
Use the data in WAGE1 for this exercise.
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## [1] "Breusch-Pagan Test for Level-Level Model:"
##
## studentized Breusch-Pagan test
##
## data: model_wage
## BP = 43.096, df = 3, p-value = 2.349e-09
## [1] "Breusch-Pagan Test for Log-Level Model:"
##
## studentized Breusch-Pagan test
##
## data: model_logwage
## BP = 10.761, df = 3, p-value = 0.01309
## The log-level model is closer to satisfying Assumption MLR.6 than the level-level model.
## While the log-level model still shows some evidence of heteroscedasticity, the issue is significantly reduced compared to the level-level model.