Midterm

Chapter 2:
Loading COUNTYMURDERS data

library("wooldridge")

## Warning: package 'wooldridge' was built under R version 4.2.3

data <- wooldridge::countymurders
data <- subset(data, year==1996)
head(data)

##     arrests countyid  density  popul perc1019 perc2029 percblack percmale
## 17        8     1001 67.21535  40061 15.89077 13.17491 20.975510 48.70073
## 34        6     1003 77.05643 123023 13.93886 11.63929 13.496660 48.83233
## 51        1     1005 29.91548  26475 15.06327 13.69972 46.190750 49.15203
## 68        0     1009 67.20457  43392 14.17542 12.99318  1.415007 48.97446
## 85        1     1011 17.89899  11188 14.98927 14.13121 72.756520 49.91956
## 102       2     1013 27.71148  21530 15.68509 11.25871 41.384110 46.81839
##     rpcincmaint rpcpersinc rpcunemins year murders  murdrate arrestrate
## 17      192.038  11852.760     26.796 1996       7 1.7473350  1.9969550
## 34      139.084  13583.020     28.710 1996       6 0.4877137  0.4877137
## 51      405.768  10760.510     63.162 1996       1 0.3777148  0.3777148
## 68      184.382  11094.820     21.692 1996       2 0.4609145  0.0000000
## 85      485.518   8349.506     63.162 1996       0 0.0000000  0.8938148
## 102     357.918   9947.058     54.868 1996       2 0.9289364  0.9289364
##     statefips countyfips execs    lpopul execrate
## 17          1          1     0 10.598160        0
## 34          1          3     0 11.720130        0
## 51          1          5     0 10.183960        0
## 68          1          9     0 10.678030        0
## 85          1         11     0  9.322598        0
## 102         1         13     0  9.977202        0

Question i). How many countries had zero murders and at least one execution? What is the largest number of executions?

zero <- sum(data$murders == 0)
zero <- paste("The number of counries that had zero murders in 1996 is", zero)
zero

## [1] "The number of counries that had zero murders in 1996 is 1051"

one <- sum(data$execs >= 1)
one <- paste("The number of counries that had at least one execution in 1996 is", one)
one

## [1] "The number of counries that had at least one execution in 1996 is 31"

largest <- max(data$execs)
largest <- paste("The largest number of execution in 1996 is", largest)
largest

## [1] "The largest number of execution in 1996 is 3"

Question ii). Estimate the equation

model1 <- lm(murders ~ execs, data = data)
a <- summary(model1)
a

## 
## Call:
## lm(formula = murders ~ execs, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -149.12   -5.46   -4.46   -2.46 1338.99 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.4572     0.8348   6.537 7.79e-11 ***
## execs        58.5555     5.8333  10.038  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.89 on 2195 degrees of freedom
## Multiple R-squared:  0.04389,    Adjusted R-squared:  0.04346 
## F-statistic: 100.8 on 1 and 2195 DF,  p-value: < 2.2e-16

sample_size <- length(model1$residuals)
sample_size <- paste("The sample size is", sample_size)
sample_size

## [1] "The sample size is 2197"

r_squared  <- a$r.squared
r_squared<- paste("The r.squared is", r_squared)
r_squared

## [1] "The r.squared is 0.0438920144130939"

The sample size is 2197. The probability of execs is 2e-16 (0.00000000000000002), which means the execs is significant variable. Thus, the number of executions effects the number of murders.

Question iii). Interpret the slope coefficient reported in the part ii). 
murders=b0+b1execs
murders=5.45772+58.55execs 
ii). The coefficient of execs is 58.55. When the number of executions increases by one unit, the number of murders increases by 58.55 units. The multiple R-squared is 0.04389, which means independent value (execs) can explain 4.389% of dependent value (murders). The value of adjusted R-squared is 0.04346, which is close to the multiple R-squared. The executions is a significant variable to the number of murders, since the probability is lower than 0.05. Thus, it means the number of executions has considerable affect to the number od murders.

Question iv). What is the smallest number of murders that can be predicted by the equation? What is the residual for a county with zero executions and zero murders?

murders <- 5.45772+58.55*1
murders <- paste("The smallest number of murders that can be predicted by the equation is ", murders)
murders

## [1] "The smallest number of murders that can be predicted by the equation is  64.00772"

new_data <- data.frame(execs = 0)
predicted_value <- predict(model1, newdata = new_data)
residual <- 0 - predicted_value
residual <- paste("The residual for a county with zero executions and zero murders is", residual)
residual

## [1] "The residual for a county with zero executions and zero murders is -5.4572408848833"

The smallest number of murder is 64 when execs is equal to 1. The residual value is -5.457241 when the values of execs and murders are equal to 0.

Question v). Explain why a simple regression analysis is not well suited for determining whether capital punishment has a deterrent effect on murders.

plot(model1)

A simple linear regression model may not be well-suited for assessing the relationship between the number of executives (execs) and the number of murders (murder). Look at the plot, these two values have no linear/direct relationship.Thus, simple regression is not suitable for
determining whether execs has effect on murders. The line is close to the 0, which indicates non-normality.

Chapter 3: 
Question 5

Question i). Does it make sense to hold sleep, work, and leisure fixed while changing study?
GPA=b0+b1*study+b2*sleep+b3*work+b4*leisure+u 
I think it may not make sense to hold sleep, work, and leisure fixed while changing study.Because the total time available in a week is fixed (168 hours), and changes in one activity can result in changes in the other activities. For example, if a student increases the time spent on studying, they would likely have to decrease time spent on other activities to maintain the total of 168 hours.

ii) Explain why this model violates Assumption MLR.3.
First of all, assumption MLR.3 is Multivariate Normality of Errors. It assumes that the errors (u) are normally distributed. Also, it assumes that all the error terms across all observations follows a multivariate normal. In distributionIn this model, the errors are likely to violate this assumption because they do not account for the fact that changes in one activity might affect the others. The errors are likely to exhibit patterns related to the interdependence of activities, violating the assumption of independently and identically distributed errors.

iii) How could you reformulate the model so that its parameters have auseful interpretation and it satisfies Assumption MLR.3?
We can use logarithmic transformations to stabilize variance and make the distribution of errors more symmetric.Or we can consider using a constrained model that reflects the fixed total time available. One approach is to use a constrained model where the sum of the coefficients for the activities equals zero, ensuring that changes in one activity are compensated by changes in others to maintain the fixed total time.

Question 10

i) If x1 is highly correlated with x2 and x3 in the sample, and x2 and x3 have large partial effects on y, would you expect b1(~) and b1(hat) to be similar or very different? Explain.

If x1 is highly correlated with x2 and x3, and  and x2 and x3 have large partial effects on y, it means there is multicollinearity. Multicollinearity can lead to inflated standard errors and increased uncertainty in the estimated coefficients. The estimates might be imprecise, and it might be challenging to isolate the individual effect of x1 due to the correlation with other variables. The estimates might be less reliable.

ii) If x1 is almost uncorrelated with x2 and x3 in the sample, but x2 and x3 are highly correlated, would you expect b1(~) and b1(hat) to be similar or very different? Explain.

If x1 is almost uncorrelated with x2 and x3, it means b1(hat) and b1(~) are likely to be more stable. If x2 and x3 are highly correlated, there might still be some multicollinearity, but it may be less severe compared to the i).

iii). If x1 is highly correlated with x2 and x3 in the sample, and x2 and x3 have small partial effects on y, would you expect b1(~) and b1(hat) to be similar or very different? Explain.

If x1 is highly correlated with x2 and x3, and  and x2 and x3 have small partial effects on y, it means there is multicollinearity. But hte impact my be less severe on b1(hat). Small partial effects of x2 and x3 suggests that these variables have limited impact. B1(hat) might have smaller standard error.

iv). If x1 is almost uncorrelated with x2 and x3 in the sample, and x2 and x3 have large partial effects on y, would you expect b1(~) and b1(hat) to be similar or very different? Explain.

If x1 is almost uncorrelated with x2 and x3, and x2 and x3 have large partial effects on y, it may be more accurate.

Question C8:
Loading data

data2 <- wooldridge::discrim
head(data2)

##   psoda pfries pentree wagest nmgrs nregs hrsopen  emp psoda2 pfries2 pentree2
## 1  1.12   1.06    1.02   4.25     3     5    16.0 27.5   1.11    1.11     1.05
## 2  1.06   0.91    0.95   4.75     3     3    16.5 21.5   1.05    0.89     0.95
## 3  1.06   0.91    0.98   4.25     3     5    18.0 30.0   1.05    0.94     0.98
## 4  1.12   1.02    1.06   5.00     4     5    16.0 27.5   1.15    1.05     1.05
## 5  1.12     NA    0.49   5.00     3     3    16.0  5.0   1.04    1.01     0.58
## 6  1.06   0.95    1.01   4.25     4     4    15.0 17.5   1.05    0.94     1.00
##   wagest2 nmgrs2 nregs2 hrsopen2 emp2 compown chain density    crmrte state
## 1    5.05      5      5     15.0 27.0       1     3    4030 0.0528866     1
## 2    5.05      4      3     17.5 24.5       0     1    4030 0.0528866     1
## 3    5.05      4      5     17.5 25.0       0     1   11400 0.0360003     1
## 4    5.05      4      5     16.0   NA       0     3    8345 0.0484232     1
## 5    5.05      3      3     16.0 12.0       0     1     720 0.0615890     1
## 6    5.05      3      4     15.0 28.0       0     1    4424 0.0334823     1
##     prpblck    prppov   prpncar hseval nstores income county     lpsoda
## 1 0.1711542 0.0365789 0.0788428 148300       3  44534     18 0.11332869
## 2 0.1711542 0.0365789 0.0788428 148300       3  44534     18 0.05826885
## 3 0.0473602 0.0879072 0.2694298 169200       3  41164     12 0.05826885
## 4 0.0528394 0.0591227 0.1366903 171600       3  50366     10 0.11332869
## 5 0.0344800 0.0254145 0.0738020 249100       1  72287     10 0.11332869
## 6 0.0591327 0.0835001 0.1151341 148000       2  44515     18 0.05826885
##       lpfries  lhseval  lincome ldensity NJ BK KFC RR
## 1  0.05826885 11.90699 10.70401 8.301521  1  0   0  1
## 2 -0.09431065 11.90699 10.70401 8.301521  1  1   0  0
## 3 -0.09431065 12.03884 10.62532 9.341369  1  1   0  0
## 4  0.01980261 12.05292 10.82707 9.029418  1  0   0  1
## 5          NA 12.42561 11.18840 6.579251  1  1   0  0
## 6 -0.05129331 11.90497 10.70358 8.394799  1  1   0  0

i). Find the average values and sd of prpblck and income. What are the units of measurement of prpblck and income?

c <- na.omit(data2$prpblck)
a <- mean(c)
a <- paste("The average number of prpblck is", a)
a

## [1] "The average number of prpblck is 0.113486396497833"

a1 <- sd(c)
a1 <- paste("The standard deviation of prpblck is", a1)
a1

## [1] "The standard deviation of prpblck is 0.182416467486231"

c1 <- na.omit(data2$income)
b <- mean(c1)
b <- paste("The average value of income is", b)
b

## [1] "The average value of income is 47053.7848410758"

b1 <- sd(c1)
b1 <- paste("The standard deviation of income is", b1)
b1

## [1] "The standard deviation of income is 13179.2860689389"

ii).Estimate equation

model2 <- lm(psoda ~ prpblck + income, data=data2)
summary(model2)

## 
## Call:
## lm(formula = psoda ~ prpblck + income, data = data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29401 -0.05242  0.00333  0.04231  0.44322 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.563e-01  1.899e-02  50.354  < 2e-16 ***
## prpblck     1.150e-01  2.600e-02   4.423 1.26e-05 ***
## income      1.603e-06  3.618e-07   4.430 1.22e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08611 on 398 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.06422,    Adjusted R-squared:  0.05952 
## F-statistic: 13.66 on 2 and 398 DF,  p-value: 1.835e-06

psoda=0.9563+0.115prpblck+0.000001603income

The coefficient of prpblck is 0.115, which means if the percentage of black people in population increases by one unit, the price of soda increases by 0.115 unit. I think it is economically large. Prpblck and income are significant variables, because their probability is lower than 0.05. The R-squared is 6.4%. Thus, it means the independent variables (prpblck and income) can explain only 6.4% of the dependent variable (the price of soda).

iii). Compare results in ii). and iii). Is there discrimination effect?

model3 <- lm(psoda ~ prpblck, data=data2)
summary(model3)

## 
## Call:
## lm(formula = psoda ~ prpblck, data = data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.30884 -0.05963  0.01135  0.03206  0.44840 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.03740    0.00519  199.87  < 2e-16 ***
## prpblck      0.06493    0.02396    2.71  0.00702 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0881 on 399 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.01808,    Adjusted R-squared:  0.01561 
## F-statistic: 7.345 on 1 and 399 DF,  p-value: 0.007015

psoda=1.03740+0.06493prpblck

The coefficient of prpblck is 0.06493, which means if the percentage of black people in population increases by one unit, the price of soda increases by 0.06493 unit. The coefficient of prpblck is smaller than the coeffiecent in i). The discrimination effect is larger when there is control for income.

iv). Estimate the model.

model4 <- glm(log(psoda) ~ prpblck + log(income), data=data2)
summary(model4)

## 
## Call:
## glm(formula = log(psoda) ~ prpblck + log(income), data = data2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.33563  -0.04695   0.00658   0.04334   0.35413  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.79377    0.17943  -4.424 1.25e-05 ***
## prpblck      0.12158    0.02575   4.722 3.24e-06 ***
## log(income)  0.07651    0.01660   4.610 5.43e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.006740526)
## 
##     Null deviance: 2.8788  on 400  degrees of freedom
## Residual deviance: 2.6827  on 398  degrees of freedom
##   (9 observations deleted due to missingness)
## AIC: -861.87
## 
## Number of Fisher Scoring iterations: 2

log(psoda)=-0.79377+0.12158prpblck+0.07651log(income)

The coefficient of prpblck is 0.12158, which means if the percentage of black people in population increases by one unit, the price of soda increases by 0.12158 unit.

If prpblck increases by increases by 20%, what is the estimated change in psoda?

p1 <- -0.79377+1*0.12158+1*0.07651
p1

## [1] -0.59568

p2 <- -0.79377+1.2*0.12158+1*0.07651
p2

## [1] -0.571364

change <- (p2-p1)*100
change

## [1] 2.4316

The estimated percentage change is 2.43.

v). Add prppov to the regression.

model5 <- glm(log(psoda) ~ prpblck + log(income) + prppov, data=data2)
summary(model5)

## 
## Call:
## glm(formula = log(psoda) ~ prpblck + log(income) + prppov, data = data2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.32218  -0.04648   0.00651   0.04272   0.35622  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.46333    0.29371  -4.982  9.4e-07 ***
## prpblck      0.07281    0.03068   2.373   0.0181 *  
## log(income)  0.13696    0.02676   5.119  4.8e-07 ***
## prppov       0.38036    0.13279   2.864   0.0044 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.006620679)
## 
##     Null deviance: 2.8788  on 400  degrees of freedom
## Residual deviance: 2.6284  on 397  degrees of freedom
##   (9 observations deleted due to missingness)
## AIC: -868.07
## 
## Number of Fisher Scoring iterations: 2

log(psoda)=-1.46333+0.07281prpblck+0.13996log(income)+0.38036prppov

The coefficient of prpblck is 0.07281, which means if the percentage of
black people in population increases by one unit, the price of soda
increases by 0.07281 unit. The value of coefficient has decreased.

Correlation

a <- na.omit(data2$prppov)
b <- na.omit(data2$lincome)
correlation <- cor(b,a)
print(correlation)

## [1] -0.838467

The correlation between log(income) and prppov is -0.838467. It means
they are negatively correlated.

vii). The correlation coefficient of -0.838467 between log(income) and
prppovindicates a strong negative correlation. A correlation of
-0.838467 is considered quite high. High correlation between independent
variables can lead to multicollinearity, which might pose challenges in
estimating the individual coefficients' precise effects.
Multicollinearity can lead to inflated standard errors, making it
difficult to identify the individual impact of each variable. The
estimated coefficients may be sensitive to small changes in the data.

Chapter 4

i). The coefficient on log(sales) is 0.321. This means that a one percent increase in sales is associated with a 0.321 percent increase in rdintens. If sales increases by 10%, rdintens increases by 3.21%.

ii). Test the hypothesis that R&D intensity does not change with sales:

The hypothesis test for the coefficient on log(sales) involves examining whether the coefficient is significantly different from zero. The test statistic is calculated by dividing the coefficient estimate by its standard error. The values in parentheses represent standard errors:

t=0.321\0.216≈1.486

For a two-tailed test at the 5% significance level, you compare the absolute value of the test statistic to the critical t-value. If the absolute value of the test statistic is greater than the critical t-value, you reject the null hypothesis.

For 32 degrees of freedom (n - number of coefficients), the critical t-values are approximately ±2.039 at the 5% level and ±1.695 at the 10% level.

At the 5% level: 
∣1.486∣<2.039 (Fail to reject the null hypothesis)
At the 10% level: 
∣1.486∣<1.695 (Fail to reject the null hypothesis)
Therefore, based on the test, there is insufficient evidence to conclude that R&D intensity changes significantly with sales.

(iii) Interpretation of the coefficient on profmarg:

The coefficient on profmarg is 0.050. This means that a one percentage point increase in profit margin is associated with a 0.050 percent increase in rdintens. Whether this effect is economically large depends on industry norms and the context of the chemical industry.

(iv) Does profmarg have a statistically significant effect on rdintens?

To test the significance of the coefficient on profmarg, you can use a similar approach as in part (ii). Calculate the t-statistic:

t=0.050\0.046≈1.087

Question C8

(i) How many single-person households are there in the data set?

This information can be obtained by filtering the dataset to include only single-person households (where fsize is 1) and then counting the number of observations.

(ii) Use OLS to estimate the model and report the results:

The model to estimate is:
nettfa=b0+b1inc+b2age+u
Regresion analysis should be done.
b1- represents the estimated change in net financial wealth for a one-unit increase in annual family income, holding age constant. If 
If b1 is positive, it suggests that higher income is associated with higher net financial wealth.

b2- This represents the estimated change in net financial wealth for a one-unit increase in age, holding annual family income constant. If 
If b2 is positive, it suggests that older individuals tend to have higher net financial wealth.

(iii) Does the intercept from the regression have an interesting meaning?

The intercept (b0) represents the estimated net financial wealth when both annual family income (inc) and age (age) are zero. However, the interpretation might not be meaningful in the context of this model. In reality, it's unlikely to have individuals with zero age and zero income.
(iv) Find the p-value for the test H0: b2=1 against H1: b2 not equal to 1. Do you reject H0 at the 1% significance level?

To find the p-value for this test, the t-statistic value of b2 should be found and compared to critical values. If the p-value is less than 0.01, H0 will be rejected at the 1% significance level.

(v) If you do a simple regression of nettfa on inc, is the estimated coefficient on inc much different from the estimate in part (ii)? Why or why not?

In a simple regression of nettfa on inc, you are estimating a model without including age. The estimated coefficient on inc in the simple regression represents the change in net financial wealth for a one-unit increase in annual family income, without considering the effect of age.

If the estimated coefficient on inc in the simple regression is not significantly different from the estimate in part (ii), it suggests that age may not have a substantial impact on the relationship between annual family income and net financial wealth. The difference in the estimates would indicate the added explanatory power of including age in the model.

Chapter 5

i). P(score>100) is zero. Because the range of score is
between 0 and 100. ii). The normal distribution is right tailed. The
most students scored over 60.

data4 <- wooldridge::wage1
head(data4)

##   wage educ exper tenure nonwhite female married numdep smsa northcen south
## 1 3.10   11     2      0        0      1       0      2    1        0     0
## 2 3.24   12    22      2        0      1       1      3    1        0     0
## 3 3.00   11     2      0        0      0       0      2    0        0     0
## 4 6.00    8    44     28        0      0       1      0    1        0     0
## 5 5.30   12     7      2        0      0       1      1    0        0     0
## 6 8.75   16     9      8        0      0       1      0    1        0     0
##   west construc ndurman trcommpu trade services profserv profocc clerocc
## 1    1        0       0        0     0        0        0       0       0
## 2    1        0       0        0     0        1        0       0       0
## 3    1        0       0        0     1        0        0       0       0
## 4    1        0       0        0     0        0        0       0       1
## 5    1        0       0        0     0        0        0       0       0
## 6    1        0       0        0     0        0        1       1       0
##   servocc    lwage expersq tenursq
## 1       0 1.131402       4       0
## 2       1 1.175573     484       4
## 3       0 1.098612       4       0
## 4       0 1.791759    1936     784
## 5       0 1.667707      49       4
## 6       0 2.169054      81      64

i). Regression

model6 <- lm(wage ~ educ + exper + tenure, data = data4)
summary(model6)

## 
## Call:
## lm(formula = wage ~ educ + exper + tenure, data = data4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6068 -1.7747 -0.6279  1.1969 14.6536 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.87273    0.72896  -3.941 9.22e-05 ***
## educ         0.59897    0.05128  11.679  < 2e-16 ***
## exper        0.02234    0.01206   1.853   0.0645 .  
## tenure       0.16927    0.02164   7.820 2.93e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.084 on 522 degrees of freedom
## Multiple R-squared:  0.3064, Adjusted R-squared:  0.3024 
## F-statistic: 76.87 on 3 and 522 DF,  p-value: < 2.2e-16

wage = -2.87273+0.59897educ+0.02234exper+0.16927tenure
The coefficient of educ is 0.59897. When educ increases by one unit, the wage increases by 0.59897 units.

plot(model6)

model7 <- glm(log(wage) ~ educ + exper + tenure, data = data4)
summary(model7)

## 
## Call:
## glm(formula = log(wage) ~ educ + exper + tenure, data = data4)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.05802  -0.29645  -0.03265   0.28788   1.42809  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.284360   0.104190   2.729  0.00656 ** 
## educ        0.092029   0.007330  12.555  < 2e-16 ***
## exper       0.004121   0.001723   2.391  0.01714 *  
## tenure      0.022067   0.003094   7.133 3.29e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1943593)
## 
##     Null deviance: 148.33  on 525  degrees of freedom
## Residual deviance: 101.46  on 522  degrees of freedom
## AIC: 637.1
## 
## Number of Fisher Scoring iterations: 2

plot(model7)

Assumption R.6, also known as the "No Perfect Multicollinearity" assumption, states that there is no exact linear relationship among the independent variables. Level-level is closer.

Chapter 6

Question3

data5 <- wooldridge::rdchem
head(data5)

##      rd  sales profits rdintens  profmarg     salessq   lsales       lrd
## 1 430.6 4570.2   186.9 9.421906  4.089536 20886730.00 8.427312 6.0651798
## 2  59.0 2830.0   467.0 2.084806 16.501766  8008900.00 7.948032 4.0775375
## 3  23.5  596.8   107.4 3.937668 17.995979   356170.22 6.391582 3.1570003
## 4   3.5  133.6    -4.3 2.619760 -3.218563    17848.96 4.894850 1.2527629
## 5   1.7   42.0     8.0 4.047619 19.047619     1764.00 3.737670 0.5306283
## 6   8.4  390.0    47.3 2.153846 12.128205   152100.00 5.966147 2.1282318

i).

model9 <- lm(rdintens ~ sales + salessq, data = data5)
summary(model9)

## 
## Call:
## lm(formula = rdintens ~ sales + salessq, data = data5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1418 -1.3630 -0.2257  1.0688  5.5808 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.613e+00  4.294e-01   6.084 1.27e-06 ***
## sales        3.006e-04  1.393e-04   2.158   0.0394 *  
## salessq     -6.946e-09  3.726e-09  -1.864   0.0725 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.788 on 29 degrees of freedom
## Multiple R-squared:  0.1484, Adjusted R-squared:  0.08969 
## F-statistic: 2.527 on 2 and 29 DF,  p-value: 0.09733

plot(model9)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

i). At what point does the marginal effect of sales on rdintens become negative?

library(Deriv)

## Warning: package 'Deriv' was built under R version 4.2.3

f <- function(x) 2.613 + 0.00030*x - 0.0000000070*x^2
derivative <- Deriv(f, "x")
derivative

## function (x) 
## 3e-04 - 1.4e-08 * x

derivative=0

0.00030−2⋅0.0000000070⋅sales=0
sales= 21428,5714

ii) Would you keep the quadratic term in the model? Explain.

Whether to keep the quadratic term in the model depends on the fit of the model, and the results of hypothesis tests. In this case, the coefficient associated with the quadratic term is negative, suggesting a downward curvature in the relationship between sales and rdintens.
Hypothesis tests can determine if the quadratic term significantly improves the model fit. If the p-value associated with the quadratic term is small, the quadratic term is statistically significant and improves the model's explanatory power.

iii). Rewrite the estimated equation with salesbil and salesbil^2:
dintens=2.613+0.00030⋅1,000⋅salesbil−0.0000000070⋅(1,000⋅salesbi) ^2
rdintens=2.613+0.30⋅salesbil−0.0070⋅salesbil2

iv) For the purpose of reporting the results, which equation do you prefer?

The choice between the linear and quadratic model depends on the goals of the analysis, the fit of the model to the data, and the significance of the coefficients. If the quadratic term is statistically significant and improves the model fit, it might provide a more accurate representation of the relationship between sales and rdintens. However, the final decision should also consider the interpretability of the model and its implications for the research question.

Question 10

library(wooldridge)
data("meapsingle")
model10 <- lm(math4 ~ lexppp + free + lmedinc + pctsgle, data = meapsingle)
summary(model10)

## 
## Call:
## lm(formula = math4 ~ lexppp + free + lmedinc + pctsgle, data = meapsingle)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.259  -7.422   1.615   7.274  49.524 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24.48949   59.23781   0.413   0.6797    
## lexppp       9.00648    4.03530   2.232   0.0266 *  
## free        -0.42164    0.07064  -5.969 9.27e-09 ***
## lmedinc     -0.75221    5.35816  -0.140   0.8885    
## pctsgle     -0.27444    0.16086  -1.706   0.0894 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.59 on 224 degrees of freedom
## Multiple R-squared:  0.4716, Adjusted R-squared:  0.4622 
## F-statistic: 49.98 on 4 and 224 DF,  p-value: < 2.2e-16

model11 <- lm(math4 ~ lexppp + free + lmedinc + pctsgle + read4, data = meapsingle)
summary(model11)

## 
## Call:
## lm(formula = math4 ~ lexppp + free + lmedinc + pctsgle + read4, 
##     data = meapsingle)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -29.5690  -4.6729  -0.0349   4.3644  24.8425 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 149.37870   41.70293   3.582 0.000419 ***
## lexppp        1.93215    2.82480   0.684 0.494688    
## free         -0.06004    0.05399  -1.112 0.267297    
## lmedinc     -10.77595    3.75746  -2.868 0.004529 ** 
## pctsgle      -0.39663    0.11143  -3.559 0.000454 ***
## read4         0.66656    0.04249  15.687  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.012 on 223 degrees of freedom
## Multiple R-squared:  0.7488, Adjusted R-squared:  0.7432 
## F-statistic: 132.9 on 5 and 223 DF,  p-value: < 2.2e-16

i). The value of R-squared is higher in the second equation. So, it can show better result than the first equation. If the expenditure increases by 10%, math test perfoemance increases by 19,3%.
ii). In the first equation, only expenditure and free variables are significant. In the 2nd equation, lmedinc, pctsgle and read4 are significant varibales.
iii). Addind read4, the value of R-squared has increased.But, it does not improve the model ability to explain. So, in this case, choosing the model with a smaller adjusted R-squared may be preferable if the additional variable does not contribute substantially to the model's accuracy.

Midterm

Tergelsaran Enkhbold

2023-11-10