Chap2-4

Chapter 2

C6

According to the model, the coefficient on log(dist) is 0.312. It indicates that if log(dist) increases 1 unit, the log(price) will increase 0.312 unit. The sign is positive, meaning that the further the house from the garbage incinerator, the higher the price. It is what I expected because people don’t want to live near the garbage incinerator.
It depends on how the data is constructed: whether the data sample is representative
There are so many factors can affect the house price but not correlated from incinerator such as: the size of house, age, furniture inside, facility of neighbor around

C4

Mean and standard deviation

data <- wage2
print(paste("The mean of salary will be:", mean(data$wage)))

## [1] "The mean of salary will be: 957.945454545455"

print(paste("The mean of IQ will be:", mean(data$IQ)))

## [1] "The mean of IQ will be: 101.282352941176"

print(paste("The standard deviation of IQ will be:", sd(data$IQ)))

## [1] "The standard deviation of IQ will be: 15.0526363702651"

Model

One point increase in IQ will increase 8.3031 USD in wage. Hence, 15 point increase in IQ will increase 124.546 USD in wage.

IQ doesn’t explain most of variation in wage because Rsquared= 0.09554, meaning that it explained only 9.554%

model <- lm(wage~IQ, data= data)
summary(model)

## 
## Call:
## lm(formula = wage ~ IQ, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -898.7 -256.5  -47.3  201.1 2072.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 116.9916    85.6415   1.366    0.172    
## IQ            8.3031     0.8364   9.927   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 384.8 on 933 degrees of freedom
## Multiple R-squared:  0.09554,    Adjusted R-squared:  0.09457 
## F-statistic: 98.55 on 1 and 933 DF,  p-value: < 2.2e-16

log(wage) ~ IQ

The coef is 0.0088, meaning that one point increase in IQ will make an increase of 0.88% increase in wage

model <- lm(log(wage)~IQ, data= data)
summary(model)

## 
## Call:
## lm(formula = log(wage) ~ IQ, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.09324 -0.25547  0.02261  0.27544  1.21486 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.8869943  0.0890206   66.13   <2e-16 ***
## IQ          0.0088072  0.0008694   10.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3999 on 933 degrees of freedom
## Multiple R-squared:  0.09909,    Adjusted R-squared:  0.09813 
## F-statistic: 102.6 on 1 and 933 DF,  p-value: < 2.2e-16

Chapter 3

C6

IQ~ educ

The coef is 3.53, meaning that 1 year increase in education resulting in 3.53 point increase in IQ

model <- lm(IQ~educ, data= data)
summary(model)

## 
## Call:
## lm(formula = IQ ~ educ, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.228  -7.262   0.907   8.772  37.373 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  53.6872     2.6229   20.47   <2e-16 ***
## educ          3.5338     0.1922   18.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.9 on 933 degrees of freedom
## Multiple R-squared:  0.2659, Adjusted R-squared:  0.2652 
## F-statistic:   338 on 1 and 933 DF,  p-value: < 2.2e-16

theta1= model$coefficients[2]

wage ~ educ

The coef is 0.059839, meaning that 1 year increase in education resulting in 0.06 USD increase in log(wage)

model <- lm(log(wage)~educ, data= data)
summary(model)

## 
## Call:
## lm(formula = log(wage) ~ educ, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.94620 -0.24832  0.03507  0.27440  1.28106 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.973062   0.081374   73.40   <2e-16 ***
## educ        0.059839   0.005963   10.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4003 on 933 degrees of freedom
## Multiple R-squared:  0.09742,    Adjusted R-squared:  0.09645 
## F-statistic: 100.7 on 1 and 933 DF,  p-value: < 2.2e-16

beta1= model$coefficients[2]

wage ~ educ+iq

The coef are 0.0391199 and 0.0058631 for educ and IQ respectively. It indicates that if the years of education do not change, 1 point increase in IQ resulting in 0.006 USD increase in the log(wage). Besidesm if the IQ level is fixed, 1 year increase in education resulting in 0.039 USD increase in the log(wage).

model <- lm(log(wage)~educ+IQ, data= data)
summary(model)

## 
## Call:
## lm(formula = log(wage) ~ educ + IQ, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.01601 -0.24367  0.03359  0.27960  1.23783 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.6582876  0.0962408  58.793  < 2e-16 ***
## educ        0.0391199  0.0068382   5.721 1.43e-08 ***
## IQ          0.0058631  0.0009979   5.875 5.87e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3933 on 932 degrees of freedom
## Multiple R-squared:  0.1297, Adjusted R-squared:  0.1278 
## F-statistic: 69.42 on 2 and 932 DF,  p-value: < 2.2e-16

beta1_2= model$coefficients[2]
beta2= model$coefficients[3]

The verification Due to the error term we don’t have exactly substraction equal to 0, the result near 0 at -0.02

beta1_2+beta2*theta1 - beta1

## educ 
##    0

C8

Mean and standard deviation

The unit of measurement in income is USD while that for prpblck is the percentage

data8 <- discrim
print(paste("The mean of income will be:", mean(data8$income, na.rm=TRUE)))

## [1] "The mean of income will be: 47053.7848410758"

print(paste("The standard deviation of income will be:", sd(data8$income, na.rm=TRUE)))

## [1] "The standard deviation of income will be: 13179.2860689389"

print(paste("The mean of people black will be:", mean(data8$prpblck,na.rm=TRUE)))

## [1] "The mean of people black will be: 0.113486396497833"

print(paste("The standard deviation of people black will be:", sd(data8$prpblck, na.rm=TRUE)))

## [1] "The standard deviation of people black will be: 0.182416467486231"

psoda~ prpblck+income

The coef of prpblck is 0.0115, meaning that if we increase black people by 1% in zipcode, the price of medium soda will increase by 0.0115 USD. It is not economically lag.

sample_size <- data8$psoda %>% na.omit %>% as.data.frame()
sample_size <- nrow(sample_size)
model <- lm(psoda~ prpblck+income, data=data8)
summary <- summary(model)
summary

## 
## Call:
## lm(formula = psoda ~ prpblck + income, data = data8)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29401 -0.05242  0.00333  0.04231  0.44322 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.563e-01  1.899e-02  50.354  < 2e-16 ***
## prpblck     1.150e-01  2.600e-02   4.423 1.26e-05 ***
## income      1.603e-06  3.618e-07   4.430 1.22e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08611 on 398 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.06422,    Adjusted R-squared:  0.05952 
## F-statistic: 13.66 on 2 and 398 DF,  p-value: 1.835e-06

print(paste("The sample size is:",sample_size))

## [1] "The sample size is: 402"

print(paste("The Rsquared is:",summary$r.squared))

## [1] "The Rsquared is: 0.0642203910903628"

psoda ~ prpblck

The coef is 0.0649. Hence, the effect is smaller when we control for income

model <- lm(psoda~prpblck, data=data8)
summary(model)

## 
## Call:
## lm(formula = psoda ~ prpblck, data = data8)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.30884 -0.05963  0.01135  0.03206  0.44840 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.03740    0.00519  199.87  < 2e-16 ***
## prpblck      0.06493    0.02396    2.71  0.00702 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0881 on 399 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.01808,    Adjusted R-squared:  0.01561 
## F-statistic: 7.345 on 1 and 399 DF,  p-value: 0.007015

log(psoda) ~ prpblck ~ log(income)

The coef is 0.1216. Hence, the effect is smaller when we control for income.

model <- lm(log(psoda)~prpblck+ log(income), data=data8)
summary(model)

## 
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income), data = data8)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33563 -0.04695  0.00658  0.04334  0.35413 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.79377    0.17943  -4.424 1.25e-05 ***
## prpblck      0.12158    0.02575   4.722 3.24e-06 ***
## log(income)  0.07651    0.01660   4.610 5.43e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0821 on 398 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.06809,    Adjusted R-squared:  0.06341 
## F-statistic: 14.54 on 2 and 398 DF,  p-value: 8.039e-07

log(psoda) ~ prpblck ~ log(income)+ prppov

The coef decrease to 0.07281 and the significant level also decreases

model <- lm(log(psoda)~prpblck+ log(income)+prppov, data=data8)
summary(model)

## 
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income) + prppov, data = data8)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32218 -0.04648  0.00651  0.04272  0.35622 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.46333    0.29371  -4.982  9.4e-07 ***
## prpblck      0.07281    0.03068   2.373   0.0181 *  
## log(income)  0.13696    0.02676   5.119  4.8e-07 ***
## prppov       0.38036    0.13279   2.864   0.0044 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08137 on 397 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.08696,    Adjusted R-squared:  0.08006 
## F-statistic:  12.6 on 3 and 397 DF,  p-value: 6.917e-08

correlation

The correlation is not as my expectation. It tells that proportion of poverty and price of soda is only 0.0259, meaning no relation. I expected the poorer the population, the less price we obtain.

# Remove missing values from the vectors
data8_6 <- data8 %>% select(psoda,prppov) %>% na.omit 
cor(data8_6)

##             psoda     prppov
## psoda  1.00000000 0.02598077
## prppov 0.02598077 1.00000000

statement evaluation

Because the correaltion is very high at -0.83. It means that the added value of prppove is very low when consider information of income already. Besides, 2 correlated variables in the model also can cause the multicolinearity. Hence, the statement is true.

# Remove missing values from the vectors
data8_7 <- data8 %>% select(income,prppov) %>% na.omit %>% mutate(income= log(income))
cor(data8_7)

##           income    prppov
## income  1.000000 -0.838467
## prppov -0.838467  1.000000

Chapter 4

C10

t-test We set significant level at 5%:

t(dkr)= 0.321/0.201 = 1.59 < 1.64

t(eps)= 0.043/0.078 = 0.55 < 1.64

t(nentinc)= 0.0051/0.0047= 1.085 < 1.64

t(salary)= 0.0035/0.0022 = 1.59 < 1.64

We don’t have any statistically significant variables for this model.

t-test We set significant level at 5%:

t(dkr)= 0.327/0.203= 1.61 < 1.64

t(eps)= 0.069/0.08= 0.8625 < 1.64

t(log(netinc))= 4.74/3.39= 1.39 < 1.64

t(log(salary))= 7.24/ 6.31= 1.14 < 1.64

We don’t have any statistically significant variables for this model.

Logarit transformation

The logarithmic transformation is not recommended as the natural logs for zero and negative values do not exist. So, the transformation does not enhance the fit.

Model performance

The predictability of stock returns is very weak at 3% only

C9

model report

Beta1 is statistically significant at both 5% significance and 1% significance

model <- lm(log(psoda)~prpblck+ log(income)+prppov, data=data8)
summary(model)

## 
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income) + prppov, data = data8)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32218 -0.04648  0.00651  0.04272  0.35622 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.46333    0.29371  -4.982  9.4e-07 ***
## prpblck      0.07281    0.03068   2.373   0.0181 *  
## log(income)  0.13696    0.02676   5.119  4.8e-07 ***
## prppov       0.38036    0.13279   2.864   0.0044 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08137 on 397 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.08696,    Adjusted R-squared:  0.08006 
## F-statistic:  12.6 on 3 and 397 DF,  p-value: 6.917e-08

correlation

The correlation is -0.838

Both variables are statistically significant with p(log(income)~0, p(prppove)= 0.0044

data9_2 <- data8 %>% select(income,prppov) %>% na.omit %>% mutate(income= log(income))
cor(data9_2)

##           income    prppov
## income  1.000000 -0.838467
## prppov -0.838467  1.000000

Model modification

The coef is 0.121, meaning that an increase 1 USD in housing value will result in 0.121 USD in price of soda. And the null hypothesis is strongly reject with p-value~0

model <- lm(log(psoda)~prpblck+ log(income)+prppov+log(hseval), data=data8)
summary(model)

## 
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income) + prppov + log(hseval), 
##     data = data8)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.30652 -0.04380  0.00701  0.04332  0.35272 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.84151    0.29243  -2.878 0.004224 ** 
## prpblck      0.09755    0.02926   3.334 0.000937 ***
## log(income) -0.05299    0.03753  -1.412 0.158706    
## prppov       0.05212    0.13450   0.388 0.698571    
## log(hseval)  0.12131    0.01768   6.860 2.67e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07702 on 396 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.1839, Adjusted R-squared:  0.1757 
## F-statistic: 22.31 on 4 and 396 DF,  p-value: < 2.2e-16

Insignificant value

At 5% significance level, the 2 variables have joint effect

modelur <- lm(log(psoda)~prpblck+ log(income)+prppov+log(hseval), data=data8)
modelr <- lm(log(psoda)~prpblck+log(hseval), data=data8)
ftest <- anova(modelur, modelr)
print(ftest)

## Analysis of Variance Table
## 
## Model 1: log(psoda) ~ prpblck + log(income) + prppov + log(hseval)
## Model 2: log(psoda) ~ prpblck + log(hseval)
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1    396 2.3493                              
## 2    398 2.3911 -2 -0.041797 3.5227 0.03045 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model Intepretation

Based on the model in part 9.3, we conclude that prpblck and log(hseval) are two most reliable variables because of their significance value in both coefficients and small p-value

Chap2-4

NGUYEN BAO QUYNH TRANG

2023-10-20

Chapter 2

C6

C4

Chapter 3

C6

C8

Chapter 4

C10

C9