According to the model, the coefficient on log(dist) is 0.312. It indicates that if log(dist) increases 1 unit, the log(price) will increase 0.312 unit. The sign is positive, meaning that the further the house from the garbage incinerator, the higher the price. It is what I expected because people don’t want to live near the garbage incinerator.
It depends on how the data is constructed: whether the data sample is representative
There are so many factors can affect the house price but not correlated from incinerator such as: the size of house, age, furniture inside, facility of neighbor around
data <- wage2
print(paste("The mean of salary will be:", mean(data$wage)))
## [1] "The mean of salary will be: 957.945454545455"
print(paste("The mean of IQ will be:", mean(data$IQ)))
## [1] "The mean of IQ will be: 101.282352941176"
print(paste("The standard deviation of IQ will be:", sd(data$IQ)))
## [1] "The standard deviation of IQ will be: 15.0526363702651"
One point increase in IQ will increase 8.3031 USD in wage. Hence, 15 point increase in IQ will increase 124.546 USD in wage.
IQ doesn’t explain most of variation in wage because Rsquared= 0.09554, meaning that it explained only 9.554%
model <- lm(wage~IQ, data= data)
summary(model)
##
## Call:
## lm(formula = wage ~ IQ, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -898.7 -256.5 -47.3 201.1 2072.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 116.9916 85.6415 1.366 0.172
## IQ 8.3031 0.8364 9.927 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 384.8 on 933 degrees of freedom
## Multiple R-squared: 0.09554, Adjusted R-squared: 0.09457
## F-statistic: 98.55 on 1 and 933 DF, p-value: < 2.2e-16
The coef is 0.0088, meaning that one point increase in IQ will make an increase of 0.88% increase in wage
model <- lm(log(wage)~IQ, data= data)
summary(model)
##
## Call:
## lm(formula = log(wage) ~ IQ, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.09324 -0.25547 0.02261 0.27544 1.21486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.8869943 0.0890206 66.13 <2e-16 ***
## IQ 0.0088072 0.0008694 10.13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3999 on 933 degrees of freedom
## Multiple R-squared: 0.09909, Adjusted R-squared: 0.09813
## F-statistic: 102.6 on 1 and 933 DF, p-value: < 2.2e-16
The coef is 3.53, meaning that 1 year increase in education resulting in 3.53 point increase in IQ
model <- lm(IQ~educ, data= data)
summary(model)
##
## Call:
## lm(formula = IQ ~ educ, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.228 -7.262 0.907 8.772 37.373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.6872 2.6229 20.47 <2e-16 ***
## educ 3.5338 0.1922 18.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.9 on 933 degrees of freedom
## Multiple R-squared: 0.2659, Adjusted R-squared: 0.2652
## F-statistic: 338 on 1 and 933 DF, p-value: < 2.2e-16
theta1= model$coefficients[2]
The coef is 0.059839, meaning that 1 year increase in education resulting in 0.06 USD increase in log(wage)
model <- lm(log(wage)~educ, data= data)
summary(model)
##
## Call:
## lm(formula = log(wage) ~ educ, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.94620 -0.24832 0.03507 0.27440 1.28106
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.973062 0.081374 73.40 <2e-16 ***
## educ 0.059839 0.005963 10.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4003 on 933 degrees of freedom
## Multiple R-squared: 0.09742, Adjusted R-squared: 0.09645
## F-statistic: 100.7 on 1 and 933 DF, p-value: < 2.2e-16
beta1= model$coefficients[2]
The coef are 0.0391199 and 0.0058631 for educ and IQ respectively. It indicates that if the years of education do not change, 1 point increase in IQ resulting in 0.006 USD increase in the log(wage). Besidesm if the IQ level is fixed, 1 year increase in education resulting in 0.039 USD increase in the log(wage).
model <- lm(log(wage)~educ+IQ, data= data)
summary(model)
##
## Call:
## lm(formula = log(wage) ~ educ + IQ, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.01601 -0.24367 0.03359 0.27960 1.23783
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.6582876 0.0962408 58.793 < 2e-16 ***
## educ 0.0391199 0.0068382 5.721 1.43e-08 ***
## IQ 0.0058631 0.0009979 5.875 5.87e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3933 on 932 degrees of freedom
## Multiple R-squared: 0.1297, Adjusted R-squared: 0.1278
## F-statistic: 69.42 on 2 and 932 DF, p-value: < 2.2e-16
beta1_2= model$coefficients[2]
beta2= model$coefficients[3]
beta1_2+beta2*theta1 - beta1
## educ
## 0
The unit of measurement in income is USD while that for prpblck is the percentage
data8 <- discrim
print(paste("The mean of income will be:", mean(data8$income, na.rm=TRUE)))
## [1] "The mean of income will be: 47053.7848410758"
print(paste("The standard deviation of income will be:", sd(data8$income, na.rm=TRUE)))
## [1] "The standard deviation of income will be: 13179.2860689389"
print(paste("The mean of people black will be:", mean(data8$prpblck,na.rm=TRUE)))
## [1] "The mean of people black will be: 0.113486396497833"
print(paste("The standard deviation of people black will be:", sd(data8$prpblck, na.rm=TRUE)))
## [1] "The standard deviation of people black will be: 0.182416467486231"
The coef of prpblck is 0.0115, meaning that if we increase black people by 1% in zipcode, the price of medium soda will increase by 0.0115 USD. It is not economically lag.
sample_size <- data8$psoda %>% na.omit %>% as.data.frame()
sample_size <- nrow(sample_size)
model <- lm(psoda~ prpblck+income, data=data8)
summary <- summary(model)
summary
##
## Call:
## lm(formula = psoda ~ prpblck + income, data = data8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.29401 -0.05242 0.00333 0.04231 0.44322
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.563e-01 1.899e-02 50.354 < 2e-16 ***
## prpblck 1.150e-01 2.600e-02 4.423 1.26e-05 ***
## income 1.603e-06 3.618e-07 4.430 1.22e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08611 on 398 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.06422, Adjusted R-squared: 0.05952
## F-statistic: 13.66 on 2 and 398 DF, p-value: 1.835e-06
print(paste("The sample size is:",sample_size))
## [1] "The sample size is: 402"
print(paste("The Rsquared is:",summary$r.squared))
## [1] "The Rsquared is: 0.0642203910903628"
The coef is 0.0649. Hence, the effect is smaller when we control for income
model <- lm(psoda~prpblck, data=data8)
summary(model)
##
## Call:
## lm(formula = psoda ~ prpblck, data = data8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.30884 -0.05963 0.01135 0.03206 0.44840
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.03740 0.00519 199.87 < 2e-16 ***
## prpblck 0.06493 0.02396 2.71 0.00702 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0881 on 399 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.01808, Adjusted R-squared: 0.01561
## F-statistic: 7.345 on 1 and 399 DF, p-value: 0.007015
The coef is 0.1216. Hence, the effect is smaller when we control for income.
model <- lm(log(psoda)~prpblck+ log(income), data=data8)
summary(model)
##
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income), data = data8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33563 -0.04695 0.00658 0.04334 0.35413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.79377 0.17943 -4.424 1.25e-05 ***
## prpblck 0.12158 0.02575 4.722 3.24e-06 ***
## log(income) 0.07651 0.01660 4.610 5.43e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0821 on 398 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.06809, Adjusted R-squared: 0.06341
## F-statistic: 14.54 on 2 and 398 DF, p-value: 8.039e-07
The coef decrease to 0.07281 and the significant level also decreases
model <- lm(log(psoda)~prpblck+ log(income)+prppov, data=data8)
summary(model)
##
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income) + prppov, data = data8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.32218 -0.04648 0.00651 0.04272 0.35622
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.46333 0.29371 -4.982 9.4e-07 ***
## prpblck 0.07281 0.03068 2.373 0.0181 *
## log(income) 0.13696 0.02676 5.119 4.8e-07 ***
## prppov 0.38036 0.13279 2.864 0.0044 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08137 on 397 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.08696, Adjusted R-squared: 0.08006
## F-statistic: 12.6 on 3 and 397 DF, p-value: 6.917e-08
The correlation is not as my expectation. It tells that proportion of poverty and price of soda is only 0.0259, meaning no relation. I expected the poorer the population, the less price we obtain.
# Remove missing values from the vectors
data8_6 <- data8 %>% select(psoda,prppov) %>% na.omit
cor(data8_6)
## psoda prppov
## psoda 1.00000000 0.02598077
## prppov 0.02598077 1.00000000
Because the correaltion is very high at -0.83. It means that the added value of prppove is very low when consider information of income already. Besides, 2 correlated variables in the model also can cause the multicolinearity. Hence, the statement is true.
# Remove missing values from the vectors
data8_7 <- data8 %>% select(income,prppov) %>% na.omit %>% mutate(income= log(income))
cor(data8_7)
## income prppov
## income 1.000000 -0.838467
## prppov -0.838467 1.000000
t(dkr)= 0.321/0.201 = 1.59 < 1.64
t(eps)= 0.043/0.078 = 0.55 < 1.64
t(nentinc)= 0.0051/0.0047= 1.085 < 1.64
t(salary)= 0.0035/0.0022 = 1.59 < 1.64
We don’t have any statistically significant variables for this model.
t(dkr)= 0.327/0.203= 1.61 < 1.64
t(eps)= 0.069/0.08= 0.8625 < 1.64
t(log(netinc))= 4.74/3.39= 1.39 < 1.64
t(log(salary))= 7.24/ 6.31= 1.14 < 1.64
We don’t have any statistically significant variables for this model.
The logarithmic transformation is not recommended as the natural logs for zero and negative values do not exist. So, the transformation does not enhance the fit.
The predictability of stock returns is very weak at 3% only
Beta1 is statistically significant at both 5% significance and 1% significance
model <- lm(log(psoda)~prpblck+ log(income)+prppov, data=data8)
summary(model)
##
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income) + prppov, data = data8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.32218 -0.04648 0.00651 0.04272 0.35622
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.46333 0.29371 -4.982 9.4e-07 ***
## prpblck 0.07281 0.03068 2.373 0.0181 *
## log(income) 0.13696 0.02676 5.119 4.8e-07 ***
## prppov 0.38036 0.13279 2.864 0.0044 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08137 on 397 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.08696, Adjusted R-squared: 0.08006
## F-statistic: 12.6 on 3 and 397 DF, p-value: 6.917e-08
The correlation is -0.838
Both variables are statistically significant with p(log(income)~0, p(prppove)= 0.0044
data9_2 <- data8 %>% select(income,prppov) %>% na.omit %>% mutate(income= log(income))
cor(data9_2)
## income prppov
## income 1.000000 -0.838467
## prppov -0.838467 1.000000
The coef is 0.121, meaning that an increase 1 USD in housing value will result in 0.121 USD in price of soda. And the null hypothesis is strongly reject with p-value~0
model <- lm(log(psoda)~prpblck+ log(income)+prppov+log(hseval), data=data8)
summary(model)
##
## Call:
## lm(formula = log(psoda) ~ prpblck + log(income) + prppov + log(hseval),
## data = data8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.30652 -0.04380 0.00701 0.04332 0.35272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.84151 0.29243 -2.878 0.004224 **
## prpblck 0.09755 0.02926 3.334 0.000937 ***
## log(income) -0.05299 0.03753 -1.412 0.158706
## prppov 0.05212 0.13450 0.388 0.698571
## log(hseval) 0.12131 0.01768 6.860 2.67e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07702 on 396 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.1839, Adjusted R-squared: 0.1757
## F-statistic: 22.31 on 4 and 396 DF, p-value: < 2.2e-16
At 5% significance level, the 2 variables have joint effect
modelur <- lm(log(psoda)~prpblck+ log(income)+prppov+log(hseval), data=data8)
modelr <- lm(log(psoda)~prpblck+log(hseval), data=data8)
ftest <- anova(modelur, modelr)
print(ftest)
## Analysis of Variance Table
##
## Model 1: log(psoda) ~ prpblck + log(income) + prppov + log(hseval)
## Model 2: log(psoda) ~ prpblck + log(hseval)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 396 2.3493
## 2 398 2.3911 -2 -0.041797 3.5227 0.03045 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the model in part 9.3, we conclude that prpblck and log(hseval) are two most reliable variables because of their significance value in both coefficients and small p-value