In 1967, two Canadian professors, Peter Pineo and John Porter published the results of a 1965 study in which Canadians were asked to rank a limited number of occupations. It presented the assigned values (Prestige scores), with respect to the prestige of 204 jobs across the Canadian sample. The Pineo-Porter method was intended, in part, to be an evaluative test of the 1971 census codes. The Pineo-Porter method of socioeconomic status (SES) estimation attaches prestige scores to 16 occupational categories and is the basis for the Blishen scale. It assigns SES codes to the occupations listed in the 1981 Canadian Classification and Dictionary of Occupations.
Following regression analysis was used to calculate the Prestige score.
Y (Prestige score) = a + b1 Education + b2 Income
The score was a reliable measure of socioeconomic status at the time of its creation. It was used to compare occupations during census of 1971 and 1981. But at this time, it is not a reliable indicator. The definition and terminology used for different occupation has changed overtime. In past 50 years, the economy has changed from primary manufacturing based to primarily service based and income levels for certain occupations have changed. Some occupations (garment factory worker, butchers etc.) have almost vanished and new occupations (technology and internet based jobs etc.) have emerged.
The output demonstrates that blue collars and white collar workers have low prestige scores, compared to professional. Income and levels of education seem to explain these trends. From the relationship between income and percentage of women (third column left to right second row top to bottom graph), we can see that as the percentage of women increases, average income in the profession declines. Relationship between prestige and Education (fourth column left to right first row top to bottom graph) explains that there a high prestige score for higher level of education. The scatter plot also implies that there is no significant linear relationship between the percentage of women in the occupation and the prestige rating. Therefore, we won’t be including women in the regression model.
Does restricting our regression to only income, education, and type variables make sense given your exploratory analysis? - Yes, Since there is no linear relationship between the percentage of women in occupation and the prestige rating, we will not use women as our explanatory variables. We can use the explanatory variables: income, education, and type for the remainder of the question.
## Number of missing type are 4
occupation.group | education | income | women | prestige | census | type | |
---|---|---|---|---|---|---|---|
34 | athletes | 11.44 | 8206 | 8.13 | 54.1 | 3373 | NA |
53 | newsboys | 9.62 | 918 | 7.00 | 14.8 | 5143 | NA |
63 | babysitters | 9.46 | 611 | 96.53 | 25.9 | 6147 | NA |
67 | farmers | 6.84 | 3643 | 3.60 | 44.1 | 7112 | NA |
In our dataset, we have type, a factor variable, that refers to three occupational levels: bc for Blue Collar; prof for Professional, Managerial, and Technical; and wc for White Collar. The output demonstrates that there are more blue collar workers in the survey than any other occupational groups. From above output and histogram we can see that 4 professions are missing type - athletes, newsboys, babysitters and farmers.
It is not advisable to group these professions in one category for following reasons:
The propensity for a data point to be missing is completely random. The missing data are just a random subset of the data. The fact that a certain value is missing has nothing to do whatsoever with its hypothetical value, and nothing to do with the values of the other variables.
The number of cases of missing values is extremely small which is 4 out of 102; then, according to expert researcher may drop or omit those values from the analysis. In statistical language, if the number of the cases is less than 5% of the sample, then we can drop them. (http://www.statisticssolutions.com/missing-values-in-data/)
The professions which are missing values for variable type are athletes, newsboys, babysitters and farmers. All these professions do not belong to a homogeneous category. Combining them together as a fourth professional category will not contribute any significance to our model.
Now, we will clean the data.
## [1] "Cleaned Data"
## occupation.group education income women prestige census type
## 1 gov.administrators 13.11 12351 11.16 68.8 1113 prof
## 2 general.managers 12.26 25879 4.02 69.1 1130 prof
## 3 accountants 12.77 9271 15.70 63.4 1171 prof
## 4 purchasing.officers 11.42 8865 9.11 56.8 1175 prof
## 5 chemists 14.62 8403 11.68 73.5 2111 prof
## 6 physicists 15.64 11030 5.13 77.6 2113 prof
Below ouput shows the summary of data after dropping missing value.
##
## Call:
## lm(formula = prestige ~ education + income + type, data = newPrestige_cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.9529 -4.4486 0.1678 5.0566 18.6320
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6229292 5.2275255 -0.119 0.905
## education 3.6731661 0.6405016 5.735 1.21e-07 ***
## income 0.0010132 0.0002209 4.586 1.40e-05 ***
## typewc -2.7372307 2.5139324 -1.089 0.279
## typeprof 6.0389707 3.8668551 1.562 0.122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.095 on 93 degrees of freedom
## Multiple R-squared: 0.8349, Adjusted R-squared: 0.8278
## F-statistic: 117.5 on 4 and 93 DF, p-value: < 2.2e-16
Below output summary is of imputed data with the fourth professional category “other”.
##
## Call:
## lm(formula = prestige ~ education + income + type, data = imputed_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.0864 -4.8662 0.1436 5.3524 19.2652
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7693934 5.3361820 0.332 0.7409
## education 3.3059733 0.6537085 5.057 2.04e-06 ***
## income 0.0011392 0.0002305 4.942 3.28e-06 ***
## typewc -1.7190573 2.6174653 -0.657 0.5129
## typeprof 7.4877370 3.9698331 1.886 0.0623 .
## typeother -1.7322264 4.0258430 -0.430 0.6680
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.512 on 96 degrees of freedom
## Multiple R-squared: 0.8188, Adjusted R-squared: 0.8093
## F-statistic: 86.75 on 5 and 96 DF, p-value: < 2.2e-16
We can see that the type “other” in imputed_data model does not make any significant changes compared to original model. Hence, there is no use of making “other” as a fourth professional category. Hence, we drop the missing values from the data.
Interaction terms are needed whenever there is reason to believe that the effect of one independent variable depends on the value of another independent variable.
From the plots above, we can say there is a possibility of a interection term. Let’s test all the possibilites of the interection term which improves performance of our model significantly.
First, we will check if education*type is significant.
##
## Call:
## lm(formula = prestige ~ education + income + type + education *
## type, data = newPrestige_cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.1168 -4.1751 0.4384 5.1625 15.2362
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.331e+00 7.783e+00 -0.299 0.765
## education 3.852e+00 9.406e-01 4.096 9.12e-05 ***
## income 1.052e-03 2.201e-04 4.782 6.66e-06 ***
## typewc -2.822e+01 1.959e+01 -1.440 0.153
## typeprof 2.209e+01 1.520e+01 1.454 0.149
## education:typewc 2.270e+00 1.872e+00 1.213 0.228
## education:typeprof -1.227e+00 1.304e+00 -0.941 0.349
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.036 on 91 degrees of freedom
## Multiple R-squared: 0.8411, Adjusted R-squared: 0.8306
## F-statistic: 80.27 on 6 and 91 DF, p-value: < 2.2e-16
From the above summary outputs we can say that the interaction term (education*type) is not significant. So we will not include this term in our model.
Let’s check if the interaction term (income*type) is significant or not.
##
## Call:
## lm(formula = prestige ~ education + income + type + (income *
## type), data = newPrestige_cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.8720 -4.8321 0.8534 4.1425 19.6710
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.7272633 4.9515480 -1.359 0.1776
## education 3.0396961 0.6003699 5.063 2.14e-06 ***
## income 0.0031344 0.0005215 6.010 3.79e-08 ***
## typewc 7.1375093 5.2898177 1.349 0.1806
## typeprof 25.1723873 5.4669586 4.604 1.34e-05 ***
## income:typewc -0.0014856 0.0008720 -1.704 0.0919 .
## income:typeprof -0.0025102 0.0005530 -4.539 1.72e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.455 on 91 degrees of freedom
## Multiple R-squared: 0.8663, Adjusted R-squared: 0.8574
## F-statistic: 98.23 on 6 and 91 DF, p-value: < 2.2e-16
By adding interaction term (income*type), income:typeProfessional becomes statically significant.
Now we will check if the interaction term type*(education+income) is significant or not.
##
## Call:
## lm(formula = prestige ~ education + income + type * (education +
## income), data = newPrestige_cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.462 -4.225 1.346 3.826 19.631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.276e+00 7.057e+00 0.323 0.7478
## education 1.713e+00 9.572e-01 1.790 0.0769 .
## income 3.522e-03 5.563e-04 6.332 9.62e-09 ***
## typewc -3.354e+01 1.765e+01 -1.900 0.0607 .
## typeprof 1.535e+01 1.372e+01 1.119 0.2660
## education:typewc 4.291e+00 1.757e+00 2.442 0.0166 *
## education:typeprof 1.388e+00 1.289e+00 1.077 0.2844
## income:typewc -2.072e-03 8.940e-04 -2.318 0.0228 *
## income:typeprof -2.903e-03 5.989e-04 -4.847 5.28e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.318 on 89 degrees of freedom
## Multiple R-squared: 0.8747, Adjusted R-squared: 0.8634
## F-statistic: 77.64 on 8 and 89 DF, p-value: < 2.2e-16
The (education + income)*type specifies terms for education, income, type and the interactions between education and type and between income and type. By adding this term interaction of income with type “wc” and “prof” become statically significant. Let’s do partial F-test to compare these two models.
The following ouput is to performs the partial F-test:
## Analysis of Variance Table
##
## Model 1: prestige ~ education + income + type * (education + income)
## Model 2: prestige ~ education + income + type + (income * type)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 89 3552.9
## 2 91 3791.3 -2 -238.4 2.9859 0.05557 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The output shows the results of the partial F-test. Since F=2.985 (p-value >= 0.5). We can reject the null hypothesis and conclude that “type * (education + income)” term is significant at α = 0.05 level.
So we will go ahead with model: prestige ~ education + income + type * (education + income)
We now fit a model to predict prestige using: income, education, type, and our interaction term “type * (education + income)” based on our answer to part (d).
lm(prestige ~ education + income + type*(education+income), data=newPrestige_cleaned)
We will now evaluate our model by checking the regression assumptions as follows:
1)Fixed x and measurement error: The data used in the analysis is collected from reliable sources so x values are fixed. But there could be chances of measurement errors since the data for all the variables are collected on country level. There are chances that the respondents may decline to do the task, or also some of the people from whom the data were collected were illiterate or they lied while giving out the information. Or the person who was in charge of data collection was changed and the occupational prestige packet had not been transferred to the new person correctly.
2)Linearity: The data appear to be well modeled by a linear relationship between y and x, and the points appear to be randomly spread out about the line, with no discerninle non-linear trends or changes in variability. Looking at the “Residuals vs Fitted plot” (Figure 3), we see that the red line is perfectly flat. This tells us that there is no discernible non-linear trend to the residuals. Furthermore, the residuals appear to be equally variable across the entire range of fitted values. There is no indication of non-constant variance.
3)Homoscedasticity of residuals or equal variance: Ideally, residuals are randomly scattered around 0 (the horizontal line) providing a relatively even distribution. The standard linear regression assumption is that the variance is constant across the entire range. Here the points appear random and the line looks pretty flat, with no increasing or decreasing trend. So, the condition of homoscedasticity can be accepted.
## education income women
## [1,] 0.8664798 0.7033094 -0.1101426
Collinearity/Multicollinearity: In our multiple regression model output above, education no longer displays a significant p-value. Here, education represents the average effect while holding the other variables income and interaction term “type*(education+income)” constant. The correlation matrix shown above highlights the situation we encoutered with the model output. Notice that the correlation between education and income is high which is 0.6. From the matrix scatterplot shown above, we can also see the pattern prestige takes when regressed on eduation and income. We can notice that how closely aligned their pattern is with each other. So in essence, when they are put together in the model, education is no longer significant.
We also can check this by using VIF(Variance Inflation Factor) as follows:
## Variables VIF
## 1 income 1.491621
## 2 education 1.491621
From the above vif results, we will include either “income” or “education”, but not both since both are giving the exact same information.
Hence, we will drop education from our final model. The final model is as follows with interaction term (type*income)
Model Evaluation:
Given that the model assumptions are satisfied, we want to determine how good our model is. Below is the summary and anova results for our model.
##
## Call:
## lm(formula = prestige ~ income + type + type * income, data = newPrestige_cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.2669 -5.2956 0.3125 4.3392 25.0200
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.9045168 3.1671787 4.390 3.02e-05 ***
## income 0.0040235 0.0005530 7.276 1.12e-10 ***
## typewc 18.9807386 5.3421020 3.553 0.000603 ***
## typeprof 45.0190221 4.2907398 10.492 < 2e-16 ***
## income:typewc -0.0021712 0.0009700 -2.238 0.027603 *
## income:typeprof -0.0031783 0.0006047 -5.256 9.48e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.268 on 92 degrees of freedom
## Multiple R-squared: 0.8286, Adjusted R-squared: 0.8193
## F-statistic: 88.94 on 5 and 92 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: prestige
## Df Sum Sq Mean Sq F value Pr(>F)
## income 1 14021.6 14021.6 265.471 < 2.2e-16 ***
## type 2 7988.5 3994.3 75.623 < 2.2e-16 ***
## income:type 2 1477.5 738.8 13.987 4.969e-06 ***
## Residuals 92 4859.2 52.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
r.squared | adj.r.squared | |
---|---|---|
prestige vs. income | 0.4946 | 0.4894 |
prestige vs. income, type | 0.7765 | 0.7693 |
prestige vs. income, type,income*type | 0.8286 | 0.8193 |
First, we perform the overall f-test on | the model as | follows. |
α = 0.05 | ||
hypotheses: | ||
H0: βincome =βtype=βincome*type = 0 | ||
Ha: at least one slope is not zero | ||
test statistic: | ||
Fc = (23487.6/5)/( 4859.2/92) | ||
Fc = 88.94 | ||
with p = 5 and n − (p + 1) = 98 − 6 = 92 | degrees of | freedom |
p-value< α = 0.05 | ||
conclusion: reject null hypothesis; the | model is ade | quate. |
Next we perform t-test.
Testing βincome: α = 0.05
H0 : βincome=0 (Assuming type and income*type are already in the model)
Ha : βincome not =0
t-statistics t1 = 7.276 which has t distribution with n − (p + 1) = 98 − 6 = 92 degrees of freedom.
Here p-value is 1.12e-10 < 0.05. So we reject null hypothesis and conclude that income is statically significant when type and income*type are already in the model.
Testing βtypewc: α = 0.05
H0 : βtypewc=0 (Assuming income and income*type are already in the model)
Ha : βtypewc not =0
t-statistics t1 = 3.553 which has t distribution with n − (p + 1) = 98 − 6 = 92 degrees of freedom.
Here p-value is 0.000603 < 0.05. So we reject null hypothesis and conclude that typewc is statically significant when income and income*type are already in the model.
Testing βtypeprof: α = 0.05
H0 : βtypeprof=0 (Assuming income and income*type are already in the model)
Ha : βtypeprof not =0
t-statistics t1 = 10.492 which has t distribution with n − (p + 1) = 98 − 6 = 92 degrees of freedom.
Here p-value is < 2e-16 < 0.05. So we reject null hypothesis and conclude that typeprof is statically significant when income and income*type are already in the model.
Testing βincome:typewc: α = 0.05
H0 : βincome:typewc=0 (Assuming income and type are already in the model)
Ha : βincome:typewc not =0
t-statistics t1 = -2.238 which has t distribution with n − (p + 1) = 98 − 6 = 92 degrees of freedom.
Here p-value is < 2e-16 < 0.05. So we reject null hypothesis and conclude that income:typewc is statically significant when income and type are already in the model.
Testing βincome:typeprof: α = 0.05
H0 : βincome:typeprof=0 (Assuming income and type are already in the model)
Ha : βincome:typeprof not =0
t-statistics t1 = -5.256 which has t distribution with n − (p + 1) = 98 − 6 = 92 degrees of freedom.
Here p-value is 9.48e-07 < 0.05. So we reject null hypothesis and conclude that income:typeprof is statically significant when income and type are already in the model.
Final regression model: $y=$13.9045168 + 0.0040235 income($) + 18.9807386 typewc + 45.0190221 typeprof + -0.0021712 incomeOftypewc($) + -0.0031783 incomeOftypeprof($)
From our answer to part(c), we will use type “wc” and “prof” in our analysis.
The log transformation of income decreased the variability of data and made data conform more closely to the normal distribution.
##
## Call:
## lm(formula = prestige ~ log(income) + type + type * log(income),
## data = newPrestige_cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.9201 -5.2785 -0.1848 5.5620 24.3118
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -124.353 23.204 -5.359 6.16e-07 ***
## log(income) 18.782 2.723 6.898 6.55e-10 ***
## typewc 83.702 41.469 2.018 0.04646 *
## typeprof 102.630 35.688 2.876 0.00501 **
## log(income):typewc -8.979 4.889 -1.837 0.06947 .
## log(income):typeprof -9.001 4.020 -2.239 0.02756 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.404 on 92 degrees of freedom
## Multiple R-squared: 0.8221, Adjusted R-squared: 0.8124
## F-statistic: 85.02 on 5 and 92 DF, p-value: < 2.2e-16
Now, let’s evaluate our model with log income by checking the regression assumptions as follows:
1)Fixed x and measurement error: The data used in the analysis is collected from reliable sources so x values are fixed. But there could be chances of measurement errors since the data for all the variables are collected on country level. There are chances that the respondents may decline to do the task, or also some of the people from whom the data were collected were illiterate or they lied while giving out the information. Or the person who was in charge of data collection was changed and the occupational prestige packet had not been transferred to the new person correctly.
2)Linearity: The data appear to be well modeled by a linear relationship between y and x, and the points appear to be randomly spread out about the line, with no discerninle non-linear trends or changes in variability. Looking at the “Residuals vs Fitted plot” (Figure 3), we see that the red line is perfectly flat. This tells us that there is no discernible non-linear trend to the residuals. Furthermore, the residuals appear to be equally variable across the entire range of fitted values. There is no indication of non-constant variance.
3)Homoscedasticity of residuals or equal variance: Ideally, residuals are randomly scattered around 0 (the horizontal line) providing a relatively even distribution. The standard linear regression assumption is that the variance is constant across the entire range. Here the points appear random and the line looks pretty flat, with no increasing or decreasing trend. So, the condition of homoscedasticity can be accepted.
## 21 82 31
## 96 97 98
Since we have categorical variable in our model, we won’t check multicollinearity.
Model Evaluation:
Given that the model assumptions are satisfied, we want to determine how good our model is. Below is the summary and anova results for our model.
##
## Call:
## lm(formula = prestige ~ log(income) + type + type * log(income),
## data = newPrestige_cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.9201 -5.2785 -0.1848 5.5620 24.3118
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -124.353 23.204 -5.359 6.16e-07 ***
## log(income) 18.782 2.723 6.898 6.55e-10 ***
## typewc 83.702 41.469 2.018 0.04646 *
## typeprof 102.630 35.688 2.876 0.00501 **
## log(income):typewc -8.979 4.889 -1.837 0.06947 .
## log(income):typeprof -9.001 4.020 -2.239 0.02756 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.404 on 92 degrees of freedom
## Multiple R-squared: 0.8221, Adjusted R-squared: 0.8124
## F-statistic: 85.02 on 5 and 92 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: prestige
## Df Sum Sq Mean Sq F value Pr(>F)
## log(income) 1 15998.5 15998.5 291.8325 <2e-16 ***
## type 2 6967.2 3483.6 63.5449 <2e-16 ***
## log(income):type 2 337.8 168.9 3.0806 0.0507 .
## Residuals 92 5043.5 54.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
r.squared | adj.r.squared | |
---|---|---|
prestige vs. income | 0.5644 | 0.5598 |
prestige vs. income, type | 0.8102 | 0.8041 |
prestige vs. income, type,income*type | 0.8221 | 0.8124 |
First, we perform the overall f-test on the model as follows. α = 0.05
hypotheses:
H0: βlog(income) =βtype=βlog(income)*type = 0
Ha: at least one slope is not zero test statistic:
Fc = (23303.5/5)/( 5043.5/92)
Fc = 85.01
with p = 5 and n − (p + 1) = 98 − 6 = 92 degrees of freedom
p-value< α = 0.05
conclusion: reject null hypothesis; the model is adequate.
Next we perform t-test.
Testing βlog(income): α = 0.05
H0 : βlog(income)=0 (Assuming type and log(income)*type are already in the model)
Ha : βlog(income) not =0
t-statistics t1 = 6.898 which has t distribution with n − (p + 1) = 98 − 6 = 92 degrees of freedom.
Here p-value is 6.55e-10 < 0.05. So we reject null hypothesis and conclude that income is statically significant when type and log(income)*type are already in the model.
Testing βtypewc: α = 0.05
H0 : βtypewc=0 (Assuming log(income) and log(income)*type are already in the model)
Ha : βtypewc not =0
t-statistics t1 = 2.018 which has t distribution with n − (p + 1) = 98 − 6 = 92 degrees of freedom.
Here p-value is 0.04646 < 0.05. So we reject null hypothesis and conclude that typewc is statically significant when log(income) and log(income)*type are already in the model.
Testing βtypeprof: α = 0.05
H0 : βtypeprof=0 (Assuming income and log(income)*type are already in the model)
Ha : βtypeprof not =0
t-statistics t1 = 2.876 which has t distribution with n − (p + 1) = 98 − 6 = 92 degrees of freedom.
Here p-value is 0.00501 <= 0.05. So we reject null hypothesis and conclude that typeprof is statically significant when log(income) and log(income)*type are already in the model.
Testing βlog(income):typewc: α = 0.05
H0 : βlog(income):typewc=0 (Assuming log(income) and type are already in the model)
Ha : βlog(income):typewc not =0
t-statistics t1 = -1.837 which has t distribution with n − (p + 1) = 98 − 6 = 92 degrees of freedom.
Here p-value is 0.06947 > 0.05. So we can not reject null hypothesis and conclude that log(income):typewc is not statically significant when log(income) and type are already in the model.
Testing βlog(income):typeprof: α = 0.05
H0 : βlog(income):typeprof=0 (Assuming log(income) and type are already in the model)
Ha : βlog(income):typeprof not =0
t-statistics t1 =-2.239 which has t distribution with n − (p + 1) = 98 − 6 = 92 degrees of freedom.
Here p-value is 0.02756 < 0.05. So we reject null hypothesis and conclude that log(income):typeprof is statically significant when log(income) and type are already in the model.
Final regression model:
\(\hat y=\)-124.3525838 + 18.7820949 income($) + 83.7020124 typewc + 102.6304773 typeProf + -8.9793239 incomeOftypewc($) + -9.0010905 incomeOftypeprof($)
We can not compare both models using partial F-test since both models are not nested. Partial F-test is used only when two models are nested.
Instead, We can compare both models using RMSE.
## RMSE for model(e) 7.041597
## RMSE for model(g) 7.173863
By comparing the RMSE of bothe model(e) (rmse=7.041597) and model(g)(rmse = 7.173863), we can say that Model(e) is better.
There is another way in R to compare two non-nested model. We can check the BIC - Bayesian Information Criterion.
## BIC for model(e) 692.7664
## and BIC for model(g) 696.4138
Since BIC for model (e) is low, this model is preferable.