Q1.Regression Model for the prestige level of occupations

(a) The Pineo-Porter prestige score

In 1967, two Canadian professors, Peter Pineo and John Porter published the results of a 1965 study in which Canadians were asked to rank a limited number of occupations. It presented the assigned values (Prestige scores), with respect to the prestige of 204 jobs across the Canadian sample. The Pineo-Porter method was intended, in part, to be an evaluative test of the 1971 census codes. The Pineo-Porter method of socioeconomic status (SES) estimation attaches prestige scores to 16 occupational categories and is the basis for the Blishen scale. It assigns SES codes to the occupations listed in the 1981 Canadian Classification and Dictionary of Occupations.
Following regression analysis was used to calculate the Prestige score.

Y (Prestige score) = a + b1 Education + b2 Income

The score was a reliable measure of socioeconomic status at the time of its creation. It was used to compare occupations during census of 1971 and 1981. But at this time, it is not a reliable indicator. The definition and terminology used for different occupation has changed overtime. In past 50 years, the economy has changed from primary manufacturing based to primarily service based and income levels for certain occupations have changed. Some occupations (garment factory worker, butchers etc.) have almost vanished and new occupations (technology and internet based jobs etc.) have emerged.

(b) A scatterplot matrix of all the quantitative variables.

The output demonstrates that blue collars and white collar workers have low prestige scores, compared to professional. Income and levels of education seem to explain these trends. From the relationship between income and percentage of women (third column left to right second row top to bottom graph), we can see that as the percentage of women increases, average income in the profession declines. Relationship between prestige and Education (fourth column left to right first row top to bottom graph) explains that there a high prestige score for higher level of education. The scatter plot also implies that there is no significant linear relationship between the percentage of women in the occupation and the prestige rating. Therefore, we won’t be including women in the regression model.
Does restricting our regression to only income, education, and type variables make sense given your exploratory analysis? - Yes, Since there is no linear relationship between the percentage of women in occupation and the prestige rating, we will not use women as our explanatory variables. We can use the explanatory variables: income, education, and type for the remainder of the question.

(c) Missing values analysis

## Number of missing type are 4

The professions which are missing type
	occupation.group	education	income	women	prestige	census	type
34	athletes	11.44	8206	8.13	54.1	3373	NA
53	newsboys	9.62	918	7.00	14.8	5143	NA
63	babysitters	9.46	611	96.53	25.9	6147	NA
67	farmers	6.84	3643	3.60	44.1	7112	NA

In our dataset, we have type, a factor variable, that refers to three occupational levels: bc for Blue Collar; prof for Professional, Managerial, and Technical; and wc for White Collar. The output demonstrates that there are more blue collar workers in the survey than any other occupational groups. From above output and histogram we can see that 4 professions are missing type - athletes, newsboys, babysitters and farmers.

It is not advisable to group these professions in one category for following reasons:

The propensity for a data point to be missing is completely random. The missing data are just a random subset of the data. The fact that a certain value is missing has nothing to do whatsoever with its hypothetical value, and nothing to do with the values of the other variables.
The number of cases of missing values is extremely small which is 4 out of 102; then, according to expert researcher may drop or omit those values from the analysis. In statistical language, if the number of the cases is less than 5% of the sample, then we can drop them. (http://www.statisticssolutions.com/missing-values-in-data/)
The professions which are missing values for variable type are athletes, newsboys, babysitters and farmers. All these professions do not belong to a homogeneous category. Combining them together as a fourth professional category will not contribute any significance to our model.

Now, we will clean the data.

## [1] "Cleaned Data"

##      occupation.group education income women prestige census type
## 1  gov.administrators     13.11  12351 11.16     68.8   1113 prof
## 2    general.managers     12.26  25879  4.02     69.1   1130 prof
## 3         accountants     12.77   9271 15.70     63.4   1171 prof
## 4 purchasing.officers     11.42   8865  9.11     56.8   1175 prof
## 5            chemists     14.62   8403 11.68     73.5   2111 prof
## 6          physicists     15.64  11030  5.13     77.6   2113 prof

Below ouput shows the summary of data after dropping missing value.

## 
## Call:
## lm(formula = prestige ~ education + income + type, data = newPrestige_cleaned)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.9529  -4.4486   0.1678   5.0566  18.6320 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.6229292  5.2275255  -0.119    0.905    
## education    3.6731661  0.6405016   5.735 1.21e-07 ***
## income       0.0010132  0.0002209   4.586 1.40e-05 ***
## typewc      -2.7372307  2.5139324  -1.089    0.279    
## typeprof     6.0389707  3.8668551   1.562    0.122    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.095 on 93 degrees of freedom
## Multiple R-squared:  0.8349, Adjusted R-squared:  0.8278 
## F-statistic: 117.5 on 4 and 93 DF,  p-value: < 2.2e-16

Below output summary is of imputed data with the fourth professional category “other”.

## 
## Call:
## lm(formula = prestige ~ education + income + type, data = imputed_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.0864  -4.8662   0.1436   5.3524  19.2652 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.7693934  5.3361820   0.332   0.7409    
## education    3.3059733  0.6537085   5.057 2.04e-06 ***
## income       0.0011392  0.0002305   4.942 3.28e-06 ***
## typewc      -1.7190573  2.6174653  -0.657   0.5129    
## typeprof     7.4877370  3.9698331   1.886   0.0623 .  
## typeother   -1.7322264  4.0258430  -0.430   0.6680    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.512 on 96 degrees of freedom
## Multiple R-squared:  0.8188, Adjusted R-squared:  0.8093 
## F-statistic: 86.75 on 5 and 96 DF,  p-value: < 2.2e-16

We can see that the type “other” in imputed_data model does not make any significant changes compared to original model. Hence, there is no use of making “other” as a fourth professional category. Hence, we drop the missing values from the data.

(d) Interaction terms

Interaction terms are needed whenever there is reason to believe that the effect of one independent variable depends on the value of another independent variable.

From the plots above, we can say there is a possibility of a interection term. Let’s test all the possibilites of the interection term which improves performance of our model significantly.

First, we will check if education*type is significant.

## 
## Call:
## lm(formula = prestige ~ education + income + type + education * 
##     type, data = newPrestige_cleaned)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1168  -4.1751   0.4384   5.1625  15.2362 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -2.331e+00  7.783e+00  -0.299    0.765    
## education           3.852e+00  9.406e-01   4.096 9.12e-05 ***
## income              1.052e-03  2.201e-04   4.782 6.66e-06 ***
## typewc             -2.822e+01  1.959e+01  -1.440    0.153    
## typeprof            2.209e+01  1.520e+01   1.454    0.149    
## education:typewc    2.270e+00  1.872e+00   1.213    0.228    
## education:typeprof -1.227e+00  1.304e+00  -0.941    0.349    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.036 on 91 degrees of freedom
## Multiple R-squared:  0.8411, Adjusted R-squared:  0.8306 
## F-statistic: 80.27 on 6 and 91 DF,  p-value: < 2.2e-16

From the above summary outputs we can say that the interaction term (education*type) is not significant. So we will not include this term in our model.

Let’s check if the interaction term (income*type) is significant or not.

## 
## Call:
## lm(formula = prestige ~ education + income + type + (income * 
##     type), data = newPrestige_cleaned)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.8720  -4.8321   0.8534   4.1425  19.6710 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -6.7272633  4.9515480  -1.359   0.1776    
## education        3.0396961  0.6003699   5.063 2.14e-06 ***
## income           0.0031344  0.0005215   6.010 3.79e-08 ***
## typewc           7.1375093  5.2898177   1.349   0.1806    
## typeprof        25.1723873  5.4669586   4.604 1.34e-05 ***
## income:typewc   -0.0014856  0.0008720  -1.704   0.0919 .  
## income:typeprof -0.0025102  0.0005530  -4.539 1.72e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.455 on 91 degrees of freedom
## Multiple R-squared:  0.8663, Adjusted R-squared:  0.8574 
## F-statistic: 98.23 on 6 and 91 DF,  p-value: < 2.2e-16

By adding interaction term (income*type), income:typeProfessional becomes statically significant.

Now we will check if the interaction term type*(education+income) is significant or not.

## 
## Call:
## lm(formula = prestige ~ education + income + type * (education + 
##     income), data = newPrestige_cleaned)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.462  -4.225   1.346   3.826  19.631 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.276e+00  7.057e+00   0.323   0.7478    
## education           1.713e+00  9.572e-01   1.790   0.0769 .  
## income              3.522e-03  5.563e-04   6.332 9.62e-09 ***
## typewc             -3.354e+01  1.765e+01  -1.900   0.0607 .  
## typeprof            1.535e+01  1.372e+01   1.119   0.2660    
## education:typewc    4.291e+00  1.757e+00   2.442   0.0166 *  
## education:typeprof  1.388e+00  1.289e+00   1.077   0.2844    
## income:typewc      -2.072e-03  8.940e-04  -2.318   0.0228 *  
## income:typeprof    -2.903e-03  5.989e-04  -4.847 5.28e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.318 on 89 degrees of freedom
## Multiple R-squared:  0.8747, Adjusted R-squared:  0.8634 
## F-statistic: 77.64 on 8 and 89 DF,  p-value: < 2.2e-16

The (education + income)*type specifies terms for education, income, type and the interactions between education and type and between income and type. By adding this term interaction of income with type “wc” and “prof” become statically significant. Let’s do partial F-test to compare these two models.

The following ouput is to performs the partial F-test:

## Analysis of Variance Table
## 
## Model 1: prestige ~ education + income + type * (education + income)
## Model 2: prestige ~ education + income + type + (income * type)
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     89 3552.9                              
## 2     91 3791.3 -2    -238.4 2.9859 0.05557 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output shows the results of the partial F-test. Since F=2.985 (p-value >= 0.5). We can reject the null hypothesis and conclude that “type * (education + income)” term is significant at α = 0.05 level.

So we will go ahead with model: prestige ~ education + income + type * (education + income)

(e)Fitting a model

We now fit a model to predict prestige using: income, education, type, and our interaction term “type * (education + income)” based on our answer to part (d).

lm(prestige ~ education + income + type*(education+income), data=newPrestige_cleaned)

We will now evaluate our model by checking the regression assumptions as follows:

1)Fixed x and measurement error: The data used in the analysis is collected from reliable sources so x values are fixed. But there could be chances of measurement errors since the data for all the variables are collected on country level. There are chances that the respondents may decline to do the task, or also some of the people from whom the data were collected were illiterate or they lied while giving out the information. Or the person who was in charge of data collection was changed and the occupational prestige packet had not been transferred to the new person correctly.

2)Linearity: The data appear to be well modeled by a linear relationship between y and x, and the points appear to be randomly spread out about the line, with no discerninle non-linear trends or changes in variability. Looking at the “Residuals vs Fitted plot” (Figure 3), we see that the red line is perfectly flat. This tells us that there is no discernible non-linear trend to the residuals. Furthermore, the residuals appear to be equally variable across the entire range of fitted values. There is no indication of non-constant variance.

3)Homoscedasticity of residuals or equal variance: Ideally, residuals are randomly scattered around 0 (the horizontal line) providing a relatively even distribution. The standard linear regression assumption is that the variance is constant across the entire range. Here the points appear random and the line looks pretty flat, with no increasing or decreasing trend. So, the condition of homoscedasticity can be accepted.

Normality: In our Q-Q plots, points lie on the line, which means the data is normally distributed. However, some deviation can be seen, particularly near the ends (note the upper right), but the deviations seems be small.

collinearity/multicollinearity:

##      education    income      women
## [1,] 0.8664798 0.7033094 -0.1101426

Collinearity/Multicollinearity: In our multiple regression model output above, education no longer displays a significant p-value. Here, education represents the average effect while holding the other variables income and interaction term “type*(education+income)” constant. The correlation matrix shown above highlights the situation we encoutered with the model output. Notice that the correlation between education and income is high which is 0.6. From the matrix scatterplot shown above, we can also see the pattern prestige takes when regressed on eduation and income. We can notice that how closely aligned their pattern is with each other. So in essence, when they are put together in the model, education is no longer significant.

We also can check this by using VIF(Variance Inflation Factor) as follows:

##   Variables      VIF
## 1    income 1.491621
## 2 education 1.491621

From the above vif results, we will include either “income” or “education”, but not both since both are giving the exact same information.

Hence, we will drop education from our final model. The final model is as follows with interaction term (type*income)

Model Evaluation:
Given that the model assumptions are satisfied, we want to determine how good our model is. Below is the summary and anova results for our model.

## 
## Call:
## lm(formula = prestige ~ income + type + type * income, data = newPrestige_cleaned)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.2669  -5.2956   0.3125   4.3392  25.0200 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     13.9045168  3.1671787   4.390 3.02e-05 ***
## income           0.0040235  0.0005530   7.276 1.12e-10 ***
## typewc          18.9807386  5.3421020   3.553 0.000603 ***
## typeprof        45.0190221  4.2907398  10.492  < 2e-16 ***
## income:typewc   -0.0021712  0.0009700  -2.238 0.027603 *  
## income:typeprof -0.0031783  0.0006047  -5.256 9.48e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.268 on 92 degrees of freedom
## Multiple R-squared:  0.8286, Adjusted R-squared:  0.8193 
## F-statistic: 88.94 on 5 and 92 DF,  p-value: < 2.2e-16

## Analysis of Variance Table
## 
## Response: prestige
##             Df  Sum Sq Mean Sq F value    Pr(>F)    
## income       1 14021.6 14021.6 265.471 < 2.2e-16 ***
## type         2  7988.5  3994.3  75.623 < 2.2e-16 ***
## income:type  2  1477.5   738.8  13.987 4.969e-06 ***
## Residuals   92  4859.2    52.8                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R squared and Adj.R squared
	r.squared	adj.r.squared
prestige vs. income	0.4946	0.4894
prestige vs. income, type	0.7765	0.7693
prestige vs. income, type,income*type	0.8286	0.8193
First, we perform the overall f-test on	the model as	follows.
α = 0.05
hypotheses:
H0: βincome =βtype=βincome*type = 0
Ha: at least one slope is not zero
test statistic:
Fc = (23487.6/5)/( 4859.2/92)
Fc = 88.94
with p = 5 and n − (p + 1) = 98 − 6 = 92	degrees of	freedom
p-value< α = 0.05
conclusion: reject null hypothesis; the	model is ade	quate.

Next we perform t-test.
Testing βincome: α = 0.05
H0 : βincome=0 (Assuming type and income*type are already in the model)
Ha : βincome not =0
t-statistics t1 = 7.276 which has t distribution with n − (p + 1) = 98 − 6 = 92 degrees of freedom.
Here p-value is 1.12e-10 < 0.05. So we reject null hypothesis and conclude that income is statically significant when type and income*type are already in the model.

Testing βtypewc: α = 0.05
H0 : βtypewc=0 (Assuming income and income*type are already in the model)
Ha : βtypewc not =0
t-statistics t1 = 3.553 which has t distribution with n − (p + 1) = 98 − 6 = 92 degrees of freedom.
Here p-value is 0.000603 < 0.05. So we reject null hypothesis and conclude that typewc is statically significant when income and income*type are already in the model.

Testing βtypeprof: α = 0.05
H0 : βtypeprof=0 (Assuming income and income*type are already in the model)
Ha : βtypeprof not =0
t-statistics t1 = 10.492 which has t distribution with n − (p + 1) = 98 − 6 = 92 degrees of freedom.
Here p-value is < 2e-16 < 0.05. So we reject null hypothesis and conclude that typeprof is statically significant when income and income*type are already in the model.

Testing βincome:typewc: α = 0.05
H0 : βincome:typewc=0 (Assuming income and type are already in the model)
Ha : βincome:typewc not =0
t-statistics t1 = -2.238 which has t distribution with n − (p + 1) = 98 − 6 = 92 degrees of freedom.
Here p-value is < 2e-16 < 0.05. So we reject null hypothesis and conclude that income:typewc is statically significant when income and type are already in the model.

Testing βincome:typeprof: α = 0.05
H0 : βincome:typeprof=0 (Assuming income and type are already in the model)
Ha : βincome:typeprof not =0
t-statistics t1 = -5.256 which has t distribution with n − (p + 1) = 98 − 6 = 92 degrees of freedom.
Here p-value is 9.48e-07 < 0.05. So we reject null hypothesis and conclude that income:typeprof is statically significant when income and type are already in the model.

Final regression model: $y=$13.9045168 + 0.0040235 income($) + 18.9807386 typewc + 45.0190221 typeprof + -0.0021712 incomeOftypewc($) + -0.0031783 incomeOftypeprof($)

From our answer to part(c), we will use type “wc” and “prof” in our analysis.

(f) Histogram of income and a histogram of log(income)

The log transformation of income decreased the variability of data and made data conform more closely to the normal distribution.

(g) Fit the model in (e) but this time use log(income) (i.e., natural logarithm) instead of income.

## 
## Call:
## lm(formula = prestige ~ log(income) + type + type * log(income), 
##     data = newPrestige_cleaned)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9201  -5.2785  -0.1848   5.5620  24.3118 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -124.353     23.204  -5.359 6.16e-07 ***
## log(income)            18.782      2.723   6.898 6.55e-10 ***
## typewc                 83.702     41.469   2.018  0.04646 *  
## typeprof              102.630     35.688   2.876  0.00501 ** 
## log(income):typewc     -8.979      4.889  -1.837  0.06947 .  
## log(income):typeprof   -9.001      4.020  -2.239  0.02756 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.404 on 92 degrees of freedom
## Multiple R-squared:  0.8221, Adjusted R-squared:  0.8124 
## F-statistic: 85.02 on 5 and 92 DF,  p-value: < 2.2e-16

Now, let’s evaluate our model with log income by checking the regression assumptions as follows:

Normality: We can see that compare to our previous model, the points in this model with log income are more are a perfect match to the diagonal line but a little more tailed to the ends (note the upper right and lower left), but the deviations seems be small.

## 21 82 31 
## 96 97 98

collinearity/multicollinearity:

Since we have categorical variable in our model, we won’t check multicollinearity.

Model Evaluation:
Given that the model assumptions are satisfied, we want to determine how good our model is. Below is the summary and anova results for our model.

## 
## Call:
## lm(formula = prestige ~ log(income) + type + type * log(income), 
##     data = newPrestige_cleaned)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9201  -5.2785  -0.1848   5.5620  24.3118 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -124.353     23.204  -5.359 6.16e-07 ***
## log(income)            18.782      2.723   6.898 6.55e-10 ***
## typewc                 83.702     41.469   2.018  0.04646 *  
## typeprof              102.630     35.688   2.876  0.00501 ** 
## log(income):typewc     -8.979      4.889  -1.837  0.06947 .  
## log(income):typeprof   -9.001      4.020  -2.239  0.02756 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.404 on 92 degrees of freedom
## Multiple R-squared:  0.8221, Adjusted R-squared:  0.8124 
## F-statistic: 85.02 on 5 and 92 DF,  p-value: < 2.2e-16

## Analysis of Variance Table
## 
## Response: prestige
##                  Df  Sum Sq Mean Sq  F value Pr(>F)    
## log(income)       1 15998.5 15998.5 291.8325 <2e-16 ***
## type              2  6967.2  3483.6  63.5449 <2e-16 ***
## log(income):type  2   337.8   168.9   3.0806 0.0507 .  
## Residuals        92  5043.5    54.8                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R squared and Adj.R squared
	r.squared	adj.r.squared
prestige vs. income	0.5644	0.5598
prestige vs. income, type	0.8102	0.8041
prestige vs. income, type,income*type	0.8221	0.8124

First, we perform the overall f-test on the model as follows. α = 0.05
hypotheses:
H0: βlog(income) =βtype=βlog(income)*type = 0
Ha: at least one slope is not zero test statistic:
Fc = (23303.5/5)/( 5043.5/92)
Fc = 85.01
with p = 5 and n − (p + 1) = 98 − 6 = 92 degrees of freedom
p-value< α = 0.05
conclusion: reject null hypothesis; the model is adequate.

Next we perform t-test.
Testing βlog(income): α = 0.05
H0 : βlog(income)=0 (Assuming type and log(income)*type are already in the model)
Ha : βlog(income) not =0
t-statistics t1 = 6.898 which has t distribution with n − (p + 1) = 98 − 6 = 92 degrees of freedom.
Here p-value is 6.55e-10 < 0.05. So we reject null hypothesis and conclude that income is statically significant when type and log(income)*type are already in the model.

Testing βtypewc: α = 0.05
H0 : βtypewc=0 (Assuming log(income) and log(income)*type are already in the model)
Ha : βtypewc not =0
t-statistics t1 = 2.018 which has t distribution with n − (p + 1) = 98 − 6 = 92 degrees of freedom.
Here p-value is 0.04646 < 0.05. So we reject null hypothesis and conclude that typewc is statically significant when log(income) and log(income)*type are already in the model.

Testing βtypeprof: α = 0.05
H0 : βtypeprof=0 (Assuming income and log(income)*type are already in the model)
Ha : βtypeprof not =0
t-statistics t1 = 2.876 which has t distribution with n − (p + 1) = 98 − 6 = 92 degrees of freedom.
Here p-value is 0.00501 <= 0.05. So we reject null hypothesis and conclude that typeprof is statically significant when log(income) and log(income)*type are already in the model.

Testing βlog(income):typewc: α = 0.05
H0 : βlog(income):typewc=0 (Assuming log(income) and type are already in the model)
Ha : βlog(income):typewc not =0
t-statistics t1 = -1.837 which has t distribution with n − (p + 1) = 98 − 6 = 92 degrees of freedom.
Here p-value is 0.06947 > 0.05. So we can not reject null hypothesis and conclude that log(income):typewc is not statically significant when log(income) and type are already in the model.

Testing βlog(income):typeprof: α = 0.05
H0 : βlog(income):typeprof=0 (Assuming log(income) and type are already in the model)
Ha : βlog(income):typeprof not =0
t-statistics t1 =-2.239 which has t distribution with n − (p + 1) = 98 − 6 = 92 degrees of freedom.
Here p-value is 0.02756 < 0.05. So we reject null hypothesis and conclude that log(income):typeprof is statically significant when log(income) and type are already in the model.

Final regression model:

$\hat y=$-124.3525838 + 18.7820949 income($) + 83.7020124 typewc + 102.6304773 typeProf + -8.9793239 incomeOftypewc($) + -9.0010905 incomeOftypeprof($)

(h) Is the model in (e) or (g) better? Justify your answer. Why can’t we use a partial F-test here?

We can not compare both models using partial F-test since both models are not nested. Partial F-test is used only when two models are nested.

Instead, We can compare both models using RMSE.

## RMSE for model(e) 7.041597

##   RMSE for model(g) 7.173863

By comparing the RMSE of bothe model(e) (rmse=7.041597) and model(g)(rmse = 7.173863), we can say that Model(e) is better.

There is another way in R to compare two non-nested model. We can check the BIC - Bayesian Information Criterion.

## BIC for model(e) 692.7664

## and BIC for model(g) 696.4138

Since BIC for model (e) is low, this model is preferable.

Regression Model for the Prestige Level of Occupations.

Rohini Mandge

3/26/2018