PhD assignment

Adnan Ovcina

Model specification

In model predictor variables are number of hours spent exercising for exam, which is continual variable and if student has regularly attended lectures or not (attend). Variable attend is coded as dummy variable where 0 means that student did not attend lecture, while 1 means that student has attended lectures. Model goodness of fit test shows good fit of the model with F(2,97) = 235.2, p < .001. Furthermore, adjusted R2 = 0.82 shows that model is able to explain significant proportion of variance in outcome variable (82%).

## 
## Call:
## lm(formula = score ~ exercise + attend, data = regression_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4528  -2.7730   0.4746   2.7135   7.8428 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.58155    1.32591  -3.455 0.000817 ***
## exercise     1.28817    0.08027  16.049  < 2e-16 ***
## attendYes   10.99254    0.74681  14.719  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.697 on 97 degrees of freedom
## Multiple R-squared:  0.8291, Adjusted R-squared:  0.8256 
## F-statistic: 235.3 on 2 and 97 DF,  p-value: < 2.2e-16

term	estimate	std.error	statistic	p.value	Beta
(Intercept)	-4.581552	1.32591023	-3.455401	<0.001	NA
Exercise	1.288170	0.08026669	16.048620	<0.001	0.6736443
Attend	10.992541	0.74681296	14.719269	<0.001	0.6178445

## 
##  studentized Breusch-Pagan test
## 
## data:  reg_model
## BP = 0.82343, df = 2, p-value = 0.6625

In table below I have made descriptive statistic of the data. From the table it can be seen that there 100 scores in sample (N = 100) and there are no missing values.

	N	Mean	Min	SD	Max
ID	100	50.50	1	29.0114920	100
Score	100	19.88	0	8.8526433	40
Exercise	100	15.32	1	4.6294621	26
Attend	100	0.43	0	0.4975699	1

Considering that there are two independent variables in dataset and that desirable sample size is 15 data per independent (predictor) variable (2 x 15 = 30) we can conclude that we have adequate sample to run regression analysis.

There are no values with cooks distance value higher than 1. Highest cooks distance value is .147, hence we can conclude that there are no typical outliers that could violate regression model fitted to the data.

score	exercise	attend	Cooksd
14	23	0	0.1473
1	14	0	0.0722
4	1	1	0.0514
31	13	1	0.0406
33	24	1	0.0305
11	7	1	0.0298

Mean variance inflation value is 1, hence we can conclude that that is no multicolinearity in our data. Findings are further supported by correlation coefficient significantly lower than value of 0.9 which is considered to be threshold for existence of multicolinearity in the data

## exercise   attend 
##  1.00006  1.00006

From the acpr plot it can be concluded that assumption of linearity is satisfied. All residual terms are following linear trend and there are no significant outliers that could exercise significant influence on fitted model.

## NULL

Assumption of homoscedasticity is satisfied. Residual and fitted values of the regression model are roughly equally scattered on the graph above and are not forming any type of pattern, hence we can conclude that there is assumption of homoscedasticity is not violated in our model. At each level of predictors variance of residuals is constant.

Value of Durbin-Watson test for independence of errors is d(3,100) = 1.88, p = .548. Durbin Watson test is considered to show independence of errors if its value is close to value of 2, hence value of 1.88 show that assumption of independence of errors is satisfied.

##  lag Autocorrelation D-W Statistic p-value
##    1        0.059744      1.879085   0.528
##  Alternative hypothesis: rho != 0

Kernel density estimate graph shows that distribution of residual values is only slightly deviating from normal density curve. Results of Shapiro-Wilks test with W = .96, p = .01 show that residual values are not normally distributed. Shapiro-Wilks test is sensitive to number of sample size. When sample size is large enough even small deviations will be statistically significant. From the plot one could argue that residual distribution and normal distribution are not significantly different. Also if we take into account sample size on n = 100, normality if distribution of residuals will not have significant effect on estimation of regression coefficients. In case when we have smaller samples significant violation of assumption of normality of residual terms will have negative influence on model estimators. Most importantly violation of violation of assumption of normality of residual terms will influence on estimation of standard error, hence on all other test statistics for which standard error is used. In case of small samples Shapiro-Wilks test results will be more reliable.

## 
##  Shapiro-Wilk normality test
## 
## data:  regression_data$stud_res
## W = 0.96248, p-value = 0.006065

Respecification of the model

Analysis of assumptions for regression model did not show need for further respecification of the model.

Model functional form

Model is found to fit data well. Results of Ramsey test RESET(3,94) = 0.24125 show that functional form of model is well specified with p = .7861. H0 for Ramsey test is that there are no omitted variables.

## 
##  RESET test
## 
## data:  reg_model
## RESET = 0.24125, df1 = 2, df2 = 95, p-value = 0.7861

Interpreation of coefficients

Number of hours spent exercising and if student have attended lectures or not are both statistically significant with number of hours spent exercising = 1.29, SE = .08, p < .001, 95% CI[1.12,1.44] and if student has attended lectures or not = 11, SE = .74, p < .001, 95% CI[9.51, 12.47].

From the model we can conclude that number of attended lectures has significantly larger impact on overall score of students on exam compared to number of hours spent studying.

Below is calculation of score achieved by student who has spent 20 hours exercising and had attended lectures according to model results. It can be seen that if person spends 20 hours exercising it will score approximately 32 points at the exam.

Score = -4.58 + 20x1.29 + 1x11 + error Score = -4.58 + 25.8 + 11 + error Score = 32.22 + error

Graph below shows our model with dummy variables. Two linear lines fitted to the plot relation between scores on the exam and number of hours spent exercising for both, those who have attended lecture and those who did not.