MATH1324 Introduction to Statistics Assignment 3

Simple Linear Regression Model- Can happiness be predicted?

Shel Nee Gan (s3746473) Haojun Xu (s3685256) Tianbao Jin (s3696594)

Last updated: 28 October, 2018

RPubs link information

RPubs link (see here)
http://rpubs.com/sngan/434019

Introduction

This report is to conduct statistical investigation by using linear regression models.
Two variables will be selected from the data to investigate their relationship.
R studio is applied on the dataset to fit a simple lienar regression.
Statistical hypotheses of various model compoenents are implemented.
Various assumptions based on the linear regression model are tested.
Discuss of the findings will be explored.

Problem Statement

Question: Can GDP per capita of the country be used to determine the happiness of people in one country?

Method:

Apply linear regression model to examine the relationship between two selected variables.
Make hypothesis testing for the overall linear regresison model and parameters (\(\alpha\) and \(\beta\)).
Make assumptions for independence, linearity, normality of residuals, homoscedasticity.
Apply the pearson correlation coefficient to measure the strength of the linear relationship.

Data

#read data
World_Happiness_Report<-read.csv("2017.csv")

The data is taken from the website https://www.kaggle.com/unsdsn/world-happiness.
The dataset consists 155 observations (represent 155 countries) and 12 variables.
The variables Economy (GDP), Family, Health, Freedom, Generosity, Trust and Dystopia Residual are taken from the population (by survey) to determine a country’s happiness score.
The happiness score data was taken from the Gallup World Poll.

Data Cont.

#Subset the data
happiness<- World_Happiness_Report[,c(1,3,6)]

#changing column names
colnames(happiness)[c(2,3)]<- c("Happiness Score", "Economy")

Variables Used in Data:

Happiness Score: How would you rate your hapinness on a scale of 0 to 10 where 10 is the happiest?
Economy: GDP per capita
Subset Country, Happiness.Score, Economy..GDP.per.Capita. into a new data frame and change to new column name (Happiness Score, Economy) to make the data easier to read.

Descriptive Statistics and Visualisation

Use scatter plot to visualise the relationship between Economy and Happiness Score.

plot(happiness$`Happiness Score` ~ happiness$Economy, data = happiness, xlab = "Economy", ylab = "Hapiness Score")

As we can see from the scatter plot, as Economy increases, the Happiness Score also increases. Therefore, this is a positive relationship.
However, if Happiness Score had decreased with increasing values for Economy, the relationship would be negative.

Decsriptive Statistics Cont.

#Use lm() function to fit the linear regression model
happiness_model<- lm(happiness$`Happiness Score`~happiness$Economy, data = happiness)
happiness_model %>% summary()

## 
## Call:
## lm(formula = happiness$`Happiness Score` ~ happiness$Economy, 
##     data = happiness)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.88807 -0.45200 -0.05328  0.49425  1.89833 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.2032     0.1356   23.62   <2e-16 ***
## happiness$Economy   2.1842     0.1267   17.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6617 on 153 degrees of freedom
## Multiple R-squared:  0.6601, Adjusted R-squared:  0.6579 
## F-statistic: 297.1 on 1 and 153 DF,  p-value: < 2.2e-16

The best line fit is Happiness Score = 3.203 + 2.184 x Economy

Hypothesis Testing for the overall linear regression model

\[H_0: The\ data\ do\ not\ fit\ the\ linear\ regression\ model\] \[H_A: The\ data\ fit\ the\ linear\ regression\ model\]

#calculate the p-value
pf( q =297.1,1,153,lower.tail = FALSE)

## [1] 1.117922e-37

p< 0.001
Reject \(H_0\) as the p-value is less than the 0.05 level of significance.
There was statistically significant evidence that the data fit a linear regression model.

Hypthesis Testing - Model parameters (\(\alpha\)):

\[H_0:\alpha = 0 \] \[H_A: \alpha \ne 0\]

This hypothesis is testing using a t statistic, reported as t = 23.62, p<0.001.
There was statistically siginificant evidence that the constant is not 0.

In order to confirm that p < 0.001. We calculate 95% CI for \(\alpha\) by using confint() function:

happiness_model %>% confint()

##                      2.5 %   97.5 %
## (Intercept)       2.935283 3.471143
## happiness$Economy 1.933859 2.434511

95% CI for \(\alpha\) to be [2.935 3.471].
\(H_0\): \(\alpha\) = 0 is clearly not captured by this interval, therefore it was rejected.

Hypthesis Testing - Model parameters (\(\beta\)):

The slope of the regression line was reported as \(\beta\) = 2.18. This means that one unit increase in GDP is related to an average increase in happiness score of 2.18 units. This is a positive change.

\[H_0: \beta = 0 \] \[H_A: \beta \ne 0 \]

The slope is testing using a t statistic, reported as t = 17.24, p<0.001.
There was statistically siginificant evidence that the \(\beta\) is not 0.
The confint() function used in previous slide shows the 95% CI for slope to be [1.934 2.435].
95% CI does not capture \(H_0\), so it is rejected.
There is a statistically significant positive relationship between GDP and hapiness score.

Assumptions

Independence

Independence is checked through the research design.

Linearity

We have checked and confirmed the linearity in the begining of the report by using scatter plot.

The scatter plot shows a positive relationship as GDP increases, so too does happiness scores.

Normality of residuals and Homoscedasticity

The plot() function is used to obtain a series of plots for checking the diagnotics of a fitted regression model in the following slides.

Testing Assumptions- Residuals vs. Fitted

The relationship between fitted values and residuals is flat, which is a good indication because the model is a linear relationship.
The variability on y aixs is constantly across the range of values on the x axis and there is no distinct pattern in variablitiy, therefore, this is a sign of homoscedasticity (constant variance).

happiness_model %>% plot(which =1)

Testing Assumptions- Normal Q-Q

The residuals fall close to the line and there are no major deviations from normality.
So it is safe to assume the residuals are approximately normally distributed.

happiness_model %>% plot(which =2)

Testing Assumptions- Scale-Location

The red line in the plot below is close to flat and the variance in the square root of the standardised residuals are consistently across fitted values.
Therefore, this is a sign of homoscedasticity.

happiness_model %>% plot(which =3)

Testing Assumptions- Residuals vs. Leverage

There is no values that fall in the upper and lowe right hand side of the plot beyond the red bands.
In fact, the bands are not even visible in the plot below.

happiness_model %>% plot(which =5)

Linear Regression- \(R^2\)

\(R^2\) = 0.66.
66% of the variability in the happiness score can be explained by a linear relationship with economy.

r <- cor(happiness$`Happiness Score`,happiness$Economy)
r

## [1] 0.8124688

The correlation between Happiness Score and GDP is 0.81 which indicates a positive correlation.

Interpretation

Summary:

Independence and linearity was assumed, normality of residuals OK, homoscedasticity OK, no influential cases.
\(r\) = 0.81, \(r^2\) = 0.66
Model ANOVA, \(F\)(1,153) = 297.1, \(p\) < 0.001
\(\alpha\) = 3.203, \(p\) < 0.001, 95% CI (2.935 3.471)
\(\beta\) = 2.184, \(p\) < 0.001, 95% CI (1.934 2.435)

Decision:

Overall model: Reject \(H_0\)
Intercept: Reject \(H_0\)
Slope: Reject \(H_0\)
Happiness Score = 3.203 + 2.184 x Economy

Discussion

There was a statistically significant positive lieanr relationship betwen Happiness Score and Economy.
Happiness score was estimated to explain up to 66% of the variability in economy.

\[Happiness\ Score = 3.203 + 2.184 * Economy\]

Strengths: Good choices of two variable for simple linear regression model.
Limitation: Only one variable (Economy) from the dataset is used to investigate whether the happiness score of each country can be determined.
Directions for future investigations

The simple regression model should further be used for comparision in 2-3 years.

The happiness score is recommended to use for determining the happiness level within a nation instead of comparing with other nations.

Messages for audiences

Use the estimated linear regression model to predict the level of happiness of the countries that are not included in this report and compare the predicted happiness score with the actual happiness score.