Shel Nee Gan (s3746473) Haojun Xu (s3685256) Tianbao Jin (s3696594)
Last updated: 28 October, 2018
RPubs link (see here)
This report is to conduct statistical investigation by using linear regression models.
Two variables will be selected from the data to investigate their relationship.
R studio is applied on the dataset to fit a simple lienar regression.
Statistical hypotheses of various model compoenents are implemented.
Various assumptions based on the linear regression model are tested.
Discuss of the findings will be explored.
Question: Can GDP per capita of the country be used to determine the happiness of people in one country?
Method:
Apply linear regression model to examine the relationship between two selected variables.
Make hypothesis testing for the overall linear regresison model and parameters (\(\alpha\) and \(\beta\)).
Make assumptions for independence, linearity, normality of residuals, homoscedasticity.
Apply the pearson correlation coefficient to measure the strength of the linear relationship.
#read data
World_Happiness_Report<-read.csv("2017.csv")
The data is taken from the website https://www.kaggle.com/unsdsn/world-happiness.
The dataset consists 155 observations (represent 155 countries) and 12 variables.
The variables Economy (GDP), Family, Health, Freedom, Generosity, Trust and Dystopia Residual are taken from the population (by survey) to determine a country’s happiness score.
The happiness score data was taken from the Gallup World Poll.
#Subset the data
happiness<- World_Happiness_Report[,c(1,3,6)]
#changing column names
colnames(happiness)[c(2,3)]<- c("Happiness Score", "Economy")
Variables Used in Data:
Happiness Score: How would you rate your hapinness on a scale of 0 to 10 where 10 is the happiest?
Economy: GDP per capita
Subset Country, Happiness.Score, Economy..GDP.per.Capita. into a new data frame and change to new column name (Happiness Score, Economy) to make the data easier to read.
Use scatter plot to visualise the relationship between Economy and Happiness Score.
plot(happiness$`Happiness Score` ~ happiness$Economy, data = happiness, xlab = "Economy", ylab = "Hapiness Score")
As we can see from the scatter plot, as Economy increases, the Happiness Score also increases. Therefore, this is a positive relationship.
However, if Happiness Score had decreased with increasing values for Economy, the relationship would be negative.
#Use lm() function to fit the linear regression model
happiness_model<- lm(happiness$`Happiness Score`~happiness$Economy, data = happiness)
happiness_model %>% summary()
##
## Call:
## lm(formula = happiness$`Happiness Score` ~ happiness$Economy,
## data = happiness)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.88807 -0.45200 -0.05328 0.49425 1.89833
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.2032 0.1356 23.62 <2e-16 ***
## happiness$Economy 2.1842 0.1267 17.24 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6617 on 153 degrees of freedom
## Multiple R-squared: 0.6601, Adjusted R-squared: 0.6579
## F-statistic: 297.1 on 1 and 153 DF, p-value: < 2.2e-16
The best line fit is Happiness Score = 3.203 + 2.184 x Economy
\[H_0: The\ data\ do\ not\ fit\ the\ linear\ regression\ model\] \[H_A: The\ data\ fit\ the\ linear\ regression\ model\]
#calculate the p-value
pf( q =297.1,1,153,lower.tail = FALSE)
## [1] 1.117922e-37
p< 0.001
Reject \(H_0\) as the p-value is less than the 0.05 level of significance.
There was statistically significant evidence that the data fit a linear regression model.
\[H_0:\alpha = 0 \] \[H_A: \alpha \ne 0\]
This hypothesis is testing using a t statistic, reported as t = 23.62, p<0.001.
There was statistically siginificant evidence that the constant is not 0.
In order to confirm that p < 0.001. We calculate 95% CI for \(\alpha\) by using confint() function:
happiness_model %>% confint()
## 2.5 % 97.5 %
## (Intercept) 2.935283 3.471143
## happiness$Economy 1.933859 2.434511
95% CI for \(\alpha\) to be [2.935 3.471].
\(H_0\): \(\alpha\) = 0 is clearly not captured by this interval, therefore it was rejected.
The slope of the regression line was reported as \(\beta\) = 2.18. This means that one unit increase in GDP is related to an average increase in happiness score of 2.18 units. This is a positive change.
\[H_0: \beta = 0 \] \[H_A: \beta \ne 0 \]
The slope is testing using a t statistic, reported as t = 17.24, p<0.001.
There was statistically siginificant evidence that the \(\beta\) is not 0.
The confint() function used in previous slide shows the 95% CI for slope to be [1.934 2.435].
95% CI does not capture \(H_0\), so it is rejected.
There is a statistically significant positive relationship between GDP and hapiness score.
Independence is checked through the research design.
We have checked and confirmed the linearity in the begining of the report by using scatter plot.
The scatter plot shows a positive relationship as GDP increases, so too does happiness scores.
The plot() function is used to obtain a series of plots for checking the diagnotics of a fitted regression model in the following slides.
The relationship between fitted values and residuals is flat, which is a good indication because the model is a linear relationship.
The variability on y aixs is constantly across the range of values on the x axis and there is no distinct pattern in variablitiy, therefore, this is a sign of homoscedasticity (constant variance).
happiness_model %>% plot(which =1)
The residuals fall close to the line and there are no major deviations from normality.
So it is safe to assume the residuals are approximately normally distributed.
happiness_model %>% plot(which =2)
The red line in the plot below is close to flat and the variance in the square root of the standardised residuals are consistently across fitted values.
Therefore, this is a sign of homoscedasticity.
happiness_model %>% plot(which =3)
There is no values that fall in the upper and lowe right hand side of the plot beyond the red bands.
In fact, the bands are not even visible in the plot below.
happiness_model %>% plot(which =5)
\(R^2\) = 0.66.
66% of the variability in the happiness score can be explained by a linear relationship with economy.
r <- cor(happiness$`Happiness Score`,happiness$Economy)
r
## [1] 0.8124688
Summary:
Independence and linearity was assumed, normality of residuals OK, homoscedasticity OK, no influential cases.
\(r\) = 0.81, \(r^2\) = 0.66
Model ANOVA, \(F\)(1,153) = 297.1, \(p\) < 0.001
\(\alpha\) = 3.203, \(p\) < 0.001, 95% CI (2.935 3.471)
\(\beta\) = 2.184, \(p\) < 0.001, 95% CI (1.934 2.435)
Decision:
Overall model: Reject \(H_0\)
Intercept: Reject \(H_0\)
Slope: Reject \(H_0\)
Happiness Score = 3.203 + 2.184 x Economy
There was a statistically significant positive lieanr relationship betwen Happiness Score and Economy.
Happiness score was estimated to explain up to 66% of the variability in economy.
\[Happiness\ Score = 3.203 + 2.184 * Economy\]
Strengths: Good choices of two variable for simple linear regression model.
Limitation: Only one variable (Economy) from the dataset is used to investigate whether the happiness score of each country can be determined.
Directions for future investigations
The simple regression model should further be used for comparision in 2-3 years.
The happiness score is recommended to use for determining the happiness level within a nation instead of comparing with other nations.
Use the estimated linear regression model to predict the level of happiness of the countries that are not included in this report and compare the predicted happiness score with the actual happiness score.