Tugas 6 Komputasi Statistika
Simple Linear Regression in R
| Kontak | \(\downarrow\) |
| naftaligunawan@gmail.com | |
| https://www.instagram.com/nbrigittag/ | |
| RPubs | https://rpubs.com/naftalibrigitta/ |
| Nama | Naftali Brigitta Gunawan |
| NIM | 20214920002 |
Step 1 : Load the data into R
library(ggplot2)
library(dplyr)
library(broom)
library(ggpubr)
income <- read.csv("income.data.csv")
summary(income)## X income happiness
## Min. : 1.0 Min. :1.506 Min. :0.266
## 1st Qu.:125.2 1st Qu.:3.006 1st Qu.:2.266
## Median :249.5 Median :4.424 Median :3.473
## Mean :249.5 Mean :4.467 Mean :3.393
## 3rd Qu.:373.8 3rd Qu.:5.992 3rd Qu.:4.503
## Max. :498.0 Max. :7.482 Max. :6.863
The conclusion of summary(income) :
Dependent Variables =
happinessIndependent Variables =
income
Step 2 : Make sure your data meet the assumptions
There are four main assumptions for linear regression.
1. Independence of observations (or no autocorrelation)
Because we only have one independent variable and one dependent variable, so we don’t need to test th relationship. And we can move to the next step
2. Normality
hist(income$happiness)Because the histogram are like bell-shaped (high in the middle and fewer on the tails), so we can move to the next step.
3. Linearity
The relationship between independent and dependent variable must be linear, so we can use plot to visualize with a scatter plot.
plot(happiness ~ income, data = income)The result of the plot looks roughly linear, so we can move to the next step.
4. Homoscedasticity (or homogeneity of variance)
This means that the prediction error doesn’t change significantly. We can test this assumption later, after fitting the linear model.
Step 3 : Perform the linear regression analysis
income.happiness.lm <- lm(happiness ~ income, data = income)
summary(income.happiness.lm)##
## Call:
## lm(formula = happiness ~ income, data = income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.02479 -0.48526 0.04078 0.45898 2.37805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.20427 0.08884 2.299 0.0219 *
## income 0.71383 0.01854 38.505 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7181 on 496 degrees of freedom
## Multiple R-squared: 0.7493, Adjusted R-squared: 0.7488
## F-statistic: 1483 on 1 and 496 DF, p-value: < 2.2e-16
The result are :
The estimates (Estimate) for the model parameters – the value of the y-intercept (in this case 0.204) and the estimated effect of income on happiness (0.713).
The p value (Pr(>|t|)) is 2.2e-16, so the alternative hypothesis is accepted and null hypothesis is rejected.
Step 4 : Check for homoscedasticity
par(mfrow=c(2,2))
plot(income.happiness.lm)par(mfrow=c(1,1))The residuals form our models almost perfectly (linear line), based on these residuals, we can say that our model meets the assumption of homoscedasticity.
Step 5 : Visualize the results with a graph
1. Plot the data points on a graph
income.graph<-ggplot(income, aes(x=income, y=happiness))+
geom_point()
income.graph2. Add the linear regression line to the plotted data
We can use geom_smoothand typing in lm to show linear regression line.
income.graph <- income.graph + geom_smooth(method="lm", col="red")
income.graph3. Add the equation for the regression line
income.graph <- income.graph +
stat_regline_equation(label.x = 3, label.y = 7)
income.graph4. Make the graph ready for publication
We can use theme_bw to add some style and use labs() to make custom labels
income.graph +
theme_bw() +
labs(title = "Reported happiness as a function of income",
x = "Income (x$10,000)",
y = "Happiness score (0 to 10)")Step 6 : Report your results
We found a significant relationship between income and happiness (p < 0.001, R2 = 0.73 ± 0.0193), with a 0.73-unit increase in reported happiness for every $10,000 increase in income.