Download data

The dataset contains observations about income (in a range of $\$15k$ to $\$75k$) and happiness (rated on a scale of 1 to 10) in an imaginary sample of 500 people. The income values are divided by 10,000 to make the income data match the scale of the happiness scores (so a value of $\$2$ represents $\$20,000$, $\$3$ is $\$30,000$, etc.). Please download here: https://github.com/Tongtt33/SLR/blob/main.

Packages needed (install using install.packages())

‘ggplot2’ is a plotting package that provides helpful commands to create complex plots from data in a data frame.

The ‘ggpubr’ package provides some easy-to-use functions for creating and customizing ‘ggplot2’.

library(ggplot2)
library(ggpubr)

Step 1: Read the data into R

Before reading any data, set the R working directory to the location of the data.

setwd("D:/AUM interview/Teaching demo")
getwd()

## [1] "D:/AUM interview/Teaching demo"

income.data <- read.csv(file="income.data.csv",header=T)

After you’ve loaded the data, check that it has been read in correctly using “head()”

head(income.data,5)

##     income happiness
## 1 3.862647  2.314489
## 2 4.979381  3.433490
## 3 4.923957  4.599373
## 4 3.214372  2.791114
## 5 7.196409  5.596398

Step 2: Visualize the Data (Assumption check - Lecture 2)

The first step in this single predictor modeling process is to determine whether or not it looks as though a linear relationship exists between the two variables. We can test this visually with a scatter plot to see if the distribution of data points could be described with a straight line.

plot(happiness ~ income, data = income.data)

The relationship looks roughly linear, so we can proceed with the simple linear regression model.

Step 3: Perform the linear regression analysis

To perform a simple linear regression analysis and check the results, you need to run two lines of code. The first line of code makes the linear model, and the second line prints out the informations of the model:

income.lm <- lm(happiness ~ income, data = income.data)

summary(income.lm)

## 
## Call:
## lm(formula = happiness ~ income, data = income.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.02479 -0.48526  0.04078  0.45898  2.37805 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.20427    0.08884   2.299   0.0219 *  
## income       0.71383    0.01854  38.505   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7181 on 496 degrees of freedom
## Multiple R-squared:  0.7493, Adjusted R-squared:  0.7488 
## F-statistic:  1483 on 1 and 496 DF,  p-value: < 2.2e-16

In this case, the estimated y-intercept is $\hat\beta_0$ = 0.2043 and the estimated slope is $\hat \beta_1$ = 0.7138. Thus, the final predict line is: $\hat{happiness}=0.2042+0.7138\times income$.

Step 4: Visualize the results with a graph

Basic plot:

plot(happiness ~ income, data = income.data)
abline(income.lm)

Follow 4 steps to visualize the results using ‘ggplot2’:

1. Plot the data points on a graph

income.graph<-ggplot(income.data, aes(x=income, y=happiness))+
                     geom_point(col="orange")
income.graph

2. Add the linear regression line to the plotted data

Add the regression line using geom_smooth() and typing in lm as your method for creating the line. This will add the line of the linear regression as well as the standard error of the estimate as a light grey stripe surrounding the line:

income.graph <- income.graph + geom_smooth(method="lm", col="black")

income.graph

3. Add the equation for the regression line.

income.graph <- income.graph +
  stat_regline_equation(label.x = 3, label.y = 7)

income.graph

4. Make the graph ready for publication

We can make custom labels using labs().

income.graph +
  labs(title = "Reported happiness as a function of income",
      x = "Income (x$10,000)",
      y = "Happiness score (0 to 10)")

Simple Linear Regression Example 1

Tingting Tong

2024-01-17