This experiment came from a Linear regression tutorial on Scribbr - https://www.scribbr.com/statistics/linear-regression-in-r

The dataset contains observations about income and happiness taken from a sample of 500 people.

# Import the CSV dataset into R.
income_dataset_url <- 'https://raw.githubusercontent.com/stephen-haslett/data605/data605-week-11/income_to_happy.csv'
income_dataset <- read.csv(income_dataset_url)
head(income_dataset)
##   X   income happiness
## 1 1 3.862647  2.314489
## 2 2 4.979381  3.433490
## 3 3 4.923957  4.599373
## 4 4 3.214372  2.791114
## 5 5 7.196409  5.596398
## 6 6 3.729643  2.458556

Take a quick look at the data.

summary(income_dataset)
##        X             income        happiness    
##  Min.   :  1.0   Min.   :1.506   Min.   :0.266  
##  1st Qu.:125.2   1st Qu.:3.006   1st Qu.:2.266  
##  Median :249.5   Median :4.424   Median :3.473  
##  Mean   :249.5   Mean   :4.467   Mean   :3.393  
##  3rd Qu.:373.8   3rd Qu.:5.992   3rd Qu.:4.503  
##  Max.   :498.0   Max.   :7.482   Max.   :6.863

Does the dependent variable follow a normal distribution?

hist(income_dataset$happiness)

Is the relationship between the indepentent and dependant variable linear?

plot(happiness ~ income, data = income_dataset)

Linear Regression Analysis

Is there a linear relationship between income and happiness?

income_dataset_happiness_lm <- lm(happiness ~ income, data = income_dataset)
summary(income_dataset_happiness_lm)
## 
## Call:
## lm(formula = happiness ~ income, data = income_dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.02479 -0.48526  0.04078  0.45898  2.37805 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.20427    0.08884   2.299   0.0219 *  
## income       0.71383    0.01854  38.505   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7181 on 496 degrees of freedom
## Multiple R-squared:  0.7493, Adjusted R-squared:  0.7488 
## F-statistic:  1483 on 1 and 496 DF,  p-value: < 2.2e-16

Check if the residual means are close to zero. In this case they are as they hug the red lines in the graphs, which means our model is valid, and we can contune with our study.

par(mfrow = c(2,2))
plot(income_dataset_happiness_lm)

par(mfrow=c(1,1))
income_dataset_graph<-ggplot(income_dataset, aes(x=income, y=happiness))+
                     geom_point()
income_dataset_graph

income_dataset_graph <- income_dataset_graph + geom_smooth(method="lm", col="black")

income_dataset_graph