Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

For this regression model I chose the income.data file from RStudio.

# Load the packages needed
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(broom)
## Warning: package 'broom' was built under R version 4.2.2
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.2.2

This numeric summary tells us the minimum, median, mean, and maximum values of the independent variable (income) and dependent variable (happiness). Both variables are quantitative.

# Create a data frame by reading the csv file.
# head()reads the first lines of the dataset. 
income.data <- read.csv("income.data.csv")
head(income.data)
##   X   income happiness
## 1 1 3.862647  2.314489
## 2 2 4.979381  3.433490
## 3 3 4.923957  4.599373
## 4 4 3.214372  2.791114
## 5 5 7.196409  5.596398
## 6 6 3.729643  2.458556
# Numeric summary of the data.
summary(income.data)
##        X             income        happiness    
##  Min.   :  1.0   Min.   :1.506   Min.   :0.266  
##  1st Qu.:125.2   1st Qu.:3.006   1st Qu.:2.266  
##  Median :249.5   Median :4.424   Median :3.473  
##  Mean   :249.5   Mean   :4.467   Mean   :3.393  
##  3rd Qu.:373.8   3rd Qu.:5.992   3rd Qu.:4.503  
##  Max.   :498.0   Max.   :7.482   Max.   :6.863

We first check that our data meet the four main assumptions for linear regression:

  1. Independence of observations (aka no autocorrelation)

Because we only have one independent variable and one dependent variable, we don’t need to test for any hidden relationships among variables.

  1. Normality

To check whether the dependent variable follows a normal distribution, use the hist() function.

The observations are roughly bell-shaped (more observations in the middle of the distribution, fewer on the tails), so we can proceed with the linear regression.

hist(income.data$happiness)

  1. Linearity

The relationship between the independent and dependent variable must be linear. We can test this visually with a scatter plot to see if the distribution of data points could be described with a straight line.

The relationship looks roughly linear, so we can proceed with the linear model.

plot(happiness ~ income, data = income.data)

  1. Homoscedasticity (aka homogeneity of variance)

This means that the prediction error doesn’t change significantly over the range of prediction of the model. We can test this assumption later, after fitting the linear model.

We have determined the data meet the assumptions, we can now perform a linear regression analysis to evaluate the relationship between the independent and dependent variables.

Simple regression: income and happiness

Let’s see if there’s a linear relationship between income and happiness in our survey of 500 people with incomes ranging from $15k to $75k, where happiness is measured on a scale of 1 to 10.

The Coefficients section shows:

The estimates (Estimate) for the model parameters – the value of the y-intercept (in this case 0.204) and the estimated effect of income on happiness (0.713).

The standard error of the estimated values (Std. Error).

The test statistic (t value, in this case the t statistic).

The p value ( Pr(>| t | ) ), aka the probability of finding the given t statistic if the null hypothesis of no relationship were true.

The most important thing to note is the p value (here it is 2.2e-16, or almost zero), which will indicate whether the model fits the data well.

From these results, we can say that there is a significant positive relationship between income and happiness (p value < 0.001), with a 0.713-unit (+/- 0.01) increase in happiness for every unit increase in income.

# To perform a simple linear regression analysis and check the results, you need to run two lines of code. The first line of code makes the linear model, and the second line prints out the summary of the model:

income.happiness.lm <- lm(happiness ~ income, data = income.data)

summary(income.happiness.lm)
## 
## Call:
## lm(formula = happiness ~ income, data = income.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.02479 -0.48526  0.04078  0.45898  2.37805 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.20427    0.08884   2.299   0.0219 *  
## income       0.71383    0.01854  38.505   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7181 on 496 degrees of freedom
## Multiple R-squared:  0.7493, Adjusted R-squared:  0.7488 
## F-statistic:  1483 on 1 and 496 DF,  p-value: < 2.2e-16

Before proceeding with data visualization, we should make sure that our models fit the homoscedasticity assumption of the linear model.

We can run plot(income.happiness.lm) to check whether the observed data meets our model assumptions:

Residuals are the unexplained variance. They are not exactly the same as model error, but they are calculated from it, so seeing a bias in the residuals would also indicate a bias in the error.

The most important thing to look for is that the red lines representing the mean of the residuals are all basically horizontal and centered around zero. This means there are no outliers or biases in the data that would make a linear regression invalid.

In the Normal Q-Qplot in the top right, we can see that the real residuals from our model form an almost perfectly one-to-one line with the theoretical residuals from a perfect model.

Based on these residuals, the residuals show no bias, we can say that our model meets the assumption of homoscedasticity.

par(mfrow=c(2,2))
plot(income.happiness.lm)

par(mfrow=c(1,1))

We can plot the data and the regression line from our linear regression model:

# Plot the data points on a graph.
# Add the regression line using geom_smooth() and typing in lm as your method for creating the line. This will add the line of the linear regression as well as the standard error of the estimate (in this case +/- 0.01) as a light grey stripe surrounding the line.
# Add the equation for the regression line.
# Add some style parameters using theme_bw() and making custom labels using labs().

income.graph<-ggplot(income.data, aes(x=income, y=happiness))+
                     geom_point()+ 
  geom_smooth(method="lm", col="black")+
  stat_regline_equation(label.x = 3, label.y = 7)+
  theme_bw() +
  labs(title = "Reported happiness as a function of income",
      x = "Income (x$10,000)",
      y = "Happiness score (0 to 10)")
income.graph
## `geom_smooth()` using formula = 'y ~ x'

This linear model was appropriate because the data has met the four main assumptions, Independence of observations, Normality, Linearity, and Homoscedasticity, for linear regression.

In conclusion we found a significant relationship between income and happiness (p < 0.001, R2 = 0.73 ± 0.0193), with a 0.73-unit increase in reported happiness for every $10,000 increase in income.