r195334vncube@gmail.com
+ 263 77 497 9514
Github profile
Rpubs
Personal website
Linkedin

Introduction

This presentation provide hands-on instruction covering basic statistical analysis in R. This will cover descriptive statistics, t-tests, linear models, chi-square. We will also cover methods for “tidying” model results for downstream visualization and summarization.

Import & inspect

library(tidyverse)
library(skimr)

my_df<-readr::read_csv("loan_data_cleaned.csv")
my_df

Notice that the data are already in tidy format - meaning that:

  • Each variable is a column

  • Each observation is a row

  • Each value is in its own cell

factors vs characters

A note on characters versus factors: One thing that you immediately notice is that all the categorical variables are read in as character data types. This data type is used for storing strings of text, for example, IDs, names, descriptive text, etc. There’s another related data type called factors. Factor variables are used to represent categorical variables with two or more levels, e.g., “male” or “female” for Gender, or “Single” versus “Committed” for loan_status. For the most part, statistical analysis treats these two data types the same. It’s often easier to leave categorical variables as characters. However, in some cases you may get a warning message alerting you that a character variable was converted into a factor variable during analysis. Generally, these warnings are nothing to worry about. You can, if you like, convert individual variables to factor variables, or simply use dplyr’s mutate_if to convert all character vectors to factor variables:

LETS SEE

my_df <- my_df |> mutate_if(is.character, as.factor) |> 
  mutate(loan_status=as.factor(loan_status))
my_df

useful fuctions to remember

Recall several built-in functions that are useful for working with data frames.

  • Content:
    • head(): shows the first few rows
    • tail(): shows the last few rows
  • Size:
    • dim(): returns a 2-element vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
    • nrow(): returns the number of rows
    • ncol(): returns the number of columns
  • Summary:
    • colnames() (or just names()): returns the column names
    • glimpse() (from dplyr): Returns a glimpse of your data, telling you the structure of the dataset and information about the class, length and content of each column

heads and tails

head(my_df)
tail(my_df)

exploring the data

dim(my_df)
#> [1] 29091     9
names(my_df)
#> [1] "...1"           "loan_status"    "loan_amnt"      "grade"         
#> [5] "home_ownership" "annual_inc"     "age"            "emp_cat"       
#> [9] "ir_cat"
glimpse(my_df)
#> Rows: 29,091
#> Columns: 9
#> $ ...1           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ~
#> $ loan_status    <fct> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0~
#> $ loan_amnt      <dbl> 5000, 2400, 10000, 5000, 3000, 12000, 9000, 3000, 10000~
#> $ grade          <fct> B, C, C, A, E, B, C, B, B, D, C, A, B, A, B, B, B, B, B~
#> $ home_ownership <fct> RENT, RENT, RENT, RENT, RENT, OWN, RENT, RENT, RENT, RE~
#> $ annual_inc     <dbl> 24000.00, 12252.00, 49200.00, 36000.00, 48000.00, 75000~
#> $ age            <dbl> 33, 31, 24, 39, 24, 28, 22, 22, 28, 22, 23, 27, 30, 24,~
#> $ emp_cat        <fct> 0-15, 15-30, 0-15, 0-15, 0-15, 0-15, 0-15, 0-15, 0-15, ~
#> $ ir_cat         <fct> 8-11, Missing, 11-13.5, Missing, Missing, 11-13.5, 11-1~

data types

  • it is important to look at the data types before any analysis
  • we can use map_chr() or sapply
map_chr(my_df,class)
#>           ...1    loan_status      loan_amnt          grade home_ownership 
#>      "numeric"       "factor"      "numeric"       "factor"       "factor" 
#>     annual_inc            age        emp_cat         ir_cat 
#>      "numeric"      "numeric"       "factor"       "factor"

Descriptive statistics

We can access individual variables within a data frame using the $ operator, e.g., mydataframe$specificVariable. Let’s print out all the loan_status** values in the data. Let's then see what are theuniquevalues of each. Then let's calculate themean,median, andrangeof the **age` variable.

ideal procedures

# Do the same thing the dplyr way
my_df$loan_status |> unique()
#> [1] 0 1
#> Levels: 0 1
my_df$loan_status |> unique() |> length()
#> [1] 2

Measures of central tendency

To understand the distribution better, we can examine so-called measures of central tendency; which is a fancy way of describing statistics that represent the “middle” of the data. The goal of this is to try to find a “typical” value. Common ways to define the middle of the data include:

  • The mean: A simple average based on adding together all of the values in the sample set, and then dividing the total by the number of samples.

  • The median: The value in the middle of the range of all of the sample values.

  • The mode: The most commonly occuring value in the sample set*.

Let’s calculate these values, along with the minimum and maximum values for comparison, and show them on the histogram.

*Of course, in some sample sets , there may be a tie for the most common value - in which case the dataset is described as bimodal or even multimodal.

# Get the mean loan amount using the accessor `$` 
mean_loan_amount <- mean(my_df$loan_amnt)


# Get the mean age sequentially using ` %>% ` and dplyr::pull
mean_age <- my_df %>% 
                pull(age) %>% 
                  mean()

look at the results


# Glue it all together
glue(
  'Average loan amount: {format(round(mean_loan_amount, 2), nsmall = 2)}
   Average age: {format(round(mean_age, 2), nsmall = 2)}'
)
#> Average loan amount: 9593.66
#> Average age: 27.70

Measures of variance

how much variability is there in the data?

Typical statistics that measure variability in the data include:

  • Range: The difference between the maximum and minimum. There’s no built-in function for this, but it’s easy to calculate using the min and max functions. Another approach would be to use Base R’s base::range() which returns a vector containing the minimum and maximum of all the given arguments. Wrapping this in base::diff() will get you well on your way to finding the range.

  • Variance: The average of the squared difference from the mean. You can use the built-in var function to find this.

  • Standard Deviation: The square root of the variance. You can use the built-in sd function to find this.

Let`s answer a few questions here

  1. What’s the range of values for loan_amnt in all participants? (Hint: see help for min(), max(), and range() functions, e.g., enter ?range without the parentheses to get help).
#> [1] 20 94

oldest person is 94 and the youngest is 20

  1. What are the median, lower, and upper quartiles for the age of all participants? (Hint: see help for median, or better yet, quantile).
#>   0%  25%  50%  75% 100% 
#>   20   23   26   30   94

data displays the 5 number summary

  1. What’s the variance and standard deviation for age among all participants?
#> [1] 38.37329
#> [1] 6.194617

NB the variance displayed by R is sample variance , for large samples we require the population variance

we know that , \[s^2 = \frac{\sum(x-\bar{x})}{n-1}\]

Of these statistics, the standard deviation is generally the most useful. It provides a measure of variance in the data on the same scale as the data itself. The higher the standard deviation, the more variance there is when comparing values in the distribution to the distribution mean - in other words, the data is more spread out.

When working with a normal distribution, the standard deviation works with the particular characteristics of a normal distribution to provide even greater insight. This can be summarized using the 68–95–99.7 rule, also known as the empirical rulewhich is described as follows:

In any normal distribution:

  • Approximately 68.26% of values fall within one standard deviation from the mean.

  • Approximately 95.45% of values fall within two standard deviations from the mean.

  • Approximately 99.73% of values fall within three standard deviations from the mean.

Getting started with statistical analysis🎉.

A lot of data science is rooted in statistics, so we’ll explore some basic statistical techniques.

Note: This is not intended to teach you statistics - that’s much too big a topic for this notebook. It will however introduce you to some statistical concepts and techniques that data scientists use as they explore data in preparation for machine learning modeling.

Descriptive statistics and data distribution

When examining a variable (for example a sample employee loan amounts), data scientists are particularly interested in its distribution (in other words, how are all the different loan values spread across the sample). The starting point for this exploration is often to visualize the data as a histogram, and see how frequently each value for the variable occurs.

# Get the variable to examine
var_data <- my_df %>% select(loan_amnt)

# Plot a histogram
var_data %>% 
  ggplot(aes(x = loan_amnt))+
  geom_histogram(fill = "midnightblue", alpha = 0.7, boundary = 0.5)+
  ggtitle('Data Distribution')+
  xlab('Value')+
  ylab('Frequency')+
  theme(plot.title = element_text(hjust = 0.5))

library(patchwork)
p1<-ggplot(my_df, aes(loan_amnt)) + geom_histogram(bins=30)
p2<-ggplot(my_df, aes(age)) + geom_histogram(bins=30)

plot

histograms are important for inspecting the distribution of the data

  • age is right skewed hence does not seem to follow a normal distribution
  • loan_amount is also right skewed

library(statip)
# Get the variable to examine
var <-  my_df %>% select(loan_amnt)

# Get statistics
min_val <- min(var$loan_amnt)
max_val <- max(var$loan_amnt)
mean_val <- mean(var$loan_amnt)
med_val <- median(var$loan_amnt)
mod_val <- mfv(var$loan_amnt)

# Print the stats
glue::glue(
  'Minimum: {format(round(min_val, 2), nsmall = 2)}
   Mean: {format(round(mean_val, 2), nsmall = 2)}
   Median: {format(round(med_val, 2), nsmall = 2)}
   Mode: {format(round(mod_val, 2), nsmall = 2)}
   Maximum: {format(round(max_val, 2), nsmall = 2)}'
)
#> Minimum: 500.00
#> Mean: 9593.66
#> Median: 8000.00
#> Mode: 10000.00
#> Maximum: 35000.00


# Plot a histogram
var %>% 
  ggplot(aes(x = loan_amnt))+
  geom_histogram( fill = "midnightblue", alpha = 0.7, boundary = 0.5)+
  
# Add lines for the statistics
  geom_vline(xintercept = min_val, color = 'gray33', linetype = "dashed", size = 1.3)+
  geom_vline(xintercept = mean_val, color = 'cyan', linetype = "dashed", size = 1.3)+
  geom_vline(xintercept = med_val, color = 'red', linetype = "dashed", size = 1.3 )+
  geom_vline(xintercept = mod_val, color = 'yellow', linetype = "dashed", size = 1.3 )+
  geom_vline(xintercept = max_val, color = 'gray33', linetype = "dashed", size = 1.3 )+
  
# Add titles and labels
  ggtitle('Data Distribution')+
  xlab('Value')+
  ylab('Frequency')+
  theme(plot.title = element_text(hjust = 0.5))

Another way to visualize the distribution of a variable is to use a box plot (sometimes called a box-and-whiskers plot). Let’s create one for the grade data.

# Get the variable to examine
var <-  my_df %>% select(loan_amnt)

# Plot a box plot
var %>% 
  ggplot(aes(x = 1, y = loan_amnt))+
  geom_boxplot(fill = "#E69F00", color = "gray23", alpha = 0.7)+
# Add titles and labels
  ggtitle("Data Distribution")+
  xlab("")+
  ylab("loan_amnt")+
  theme(plot.title = element_text(hjust = 0.5))

The box plot shows the distribution of the loan amount values in a different format to the histogram. The box part of the plot shows where the inner two quartiles of the data reside - The whiskers extending from the box show the outer two quartiles; The line in the box indicates the median value.

Outliers

# Create a function that we can reuse
show_distribution <- function(var_data, binwidth) {
  library(ggplot2)
  library(dplyr)
  library(glue)
  library(patchwork)
  
  # Get statistics
  min_val <- min(pull(var_data))
  max_val <- max(pull(var_data))
  mean_val <- mean(pull(var_data))
  med_val <- median(pull(var_data))
  mod_val <- statip::mfv(pull(var_data))

  # Print the stats
  stats <- glue::glue(
  'Minimum: {format(round(min_val, 2), nsmall = 2)}
   Mean: {format(round(mean_val, 2), nsmall = 2)}
   Median: {format(round(med_val, 2), nsmall = 2)}
   Mode: {format(round(mod_val, 2), nsmall = 2)}
   Maximum: {format(round(max_val, 2), nsmall = 2)}'
  )
  
  # Plot the histogram
  hist_gram <- var_data %>% 
  ggplot(aes(x = pull(var_data)))+
  geom_histogram(fill = "midnightblue", alpha = 0.7, boundary = 0.4)+
  # Add lines for the statistics
  geom_vline(xintercept = min_val, color = 'gray33', linetype = "dashed", size = 1.3)+
  geom_vline(xintercept = mean_val, color = 'cyan', linetype = "dashed", size = 1.3)+
  geom_vline(xintercept = med_val, color = 'red', linetype = "dashed", size = 1.3 )+
  geom_vline(xintercept = mod_val, color = 'yellow', linetype = "dashed", size = 1.3 )+
  geom_vline(xintercept = max_val, color = 'gray33', linetype = "dashed", size = 1.3 )+
  # Add titles and labels
  ggtitle('Data Distribution')+
  xlab('')+
  ylab('Frequency')+
  theme(plot.title = element_text(hjust = 0.5))
  
  # Plot the box plot
  bx_plt <- var_data %>% 
  ggplot(aes(x = pull(var_data), y = 1))+
  geom_boxplot(fill = "#E69F00", color = "gray23", alpha = 0.7
               )+
    # Add titles and labels
  xlab("Value")+
  ylab("")+
  theme(plot.title = element_text(hjust = 0.5))
  
  
  # To return multiple outputs, use a `list`
  return(list(stats,
              hist_gram / bx_plt)
        ) # End of returned outputs
  
} # End of function

In most real-world cases, it’s easier to consider outliers as being values that fall below or above percentiles within which most of the data lie. For example, the following code uses the stats::quantile() function to exclude observations below the 0.01th percentile (the value above which 99% of the data reside).

# Produce a quantile corresponding to 1%
q01 <- my_df$loan_amnt %>% 
  quantile(probs = 1/100, names = FALSE)

# Get the variable to examine
col <- my_df %>% 
  select(loan_amnt) %>% 
  filter(. > q01)
  
# Call the function
show_distribution(var_data = col, binwidth = 0.5)
#> [[1]]
#> Minimum: 1050.00
#> Mean: 9681.53
#> Median: 8000.00
#> Mode: 10000.00
#> Maximum: 35000.00
#> 
#> [[2]]

T-tests

Let’s do a few two-sample t-tests to test for differences in means between two groups. The function for a t-test is t.test(). See the help for ?t.test. We’ll be using the forumla method. The usage is t.test(response~group, data=myDataFrame).

  1. Are there differences in age loan_status in this dataset?
  2. Does loan_amnt differ between defaulters and non_defaulters?

the code

t.test(age~loan_status, data=my_df)
#> 
#>  Welch Two Sample t-test
#> 
#> data:  age by loan_status
#> t = 2.8422, df = 4113.6, p-value = 0.004503
#> alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
#> 95 percent confidence interval:
#>  0.1002091 0.5458840
#> sample estimates:
#> mean in group 0 mean in group 1 
#>        27.73395        27.41091

loan amount

t.test(loan_amnt~loan_status, data=my_df)
#> 
#>  Welch Two Sample t-test
#> 
#> data:  loan_amnt by loan_status
#> t = 1.9199, df = 4035.6, p-value = 0.05494
#> alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
#> 95 percent confidence interval:
#>   -4.882188 465.910368
#> sample estimates:
#> mean in group 0 mean in group 1 
#>        9619.234        9388.720

annual income

t.test(annual_inc~loan_status, data=my_df)
#> 
#>  Welch Two Sample t-test
#> 
#> data:  annual_inc by loan_status
#> t = 10.033, df = 4424.8, p-value < 2.2e-16
#> alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
#> 95 percent confidence interval:
#>   7055.507 10482.649
#> sample estimates:
#> mean in group 0 mean in group 1 
#>        67937.63        59168.55

Note

A note on one-tailed versus two-tailed tests: A two-tailed test is almost always more appropriate. The hypothesis you’re testing here is spelled out in the results (“alternative hypothesis: true difference in means is not equal to 0”). If the p-value is very low, you can reject the null hypothesis that there’s no difference in means. Because you typically don’t know a priori whether the difference in means will be positive or negative (e.g., we don’t know a priori whether Single people would be expected to drink more or less than those in a committed relationship), we want to do the two-tailed test. However, if we only wanted to test a very specific directionality of effect, we could use a one-tailed test and specify which direction we expect. This is more powerful if we “get it right”, but much less powerful for the opposite effect. Notice how the p-value changes depending on how we specify the hypothesis. Again, the two-tailed test is almost always more appropriate.

lower and upper tail

# Two tailed
t.test(annual_inc~loan_status, data=my_df)

# Difference in means is >0 
t.test(annual_inc~loan_status, data=my_df, alternative="greater")

# Difference in means is <0
t.test(annual_inc~loan_status, data=my_df, alternative="less")

note

A note on paired versus unpaired t-tests: The t-test we performed here was an unpaired test. . In this case, an unpaired test is appropriate. An alternative design might be when data is derived from samples who have been measured at two different time points or locations, e.g., before versus after treatment, left versus right hand, etc. In this case, a paired t-test would be more appropriate. A paired test takes into consideration the intra and inter-subject variability, and is more powerful than the unpaired test. See the help for ?t.test for more information on how to do this.

Wilcoxon test

Another assumption of the t-test is that data is normally distributed. Looking at the histogram for age shows that this data clearly isn’t.

the test

The Wilcoxon rank-sum test (a.k.a. Mann-Whitney U test) is a nonparametric test of differences in mean that does not require normally distributed data. When data is perfectly normal, the t-test is uniformly more powerful. But when this assumption is violated, the t-test is unreliable. This test is called in a similar way as the t-test.

wilcox.test(age~loan_status, data=my_df)
#> 
#>  Wilcoxon rank sum test with continuity correction
#> 
#> data:  age by loan_status
#> W = 43348257, p-value = 0.0003112
#> alternative hypothesis: true location shift is not equal to 0

The results are still significant, but much less than the p-value reported for the (incorrect) t-test above. Also note in the help for ?wilcox.test that there’s a paired option here too.

Linear models

Where t-tests and their nonparametric substitutes are used for assessing the differences in means between two groups, ANOVA is used to assess the significance of differences in means between multiple groups. In fact, a t-test is just a specific case of ANOVA when you only have two groups. And both t-tests and ANOVA are just specific cases of linear regression, where you’re trying to fit a model describing how a continuous outcome (e.g., annual_inc) changes with some predictor variable (e.g., diabetic status, loan_status, age, etc.). The distinction is largely semantic – with a linear model you’re asking, “do levels of a categorical variable affect the response?” where with ANOVA or t-tests you’re asking, “does the mean response differ between levels of a categorical variable?”

previously….

Let’s examine the relationship between annual_inc and loan_status. Let’s first do this with a t-test, and for now, let’s assume that the variances between groups are equal.

t.test(annual_inc~loan_status, data=my_df, var.equal=TRUE)
#> 
#>  Two Sample t-test
#> 
#> data:  annual_inc by loan_status
#> t = 8.8318, df = 29089, p-value < 2.2e-16
#> alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
#> 95 percent confidence interval:
#>   6822.958 10715.199
#> sample estimates:
#> mean in group 0 mean in group 1 
#>        67937.63        59168.55

using lm()

Now, let’s do the same test in a linear modeling framework. First, let’s create the fitted model and store it in an object called fit.

fit <- lm(annual_inc~loan_status, data=my_df)

You can display the object itself, but that isn’t too interesting. You can get the more familiar ANOVA table by calling the anova() function on the fit object. More generally, the summary() function on a linear model object will tell you much more. (Note this is different from dplyr’s summarize function).

output

fit
#> 
#> Call:
#> lm(formula = annual_inc ~ loan_status, data = my_df)
#> 
#> Coefficients:
#>  (Intercept)  loan_status1  
#>        67938         -8769
anova(fit)

using summary()

summary(fit)
#> 
#> Call:
#> lm(formula = annual_inc ~ loan_status, data = my_df)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#>  -63938  -27938   -9938   12831 1971846 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   67937.6      330.7 205.441   <2e-16 ***
#> loan_status1  -8769.1      992.9  -8.832   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 53180 on 29089 degrees of freedom
#> Multiple R-squared:  0.002674,   Adjusted R-squared:  0.00264 
#> F-statistic:    78 on 1 and 29089 DF,  p-value: < 2.2e-16

going back to previous t.test()

Go back and re-run the t-test assuming equal variances as we did before. Now notice a few things:

t.test(annual_inc~loan_status, data=my_df, var.equal=TRUE)

ok….what’s happening

  1. The p-values from all three tests (t-test, ANOVA, and linear regression) are all identical (2.2e-16). This is because they’re all identical: a t-test is a specific case of ANOVA, which is a specific case of linear regression. There may be some rounding error, but we’ll talk about extracting the exact values from a model object later on.
  2. The test statistics are all related.If you square that, you get 2.347, the F statistic from the ANOVA.
  3. The t.test() output shows you the means for the two groups,default and no defalult. Just displaying the fit object itself or running summary(fit) shows you the coefficients for a linear model. Here, the model assumes the “baseline” loan_status level is loan_status1, and that the intercept in a regression model (e.g., \(\beta_{0}\) in the model \(Y = \beta_{0} + \beta_{1}X\)) is the mean of the baseline group.

ANOVA

Recap: t-tests are for assessing the differences in means between two groups. A t-test is a specific case of ANOVA, which is a specific case of a linear model. Let’s run ANOVA, but this time looking for differences in means between more than two groups.

the relationship

Let’s look at the relationship between home_ownership (RENT,OWN etc), and annual_inc.

fit <- lm(annual_inc~home_ownership, data=my_df)
anova(fit)

summary()

summary(fit)
#> 
#> Call:
#> lm(formula = annual_inc ~ home_ownership, data = my_df)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#>  -73988  -25832   -9176   13424 1983608 
#> 
#> Coefficients:
#>                     Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)          81892.2      472.5 173.335   <2e-16 ***
#> home_ownershipOTHER -10177.3     5276.3  -1.929   0.0538 .  
#> home_ownershipOWN   -24094.7     1177.9 -20.456   <2e-16 ***
#> home_ownershipRENT  -25716.2      636.8 -40.382   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 51760 on 29087 degrees of freedom
#> Multiple R-squared:  0.05552,    Adjusted R-squared:  0.05542 
#> F-statistic: 569.9 on 3 and 29087 DF,  p-value: < 2.2e-16

interpretting the output

The F-test on the ANOVA table tells us that there is a significant difference in means between the groups. However, the linear model output might not have been what we wanted. Because the default handling of categorical variables is to treat the alphabetical first level as the baseline, “Mortgage” is treated as baseline, and this mean becomes the intercept, and the coefficients on “OWN” and “RENT” and other describe how those groups’ means differ from mortgage.

change levels

What if we wanted “RENT” to be the baseline, followed by other levels? Have a look at ?factor to relevel the factor levels.

# Look at my_df$SmokingStatus
#my_df$home_ownership

# What happens if we relevel it? Let's see what that looks like.
#relevel(my_df$home_ownership, ref="RENT")

# If we're happy with that, let's change the value of my_df$SmokingStatus in place
my_df$home_ownership<- relevel(my_df$home_ownership, ref="RENT")

# Or we could do this the dplyr way
my_df <- my_df |> 
  mutate(home_ownership=relevel(home_ownership, ref="RENT"))

Fit model again

# Re-fit the model
fit <- lm(annual_inc~home_ownership, data=my_df)

# Optionally, show the ANOVA table
# anova(fit)

# Print the full model statistics
summary(fit)
#> 
#> Call:
#> lm(formula = annual_inc ~ home_ownership, data = my_df)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#>  -73988  -25832   -9176   13424 1983608 
#> 
#> Coefficients:
#>                        Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)             56176.1      427.0 131.561  < 2e-16 ***
#> home_ownershipMORTGAGE  25716.2      636.8  40.382  < 2e-16 ***
#> home_ownershipOTHER     15538.9     5272.4   2.947  0.00321 ** 
#> home_ownershipOWN        1621.5     1160.4   1.397  0.16230    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 51760 on 29087 degrees of freedom
#> Multiple R-squared:  0.05552,    Adjusted R-squared:  0.05542 
#> F-statistic: 569.9 on 3 and 29087 DF,  p-value: < 2.2e-16

Notice that the p-value on the ANOVA/regression didn’t change, but the coefficients did. RENT is now treated as baseline. The intercept coefficient (56176.1 ) is now the mean for RENT smokers. The MORTGAGE coefficient of 25716.2 shows the apparent increase in annual_inc that Those on Mortage have when compared to those who Rent, that difference is significant (p=2e-16). you can do the same for other levels

TURKEYHSD

Finally, you can do the typical post-hoc ANOVA procedures on the fit object. For example, the TukeyHSD() function will run Tukey’s test (also known as Tukey’s range test, the Tukey method, Tukey’s honest significance test, Tukey’s HSD test (honest significant difference), or the Tukey-Kramer method). Tukey’s test computes all pairwise mean difference calculation, comparing each group to each other group, identifying any difference between two groups that’s greater than the standard error, while controlling the type I error for all multiple comparisons. First run aov() (not anova()) on the fitted linear model object, then run TukeyHSD() on the resulting analysis of variance fit.

the fit

TukeyHSD(aov(fit))
#>   Tukey multiple comparisons of means
#>     95% family-wise confidence level
#> 
#> Fit: aov(formula = fit)
#> 
#> $home_ownership
#>                     diff        lwr         upr     p adj
#> MORTGAGE-RENT   25716.17  24080.169  27352.1764 0.0000000
#> OTHER-RENT      15538.92   1993.953  29083.8831 0.0169135
#> OWN-RENT         1621.52  -1359.542   4602.5828 0.5009343
#> OTHER-MORTGAGE -10177.25 -23732.176   3377.6673 0.2158710
#> OWN-MORTGAGE   -24094.65 -27120.633 -21068.6713 0.0000000
#> OWN-OTHER      -13917.40 -27699.492   -135.3033 0.0467451

visualise

Finally, let’s visualize the differences in means between these groups. The NA category, which is omitted from the ANOVA, contains all the observations who have missing or non-recorded Smoking Status.

p<-ggplot(my_df, aes(home_ownership, annual_inc)) + geom_boxplot() + theme_classic()

the plot