r195334vncube@gmail.com
+ 263 77 497
9514
Github profile
Rpubs
Personal website
Linkedin
This presentation provide hands-on instruction covering basic statistical analysis in R. This will cover descriptive statistics, t-tests, linear models, chi-square. We will also cover methods for “tidying” model results for downstream visualization and summarization.
library(tidyverse)
library(skimr)
<-readr::read_csv("loan_data_cleaned.csv")
my_df my_df
Notice that the data are already in tidy format - meaning that:
Each variable is a column
Each observation is a row
Each value is in its own cell
A note on characters versus factors: One thing that you immediately notice is that all the categorical variables are read in as character data types. This data type is used for storing strings of text, for example, IDs, names, descriptive text, etc. There’s another related data type called factors. Factor variables are used to represent categorical variables with two or more levels, e.g., “male” or “female” for Gender, or “Single” versus “Committed” for loan_status. For the most part, statistical analysis treats these two data types the same. It’s often easier to leave categorical variables as characters. However, in some cases you may get a warning message alerting you that a character variable was converted into a factor variable during analysis. Generally, these warnings are nothing to worry about. You can, if you like, convert individual variables to factor variables, or simply use dplyr’s
mutate_if
to convert all character vectors to factor variables:
<- my_df |> mutate_if(is.character, as.factor) |>
my_df mutate(loan_status=as.factor(loan_status))
my_df
Recall several built-in functions that are useful for working with data frames.
head()
: shows the first few rowstail()
: shows the last few rowsdim()
: returns a 2-element vector with the number of
rows in the first element, and the number of columns as the second
element (the dimensions of the object)nrow()
: returns the number of rowsncol()
: returns the number of columnscolnames()
(or just names()
): returns the
column namesglimpse()
(from dplyr): Returns a
glimpse of your data, telling you the structure of the dataset and
information about the class, length and content of each columnhead(my_df)
tail(my_df)
dim(my_df)
#> [1] 29091 9
names(my_df)
#> [1] "...1" "loan_status" "loan_amnt" "grade"
#> [5] "home_ownership" "annual_inc" "age" "emp_cat"
#> [9] "ir_cat"
glimpse(my_df)
#> Rows: 29,091
#> Columns: 9
#> $ ...1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ~
#> $ loan_status <fct> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0~
#> $ loan_amnt <dbl> 5000, 2400, 10000, 5000, 3000, 12000, 9000, 3000, 10000~
#> $ grade <fct> B, C, C, A, E, B, C, B, B, D, C, A, B, A, B, B, B, B, B~
#> $ home_ownership <fct> RENT, RENT, RENT, RENT, RENT, OWN, RENT, RENT, RENT, RE~
#> $ annual_inc <dbl> 24000.00, 12252.00, 49200.00, 36000.00, 48000.00, 75000~
#> $ age <dbl> 33, 31, 24, 39, 24, 28, 22, 22, 28, 22, 23, 27, 30, 24,~
#> $ emp_cat <fct> 0-15, 15-30, 0-15, 0-15, 0-15, 0-15, 0-15, 0-15, 0-15, ~
#> $ ir_cat <fct> 8-11, Missing, 11-13.5, Missing, Missing, 11-13.5, 11-1~
map_chr()
or sapply
map_chr(my_df,class)
#> ...1 loan_status loan_amnt grade home_ownership
#> "numeric" "factor" "numeric" "factor" "factor"
#> annual_inc age emp_cat ir_cat
#> "numeric" "numeric" "factor" "factor"
We can access individual variables within a data frame using the
$
operator, e.g.,
mydataframe$specificVariable
. Let’s print out all the
loan_status** values in the data. Let's then see what are the
uniquevalues of each. Then let's calculate the
mean,
median, and
rangeof the **
age`
variable.
# Do the same thing the dplyr way
$loan_status |> unique()
my_df#> [1] 0 1
#> Levels: 0 1
$loan_status |> unique() |> length()
my_df#> [1] 2
To understand the distribution better, we can examine so-called measures of central tendency; which is a fancy way of describing statistics that represent the “middle” of the data. The goal of this is to try to find a “typical” value. Common ways to define the middle of the data include:
The mean: A simple average based on adding together all of the values in the sample set, and then dividing the total by the number of samples.
The median: The value in the middle of the range of all of the sample values.
The mode: The most commonly occuring value in the sample set*.
Let’s calculate these values, along with the minimum and maximum values for comparison, and show them on the histogram.
*Of course, in some sample sets , there may be a tie for the most common value - in which case the dataset is described as bimodal or even multimodal.
# Get the mean loan amount using the accessor `$`
<- mean(my_df$loan_amnt)
mean_loan_amount
# Get the mean age sequentially using ` %>% ` and dplyr::pull
<- my_df %>%
mean_age pull(age) %>%
mean()
look at the results
# Glue it all together
glue(
'Average loan amount: {format(round(mean_loan_amount, 2), nsmall = 2)}
Average age: {format(round(mean_age, 2), nsmall = 2)}'
)#> Average loan amount: 9593.66
#> Average age: 27.70
how much variability is there in the data?
Typical statistics that measure variability in the data include:
Range: The difference between the maximum and
minimum. There’s no built-in function for this, but it’s easy to
calculate using the min and max
functions. Another approach would be to use Base R’s
base::range()
which returns a vector containing the
minimum and maximum of all the given arguments.
Wrapping this in base::diff()
will get you well on your way
to finding the range.
Variance: The average of the squared difference from the mean. You can use the built-in var function to find this.
Standard Deviation: The square root of the variance. You can use the built-in sd function to find this.
min()
, max()
, and
range()
functions, e.g., enter ?range
without
the parentheses to get help).#> [1] 20 94
oldest person is 94 and the youngest is 20
median
, or better yet,
quantile
).#> 0% 25% 50% 75% 100%
#> 20 23 26 30 94
data displays the 5 number summary
#> [1] 38.37329
#> [1] 6.194617
NB
the variance displayed by R is sample variance , for large samples we require the population variance
we know that , \[s^2 = \frac{\sum(x-\bar{x})}{n-1}\]
Of these statistics, the standard deviation is generally the most useful. It provides a measure of variance in the data on the same scale as the data itself. The higher the standard deviation, the more variance there is when comparing values in the distribution to the distribution mean - in other words, the data is more spread out.
When working with a normal distribution, the standard
deviation works with the particular characteristics of a normal
distribution to provide even greater insight. This can be summarized
using the 68–95–99.7 rule
, also known as the
empirical rule
which is described as follows:
In any normal distribution:
Approximately 68.26% of values fall within one standard deviation from the mean.
Approximately 95.45% of values fall within two standard deviations from the mean.
Approximately 99.73% of values fall within three standard deviations from the mean.
A lot of data science is rooted in statistics, so we’ll explore some basic statistical techniques.
Note: This is not intended to teach you statistics - that’s much too big a topic for this notebook. It will however introduce you to some statistical concepts and techniques that data scientists use as they explore data in preparation for machine learning modeling.
When examining a variable (for example a sample employee
loan amounts), data scientists are particularly interested in its
distribution (in other words, how are all the different loan
values spread across the sample). The starting point for this
exploration is often to visualize the data as a histogram
,
and see how frequently each value for the variable occurs.
# Get the variable to examine
<- my_df %>% select(loan_amnt)
var_data
# Plot a histogram
%>%
var_data ggplot(aes(x = loan_amnt))+
geom_histogram(fill = "midnightblue", alpha = 0.7, boundary = 0.5)+
ggtitle('Data Distribution')+
xlab('Value')+
ylab('Frequency')+
theme(plot.title = element_text(hjust = 0.5))
library(patchwork)
<-ggplot(my_df, aes(loan_amnt)) + geom_histogram(bins=30)
p1<-ggplot(my_df, aes(age)) + geom_histogram(bins=30) p2
histograms are important for inspecting the distribution of the data
library(statip)
# Get the variable to examine
<- my_df %>% select(loan_amnt)
var
# Get statistics
<- min(var$loan_amnt)
min_val <- max(var$loan_amnt)
max_val <- mean(var$loan_amnt)
mean_val <- median(var$loan_amnt)
med_val <- mfv(var$loan_amnt)
mod_val
# Print the stats
::glue(
glue'Minimum: {format(round(min_val, 2), nsmall = 2)}
Mean: {format(round(mean_val, 2), nsmall = 2)}
Median: {format(round(med_val, 2), nsmall = 2)}
Mode: {format(round(mod_val, 2), nsmall = 2)}
Maximum: {format(round(max_val, 2), nsmall = 2)}'
)#> Minimum: 500.00
#> Mean: 9593.66
#> Median: 8000.00
#> Mode: 10000.00
#> Maximum: 35000.00
# Plot a histogram
%>%
var ggplot(aes(x = loan_amnt))+
geom_histogram( fill = "midnightblue", alpha = 0.7, boundary = 0.5)+
# Add lines for the statistics
geom_vline(xintercept = min_val, color = 'gray33', linetype = "dashed", size = 1.3)+
geom_vline(xintercept = mean_val, color = 'cyan', linetype = "dashed", size = 1.3)+
geom_vline(xintercept = med_val, color = 'red', linetype = "dashed", size = 1.3 )+
geom_vline(xintercept = mod_val, color = 'yellow', linetype = "dashed", size = 1.3 )+
geom_vline(xintercept = max_val, color = 'gray33', linetype = "dashed", size = 1.3 )+
# Add titles and labels
ggtitle('Data Distribution')+
xlab('Value')+
ylab('Frequency')+
theme(plot.title = element_text(hjust = 0.5))
Another way to visualize the distribution of a variable is to use a
box plot (sometimes called a box-and-whiskers plot).
Let’s create one for the grade data.
# Get the variable to examine
<- my_df %>% select(loan_amnt)
var
# Plot a box plot
%>%
var ggplot(aes(x = 1, y = loan_amnt))+
geom_boxplot(fill = "#E69F00", color = "gray23", alpha = 0.7)+
# Add titles and labels
ggtitle("Data Distribution")+
xlab("")+
ylab("loan_amnt")+
theme(plot.title = element_text(hjust = 0.5))
The box plot shows the distribution of the loan amount values in a
different format to the histogram. The box part of the plot
shows where the inner two quartiles of the data reside - The
whiskers extending from the box show the outer two quartiles;
The line in the box indicates the median value.
# Create a function that we can reuse
<- function(var_data, binwidth) {
show_distribution library(ggplot2)
library(dplyr)
library(glue)
library(patchwork)
# Get statistics
<- min(pull(var_data))
min_val <- max(pull(var_data))
max_val <- mean(pull(var_data))
mean_val <- median(pull(var_data))
med_val <- statip::mfv(pull(var_data))
mod_val
# Print the stats
<- glue::glue(
stats 'Minimum: {format(round(min_val, 2), nsmall = 2)}
Mean: {format(round(mean_val, 2), nsmall = 2)}
Median: {format(round(med_val, 2), nsmall = 2)}
Mode: {format(round(mod_val, 2), nsmall = 2)}
Maximum: {format(round(max_val, 2), nsmall = 2)}'
)
# Plot the histogram
<- var_data %>%
hist_gram ggplot(aes(x = pull(var_data)))+
geom_histogram(fill = "midnightblue", alpha = 0.7, boundary = 0.4)+
# Add lines for the statistics
geom_vline(xintercept = min_val, color = 'gray33', linetype = "dashed", size = 1.3)+
geom_vline(xintercept = mean_val, color = 'cyan', linetype = "dashed", size = 1.3)+
geom_vline(xintercept = med_val, color = 'red', linetype = "dashed", size = 1.3 )+
geom_vline(xintercept = mod_val, color = 'yellow', linetype = "dashed", size = 1.3 )+
geom_vline(xintercept = max_val, color = 'gray33', linetype = "dashed", size = 1.3 )+
# Add titles and labels
ggtitle('Data Distribution')+
xlab('')+
ylab('Frequency')+
theme(plot.title = element_text(hjust = 0.5))
# Plot the box plot
<- var_data %>%
bx_plt ggplot(aes(x = pull(var_data), y = 1))+
geom_boxplot(fill = "#E69F00", color = "gray23", alpha = 0.7
+
)# Add titles and labels
xlab("Value")+
ylab("")+
theme(plot.title = element_text(hjust = 0.5))
# To return multiple outputs, use a `list`
return(list(stats,
/ bx_plt)
hist_gram # End of returned outputs
)
# End of function }
In most real-world cases, it’s easier to consider outliers as being
values that fall below or above percentiles within which most of the
data lie. For example, the following code uses the
stats::quantile()
function to exclude observations below
the 0.01th percentile (the value above which 99% of the data
reside).
# Produce a quantile corresponding to 1%
<- my_df$loan_amnt %>%
q01 quantile(probs = 1/100, names = FALSE)
# Get the variable to examine
<- my_df %>%
col select(loan_amnt) %>%
filter(. > q01)
# Call the function
show_distribution(var_data = col, binwidth = 0.5)
#> [[1]]
#> Minimum: 1050.00
#> Mean: 9681.53
#> Median: 8000.00
#> Mode: 10000.00
#> Maximum: 35000.00
#>
#> [[2]]
Let’s do a few two-sample t-tests to test for differences in
means between two groups. The function for a t-test is
t.test()
. See the help for ?t.test
. We’ll be
using the forumla method. The usage is
t.test(response~group, data=myDataFrame)
.
t.test(age~loan_status, data=my_df)
#>
#> Welch Two Sample t-test
#>
#> data: age by loan_status
#> t = 2.8422, df = 4113.6, p-value = 0.004503
#> alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
#> 95 percent confidence interval:
#> 0.1002091 0.5458840
#> sample estimates:
#> mean in group 0 mean in group 1
#> 27.73395 27.41091
t.test(loan_amnt~loan_status, data=my_df)
#>
#> Welch Two Sample t-test
#>
#> data: loan_amnt by loan_status
#> t = 1.9199, df = 4035.6, p-value = 0.05494
#> alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
#> 95 percent confidence interval:
#> -4.882188 465.910368
#> sample estimates:
#> mean in group 0 mean in group 1
#> 9619.234 9388.720
t.test(annual_inc~loan_status, data=my_df)
#>
#> Welch Two Sample t-test
#>
#> data: annual_inc by loan_status
#> t = 10.033, df = 4424.8, p-value < 2.2e-16
#> alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
#> 95 percent confidence interval:
#> 7055.507 10482.649
#> sample estimates:
#> mean in group 0 mean in group 1
#> 67937.63 59168.55
A note on one-tailed versus two-tailed tests: A two-tailed test is almost always more appropriate. The hypothesis you’re testing here is spelled out in the results (“alternative hypothesis: true difference in means is not equal to 0”). If the p-value is very low, you can reject the null hypothesis that there’s no difference in means. Because you typically don’t know a priori whether the difference in means will be positive or negative (e.g., we don’t know a priori whether Single people would be expected to drink more or less than those in a committed relationship), we want to do the two-tailed test. However, if we only wanted to test a very specific directionality of effect, we could use a one-tailed test and specify which direction we expect. This is more powerful if we “get it right”, but much less powerful for the opposite effect. Notice how the p-value changes depending on how we specify the hypothesis. Again, the two-tailed test is almost always more appropriate.
# Two tailed
t.test(annual_inc~loan_status, data=my_df)
# Difference in means is >0
t.test(annual_inc~loan_status, data=my_df, alternative="greater")
# Difference in means is <0
t.test(annual_inc~loan_status, data=my_df, alternative="less")
A note on paired versus unpaired t-tests: The t-test we performed here was an unpaired test. . In this case, an unpaired test is appropriate. An alternative design might be when data is derived from samples who have been measured at two different time points or locations, e.g., before versus after treatment, left versus right hand, etc. In this case, a paired t-test would be more appropriate. A paired test takes into consideration the intra and inter-subject variability, and is more powerful than the unpaired test. See the help for
?t.test
for more information on how to do this.
Another assumption of the t-test is that data is normally distributed. Looking at the histogram for age shows that this data clearly isn’t.
The Wilcoxon rank-sum test (a.k.a. Mann-Whitney U test) is a nonparametric test of differences in mean that does not require normally distributed data. When data is perfectly normal, the t-test is uniformly more powerful. But when this assumption is violated, the t-test is unreliable. This test is called in a similar way as the t-test.
wilcox.test(age~loan_status, data=my_df)
#>
#> Wilcoxon rank sum test with continuity correction
#>
#> data: age by loan_status
#> W = 43348257, p-value = 0.0003112
#> alternative hypothesis: true location shift is not equal to 0
The results are still significant, but much less than the p-value
reported for the (incorrect) t-test above. Also note in the help for
?wilcox.test
that there’s a paired
option here
too.
Where t-tests and their nonparametric substitutes are used for assessing the differences in means between two groups, ANOVA is used to assess the significance of differences in means between multiple groups. In fact, a t-test is just a specific case of ANOVA when you only have two groups. And both t-tests and ANOVA are just specific cases of linear regression, where you’re trying to fit a model describing how a continuous outcome (e.g., annual_inc) changes with some predictor variable (e.g., diabetic status, loan_status, age, etc.). The distinction is largely semantic – with a linear model you’re asking, “do levels of a categorical variable affect the response?” where with ANOVA or t-tests you’re asking, “does the mean response differ between levels of a categorical variable?”
Let’s examine the relationship between annual_inc and loan_status. Let’s first do this with a t-test, and for now, let’s assume that the variances between groups are equal.
t.test(annual_inc~loan_status, data=my_df, var.equal=TRUE)
#>
#> Two Sample t-test
#>
#> data: annual_inc by loan_status
#> t = 8.8318, df = 29089, p-value < 2.2e-16
#> alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
#> 95 percent confidence interval:
#> 6822.958 10715.199
#> sample estimates:
#> mean in group 0 mean in group 1
#> 67937.63 59168.55
Now, let’s do the same test in a linear modeling framework. First,
let’s create the fitted model and store it in an object called
fit
.
<- lm(annual_inc~loan_status, data=my_df) fit
You can display the object itself, but that isn’t too interesting.
You can get the more familiar ANOVA table by calling the
anova()
function on the fit
object. More
generally, the summary()
function on a linear model object
will tell you much more. (Note this is different from dplyr’s
summarize function).
fit#>
#> Call:
#> lm(formula = annual_inc ~ loan_status, data = my_df)
#>
#> Coefficients:
#> (Intercept) loan_status1
#> 67938 -8769
anova(fit)
summary(fit)
#>
#> Call:
#> lm(formula = annual_inc ~ loan_status, data = my_df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -63938 -27938 -9938 12831 1971846
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 67937.6 330.7 205.441 <2e-16 ***
#> loan_status1 -8769.1 992.9 -8.832 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 53180 on 29089 degrees of freedom
#> Multiple R-squared: 0.002674, Adjusted R-squared: 0.00264
#> F-statistic: 78 on 1 and 29089 DF, p-value: < 2.2e-16
Go back and re-run the t-test assuming equal variances as we did before. Now notice a few things:
t.test(annual_inc~loan_status, data=my_df, var.equal=TRUE)
t.test()
output shows you the means for the two
groups,default and no defalult. Just displaying the fit
object itself or running summary(fit)
shows you the
coefficients for a linear model. Here, the model assumes the “baseline”
loan_status level is loan_status1, and that the
intercept in a regression model (e.g., \(\beta_{0}\) in the model \(Y = \beta_{0} + \beta_{1}X\)) is the mean
of the baseline group.Recap: t-tests are for assessing the differences in means between two groups. A t-test is a specific case of ANOVA, which is a specific case of a linear model. Let’s run ANOVA, but this time looking for differences in means between more than two groups.
Let’s look at the relationship between home_ownership (RENT,OWN etc), and annual_inc.
<- lm(annual_inc~home_ownership, data=my_df)
fit anova(fit)
summary(fit)
#>
#> Call:
#> lm(formula = annual_inc ~ home_ownership, data = my_df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -73988 -25832 -9176 13424 1983608
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 81892.2 472.5 173.335 <2e-16 ***
#> home_ownershipOTHER -10177.3 5276.3 -1.929 0.0538 .
#> home_ownershipOWN -24094.7 1177.9 -20.456 <2e-16 ***
#> home_ownershipRENT -25716.2 636.8 -40.382 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 51760 on 29087 degrees of freedom
#> Multiple R-squared: 0.05552, Adjusted R-squared: 0.05542
#> F-statistic: 569.9 on 3 and 29087 DF, p-value: < 2.2e-16
The F-test on the ANOVA table tells us that there is a significant difference in means between the groups. However, the linear model output might not have been what we wanted. Because the default handling of categorical variables is to treat the alphabetical first level as the baseline, “Mortgage” is treated as baseline, and this mean becomes the intercept, and the coefficients on “OWN” and “RENT” and other describe how those groups’ means differ from mortgage.
What if we wanted “RENT” to be the baseline, followed by other
levels? Have a look at ?factor
to relevel the factor
levels.
# Look at my_df$SmokingStatus
#my_df$home_ownership
# What happens if we relevel it? Let's see what that looks like.
#relevel(my_df$home_ownership, ref="RENT")
# If we're happy with that, let's change the value of my_df$SmokingStatus in place
$home_ownership<- relevel(my_df$home_ownership, ref="RENT")
my_df
# Or we could do this the dplyr way
<- my_df |>
my_df mutate(home_ownership=relevel(home_ownership, ref="RENT"))
# Re-fit the model
<- lm(annual_inc~home_ownership, data=my_df)
fit
# Optionally, show the ANOVA table
# anova(fit)
# Print the full model statistics
summary(fit)
#>
#> Call:
#> lm(formula = annual_inc ~ home_ownership, data = my_df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -73988 -25832 -9176 13424 1983608
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 56176.1 427.0 131.561 < 2e-16 ***
#> home_ownershipMORTGAGE 25716.2 636.8 40.382 < 2e-16 ***
#> home_ownershipOTHER 15538.9 5272.4 2.947 0.00321 **
#> home_ownershipOWN 1621.5 1160.4 1.397 0.16230
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 51760 on 29087 degrees of freedom
#> Multiple R-squared: 0.05552, Adjusted R-squared: 0.05542
#> F-statistic: 569.9 on 3 and 29087 DF, p-value: < 2.2e-16
Notice that the p-value on the ANOVA/regression didn’t change, but
the coefficients did. RENT is now treated as baseline. The intercept
coefficient (56176.1 ) is now the mean for RENT smokers. The
MORTGAGE
coefficient of 25716.2 shows the apparent increase
in annual_inc that Those on Mortage have when compared to those who
Rent, that difference is significant (p=2e-16). you can do the same for
other levels
Finally, you can do the typical post-hoc ANOVA procedures on the fit
object. For example, the TukeyHSD()
function will run Tukey’s
test (also known as Tukey’s range test, the Tukey
method, Tukey’s honest significance test, Tukey’s HSD
test (honest significant difference), or the Tukey-Kramer
method). Tukey’s test computes all pairwise mean difference
calculation, comparing each group to each other group, identifying any
difference between two groups that’s greater than the standard error,
while controlling the type I error for all multiple comparisons. First
run aov()
(not anova()
) on
the fitted linear model object, then run TukeyHSD()
on the
resulting analysis of variance fit.
TukeyHSD(aov(fit))
#> Tukey multiple comparisons of means
#> 95% family-wise confidence level
#>
#> Fit: aov(formula = fit)
#>
#> $home_ownership
#> diff lwr upr p adj
#> MORTGAGE-RENT 25716.17 24080.169 27352.1764 0.0000000
#> OTHER-RENT 15538.92 1993.953 29083.8831 0.0169135
#> OWN-RENT 1621.52 -1359.542 4602.5828 0.5009343
#> OTHER-MORTGAGE -10177.25 -23732.176 3377.6673 0.2158710
#> OWN-MORTGAGE -24094.65 -27120.633 -21068.6713 0.0000000
#> OWN-OTHER -13917.40 -27699.492 -135.3033 0.0467451
Finally, let’s visualize the differences in means between these groups. The NA category, which is omitted from the ANOVA, contains all the observations who have missing or non-recorded Smoking Status.
<-ggplot(my_df, aes(home_ownership, annual_inc)) + geom_boxplot() + theme_classic() p