As we move on from week to week and task to task, the code that you have already completed, will stay on the template but will not run, this is possible by adding eval=FALSE to the corresponding code chunk. Note that the libraries need to be linked to this program as well.
# Install and load necessary libraries
#install.packages("ggplot2") # Install ggplot2 for plotting, if you have already installed the packages, comment this out by enterring a # in front of this command
#install.packages("scales") # Install scales for formatting
#install.packages("moments") # Install moments for skewness and kurtosis
library(ggplot2) # Load ggplot2 library
library(scales) # Load scales library
library(moments)
This needs to be addressed here.
# Check the current working directory
getwd()
## [1] "C:/Users/cdaniels/OneDrive - National University/Documents"
# in the next line, change the directory to the place where you saved the
# data file, if you prefer you can save your data.csv file in the directory
# that command 7 indicated.
# for example your next line should like something similar to this: setwd("C:/Users/tsapara/Documents")
# Set the working directory to where the data file is located
# This ensures the program can access the file correctly
setwd("C:/Users/cdaniels/OneDrive - National University/Documents")
### Choose an already existing directory in your computer.
# Read the CSV file
# The header parameter ensures column names are correctly read
# sep defines the delimiter (comma in this case)
# stringsAsFactors prevents automatic conversion of strings to factors
df <- read.csv("data.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)
##########################################################
# Define variables A and B based on your student ID
# A represents the first 3 digits, B represents the last 3 digits
A <- 34
B <- 99
Randomizer <- A + B # Randomizer ensures a consistent seed value for reproducibility
# Generate a random sample of 500 rows from the dataset
set.seed(Randomizer) # Set the seed for reproducibility
sample_size <- 500
df <- df[sample(nrow(df), sample_size, replace = TRUE), ] # Sample the dataset
write.csv(df, file = "data.csv", row.names = FALSE) # this command may take some time to run once it is done, it will create the desired data file locally in your directory
As practice, you may want now to knit your file in an html. To do this, you should click on the knit button on the top panel, and wait for the rendering file. The HTML will open once it is done for you to review.
It is recommended to practice with RMD and download and review the following cheatsheets: https://rmarkdown.rstudio.com/lesson-15.HTML
In addition, you may want to alter some of the editor components and re-knit your file to gain some knowledge and understanding of RMD. For a complete tutorial, visit: https://rmarkdown.rstudio.com/lesson-2.html
df <- read.csv("data.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)
Step 0. Now that you read the file, you want to learn few information about your data
The following commands will not be explained here, do your research, review your csv file and answer the questions related with this part of your code.
# Basic exploratory commands
nrow(df) # Number of rows in the dataset
## [1] 500
length(df) # Number of columns (or variables) in the dataset
## [1] 17
str(df) # Structure of the dataset (data types and a preview)
## 'data.frame': 500 obs. of 17 variables:
## $ ID : int 952 977 1085 1004 990 1077 6 1089 29 1036 ...
## $ Gender : Factor w/ 3 levels "Female","Male",..: 2 2 2 1 2 2 2 1 3 1 ...
## $ Age : int 30 30 29 25 32 29 34 26 NA 27 ...
## $ Height : int 180 180 175 165 175 175 190 160 165 168 ...
## $ Weight : int 75 75 70 58 70 70 90 55 56 65 ...
## $ Education : Factor w/ 5 levels "","Bachelor's",..: 4 4 5 2 5 5 4 1 5 2 ...
## $ Income : int NA 60000 60000 45000 60000 60000 60000 48000 55000 45000 ...
## $ MaritalStatus: Factor w/ 3 levels "","Married","Single": 3 2 1 3 2 2 2 3 3 3 ...
## $ Employment : Factor w/ 3 levels "","Employed",..: 2 2 2 1 2 2 2 2 3 2 ...
## $ Score : num 7.5 7.5 7.8 6.2 7.8 7.8 8.2 6.1 7.8 6.2 ...
## $ Category : Factor w/ 4 levels "","A","B","C": 4 4 4 3 4 4 3 3 3 3 ...
## $ Color : Factor w/ 3 levels "Art","Music",..: 2 2 3 2 3 3 2 1 2 1 ...
## $ Hobby : Factor w/ 3 levels "Blue","Green",..: 1 1 3 2 3 3 1 1 1 1 ...
## $ Happiness : Factor w/ 5 levels "","Photography",..: 3 3 5 5 5 5 5 4 5 3 ...
## $ Location : num 8 8 9 7 9 9 7.5 7 8.5 7 ...
## $ X : Factor w/ 4 levels "","City","Rural",..: 2 2 4 3 4 4 4 3 4 2 ...
## $ X.1 : logi NA NA NA NA NA NA ...
summary(df) # Summary statistics for each column
## ID Gender Age Height Weight
## Min. : 1.00 Female:189 Min. :25.00 Min. :155.0 Min. :54.0
## 1st Qu.: 48.75 Male :212 1st Qu.:27.00 1st Qu.:165.0 1st Qu.:65.0
## Median : 991.50 Other : 99 Median :28.00 Median :175.0 Median :70.0
## Mean : 717.23 Mean :28.46 Mean :173.2 Mean :70.5
## 3rd Qu.:1060.25 3rd Qu.:30.00 3rd Qu.:182.0 3rd Qu.:80.0
## Max. :1117.00 Max. :34.00 Max. :190.0 Max. :90.0
## NA's :37 NA's :22 NA's :40
## Education Income MaritalStatus Employment
## : 18 Min. :32000 : 22 : 15
## Bachelor's :178 1st Qu.:45000 Married:194 Employed :373
## High School:103 Median :49000 Single :284 Unemployed:112
## Master's :105 Mean :51762
## PhD : 94 3rd Qu.:60000
## NA's : 2 Max. :70000
## NA's :84
## Score Category Color Hobby Happiness
## Min. :5.500 : 8 Art :177 Blue :209 : 2
## 1st Qu.:6.100 A:160 Music :141 Green:138 Photography:101
## Median :6.200 B:193 Sports:182 Red :153 Reading :139
## Mean :6.745 C:139 Swimming : 96
## 3rd Qu.:7.500 Traveling :162
## Max. :8.900
## NA's :63
## Location X X.1
## Min. :6.000 : 3 Mode:logical
## 1st Qu.:7.000 City :205 NA's:500
## Median :7.000 Rural :186
## Mean :7.521 Suburb:106
## 3rd Qu.:8.500
## Max. :9.000
## NA's :20
Please answer the following questions, by typing information after the question.
Question 1
What type of variables does your file include?
Answer 1:
Question 2
Specific data types?
Answer 2:
Question 3
Are they read properly?
Answer 3:
Question 4
Are there any issues ?
Answer 4:
Question 5
Does your file includes both NAs and blanks?
Answer 5:
Question 6
How many NAs do you have and
Answer 6:
Question 7
How many blanks?
Answer 7:
factor_columns <- names(df)[sapply(df, function(x) is.character(x) | is.factor(x))]
if (length(factor_columns) > 0) {
df[factor_columns] <- lapply(df[factor_columns], as.factor)
}
names(df)
## [1] "ID" "Gender" "Age" "Height"
## [5] "Weight" "Education" "Income" "MaritalStatus"
## [9] "Employment" "Score" "Category" "Color"
## [13] "Hobby" "Happiness" "Location" "X"
## [17] "X.1"
factor_columns
## [1] "Gender" "Education" "MaritalStatus" "Employment"
## [5] "Category" "Color" "Hobby" "Happiness"
## [9] "X"
###################################################################
# Step X: Cleanup and variable typing
###################################################################
# --- Define categorical variables (strings/factors) ---
factor_columns <- c("Gender", "Education", "MaritalStatus",
"Employment", "Category", "Color",
"Hobby", "Location")
# Keep only ones that exist in df
factor_columns <- factor_columns[factor_columns %in% names(df)]
# Convert them to factors
if (length(factor_columns) > 0) {
df[factor_columns] <- lapply(df[factor_columns], as.factor)
}
# --- Define numeric variables ---
numeric_columns <- c("Age", "Height", "Weight", "Income", "Score", "Happiness", "Rating")
# Keep only ones that exist in df
numeric_columns <- numeric_columns[numeric_columns %in% names(df)]
# Convert to numeric (if not already)
if (length(numeric_columns) > 0) {
df[numeric_columns] <- lapply(df[numeric_columns], function(x) as.numeric(as.character(x)))
}
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
# --- Confirm results ---
cat("Categorical variables:\n")
## Categorical variables:
print(factor_columns)
## [1] "Gender" "Education" "MaritalStatus" "Employment"
## [5] "Category" "Color" "Hobby" "Location"
cat("\nNumeric variables:\n")
##
## Numeric variables:
print(numeric_columns)
## [1] "Age" "Height" "Weight" "Income" "Score" "Happiness"
Step 1: Handling both blanks and NAs is not simple so first we want to eliminate some of those, let’s eliminate the blanks and change them to NAs
# Define which variables *could* be factors
factor_columns <- c("Gender", "Education", "MaritalStatus",
"Employment", "Category", "Color",
"Hobby", "Location")
# Keep only the ones that exist in df
factor_columns <- factor_columns[factor_columns %in% names(df)]
# Only convert if there are any left
if (length(factor_columns) > 0) {
df[factor_columns] <- lapply(df[factor_columns], as.factor)
}
cat("Factor columns actually found in df:\n")
## Factor columns actually found in df:
print(factor_columns)
## [1] "Gender" "Education" "MaritalStatus" "Employment"
## [5] "Category" "Color" "Hobby" "Location"
# Step 1: Replace blanks with NAs
df[df == ""] <- NA
# Define possible categorical columns
factor_columns <- c("Gender", "Education", "Rating", "MaritalStatus",
"Category", "Employment", "Color", "Hobby", "Location")
# Keep only those that actually exist in df
factor_columns <- factor_columns[factor_columns %in% names(df)]
# Convert them to factors (only if any exist)
if (length(factor_columns) > 0) {
df[factor_columns] <- lapply(df[factor_columns], function(col) as.factor(as.character(col)))
}
# Debug: print which factor columns were found
cat("Factor columns in df:\n")
## Factor columns in df:
print(factor_columns)
## [1] "Gender" "Education" "MaritalStatus" "Category"
## [5] "Employment" "Color" "Hobby" "Location"
Step 2: Count NAs in the entire dataset
#
# Step 2: Count NAs in the entire dataset
# Count the total number of NAs in the dataset
total_nas <- sum(is.na(df))
total_nas # Print the total number of missing values
## [1] 1334
Please answer the following questions, by typing information after the question.
Question 8
Explain what the printed number is, what is the information that relays and how can you use it in your analysis?
Answer 8:
Step 3: Count rows with NAs.
#
# Step 3: Count rows with NAs
#
# Count rows with at least one NA
rows_with_nas <- sum(rowSums(is.na(df)) > 0)
Percent_row_NA <- percent(rows_with_nas / nrow(df)) # Percentage of rows with NAs
rows_with_nas
## [1] 500
Percent_row_NA
## [1] "100%"
Question 9
How large is the proportion of the rows with NAs, we can drop up to 5%?
Answer 9:
Question 10
Do you think that would be wise to drop the above percent?
Answer 10:
Question 11
How this will affect your dataset?
Answer 11:
Step 4: Count columns with NAs
#
# Step 4: Count columns with NAs
# Count columns with at least one NA
cols_with_nas <- sum(colSums(is.na(df)) > 0)
Percent_col_NA <- percent(cols_with_nas / length(df)) # Percentage of columns with NAs
cols_with_nas
## [1] 13
Percent_col_NA
## [1] "76%"
Question 12
How large is the proportion of the cols with NAs, we never want to drop entire columnes as this would mean that we will loose variables and associations but do you think that would be wise to drop the above percent?
Answer 12:
Question 13
How this will affect your dataset?
Answer 13:
Step 5: Replace NAs with appropriate values (mean for numeric and integer,mode for factor, “NA” for character)
In later weeks we will learn how to replace the NAs properly based on the descriptive statistics and you will discuss this code.For now, you can assume that by setting the mean of the variable for numeric and mode for categorical it is correct - this is not always the case of course but the code will become much more complicated in that case.
# Step 5: Replace NAs with appropriate values
df <- lapply(df, function(col) {
if (is.numeric(col) || is.integer(col)) {
non_na_count <- sum(!is.na(col))
if (non_na_count > 10) {
# Plenty of data → use mean
col[is.na(col)] <- mean(col, na.rm = TRUE)
} else if (non_na_count >= 2) {
# Small sample → interpolate
col[is.na(col)] <- approx(seq_along(col), col, n = length(col))[["y"]][is.na(col)]
} else if (non_na_count == 1) {
# Only one value → just use that value
col[is.na(col)] <- unique(na.omit(col))
} else {
# All values NA → leave as NA (or could set to 0)
col[is.na(col)] <- NA
}
} else if (is.factor(col)) {
# Mode for factors
mode_val <- names(sort(-table(col)))[1]
col[is.na(col)] <- mode_val
} else if (is.character(col)) {
# Replace missing strings with "NA"
col[is.na(col)] <- "NA"
}
return(col)
})
df <- as.data.frame(df)
# Check updated dataset
summary(df)
## ID Gender Age Height Weight
## Min. : 1.00 Female:189 Min. :25.00 Min. :155.0 Min. :54.0
## 1st Qu.: 48.75 Male :212 1st Qu.:27.00 1st Qu.:165.0 1st Qu.:65.0
## Median : 991.50 Other : 99 Median :28.00 Median :175.0 Median :70.0
## Mean : 717.23 Mean :28.46 Mean :173.2 Mean :70.5
## 3rd Qu.:1060.25 3rd Qu.:30.00 3rd Qu.:182.0 3rd Qu.:80.0
## Max. :1117.00 Max. :34.00 Max. :190.0 Max. :90.0
##
## Education Income MaritalStatus Employment
## Bachelor's :198 Min. :32000 Married:194 Employed :388
## High School:103 1st Qu.:45000 Single :306 Unemployed:112
## Master's :105 Median :51762
## PhD : 94 Mean :51762
## 3rd Qu.:60000
## Max. :70000
##
## Score Category Color Hobby Happiness Location
## Min. :5.500 A:160 Art :177 Blue :209 Min. : NA 7 :176
## 1st Qu.:6.100 B:201 Music :141 Green:138 1st Qu.: NA 6 : 77
## Median :6.700 C:139 Sports:182 Red :153 Median : NA 9 : 77
## Mean :6.745 Mean :NaN 8 : 74
## 3rd Qu.:7.500 3rd Qu.: NA 8.5 : 61
## Max. :8.900 Max. : NA 7.5 : 15
## NA's :500 (Other): 20
## X X.1
## : 0 Mode:logical
## City :208 NA's:500
## Rural :186
## Suburb:106
##
##
##
Essay Question
Run summary(df) and compare with the previous statistics. Do you observe any undesired changes? Explain in detail. Are there any more NA’s in your file?What is the information that is printed by the summary? How can this be interpreted? what are your observations? Verify the effects of imputation, and explain in detail. Compare the updated summary with the earlier statistics and note changes. Explain everything that you obsevrve.
Answer
Step 6: Create descriptive statistics for all variables
We run all the descriptive statistics for all the numeric variables
###################################################################
#
# Step 6: Create descriptive statistics for all variables
# We run all the descriptive statistics for all the numeric variables
#
###################################################################
# Initialize a function to compute descriptive statistics
compute_stats <- function(column, name) {
if (is.numeric(column) || is.integer(column)) {
data.frame(
Variable = name,
Mean = round(mean(column, na.rm = TRUE), 2),
Median = round(median(column, na.rm = TRUE), 2),
St.Deviation = round(sd(column, na.rm = TRUE), 2),
Range = round(diff(range(column, na.rm = TRUE)), 2),
IQR = round(IQR(column, na.rm = TRUE), 2),
Skewness = round(skewness(column, na.rm = TRUE), 2),
Kurtosis = round(kurtosis(column, na.rm = TRUE), 2),
stringsAsFactors = FALSE
)
} else {
NULL
}
}
# Apply the function to each numeric or integer column in the dataset
descriptive_stats <- do.call(
rbind,
lapply(names(df), function(col) compute_stats(df[[col]], col))
)
## Warning in min(x): no non-missing arguments to min; returning Inf
## Warning in max(x): no non-missing arguments to max; returning -Inf
# Print the descriptive statistics dataframe
descriptive_stats
## Variable Mean Median St.Deviation Range IQR Skewness Kurtosis
## 1 ID 717.23 991.50 467.74 1116.0 1011.5 -0.77 1.63
## 2 Age 28.46 28.00 2.01 9.0 3.0 0.61 3.26
## 3 Height 173.24 175.00 9.03 35.0 17.0 -0.12 1.87
## 4 Weight 70.50 70.00 10.16 36.0 15.0 0.01 1.97
## 5 Income 51762.02 51762.02 8299.68 38000.0 15000.0 0.16 2.63
## 6 Score 6.74 6.70 0.86 3.4 1.4 0.62 2.51
## 7 Happiness NaN NA NA -Inf NA NaN NaN
Step 7: Print Descriptive Statistics
Now you have all the descriptive statistics for all numeric variables Create a professional table in your paper. The library(KableExtra), can help you create the table here. If you have no programming experience you can cut and paste in Excel and beautify the table in Excel.
#############################################################
#
# Step 7: Print Descriptive Statistics
# Now you have all the descriptive statistics for all numeric variables
# Create a professional table in your paper.
# the library(KableExtra), can help you create the table here.
# if you have no programming experience you can cut and paste in Excel
# and beautify the table in Excel
#############################################################
print("Descriptive Statistics:")
## [1] "Descriptive Statistics:"
print(descriptive_stats)
## Variable Mean Median St.Deviation Range IQR Skewness Kurtosis
## 1 ID 717.23 991.50 467.74 1116.0 1011.5 -0.77 1.63
## 2 Age 28.46 28.00 2.01 9.0 3.0 0.61 3.26
## 3 Height 173.24 175.00 9.03 35.0 17.0 -0.12 1.87
## 4 Weight 70.50 70.00 10.16 36.0 15.0 0.01 1.97
## 5 Income 51762.02 51762.02 8299.68 38000.0 15000.0 0.16 2.63
## 6 Score 6.74 6.70 0.86 3.4 1.4 0.62 2.51
## 7 Happiness NaN NA NA -Inf NA NaN NaN
Essay Question
Review and compare with the previous statistics. Do you observe any undesired changes? Explain in detail. How can this be interpreted? what are your observations? Verify the descriptive statistics, and explain in detail. Explain everything that you obsevrve. Complete your research compare your variables and complete your paper
Answer
Step 8: Create graphs using ggplot2
For this part there are parts that you will need to change to create your graphs. The example is set to work with Income. Make the necessary changes to create the rest of the graphs. You may also want to change the colors, the dimensions etc…
#######################################################################
#
# Step 8: Create graphs using ggplot2
# For this part there are parts that you will need to change to create
# your graphs.
# The example is set to work with Income
# Make the necessary changes to create the rest of the graphs
# You may also want to change the colors, the dimensions etc...
#############################################################
#############################################################
#
# STEP 8a: Create a bargraph or a histogram
# Explain what graph was that and why?
# Set col to the desired column name
#############################################################
#
##
# In this code we start you of with an example of Happiness, later in the code
# you should replace this with your desired variable.
#
col = "Income" # This is an example, try to do the same with a different variable
# Assume df is your dataframe and col is the column name (as string)
if (is.factor(df[[col]])) {
# Bar graph for factors
ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
geom_bar() +
labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
theme_minimal() +
theme(legend.position = "right")
}
You can also copy the chunk and create more graphs by resetting the col variable appropriately
if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
# Histogram for numeric variables
ggplot(df, aes(x = .data[[col]])) +
geom_histogram(bins = 30, fill = "steelblue", color = "black") +
labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
theme_minimal()
}
Essay Question
Now that you can observe graphically your data, explain the importance of graphical representations and how this helps to communicate data with other parties. Explain what graph was that and why?
Answer
STEP 8b: Create a boxplot and a Histogram for numeric variables note the the Bin width cannot be set up in the same way to work with Age or Happiness that has a small range and Income that the range is in thousands. Change this appropriately
Please note that this part of the code will not run for the demo code. You will need to change the value of eval=FALSE to eval=TRUE, after you introduce your code, to run it and add it to your knitted file.
#############################################################
#
# STEP 8b: Create a boxplot and Histogram for numeric variables
# note the the Bin width cannot be set up in the same way to work with
# Age or Happiness that has a small range and Income that the range is in thousands
# Change this appropriately
#############################################################
#
# Choose a numeric variable (i.e., Age) set the col variable to the name of the column then you rerun the code that is commented out here.
#col = Income
# Uncomment the code and you will create a Bar graph or a Histogram of a different variable here.
# Do not forget to change the value of eval=TRUE to run and knit this chunk
# if (is.factor(df[[col]])) { # if the col is categorical, then the code will
# create two graphs the Bar graph
# Highlight and run until the line that start with `# Boxplot for numeric variables
#
# If the col is numeric, then it will create the histogram
# Bar graph for factors
# ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
# geom_bar() +
# labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
# theme_minimal() +
# theme(legend.position = "right")
# } else if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
# ggplot(df, aes(x = .data[[col]])) +
# geom_histogram(binwidth = 0.3) +
# labs(title = paste("Histogram for", col), x = col, y = "Count") +
# theme_minimal()
Essay Question
Now explain this graph. Focus on the information extracted, anomalies, outliers, relationships.
Answer
***Step 8c: NOTE that you should run this part with the latest value of col. Do not forget to change the eval=TRUE to knit it.
Boxplot for numeric variables
#############################################################
#
# Step 8c
# NOTE that you should run this part of the code after you
# copy the graph that the previous code creates. Boxplot for numeric variables
#############################################################
# The next 5 lines will run only if the col is numeric, otherwise will give you an error.
ggplot(df, aes(x = "", y = .data[[col]])) +
geom_boxplot(fill = "skyblue", color = "darkblue", width = 0.3, outlier.color = "red", outlier.size = 2) +
labs(
title = paste("Box Plot for", col),
x = NULL,
y = "Value"
) +
theme_minimal() +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title.y = element_text(size = 14),
axis.text.y = element_text(size = 12)
)
Essay Question
Explain the findings of your Boxplot. Are there any outliers? What is the IQR? Focus on the information extracted, anomalies, outliers, relationships.
Answer
Step 9: Tables
Creating tables to understand how the different categorical variables interconnect. Tabular information can be provided in both tables and parallel barplots. The following is an example on two variables, choose two others to get more valuable insights.
#############################################################
#
# Step 9
#
# Creating tables to understand how the different categorical variables
# interconnect
# Tabular information can be provided in both tables and parallel barplots.
# The following is an example on two variables, choose two others to get
# more valuable insights.
#############################################################
Gender_Education <- table(df$Education, df$Gender)
Gender_Education # what does this information tells you?
##
## Female Male Other
## Bachelor's 179 10 9
## High School 4 26 73
## Master's 6 99 0
## PhD 0 77 17
# How Many rows are there?
# This is the number of colors you should have in the vector below
# more intuitive colors can be added here.
# Keep the order from top to bottom to create your legend vector
addmargins(Gender_Education) # Add totals to your table
##
## Female Male Other Sum
## Bachelor's 179 10 9 198
## High School 4 26 73 103
## Master's 6 99 0 105
## PhD 0 77 17 94
## Sum 189 212 99 500
color <- c("red","blue","yellow","green")
names <- c("Bachelor's","High School", "Master's","PhD")
barplot(Gender_Education, col=color, beside= TRUE, main = "Education by Gender", ylim = c(0,250) )
legend("topright",names,fill=color,cex=0.5)
# topright is the position of the legend, it can be moved to top, left bottom, etc...
# you do not change the rest of the parameters here
print(addmargins(Gender_Education))
##
## Female Male Other Sum
## Bachelor's 179 10 9 198
## High School 4 26 73 103
## Master's 6 99 0 105
## PhD 0 77 17 94
## Sum 189 212 99 500
library(knitr)
library(kableExtra)
# Create the contingency table
Gender_Education <- table(df$Education, df$Gender)
# Add row and column totals
Gender_Education_margins <- addmargins(Gender_Education)
# Make a clean and beautiful table with kable
kable(Gender_Education_margins, caption = "Gender by Education Level", align = 'c') %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
row_spec(0, bold = TRUE, background = "#D3D3D3") # Highlight header
| Female | Male | Other | Sum | |
|---|---|---|---|---|
| Bachelor’s | 179 | 10 | 9 | 198 |
| High School | 4 | 26 | 73 | 103 |
| Master’s | 6 | 99 | 0 | 105 |
| PhD | 0 | 77 | 17 | 94 |
| Sum | 189 | 212 | 99 | 500 |
Essay Question
Explain the table in details. Focus on the information extracted, anomalies, outliers, relationships.
Answer