Already complete Code

As we move on from week to week and task to task, the code that you have already completed, will stay on the template but will not run, this is possible by adding eval=FALSE to the corresponding code chunk. Note that the libraries need to be linked to this program as well.

# Install and load necessary libraries
#install.packages("ggplot2") # Install ggplot2 for plotting, if you have already installed the packages, comment this out by enterring a # in front of this command
#install.packages("scales")  # Install scales for formatting
#install.packages("moments") # Install moments for skewness and kurtosis
library(ggplot2)            # Load ggplot2 library
library(scales)             # Load scales library
library(moments)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.5.1
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows

Setting up your directory in your computer.

This needs to be addressed here.

# Check the current working directory
getwd()
## [1] "C:/Users/benke/Downloads"
# in the next line, change the directory to the place where you saved the
# data file, if you prefer you can save your data.csv file in the directory
# that command 7 indicated.
# for example your next line should like something similar to this: setwd("C:/Users/tsapara/Documents")

# Set the working directory to where the data file is located
# This ensures the program can access the file correctly

setwd("C:/Users/benke/Downloads")

### Choose an already existing directory in your computer.

Setting up your personalized data

# Read the CSV file
# The header parameter ensures column names are correctly read
# sep defines the delimiter (comma in this case)
# stringsAsFactors prevents automatic conversion of strings to factors
df <- read.csv("data.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)

##########################################################
# Define variables A and B based on your student ID
# A represents the first 3 digits, B represents the last 3 digits
A <- 34
B <- 99
Randomizer <- A + B # Randomizer ensures a consistent seed value for reproducibility


# Generate a random sample of 500 rows from the dataset
set.seed(Randomizer) # Set the seed for reproducibility
sample_size <- 500
df <- df[sample(nrow(df), sample_size, replace = TRUE), ] # Sample the dataset

write.csv(df, file = "my_data.csv", row.names = FALSE) # this command may take some time to run once it is done, it will create the desired data file locally in your directory

Knit your file

As practice, you may want now to knit your file in an html. To do this, you should click on the knit button on the top panel, and wait for the rendering file. The HTML will open once it is done for you to review.

It is recommended to practice with RMD and download and review the following cheatsheets: https://rmarkdown.rstudio.com/lesson-15.HTML

In addition, you may want to alter some of the editor components and re-knit your file to gain some knowledge and understanding of RMD. For a complete tutorial, visit: https://rmarkdown.rstudio.com/lesson-2.html

df <- read.csv("my_data.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)

Cleaning up your data

Step 0. Now that you read the file, you want to learn few information about your data

The following commands will not be explained here, do your research, review your csv file and answer the questions related with this part of your code.

# Basic exploratory commands
nrow(df)       # Number of rows in the dataset
## [1] 500
length(df)     # Number of columns (or variables) in the dataset
## [1] 16
str(df)        # Structure of the dataset (data types and a preview)
## 'data.frame':    500 obs. of  16 variables:
##  $ ID           : int  31 957 964 34 968 1040 31 1051 32 1082 ...
##  $ Gender       : Factor w/ 3 levels "Female","Male",..: 1 3 2 1 2 2 1 1 2 3 ...
##  $ Age          : int  29 30 31 27 30 29 29 26 31 28 ...
##  $ Height       : int  155 165 180 168 182 175 155 160 180 185 ...
##  $ Weight       : int  54 NA NA 65 80 70 54 55 NA 85 ...
##  $ Education    : Factor w/ 5 levels "","Bachelor's",..: 4 3 2 2 4 5 4 2 2 3 ...
##  $ Income       : int  55000 NA 50000 45000 65000 60000 55000 48000 50000 NA ...
##  $ MaritalStatus: Factor w/ 3 levels "","Married","Single": 2 3 2 3 2 2 2 3 2 3 ...
##  $ Employment   : Factor w/ 3 levels "","Employed",..: 2 3 2 2 2 2 2 2 2 3 ...
##  $ Score        : num  7.3 5.5 6.5 6.2 NA 7.8 7.3 6.1 6.5 5.7 ...
##  $ Rating       : Factor w/ 4 levels "","A","B","C": 3 2 2 3 4 4 3 3 2 2 ...
##  $ Category     : Factor w/ 4 levels "","Art","Music",..: 3 4 2 2 3 4 3 2 2 4 ...
##  $ Color        : Factor w/ 4 levels "","Blue","Green",..: 2 3 3 2 4 4 2 2 3 3 ...
##  $ Hobby        : Factor w/ 5 levels "","Photography",..: 4 3 3 3 5 5 4 4 3 2 ...
##  $ Happiness    : num  8.2 NA 8 7 8.5 9 8.2 7 8 6 ...
##  $ Location     : Factor w/ 4 levels "","City","Rural",..: 2 3 3 2 2 4 2 3 3 3 ...
summary(df)    # Summary statistics for each column
##        ID            Gender         Age            Height          Weight     
##  Min.   :   1.0   Female:221   Min.   :25.00   Min.   :155.0   Min.   :54.00  
##  1st Qu.:  59.0   Male  :184   1st Qu.:27.00   1st Qu.:165.0   1st Qu.:60.00  
##  Median :1001.0   Other : 95   Median :28.00   Median :175.0   Median :70.00  
##  Mean   : 766.7                Mean   :28.15   Mean   :172.3   Mean   :69.38  
##  3rd Qu.:1059.2                3rd Qu.:29.00   3rd Qu.:182.0   3rd Qu.:80.00  
##  Max.   :1117.0                Max.   :34.00   Max.   :190.0   Max.   :90.00  
##                                NA's   :27      NA's   :18      NA's   :28     
##        Education       Income      MaritalStatus      Employment 
##             : 20   Min.   :32000          : 16             : 21  
##  Bachelor's :189   1st Qu.:45000   Married:192   Employed  :364  
##  High School:104   Median :48000   Single :292   Unemployed:115  
##  Master's   : 96   Mean   :51551                                 
##  PhD        : 88   3rd Qu.:60000                                 
##  NA's       :  3   Max.   :70000                                 
##                    NA's   :77                                    
##      Score      Rating    Category     Color             Hobby    
##  Min.   :5.50    :  4         :  6        :  1              :  4  
##  1st Qu.:6.10   A:147   Art   :173   Blue :207   Photography: 90  
##  Median :6.20   B:215   Music :145   Green:135   Reading    :125  
##  Mean   :6.63   C:134   Sports:176   Red  :157   Swimming   :100  
##  3rd Qu.:7.50                                    Traveling  :181  
##  Max.   :8.90                                                     
##  NA's   :67                                                       
##    Happiness       Location  
##  Min.   :6.000         :  5  
##  1st Qu.:7.000   City  :209  
##  Median :7.000   Rural :191  
##  Mean   :7.502   Suburb: 95  
##  3rd Qu.:8.500               
##  Max.   :9.000               
##  NA's   :13

Your Turn

Please answer the following questions, by typing information after the question.

Question 1

What type of variables does your file include?

Answer 1:

Question 2

Specific data types?

Answer 2:

Question 3

Are they read properly?

Answer 3:

Question 4

Are there any issues ?

Answer 4:

Question 5

Does your file includes both NAs and blanks?

Answer 5:

Question 6

How many NAs do you have and

Answer 6:

Question 7

How many blanks?

Answer 7:

Cleanup Continued

Step 1: Handling both blanks and NAs is not simple so first we want to eliminate some of those, let’s eliminate the blanks and change them to NAs

#
# Step 1:  # Handling both blanks and NAs is not simple so first we want to eliminate
# some of those, let's eliminate the blanks and change them to NAs
#


# Replace blanks with NAs across the dataset
# This ensures that blank values are consistently treated as missing data
df[df == ""] <- NA

# Convert specific columns to factors
# This step ensures categorical variables are treated correctly after replacing blanks
factor_columns <- c("Gender", "Education", "Rating", "MaritalStatus", "Category", 
                    "Employment", "Color", "Hobby", "Location")
df[factor_columns] <- lapply(df[factor_columns], function(col) as.factor(as.character(col)))

Cleanup Continued

Step 2: Count NAs in the entire dataset

#
# Step 2: Count NAs in the entire dataset


# Count the total number of NAs in the dataset
total_nas <- sum(is.na(df))
total_nas # Print the total number of missing values
## [1] 310

Your Turn

Please answer the following questions, by typing information after the question.

Question 8

Explain what the printed number is, what is the information that relays and how can you use it in your analysis?

Answer 8:

Clean Up Continued

Step 3: Count rows with NAs.

#
# Step 3: Count rows with NAs
#

# Count rows with at least one NA
rows_with_nas <- sum(rowSums(is.na(df)) > 0)
Percent_row_NA <- percent(rows_with_nas / nrow(df)) # Percentage of rows with NAs
rows_with_nas
## [1] 255
Percent_row_NA
## [1] "51%"

Your Turn

Question 9

How large is the proportion of the rows with NAs, we can drop up to 5%?

Answer 9:

Question 10

Do you think that would be wise to drop the above percent?

Answer 10:

Question 11

How this will affect your dataset?

Answer 11:

CleanUp Continued

Step 4: Count columns with NAs

#  
# Step 4: Count columns with NAs

# Count columns with at least one NA
cols_with_nas <- sum(colSums(is.na(df)) > 0)
Percent_col_NA <- percent(cols_with_nas / length(df)) # Percentage of columns with NAs
cols_with_nas
## [1] 14
Percent_col_NA
## [1] "88%"

Your Turn

Question 12

How large is the proportion of the cols with NAs, we never want to drop entire columnes as this would mean that we will loose variables and associations but do you think that would be wise to drop the above percent?

Answer 12:

Question 13

How this will affect your dataset?

Answer 13:

Imputation

Step 5: Replace NAs with appropriate values (mean for numeric and integer,mode for factor, “NA” for character)

In later weeks we will learn how to replace the NAs properly based on the descriptive statistics and you will discuss this code.For now, you can assume that by setting the mean of the variable for numeric and mode for categorical it is correct - this is not always the case of course but the code will become much more complicated in that case.

# 
# Step 5: Replace NAs with appropriate values (mean for numeric and integer,
# mode for factor, "NA" for character)
# In later weeks we will learn how to replace the NAs properly based on the
# descriptive statistics and you will discuss this code.
# for now, you can assume that by setting the mean of the variable for numeric
# and mode for categorical it is correct - this is not always the case of course
# but the code will become much more complicated in that case.


# Replace NAs with appropriate values
# Numeric: Replace with the mean if sufficient data is available
# Categorical: Replace with the mode (most common value)
# Character: Replace with the string "NA"
df <- lapply(df, function(col) {
  if (is.numeric(col) || is.integer(col)) { # Numeric or integer columns
    if (sum(!is.na(col)) > 10) {
      col[is.na(col)] <- mean(col, na.rm = TRUE) # Replace with mean
    } else {
      col[is.na(col)] <- approx(seq_along(col), col, n = length(col))[["y"]][is.na(col)] # Interpolation
    }
  } else if (is.factor(col)) { # Factor columns
    mode_val <- names(sort(-table(col)))[1] # Mode (most common value)
    col[is.na(col)] <- mode_val
  } else if (is.character(col)) { # Character columns
    col[is.na(col)] <- "NA" # Replace with "NA"
  }
  return(col) # Return the modified column
})

df <- as.data.frame(df) # Convert the list back to a dataframe


#
# following the above method to impute, has now changed some of the statistics


# Check the updated dataset and ensure no remaining NAs
summary(df)
##        ID            Gender         Age            Height          Weight     
##  Min.   :   1.0   Female:221   Min.   :25.00   Min.   :155.0   Min.   :54.00  
##  1st Qu.:  59.0   Male  :184   1st Qu.:27.00   1st Qu.:165.0   1st Qu.:62.00  
##  Median :1001.0   Other : 95   Median :28.00   Median :172.3   Median :69.38  
##  Mean   : 766.7                Mean   :28.15   Mean   :172.3   Mean   :69.38  
##  3rd Qu.:1059.2                3rd Qu.:29.00   3rd Qu.:182.0   3rd Qu.:80.00  
##  Max.   :1117.0                Max.   :34.00   Max.   :190.0   Max.   :90.00  
##        Education       Income      MaritalStatus      Employment 
##  Bachelor's :212   Min.   :32000   Married:192   Employed  :385  
##  High School:104   1st Qu.:45000   Single :308   Unemployed:115  
##  Master's   : 96   Median :51551                                 
##  PhD        : 88   Mean   :51551                                 
##                    3rd Qu.:60000                                 
##                    Max.   :70000                                 
##      Score      Rating    Category     Color             Hobby    
##  Min.   :5.50   A:147   Art   :173   Blue :208   Photography: 90  
##  1st Qu.:6.10   B:219   Music :145   Green:135   Reading    :125  
##  Median :6.20   C:134   Sports:182   Red  :157   Swimming   :100  
##  Mean   :6.63                                    Traveling  :185  
##  3rd Qu.:7.30                                                     
##  Max.   :8.90                                                     
##    Happiness       Location  
##  Min.   :6.000   City  :214  
##  1st Qu.:7.000   Rural :191  
##  Median :7.000   Suburb: 95  
##  Mean   :7.502               
##  3rd Qu.:8.500               
##  Max.   :9.000
summary(df) %>%
  kbl(caption = "Table 1. Summary of Data Frame Characteristics") %>%
  kable_classic()
Table 1. Summary of Data Frame Characteristics
ID Gender Age Height Weight Education Income MaritalStatus Employment Score Rating Category Color Hobby Happiness Location
Min. : 1.0 Female:221 Min. :25.00 Min. :155.0 Min. :54.00 Bachelor’s :212 Min. :32000 Married:192 Employed :385 Min. :5.50 A:147 Art :173 Blue :208 Photography: 90 Min. :6.000 City :214
1st Qu.: 59.0 Male :184 1st Qu.:27.00 1st Qu.:165.0 1st Qu.:62.00 High School:104 1st Qu.:45000 Single :308 Unemployed:115 1st Qu.:6.10 B:219 Music :145 Green:135 Reading :125 1st Qu.:7.000 Rural :191
Median :1001.0 Other : 95 Median :28.00 Median :172.3 Median :69.38 Master’s : 96 Median :51551 NA NA Median :6.20 C:134 Sports:182 Red :157 Swimming :100 Median :7.000 Suburb: 95
Mean : 766.7 NA Mean :28.15 Mean :172.3 Mean :69.38 PhD : 88 Mean :51551 NA NA Mean :6.63 NA NA NA Traveling :185 Mean :7.502 NA
3rd Qu.:1059.2 NA 3rd Qu.:29.00 3rd Qu.:182.0 3rd Qu.:80.00 NA 3rd Qu.:60000 NA NA 3rd Qu.:7.30 NA NA NA NA 3rd Qu.:8.500 NA
Max. :1117.0 NA Max. :34.00 Max. :190.0 Max. :90.00 NA Max. :70000 NA NA Max. :8.90 NA NA NA NA Max. :9.000 NA
head(df) %>%
  kbl(caption = "Table 1. Head of Data Frame Characteristics") %>%
  kable_classic()
Table 1. Head of Data Frame Characteristics
ID Gender Age Height Weight Education Income MaritalStatus Employment Score Rating Category Color Hobby Happiness Location
31 Female 29 155 54.00000 Master’s 55000.00 Married Employed 7.300000 B Music Blue Swimming 8.200000 City
957 Other 30 165 69.37924 High School 51550.83 Single Unemployed 5.500000 A Sports Green Reading 7.502259 Rural
964 Male 31 180 69.37924 Bachelor’s 50000.00 Married Employed 6.500000 A Art Green Reading 8.000000 Rural
34 Female 27 168 65.00000 Bachelor’s 45000.00 Single Employed 6.200000 B Art Blue Reading 7.000000 City
968 Male 30 182 80.00000 Master’s 65000.00 Married Employed 6.630485 C Music Red Traveling 8.500000 City
1040 Male 29 175 70.00000 PhD 60000.00 Married Employed 7.800000 C Sports Red Traveling 9.000000 Suburb

Your Turn

Essay Question

Run summary(df) and compare with the previous statistics. Do you observe any undesired changes? Explain in detail. Are there any more NA’s in your file?What is the information that is printed by the summary? How can this be interpreted? what are your observations? Verify the effects of imputation, and explain in detail. Compare the updated summary with the earlier statistics and note changes. Explain everything that you obsevrve.

Answer

Descriptive Statistics

Step 6: Create descriptive statistics for all variables

We run all the descriptive statistics for all the numeric variables

################################################################### 
# 
# Step 6: Create descriptive statistics for all variables
# We run all the descriptive statistics for all the numeric variables
#
###################################################################
# Initialize a function to compute descriptive statistics
compute_stats <- function(column, name) {
  if (is.numeric(column) || is.integer(column)) {
    data.frame(
      Variable = name,
      Mean = round(mean(column, na.rm = TRUE), 2),
      Median = round(median(column, na.rm = TRUE), 2),
      St.Deviation = round(sd(column, na.rm = TRUE), 2),
      Range = round(diff(range(column, na.rm = TRUE)), 2),
      IQR = round(IQR(column, na.rm = TRUE), 2),
      Skewness = round(skewness(column, na.rm = TRUE), 2),
      Kurtosis = round(kurtosis(column, na.rm = TRUE), 2),
      stringsAsFactors = FALSE
    )
  } else {
    NULL
  }
}

# Apply the function to each numeric or integer column in the dataset
descriptive_stats <- do.call(
  rbind,
  lapply(names(df), function(col) compute_stats(df[[col]], col))
)

# Print the descriptive statistics dataframe
descriptive_stats
##    Variable     Mean   Median St.Deviation   Range      IQR Skewness Kurtosis
## 1        ID   766.68  1001.00       443.87  1116.0  1000.25    -1.03     2.11
## 2       Age    28.15    28.00         1.84     9.0     2.00     0.71     3.96
## 3    Height   172.30   172.30         9.19    35.0    17.00     0.03     1.84
## 4    Weight    69.38    69.38        10.41    36.0    18.00     0.18     1.88
## 5    Income 51550.83 51550.83      7926.35 38000.0 15000.00     0.22     2.41
## 6     Score     6.63     6.20         0.81     3.4     1.20     0.72     2.59
## 7 Happiness     7.50     7.00         0.98     3.0     1.50     0.06     1.85

Descriptive Statistics Continued

Step 7: Print Descriptive Statistics

Now you have all the descriptive statistics for all numeric variables Create a professional table in your paper. The library(KableExtra), can help you create the table here. If you have no programming experience you can cut and paste in Excel and beautify the table in Excel.

#############################################################
# 
# Step 7: Print Descriptive Statistics
# Now you have all the descriptive statistics for all numeric variables
# Create a professional table in your paper.
# the library(KableExtra), can help you create the table here.
# if you have no programming experience you can cut and paste in Excel
# and beautify the table in Excel
#############################################################
  
  print("Descriptive Statistics:")
## [1] "Descriptive Statistics:"
  print(descriptive_stats)
##    Variable     Mean   Median St.Deviation   Range      IQR Skewness Kurtosis
## 1        ID   766.68  1001.00       443.87  1116.0  1000.25    -1.03     2.11
## 2       Age    28.15    28.00         1.84     9.0     2.00     0.71     3.96
## 3    Height   172.30   172.30         9.19    35.0    17.00     0.03     1.84
## 4    Weight    69.38    69.38        10.41    36.0    18.00     0.18     1.88
## 5    Income 51550.83 51550.83      7926.35 38000.0 15000.00     0.22     2.41
## 6     Score     6.63     6.20         0.81     3.4     1.20     0.72     2.59
## 7 Happiness     7.50     7.00         0.98     3.0     1.50     0.06     1.85
descriptive_stats %>%
  kbl(caption = "Table 2. Descriptive Statistics") %>%
  kable_classic()
Table 2. Descriptive Statistics
Variable Mean Median St.Deviation Range IQR Skewness Kurtosis
ID 766.68 1001.00 443.87 1116.0 1000.25 -1.03 2.11
Age 28.15 28.00 1.84 9.0 2.00 0.71 3.96
Height 172.30 172.30 9.19 35.0 17.00 0.03 1.84
Weight 69.38 69.38 10.41 36.0 18.00 0.18 1.88
Income 51550.83 51550.83 7926.35 38000.0 15000.00 0.22 2.41
Score 6.63 6.20 0.81 3.4 1.20 0.72 2.59
Happiness 7.50 7.00 0.98 3.0 1.50 0.06 1.85

Your Turn

Essay Question

Review and compare with the previous statistics. Do you observe any undesired changes? Explain in detail. How can this be interpreted? what are your observations? Verify the descriptive statistics, and explain in detail. Explain everything that you obsevrve. Complete your research compare your variables and complete your paper

The descriptive statistics provide numerical information about the trends in the data. The mean and median are measures of central tendency and provide a general idea of the numerical value by providing the midpoint of a set of values (median) and the mathematical average (mean) (Black, 2012). These numbers are important when considering how specific values compare with the average. The largest number is related to the Income variable with a mean and median of 51,550.83. The size of the mean of this variable indicates Income may relate to a yearly income in a currency. The smallest numbers related to Score and Happiness variables, with a mean of 6.63 for Score and a mean of 7.5 for Happiness. Because the mean of these variables is less than 10, this variable likely relates to a score on a survey or questionnaire. The mean for age, height, and weight are 28.15, 172.30 and 69.38 respectively and indicate this sample includes young adults

Standard deviation, range, and the interquartile range (IQR) provide information about the variability and range in values. Measures of variability show the amount of similarity in individual values to deduce how representative the sample is of the population of interest (Black, 2012). Age has a range of 9, a standard deviation of 1.84 and a IQR of 2. From the descriptive statistics, age has low variability with a low standard deviation, although there are likely a few outliers causing the range to be quite different than the IQR and standard deviation. Score and Happiness both have a standard deviation of less than 1.0 and a range of less than 3.5 indicating low variability. The weight variable has higher variability with a high standard deviation of 10.41 and range of 36.0 compared to the mean of 69.38. The height variable, on the other hand, has a mean of 172.30 with a standard deviation of 9.19 and range of 35 indicating a moderate amount of variability across the range but not as high as for the Weight variable.

Skewness and kurtosis provide information regarding the distribution of values to determine if values are evenly distributed across the range. High skewness and kurtosis may identify opportunities for bias in values with clusters of scores similar to one value and low frequency outliers affecting the interpretation. Score, Income, and Age variables have the highest kurtosis of 2.59 for Score, 2.41 for Income, and 3.96 for Age indicating values may be clustered at the high end and the low end rather than evenly distributed. The score variable as well as the age variable have relatively high skewness, indicating these variables have scores clustered at one end of the range of data values. For these two variables, skewness and kurtosis are also low indicating relatively even distribution across the range of variables.

From the measures of variability, variables of age and height have the highest consistency and may yield meaningful insights with further analysis. However, weight was chosen for further investigation due to the predicted relationship between weight and happiness. Income has high value in meaningfulness, when considering relationships with scores of happiness or categorical variables such as level of education. However, the high kurtosis and high standard deviation for income may indicate bias. A larger sample size may increase the reliability of the values for less variability and could be considered for future analysis.

Based on the descriptive statistics, relationships between individual characteristics (Age, Height, Weight, and Income) and Happiness score could provide meaningful information about the impact of these characteristics on happiness scores.

The descriptive research questions are:

What is the relationship between age, weight, income and happiness values?

What is the relationship between location and hobby?

What is the relationship between education and hobby?

What is the relationship between education level and location?

Visual Representations

Step 8: Create graphs using ggplot2

For this part there are parts that you will need to change to create your graphs. The example is set to work with Income. Make the necessary changes to create the rest of the graphs. You may also want to change the colors, the dimensions etc…

  #######################################################################
  # 
  # Step 8: Create graphs using ggplot2
  # For this part there are parts that you will need to change to create 
  # your graphs.
  # The example is set to work with Income
  # Make the necessary changes to create the rest of the graphs
  # You may also want to change the colors, the dimensions etc...
  #############################################################
  
  #############################################################
  #
  # STEP 8a: Create a bargraph or a histogram
  # Explain what graph was that and why?
  # Set col to the desired column name
  #############################################################
  #
  ##
  # In this code we start you of with an example of Happiness, later in the code
  # you should replace this with your desired variable.
  #
  
    col = "Happiness"  # This is an example, try to do the same with a different variable

Bargraph - if the variable of your choice is categorical

#Bargraph for Education, Category, Hobby, and Location

  # Assume df is your dataframe and col is the column name (as string)
col = "Education"
if (is.factor(df[[col]])) {
  # Bar graph for factors
  ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
    geom_bar() +
    labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
    theme_minimal() +
    theme(legend.position = "right")
  
} 

col = "Category"
if (is.factor(df[[col]])) {
  # Bar graph for factors
  ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
    geom_bar() +
    labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
    theme_minimal() +
    theme(legend.position = "right")
  
} 

col = "Hobby"
if (is.factor(df[[col]])) {
  # Bar graph for factors
  ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
    geom_bar() +
    labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
    theme_minimal() +
    theme(legend.position = "right")
  
} 

col = "Location"
if (is.factor(df[[col]])) {
  # Bar graph for factors
  ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
    geom_bar() +
    labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
    theme_minimal() +
    theme(legend.position = "right")
  
} 

Histogram - If the varaibles of your choice is Numerical

You can also copy the chunk and create more graphs by resetting the col variable appropriately

#Histogram for Happiness

col = "Happiness"
if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
  # Histogram for numeric variables
  ggplot(df, aes(x = .data[[col]])) +
    geom_histogram(bins = 15, fill = "darkviolet", color = "black") +
    labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
    theme_minimal()
}

#Histogram for Age

col = "Age"
if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
  # Histogram for numeric variables
  ggplot(df, aes(x = .data[[col]])) +
    geom_histogram(bins = 15, fill = "orange2", color = "black") +
    labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
    theme_minimal()
}

#Histogram for Weight

col = "Weight"
if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
  # Histogram for numeric variables
  ggplot(df, aes(x = .data[[col]])) +
    geom_histogram(bins = 15, fill = "darkblue", color = "black") +
    labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
    theme_minimal()
}

#Histogram for Income

col = "Income"
if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
  # Histogram for numeric variables
  ggplot(df, aes(x = .data[[col]])) +
    geom_histogram(bins = 15, fill = "darkgreen", color = "black") +
    labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
    theme_minimal()
}

Your Turn

Essay Question

Now that you can observe graphically your data, explain the importance of graphical representations and how this helps to communicate data with other parties. Explain what graph was that and why?

To fully comprehend the data and relationships, graphic visualization is required. Data arrangement and sorting for graphic representation which aligns with the research question provides an opportunity to compare individual data points with overall trends (Bradstreet and Palcza, 2011). Interpretation of relationships between variables, groups, and individual data points is easily understood by viewers when categories are clearly detected.

Two of the categorical variables represented indicate a larger volume in one category. For the Education variable, the count in Bachelor’s is almost double compared to high school, Master’s and PhD. In the variable of hobby, the category of traveling is higher than photography, swimming, and reading, although reading is higher than both swimming and photography. In the variable of location, both categories city and rural have a higher count than suburb. Viewing results in this manner provides information about the data for later analysis. In considering relationships between variables, the high volume of values in the Bachelor’s category may lead to bias when determining relationships between education level and other variables.

Histograms provide a visual representation of the values of numerical data. It provides a visual representation of the distribution, allowing rapid detection of trends in the data. Histograms were developed for values for happiness, age, weight, and income. Histograms for happiness and age have the highest frequency of values close to the mean, with a small frequency of values spreading to the right of the graph indicating skewness toward the values on the left. Age has outlier values at 31.75 and 33.25 which do not occur with sufficient frequency to affect skewness toward lower ages in the graph. The variable for Weight also has higher frequency of values close to the mean, with frequently occurring weights at 80 and 85 with an outlier at 90. The income histogram, on the other hand, has increased values near the mean and to the right of the graph, indicating skewness toward higher income values, with an outlier at 70,000 . A small frequency of values are observed from 31,000 to 42,000. The information from this graph shows that the income for this sample falls between 45,000 and 65,000 although the range is much wider. This is important when making inferences regarding income, as the small frequency of values in the lower range may indicate bias in the results if the sample does not match the income of the population investigated.

Your Turn

STEP 8b: Create a boxplot and a Histogram for numeric variables note the the Bin width cannot be set up in the same way to work with Age or Happiness that has a small range and Income that the range is in thousands. Change this appropriately

Please note that this part of the code will not run for the demo code. You will need to change the value of eval=FALSE to eval=TRUE, after you introduce your code, to run it and add it to your knitted file.

 #############################################################
      #
      # STEP 8b: Create a boxplot  and Histogram for numeric variables
      # note the the Bin width cannot be set up in the same way to work with 
      # Age or Happiness that has a small range and Income that the range is in thousands
      # Change this appropriately
      #############################################################
      #
      # Choose a numeric variable (i.e., Age) set the col variable to the name of the column then you rerun the code that is commented out here.

#col = ____ Add the variable of your choice  

# Uncomment the code and you will create a Bar graph or a Histogram of a different variable here.
# Do not forget to change the value of eval=TRUE to run and knit this chunk
    
#  if (is.factor(df[[col]])) { # if the col is categorical, then the code will
      # create two graphs the Bar graph 
      # Highlight and run until the line that start with `# Boxplot for numeric variables
      #
      # If the col is numeric, then it will create the histogram
      # Bar graph for factors
#      ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
#        geom_bar() +
#        labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
#        theme_minimal() +
#        theme(legend.position = "right")
#    } else if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
      
#      ggplot(df, aes(x = .data[[col]])) +
#        geom_histogram(binwidth = 0.3) +
#        labs(title = paste("Histogram for", col), x = col, y = "Count") +
#        theme_minimal()
    }
col = "Happiness"

    
  if (is.factor(df[[col]])) { # if the col is categorical, then the code will
      # create two graphs the Bar graph 
      # Highlight and run until the line that start with `# Boxplot for numeric variables
      #
      # If the col is numeric, then it will create the histogram
      # Bar graph for factors
      ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
        geom_bar() +
       labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
        theme_minimal() +
        theme(legend.position = "right")
    } else if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
      
     ggplot(df, aes(x = .data[[col]])) +
      geom_histogram(binwidth = 0.3) +
       labs(title = paste("Histogram for", col), x = col, y = "Count") +
        theme_minimal()
    }

col = "Education"

    
  if (is.factor(df[[col]])) { # if the col is categorical, then the code will
      # create two graphs the Bar graph 
      # Highlight and run until the line that start with `# Boxplot for numeric variables
      #
      # If the col is numeric, then it will create the histogram
      # Bar graph for factors
      ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
        geom_bar() +
       labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
        theme_minimal() +
        theme(legend.position = "right")
    } else if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
      
     ggplot(df, aes(x = .data[[col]])) +
      geom_histogram(binwidth = 0.3) +
       labs(title = paste("Histogram for", col), x = col, y = "Count") +
        theme_minimal()
    }

Your Turn

Essay Question

Now explain this graph. Focus on the information extracted, anomalies, outliers, relationships.

Answer

Your Turn

***Step 8c: NOTE that you should run this part with the latest value of col. Do not forget to change the eval=TRUE to knit it.

Boxplot for numeric variables

#Box plot for Happiness

       #############################################################
       #
       # Step 8c
       # NOTE that you should run this part of the code after you 
       #  copy the graph that the previous code creates. Boxplot for numeric variables
       #############################################################
      # The next 5 lines will run only if the col is numeric, otherwise will give you an error.
    col = "Happiness"

    
  if (is.factor(df[[col]])) { # if the col is categorical, then the code will
      # create two graphs the Bar graph 
      # Highlight and run until the line that start with `# Boxplot for numeric variables
      #
      # If the col is numeric, then it will create the histogram
      # Bar graph for factors
      ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
        geom_bar() +
       labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
        theme_minimal() +
        theme(legend.position = "right")
    } else if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
      
     ggplot(df, aes(x = .data[[col]])) +
      geom_histogram(binwidth = 0.3) +
       labs(title = paste("Histogram for", col), x = col, y = "Count") +
        theme_minimal()
    }  

       ggplot(df, aes(x = "", y = .data[[col]])) +
  geom_boxplot(fill = "skyblue", color = "darkblue", width = 0.3, outlier.color = "red", outlier.size = 2) +
  labs(
    title = paste("Box Plot for", col),
    x = NULL,
    y = "Value"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.y = element_text(size = 14),
    axis.text.y = element_text(size = 12)
  )

#Box plot for Age

    col = "Age"

    
  if (is.factor(df[[col]])) { # if the col is categorical, then the code will
      # create two graphs the Bar graph 
      # Highlight and run until the line that start with `# Boxplot for numeric variables
      #
      # If the col is numeric, then it will create the histogram
      # Bar graph for factors
      ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
        geom_bar() +
       labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
        theme_minimal() +
        theme(legend.position = "right")
    } else if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
      
     ggplot(df, aes(x = .data[[col]])) +
      geom_histogram(binwidth = 0.3) +
       labs(title = paste("Histogram for", col), x = col, y = "Count") +
        theme_minimal()
    }  

       ggplot(df, aes(x = "", y = .data[[col]])) +
  geom_boxplot(fill = "skyblue", color = "darkblue", width = .3, outlier.color = "red", outlier.size = 2) +
  labs(
    title = paste("Box Plot for", col),
    x = NULL,
    y = "Value"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.y = element_text(size = 14),
    axis.text.y = element_text(size = 12)
  )

#Box plot for Weight, Height, and Income

    col = "Weight"

    
  if (is.factor(df[[col]])) { # if the col is categorical, then the code will
      # create two graphs the Bar graph 
      # Highlight and run until the line that start with `# Boxplot for numeric variables
      #
      # If the col is numeric, then it will create the histogram
      # Bar graph for factors
      ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
        geom_bar() +
       labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
        theme_minimal() +
        theme(legend.position = "right")
    } else if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
      
     ggplot(df, aes(x = .data[[col]])) +
      geom_histogram(binwidth = 0.3) +
       labs(title = paste("Histogram for", col), x = col, y = "Count") +
        theme_minimal()
    }  

       ggplot(df, aes(x = "", y = .data[[col]])) +
  geom_boxplot(fill = "skyblue", color = "darkblue", width = 0.3, outlier.color = "red", outlier.size = 2) +
  labs(
    title = paste("Box Plot for", col),
    x = NULL,
    y = "Value"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.y = element_text(size = 14),
    axis.text.y = element_text(size = 12)
  )

    col = "Height"

    
  if (is.factor(df[[col]])) { # if the col is categorical, then the code will
      # create two graphs the Bar graph 
      # Highlight and run until the line that start with `# Boxplot for numeric variables
      #
      # If the col is numeric, then it will create the histogram
      # Bar graph for factors
      ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
        geom_bar() +
       labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
        theme_minimal() +
        theme(legend.position = "right")
    } else if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
      
     ggplot(df, aes(x = .data[[col]])) +
      geom_histogram(binwidth = 0.3) +
       labs(title = paste("Histogram for", col), x = col, y = "Count") +
        theme_minimal()
    }  

       ggplot(df, aes(x = "", y = .data[[col]])) +
  geom_boxplot(fill = "skyblue", color = "darkblue", width = 0.3, outlier.color = "red", outlier.size = 2) +
  labs(
    title = paste("Box Plot for", col),
    x = NULL,
    y = "Value"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.y = element_text(size = 14),
    axis.text.y = element_text(size = 12)
  )

    col = "Income"

    
  if (is.factor(df[[col]])) { # if the col is categorical, then the code will
      # create two graphs the Bar graph 
      # Highlight and run until the line that start with `# Boxplot for numeric variables
      #
      # If the col is numeric, then it will create the histogram
      # Bar graph for factors
      ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
        geom_bar() +
       labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
        theme_minimal() +
        theme(legend.position = "right")
    } else if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
      
     ggplot(df, aes(x = .data[[col]])) +
      geom_histogram(binwidth = 0.3) +
       labs(title = paste("Histogram for", col), x = col, y = "Count") +
        theme_minimal()
    }  

       ggplot(df, aes(x = "", y = .data[[col]])) +
  geom_boxplot(fill = "skyblue", color = "darkblue", width = .25, outlier.color = "red", outlier.size = 2) +
  labs(
    title = paste("Box Plot for", col),
    x = NULL,
    y = "Value"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.y = element_text(size = 14),
    axis.text.y = element_text(size = 12)
  )

Your Turn

Essay Question

Explain the findings of your Boxplot. Are there any outliers? What is the IQR? Focus on the information extracted, anomalies, outliers, relationships.

Box plot visualization provides a visual representation of distribution of values and skewness. The median as well as the interquartile range, upper and lower ranges, and any outliers are represented in box plot graphs. Box plot graphs were created for happiness, age, weight, height, and income. For the box plot for happiness, the median is at the first quartile indicating highly skewed values for the second and third quartiles. Additionally, the upper range is closer to the third quartile than the lower range is in relation to the first quartile. Results from the happiness values must be interpreted with caution due to skewedness.

By inspecting the box plot for age, an outlier is visible above the upper range at a value above 33. The length of the upper range is longer from the third quartile than the lower range is from the first quartile. The median is in the center of the box indicating a high frequency of equally spread values in the interquartile range. The box plot for weight and for height shows value skewed toward the third quartile. For weight, the highest range is further from the third quartile than the lowest range value is from the first quartile. Income also shows slight skewness toward the third quartile although the lowest range is further from the first quartile than the highest range is from the third quartile.

Based on the high frequency of skewness identified, further analysis with a larger sample may provide increased reliability with insights gained. In comparing the box plot graphs, age appears to have the least variability although includes an outlier. Weight and height have similar skewness toward the higher values, as would be expected but further analysis could determine if the assumption that taller individuals weigh more is true based on this data set. Income also is skewed toward the higher values although has a wider range in the lower values.

Tabular Representations

Step 9: Tables

Creating tables to understand how the different categorical variables interconnect. Tabular information can be provided in both tables and parallel barplots. The following is an example on two variables, choose two others to get more valuable insights.

#############################################################
#
# Step 9
#
# Creating tables to understand how the different categorical variables
# interconnect
# Tabular information can be provided in both tables and parallel barplots.
# The following is an example on two variables, choose two others to get
# more valuable insights.
#############################################################

Gender_Education <- table(df$Education, df$Gender)
Gender_Education # what does this information tells you?
##              
##               Female Male Other
##   Bachelor's     197    8     7
##   High School     11   22    71
##   Master's        13   83     0
##   PhD              0   71    17
# How Many rows are there?
# This is the number of colors you should have in the vector below
# more intuitive colors can be added here.
# Keep the order from top to bottom to create your legend vector
addmargins(Gender_Education) # Add totals to your table
##              
##               Female Male Other Sum
##   Bachelor's     197    8     7 212
##   High School     11   22    71 104
##   Master's        13   83     0  96
##   PhD              0   71    17  88
##   Sum            221  184    95 500
color <- c("red","blue","yellow","green")
names <- c("Bachelor's","High School", "Master's","PhD")
barplot(Gender_Education, col=color, beside= TRUE, main = "Education by Gender", ylim = c(0,250) )
legend("topright",names,fill=color,cex=0.5) 

# topright is the position of the legend, it can be moved to top, left bottom, etc...
# you do not change the rest of the parameters here
print(addmargins(Gender_Education))
##              
##               Female Male Other Sum
##   Bachelor's     197    8     7 212
##   High School     11   22    71 104
##   Master's        13   83     0  96
##   PhD              0   71    17  88
##   Sum            221  184    95 500

#Clustered bar plot for Hobby by Location

Hobby_Location <- table(df$Hobby, df$Location)
Hobby_Location # what does this information tells you?
##              
##               City Rural Suburb
##   Photography    0    81      9
##   Reading      108    17      0
##   Swimming      42    58      0
##   Traveling     64    35     86
# How Many rows are there?
# This is the number of colors you should have in the vector below
# more intuitive colors can be added here.
# Keep the order from top to bottom to create your legend vector
addmargins(Hobby_Location) # Add totals to your table
##              
##               City Rural Suburb Sum
##   Photography    0    81      9  90
##   Reading      108    17      0 125
##   Swimming      42    58      0 100
##   Traveling     64    35     86 185
##   Sum          214   191     95 500
color <- c("orangered3","navyblue","yellow2","palegreen4")
names <- c("Photography","Reading", "Swimming","Traveling")
barplot(Hobby_Location, col=color, beside= TRUE, main = "Hobby by Location", ylim = c(0,250) )
legend("topright",names,fill=color,cex=0.5) 

# topright is the position of the legend, it can be moved to top, left bottom, etc...
# you do not change the rest of the parameters here
print(addmargins(Hobby_Location))
##              
##               City Rural Suburb Sum
##   Photography    0    81      9  90
##   Reading      108    17      0 125
##   Swimming      42    58      0 100
##   Traveling     64    35     86 185
##   Sum          214   191     95 500

#Clustered bar plot for Hobby by Education

Hobby_Education <- table(df$Education, df$Hobby)
Hobby_Education # what does this information tells you?
##              
##               Photography Reading Swimming Traveling
##   Bachelor's            0      94       86        32
##   High School          81      12        1        10
##   Master's              0      19       13        64
##   PhD                   9       0        0        79
# How Many rows are there?
# This is the number of colors you should have in the vector below
# more intuitive colors can be added here.
# Keep the order from top to bottom to create your legend vector
addmargins(Hobby_Education) # Add totals to your table
##              
##               Photography Reading Swimming Traveling Sum
##   Bachelor's            0      94       86        32 212
##   High School          81      12        1        10 104
##   Master's              0      19       13        64  96
##   PhD                   9       0        0        79  88
##   Sum                  90     125      100       185 500
color <- c("magenta4","steelblue","goldenrod1","darkgreen")
names <- c("Bachelor's","High School", "Master's","PhD")
barplot(Hobby_Education, col=color, beside= TRUE, main = "Hobby by Education", ylim = c(0,250) )
legend("topright",names,fill=color,cex=0.5) 

# topright is the position of the legend, it can be moved to top, left bottom, etc...
# you do not change the rest of the parameters here
print(addmargins(Hobby_Education))
##              
##               Photography Reading Swimming Traveling Sum
##   Bachelor's            0      94       86        32 212
##   High School          81      12        1        10 104
##   Master's              0      19       13        64  96
##   PhD                   9       0        0        79  88
##   Sum                  90     125      100       185 500

#Clustered bar plot for Hobby by Category

Hobby_Category <- table(df$Hobby, df$Category)
Hobby_Category # what does this information tells you?
##              
##               Art Music Sports
##   Photography   9     0     81
##   Reading      90     8     27
##   Swimming     70    30      0
##   Traveling     4   107     74
# How Many rows are there?
# This is the number of colors you should have in the vector below
# more intuitive colors can be added here.
# Keep the order from top to bottom to create your legend vector
addmargins(Hobby_Category) # Add totals to your table
##              
##               Art Music Sports Sum
##   Photography   9     0     81  90
##   Reading      90     8     27 125
##   Swimming     70    30      0 100
##   Traveling     4   107     74 185
##   Sum         173   145    182 500
color <- c("orangered3","navyblue","yellow2","darkgreen")
names <- c("Photography","Reading", "Swimming", "Traveling")
barplot(Hobby_Category, col=color, beside= TRUE, main = "Hobby by Category", ylim = c(0,250) )
legend("topright",names,fill=color,cex=0.5) 

# topright is the position of the legend, it can be moved to top, left bottom, etc...
# you do not change the rest of the parameters here
print(addmargins(Hobby_Category))
##              
##               Art Music Sports Sum
##   Photography   9     0     81  90
##   Reading      90     8     27 125
##   Swimming     70    30      0 100
##   Traveling     4   107     74 185
##   Sum         173   145    182 500

#Clustered bar plot for Education by Location

Location_Education <- table(df$Education, df$Location)
Location_Education # what does this information tells you?
##              
##               City Rural Suburb
##   Bachelor's   124    88      0
##   High School    1   103      0
##   Master's      84     0     12
##   PhD            5     0     83
# How Many rows are there?
# This is the number of colors you should have in the vector below
# more intuitive colors can be added here.
# Keep the order from top to bottom to create your legend vector
addmargins(Location_Education) # Add totals to your table
##              
##               City Rural Suburb Sum
##   Bachelor's   124    88      0 212
##   High School    1   103      0 104
##   Master's      84     0     12  96
##   PhD            5     0     83  88
##   Sum          214   191     95 500
color <- c("magenta4","steelblue","goldenrod1","darkgreen")
names <- c("Bachelor's","High School", "Master's", "PhD")
barplot(Location_Education, col=color, beside= TRUE, main = "Education by Location", ylim = c(0,250) )
legend("topright",names,fill=color,cex=0.5) 

# topright is the position of the legend, it can be moved to top, left bottom, etc...
# you do not change the rest of the parameters here
print(addmargins(Location_Education))
##              
##               City Rural Suburb Sum
##   Bachelor's   124    88      0 212
##   High School    1   103      0 104
##   Master's      84     0     12  96
##   PhD            5     0     83  88
##   Sum          214   191     95 500

Making Pretty Tables

#Contingency tables for Hobby by Location, Hobby by Education, and Hobby by Category and Education by Location

library(knitr)
## Warning: package 'knitr' was built under R version 4.5.1
library(kableExtra)

# Create the contingency table
Gender_Education <- table(df$Education, df$Gender)

# Add row and column totals
Gender_Education_margins <- addmargins(Gender_Education)

# Make a clean and beautiful table with kable
kable(Gender_Education_margins, caption = "Gender by Education Level", align = 'c') %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(0, bold = TRUE, background = "#D3D3D3")  # Highlight header
Gender by Education Level
Female Male Other Sum
Bachelor’s 197 8 7 212
High School 11 22 71 104
Master’s 13 83 0 96
PhD 0 71 17 88
Sum 221 184 95 500
# Create the contingency table
Hobby_Location <- table(df$Location, df$Hobby)

# Add row and column totals
Hobby_Location_margins <- addmargins(Hobby_Location)

# Make a clean and beautiful table with kable
kable(Hobby_Location_margins, caption = "Table 3. Hobby by Location", align = 'c') %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(0, bold = TRUE, background = "#D3D3D3")  # Highlight header
Table 3. Hobby by Location
Photography Reading Swimming Traveling Sum
City 0 108 42 64 214
Rural 81 17 58 35 191
Suburb 9 0 0 86 95
Sum 90 125 100 185 500
# Create the contingency table
Hobby_Education <- table(df$Education, df$Hobby)

# Add row and column totals
Hobby_Education_margins <- addmargins(Hobby_Education)

# Make a clean and beautiful table with kable
kable(Hobby_Education_margins, caption = "Table 4. Hobby by Education Level", align = 'c') %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(0, bold = TRUE, background = "#D3D3D3")  # Highlight header
Table 4. Hobby by Education Level
Photography Reading Swimming Traveling Sum
Bachelor’s 0 94 86 32 212
High School 81 12 1 10 104
Master’s 0 19 13 64 96
PhD 9 0 0 79 88
Sum 90 125 100 185 500
# Create the contingency table
Hobby_Category <- table(df$Category, df$Hobby)

# Add row and column totals
Hobby_Category_margins <- addmargins(Hobby_Category)

# Make a clean and beautiful table with kable
kable(Hobby_Category_margins, caption = "Hobby by Category", align = 'c') %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(0, bold = TRUE, background = "#D3D3D3")  # Highlight header
Hobby by Category
Photography Reading Swimming Traveling Sum
Art 9 90 70 4 173
Music 0 8 30 107 145
Sports 81 27 0 74 182
Sum 90 125 100 185 500
# Create the contingency table
Location_Education <- table(df$Education, df$Location)

# Add row and column totals
Location_Education_margins <- addmargins(Location_Education)

# Make a clean and beautiful table with kable
kable(Location_Education_margins, caption = "Table 5. Education Level by Location", align = 'c') %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(0, bold = TRUE, background = "#D3D3D3")  # Highlight header
Table 5. Education Level by Location
City Rural Suburb Sum
Bachelor’s 124 88 0 212
High School 1 103 0 104
Master’s 84 0 12 96
PhD 5 0 83 88
Sum 214 191 95 500

Your Turn

Essay Question

Explain the table in details. Focus on the information extracted, anomalies, outliers, relationships.

Clustered bar plots and contingency tables were created for the Hobby variable to identify trends in hobby values based on location, education, and category. Clustered bar plots and contingency tables were also created for hobby by category, but not included in this EDA as the insights from these relationships were not as meaningful as the relationships between location, education level, and hobby.

When comparing the hobby variable with location, the suburb category had the least variety, with only photography and traveling included in hobby. City included reading, swimming and traveling. Rural had representation in all of the categories. As represented by the contingency table, rural had 90% of the values for the hobby of photography and city had 87% of the values for the hobby of reading. Swimming was more equally distributed in both city and rural categories. All locations included traveling as a hobby, although the suburb location had the highest count of traveling and rural had the lowest. (It would be interesting to look at hobby by income to see if higher income related to travel, especially since Master’s and PhD’s had higher counts of travel than Bachelors and High School. )

When comparing hobby by education by clustered bar graph, several trends are indicated. Only the hobby of traveling included counts from all education categories. Observations which indicated a Bachelor’s had the highest counts for reading and swimming, with a smaller but visible number for traveling. The hobby of photography was largely made up of observations with high school as the education category with a small number of observations with PhD. No observations with an education of PhD listed reading or swimming as hobbies. Observations with a Master’s degree were primarily under traveling although to a lesser degree reading and swimming. In viewing the relationship between hobby and education level in the contingency table, 90% of observations which were included in the hobby of photography listed high school as the educational level. For the hobby of reading, the highest count was for the educational level of Bachelor’s with a small number in high school and master’s. A higher count of observations for the educational level of Bachelor’s were counted in the swimming hobby with only one in high school, a few in Master’s and zero in PhD education levels. The educational level of PhD had the highest count in traveling, with a slightly smaller count with an educational level of Master’s, a smaller count for Bachelor’s and the smallest number for an educational level of high school.

A clustered bar graph and contingency table were also created identifying the relationship between education and location variables. From the clustered bar graph, it is evident that educational levels are not equally represented across locations. The educational level of high school had almost all of the values in the rural location with what appears to be a tiny count in the city location. The educational level of Bachelor’s had a count slightly less than high school in the rural location, with a slightly higher count located in the city than in the rural location and no values in the suburb. The educational level of Master’s had the highest count in the city, with a small number located in the suburb location. Finally, the educational level of PhD had the highest count in the suburb location with a small number located in the city. No values with graduate degree education levels were found in the rural location. No high school or Bachelor’s education level values were found in the suburb location. When inspecting the relationships in the contingency table, 87% of values for the suburb location were found with the education level of PhD. In the rural location, a little over half (53%) were found with the education level of high school. For the city location, 57% of values were found with the education level of Bachelor’s, 39% of values were found with the education level of Master’s, a small number of counts for PhD and only 1 count for the high school education level.

Based on the descriptive statistics, relationships between the variables can be determined and visualized. A comparison of the histograms provides insights about the trends of individual characteristics and happiness values.

The histograms for age and weight were more similar in shape to the happiness histogram than income. Happiness was highest at 7.5 with more between 7-9 than 5-7, and skewed toward the higher numbers. Age was highest values between 25-30 with outliers older than 30, had a central median, but high spread toward the higher end. Weight was highest from 45-85, with most between 65-70 and an outlier at 90, skewed toward the higher end with a higher spread beyond the third quartile when compared to the spread from the origin to the first quartile. Further analysis is required to determine if the similarity in the shape of the graph indicates a positive correlation between these variables and happiness scores.

Kovner et al. (2020) identified age among factors that indicate pursuit of graduate education. In this data set, the age histogram and happiness histogram presented similarly in shape. The trends identified in bar cluster plots demonstrated differences in location and hobby for graduate education levels (Master’s and PhD). Further analysis of correlations between happiness scores and education levels may provide insights into the happiness values based on education level.

Based on the data analysis and visualization, there appears to be a relationship between the education, location and hobby variables. The education level that had the least number of counts was in the PhD category, making up 17% of the sample. From the analysis and data visualization, there appears to be a relationship between education level of PhD, location (suburb) and hobby (traveling). A relationship was also found between education level of Bachelor’s, location (city and rural) and hobby (reading and swimming). The relationship between Master’s and location was related to a high proportion in the city variable with the highest counts in the traveling category of the hobby variable, although a few were included in reading and swimming. Finally, 20% of the sample included the high school education level. Almost all of the observations with high school as the education level were in the rural location category. More with a high school as the educational level are included in photography as the hobby.

Works Cited Black, Ken. Business Statistics for Contemporary Decision Mak Ing 7E + WileyPlus Registration Card. 16 Mar. 2012. Bradstreet, Thomas E, and John S Palcza. “Digging into Data with Graphics.” Teaching Statistics Trust, vol. 34, no. 2, 2011, pp. 68–74. Kovner, Christine T., et al. “Charting the Course for Nurses’ Achievement of Higher Education Levels.” Journal of Professional Nursing, vol. 28, no. 6, Nov. 2012, pp. 333–343, https://doi.org/10.1016/j.profnurs.2012.04.021. Accessed 10 Jan. 2020. Medley-Rath, Stephanie, et al. “Figures and Charts and Tables, Oh My!: A Content Analysis of Textbook Data Visualizations.” Teaching Sociology, vol. 52, no. 3, 29 Nov. 2023, https://doi.org/10.1177/0092055x231214006. Accessed 26 Apr. 2024.