As we move on from week to week and task to task, the code that you have already completed, will stay on the template but will not run, this is possible by adding eval=FALSE to the corresponding code chunk. Note that the libraries need to be linked to this program as well.
# Install and load necessary libraries
#install.packages("ggplot2") # Install ggplot2 for plotting, if you have already installed the packages, comment this out by enterring a # in front of this command
#install.packages("scales") # Install scales for formatting
#install.packages("moments") # Install moments for skewness and kurtosis
library(ggplot2) # Load ggplot2 library
library(scales) # Load scales library
library(moments)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.5.1
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
This needs to be addressed here.
# Check the current working directory
getwd()
## [1] "C:/Users/benke/Downloads"
# in the next line, change the directory to the place where you saved the
# data file, if you prefer you can save your data.csv file in the directory
# that command 7 indicated.
# for example your next line should like something similar to this: setwd("C:/Users/tsapara/Documents")
# Set the working directory to where the data file is located
# This ensures the program can access the file correctly
setwd("C:/Users/benke/Downloads")
### Choose an already existing directory in your computer.
# Read the CSV file
# The header parameter ensures column names are correctly read
# sep defines the delimiter (comma in this case)
# stringsAsFactors prevents automatic conversion of strings to factors
df <- read.csv("data.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)
##########################################################
# Define variables A and B based on your student ID
# A represents the first 3 digits, B represents the last 3 digits
A <- 34
B <- 99
Randomizer <- A + B # Randomizer ensures a consistent seed value for reproducibility
# Generate a random sample of 500 rows from the dataset
set.seed(Randomizer) # Set the seed for reproducibility
sample_size <- 500
df <- df[sample(nrow(df), sample_size, replace = TRUE), ] # Sample the dataset
write.csv(df, file = "my_data.csv", row.names = FALSE) # this command may take some time to run once it is done, it will create the desired data file locally in your directory
As practice, you may want now to knit your file in an html. To do this, you should click on the knit button on the top panel, and wait for the rendering file. The HTML will open once it is done for you to review.
It is recommended to practice with RMD and download and review the following cheatsheets: https://rmarkdown.rstudio.com/lesson-15.HTML
In addition, you may want to alter some of the editor components and re-knit your file to gain some knowledge and understanding of RMD. For a complete tutorial, visit: https://rmarkdown.rstudio.com/lesson-2.html
df <- read.csv("my_data.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)
Step 0. Now that you read the file, you want to learn few information about your data
The following commands will not be explained here, do your research, review your csv file and answer the questions related with this part of your code.
# Basic exploratory commands
nrow(df) # Number of rows in the dataset
## [1] 500
length(df) # Number of columns (or variables) in the dataset
## [1] 16
str(df) # Structure of the dataset (data types and a preview)
## 'data.frame': 500 obs. of 16 variables:
## $ ID : int 31 957 964 34 968 1040 31 1051 32 1082 ...
## $ Gender : Factor w/ 3 levels "Female","Male",..: 1 3 2 1 2 2 1 1 2 3 ...
## $ Age : int 29 30 31 27 30 29 29 26 31 28 ...
## $ Height : int 155 165 180 168 182 175 155 160 180 185 ...
## $ Weight : int 54 NA NA 65 80 70 54 55 NA 85 ...
## $ Education : Factor w/ 5 levels "","Bachelor's",..: 4 3 2 2 4 5 4 2 2 3 ...
## $ Income : int 55000 NA 50000 45000 65000 60000 55000 48000 50000 NA ...
## $ MaritalStatus: Factor w/ 3 levels "","Married","Single": 2 3 2 3 2 2 2 3 2 3 ...
## $ Employment : Factor w/ 3 levels "","Employed",..: 2 3 2 2 2 2 2 2 2 3 ...
## $ Score : num 7.3 5.5 6.5 6.2 NA 7.8 7.3 6.1 6.5 5.7 ...
## $ Rating : Factor w/ 4 levels "","A","B","C": 3 2 2 3 4 4 3 3 2 2 ...
## $ Category : Factor w/ 4 levels "","Art","Music",..: 3 4 2 2 3 4 3 2 2 4 ...
## $ Color : Factor w/ 4 levels "","Blue","Green",..: 2 3 3 2 4 4 2 2 3 3 ...
## $ Hobby : Factor w/ 5 levels "","Photography",..: 4 3 3 3 5 5 4 4 3 2 ...
## $ Happiness : num 8.2 NA 8 7 8.5 9 8.2 7 8 6 ...
## $ Location : Factor w/ 4 levels "","City","Rural",..: 2 3 3 2 2 4 2 3 3 3 ...
summary(df) # Summary statistics for each column
## ID Gender Age Height Weight
## Min. : 1.0 Female:221 Min. :25.00 Min. :155.0 Min. :54.00
## 1st Qu.: 59.0 Male :184 1st Qu.:27.00 1st Qu.:165.0 1st Qu.:60.00
## Median :1001.0 Other : 95 Median :28.00 Median :175.0 Median :70.00
## Mean : 766.7 Mean :28.15 Mean :172.3 Mean :69.38
## 3rd Qu.:1059.2 3rd Qu.:29.00 3rd Qu.:182.0 3rd Qu.:80.00
## Max. :1117.0 Max. :34.00 Max. :190.0 Max. :90.00
## NA's :27 NA's :18 NA's :28
## Education Income MaritalStatus Employment
## : 20 Min. :32000 : 16 : 21
## Bachelor's :189 1st Qu.:45000 Married:192 Employed :364
## High School:104 Median :48000 Single :292 Unemployed:115
## Master's : 96 Mean :51551
## PhD : 88 3rd Qu.:60000
## NA's : 3 Max. :70000
## NA's :77
## Score Rating Category Color Hobby
## Min. :5.50 : 4 : 6 : 1 : 4
## 1st Qu.:6.10 A:147 Art :173 Blue :207 Photography: 90
## Median :6.20 B:215 Music :145 Green:135 Reading :125
## Mean :6.63 C:134 Sports:176 Red :157 Swimming :100
## 3rd Qu.:7.50 Traveling :181
## Max. :8.90
## NA's :67
## Happiness Location
## Min. :6.000 : 5
## 1st Qu.:7.000 City :209
## Median :7.000 Rural :191
## Mean :7.502 Suburb: 95
## 3rd Qu.:8.500
## Max. :9.000
## NA's :13
Please answer the following questions, by typing information after the question.
Question 1
What type of variables does your file include?
Answer 1:
Question 2
Specific data types?
Answer 2:
Question 3
Are they read properly?
Answer 3:
Question 4
Are there any issues ?
Answer 4:
Question 5
Does your file includes both NAs and blanks?
Answer 5:
Question 6
How many NAs do you have and
Answer 6:
Question 7
How many blanks?
Answer 7:
Step 1: Handling both blanks and NAs is not simple so first we want to eliminate some of those, let’s eliminate the blanks and change them to NAs
#
# Step 1: # Handling both blanks and NAs is not simple so first we want to eliminate
# some of those, let's eliminate the blanks and change them to NAs
#
# Replace blanks with NAs across the dataset
# This ensures that blank values are consistently treated as missing data
df[df == ""] <- NA
# Convert specific columns to factors
# This step ensures categorical variables are treated correctly after replacing blanks
factor_columns <- c("Gender", "Education", "Rating", "MaritalStatus", "Category",
"Employment", "Color", "Hobby", "Location")
df[factor_columns] <- lapply(df[factor_columns], function(col) as.factor(as.character(col)))
Step 2: Count NAs in the entire dataset
#
# Step 2: Count NAs in the entire dataset
# Count the total number of NAs in the dataset
total_nas <- sum(is.na(df))
total_nas # Print the total number of missing values
## [1] 310
Please answer the following questions, by typing information after the question.
Question 8
Explain what the printed number is, what is the information that relays and how can you use it in your analysis?
Answer 8:
Step 3: Count rows with NAs.
#
# Step 3: Count rows with NAs
#
# Count rows with at least one NA
rows_with_nas <- sum(rowSums(is.na(df)) > 0)
Percent_row_NA <- percent(rows_with_nas / nrow(df)) # Percentage of rows with NAs
rows_with_nas
## [1] 255
Percent_row_NA
## [1] "51%"
Question 9
How large is the proportion of the rows with NAs, we can drop up to 5%?
Answer 9:
Question 10
Do you think that would be wise to drop the above percent?
Answer 10:
Question 11
How this will affect your dataset?
Answer 11:
Step 4: Count columns with NAs
#
# Step 4: Count columns with NAs
# Count columns with at least one NA
cols_with_nas <- sum(colSums(is.na(df)) > 0)
Percent_col_NA <- percent(cols_with_nas / length(df)) # Percentage of columns with NAs
cols_with_nas
## [1] 14
Percent_col_NA
## [1] "88%"
Question 12
How large is the proportion of the cols with NAs, we never want to drop entire columnes as this would mean that we will loose variables and associations but do you think that would be wise to drop the above percent?
Answer 12:
Question 13
How this will affect your dataset?
Answer 13:
Step 5: Replace NAs with appropriate values (mean for numeric and integer,mode for factor, “NA” for character)
In later weeks we will learn how to replace the NAs properly based on the descriptive statistics and you will discuss this code.For now, you can assume that by setting the mean of the variable for numeric and mode for categorical it is correct - this is not always the case of course but the code will become much more complicated in that case.
#
# Step 5: Replace NAs with appropriate values (mean for numeric and integer,
# mode for factor, "NA" for character)
# In later weeks we will learn how to replace the NAs properly based on the
# descriptive statistics and you will discuss this code.
# for now, you can assume that by setting the mean of the variable for numeric
# and mode for categorical it is correct - this is not always the case of course
# but the code will become much more complicated in that case.
# Replace NAs with appropriate values
# Numeric: Replace with the mean if sufficient data is available
# Categorical: Replace with the mode (most common value)
# Character: Replace with the string "NA"
df <- lapply(df, function(col) {
if (is.numeric(col) || is.integer(col)) { # Numeric or integer columns
if (sum(!is.na(col)) > 10) {
col[is.na(col)] <- mean(col, na.rm = TRUE) # Replace with mean
} else {
col[is.na(col)] <- approx(seq_along(col), col, n = length(col))[["y"]][is.na(col)] # Interpolation
}
} else if (is.factor(col)) { # Factor columns
mode_val <- names(sort(-table(col)))[1] # Mode (most common value)
col[is.na(col)] <- mode_val
} else if (is.character(col)) { # Character columns
col[is.na(col)] <- "NA" # Replace with "NA"
}
return(col) # Return the modified column
})
df <- as.data.frame(df) # Convert the list back to a dataframe
#
# following the above method to impute, has now changed some of the statistics
# Check the updated dataset and ensure no remaining NAs
summary(df)
## ID Gender Age Height Weight
## Min. : 1.0 Female:221 Min. :25.00 Min. :155.0 Min. :54.00
## 1st Qu.: 59.0 Male :184 1st Qu.:27.00 1st Qu.:165.0 1st Qu.:62.00
## Median :1001.0 Other : 95 Median :28.00 Median :172.3 Median :69.38
## Mean : 766.7 Mean :28.15 Mean :172.3 Mean :69.38
## 3rd Qu.:1059.2 3rd Qu.:29.00 3rd Qu.:182.0 3rd Qu.:80.00
## Max. :1117.0 Max. :34.00 Max. :190.0 Max. :90.00
## Education Income MaritalStatus Employment
## Bachelor's :212 Min. :32000 Married:192 Employed :385
## High School:104 1st Qu.:45000 Single :308 Unemployed:115
## Master's : 96 Median :51551
## PhD : 88 Mean :51551
## 3rd Qu.:60000
## Max. :70000
## Score Rating Category Color Hobby
## Min. :5.50 A:147 Art :173 Blue :208 Photography: 90
## 1st Qu.:6.10 B:219 Music :145 Green:135 Reading :125
## Median :6.20 C:134 Sports:182 Red :157 Swimming :100
## Mean :6.63 Traveling :185
## 3rd Qu.:7.30
## Max. :8.90
## Happiness Location
## Min. :6.000 City :214
## 1st Qu.:7.000 Rural :191
## Median :7.000 Suburb: 95
## Mean :7.502
## 3rd Qu.:8.500
## Max. :9.000
summary(df) %>%
kbl(caption = "Table 1. Summary of Data Frame Characteristics") %>%
kable_classic()
ID | Gender | Age | Height | Weight | Education | Income | MaritalStatus | Employment | Score | Rating | Category | Color | Hobby | Happiness | Location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Min. : 1.0 | Female:221 | Min. :25.00 | Min. :155.0 | Min. :54.00 | Bachelor’s :212 | Min. :32000 | Married:192 | Employed :385 | Min. :5.50 | A:147 | Art :173 | Blue :208 | Photography: 90 | Min. :6.000 | City :214 | |
1st Qu.: 59.0 | Male :184 | 1st Qu.:27.00 | 1st Qu.:165.0 | 1st Qu.:62.00 | High School:104 | 1st Qu.:45000 | Single :308 | Unemployed:115 | 1st Qu.:6.10 | B:219 | Music :145 | Green:135 | Reading :125 | 1st Qu.:7.000 | Rural :191 | |
Median :1001.0 | Other : 95 | Median :28.00 | Median :172.3 | Median :69.38 | Master’s : 96 | Median :51551 | NA | NA | Median :6.20 | C:134 | Sports:182 | Red :157 | Swimming :100 | Median :7.000 | Suburb: 95 | |
Mean : 766.7 | NA | Mean :28.15 | Mean :172.3 | Mean :69.38 | PhD : 88 | Mean :51551 | NA | NA | Mean :6.63 | NA | NA | NA | Traveling :185 | Mean :7.502 | NA | |
3rd Qu.:1059.2 | NA | 3rd Qu.:29.00 | 3rd Qu.:182.0 | 3rd Qu.:80.00 | NA | 3rd Qu.:60000 | NA | NA | 3rd Qu.:7.30 | NA | NA | NA | NA | 3rd Qu.:8.500 | NA | |
Max. :1117.0 | NA | Max. :34.00 | Max. :190.0 | Max. :90.00 | NA | Max. :70000 | NA | NA | Max. :8.90 | NA | NA | NA | NA | Max. :9.000 | NA |
head(df) %>%
kbl(caption = "Table 1. Head of Data Frame Characteristics") %>%
kable_classic()
ID | Gender | Age | Height | Weight | Education | Income | MaritalStatus | Employment | Score | Rating | Category | Color | Hobby | Happiness | Location |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
31 | Female | 29 | 155 | 54.00000 | Master’s | 55000.00 | Married | Employed | 7.300000 | B | Music | Blue | Swimming | 8.200000 | City |
957 | Other | 30 | 165 | 69.37924 | High School | 51550.83 | Single | Unemployed | 5.500000 | A | Sports | Green | Reading | 7.502259 | Rural |
964 | Male | 31 | 180 | 69.37924 | Bachelor’s | 50000.00 | Married | Employed | 6.500000 | A | Art | Green | Reading | 8.000000 | Rural |
34 | Female | 27 | 168 | 65.00000 | Bachelor’s | 45000.00 | Single | Employed | 6.200000 | B | Art | Blue | Reading | 7.000000 | City |
968 | Male | 30 | 182 | 80.00000 | Master’s | 65000.00 | Married | Employed | 6.630485 | C | Music | Red | Traveling | 8.500000 | City |
1040 | Male | 29 | 175 | 70.00000 | PhD | 60000.00 | Married | Employed | 7.800000 | C | Sports | Red | Traveling | 9.000000 | Suburb |
Essay Question
Run summary(df) and compare with the previous statistics. Do you observe any undesired changes? Explain in detail. Are there any more NA’s in your file?What is the information that is printed by the summary? How can this be interpreted? what are your observations? Verify the effects of imputation, and explain in detail. Compare the updated summary with the earlier statistics and note changes. Explain everything that you obsevrve.
Answer
Step 6: Create descriptive statistics for all variables
We run all the descriptive statistics for all the numeric variables
###################################################################
#
# Step 6: Create descriptive statistics for all variables
# We run all the descriptive statistics for all the numeric variables
#
###################################################################
# Initialize a function to compute descriptive statistics
compute_stats <- function(column, name) {
if (is.numeric(column) || is.integer(column)) {
data.frame(
Variable = name,
Mean = round(mean(column, na.rm = TRUE), 2),
Median = round(median(column, na.rm = TRUE), 2),
St.Deviation = round(sd(column, na.rm = TRUE), 2),
Range = round(diff(range(column, na.rm = TRUE)), 2),
IQR = round(IQR(column, na.rm = TRUE), 2),
Skewness = round(skewness(column, na.rm = TRUE), 2),
Kurtosis = round(kurtosis(column, na.rm = TRUE), 2),
stringsAsFactors = FALSE
)
} else {
NULL
}
}
# Apply the function to each numeric or integer column in the dataset
descriptive_stats <- do.call(
rbind,
lapply(names(df), function(col) compute_stats(df[[col]], col))
)
# Print the descriptive statistics dataframe
descriptive_stats
## Variable Mean Median St.Deviation Range IQR Skewness Kurtosis
## 1 ID 766.68 1001.00 443.87 1116.0 1000.25 -1.03 2.11
## 2 Age 28.15 28.00 1.84 9.0 2.00 0.71 3.96
## 3 Height 172.30 172.30 9.19 35.0 17.00 0.03 1.84
## 4 Weight 69.38 69.38 10.41 36.0 18.00 0.18 1.88
## 5 Income 51550.83 51550.83 7926.35 38000.0 15000.00 0.22 2.41
## 6 Score 6.63 6.20 0.81 3.4 1.20 0.72 2.59
## 7 Happiness 7.50 7.00 0.98 3.0 1.50 0.06 1.85
Step 7: Print Descriptive Statistics
Now you have all the descriptive statistics for all numeric variables Create a professional table in your paper. The library(KableExtra), can help you create the table here. If you have no programming experience you can cut and paste in Excel and beautify the table in Excel.
#############################################################
#
# Step 7: Print Descriptive Statistics
# Now you have all the descriptive statistics for all numeric variables
# Create a professional table in your paper.
# the library(KableExtra), can help you create the table here.
# if you have no programming experience you can cut and paste in Excel
# and beautify the table in Excel
#############################################################
print("Descriptive Statistics:")
## [1] "Descriptive Statistics:"
print(descriptive_stats)
## Variable Mean Median St.Deviation Range IQR Skewness Kurtosis
## 1 ID 766.68 1001.00 443.87 1116.0 1000.25 -1.03 2.11
## 2 Age 28.15 28.00 1.84 9.0 2.00 0.71 3.96
## 3 Height 172.30 172.30 9.19 35.0 17.00 0.03 1.84
## 4 Weight 69.38 69.38 10.41 36.0 18.00 0.18 1.88
## 5 Income 51550.83 51550.83 7926.35 38000.0 15000.00 0.22 2.41
## 6 Score 6.63 6.20 0.81 3.4 1.20 0.72 2.59
## 7 Happiness 7.50 7.00 0.98 3.0 1.50 0.06 1.85
descriptive_stats %>%
kbl(caption = "Table 2. Descriptive Statistics") %>%
kable_classic()
Variable | Mean | Median | St.Deviation | Range | IQR | Skewness | Kurtosis |
---|---|---|---|---|---|---|---|
ID | 766.68 | 1001.00 | 443.87 | 1116.0 | 1000.25 | -1.03 | 2.11 |
Age | 28.15 | 28.00 | 1.84 | 9.0 | 2.00 | 0.71 | 3.96 |
Height | 172.30 | 172.30 | 9.19 | 35.0 | 17.00 | 0.03 | 1.84 |
Weight | 69.38 | 69.38 | 10.41 | 36.0 | 18.00 | 0.18 | 1.88 |
Income | 51550.83 | 51550.83 | 7926.35 | 38000.0 | 15000.00 | 0.22 | 2.41 |
Score | 6.63 | 6.20 | 0.81 | 3.4 | 1.20 | 0.72 | 2.59 |
Happiness | 7.50 | 7.00 | 0.98 | 3.0 | 1.50 | 0.06 | 1.85 |
Essay Question
Review and compare with the previous statistics. Do you observe any undesired changes? Explain in detail. How can this be interpreted? what are your observations? Verify the descriptive statistics, and explain in detail. Explain everything that you obsevrve. Complete your research compare your variables and complete your paper
The descriptive statistics provide numerical information about the trends in the data. The mean and median are measures of central tendency and provide a general idea of the numerical value by providing the midpoint of a set of values (median) and the mathematical average (mean) (Black, 2012). These numbers are important when considering how specific values compare with the average. The largest number is related to the Income variable with a mean and median of 51,550.83. The size of the mean of this variable indicates Income may relate to a yearly income in a currency. The smallest numbers related to Score and Happiness variables, with a mean of 6.63 for Score and a mean of 7.5 for Happiness. Because the mean of these variables is less than 10, this variable likely relates to a score on a survey or questionnaire. The mean for age, height, and weight are 28.15, 172.30 and 69.38 respectively and indicate this sample includes young adults
Standard deviation, range, and the interquartile range (IQR) provide information about the variability and range in values. Measures of variability show the amount of similarity in individual values to deduce how representative the sample is of the population of interest (Black, 2012). Age has a range of 9, a standard deviation of 1.84 and a IQR of 2. From the descriptive statistics, age has low variability with a low standard deviation, although there are likely a few outliers causing the range to be quite different than the IQR and standard deviation. Score and Happiness both have a standard deviation of less than 1.0 and a range of less than 3.5 indicating low variability. The weight variable has higher variability with a high standard deviation of 10.41 and range of 36.0 compared to the mean of 69.38. The height variable, on the other hand, has a mean of 172.30 with a standard deviation of 9.19 and range of 35 indicating a moderate amount of variability across the range but not as high as for the Weight variable.
Skewness and kurtosis provide information regarding the distribution of values to determine if values are evenly distributed across the range. High skewness and kurtosis may identify opportunities for bias in values with clusters of scores similar to one value and low frequency outliers affecting the interpretation. Score, Income, and Age variables have the highest kurtosis of 2.59 for Score, 2.41 for Income, and 3.96 for Age indicating values may be clustered at the high end and the low end rather than evenly distributed. The score variable as well as the age variable have relatively high skewness, indicating these variables have scores clustered at one end of the range of data values. For these two variables, skewness and kurtosis are also low indicating relatively even distribution across the range of variables.
From the measures of variability, variables of age and height have the highest consistency and may yield meaningful insights with further analysis. However, weight was chosen for further investigation due to the predicted relationship between weight and happiness. Income has high value in meaningfulness, when considering relationships with scores of happiness or categorical variables such as level of education. However, the high kurtosis and high standard deviation for income may indicate bias. A larger sample size may increase the reliability of the values for less variability and could be considered for future analysis.
Based on the descriptive statistics, relationships between individual characteristics (Age, Height, Weight, and Income) and Happiness score could provide meaningful information about the impact of these characteristics on happiness scores.
The descriptive research questions are:
What is the relationship between age, weight, income and happiness values?
What is the relationship between location and hobby?
What is the relationship between education and hobby?
What is the relationship between education level and location?
Step 8: Create graphs using ggplot2
For this part there are parts that you will need to change to create your graphs. The example is set to work with Income. Make the necessary changes to create the rest of the graphs. You may also want to change the colors, the dimensions etc…
#######################################################################
#
# Step 8: Create graphs using ggplot2
# For this part there are parts that you will need to change to create
# your graphs.
# The example is set to work with Income
# Make the necessary changes to create the rest of the graphs
# You may also want to change the colors, the dimensions etc...
#############################################################
#############################################################
#
# STEP 8a: Create a bargraph or a histogram
# Explain what graph was that and why?
# Set col to the desired column name
#############################################################
#
##
# In this code we start you of with an example of Happiness, later in the code
# you should replace this with your desired variable.
#
col = "Happiness" # This is an example, try to do the same with a different variable
#Bargraph for Education, Category, Hobby, and Location
# Assume df is your dataframe and col is the column name (as string)
col = "Education"
if (is.factor(df[[col]])) {
# Bar graph for factors
ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
geom_bar() +
labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
theme_minimal() +
theme(legend.position = "right")
}
col = "Category"
if (is.factor(df[[col]])) {
# Bar graph for factors
ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
geom_bar() +
labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
theme_minimal() +
theme(legend.position = "right")
}
col = "Hobby"
if (is.factor(df[[col]])) {
# Bar graph for factors
ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
geom_bar() +
labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
theme_minimal() +
theme(legend.position = "right")
}
col = "Location"
if (is.factor(df[[col]])) {
# Bar graph for factors
ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
geom_bar() +
labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
theme_minimal() +
theme(legend.position = "right")
}
You can also copy the chunk and create more graphs by resetting the col variable appropriately
#Histogram for Happiness
col = "Happiness"
if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
# Histogram for numeric variables
ggplot(df, aes(x = .data[[col]])) +
geom_histogram(bins = 15, fill = "darkviolet", color = "black") +
labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
theme_minimal()
}
#Histogram for Age
col = "Age"
if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
# Histogram for numeric variables
ggplot(df, aes(x = .data[[col]])) +
geom_histogram(bins = 15, fill = "orange2", color = "black") +
labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
theme_minimal()
}
#Histogram for Weight
col = "Weight"
if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
# Histogram for numeric variables
ggplot(df, aes(x = .data[[col]])) +
geom_histogram(bins = 15, fill = "darkblue", color = "black") +
labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
theme_minimal()
}
#Histogram for Income
col = "Income"
if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
# Histogram for numeric variables
ggplot(df, aes(x = .data[[col]])) +
geom_histogram(bins = 15, fill = "darkgreen", color = "black") +
labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
theme_minimal()
}
Essay Question
Now that you can observe graphically your data, explain the importance of graphical representations and how this helps to communicate data with other parties. Explain what graph was that and why?
To fully comprehend the data and relationships, graphic visualization is required. Data arrangement and sorting for graphic representation which aligns with the research question provides an opportunity to compare individual data points with overall trends (Bradstreet and Palcza, 2011). Interpretation of relationships between variables, groups, and individual data points is easily understood by viewers when categories are clearly detected.
Two of the categorical variables represented indicate a larger volume in one category. For the Education variable, the count in Bachelor’s is almost double compared to high school, Master’s and PhD. In the variable of hobby, the category of traveling is higher than photography, swimming, and reading, although reading is higher than both swimming and photography. In the variable of location, both categories city and rural have a higher count than suburb. Viewing results in this manner provides information about the data for later analysis. In considering relationships between variables, the high volume of values in the Bachelor’s category may lead to bias when determining relationships between education level and other variables.
Histograms provide a visual representation of the values of numerical data. It provides a visual representation of the distribution, allowing rapid detection of trends in the data. Histograms were developed for values for happiness, age, weight, and income. Histograms for happiness and age have the highest frequency of values close to the mean, with a small frequency of values spreading to the right of the graph indicating skewness toward the values on the left. Age has outlier values at 31.75 and 33.25 which do not occur with sufficient frequency to affect skewness toward lower ages in the graph. The variable for Weight also has higher frequency of values close to the mean, with frequently occurring weights at 80 and 85 with an outlier at 90. The income histogram, on the other hand, has increased values near the mean and to the right of the graph, indicating skewness toward higher income values, with an outlier at 70,000 . A small frequency of values are observed from 31,000 to 42,000. The information from this graph shows that the income for this sample falls between 45,000 and 65,000 although the range is much wider. This is important when making inferences regarding income, as the small frequency of values in the lower range may indicate bias in the results if the sample does not match the income of the population investigated.
STEP 8b: Create a boxplot and a Histogram for numeric variables note the the Bin width cannot be set up in the same way to work with Age or Happiness that has a small range and Income that the range is in thousands. Change this appropriately
Please note that this part of the code will not run for the demo code. You will need to change the value of eval=FALSE to eval=TRUE, after you introduce your code, to run it and add it to your knitted file.
#############################################################
#
# STEP 8b: Create a boxplot and Histogram for numeric variables
# note the the Bin width cannot be set up in the same way to work with
# Age or Happiness that has a small range and Income that the range is in thousands
# Change this appropriately
#############################################################
#
# Choose a numeric variable (i.e., Age) set the col variable to the name of the column then you rerun the code that is commented out here.
#col = ____ Add the variable of your choice
# Uncomment the code and you will create a Bar graph or a Histogram of a different variable here.
# Do not forget to change the value of eval=TRUE to run and knit this chunk
# if (is.factor(df[[col]])) { # if the col is categorical, then the code will
# create two graphs the Bar graph
# Highlight and run until the line that start with `# Boxplot for numeric variables
#
# If the col is numeric, then it will create the histogram
# Bar graph for factors
# ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
# geom_bar() +
# labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
# theme_minimal() +
# theme(legend.position = "right")
# } else if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
# ggplot(df, aes(x = .data[[col]])) +
# geom_histogram(binwidth = 0.3) +
# labs(title = paste("Histogram for", col), x = col, y = "Count") +
# theme_minimal()
}
col = "Happiness"
if (is.factor(df[[col]])) { # if the col is categorical, then the code will
# create two graphs the Bar graph
# Highlight and run until the line that start with `# Boxplot for numeric variables
#
# If the col is numeric, then it will create the histogram
# Bar graph for factors
ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
geom_bar() +
labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
theme_minimal() +
theme(legend.position = "right")
} else if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
ggplot(df, aes(x = .data[[col]])) +
geom_histogram(binwidth = 0.3) +
labs(title = paste("Histogram for", col), x = col, y = "Count") +
theme_minimal()
}
col = "Education"
if (is.factor(df[[col]])) { # if the col is categorical, then the code will
# create two graphs the Bar graph
# Highlight and run until the line that start with `# Boxplot for numeric variables
#
# If the col is numeric, then it will create the histogram
# Bar graph for factors
ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
geom_bar() +
labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
theme_minimal() +
theme(legend.position = "right")
} else if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
ggplot(df, aes(x = .data[[col]])) +
geom_histogram(binwidth = 0.3) +
labs(title = paste("Histogram for", col), x = col, y = "Count") +
theme_minimal()
}
Essay Question
Now explain this graph. Focus on the information extracted, anomalies, outliers, relationships.
Answer
***Step 8c: NOTE that you should run this part with the latest value of col. Do not forget to change the eval=TRUE to knit it.
Boxplot for numeric variables
#Box plot for Happiness
#############################################################
#
# Step 8c
# NOTE that you should run this part of the code after you
# copy the graph that the previous code creates. Boxplot for numeric variables
#############################################################
# The next 5 lines will run only if the col is numeric, otherwise will give you an error.
col = "Happiness"
if (is.factor(df[[col]])) { # if the col is categorical, then the code will
# create two graphs the Bar graph
# Highlight and run until the line that start with `# Boxplot for numeric variables
#
# If the col is numeric, then it will create the histogram
# Bar graph for factors
ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
geom_bar() +
labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
theme_minimal() +
theme(legend.position = "right")
} else if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
ggplot(df, aes(x = .data[[col]])) +
geom_histogram(binwidth = 0.3) +
labs(title = paste("Histogram for", col), x = col, y = "Count") +
theme_minimal()
}
ggplot(df, aes(x = "", y = .data[[col]])) +
geom_boxplot(fill = "skyblue", color = "darkblue", width = 0.3, outlier.color = "red", outlier.size = 2) +
labs(
title = paste("Box Plot for", col),
x = NULL,
y = "Value"
) +
theme_minimal() +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title.y = element_text(size = 14),
axis.text.y = element_text(size = 12)
)
#Box plot for Age
col = "Age"
if (is.factor(df[[col]])) { # if the col is categorical, then the code will
# create two graphs the Bar graph
# Highlight and run until the line that start with `# Boxplot for numeric variables
#
# If the col is numeric, then it will create the histogram
# Bar graph for factors
ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
geom_bar() +
labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
theme_minimal() +
theme(legend.position = "right")
} else if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
ggplot(df, aes(x = .data[[col]])) +
geom_histogram(binwidth = 0.3) +
labs(title = paste("Histogram for", col), x = col, y = "Count") +
theme_minimal()
}
ggplot(df, aes(x = "", y = .data[[col]])) +
geom_boxplot(fill = "skyblue", color = "darkblue", width = .3, outlier.color = "red", outlier.size = 2) +
labs(
title = paste("Box Plot for", col),
x = NULL,
y = "Value"
) +
theme_minimal() +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title.y = element_text(size = 14),
axis.text.y = element_text(size = 12)
)
#Box plot for Weight, Height, and Income
col = "Weight"
if (is.factor(df[[col]])) { # if the col is categorical, then the code will
# create two graphs the Bar graph
# Highlight and run until the line that start with `# Boxplot for numeric variables
#
# If the col is numeric, then it will create the histogram
# Bar graph for factors
ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
geom_bar() +
labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
theme_minimal() +
theme(legend.position = "right")
} else if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
ggplot(df, aes(x = .data[[col]])) +
geom_histogram(binwidth = 0.3) +
labs(title = paste("Histogram for", col), x = col, y = "Count") +
theme_minimal()
}
ggplot(df, aes(x = "", y = .data[[col]])) +
geom_boxplot(fill = "skyblue", color = "darkblue", width = 0.3, outlier.color = "red", outlier.size = 2) +
labs(
title = paste("Box Plot for", col),
x = NULL,
y = "Value"
) +
theme_minimal() +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title.y = element_text(size = 14),
axis.text.y = element_text(size = 12)
)
col = "Height"
if (is.factor(df[[col]])) { # if the col is categorical, then the code will
# create two graphs the Bar graph
# Highlight and run until the line that start with `# Boxplot for numeric variables
#
# If the col is numeric, then it will create the histogram
# Bar graph for factors
ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
geom_bar() +
labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
theme_minimal() +
theme(legend.position = "right")
} else if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
ggplot(df, aes(x = .data[[col]])) +
geom_histogram(binwidth = 0.3) +
labs(title = paste("Histogram for", col), x = col, y = "Count") +
theme_minimal()
}
ggplot(df, aes(x = "", y = .data[[col]])) +
geom_boxplot(fill = "skyblue", color = "darkblue", width = 0.3, outlier.color = "red", outlier.size = 2) +
labs(
title = paste("Box Plot for", col),
x = NULL,
y = "Value"
) +
theme_minimal() +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title.y = element_text(size = 14),
axis.text.y = element_text(size = 12)
)
col = "Income"
if (is.factor(df[[col]])) { # if the col is categorical, then the code will
# create two graphs the Bar graph
# Highlight and run until the line that start with `# Boxplot for numeric variables
#
# If the col is numeric, then it will create the histogram
# Bar graph for factors
ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
geom_bar() +
labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
theme_minimal() +
theme(legend.position = "right")
} else if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
ggplot(df, aes(x = .data[[col]])) +
geom_histogram(binwidth = 0.3) +
labs(title = paste("Histogram for", col), x = col, y = "Count") +
theme_minimal()
}
ggplot(df, aes(x = "", y = .data[[col]])) +
geom_boxplot(fill = "skyblue", color = "darkblue", width = .25, outlier.color = "red", outlier.size = 2) +
labs(
title = paste("Box Plot for", col),
x = NULL,
y = "Value"
) +
theme_minimal() +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title.y = element_text(size = 14),
axis.text.y = element_text(size = 12)
)
Essay Question
Explain the findings of your Boxplot. Are there any outliers? What is the IQR? Focus on the information extracted, anomalies, outliers, relationships.
Box plot visualization provides a visual representation of distribution of values and skewness. The median as well as the interquartile range, upper and lower ranges, and any outliers are represented in box plot graphs. Box plot graphs were created for happiness, age, weight, height, and income. For the box plot for happiness, the median is at the first quartile indicating highly skewed values for the second and third quartiles. Additionally, the upper range is closer to the third quartile than the lower range is in relation to the first quartile. Results from the happiness values must be interpreted with caution due to skewedness.
By inspecting the box plot for age, an outlier is visible above the upper range at a value above 33. The length of the upper range is longer from the third quartile than the lower range is from the first quartile. The median is in the center of the box indicating a high frequency of equally spread values in the interquartile range. The box plot for weight and for height shows value skewed toward the third quartile. For weight, the highest range is further from the third quartile than the lowest range value is from the first quartile. Income also shows slight skewness toward the third quartile although the lowest range is further from the first quartile than the highest range is from the third quartile.
Based on the high frequency of skewness identified, further analysis with a larger sample may provide increased reliability with insights gained. In comparing the box plot graphs, age appears to have the least variability although includes an outlier. Weight and height have similar skewness toward the higher values, as would be expected but further analysis could determine if the assumption that taller individuals weigh more is true based on this data set. Income also is skewed toward the higher values although has a wider range in the lower values.
Step 9: Tables
Creating tables to understand how the different categorical variables interconnect. Tabular information can be provided in both tables and parallel barplots. The following is an example on two variables, choose two others to get more valuable insights.
#############################################################
#
# Step 9
#
# Creating tables to understand how the different categorical variables
# interconnect
# Tabular information can be provided in both tables and parallel barplots.
# The following is an example on two variables, choose two others to get
# more valuable insights.
#############################################################
Gender_Education <- table(df$Education, df$Gender)
Gender_Education # what does this information tells you?
##
## Female Male Other
## Bachelor's 197 8 7
## High School 11 22 71
## Master's 13 83 0
## PhD 0 71 17
# How Many rows are there?
# This is the number of colors you should have in the vector below
# more intuitive colors can be added here.
# Keep the order from top to bottom to create your legend vector
addmargins(Gender_Education) # Add totals to your table
##
## Female Male Other Sum
## Bachelor's 197 8 7 212
## High School 11 22 71 104
## Master's 13 83 0 96
## PhD 0 71 17 88
## Sum 221 184 95 500
color <- c("red","blue","yellow","green")
names <- c("Bachelor's","High School", "Master's","PhD")
barplot(Gender_Education, col=color, beside= TRUE, main = "Education by Gender", ylim = c(0,250) )
legend("topright",names,fill=color,cex=0.5)
# topright is the position of the legend, it can be moved to top, left bottom, etc...
# you do not change the rest of the parameters here
print(addmargins(Gender_Education))
##
## Female Male Other Sum
## Bachelor's 197 8 7 212
## High School 11 22 71 104
## Master's 13 83 0 96
## PhD 0 71 17 88
## Sum 221 184 95 500
#Clustered bar plot for Hobby by Location
Hobby_Location <- table(df$Hobby, df$Location)
Hobby_Location # what does this information tells you?
##
## City Rural Suburb
## Photography 0 81 9
## Reading 108 17 0
## Swimming 42 58 0
## Traveling 64 35 86
# How Many rows are there?
# This is the number of colors you should have in the vector below
# more intuitive colors can be added here.
# Keep the order from top to bottom to create your legend vector
addmargins(Hobby_Location) # Add totals to your table
##
## City Rural Suburb Sum
## Photography 0 81 9 90
## Reading 108 17 0 125
## Swimming 42 58 0 100
## Traveling 64 35 86 185
## Sum 214 191 95 500
color <- c("orangered3","navyblue","yellow2","palegreen4")
names <- c("Photography","Reading", "Swimming","Traveling")
barplot(Hobby_Location, col=color, beside= TRUE, main = "Hobby by Location", ylim = c(0,250) )
legend("topright",names,fill=color,cex=0.5)
# topright is the position of the legend, it can be moved to top, left bottom, etc...
# you do not change the rest of the parameters here
print(addmargins(Hobby_Location))
##
## City Rural Suburb Sum
## Photography 0 81 9 90
## Reading 108 17 0 125
## Swimming 42 58 0 100
## Traveling 64 35 86 185
## Sum 214 191 95 500
#Clustered bar plot for Hobby by Education
Hobby_Education <- table(df$Education, df$Hobby)
Hobby_Education # what does this information tells you?
##
## Photography Reading Swimming Traveling
## Bachelor's 0 94 86 32
## High School 81 12 1 10
## Master's 0 19 13 64
## PhD 9 0 0 79
# How Many rows are there?
# This is the number of colors you should have in the vector below
# more intuitive colors can be added here.
# Keep the order from top to bottom to create your legend vector
addmargins(Hobby_Education) # Add totals to your table
##
## Photography Reading Swimming Traveling Sum
## Bachelor's 0 94 86 32 212
## High School 81 12 1 10 104
## Master's 0 19 13 64 96
## PhD 9 0 0 79 88
## Sum 90 125 100 185 500
color <- c("magenta4","steelblue","goldenrod1","darkgreen")
names <- c("Bachelor's","High School", "Master's","PhD")
barplot(Hobby_Education, col=color, beside= TRUE, main = "Hobby by Education", ylim = c(0,250) )
legend("topright",names,fill=color,cex=0.5)
# topright is the position of the legend, it can be moved to top, left bottom, etc...
# you do not change the rest of the parameters here
print(addmargins(Hobby_Education))
##
## Photography Reading Swimming Traveling Sum
## Bachelor's 0 94 86 32 212
## High School 81 12 1 10 104
## Master's 0 19 13 64 96
## PhD 9 0 0 79 88
## Sum 90 125 100 185 500
#Clustered bar plot for Hobby by Category
Hobby_Category <- table(df$Hobby, df$Category)
Hobby_Category # what does this information tells you?
##
## Art Music Sports
## Photography 9 0 81
## Reading 90 8 27
## Swimming 70 30 0
## Traveling 4 107 74
# How Many rows are there?
# This is the number of colors you should have in the vector below
# more intuitive colors can be added here.
# Keep the order from top to bottom to create your legend vector
addmargins(Hobby_Category) # Add totals to your table
##
## Art Music Sports Sum
## Photography 9 0 81 90
## Reading 90 8 27 125
## Swimming 70 30 0 100
## Traveling 4 107 74 185
## Sum 173 145 182 500
color <- c("orangered3","navyblue","yellow2","darkgreen")
names <- c("Photography","Reading", "Swimming", "Traveling")
barplot(Hobby_Category, col=color, beside= TRUE, main = "Hobby by Category", ylim = c(0,250) )
legend("topright",names,fill=color,cex=0.5)
# topright is the position of the legend, it can be moved to top, left bottom, etc...
# you do not change the rest of the parameters here
print(addmargins(Hobby_Category))
##
## Art Music Sports Sum
## Photography 9 0 81 90
## Reading 90 8 27 125
## Swimming 70 30 0 100
## Traveling 4 107 74 185
## Sum 173 145 182 500
#Clustered bar plot for Education by Location
Location_Education <- table(df$Education, df$Location)
Location_Education # what does this information tells you?
##
## City Rural Suburb
## Bachelor's 124 88 0
## High School 1 103 0
## Master's 84 0 12
## PhD 5 0 83
# How Many rows are there?
# This is the number of colors you should have in the vector below
# more intuitive colors can be added here.
# Keep the order from top to bottom to create your legend vector
addmargins(Location_Education) # Add totals to your table
##
## City Rural Suburb Sum
## Bachelor's 124 88 0 212
## High School 1 103 0 104
## Master's 84 0 12 96
## PhD 5 0 83 88
## Sum 214 191 95 500
color <- c("magenta4","steelblue","goldenrod1","darkgreen")
names <- c("Bachelor's","High School", "Master's", "PhD")
barplot(Location_Education, col=color, beside= TRUE, main = "Education by Location", ylim = c(0,250) )
legend("topright",names,fill=color,cex=0.5)
# topright is the position of the legend, it can be moved to top, left bottom, etc...
# you do not change the rest of the parameters here
print(addmargins(Location_Education))
##
## City Rural Suburb Sum
## Bachelor's 124 88 0 212
## High School 1 103 0 104
## Master's 84 0 12 96
## PhD 5 0 83 88
## Sum 214 191 95 500
#Contingency tables for Hobby by Location, Hobby by Education, and Hobby by Category and Education by Location
library(knitr)
## Warning: package 'knitr' was built under R version 4.5.1
library(kableExtra)
# Create the contingency table
Gender_Education <- table(df$Education, df$Gender)
# Add row and column totals
Gender_Education_margins <- addmargins(Gender_Education)
# Make a clean and beautiful table with kable
kable(Gender_Education_margins, caption = "Gender by Education Level", align = 'c') %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
row_spec(0, bold = TRUE, background = "#D3D3D3") # Highlight header
Female | Male | Other | Sum | |
---|---|---|---|---|
Bachelor’s | 197 | 8 | 7 | 212 |
High School | 11 | 22 | 71 | 104 |
Master’s | 13 | 83 | 0 | 96 |
PhD | 0 | 71 | 17 | 88 |
Sum | 221 | 184 | 95 | 500 |
# Create the contingency table
Hobby_Location <- table(df$Location, df$Hobby)
# Add row and column totals
Hobby_Location_margins <- addmargins(Hobby_Location)
# Make a clean and beautiful table with kable
kable(Hobby_Location_margins, caption = "Table 3. Hobby by Location", align = 'c') %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
row_spec(0, bold = TRUE, background = "#D3D3D3") # Highlight header
Photography | Reading | Swimming | Traveling | Sum | |
---|---|---|---|---|---|
City | 0 | 108 | 42 | 64 | 214 |
Rural | 81 | 17 | 58 | 35 | 191 |
Suburb | 9 | 0 | 0 | 86 | 95 |
Sum | 90 | 125 | 100 | 185 | 500 |
# Create the contingency table
Hobby_Education <- table(df$Education, df$Hobby)
# Add row and column totals
Hobby_Education_margins <- addmargins(Hobby_Education)
# Make a clean and beautiful table with kable
kable(Hobby_Education_margins, caption = "Table 4. Hobby by Education Level", align = 'c') %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
row_spec(0, bold = TRUE, background = "#D3D3D3") # Highlight header
Photography | Reading | Swimming | Traveling | Sum | |
---|---|---|---|---|---|
Bachelor’s | 0 | 94 | 86 | 32 | 212 |
High School | 81 | 12 | 1 | 10 | 104 |
Master’s | 0 | 19 | 13 | 64 | 96 |
PhD | 9 | 0 | 0 | 79 | 88 |
Sum | 90 | 125 | 100 | 185 | 500 |
# Create the contingency table
Hobby_Category <- table(df$Category, df$Hobby)
# Add row and column totals
Hobby_Category_margins <- addmargins(Hobby_Category)
# Make a clean and beautiful table with kable
kable(Hobby_Category_margins, caption = "Hobby by Category", align = 'c') %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
row_spec(0, bold = TRUE, background = "#D3D3D3") # Highlight header
Photography | Reading | Swimming | Traveling | Sum | |
---|---|---|---|---|---|
Art | 9 | 90 | 70 | 4 | 173 |
Music | 0 | 8 | 30 | 107 | 145 |
Sports | 81 | 27 | 0 | 74 | 182 |
Sum | 90 | 125 | 100 | 185 | 500 |
# Create the contingency table
Location_Education <- table(df$Education, df$Location)
# Add row and column totals
Location_Education_margins <- addmargins(Location_Education)
# Make a clean and beautiful table with kable
kable(Location_Education_margins, caption = "Table 5. Education Level by Location", align = 'c') %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
row_spec(0, bold = TRUE, background = "#D3D3D3") # Highlight header
City | Rural | Suburb | Sum | |
---|---|---|---|---|
Bachelor’s | 124 | 88 | 0 | 212 |
High School | 1 | 103 | 0 | 104 |
Master’s | 84 | 0 | 12 | 96 |
PhD | 5 | 0 | 83 | 88 |
Sum | 214 | 191 | 95 | 500 |
Essay Question
Explain the table in details. Focus on the information extracted, anomalies, outliers, relationships.
Clustered bar plots and contingency tables were created for the Hobby variable to identify trends in hobby values based on location, education, and category. Clustered bar plots and contingency tables were also created for hobby by category, but not included in this EDA as the insights from these relationships were not as meaningful as the relationships between location, education level, and hobby.
When comparing the hobby variable with location, the suburb category had the least variety, with only photography and traveling included in hobby. City included reading, swimming and traveling. Rural had representation in all of the categories. As represented by the contingency table, rural had 90% of the values for the hobby of photography and city had 87% of the values for the hobby of reading. Swimming was more equally distributed in both city and rural categories. All locations included traveling as a hobby, although the suburb location had the highest count of traveling and rural had the lowest. (It would be interesting to look at hobby by income to see if higher income related to travel, especially since Master’s and PhD’s had higher counts of travel than Bachelors and High School. )
When comparing hobby by education by clustered bar graph, several trends are indicated. Only the hobby of traveling included counts from all education categories. Observations which indicated a Bachelor’s had the highest counts for reading and swimming, with a smaller but visible number for traveling. The hobby of photography was largely made up of observations with high school as the education category with a small number of observations with PhD. No observations with an education of PhD listed reading or swimming as hobbies. Observations with a Master’s degree were primarily under traveling although to a lesser degree reading and swimming. In viewing the relationship between hobby and education level in the contingency table, 90% of observations which were included in the hobby of photography listed high school as the educational level. For the hobby of reading, the highest count was for the educational level of Bachelor’s with a small number in high school and master’s. A higher count of observations for the educational level of Bachelor’s were counted in the swimming hobby with only one in high school, a few in Master’s and zero in PhD education levels. The educational level of PhD had the highest count in traveling, with a slightly smaller count with an educational level of Master’s, a smaller count for Bachelor’s and the smallest number for an educational level of high school.
A clustered bar graph and contingency table were also created identifying the relationship between education and location variables. From the clustered bar graph, it is evident that educational levels are not equally represented across locations. The educational level of high school had almost all of the values in the rural location with what appears to be a tiny count in the city location. The educational level of Bachelor’s had a count slightly less than high school in the rural location, with a slightly higher count located in the city than in the rural location and no values in the suburb. The educational level of Master’s had the highest count in the city, with a small number located in the suburb location. Finally, the educational level of PhD had the highest count in the suburb location with a small number located in the city. No values with graduate degree education levels were found in the rural location. No high school or Bachelor’s education level values were found in the suburb location. When inspecting the relationships in the contingency table, 87% of values for the suburb location were found with the education level of PhD. In the rural location, a little over half (53%) were found with the education level of high school. For the city location, 57% of values were found with the education level of Bachelor’s, 39% of values were found with the education level of Master’s, a small number of counts for PhD and only 1 count for the high school education level.
Based on the descriptive statistics, relationships between the variables can be determined and visualized. A comparison of the histograms provides insights about the trends of individual characteristics and happiness values.
The histograms for age and weight were more similar in shape to the happiness histogram than income. Happiness was highest at 7.5 with more between 7-9 than 5-7, and skewed toward the higher numbers. Age was highest values between 25-30 with outliers older than 30, had a central median, but high spread toward the higher end. Weight was highest from 45-85, with most between 65-70 and an outlier at 90, skewed toward the higher end with a higher spread beyond the third quartile when compared to the spread from the origin to the first quartile. Further analysis is required to determine if the similarity in the shape of the graph indicates a positive correlation between these variables and happiness scores.
Kovner et al. (2020) identified age among factors that indicate pursuit of graduate education. In this data set, the age histogram and happiness histogram presented similarly in shape. The trends identified in bar cluster plots demonstrated differences in location and hobby for graduate education levels (Master’s and PhD). Further analysis of correlations between happiness scores and education levels may provide insights into the happiness values based on education level.
Based on the data analysis and visualization, there appears to be a relationship between the education, location and hobby variables. The education level that had the least number of counts was in the PhD category, making up 17% of the sample. From the analysis and data visualization, there appears to be a relationship between education level of PhD, location (suburb) and hobby (traveling). A relationship was also found between education level of Bachelor’s, location (city and rural) and hobby (reading and swimming). The relationship between Master’s and location was related to a high proportion in the city variable with the highest counts in the traveling category of the hobby variable, although a few were included in reading and swimming. Finally, 20% of the sample included the high school education level. Almost all of the observations with high school as the education level were in the rural location category. More with a high school as the educational level are included in photography as the hobby.
Works Cited Black, Ken. Business Statistics for Contemporary Decision Mak Ing 7E + WileyPlus Registration Card. 16 Mar. 2012. Bradstreet, Thomas E, and John S Palcza. “Digging into Data with Graphics.” Teaching Statistics Trust, vol. 34, no. 2, 2011, pp. 68–74. Kovner, Christine T., et al. “Charting the Course for Nurses’ Achievement of Higher Education Levels.” Journal of Professional Nursing, vol. 28, no. 6, Nov. 2012, pp. 333–343, https://doi.org/10.1016/j.profnurs.2012.04.021. Accessed 10 Jan. 2020. Medley-Rath, Stephanie, et al. “Figures and Charts and Tables, Oh My!: A Content Analysis of Textbook Data Visualizations.” Teaching Sociology, vol. 52, no. 3, 29 Nov. 2023, https://doi.org/10.1177/0092055x231214006. Accessed 26 Apr. 2024.