Lab Homework 7 - EDA Part I

Part I: Lab Exercise

Lab 1:

1. How many samples are there? How many variables are there?

Answer: There are 53940 samples and 10 variables.

2. Understand the meaning and the data type for each variable.

3. Create a plot to study the relationship between price and carat. How do you understand this plot?

Code:

ggplot(data = diamonds, aes(x = carat, y = price)) +
  geom_point(color = "orange2") +
  geom_smooth(linetype = "dotdash", color = "purple3") +
  labs(title = "Diamonds' Carat vs Price",
       x = "Carat",
       y = "Price") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "purple4"),
        plot.margin = margin(1, 1, 0.5, 0.5, "cm"),
        axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "orange4"),
        axis.text = element_text(size = rel(1.1)))

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

4. Create plots to study the relationship between price and cut, color, clarity, respectively. How do you understand these plots?

Code:

ggplot(diamonds) +
  geom_boxplot(mapping = aes(x = cut, y = price, fill = cut)) +
  labs(title = "Diamonds' Cut vs Price",
       x = "Cut Quality",
       y = "Price") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "green4", 
                                  margin = margin(15,15,15,15)),
        plot.margin = margin(1, 1, 0.5, 0.5, "cm"),
        axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "orange4"),
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)),
        axis.text = element_text(size = rel(1.1)))

ggplot(diamonds) +
  geom_boxplot(mapping = aes(x = color, y = price, fill = color)) +
  labs(title = "Diamonds' Color vs Price",
       x = "Color of diamonds",
       y = "Price") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "green4", 
                                  margin = margin(15,15,15,15)),
        plot.margin = margin(1, 1, 0.5, 0.5, "cm"),
        axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "orange4"),
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)),
        axis.text = element_text(size = rel(1.1)))

ggplot(diamonds) +
  geom_boxplot(mapping = aes(x = clarity, y = price, fill = clarity)) +
  labs(title = "Diamonds' Clarity vs Price",
       x = "Clarity of diamonds",
       y = "Price") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "green4", 
                                  margin = margin(15,15,15,15)),
        plot.margin = margin(1, 1, 0.5, 0.5, "cm"),
        axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "orange4"),
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)),
        axis.text = element_text(size = rel(1.1)))

Answer: The first plot shows that while diamond prices generally increase from “Fair” to “Premium” cuts, the “Ideal” cut does not have the highest median price. The second plot reveals that, contrary to expectations, diamonds with more visible color (J) tend to be more expensive than colorless ones, possibly due to factors like carat weight. The third plot indicates that price decreases with clarity, with IF and VVS1 diamonds having lower median prices.

Across all three plots, there are numerous high-priced outliers, indicating that other factors—such as carat weight—may also be influencing price variations. Understanding these relationships can help buyers and sellers make more informed decisions in the diamond market.

5. After finishing 3 and 4, do you have more questions raised in your mind?

My question is: in the similar weight, is better quality diamond still cheaper than the lower quality ones?

Code:

ggplot(data = diamonds) + 
  stat_summary(mapping = aes(x = cut, y = carat, fill = cut), fun = "mean", geom = "bar") + 
  labs(title = "Mean Diamond Weight by Cut", 
       x = "Cut Quality", 
       y = "Mean Weight (carat)") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.8), margin = margin(15,15,15,15), color = "green4"), 
        axis.title = element_text(size = rel(1.4), color = "orange4"), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.1)))

Lab 2:

1. In `mpg’ data set, which manufacturer has the best average mpg across all its models in the data sets?

Code:

ggplot(mpg) +
  stat_summary(mapping = aes(x = manufacturer, y = hwy, fill = manufacturer), fun = "mean", geom = "bar") +
  labs(title = "Highway Mean of Manufacturers",
       x = "Manufacturer",
       y = "Highway mileage/gallon mean") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "purple4",
                                  margin = margin(15, 15, 15, 15)),
        axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "orange3"),
        axis.title.x = element_text(margin = margin(10, 5, 5, 5)),
        axis.title.y = element_text(margin = margin(5, 10, 5, 5)),
        axis.text = element_text(size = rel(0.9)))

ggplot(mpg) +
  stat_summary(mapping = aes(x = manufacturer, y = cty, fill = manufacturer), fun = "mean", geom = "bar") +
  labs(title = "City Mean of Manufacturers",
       x = "Manufacturer",
       y = "City mileage/gallon mean") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "purple4",
                                  margin = margin(15, 15, 15, 15)),
        axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "orange3"),
        axis.title.x = element_text(margin = margin(10, 5, 5, 5)),
        axis.title.y = element_text(margin = margin(5, 10, 5, 5)),
        axis.text = element_text(size = rel(0.9)))

Answer: Honda has the best MPG across all the models in the mpg data set.

2. Do you think that the conclusion is more generally valid. That is, can we say that manufacturer produces cars that are more fuel economic than other ones? Think carefully and try to find the answer from the data set itself, rather than from your own opinion or external research.

Code & Graphs:

ggplot(mpg) +
  stat_summary(mapping = aes(x = manufacturer, y = displ, fill = manufacturer), 
               fun = "mean", geom = "bar") +
  labs(title = "Engine Displacements vs Manufacturers",
       x = "Manufacturer",
       y = "Engine displacement in liters") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "purple4",
                                  margin = margin(15, 15, 15, 15)),
        axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "orange3"),
        axis.title.x = element_text(margin = margin(10, 5, 5, 5)),
        axis.title.y = element_text(margin = margin(5, 10, 5, 5)),
        axis.text = element_text(size = rel(0.9)))

ggplot(mpg) +
  stat_summary(mapping = aes(x = manufacturer, y = cyl, fill = manufacturer), 
               fun = "mean", geom = "bar") +
   labs(title = "Cylinders vs Manufacturers",
       x = "Manufacturer",
       y = "Number of cylinders") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "purple4",
                                  margin = margin(15, 15, 15, 15)),
        axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "orange3"),
        axis.title.x = element_text(margin = margin(10, 5, 5, 5)),
        axis.title.y = element_text(margin = margin(5, 10, 5, 5)),
        axis.text = element_text(size = rel(0.9)))

ggplot(mpg) +
  geom_bar(mapping = aes(x = manufacturer, fill = drv), position = "dodge") +
   labs(title = "Drive Train vs Manufacturers",
       x = "Manufacturer",
       y = "Type of drive train") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "purple4",
                                  margin = margin(15, 15, 15, 15)),
        axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "orange3"),
        axis.title.x = element_text(margin = margin(10, 5, 5, 5)),
        axis.title.y = element_text(margin = margin(5, 10, 5, 5)),
        axis.text = element_text(size = rel(0.9)))

Answer: We cannot evaluate fuel economy based solely on one or two factors, such as city and highway mileage. Other key factors that directly impact fuel consumption must also be considered, including engine displacement, the number of cylinders in a vehicle, and the drive train configuration.

Equation of displacement \[displacement=4/pi×b^2×s×c\] Where:
- b represents the bore size,
- s is the stroke length,
- c is the number of cylinders.
Therefore, there is a positive correlation between the number of cylinders and engine displacement. The more cylinders a car has, the greater its engine displacement, which in turn leads to higher fuel consumption.

According to the graphs:
1. The first graph shows that Honda has the lowest mean engine displacement.
2. The second graph indicates that Honda also has the lowest mean number of cylinders.
3. The final graph reveals that all the Honda vehicles in this data set have a front-wheel-drive configuration.

In conclusion, Honda demonstrates the best overall fuel economy, as it has the lowest average number of cylinders, the smallest mean engine displacement, and an efficient drive train in all samples in the data set–feature front-wheel drive. However, it is important to note that the comparison may not be entirely fair, as the samples from other manufacturers may represent different types of vehicles.

Lab 3: Create a frequency table and a bar plot for variables Gender and Card_Category. Summarize your findings

Code:

getwd()

## [1] "/Users/shoshow/Desktop/Data Anal/HW"

bank_data <- (read_csv("~/Desktop/BankChurners.csv"))

## Rows: 10127 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (6): Attrition_Flag, Gender, Education_Level, Marital_Status, Income_Ca...
## dbl (17): CLIENTNUM, Customer_Age, Dependent_count, Months_on_book, Total_Re...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

prop.table(table(bank_data$Gender))

## 
##         F         M 
## 0.5290807 0.4709193

ggplot(data = bank_data) +
  geom_bar(mapping = aes(Gender), fill = "powderblue") +
  labs(title = "Gender of Card Holder",
       x = "Gender",
       y = "Count") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "blue4",
                                  margin = margin(15, 15, 15, 15)),
        axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "slateblue"),
        axis.title.x = element_text(margin = margin(10, 5, 5, 5)),
        axis.title.y = element_text(margin = margin(5, 10, 5, 5)),
        axis.text = element_text(size = rel(0.9)))

prop.table(table(bank_data$Card_Category))

## 
##        Blue        Gold    Platinum      Silver 
## 0.931766565 0.011454528 0.001974919 0.054803989

ggplot(bank_data) +
  geom_bar(aes(Card_Category), fill = "powderblue") +
  labs(title = "Categories of Card",
       x = "Type of card",
       y = "Count") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "blue4",
                                  margin = margin(15, 15, 15, 15)),
        axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "slateblue"),
        axis.title.x = element_text(margin = margin(10, 5, 5, 5)),
        axis.title.y = element_text(margin = margin(5, 10, 5, 5)),
        axis.text = element_text(size = rel(0.9)))

Answer: The first graph shows the number of female (F) cardholders is slightly higher than the number of male (M) cardholders. However, the difference is not very large, indicating a relatively balanced distribution of card ownership between genders.

The “Blue” card type is by far the most common, with a significantly higher count compared to other card types. “Silver” cards have a much lower count, followed by “Gold” and “Platinum,” which are the least common. This suggests that most customers use “Blue” cards, while premium card types like “Platinum” and “Gold” are used by a smaller segment of the customer base.

Lab 4: Create a histogram and a summary for variables Customer_Age and Total_Trans_Amt. Summarize your findings.

Code:

summary(bank_data$Customer_Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   26.00   41.00   46.00   46.33   52.00   73.00

ggplot(bank_data) +
  geom_histogram(aes(Customer_Age), fill = "yellow3", binwidth = 5) +
  labs(title = "Age of Card Customers",
       x = "Age",
       y = "Count") +
  scale_x_continuous(limits = c(20, 70)) +
  theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "hotpink4",
                                  margin = margin(15, 15, 15, 15)),
        axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "hotpink3"),
        axis.title.x = element_text(margin = margin(10, 5, 5, 5)),
        axis.title.y = element_text(margin = margin(5, 10, 5, 5)),
        axis.text = element_text(size = rel(0.9)))

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).

summary(bank_data$Total_Trans_Amt)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     510    2156    3899    4404    4741   18484

ggplot(bank_data) +
  geom_histogram(aes(Total_Trans_Amt), fill = "yellow3", bins = 30, binwidth = 1000) +
  labs(title = "Total Transaction Amount for Last 12 Months",
       x = "Transaction amount (USD)",
       y = "Count") +
  scale_x_continuous(labels = scales::dollar) +
  theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "hotpink4",
                                  margin = margin(15, 15, 15, 15)),
        axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "hotpink3"),
        axis.title.x = element_text(margin = margin(10, 5, 5, 5)),
        axis.title.y = element_text(margin = margin(5, 10, 5, 5)),
        axis.text = element_text(size = rel(0.9)))

Answer: The data appears to follow a roughly normal distribution, with most customers concentrated between the ages of approximately 35 and 55. The peak customer count occurs around the mid-40s to early 50s. There are fewer customers at both younger and older age extremes, with the number significantly declining past 60. The distribution suggests that the majority of cardholders are middle-aged.

On the other hand, the second graph’s distribution is right-skewed, with most transactions concentrated between $0 and $5,000. There are two noticeable peaks in this range, suggesting common spending patterns among customers. A smaller number of customers have significantly higher transaction amounts, with some exceeding $10,000 and even $15,000, though these cases are less frequent. This suggests that while most customers spend moderately, segment of high spenders exists.

Lab 5: Find another discrete random variable from the bank data set and explore its distribution with table and graph.

Code:

table(bank_data$Total_Relationship_Count)

## 
##    1    2    3    4    5    6 
##  910 1243 2305 1912 1891 1866

ggplot(bank_data) +
  geom_bar(aes(x = as.factor(Total_Relationship_Count)), fill = "orchid4") +
  labs(title = "Products Held by Customers",
       x = "Number of products",
       y = "Count") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "darkorange4",
                                  margin = margin(15, 15, 15, 15)),
        axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "darkorange3"),
        axis.title.x = element_text(margin = margin(10, 5, 5, 5)),
        axis.title.y = element_text(margin = margin(5, 10, 5, 5)),
        axis.text = element_text(size = rel(0.9)))

Lab 6: Check how many NA values are there in each column of the flights data set.

Code:

library(nycflights13)
map(flights, ~ sum(is.na(.)))

## $year
## [1] 0
## 
## $month
## [1] 0
## 
## $day
## [1] 0
## 
## $dep_time
## [1] 8255
## 
## $sched_dep_time
## [1] 0
## 
## $dep_delay
## [1] 8255
## 
## $arr_time
## [1] 8713
## 
## $sched_arr_time
## [1] 0
## 
## $arr_delay
## [1] 9430
## 
## $carrier
## [1] 0
## 
## $flight
## [1] 0
## 
## $tailnum
## [1] 2512
## 
## $origin
## [1] 0
## 
## $dest
## [1] 0
## 
## $air_time
## [1] 9430
## 
## $distance
## [1] 0
## 
## $hour
## [1] 0
## 
## $minute
## [1] 0
## 
## $time_hour
## [1] 0

Part II: Explore the loans_full_schema data set in openintro by performing EDA

Ask a question of your interest
Question: Is the job tenure correlated with annual income?
Visualize data to answer your question

Code:

ggplot(loans_full_schema, mapping = aes(x = as.factor(emp_length), y = annual_income)) +
  stat_boxplot(geom = "errorbar", width = 0.5) +
  geom_boxplot(aes(fill = as.factor(emp_length))) +
  scale_y_continuous(limits = c(0, 200000), labels = scales::dollar) +
  labs(title = "Correlation between Income and Job Length",
       x = "Number of years in job",
       y = "Annual income (USD)")+
  theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "darkviolet", 
                                  margin = margin(15, 15, 15, 15)),
        axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "orchid4"),
        axis.title.x = element_text(margin = margin(10, 5, 5, 5)),
        axis.title.y = element_text(margin = margin(5, 10, 5, 5)),
        axis.text = element_text(size = rel(1.1)))

## Warning: Removed 286 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Removed 286 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Answer: The median annual income seems to increase modestly with more years of job tenure, indicating a weak positive correlation between tenure and income. However, the presence of many missing (NA) values in the data complicates drawing a clear conclusion from the graph.

Try to raise new questions from your plot
New Question: Will the NA values affect the trend? Is all the source from annual income reliable?
Visualize data to answer the new question

Code:

map(loans_full_schema, ~ sum(is.na(.)))

## $emp_title
## [1] 0
## 
## $emp_length
## [1] 817
## 
## $state
## [1] 0
## 
## $homeownership
## [1] 0
## 
## $annual_income
## [1] 0
## 
## $verified_income
## [1] 0
## 
## $debt_to_income
## [1] 24
## 
## $annual_income_joint
## [1] 8505
## 
## $verification_income_joint
## [1] 0
## 
## $debt_to_income_joint
## [1] 8505
## 
## $delinq_2y
## [1] 0
## 
## $months_since_last_delinq
## [1] 5658
## 
## $earliest_credit_line
## [1] 0
## 
## $inquiries_last_12m
## [1] 0
## 
## $total_credit_lines
## [1] 0
## 
## $open_credit_lines
## [1] 0
## 
## $total_credit_limit
## [1] 0
## 
## $total_credit_utilized
## [1] 0
## 
## $num_collections_last_12m
## [1] 0
## 
## $num_historical_failed_to_pay
## [1] 0
## 
## $months_since_90d_late
## [1] 7715
## 
## $current_accounts_delinq
## [1] 0
## 
## $total_collection_amount_ever
## [1] 0
## 
## $current_installment_accounts
## [1] 0
## 
## $accounts_opened_24m
## [1] 0
## 
## $months_since_last_credit_inquiry
## [1] 1271
## 
## $num_satisfactory_accounts
## [1] 0
## 
## $num_accounts_120d_past_due
## [1] 318
## 
## $num_accounts_30d_past_due
## [1] 0
## 
## $num_active_debit_accounts
## [1] 0
## 
## $total_debit_limit
## [1] 0
## 
## $num_total_cc_accounts
## [1] 0
## 
## $num_open_cc_accounts
## [1] 0
## 
## $num_cc_carrying_balance
## [1] 0
## 
## $num_mort_accounts
## [1] 0
## 
## $account_never_delinq_percent
## [1] 0
## 
## $tax_liens
## [1] 0
## 
## $public_record_bankrupt
## [1] 0
## 
## $loan_purpose
## [1] 0
## 
## $application_type
## [1] 0
## 
## $loan_amount
## [1] 0
## 
## $term
## [1] 0
## 
## $interest_rate
## [1] 0
## 
## $installment
## [1] 0
## 
## $grade
## [1] 0
## 
## $sub_grade
## [1] 0
## 
## $issue_month
## [1] 0
## 
## $loan_status
## [1] 0
## 
## $initial_listing_status
## [1] 0
## 
## $disbursement_method
## [1] 0
## 
## $balance
## [1] 0
## 
## $paid_total
## [1] 0
## 
## $paid_principal
## [1] 0
## 
## $paid_interest
## [1] 0
## 
## $paid_late_fees
## [1] 0

ggplot(loans_full_schema, mapping = aes(x = as.factor(emp_length), 
                                        fill = verified_income)) +
  geom_bar(position = "dodge") +
  labs(title = "Correlation between Income and Job Length",
       x = "Number of years in job",
       y = "Verified annual income (USD)")+
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.7), color = "darkviolet", 
                                  margin = margin(15, 15, 15, 15)),
        axis.title = element_text(hjust = 0.5, size = rel(1.4), color = "orchid4"),
        axis.title.x = element_text(margin = margin(10, 5, 5, 5)),
        axis.title.y = element_text(margin = margin(5, 10, 5, 5)),
        axis.text = element_text(size = rel(1.1)))

Answer: After using the map() function to examine the data, we found 817 unknown values in emp_length (job tenure). These missing values could impact the observed trend if they are considered alongside other factors.

According to the dodge plot, since a significant portion of the data falls under “Not Verified,” caution should be exercised when interpreting trends. The presence of missing values (NA) further complicates the analysis.

Repeat cycle 1-4

Lab Homework 7 - EDA Part I

Angela C

2025-02-15

Part I: Lab Exercise

Lab 1:

1. How many samples are there? How many variables are there?

2. Understand the meaning and the data type for each variable.

3. Create a plot to study the relationship between price and carat. How do you understand this plot?

4. Create plots to study the relationship between price and cut, color, clarity, respectively. How do you understand these plots?

5. After finishing 3 and 4, do you have more questions raised in your mind?

Lab 2:

1. In `mpg’ data set, which manufacturer has the best average mpg across all its models in the data sets?

2. Do you think that the conclusion is more generally valid. That is, can we say that manufacturer produces cars that are more fuel economic than other ones? Think carefully and try to find the answer from the data set itself, rather than from your own opinion or external research.

Lab 3: Create a frequency table and a bar plot for variables Gender and Card_Category. Summarize your findings

Lab 4: Create a histogram and a summary for variables Customer_Age and Total_Trans_Amt. Summarize your findings.

Lab 5: Find another discrete random variable from the bank data set and explore its distribution with table and graph.

Lab 6: Check how many NA values are there in each column of the flights data set.

Part II: Explore the loans_full_schema data set in openintro by performing EDA