Answer: There are 53940 samples and 10
variables.
Code:
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point(color = "orange2") +
geom_smooth(linetype = "dotdash", color = "purple3") +
labs(title = "Diamonds' Carat vs Price",
x = "Carat",
y = "Price") +
theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "purple4"),
plot.margin = margin(1, 1, 0.5, 0.5, "cm"),
axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "orange4"),
axis.text = element_text(size = rel(1.1)))
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Code:
ggplot(diamonds) +
geom_boxplot(mapping = aes(x = cut, y = price, fill = cut)) +
labs(title = "Diamonds' Cut vs Price",
x = "Cut Quality",
y = "Price") +
theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "green4",
margin = margin(15,15,15,15)),
plot.margin = margin(1, 1, 0.5, 0.5, "cm"),
axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "orange4"),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.1)))
ggplot(diamonds) +
geom_boxplot(mapping = aes(x = color, y = price, fill = color)) +
labs(title = "Diamonds' Color vs Price",
x = "Color of diamonds",
y = "Price") +
theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "green4",
margin = margin(15,15,15,15)),
plot.margin = margin(1, 1, 0.5, 0.5, "cm"),
axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "orange4"),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.1)))
ggplot(diamonds) +
geom_boxplot(mapping = aes(x = clarity, y = price, fill = clarity)) +
labs(title = "Diamonds' Clarity vs Price",
x = "Clarity of diamonds",
y = "Price") +
theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "green4",
margin = margin(15,15,15,15)),
plot.margin = margin(1, 1, 0.5, 0.5, "cm"),
axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "orange4"),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.1)))
Answer: The first plot shows that while diamond prices
generally increase from “Fair” to “Premium” cuts, the “Ideal” cut does
not have the highest median price. The second plot reveals that,
contrary to expectations, diamonds with more visible color (J) tend to
be more expensive than colorless ones, possibly due to factors like
carat weight. The third plot indicates that price decreases with
clarity, with IF and VVS1 diamonds having lower median prices.
Across all three plots, there are numerous high-priced outliers,
indicating that other factors—such as carat weight—may also be
influencing price variations. Understanding these relationships can help
buyers and sellers make more informed decisions in the diamond
market.
My question is: in the similar weight, is better quality diamond
still cheaper than the lower quality ones?
Code:
ggplot(data = diamonds) +
stat_summary(mapping = aes(x = cut, y = carat, fill = cut), fun = "mean", geom = "bar") +
labs(title = "Mean Diamond Weight by Cut",
x = "Cut Quality",
y = "Mean Weight (carat)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.8), margin = margin(15,15,15,15), color = "green4"),
axis.title = element_text(size = rel(1.4), color = "orange4"),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.1)))
Code:
ggplot(mpg) +
stat_summary(mapping = aes(x = manufacturer, y = hwy, fill = manufacturer), fun = "mean", geom = "bar") +
labs(title = "Highway Mean of Manufacturers",
x = "Manufacturer",
y = "Highway mileage/gallon mean") +
theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "purple4",
margin = margin(15, 15, 15, 15)),
axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "orange3"),
axis.title.x = element_text(margin = margin(10, 5, 5, 5)),
axis.title.y = element_text(margin = margin(5, 10, 5, 5)),
axis.text = element_text(size = rel(0.9)))
ggplot(mpg) +
stat_summary(mapping = aes(x = manufacturer, y = cty, fill = manufacturer), fun = "mean", geom = "bar") +
labs(title = "City Mean of Manufacturers",
x = "Manufacturer",
y = "City mileage/gallon mean") +
theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "purple4",
margin = margin(15, 15, 15, 15)),
axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "orange3"),
axis.title.x = element_text(margin = margin(10, 5, 5, 5)),
axis.title.y = element_text(margin = margin(5, 10, 5, 5)),
axis.text = element_text(size = rel(0.9)))
Answer: Honda has the best MPG across all the models in
the mpg data set.
Code & Graphs:
ggplot(mpg) +
stat_summary(mapping = aes(x = manufacturer, y = displ, fill = manufacturer),
fun = "mean", geom = "bar") +
labs(title = "Engine Displacements vs Manufacturers",
x = "Manufacturer",
y = "Engine displacement in liters") +
theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "purple4",
margin = margin(15, 15, 15, 15)),
axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "orange3"),
axis.title.x = element_text(margin = margin(10, 5, 5, 5)),
axis.title.y = element_text(margin = margin(5, 10, 5, 5)),
axis.text = element_text(size = rel(0.9)))
ggplot(mpg) +
stat_summary(mapping = aes(x = manufacturer, y = cyl, fill = manufacturer),
fun = "mean", geom = "bar") +
labs(title = "Cylinders vs Manufacturers",
x = "Manufacturer",
y = "Number of cylinders") +
theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "purple4",
margin = margin(15, 15, 15, 15)),
axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "orange3"),
axis.title.x = element_text(margin = margin(10, 5, 5, 5)),
axis.title.y = element_text(margin = margin(5, 10, 5, 5)),
axis.text = element_text(size = rel(0.9)))
ggplot(mpg) +
geom_bar(mapping = aes(x = manufacturer, fill = drv), position = "dodge") +
labs(title = "Drive Train vs Manufacturers",
x = "Manufacturer",
y = "Type of drive train") +
theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "purple4",
margin = margin(15, 15, 15, 15)),
axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "orange3"),
axis.title.x = element_text(margin = margin(10, 5, 5, 5)),
axis.title.y = element_text(margin = margin(5, 10, 5, 5)),
axis.text = element_text(size = rel(0.9)))
Answer: We cannot evaluate fuel economy based solely on
one or two factors, such as city and highway mileage. Other key factors
that directly impact fuel consumption must also be considered, including
engine displacement, the number of cylinders in a vehicle, and the drive
train configuration.
Equation of displacement \[displacement=4/pi×b^2×s×c\]
Where:
- b represents the bore size,
- s is the stroke length,
- c is the number of cylinders.
Therefore, there is a positive correlation between the number of
cylinders and engine displacement. The more cylinders a car has, the
greater its engine displacement, which in turn leads to higher fuel
consumption.
According to the graphs:
1. The first graph shows that Honda has the lowest mean engine
displacement.
2. The second graph indicates that Honda also has the lowest mean number
of cylinders.
3. The final graph reveals that all the Honda vehicles in this data set
have a front-wheel-drive configuration.
In conclusion, Honda demonstrates the best overall fuel economy, as
it has the lowest average number of cylinders, the smallest mean engine
displacement, and an efficient drive train in all samples in the data
set–feature front-wheel drive. However, it is important to note that the
comparison may not be entirely fair, as the samples from other
manufacturers may represent different types of vehicles.
Code:
getwd()
## [1] "/Users/shoshow/Desktop/Data Anal/HW"
bank_data <- (read_csv("~/Desktop/BankChurners.csv"))
## Rows: 10127 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Attrition_Flag, Gender, Education_Level, Marital_Status, Income_Ca...
## dbl (17): CLIENTNUM, Customer_Age, Dependent_count, Months_on_book, Total_Re...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
prop.table(table(bank_data$Gender))
##
## F M
## 0.5290807 0.4709193
ggplot(data = bank_data) +
geom_bar(mapping = aes(Gender), fill = "powderblue") +
labs(title = "Gender of Card Holder",
x = "Gender",
y = "Count") +
theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "blue4",
margin = margin(15, 15, 15, 15)),
axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "slateblue"),
axis.title.x = element_text(margin = margin(10, 5, 5, 5)),
axis.title.y = element_text(margin = margin(5, 10, 5, 5)),
axis.text = element_text(size = rel(0.9)))
prop.table(table(bank_data$Card_Category))
##
## Blue Gold Platinum Silver
## 0.931766565 0.011454528 0.001974919 0.054803989
ggplot(bank_data) +
geom_bar(aes(Card_Category), fill = "powderblue") +
labs(title = "Categories of Card",
x = "Type of card",
y = "Count") +
theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "blue4",
margin = margin(15, 15, 15, 15)),
axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "slateblue"),
axis.title.x = element_text(margin = margin(10, 5, 5, 5)),
axis.title.y = element_text(margin = margin(5, 10, 5, 5)),
axis.text = element_text(size = rel(0.9)))
Answer: The first graph shows the number of female (F)
cardholders is slightly higher than the number of male (M) cardholders.
However, the difference is not very large, indicating a relatively
balanced distribution of card ownership between genders.
The “Blue” card type is by far the most common, with a significantly
higher count compared to other card types. “Silver” cards have a much
lower count, followed by “Gold” and “Platinum,” which are the least
common. This suggests that most customers use “Blue” cards, while
premium card types like “Platinum” and “Gold” are used by a smaller
segment of the customer base.
Code:
summary(bank_data$Customer_Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 26.00 41.00 46.00 46.33 52.00 73.00
ggplot(bank_data) +
geom_histogram(aes(Customer_Age), fill = "yellow3", binwidth = 5) +
labs(title = "Age of Card Customers",
x = "Age",
y = "Count") +
scale_x_continuous(limits = c(20, 70)) +
theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "hotpink4",
margin = margin(15, 15, 15, 15)),
axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "hotpink3"),
axis.title.x = element_text(margin = margin(10, 5, 5, 5)),
axis.title.y = element_text(margin = margin(5, 10, 5, 5)),
axis.text = element_text(size = rel(0.9)))
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
summary(bank_data$Total_Trans_Amt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 510 2156 3899 4404 4741 18484
ggplot(bank_data) +
geom_histogram(aes(Total_Trans_Amt), fill = "yellow3", bins = 30, binwidth = 1000) +
labs(title = "Total Transaction Amount for Last 12 Months",
x = "Transaction amount (USD)",
y = "Count") +
scale_x_continuous(labels = scales::dollar) +
theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "hotpink4",
margin = margin(15, 15, 15, 15)),
axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "hotpink3"),
axis.title.x = element_text(margin = margin(10, 5, 5, 5)),
axis.title.y = element_text(margin = margin(5, 10, 5, 5)),
axis.text = element_text(size = rel(0.9)))
Answer: The data appears to follow a roughly normal
distribution, with most customers concentrated between the ages of
approximately 35 and 55. The peak customer count occurs around the
mid-40s to early 50s. There are fewer customers at both younger and
older age extremes, with the number significantly declining past 60. The
distribution suggests that the majority of cardholders are
middle-aged.
On the other hand, the second graph’s distribution is right-skewed,
with most transactions concentrated between $0 and $5,000. There are two
noticeable peaks in this range, suggesting common spending patterns
among customers. A smaller number of customers have significantly higher
transaction amounts, with some exceeding $10,000 and even $15,000,
though these cases are less frequent. This suggests that while most
customers spend moderately, segment of high spenders exists.
Code:
table(bank_data$Total_Relationship_Count)
##
## 1 2 3 4 5 6
## 910 1243 2305 1912 1891 1866
ggplot(bank_data) +
geom_bar(aes(x = as.factor(Total_Relationship_Count)), fill = "orchid4") +
labs(title = "Products Held by Customers",
x = "Number of products",
y = "Count") +
theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "darkorange4",
margin = margin(15, 15, 15, 15)),
axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "darkorange3"),
axis.title.x = element_text(margin = margin(10, 5, 5, 5)),
axis.title.y = element_text(margin = margin(5, 10, 5, 5)),
axis.text = element_text(size = rel(0.9)))
Code:
library(nycflights13)
map(flights, ~ sum(is.na(.)))
## $year
## [1] 0
##
## $month
## [1] 0
##
## $day
## [1] 0
##
## $dep_time
## [1] 8255
##
## $sched_dep_time
## [1] 0
##
## $dep_delay
## [1] 8255
##
## $arr_time
## [1] 8713
##
## $sched_arr_time
## [1] 0
##
## $arr_delay
## [1] 9430
##
## $carrier
## [1] 0
##
## $flight
## [1] 0
##
## $tailnum
## [1] 2512
##
## $origin
## [1] 0
##
## $dest
## [1] 0
##
## $air_time
## [1] 9430
##
## $distance
## [1] 0
##
## $hour
## [1] 0
##
## $minute
## [1] 0
##
## $time_hour
## [1] 0
Ask a question of your interest
Question: Is the job tenure correlated with annual
income?
Visualize data to answer your question
Code:
ggplot(loans_full_schema, mapping = aes(x = as.factor(emp_length), y = annual_income)) +
stat_boxplot(geom = "errorbar", width = 0.5) +
geom_boxplot(aes(fill = as.factor(emp_length))) +
scale_y_continuous(limits = c(0, 200000), labels = scales::dollar) +
labs(title = "Correlation between Income and Job Length",
x = "Number of years in job",
y = "Annual income (USD)")+
theme(plot.title = element_text(hjust = 0.5, size = rel(2), color = "darkviolet",
margin = margin(15, 15, 15, 15)),
axis.title = element_text(hjust = 0.5, size = rel(1.6), color = "orchid4"),
axis.title.x = element_text(margin = margin(10, 5, 5, 5)),
axis.title.y = element_text(margin = margin(5, 10, 5, 5)),
axis.text = element_text(size = rel(1.1)))
## Warning: Removed 286 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Removed 286 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Answer: The median annual income seems to increase
modestly with more years of job tenure, indicating a weak positive
correlation between tenure and income. However, the presence of many
missing (NA) values in the data complicates drawing a clear conclusion
from the graph.
Try to raise new questions from your plot
New Question: Will the NA values affect the trend? Is
all the source from annual income reliable?
Visualize data to answer the new question
Code:
map(loans_full_schema, ~ sum(is.na(.)))
## $emp_title
## [1] 0
##
## $emp_length
## [1] 817
##
## $state
## [1] 0
##
## $homeownership
## [1] 0
##
## $annual_income
## [1] 0
##
## $verified_income
## [1] 0
##
## $debt_to_income
## [1] 24
##
## $annual_income_joint
## [1] 8505
##
## $verification_income_joint
## [1] 0
##
## $debt_to_income_joint
## [1] 8505
##
## $delinq_2y
## [1] 0
##
## $months_since_last_delinq
## [1] 5658
##
## $earliest_credit_line
## [1] 0
##
## $inquiries_last_12m
## [1] 0
##
## $total_credit_lines
## [1] 0
##
## $open_credit_lines
## [1] 0
##
## $total_credit_limit
## [1] 0
##
## $total_credit_utilized
## [1] 0
##
## $num_collections_last_12m
## [1] 0
##
## $num_historical_failed_to_pay
## [1] 0
##
## $months_since_90d_late
## [1] 7715
##
## $current_accounts_delinq
## [1] 0
##
## $total_collection_amount_ever
## [1] 0
##
## $current_installment_accounts
## [1] 0
##
## $accounts_opened_24m
## [1] 0
##
## $months_since_last_credit_inquiry
## [1] 1271
##
## $num_satisfactory_accounts
## [1] 0
##
## $num_accounts_120d_past_due
## [1] 318
##
## $num_accounts_30d_past_due
## [1] 0
##
## $num_active_debit_accounts
## [1] 0
##
## $total_debit_limit
## [1] 0
##
## $num_total_cc_accounts
## [1] 0
##
## $num_open_cc_accounts
## [1] 0
##
## $num_cc_carrying_balance
## [1] 0
##
## $num_mort_accounts
## [1] 0
##
## $account_never_delinq_percent
## [1] 0
##
## $tax_liens
## [1] 0
##
## $public_record_bankrupt
## [1] 0
##
## $loan_purpose
## [1] 0
##
## $application_type
## [1] 0
##
## $loan_amount
## [1] 0
##
## $term
## [1] 0
##
## $interest_rate
## [1] 0
##
## $installment
## [1] 0
##
## $grade
## [1] 0
##
## $sub_grade
## [1] 0
##
## $issue_month
## [1] 0
##
## $loan_status
## [1] 0
##
## $initial_listing_status
## [1] 0
##
## $disbursement_method
## [1] 0
##
## $balance
## [1] 0
##
## $paid_total
## [1] 0
##
## $paid_principal
## [1] 0
##
## $paid_interest
## [1] 0
##
## $paid_late_fees
## [1] 0
ggplot(loans_full_schema, mapping = aes(x = as.factor(emp_length),
fill = verified_income)) +
geom_bar(position = "dodge") +
labs(title = "Correlation between Income and Job Length",
x = "Number of years in job",
y = "Verified annual income (USD)")+
theme(plot.title = element_text(hjust = 0.5, size = rel(1.7), color = "darkviolet",
margin = margin(15, 15, 15, 15)),
axis.title = element_text(hjust = 0.5, size = rel(1.4), color = "orchid4"),
axis.title.x = element_text(margin = margin(10, 5, 5, 5)),
axis.title.y = element_text(margin = margin(5, 10, 5, 5)),
axis.text = element_text(size = rel(1.1)))
Answer: After using the map() function to examine the
data, we found 817 unknown values in emp_length (job tenure). These
missing values could impact the observed trend if they are considered
alongside other factors.
According to the dodge plot, since a significant portion of the data
falls under “Not Verified,” caution should be exercised when
interpreting trends. The presence of missing values (NA) further
complicates the analysis.