The gpa data set is available through openintro package in R. Answer the following questions with an appropriate graph. Summarize your finding in plain text for each graph to answer the question.
Task list:
By doing your own research, give the precise meaning of each variable.
Visualize the relationship between studyweek and gpa. What does your graph indicate?
ggplot(data=gpa, mapping=aes(x = studyweek, y = gpa)) +
geom_point() + geom_smooth() +
labs(title = "Duke University Students Weekly Study Hours vs GPA", x = "Hours Studied Per Week", y = "GPA") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Answer: The graph shows no real correlation
between weekly study hours and GPA of Duke University students either
due to the small sample size of only 55 students or the performance of
the students does not based on how long they studied but rather ow smart
they study.
ggplot(data=gpa, mapping=aes(x = out, y = gpa)) +
geom_point(position = "jitter") + geom_smooth() +
labs(title = "Duke University Students Weekly Night Out vs GPA", x = "Night Out Per Week", y = "GPA") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Answer: The graph also shows no real
correlation between number of night out and GPA of Duke University
students either due to the small sample size of only 55 students or the
performance of the students does not decline with more night out.
ggplot(data=gpa, mapping=aes(x = out, y = sleepnight)) +
geom_point(position = "jitter") + geom_smooth() +
labs(title = "Duke University Students Weekly Night Out vs Sleeping Hours", x = "Night Out Per Week", y = "Average Sleep Hours Per Week") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Answer: The graph shows that there is a
moderate positive correlation between number of night out and average
sleep hours per week of Duke University students which suggest that
student that hang out more actually have more sleeping time.
ggplot(data=gpa, mapping=aes(x = gender, y = studyweek)) +
stat_boxplot(geom = "errorbar", width = 0.5) +
geom_boxplot() +
labs(title = "Duke University Students Gender vs Weekly Study Hours", x = "Gender", y = "Hours Studied Per Week") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))
Answer: The graph shows that on average the
female students study longer than their male students in Duke
University.
ggplot(data=gpa, mapping=aes(x = gender, y = out)) +
stat_boxplot(geom = "errorbar", width = 0.5) +
geom_boxplot() +
labs(title = "Duke University Students Gender vs Weekly Night Out", x = "Gender", y = "Night Out Per Week") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))
Answer: The graph shows that on average the
male students go out at night more often than their female students in
Duke University.
Present a question of your own interest related to this data set. Answer your question with analysis or visualization.
Question Is there any difference between the
genders of the students and their GPA on average?
ggplot(data=gpa, mapping=aes(x = gender, y = gpa)) +
stat_boxplot(geom = "errorbar", width = 0.5) +
geom_boxplot() +
labs(title = "Duke University Students Gender vs GPA", x = "Gender", y = "GPA") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))
Answer: The graph shows that on average the
female students have higher GPA than their male students in Duke
University.
Finish the following data visualization tasks using the full loans_full_schema data set (55 columns) in openintro library. For each task, you need to summarize what you learn from the graph accurately and concisely.
ggplot(data=loans_full_schema, mapping=aes(x = loan_amount)) +
geom_histogram(aes(y = after_stat(density)), binwidth = 5000, boundary = 0) +
geom_density(adjust = 30/8, linewidth = 1.2, colour = 'red') +
labs(title = "Distribution of Loan through the Lending Club", x = "The amount of the loan", y = "Frequency") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))
Answer: The number of the loan the applicant
received increse as the amount reach 10000 then decrease as the amount
increase pass 10000.
ggplot(data=loans_full_schema, mapping=aes(x = loan_amount, y = loan_purpose)) +
stat_boxplot(geom = "errorbar", width = 0.5) +
geom_boxplot() +
labs(title = "Distribution of Loan through the Lending Club", x = "The amount of the loan", y = "The purpose of the loan") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))
Answer: This graph indicate that a
categorical/discrete variable like the purpose of the loan does affect
the frequency of a numeric variable or in this case the amount of the
loan.
ggplot(data=loans_full_schema, mapping=aes(x = loan_amount, y = interest_rate)) +
geom_bin_2d() +
labs(title = "Relationship Between Loan Amount and Interest Rate", x = "The amount of the loan", y = "Interest Rate") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))
## `stat_bin2d()` using `bins = 30`. Pick better value `binwidth`.
Answer: This graph shows that in general the
lower the amount of the loan and interest rate, the more frequent it
is.
ggplot(data=loans_full_schema, mapping=aes(x = loan_amount, y = interest_rate)) +
geom_point() +
facet_wrap(~ grade, nrow = 2) +
labs(title = "Interest Rate vs Loan Amount by Grade", x = "Loan Amount", y = "Interest Rate") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))
Answer: This graph show that lower interest
rate result in higher grade regrading the amount of the loan.
ggplot(data=loans_full_schema, mapping=aes(x = interest_rate, y = loan_amount)) +
geom_point() +
facet_grid(term ~ grade) +
labs(title = "Interest Rate vs Loan Amount by Grade and Term", x = "Interest Rate", y = "Loan Amount") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))
Answer: This graph show that lower interest
rate result in higher grade regrading the amount of the loan. However,
higher grade see more short term loans while lower grade see more long
term loans.
Present a question of your own interest related to this data set. Answer your question with analysis or visualization.
Question How does the homeownership status
affecting the amount of the loan.?
ggplot(data=loans_full_schema, mapping=aes(x = loan_amount, y = homeownership)) +
stat_boxplot(geom = "errorbar", width = 0.5) +
geom_boxplot() +
labs(title = "Distribution of Loan through the Lending Club", x = "The amount of the loan", y = "The ownership status") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))
Answer: This graph show that people that rent
their residence often applied for lower amount of loan compared to
people who own their residence, while people who mortgage their
residence often applied for higher amount of loan compared to people who
own their residence.
The ames data set is available through openintro package in R.
Write an introductory paragraph to the data set which provides the basic information - what the data set is about; the number of samples and features; the scope that the features cover.
Answer: The ames data set is about housing
prices in Ames, Iowa from 2006 to 2010. The data set have 2930 samples
and 82 features that covers the location, property specs, the condition
as well as the sale information of the houses.
Use a plot to analyze how area correlates with price. Summarize your finding from the graph.
ggplot(ames, aes(x = area, y = price)) +
geom_point() + geom_smooth() +
labs(title = "Correlation Between Living Area and Sale Price", x = "Above Grade Living Area (sq ft)", y = "Sale Price (USD)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Answer: The graph show that the larger the
house, the higher the sale price but only until 300 square feet,
properly due to a lack of data after that point.
ggplot(data=ames, mapping=aes(x = Bldg.Type, y = price)) +
stat_boxplot(geom = "errorbar", width = 0.5) +
geom_boxplot() +
labs(title = "Correlation Between Bldg.Type and Sale Price", x = "Type of dwelling", y = "Sale Price (USD)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))
Answer: 1Fam = Single family house, 2fmCon =
Two family house, Duplex = A building divided into two separate houses,
Twnhs = Townhouse, TwnhsE = Townhouse End Unit, The graph show that the
sale price does not different too much between different type of
dwelling.
ggplot(data=ames, mapping=aes(x = area, y = price)) +
geom_point() +
facet_wrap(~ Bldg.Type, nrow = 2) +
labs(title = "Correlation Between Living Area and Sale Price by Bldg.Type", x = "Above Grade Living Area (sq ft)", y = "Sale Price (USD)") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))
Answer: The graph show that most type of
dwelling increase sale price as areas increase except for Two family
house and Duplex.