1. Data set gpa


Question 1: Does more study hours lead to higher gpa for this data set?

library(tidyverse)
library(openintro)

ggplot(gpa) + 
geom_point(aes(studyweek,gpa,color = gender)) +
 labs(title = "Study time vs GPA",x = "Study time (hrs)", y = "GPA")+ 
  theme(plot.title = (element_text(hjust = 0.5)))

Answer: No, there is no direct relationship in between.

Reasoning: There is no pattern shown in the scatter plot.


Question 2: Does more going-out nights lead to lower gpa for this data set?

ggplot(gpa,aes(gpa,out,color = gender))+ geom_point()+
  labs(title = "Night out vs GPA",x = "Night Out (hrs)", y = "GPA")+ 
  theme(plot.title = (element_text(hjust = 0.5)))

Answer: More going-out nights doesn’t lead to lower gpa.

Reasoning: It doesn’t show any relationship from the graph.


Question 3: Is there a correlation between sleeping hours and the number of going-out nights?

ggplot(gpa,aes(out,sleepnight,color = gender))+ geom_point()+
  labs(title = "Sleepnight vs Goingout",y = "Sleepnight (hrs)", x = "Goingout (hrs)")+ 
  theme(plot.title = (element_text(hjust = 0.5))) +
  geom_smooth(aes(out,sleepnight))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Answer: There is a positive correlation between sleeping hours and numbers of going-out nights according to the graph.

Reasoning: It does show a pattern in the scatter plot, and the female students’ curve shows there is a bit stronger linear relationship than male students. It is probably because going out more needs more rest.


Question 4: How many male students and female students are there in the data set?

ggplot(gpa)+ geom_bar(mapping=aes(x = gender, fill= gender))+
  labs(title = "Numbers of students by gender",x = "Gender", y = "Count")+
  theme(plot.title = (element_text(hjust= 0.5)))

Answer: There are 12 male students and 43 female students

Reasoning: We can count from the bar plot.


Question 5: Do female students study more hours than male students in the data set?

ggplot(gpa)  + geom_boxplot(aes(studyweek,gender),color = "orange") +
  labs(title="Study hours difference", y= "Gender", x= "Study times (hrs)") +
  theme(plot.title = element_text(hjust = 0.5, size = 18), axis.title = element_text(size = 13) ,
        axis.text=element_text(size=12), plot.margin = margin(15,24,20,14))

Answer: Yes, female’s study hours is more than male’s in terms of median value according to the data.

Reasoning: We can see from the box plot that the female’s study hour(median) is exactly 15 hrs, and male’s study hour is about 13 hours; but there is an outlier of male student who has about 42 hrs, this is really an extreme special case, which will not affect the whole situation.


Question 6: Do male students go out more than female students in the data set?

ggplot(gpa)  + geom_boxplot(aes(out,gender),color = "orange") +
  labs(title="Go out by gender", y= "Gender", x= "Go out (hrs)") +
  theme(plot.title = element_text(hjust = 0.5, size = 18), axis.title = element_text(size = 13) ,
        axis.text=element_text(size=12), plot.margin = margin(15,24,20,14))

Answer: Yes, male students go out more often than female students from the data.

Reasoning: Male students go out about 2.5 hours(median value),female is 2 hours from the box plot.


Question 7: Do female students have better gpa than male students in the data set?

ggplot(gpa)  + geom_boxplot(aes(gpa,gender),color = "orange") +
  labs(title="GPA difference by gender", y= "Gender", x= "GPA") +
  theme(plot.title = element_text(hjust = 0.5, size = 18), axis.title = element_text(size = 13) ,
        axis.text=element_text(size=12), plot.margin = margin(15,24,20,14))

Answer: Yes, female students have better gpa than male in general according to the graph.

Reasoning: Female students have about 3.7(median)GPA score, while the male students have about 3.5, even though there is an outlier that one male student has extreme high gpa score, but that’s the special case only.


2. Data set Loans_full_schema


Question 1: Study the variable loan_purpose. How many loan purposes are there in the data set? List all of them.

ggplot(loans_full_schema) + geom_bar(aes(loan_purpose),fill = "lightblue", binwidth = 5000) + coord_flip() +
  labs(title = "The distribution of Loan purpose") +
  theme(plot.title = element_text(hjust = 0.5,size = 18), axis.text=element_text(size = 8))
## Warning in geom_bar(aes(loan_purpose), fill = "lightblue", binwidth = 5000):
## Ignoring unknown parameters: `binwidth`

Answer : There are total 12 categories of loan purposes, and they are moving, debt_consolidation, other, credit_card, home_improvement, medical, house, small business, car, major_purchase, renewable_energy, vacation.


Question 2: Make a histogram of loan_purpose with x-axis being the counts and y-axis being the purposes. You can use +cord_flip() to flip the x- and y- axis. What are the three most common reasons for making the loan in the data set?

ggplot(loans_full_schema) + geom_bar(aes(loan_purpose),fill="orange")+coord_flip()+
  labs(title = "The distribution of Loan purpose", y = "Loan Purpose", x="Count") + 
  theme(plot.title = element_text(hjust = 0.5, size = 18), axis.title = element_text(size = 13), 
  axis.text=element_text(size=10), plot.margin = margin(15,24,20,14)) 

Answer: Debt_consolidation, credit card and other are the thre most common reasons for making the loan in the data set.


Question 3: In terms of loan amount, which loan purpose results in the highest loan amount? which results in the lowest? Make sense of your answer.

ggplot(loans_full_schema) + geom_boxplot(aes(loan_amount,loan_purpose),fill="lightyellow")+
  labs(title = "Loan amount vs Loan purpose", y = "Loan Purpose", x="Loan amount ($)") + 
  theme(plot.title = element_text(hjust = 0.5, size = 18), axis.title = element_text(size = 13), 
        axis.text=element_text(size=10), plot.margin = margin(15,24,20,14)) 

Answer: Small business has the highest loan amount, vacation has the lowest loan amount(in terms of median value).

Reasoning: According to the graph, small business has about $18k of loan amount, while vacation has about $5k loan amount in terms of median value. This makes sense to me because it does cost certain amount of money to start a business no matter what it is, from the graph of “The distribution of loan purpose”, we know that the total count of small business loan is pretty low, but still get the highest median loan amount, this also support the conclusion; but normally, people won’t get a loan to take a vacation.


Question 4: Borrowers with which loan purpose have the highest median annual income? Does this make sense to you?

ggplot(loans_full_schema) + geom_boxplot(aes(annual_income,loan_purpose),fill="lightyellow")+
  labs(title = "Annual income vs Loan purpose", y = "Loan Purpose", x="Annual income ($)") + 
  theme(plot.title = element_text(hjust = 0.5, size = 18), axis.title = element_text(size = 13), 
        axis.text=element_text(size=10), plot.margin = margin(15,24,20,14)) 

ggplot(loans_full_schema) + geom_boxplot(aes(annual_income,loan_purpose),fill="lightyellow")+
  labs(title = "Annual income vs Loan purpose", y = "Loan Purpose", x="annual_income ($)") + 
  theme(plot.title = element_text(hjust = 0.5, size = 18), axis.title = element_text(size = 13), 
        axis.text=element_text(size=10), plot.margin = margin(15,24,20,14)) +
  xlim(0,150000)
## Warning: Removed 706 rows containing non-finite values (`stat_boxplot()`).

Answer: Small business owner has the highest median annual income from the plot.

Reasoning: Small business owner has about $75k annual income from the graph.It does make sense to me because business owners make more money than average in general.


Question 5: Is there a relationship between annual income and loan amount? Why?

ggplot(loans_full_schema) + geom_point(aes(annual_income,loan_amount ), color = "blue") +
  labs(title = "Loan amount vs Annual income", x = "Annual income ($)", y="Loan amount ($)") + 
  theme(plot.title = element_text(hjust = 0.5, size = 16), axis.title = element_text(size = 13), 
        axis.text=element_text(size=10), plot.margin = margin(15,24,20,14))

Answer: There is no direct relationship between the two variables from the first graph.

Reasoning: Most loan amount are concentrated between 0~250k no matter what annual income is according to the graph, even there are few outliers, but do not affect the situation in general. But this doesn’t make sense to me, then I found out from the graph of “The distribution of loan purpose”, most borrowers from this data set get loans for debt consolidation and credit card, this explains why the loan amount has no relationship with annual income.