1. Data visualization and exploration tasks with gpa data set

The gpa data set is available through openintro package in R. Answer the following questions with an appropriate graph. Summarize your finding in plain text for each graph to answer the question.

Task list:

  1. By doing your own research, give the precise meaning of each variable.

    • gpa: Grade Point Average of Duke University students (0-4.0 scale).
    • studyweek: Number of hours the students spend study at night per week.
    • sleepnight: Number of hours the students spend on average to sleep per week.
    • out: Number of nights the students go out per week.
    • gender: Gender of the student (male or female).


  2. Visualize the relationship between studyweek and gpa. What does your graph indicate?

ggplot(data=gpa, mapping=aes(x = studyweek, y = gpa)) +
  geom_point() + geom_smooth() +
  labs(title = "Duke University Students Weekly Study Hours vs GPA", x = "Hours Studied Per Week", y = "GPA") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'


Answer: The graph shows no real correlation between weekly study hours and GPA of Duke University students either due to the small sample size of only 55 students or the performance of the students does not based on how long they studied but rather ow smart they study.


  1. Visualize the relationship between out and gpa. What does your graph indicate?
ggplot(data=gpa, mapping=aes(x = out, y = gpa)) +
  geom_point(position = "jitter") + geom_smooth() +
  labs(title = "Duke University Students Weekly Night Out vs GPA", x = "Night Out Per Week", y = "GPA") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'


Answer: The graph also shows no real correlation between number of night out and GPA of Duke University students either due to the small sample size of only 55 students or the performance of the students does not decline with more night out.


  1. Visualize the relationship between out and sleepnight. What does your graph indicate?
ggplot(data=gpa, mapping=aes(x = out, y = sleepnight)) +
  geom_point(position = "jitter") + geom_smooth() +
  labs(title = "Duke University Students Weekly Night Out vs Sleeping Hours", x = "Night Out Per Week", y = "Average Sleep Hours Per Week") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'


Answer: The graph shows that there is a moderate positive correlation between number of night out and average sleep hours per week of Duke University students which suggest that student that hang out more actually have more sleeping time.

  1. Visualize the relationship between gender and studyweek. What does your graph indicate?
ggplot(data=gpa, mapping=aes(x = gender, y = studyweek)) +
  stat_boxplot(geom = "errorbar", width = 0.5) +
  geom_boxplot() +
  labs(title = "Duke University Students Gender vs Weekly Study Hours", x = "Gender", y = "Hours Studied Per Week") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))


Answer: The graph shows that on average the female students study longer than their male students in Duke University.


  1. Visualize the relationship between gender and out. What does your graph indicate?
ggplot(data=gpa, mapping=aes(x = gender, y = out)) +
  stat_boxplot(geom = "errorbar", width = 0.5) +
  geom_boxplot() +
  labs(title = "Duke University Students Gender vs Weekly Night Out", x = "Gender", y = "Night Out Per Week") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))


Answer: The graph shows that on average the male students go out at night more often than their female students in Duke University.


  1. Present a question of your own interest related to this data set. Answer your question with analysis or visualization.


    Question Is there any difference between the genders of the students and their GPA on average?

ggplot(data=gpa, mapping=aes(x = gender, y = gpa)) +
  stat_boxplot(geom = "errorbar", width = 0.5) +
  geom_boxplot() +
  labs(title = "Duke University Students Gender vs GPA", x = "Gender", y = "GPA") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))


Answer: The graph shows that on average the female students have higher GPA than their male students in Duke University.


2. Data visualization tasks with loans_full_schema data set

Finish the following data visualization tasks using the full loans_full_schema data set (55 columns) in openintro library. For each task, you need to summarize what you learn from the graph accurately and concisely.

  1. Create a histogram of a numeric variable that you select and plot a density curve on top of the histogram. Carefully select bin numbers/sizes/boundaries to make the plot informative. What does this graph indicate?
ggplot(data=loans_full_schema, mapping=aes(x = loan_amount)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 5000, boundary = 0) +
  geom_density(adjust = 30/8, linewidth = 1.2, colour = 'red') +
  labs(title = "Distribution of Loan through the Lending Club", x = "The amount of the loan", y = "Frequency") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))


Answer: The number of the loan the applicant received increse as the amount reach 10000 then decrease as the amount increase pass 10000.


  1. Create a graph to study the effect of a categorical/discrete variable on the distributions of a numeric variable. What does this graph indicate?
ggplot(data=loans_full_schema, mapping=aes(x = loan_amount, y = loan_purpose)) +
  stat_boxplot(geom = "errorbar", width = 0.5) +
  geom_boxplot() +
  labs(title = "Distribution of Loan through the Lending Club", x = "The amount of the loan", y = "The purpose of the loan") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))


Answer: This graph indicate that a categorical/discrete variable like the purpose of the loan does affect the frequency of a numeric variable or in this case the amount of the loan.


  1. Create a bin heatmap (2d density plot) to study the relationship between two numeric variables that you select. Summarize the findings from the graph.
ggplot(data=loans_full_schema, mapping=aes(x = loan_amount, y = interest_rate)) +
  geom_bin_2d() +
  labs(title = "Relationship Between Loan Amount and Interest Rate", x = "The amount of the loan", y = "Interest Rate") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))
## `stat_bin2d()` using `bins = 30`. Pick better value `binwidth`.


Answer: This graph shows that in general the lower the amount of the loan and interest rate, the more frequent it is.


  1. Use facet_wrap to create an informative plot. Summarize the findings from the graph
ggplot(data=loans_full_schema, mapping=aes(x = loan_amount, y = interest_rate)) + 
  geom_point() + 
  facet_wrap(~ grade, nrow = 2) +
  labs(title = "Interest Rate vs Loan Amount by Grade", x = "Loan Amount", y = "Interest Rate") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))


Answer: This graph show that lower interest rate result in higher grade regrading the amount of the loan.


  1. Use facet_grid to create an informative plot. Summarize the findings from the graph.
ggplot(data=loans_full_schema, mapping=aes(x = interest_rate, y = loan_amount)) + 
  geom_point() + 
  facet_grid(term ~ grade) +
  labs(title = "Interest Rate vs Loan Amount by Grade and Term", x = "Interest Rate", y = "Loan Amount") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))


Answer: This graph show that lower interest rate result in higher grade regrading the amount of the loan. However, higher grade see more short term loans while lower grade see more long term loans.


  1. Present a question of your own interest related to this data set. Answer your question with analysis or visualization.


    Question How does the homeownership status affecting the amount of the loan.?

ggplot(data=loans_full_schema, mapping=aes(x = loan_amount, y = homeownership)) +
  stat_boxplot(geom = "errorbar", width = 0.5) +
  geom_boxplot() +
  labs(title = "Distribution of Loan through the Lending Club", x = "The amount of the loan", y = "The ownership status") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))


Answer: This graph show that people that rent their residence often applied for lower amount of loan compared to people who own their residence, while people who mortgage their residence often applied for higher amount of loan compared to people who own their residence.


3. Data visualization and exploration tasks with ames data set

The ames data set is available through openintro package in R.

  1. Write an introductory paragraph to the data set which provides the basic information - what the data set is about; the number of samples and features; the scope that the features cover.


    Answer: The ames data set is about housing prices in Ames, Iowa from 2006 to 2010. The data set have 2930 samples and 82 features that covers the location, property specs, the condition as well as the sale information of the houses.


  2. Use a plot to analyze how area correlates with price. Summarize your finding from the graph.

ggplot(ames, aes(x = area, y = price)) +
  geom_point() + geom_smooth() +
  labs(title = "Correlation Between Living Area and Sale Price", x = "Above Grade Living Area (sq ft)", y = "Sale Price (USD)") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'


Answer: The graph show that the larger the house, the higher the sale price but only until 300 square feet, properly due to a lack of data after that point.


  1. Use a plot to analyze how Bldg.Type correlates with price. Explain the meaning of each label for Bldg.Type and summarize your finding from the graph.
ggplot(data=ames, mapping=aes(x = Bldg.Type, y = price)) +
  stat_boxplot(geom = "errorbar", width = 0.5) +
  geom_boxplot() +
  labs(title = "Correlation Between Bldg.Type and Sale Price", x = "Type of dwelling", y = "Sale Price (USD)") + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))


Answer: 1Fam = Single family house, 2fmCon = Two family house, Duplex = A building divided into two separate houses, Twnhs = Townhouse, TwnhsE = Townhouse End Unit, The graph show that the sale price does not different too much between different type of dwelling.


  1. Use a plot to analyze how Bldg.Type and area altogether correlates with price. Summarize your finding from the graph.
ggplot(data=ames, mapping=aes(x = area, y = price)) + 
  geom_point() + 
  facet_wrap(~ Bldg.Type, nrow = 2) +
  labs(title = "Correlation Between Living Area and Sale Price by Bldg.Type", x = "Above Grade Living Area (sq ft)", y = "Sale Price (USD)") +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), face ='bold'), axis.title = element_text(size = rel(1.2), face ='bold'))


Answer: The graph show that most type of dwelling increase sale price as areas increase except for Two family house and Duplex.