1.Data visualization and exploration tasks with gpa data set

The gpa data set is available through openintro package in R. Answer the following questions with an appropriate graph. Summarize your finding in plain text for each graph to answer the question.

1. By doing your own research, give the precise meaning of each variable.

answer: gpa: The GPA (grade) of the students

studyweek: Students’ study hours per week

sleepnight: Students’ sleep hours per night

out: The number of nights students go out per week (average)

gender: students’ gender (male/female)

2. Visualize the relationship between studyweek and gpa. What does your graph indicate?

ggplot(gpa, aes(x = studyweek, y = gpa)) +
  geom_density2d_filled() +
  labs(title = "Students GPA from study hours per week", 
       x = "Study Hours per week", 
       y = "GPA") +
  ylim(2.5,4) +
    theme(plot.title = element_text(hjust = 0.5, size = 20),
          text = element_text(size = 15))

Comments: Almost all the students have GPA above 3. A lot of the students study hours is between 10 to 20 hours. Based on the graph, the correlation between students’ study hours per week and their GPA is pretty weak. There might be many reasons that lead to this result. It is possible that most of the students in the Duke university are just excellent or other possible reasons.

3. Visualize the relationship between out and gpa. What does your graph indicate?

ggplot(gpa, aes(x = out, y = gpa)) +
  geom_density2d_filled() +
  labs(title = "Graph of Students social time and GPA", 
       x = "Average nights students go out", 
       y = "GPA") +
  ylim(2.5,4) +
    theme(plot.title = element_text(hjust = 0.5, size = 20),
          text = element_text(size = 15))

Comments: Most of the students spend around 1.5 to 2 average nights out each week while they have GPA around 3.5 and higher. Some students speed at average 3 nights go out per week but still having a pretty good GPA, which close to 4.0. The correlation between out and gpa is still pretty weak. It is hard to make a conclusion on how much GPA a student got by the nights he goes out per week.

4. Visualize the relationship between out and sleepnight. What does your graph indicate?

ggplot(gpa, aes(x = out, y = sleepnight)) +
  geom_density2d_filled() +
  labs(title = "Graph of Students social time and sleep hours per night", 
       x = "Average nights students go out", 
       y = "Sleep hours per night") +
    theme(plot.title = element_text(hjust = 0.5, size = 15),
          text = element_text(size = 12))

ggplot(gpa, aes(x = out, y = sleepnight)) +
  geom_point(position = "jitter") +
  geom_smooth() +
  labs(title = "Graph of Students social time and sleep hours per night", 
       x = "Average nights students go out", 
       y = "Sleep hours per night") +
    theme(plot.title = element_text(hjust = 0.5, size = 15),
          text = element_text(size = 12))

comments: My graph indicate that the number of nights students go out has better correlation. Both graphs show that as more nights a student goes out a week, he intend to sleep more hours at night. However, this correlation become weak when a student sleep for around 7.5 hours. A lot of students go out for around 1.5 to 2 nights result sleep for around 7 hours at night.

5. Visualize the relationship between gender and studyweek. What does your graph indicate?

ggplot(gpa, aes(studyweek, gender)) +
  geom_boxplot() +
  stat_boxplot(geom = "errorbar", width = 0.5) +
  labs(title = "Graph of Students' gender and study hours per week", 
       x = "Study hours per week", 
       y = "Gender") +
    theme(plot.title = element_text(hjust = 0.5, size = 15),
          text = element_text(size = 12))

Comments: My graph indicate that female students’ study hours range is bigger and they are more likely study more hours per week than male students. Male students study hours more concentrated lower than female students’ study hours as a group.

6. Visualize the relationship between gender and out. What does your graph indicate?

ggplot(gpa, aes(out, gender)) +
  geom_boxplot() +
  stat_boxplot(geom = "errorbar", width = 0.5) +
  labs(title = "Graph of Students' gender and nights go out per week", 
       x = "Nights go out per week", 
       y = "Gender") +
    theme(plot.title = element_text(hjust = 0.5, size = 15),
          text = element_text(size = 12))

Comments: Based on my graph, the male students spend more nights going out. The data for the male students are pretty concentrated too. Compared to the male students, the female students are spend less nights going out in average, but the data is more seperate than male students.

2. Data visualization tasks with loans_full_schema data set

Finish the following data visualization tasks using the full loans_full_schema data set (55 columns) in openintro library. For each task, you need to summarize what you learn from the graph accurately and concisely.

1. Create a histogram of a numeric variable that you select and plot a density curve on top of the histogram. Carefully select bin numbers/sizes/boundaries to make the plot informative. What does this graph indicate?

ggplot(loans_full_schema, aes(annual_income)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30, boundary = 0, fill = "skyblue", colour = "white") +
  geom_density(linewidth = 1.2, colour = "red", adjust = 30/8) +
  scale_x_continuous(labels = scales::dollar, limits = c(0, 300000)) +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Density plot of Annual income", 
       x = "Annual income (USD)",
       y = "Density") +
  theme(plot.title = element_text(hjust = 0.5, size = 20),
        text = element_text(size = 15))

Comments: Based on the graph, it is easy to tell that many of the loans applier have their annual income around $50k. It is possible that they need the money to buy houses and other expansive stuffs but they don’t have that much money right now. It is also shows only very few people earn a lots of money while a lot of them earn pretty low.

2. Create a graph to study the effect of a categorical/discrete variable on the distributions of a numeric variable. What does this graph indicate?

ggplot(loans_full_schema) +
  geom_boxplot(aes(homeownership, debt_to_income/100)) +
  scale_y_continuous(labels = scales::percent, limits = c(0,1)) +
  labs(title = "Loan Data of debt-to-income and homeownership",
       x = "Homeownership",
       y = "Debt to income ratio") +
  theme(plot.title = element_text(hjust = 0.5, size = 15), 
        text = element_text(size = 12))

Comments: The homeownership does not affect a person’s debt-to-income ratio in general. However, I found that people who have their home mortgage have more people haveing a higher debt-to-income ratio.

3. Create a bin heatmap (2d density plot) to study the relationship between two numeric variables that you select. Summarize the findings from the graph.

ggplot(loans_full_schema, aes(total_credit_limit, annual_income)) +
  geom_density2d_filled() +
  scale_x_continuous(name = "Total credit limit (USD)", labels = scales::dollar, limits = c(0, 500000)) +
  scale_y_continuous(name = "Annual Income (USD)", labels = scales::dollar, limits = c(0, 150000)) +
  labs(title = "2D density plot of annual income and total credit limit") +
  theme(plot.title = element_text(hjust = 0.5, size = 15), 
        text = element_text(size = 12))

comments: Base on the plot, I found the correlation between annual income and total credit limit of a person. Normally, the more you made, the higer the credit limit you have. A lot of people have their annual income around 30k and the total credit limit at around the same.

4. Use facet_wrap to create an informative plot. Summarize the findings from the graph

ggplot(data = loans_full_schema) + 
  geom_bin2d(mapping = aes(x = total_credit_limit, y = debt_to_income/100)) +
  facet_wrap(~ homeownership, nrow = 3) +
  labs(title = "Leading Club loans data to homeownership", 
       x = "Total credit limit (USD)", 
       y = "debt-to-income ratio") + 
  scale_x_continuous(labels = scales::dollar, limits = c(0, 500000)) +
  scale_y_continuous(labels = scales::percent, limits = c(0,1)) +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = 15), 
        axis.text = element_text(size = 10))

Comments: It is surprise that some of the people who have their residence mortgage having more total credit limit than others while at quite heavy debt-to-income ratio. Residence owners and tenants have similar total credit limit while the tenants have more people under larger debt-to-income.

5. Use facet_grid to create an informative plot. Summarize the findings from the graph.

ggplot(data = loans_full_schema) + 
  geom_bin_2d(mapping = aes(x = loan_amount, y = interest_rate/100), bins = 20) +
  facet_grid(verified_income ~ grade) +
  labs(title = "Leading Club loans data", 
       x = "Loan amount (USD)", 
       y = "Interest rate") + 
  scale_x_continuous(labels = scales::dollar) +
  scale_y_continuous(labels = scales::percent, limits = c(0,0.5)) +
  theme(plot.title = element_text(hjust = 0.5, size = 20, margin = margin(15,15,15,15)), 
        text = element_text(size = 15),
        axis.text.x = element_text(angle = 60, size = 10, hjust = 0.5, vjust = 0.5))

Comments: The higher level of loan grade (A is the best), the lower the interest rate apply to them. The income Verified applicants are less in umber than others. The state of income verification also have influence on the amount of loan one’s can get.

3. Data visualization and exploration tasks with ames data set

The ames data set is available through openintro package in R.

1. Write an introductory paragraph to the data set which provides the basic information - what the data set is about; the number of samples and features; the scope that the features cover.

Answer: The ames data set contains data about the individual residential properties that sold in Ames from 2006 to 2010. There are 2930 rows and 82 variables, which comprising a mix of nominal, ordinal, discrete, and continuous data types. This data set has detailed measurements of the houses, including total living area, basement square footage, location, garage size, and distinct porch types. Quality rating for each houses for an better analysis. It also contain information on heating or cooling systems, electrical systems, and the presence of amenities likes fireplaces, pools and masonry veneers.

2. Use a plot to analyze how area correlates with price. Summarize your finding from the graph.

ggplot(ames, aes(x = area, y = price)) +
  geom_density2d_filled() +
  labs(title = "2D density plot for house living area and price", 
       x = "House living (ground) area (square feet)", 
       y = "Price (USD)") +
  scale_y_continuous(labels = scales::dollar, limits = c(0,400000)) +
  xlim(0,3000) +
    theme(plot.title = element_text(hjust = 0.5, size = 20),
          text = element_text(size = 15))

Comments: Based on my graph, the price has a strong correlation with its ground living area. The larger the area the higher the price. Most of the houses that are included in the data have area around 1000 square feet.

3. Use a plot to analyze how Bldg.Type correlates with price. Explain the meaning of each label for Bldg.Type and summarize your finding from the graph.

ggplot(ames, aes(x = Bldg.Type, y = price)) +
  geom_boxplot() +
  stat_boxplot(geom = 'errorbar', width = 0.5) + 
  labs(title = "Box plot for house living area and price", 
       x = "Type of dwelling", 
       y = "Price (USD)") +
  scale_y_continuous(labels = scales::dollar, limits = c(0,400000)) +
    theme(plot.title = element_text(hjust = 0.5, size = 20),
          text = element_text(size = 15))

Comments: In Bldg.Type, 1Fam means single-family Detached; 2fmCon means Two-family Conversion, but originally built as one-family dwelling; Duplex means duplex dwelling; TwnhsE means Townhouse end unit; Twnhsl means Townhouse inside unit. Single-family detached dwelling are having a big range of price. Most of them are more expensive than others. Townhouse end units are a really similar to the single-family detached dwelling, but sharing wall with neighbors, also price pretty high. The lowest price of them are higher than the lowest price of single-family detached dwelling. 2fmCom, Duplex and Twnhs have their price range small. and cheaper than most of the single-family detached dwelling and townhouse end units.

4. Use a plot to analyze how Bldg.Type and area altogether correlates with price. Summarize your finding from the graph.

ggplot(data = ames, aes(x = area, y = price)) + 
  geom_bin_2d() +
  geom_smooth(color = "red") +
  facet_wrap(~ Bldg.Type, nrow = 2) +
  labs(title = "2D bins plot for house type and house area affection on price", 
       x = "House Area (Square Feet)", 
       y = "Price (USD)") + 
  scale_y_continuous(labels = scales::dollar, limits = c(0, 600000)) +
  xlim(0, 4000) +
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = 15), 
        axis.text = element_text(size = 10))

**Comments:* The town house end units have the highest price increasing rate compare to others when we ignoring the outliars which have are bigger than 2000. Single family detached dwellings have the second fast price increasing rate.

5. (Bonus - 5 Points) You may need to self-study to fulfill this task: use a plot to study how area and Year.Built together correlates with price. Summarize your finding from the graph.