#4.9 For the sahp dataset, answer the following questions.

Q1

Create a scatterplot between bedroom and bathroom

#install.packages("ggplot2")
library(ggplot2)
#install.packages("r02pro")
library(r02pro)
ggplot(data= sahp) + geom_point(mapping = aes(x = bedroom, y = bathroom))

### Q2 What problem do you think this plot have? Provide two different plots to address this issue.

Jittering

ggplot(data= sahp) + geom_jitter(mapping = aes(x = bedroom, y = bathroom))

Counts plots

ggplot(data= sahp) + geom_count(mapping = aes(x = bedroom, y = bathroom))

From the sahp, we can know that there are 165 observations for each bedroom and bathroom.While from the plot, there are only several points which means that there are many overlapping.So we can use jittering and Counts plots to fix this problem.

4.10

Use the sahp data set to answer the following questions.

###1) Create a bar chart to represent the distribution of the number of available car spaces in the garage (gar_car).

ggplot(data = sahp) + geom_bar(mapping = aes( x = gar_car))

## Warning: Removed 1 rows containing non-finite values (stat_count).

2)

For the bar chart in Q1, divide each bar into sub-bars according to whether oa_qual > 5. What findings do you have in this plot?

ggplot(data = sahp) + geom_bar(mapping = aes( x = gar_car , fill = oa_qual >5))

## Warning: Removed 1 rows containing non-finite values (stat_count).

From the bar we can know that the larger size of garage in car capacity is, the higher the proportion of the above-average overall material and finish quality house is.The houses which have largest size of garage(4) are all at above-average leval of oa_qual.

3)

For the bar chart in Q2, change the position of the sub-bar to reflect the proportion for oa_qual > 5 for each value of gar_car.

ggplot(data = sahp) + geom_bar(mapping = aes( x = gar_car, fill = oa_qual >5), position = "fill")

## Warning: Removed 1 rows containing non-finite values (stat_count).

4.11

Use the sahp data set to answer the following questions.

###1） Create a scatterplot between lot_area (x-axis) and sale_price (y-axis), with the breaks on the x-axis being an equally-spaced sequence from 0 to 40000 with increment 5000, and the breaks on the y-axis being (0, 200, 300, 550).

ggplot(data = sahp) + geom_point(mapping = aes(x = lot_area, y = sale_price)) + scale_x_continuous(breaks = seq(from = 0, to = 40000, by = 5000)) + scale_y_continuous(breaks = seq(c(0,200,300,550)))

## Warning: Removed 1 rows containing missing values (geom_point).

###2） CFor the plot in Q1, create a zoom-in plot where lot_area is between 10000 and 15000, and sale_price is between 200 and 300.

ggplot(data = sahp) + geom_point(mapping = aes(x = lot_area, y = sale_price)) +
  coord_cartesian(xlim = c(10000, 15000), ylim = c(200, 300))

## Warning: Removed 1 rows containing missing values (geom_point).

###3） For the plot in Q1, create a corresponding log-log plot, where both the x-axis and y-axis are in log-scale.

ggplot(data = sahp) + geom_point(mapping = aes(x = lot_area, y = sale_price)) +
  scale_x_continuous(trans = "log10") +
  scale_y_continuous(trans = "log10")

## Warning: Removed 1 rows containing missing values (geom_point).

#4.12 Use the sahp data set to answer the following questions.

###1）Create histograms on the living area (liv_area) for each of the following settings： Use 10 bins Set the binwidth to be 300 Set the bins manually to an equally-spaced sequence from 0 to 3500 with increment 500.

Use 10 bins

ggplot(data = sahp) + geom_histogram(mapping = aes(x = liv_area), bins = 10)

Set the binwidth to be 300

ggplot(data = sahp) + geom_histogram(mapping = aes(x = liv_area), binwidth = 300)

Se the bins manually to an equally-spaced sequence from 0 to 3500 with increment 500.

ggplot(data = sahp) + geom_histogram(mapping = aes(x = liv_area), breaks = seq(from = 0, to = 3500, by = 500))

###2) Create histograms on the living area (liv_area) with 5 bins, and show the information of different kit_qual values in each bar. What conclusions can you draw from this plot?

ggplot(data = sahp) + geom_histogram(mapping = aes(x = liv_area, fill = kit_qual),bins = 5)

From the histogram, we can know that the smaller the living area is ,the higher proportion of fair kitchen quality is.For the houses of biggest living area, all of their kitchen quality is excellent. As living area is getting bigger, the ratio of good index(good,excellent) of kitchen quality occupies bigger in each bar.

#4.13 Use the sahp data set to answer the following questions.

###1）Create density plot on the living area (liv_area) with dashed lines and different colors for different values of kit_qual. What conclusions can you draw from the plot?

ggplot(data = remove_missing(sahp, vars = "oa_qual")) +  geom_density(aes(x = liv_area, color = kit_qual ), lty = 2)

## Warning: Removed 1 rows containing missing values.

From the plot， we can know that “fair” condition of kitchen quality occupies most at the smallest living area, while “excellent” condition of kitchen quality occupies most at the biggest living area.

###2)Try to create density plot for kit_qual. Do you think this plot is informative? If not, create a plot that captures the distribution of kit_qual.

ggplot(data = sahp) + geom_density(mapping = aes (x = kit_qual ))

ggplot(data = sahp) + geom_bar(mapping = aes (x = kit_qual ))

This density plot is not informative. Cause density plot mainly represents the change of data of one variable, while Kit-qual is a categorical variable, which mainly represents the proportion of four different qualities. It’s much more suitable to use bar or histogram functions.

#4.14

###1）Create a boxplot on the living area (liv_area) and find out the following values on the boxplot using R codes.

library(r02pro)
#install.packages("tidyverse")
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ✓ purrr   0.3.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

liv_area <- na.omit(sahp$liv_area)
boxplot(liv_area)

solid line in the middle

median(liv_area)

## [1] 1450

lower hinge

quantile(liv_area, 0.25)

##  25% 
## 1116

upper hinge

quantile(liv_area,0.75)

##  75% 
## 1707

lower whisker

quantile(liv_area, 0.25) - 1.5 * IQR(liv_area)

##   25% 
## 229.5

upper whisker

quantile(liv_area, 0.75) + 1.5 * IQR(liv_area)

##    75% 
## 2593.5

lower whisker

max(min(liv_area), quantile(liv_area, 0.25) - 1.5*IQR(liv_area))

## [1] 438

upper whisker

min(max(liv_area), quantile(liv_area, 0.75) + 1.5*IQR(liv_area))

## [1] 2593.5

###2)Create a boxplot to compare the distribution of living area (liv_area) for different values of kitchen quality (kit_qual). What conclusions can you draw from the plot?

ggplot(data = sahp) + geom_boxplot(aes(x = kit_qual, y = liv_area))

From the boxplot,we can see that as the living area getting bigger, the quality of kitchen is getting better(from fair,average,good to excellent).

###3) For the boxplot in Q2, for different kit_qual values, add the following three points to the plot. minimum liv_area value (in red) maximum liv_area value (in blue) the mean liv_area value (in green)

ggplot (data = sahp, aes(x = kit_qual, y = liv_area)) + geom_boxplot() +
  geom_point(stat = "summary", fun = "mean",  color = "green") +  geom_point(stat = "summary", fun = "max",  color = "blue") + geom_point(stat = "summary", fun = "min",  color = "red")

###4) For the boxplot in Q2, order it by the mean lot_area value in ascending order.

ggplot(data = remove_missing (sahp, vars = "liv_area")) + geom_boxplot(aes(x = fct_reorder(kit_qual, liv_area, mean), y = liv_area))

###5) For the boxplot in Q2, use different colors to represent whether oa_qual is larger than 5.

ggplot( data = na.omit(sahp)) + geom_boxplot(aes(x = kit_qual, y = liv_area,color = oa_qual > 5))

#4.17

###1) Create a boxplot for liv_area and assign it to the name my_boxplot.

my_boxplot <- boxplot(na.omit(liv_area), main = "my_boxplot")

###2) Using my_boxplot in Q1, generate separate plots according to the value of bedroom. What conclusions can you draw from the plot?

ggplot(data = na.omit(sahp)) + geom_point(mapping = aes(x = liv_area, y = bedroom))+ facet_wrap("bedroom")

From the plots we can know that the larger the living area is, the larger the number of rooms is.

###3) Using my_boxplot in Q1, generate a matrix of subplots with kitchen quality (kit_qual) as the row and whether the house has central ac (central_air) as the column. Do you see a subplot for all combinations of kit_qual and central_air? If not, explain the reason.

ggplot(data = na.omit(sahp)) + geom_point(mapping = aes(x = liv_area, y = bedroom)) + facet_grid(rows = vars(kit_qual), cols = vars(central_air), scales = "free")

From the result, I cannot see the combination of “no central_air” and “excellent condition of kit_qual”. Becasue the better the kitchen quality is , the higher the central air-conditioning installation rate is.”Excellent” is the best condition for kit_qual, and no excellent kitchens without central air.

HW 6

Siqi Huang

10/15/2021

Q1

4.10

2)

3)

4.11