STAT 155 Topic 1 In-class Activity

Load some necessary packages:

library(ggplot2)

library(dplyr)

Question 1

The file homes.csv contains data on the prices and other attributes for 1728 homes in upstate New York.

We can read this data into R using the following command:

homes = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/homes.csv')

There are 16 variables in this dataset, but we’ll focus on the following ones for now:

Price: home price (dollars)

Living.Area: size of living area of home (square feet)

Pct.College: percent of neighborhood residents with a college degree

Central.Air: an indicator of whether or not the house has central air conditioning

New.Construct: an indicator of whether or not the house is new construction

(a) Make a boxplot (or violin plot) and histogram (or density plot) of the Price variable.

# Enter your R commands below:
ggplot(data=homes,aes(x=Price))+geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data=homes,aes(y=Price))+geom_boxplot()

(b) Your graph from (a) should display some typical indications of what is called right-skewed data, or a right-skewed distribution. Briefly describe in your own words some of the distinguishing characteristics of right-skewed data from a boxplot, violin plot, histogram or density plot.

Boxplot:right skewed data will have the middle 50 quantile at lower values with a lot of high outliers Histogram:histogram will have right skewed data shown with a tail towards the right and the “bump” towards the left

(c) Find the mean and median sale price of the homes in this data.

# Enter your R commands below:
homes %>% summarize (median(Price))

##   median(Price)
## 1        189900

homes %>% summarize (mean(Price))

##   mean(Price)
## 1    211966.7

 median(Price)
1        189900

  mean(Price)
1    211966.7

(d) If the price distribution were “symmetric” instead of “skewed”, what relation would we observe between the mean and median?

if the data was symmetric, the mean and median should be the same, in skewed data such as here, we have slightly different values, in this case the mean is higher than the median by a bit

(e) How does right-skewness manifest itself in the mean and median of the Price variable? How would left-skewness manifest itself in the mean and median?

when its right skewed, that means that you have a tail off to the right, meaning that you have a greater extreme on that side, this would affect the mean more than the median

(f) Find the 2.5th and 97.5th quantile of Price. This is called a 95% coverage interval.

homes%>%summarize(quantile(Price,0.025))

##   quantile(Price, 0.025)
## 1                  85000

homes%>%summarize(quantile(Price,0.975))

##   quantile(Price, 0.975)
## 1               453421.4

2.5th:85000 97.5th:453421.4

(g) Provide an interpretation of the interval from the 2.5th to the 97.5th quantile that you obtained in (f).

95% of the data exists between the 2.5th quantile and the 97.5th quantile

(h) Find the variance of the Price variable.

var(homes$Price)

## [1] 9690707465

9690707465

(i) The variance and the standard deviation both measure spread. What is the relationship between variance and standard deviation?

the standard deviation is the square root of the variance

(j) What is the advantage of using the standard deviation to measure spread instead of the variance?

variance is a squared value, standard deviation fits better with data because it is the square root of the square so its theoretically in the same realm as the data itself

(k) Central.Air and New.Construct are both categorical variables. There values, however, were recorded in numerical fashion (i.e., 0 and 1). We need to make sure that R treats these variables as categorical rather than quantitative. There are many ways to do this! Below, we accomplish this by creating 2 new variables, Central.Air.Cat and New.Construct.Cat, using the `mutate` function:

homes = homes %>%
  mutate( Central.Air.Cat = case_when( Central.Air == 0 ~ "No" ,
                                       Central.Air == 1 ~ "Yes"
                                       ) ,
         New.Construct.Cat = case_when( New.Construct == 0 ~ "No" ,
                                        New.Construct == 1 ~ "Yes"
                                       )
         )
ggplot(data=homes,aes(x=Central.Air.Cat,y=Price))+geom_boxplot()

Try to describe in your own words what the above R commands are doing.

the r commands are creating categorical variables and assigning data points to them

(l) Make side-by-side boxplots of the prices by whether the home has central air conditionning.

(m) Does it seem the presence of central air conditionning is a good explanatory variable for modeling home prices? Briefly explain your reasoning.

No, because it does not appear that the presence of central air significantly increases or decreases the price of a home

(n) Now, try to assess which of the Living.Area and Pct.College variables is a better predictor of home prices by creating appropriate graphs.

ggplot(data=homes,aes(x=Price,y=Pct.College))+geom_point()

ggplot(data=homes,aes(x=Price,y=Living.Area))+geom_point()

(o) Do you think that new construction homes tend to be larger or smaller than homes that are not new construction? Try to answer this question by creating an appropriate graph.

ggplot(data=homes,aes(x=Living.Area,y=New.Construct.Cat))+geom_boxplot()

(p) To further support your answer to (o), calculate some stratified numerical summaries (i.e., summaries of a quantitative variable within each of the categories of a categorical variable).

There is some R code below to help you; delete the # at the start of each line below, and complete the missing R code portions by replacing the ??? with your code:

 homes %>%
    group_by(New.Construct.Cat) %>%
      summarize(mean(Living.Area))

## # A tibble: 2 × 2
##   New.Construct.Cat `mean(Living.Area)`
##   <chr>                           <dbl>
## 1 No                              1718.
## 2 Yes                             2497.

Question 2

Diamonds.csv has data on the price, carat, color, certification, and clarity of 308 diamonds. Read this data into RStudio using the following command:

Diamonds = read.csv("https://raw.githubusercontent.com/vittorioaddona/data/main/Diamonds.csv")

(a) Make a scatter plot of the diamond prices (y-axis) vs. the carat sizes (x-axis).

Hint: You can search our Topic 1 Notes (the html or rmd file) for an example that you can mimic.

ggplot( data=Diamonds , aes( x=carat , y=price ) )+geom_point()

(c) Make side-by-side boxplots to explore the relationships between diamond prices and color, and diamond prices and certification.

ggplot(data=Diamonds,aes(x=price,y=color))+geom_boxplot()

ggplot(data=Diamonds,aes(x=price,y=certification))+geom_boxplot()

(d) Which of color or certification, if either, do you think is a good explanatory variable for price? Briefly explain what you are looking for in order to make your decision.

I would say that certification is a better explanatory variable because the quantile range of the data, or the two sides of the box, is much smaller meaning that within a certain certification, the range of the price varies much less than the price variation within the same color.

(e) Now, investigate the relationship between price and clarity.

ggplot(data=Diamonds,aes(x=price,y=clarity))+geom_boxplot()

(f) There is something odd going on in the graph from (e). Use Google to learn a little about the 5 clarity levels in the data (IF, VS1, VS2, VVS1, and VVS2), and try to identify what is “weird”.

The highest clarity diamonds are also the cheapest

(g) Try to think of a reason for the odd graph you produced in (e) (this will be an important point going forward in our course!)

It’s possible that flawless diamonds have the cheapest median because so few of them are bought that its very hard to get a good trend, this can be seen in the graph due to the large number of extremely high priced outliers present.

Question 3

Below are a series of graphical representations of data. For each, comment on how you think each might be misleading or not ideal, and how you would suggest presenting the information instead (or what other information you might present):

GRAPH 1:

This data is really one sided, it only shows the job loss, which shows a startling uptick in the number of jobs lost, what would be a better statistic would be net job change, using both jobs created and jobs lost which might be a better representation of the truth

GRAPH 2:

for this graph, it might be misleading because it shows gas prices, it does not show the value of the US Dollar over that same period, because I’m sure when that is shown it will show that during the same period, the value of the US dollar will have decreased by a similar rate.

GRAPH 3:

This graph shows how annual revenue for something like a business or corporation increased year by year, but we don’t know why this increase occurs. if we were to include data such as inflation it might make more sense

GRAPH 4:

This graph shows how the rising cost of a degree is not matched by earnings obtained from that degree, though it should be pointed out that this graph shows earnings, not how many years after, not over a certain period of time. Anyone, even someone with a bachelors degree, won’t make a large amount of money off the bat, so a better metric might be earnings after 4 years

Question 4

Let us return to the homes.csv data, which contains prices and other attributes for 1728 homes in upstate New York.

dplyr is a very useful package which makes it easy to add variables to a data frame, use only certain variables/observations, or manipulate data in countless other valuable ways.

For each of the following parts, explain in words what the dplyr commands are doing:

(a)

homes %>% 
  count( Waterfront ) %>% 
  mutate( Percentage = prop.table(n)*100  )

Its calculating the percentage of homes with a waterfront ### (b)

homes %>%
  mutate( WaterfrontCat = case_when( Waterfront == 0 ~ "No" ,
                                     Waterfront == 1 ~ "Yes"
                                     )  ) %>%
  select( WaterfrontCat , New.Construct.Cat ) %>%
  table( )

This command compares whether a home is a new construction and whether that home is on a waterfront

(c)

homes %>%
  filter( Living.Area < 1600 ) %>%
  summarize( max(Price)  )

this command tells you the maximum price for a home that is less than 1600 square feet

(d)

homes %>%
  group_by( Bedrooms ) %>%
  summarize( mean(Living.Area) , median(Living.Area) , sd(Living.Area) )

number of bedrooms compared to living area

(e)

homes %>%
  mutate( WaterfrontCat = case_when( Waterfront == 0 ~ "No" ,
                                     Waterfront == 1 ~ "Yes"
                                     )  ) %>%
  select( WaterfrontCat , New.Construct.Cat ) %>%
  table( ) %>%
  prop.table( margin=1 )*100  ### HINT: margin=1 does things by rows

percentage of homes that have a waterfront and/or have a new construction

(f)

homes %>%
  mutate( WaterfrontCat = case_when( Waterfront == 0 ~ "No" ,
                                     Waterfront == 1 ~ "Yes"
                                     )  ) %>%
  select( WaterfrontCat , New.Construct.Cat ) %>%
  table( ) %>%
  prop.table( margin=2 )*100  ### HINT: margin=2 does things by columns

percentage of homes that are on the water and/or are new constructions

(g)

homes %>%
  filter( Living.Area < 1600 ) %>%
  ggplot( aes(x=Age,y=Price)  ) +
  geom_point( ) +
  ggtitle( "Question 4, Part (g); Give a better title" ) +
  xlab( "Age of the home (in years)" ) +
  ylab( "Price of the home (in US dollars)" )

scatter plot showing price of home vs age of home

(h)

homes %>%
  mutate( BedBathRatio = Bedrooms / Bathrooms ) %>%
  ggplot( aes(x="",y=BedBathRatio)  ) +
  geom_violin() +
  xlab("")

## Warning: Removed 1 rows containing non-finite values (`stat_ydensity()`).

violin plot showing the bedroom to bathroom ratio