Homes Data

Import the data file

Download the file Saratoga.csv from Canvas. Then, move this file to the same folder that contains this document.

Now you’re ready to begin!

Click on the green arrow below to load the data. If you get an error, you’ll need to install the readr package. Check in with the instructor or a TA for how to do that.

library(readr)
homes_data <- read_csv("Saratoga.csv")

Rows: 1063 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Acres
dbl (6): Price, Living Area, Baths, Bedrooms, Fireplace, Age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Information about this data

This file includes information about houses that were for sale in and around Saratoga Springs, New York in 2006. Among the variables collected are the sale price (dollars), the total living area (square feet), whether or not the house has a fireplace (0=No, 1 = Yes), the number of bathrooms, the number of bedrooms, the size of the lot (acres), and the age of the house (years).

Quantitative variables

We’ve already seen that variables can be classified as quantitative or categorical.

Discrete and continuous quantitative variables

New! Quantitative variables can be classified as discrete or continuous. A quantitative variable is discrete if it can only take a countable number of values. For example, Number of siblings is a discrete, quantitative variable. On the other hand, The number of seconds I can hold my breath is a continuous, quantitative variable. Time can be measured to any level of precision, so the number of values it can take is not countable.

Explanatory and response variables

Explanatory and response variables can be quantitative or categorical. When we suspect that variation in one variable might cause variation in another variable (for example, Year and Winning Race Time for the Kentucky Derby data), we call the first variable the explanatory variable and the second variable the response variable.

Scatterplots

We can visualize the interaction between two quantitative variables with a scatterplot. When we make a scatterplot, the explanatory variable is on the horizontal axis and the response variable is on the vertical axis.

Example

It’s possible that the selling price of a house (dollars) depends on the amount of living area (the number of square feet). If this association is strong enough, perhaps we could use the living area of a house to predict its selling price.

This R code will create a scatterplot of price versus living area.

plot(homes_data$`Living Area`, homes_data$Price)

Do something!

Create a scatterplot using baths and bedrooms. Be sure to assign explanatory and response variables logically.

plot(homes_data$`Baths`, homes_data$Bedrooms)

The scatterplot you just created looks a lot different than the scatterplot of price versus living area. Explain why.

This isnt a comparison of Price for Living area its comparing how many rooms per bathroom. Its measuring two discrete quanitative variables so it is very orderly.
Create a scatterplot to display the association between two other quantitative variables from the Saratoga Homes data.
```
plot(homes_data$`Age`, homes_data$Price)
```
Is the association between these variables linear, nonlinear, or no association? If linear, describe the strength and direction.

its nonlinear but there is a clear coorilation between higher age and cheaper price. Generally the older places go for less than the brand new homes.
Given your answer to prompt 3, what research question about the association between these two variables can you ask? What is the answer to your question?

I could ask ‘How does age effect the price of housing?’ the answer is that older houses are generally more affordable.
Use the hist() function to create one histogram of your explanatory variable and one histogram of your response variable.
```
hist(homes_data$'Age')
```
```
hist(homes_data$'Price')
```
Which of these variables would you say is more “spread out” and why?

Price is more spread out, only about 50% of its data is in the Mode as opposed to what seems to be 60-70 in the Age mode.

Use the summary() function to calculate statistics of your explanatory variable.

summary(homes_data$Age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    6.00   18.00   28.25   34.00  247.00

Describe the shape (modality, skew/symmetry, unusual observations) of your explanatory variable and, given the shape, report what is an appropriate typical value or values (mode, median, or mean) as well as this numerical value (with units of measurement).

This is a uni-modal histogram which skews right, the typical value should be the median (18 years) as outliers make the mean (28.25 years) unreliable.
Use the sd() function to obtain the standard deviation of your explanatory variable. Also, obtain the standard deviation of your response variable.
```
sd(homes_data$Age, na.rm = FALSE)
```
```
[1] 35.12754
```
```
sd(homes_data$Price, na.rm = FALSE)
```
```
[1] 84476.61
```
The standard deviation is a numerical measure of the variation of a distribution, also called the spread of a distribution. It tells us how far an observation tends to be from the mean. The units of measurement for the standard deviation are the same as the units of measurement for the variable. Larger standard deviations indicate that data is spread out further from the mean (there is more variation around the mean). Compare the standard deviations of your explanatory variable and response variable. Report their values (include units of measurement). Which variable has the larger variation, and how do you know? Is this surprising given your response to prompt 6 (when you compared the histograms)? If so, why? If not, why not?

the values of the standard deviations are 35 years and $84,476.61. with the larger deviation being in Price as 85 grand is much more that 35, this inst surprising as Price is a much less stable thing to measure than Age and more likely to fluctuate making larger disparities. This lines up with my analysis of #6 as Age has a more concentrated Mode there for it would have less deviation in data.