We dataset provided contains information such as asking price (in $1,000s), size (sq. feet), number of bedrooms, and number of bathrooms. The homes come from a variety of states (California, New Jersey, New York, and Pennsylvania) which are also recorded in the dataset.
A sample of the dataset is shown below.
home_data = read.csv("https://www.lock5stat.com/datasets3e/HomesForSale.csv")
head(home_data)
## State Price Size Beds Baths
## 1 CA 533 1589 3 2.5
## 2 CA 610 2008 3 2.0
## 3 CA 899 2380 5 3.0
## 4 CA 929 1868 3 3.0
## 5 CA 210 1360 2 2.0
## 6 CA 268 2131 3 2.0
Based on the data shown, the following 5 questions are proposed:
Q1. Use the data only for California. How much does the size of a home influence its price?
Q2. Use the data only for California. How does the number of bedrooms of a home influence its price?
Q3. Use the data only for California. How does the number of bathrooms of a home influence its price?
Q4. Use the data only for California. How do the size, the number of bedrooms, and the number of bathrooms of a home jointly influence its price?
Q5. Are there significant differences in home prices among the four states (CA, NY, NJ, PA)? This will help you determine if the state in which a home is located has a significant impact on its price. All data should be used.
We will analyze each question in detail.
We are looking at homes specifically in California
CA_data <- home_data[home_data$State == "CA", ]
Let x represent size in sq. ft. and y represent price.
summary(model)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -462.55 -139.69 39.24 147.65 352.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -56.81675 154.68102 -0.367 0.716145
## x 0.33919 0.08558 3.963 0.000463 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 219.3 on 28 degrees of freedom
## Multiple R-squared: 0.3594, Adjusted R-squared: 0.3365
## F-statistic: 15.71 on 1 and 28 DF, p-value: 0.0004634
We are still looking at homes in California.
Let x represent the number of bedrooms, and y represent the price of a home.
summary(model)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -413.83 -236.62 29.94 197.69 570.94
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 269.76 233.62 1.155 0.258
## x 84.77 72.91 1.163 0.255
##
## Residual standard error: 267.6 on 28 degrees of freedom
## Multiple R-squared: 0.04605, Adjusted R-squared: 0.01198
## F-statistic: 1.352 on 1 and 28 DF, p-value: 0.2548
Once again, we are looking at homes only in California.
We fit a linear regression model.
Let x represent number of bathrooms in a home, and y represent the price of a home.
summary(model)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -374.93 -181.56 -2.74 152.31 614.81
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.71 148.57 0.611 0.54641
## x 194.74 62.28 3.127 0.00409 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 235.8 on 28 degrees of freedom
## Multiple R-squared: 0.2588, Adjusted R-squared: 0.2324
## F-statistic: 9.779 on 1 and 28 DF, p-value: 0.004092
Still, we are using data from only California. Let x0 represent the number of bedrooms, x1 represent the number of bathrooms, and y represent the price of the home.
We fit a multiple linear regression model.
Let x0 represent the number of bedrooms, x1 represent the number of bathrooms, and y represent price.
summary(model)
##
## Call:
## lm(formula = y ~ x0 + x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -368.75 -182.03 -6.16 152.92 615.42
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.03 223.48 0.233 0.81767
## x0 16.40 69.80 0.235 0.81598
## x1 189.17 67.64 2.797 0.00939 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 239.9 on 27 degrees of freedom
## Multiple R-squared: 0.2604, Adjusted R-squared: 0.2056
## F-statistic: 4.752 on 2 and 27 DF, p-value: 0.01705
We want to determine whether there is a significant difference in average price of homes between states, our null hypothesis will claim that there is no significant difference between the average prices of each 4 states.
\(H_0: \mu_{CA} = \mu_{NY} = \mu_{NJ} = \mu_{PA}\) \(H_a:\) There is a difference in average home price somewhere between the states
We will conduct an ANOVA test.
summary(aov_results)
## Df Sum Sq Mean Sq F value Pr(>F)
## State 3 1198169 399390 7.355 0.000148 ***
## Residuals 116 6299266 54304
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on our analysis for each question…
According to the summary, the model to represent the relationship between size and price is y = -56.81 + 0.34x. (y = size, x = price). This means that based on the data, we can expect the size of the house to increase by 1 square foot for every 0.34 thousand dollars.
According to the summary of linear regression in the analysis section for this question, the relationship between the number of bedrooms and size is y = 269.8 + 84.8x. (y = price, x = #bedrooms). The model suggests that a 1 bedroom house costs on average 269.8 thousand, and that for every bathroom 84.8 thousand dollars.
If we want to test whether there is significant evidence to support that there is not a linear relationship between the number of bedrooms and the price of the house.
\(H_0: \beta_1 = 0\) - “There is no linear relationship between bathrooms and price” \(H_a: \beta_1 \neq 0\)
The t-statistic for the slope (price) is 1.16 which gives us a p-value of 0.255. This suggests that 25% of the data may be due to random variation, so we cannot accept the null hypothesis. i.e. There is a linear relationship between the number of bedrooms and the price of a home.
According to the summary of linear regression for question 3, the relationship between the nunmber of bathrooms and size is y = 90.7 + 194.7x. This tells us that for every bathroom, the price increases by 194.7 thousand dollars.
\(H_0: \beta_1 = 0\) - “There is no linear relationship between bathrooms and price.” \(H_a: \beta_1 \neq 0\) t-test - t-statistic: 3.12, (p-value: 0.004). This suggests that the data shown is not likely due to random variation. (Reject null). i.e. There is no linear relationship between the number of bathrooms and the price of a home.
According to the summary shown, the model for the relationship between bedrooms, bathrooms, and price is as follows. y = 52 + 16.4x_0 + 189.2x_1 + e. This means we can expect the price to increase by 189.2 units for each bathroom and 16.4 units for each bedroom on top of a base 52 units of price.
The ANOVA summary tells us that the Pr(>F) value is 0.00015. This suggests that there is a significant difference between prices in homes between the states.
I used “https://www.w3schools.com/R”, as well as OpenAI to help debug and understand R concepts.