Based on the textbook’s content, we know that the area of the house is 0, which means it is a vacant lot, and there is no house on it, so there is no observation or discussion significance. Therefore, we first delete the data of the house with the area of 0 from the data set.

index <- house$area > 0

house <- house[ index, ]

summary(house$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1896    1979    1995    1990    2006    2015     221

We choose one variable, age, which means the construction period of the house, as the X-axis, and price of the house as the Y-axis to form a point diagram as shown below.

## Warning: Removed 232 rows containing missing values (geom_point).

It’s intuitive to see that newer houses also seem to have a higher price.

## 
## Call:
## lm(formula = price ~ age, data = house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -387509 -142729  -37316  103754  598689 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4004422.1  1224810.4  -3.269 0.001234 ** 
## age             2199.7      615.7   3.573 0.000426 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 216100 on 243 degrees of freedom
##   (因为不存在,232个观察量被删除了)
## Multiple R-squared:  0.0499, Adjusted R-squared:  0.04599 
## F-statistic: 12.76 on 1 and 243 DF,  p-value: 0.0004261

y=-4004422.1 + 2100.7x + e

Next, we use the geom_smooth layer to fit the optimal regression line. Looking at the figure, we can see that when X=2010, the value of Y, namely the house price, should be close to 410000. It fits what we have calculated. It follows, therefore, that the newer the house, the higher the price. That is, the greater the value of the variable “age,” the higher the price.

## Warning: Removed 232 rows containing non-finite values (stat_smooth).
## Warning: Removed 232 rows containing missing values (geom_point).

Next, we use the same method to fit the variables “taxes” and “bed” with one-variable linear regression. The associated prediction formula and visualized regression line are shown below:

## Warning: Removed 132 rows containing missing values (geom_point).

## 
## Call:
## lm(formula = price ~ taxes, data = house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -408214 -112316  -28914   67032  709869 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 95838.413  20404.245   4.697 3.83e-06 ***
## taxes         103.164      7.018  14.701  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 165400 on 343 degrees of freedom
##   (因为不存在,132个观察量被删除了)
## Multiple R-squared:  0.3865, Adjusted R-squared:  0.3847 
## F-statistic: 216.1 on 1 and 343 DF,  p-value: < 2.2e-16

y= 95838.41 + 103.16x + e

## Warning: Removed 132 rows containing non-finite values (stat_smooth).
## Warning: Removed 132 rows containing missing values (geom_point).

## Warning: Removed 104 rows containing missing values (geom_point).

## 
## Call:
## lm(formula = price ~ bed, data = house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -330223 -145207  -45739  110177  640245 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   204771      29228   7.006 1.16e-11 ***
## bed            54984       9370   5.868 9.79e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 201200 on 371 degrees of freedom
##   (因为不存在,104个观察量被删除了)
## Multiple R-squared:  0.08493,    Adjusted R-squared:  0.08246 
## F-statistic: 34.43 on 1 and 371 DF,  p-value: 9.793e-09
## Warning: Removed 104 rows containing non-finite values (stat_smooth).
## Warning: Removed 104 rows containing missing values (geom_point).

y= 204771 + 54984x + e

By looking at the graph, we know that with the increase of X =taxes and X =bed, the price of houses also shows an upward trend. That is, tax and bed are positively correlated with house prices.

We want to explore the relationship between multiple variables further. We feel interested in the relationship between the two variables: age and number of bedrooms, and house price. We define a large unit as having more than three bedrooms. And assign with house$bigUnit. Then, by fitting the multi-far linear regression line, the following results are obtained:

house$bigunit <- factor( ifelse( house$bed > 3, 1, 0 ) )



m2 <- lm( formula = price ~ bigunit + age  + age * bigunit,
         data = house )

ggplot( data = house, mapping = aes( x = age, y = price, colour =  bigunit  ) ) +
    geom_point( alpha = .4 ) +
    geom_smooth( method = 'lm', se = FALSE )
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 232 rows containing non-finite values (stat_smooth).
## Warning: Removed 232 rows containing missing values (geom_point).

As can be seen from the figure above, for a given age change, the price of houses with large units increases more than houses with small units. That is, the regression of green has a steeper slope.

Thank you!

Thank you!