library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
Houses10 <- read_csv("Houses10.csv")
## Rows: 33656 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ADDRESS, SUBURB, Bedroom, Bathroom, GARAGE, NEAREST_STN, NEAREST_SCH
## dbl (7): PRICE, LAND_AREA, FLOOR_AREA, CBD_DIST, NEAREST_STN_DIST, POSTCODE,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
questions. You can use the glimpse function from the tidyverse package to peek into the dataset and find answers to these questions.
glimpse(Houses10)
## Rows: 33,656
## Columns: 14
## $ ADDRESS <chr> "1 Acorn Place", "1 Addis Way", "1 Ainsley Court", "1…
## $ SUBURB <chr> "South Lake", "Wandi", "Camillo", "Bellevue", "Lockri…
## $ PRICE <dbl> 565000, 365000, 287000, 255000, 325000, 409000, 40000…
## $ Bedroom <chr> "More_bedroom", "Small_Bedroom", "Small_Bedroom", "Sm…
## $ Bathroom <chr> "More_bathroom", "More_bathroom", "Small_bathroom", "…
## $ GARAGE <chr> "2", "2", "1", "2", "2", "1", "2", "2", "3", "8", "6"…
## $ LAND_AREA <dbl> 600, 351, 719, 651, 466, 759, 386, 468, 875, 552, 253…
## $ FLOOR_AREA <dbl> 160, 139, 86, 59, 131, 118, 132, 158, 168, 126, 241, …
## $ CBD_DIST <dbl> 18300, 26900, 22600, 17900, 11200, 27300, 28200, 4170…
## $ NEAREST_STN <chr> "Cockburn Central Station", "Kwinana Station", "Chall…
## $ NEAREST_STN_DIST <dbl> 1800, 4900, 1900, 3600, 2000, 1000, 3700, 1100, 2500,…
## $ POSTCODE <dbl> 6164, 6167, 6111, 6056, 6054, 6112, 6112, 6169, 6022,…
## $ NEAREST_SCH <chr> "LAKELAND SENIOR HIGH SCHOOL", "ATWELL COLLEGE", "KEL…
## $ NEAREST_SCH_DIST <dbl> 0.8283386, 5.5243244, 1.6491782, 1.5714009, 1.5149216…
My dataset has 33,656 rows.
My dataset has 14 columns.
My target variable is PRICE.
anyNA(Houses10)
## [1] FALSE
colSums(is.na(Houses10))
## ADDRESS SUBURB PRICE Bedroom
## 0 0 0 0
## Bathroom GARAGE LAND_AREA FLOOR_AREA
## 0 0 0 0
## CBD_DIST NEAREST_STN NEAREST_STN_DIST POSTCODE
## 0 0 0 0
## NEAREST_SCH NEAREST_SCH_DIST
## 0 0
No, my data set does not contain missing values.
Does the amount of land area significantly affect the price of a house?
Does the amount of land area significantly affect the price of a house? H0: The amount of land area has no significant effect on house price. Ha: The amount of land area has an significant effect on house price.
Land area and price.
My main response variable is numerical.
Houses10 %>%
ggplot(aes(PRICE)) +
geom_histogram(binwidth = 100000, colour="black",fill="pink")
Houses10 %>%
ggplot(aes(PRICE)) +
geom_histogram(binwidth = 50000, colour="black",fill="purple")
Houses10 %>%
ggplot(aes(PRICE)) +
geom_density()
The plot is unimodal and has potential outliers at the high end where the prices are above 2000000. It is also right-skewed.
I will measure the center of my response variable with the median and the spread with IQR.
Houses10 %>%
ggplot(aes(CBD_DIST)) +
geom_histogram(binwidth= 1000,colour= "black", fill="red")
Houses10 %>%
ggplot(aes(Bedroom)) +
geom_bar(binwidth=7, colour= "black", fill="yellow")
## Warning in geom_bar(binwidth = 7, colour = "black", fill = "yellow"): Ignoring
## unknown parameters: `binwidth`
Houses10 %>%
ggplot(aes(y=FLOOR_AREA)) +
geom_boxplot(fill="lightblue")
Histogram (CBD_DIST): Most of the houses around 1500 are close to the city (clustered) between 10000 and 15000 units, with few houses reaching 60000 units. Bar Chart (Bedroom): There is a large gap between more and small bedrooms. Around 20000 houses have more bedrooms while only 15000 houses have small bedrooms. Boxplot (Floor_Area): The median house size is around 200 units.
Houses10 %>%
ggplot(aes(PRICE, LAND_AREA)) +
geom_point()
Houses10 %>%
ggplot(aes(PRICE,fill= Bedroom)) +
geom_boxplot()
The relationship between price and land area is very weak and non-,inear because most of the houses have a low land area regardless of if the price increases or not.
The relationship between the price and bedroom amount of my boxplot houses prices would increase as the number of bedrooms increase.
Houses10 %>%
summarise(Mean= mean(PRICE),
SD= sd(PRICE),
Min= min(PRICE),
Q1= quantile(PRICE,0.25),
Median= median (PRICE),
Q3= quantile(PRICE,0.75),
Max= max(PRICE))
The average price of houses is around 637,072 with a standard deviation of 355,825.6 a median of 535,500.
Houses10 %>%
ggplot(aes(PRICE)) +
geom_histogram(binwidth=100000, fill= "lightpink", color= "black")
The histogram shows a unimodal distribution that is skewed to the right. These summaries reveal that the mean 637,072 is higher than the median 535,500, shows that there are high-priced outliers and the large standard deviation of 355825 shows that there is a wide variety of home prices.
x_bar= mean(Houses10$PRICE)
SD=sd(Houses10$PRICE)
n = nrow(Houses10)
tibble(n = n,
Mean = x_bar,
Sd=SD)
SE = SD/sqrt(n)
zstar = 1.96
SE
## [1] 1939.572
ME = zstar * SE
ME
## [1] 3801.562
tibble( SE = SE,
ME = ME)
library(infer)
Houses10 %>%
t_test(response = PRICE,
conf_int = TRUE,
conf_level = 0.95)
We are 95% confident that the true mean of PRICE is between the lower confidence interval (633,270.4) and the upper confidence interval (640,873.6)
Houses10%>%
filter(!is.na(PRICE), !is.na(Bedroom)) %>%
group_by(Bedroom) %>%
summarise(Mean = mean(PRICE),
SD = sd(PRICE))
The average price of houses when there is more bedrooms is higher than small bedroom. The same applies for the standard deviation.
Houses10 %>%
ggplot(aes(x= Bedroom)) +
geom_bar(fill= "blue", color= "black")
The houses with bigger bedrooms are higher than houses with small
bedrooms.
# Scatterplot 1
Houses10 %>%
ggplot(aes(x= FLOOR_AREA, y= PRICE)) +
geom_point()+
labs(title = "PRICE vs. FLOORAREA", x = "FLOORAREA", y= "PRICE")
# Scatterplot 2
Houses10 %>%
ggplot(aes(x= NEAREST_SCH_DIST, y= PRICE)) +
geom_point()+
labs(title = "PRICE vs.NEAREST_SCH_DIST ", x = "NEAREST_SCH_DIST", y= "PRICE")
• Describe the direction, form, and strength of the relationship in each
plot.
As the floor area increases the price also increases so there is a positive relationship between floor area and price. The strength is moderate and the form is linear.
#Scatterplot 2
As the distance to the nearest school increases, the price of the houses decreases so there is a negative relationship between nearest school and price. The strength is weak and the form is non-linear.
lm_FA <- lm(PRICE ~ FLOOR_AREA, data = Houses10)
summary(lm_FA)
##
## Call:
## lm(formula = PRICE ~ FLOOR_AREA, data = Houses10)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2023603 -167448 -63342 98386 2037674
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 140367.71 4434.73 31.65 <2e-16 ***
## FLOOR_AREA 2706.81 22.49 120.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 297500 on 33654 degrees of freedom
## Multiple R-squared: 0.3008, Adjusted R-squared: 0.3008
## F-statistic: 1.448e+04 on 1 and 33654 DF, p-value: < 2.2e-16
lm_NSCHD <- lm(PRICE ~ NEAREST_SCH_DIST, data = Houses10)
summary(lm_NSCHD)
##
## Call:
## lm(formula = PRICE ~ NEAREST_SCH_DIST, data = Houses10)
##
## Residuals:
## Min 1Q Median 3Q Max
## -590331 -228900 -100902 126304 1803469
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 645189 2797 230.647 < 2e-16 ***
## NEAREST_SCH_DIST -4472 1111 -4.026 5.68e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 355700 on 33654 degrees of freedom
## Multiple R-squared: 0.0004814, Adjusted R-squared: 0.0004517
## F-statistic: 16.21 on 1 and 33654 DF, p-value: 5.683e-05
• Report the correlation coefficient (r) and coefficient of determination (R2) for each model.
#coefficient of determination for FLOOR_AREA
summary(lm_FA)$r.squared
## [1] 0.3008489
#correlation coefficient (r)
cor(Houses10$PRICE,Houses10$FLOOR_AREA, use = "complete.obs")
## [1] 0.5484969
#coefficient of determination for NEAREST_SCH_DIST
summary(lm_NSCHD)$r.squared
## [1] 0.0004814228
#correlation coefficient (r)
cor(Houses10$PRICE,Houses10$NEAREST_SCH_DIST, use = "complete.obs")
## [1] -0.02194135
• Write out each model equation and explain which variable produces the better model and why.
\(\hat(PRICE) = 140367.71 + FLOOR_AREA * 2706.81\)
\(\hat(PRICE) = 6.711e+05 + NEAREST_SCH_DIST * -4472\)
The PRICE and FLOORAREA model produces the best model because its adjusted R squared (0.3008) is higher than that of PRICE and NEAREST_SCH_DIST (0.0004517).
lm_FA <- lm(PRICE ~ FLOOR_AREA, data = Houses10)
new <- tibble(FLOOR_AREA = 5000)
predict_PRICE <- predict(lm_FA, newdata = new)
print(predict_PRICE)
## 1
## 13674431
My dataset contains information about houses that were sold in Perth, Australia. It contains variables such as price, amount of bedrooms and bathrooms, garage, land area, floor area, and distance to schools and train stations. These variables describe different features available with each property. The goal is to examine how some of these variables impact house price.
Research question: Does the amount of land area significantly affect the price of a house? Hypotheses: H0: The amount of land area has no significant effect on house price. Ha: The amount of land area has an significant effect on house price.
Research question: Does the amount of land area significantly affect the price of a house?
Hypotheses: H0: The amount of land area has no significant effect on house price. Ha: The amount of land area has an significant effect on house price.
The response variable is PRICE which is numerical, unimodal, and right-skewed. Descriptive statistics: The mean house price is approx. 637,072, which is higher than the median of 535,500. Histograms and other plots are above that illustrate that most houses are clustered in the lower price brackets, with a wide spread which is indicated by the standard deviation of 355,826.
Floor Area vs. Price: This proved to be a much stronger predictor than school distance. The correlation coefficient (r) was 0.548, indicating a moderate positive linear relationship.
Model Performance: The regression model FLOOR_AREA (R2=0.3008) significantly outperformed the model using school distance (R2=0.00048).
The model used was the simple linear regression model.
(PRICE) = 140367.71 + FLOOR_AREA * 2706.81$
The PRICE and FLOORAREA model produces the best model because its adjusted R squared (0.3008) is higher than that of PRICE and NEAREST_SCH_DIST (0.0004517).
While the initial research question focused on LAND_AREA, the data suggests that FLOOR_AREA is more significant for price in the Perth market than the total lot size. Categorical analysis showed that houses that had “More_bedroom” had higher average prices than “Small_Bedroom” homes.
Possible Improvements consist of combining floor area, suburb location, and bedroom count for better variance in house prices.
This data was scraped from http://house.speakingsame.com/ and includes data from 322 Perth suburbs, resulting in an average of about 100 rows per suburb. Longitude and Latitude data was obtained from data.gov.au. School ranking data was obtained from bettereducation.