PART 1

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)

Houses10 <- read_csv("Houses10.csv")
## Rows: 33656 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ADDRESS, SUBURB, Bedroom, Bathroom, GARAGE, NEAREST_STN, NEAREST_SCH
## dbl (7): PRICE, LAND_AREA, FLOOR_AREA, CBD_DIST, NEAREST_STN_DIST, POSTCODE,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1. (8 points) Load and explore the dataset assigned to you for your project to answer the following

questions. You can use the glimpse function from the tidyverse package to peek into the dataset and find answers to these questions.

glimpse(Houses10)
## Rows: 33,656
## Columns: 14
## $ ADDRESS          <chr> "1 Acorn Place", "1 Addis Way", "1 Ainsley Court", "1…
## $ SUBURB           <chr> "South Lake", "Wandi", "Camillo", "Bellevue", "Lockri…
## $ PRICE            <dbl> 565000, 365000, 287000, 255000, 325000, 409000, 40000…
## $ Bedroom          <chr> "More_bedroom", "Small_Bedroom", "Small_Bedroom", "Sm…
## $ Bathroom         <chr> "More_bathroom", "More_bathroom", "Small_bathroom", "…
## $ GARAGE           <chr> "2", "2", "1", "2", "2", "1", "2", "2", "3", "8", "6"…
## $ LAND_AREA        <dbl> 600, 351, 719, 651, 466, 759, 386, 468, 875, 552, 253…
## $ FLOOR_AREA       <dbl> 160, 139, 86, 59, 131, 118, 132, 158, 168, 126, 241, …
## $ CBD_DIST         <dbl> 18300, 26900, 22600, 17900, 11200, 27300, 28200, 4170…
## $ NEAREST_STN      <chr> "Cockburn Central Station", "Kwinana Station", "Chall…
## $ NEAREST_STN_DIST <dbl> 1800, 4900, 1900, 3600, 2000, 1000, 3700, 1100, 2500,…
## $ POSTCODE         <dbl> 6164, 6167, 6111, 6056, 6054, 6112, 6112, 6169, 6022,…
## $ NEAREST_SCH      <chr> "LAKELAND SENIOR HIGH SCHOOL", "ATWELL COLLEGE", "KEL…
## $ NEAREST_SCH_DIST <dbl> 0.8283386, 5.5243244, 1.6491782, 1.5714009, 1.5149216…

a. (2 points) How many cases (instances/rows) are in your dataset?

My dataset has 33,656 rows.

b. (2 points) How many variables (attributes/columns) are in your dataset?

My dataset has 14 columns.

c. (2 points) Identify the main response (also known as dependent) variable in the data.

My target variable is PRICE.

d. (2 points) Does your dataset contain missing values? Which variables contain missing values?

anyNA(Houses10)
## [1] FALSE
colSums(is.na(Houses10))
##          ADDRESS           SUBURB            PRICE          Bedroom 
##                0                0                0                0 
##         Bathroom           GARAGE        LAND_AREA       FLOOR_AREA 
##                0                0                0                0 
##         CBD_DIST      NEAREST_STN NEAREST_STN_DIST         POSTCODE 
##                0                0                0                0 
##      NEAREST_SCH NEAREST_SCH_DIST 
##                0                0

No, my data set does not contain missing values.

2. (6 points) Identify one research question that you plan to answer using your dataset.

Does the amount of land area significantly affect the price of a house?

a. (3 points) Write your research question and the corresponding hypothesis to be tested.

Does the amount of land area significantly affect the price of a house? H0: The amount of land area has no significant effect on house price. Ha: The amount of land area has an significant effect on house price.

b. (3 points) Identify and list the relevant variables that will be used in the analysis.

Land area and price.

  1. (12 points) ## a. (2 points) What type of variable is your main response variable? (Categorical/Numerical)

My main response variable is numerical.

b. (4 points) Create 2 different histograms of your response variable with 2 different bin widths.

Houses10 %>%
  ggplot(aes(PRICE)) +
  geom_histogram(binwidth = 100000, colour="black",fill="pink")

Houses10 %>%
  ggplot(aes(PRICE)) +
  geom_histogram(binwidth = 50000, colour="black",fill="purple")

c. (2 points) Create a density plot of your response variable.

Houses10 %>%
  ggplot(aes(PRICE)) +
  geom_density()

d. (2 points) Comment on the shape, modality and potential outliers.

The plot is unimodal and has potential outliers at the high end where the prices are above 2000000. It is also right-skewed.

e. (2 points) What measures should you use to describe the center and the spread of your response variable?

I will measure the center of my response variable with the median and the spread with IQR.

4. (12 points) Graphing some variables in your dataset.

a. (9 points) Create at least 3 graphs displaying the distributions of at least three different variables in your data set NOT including your response variable that relate to research question(s) you might have for your dataset. For example, if the variable is categorical, report a bar chart, while for quantitative variables, report histogram, dotplot or boxplot. Note: You need to make a graph for each variable. You shouldn’t use the same variable for different graphs.

Houses10 %>%
  ggplot(aes(CBD_DIST)) +
  geom_histogram(binwidth= 1000,colour= "black", fill="red")

Houses10 %>%
  ggplot(aes(Bedroom)) +
  geom_bar(binwidth=7, colour= "black", fill="yellow")
## Warning in geom_bar(binwidth = 7, colour = "black", fill = "yellow"): Ignoring
## unknown parameters: `binwidth`

Houses10 %>%
  ggplot(aes(y=FLOOR_AREA)) +
  geom_boxplot(fill="lightblue")

b. (3 points) Write a few sentences about what you have learned from your graphs.

Histogram (CBD_DIST): Most of the houses around 1500 are close to the city (clustered) between 10000 and 15000 units, with few houses reaching 60000 units. Bar Chart (Bedroom): There is a large gap between more and small bedrooms. Around 20000 houses have more bedrooms while only 15000 houses have small bedrooms. Boxplot (Floor_Area): The median house size is around 200 units.

5. (12 points)

a. (8 points) Create at least 2 graphs that include at least 2 variables in each, one of which is the response variable identified in 1c.

Houses10 %>%
  ggplot(aes(PRICE, LAND_AREA)) +
  geom_point()

Houses10 %>%
  ggplot(aes(PRICE,fill= Bedroom)) +
  geom_boxplot()

b. (4 points) Write a few sentences about what you have learned from your graphs.

The relationship between price and land area is very weak and non-,inear because most of the houses have a low land area regardless of if the price increases or not.

The relationship between the price and bedroom amount of my boxplot houses prices would increase as the number of bedrooms increase.

PART 2

  1. (8 points) Summary of the Response Variable Compute and report descriptive statistics (mean, standard deviation, and five-number summary) for your response variable identified in Part I. Create appropriate visualizations (histogram or boxplot) and describe the shape of the distribution. Discuss what these summaries reveal about your response variable.
Houses10 %>%
  summarise(Mean= mean(PRICE),
            SD= sd(PRICE),
            Min= min(PRICE),
            Q1= quantile(PRICE,0.25),
            Median= median (PRICE),
            Q3= quantile(PRICE,0.75),
            Max= max(PRICE))

The average price of houses is around 637,072 with a standard deviation of 355,825.6 a median of 535,500.

Houses10 %>%
  ggplot(aes(PRICE)) +
  geom_histogram(binwidth=100000, fill= "lightpink", color= "black")

The histogram shows a unimodal distribution that is skewed to the right. These summaries reveal that the mean 637,072 is higher than the median 535,500, shows that there are high-priced outliers and the large standard deviation of 355825 shows that there is a wide variety of home prices.

  1. (12 points) Confidence Interval for the Population Mean Construct and interpret a 95% confidence interval for the population mean of your response variable. Assume your sample size is large enough for the Central Limit Theorem to apply, even if the data are not perfectly normal.
  1. (4 points) Report the sample mean, standard deviation, and sample size (n) for your response variable.
x_bar= mean(Houses10$PRICE)
SD=sd(Houses10$PRICE)
n = nrow(Houses10)

tibble(n = n,
       Mean = x_bar,
       Sd=SD)
  1. (4 points) Compute the standard error (SE) and the margin of error (ME) manually using R arithmetic.
SE = SD/sqrt(n)

zstar = 1.96

SE
## [1] 1939.572
ME = zstar * SE
ME
## [1] 3801.562
tibble( SE = SE,
        ME = ME)
  1. (4 points) Report and interpret your confidence interval in the context of your dataset (e.g., “We are 95% confident that the true mean house price is captured between . . . ’ ’).
library(infer)

Houses10 %>%
  t_test(response = PRICE,
         conf_int = TRUE,
         conf_level = 0.95)

We are 95% confident that the true mean of PRICE is between the lower confidence interval (633,270.4) and the upper confidence interval (640,873.6)

  1. (8 points) Comparing Groups Select one categorical variable (e.g., gender, region, smoker/nonsmoker, etc.) and summarize your re- sponse variable by the groups defined by this variable. • Report and compare the group means and standard deviations. • Create side-by-side boxplots or bar charts to visualize group differences. • Comment on any patterns or differences between the groups.
Houses10%>%
  filter(!is.na(PRICE), !is.na(Bedroom)) %>%
  group_by(Bedroom) %>%
  summarise(Mean = mean(PRICE),
            SD = sd(PRICE))

The average price of houses when there is more bedrooms is higher than small bedroom. The same applies for the standard deviation.

Houses10 %>%
  ggplot(aes(x= Bedroom)) +
  geom_bar(fill= "blue", color= "black")

The houses with bigger bedrooms are higher than houses with small bedrooms.

  1. (17 points) Exploring Relationships Using Correlation and Regression Use correlation and regression to describe and model the association between your response variable and two numerical explanatory variables.
  1. (6 points) Create two scatterplots displaying the relationship between the response variable and two different numerical explanatory variables that you think might be related.
# Scatterplot 1

Houses10 %>%
  ggplot(aes(x= FLOOR_AREA, y= PRICE)) +
  geom_point()+
  labs(title = "PRICE vs. FLOORAREA", x = "FLOORAREA", y= "PRICE")

# Scatterplot 2

Houses10 %>%
  ggplot(aes(x= NEAREST_SCH_DIST, y= PRICE)) +
  geom_point()+
  labs(title = "PRICE vs.NEAREST_SCH_DIST ", x = "NEAREST_SCH_DIST", y= "PRICE")

• Describe the direction, form, and strength of the relationship in each plot.

Scatterplot 1

As the floor area increases the price also increases so there is a positive relationship between floor area and price. The strength is moderate and the form is linear.

#Scatterplot 2

As the distance to the nearest school increases, the price of the houses decreases so there is a negative relationship between nearest school and price. The strength is weak and the form is non-linear.

  1. (7 points) Fit two simple linear regression models (one for each explanatory variable) using the lm() function.
lm_FA <- lm(PRICE ~ FLOOR_AREA, data = Houses10)

summary(lm_FA)
## 
## Call:
## lm(formula = PRICE ~ FLOOR_AREA, data = Houses10)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2023603  -167448   -63342    98386  2037674 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 140367.71    4434.73   31.65   <2e-16 ***
## FLOOR_AREA    2706.81      22.49  120.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 297500 on 33654 degrees of freedom
## Multiple R-squared:  0.3008, Adjusted R-squared:  0.3008 
## F-statistic: 1.448e+04 on 1 and 33654 DF,  p-value: < 2.2e-16
lm_NSCHD <- lm(PRICE ~ NEAREST_SCH_DIST, data = Houses10)

summary(lm_NSCHD)
## 
## Call:
## lm(formula = PRICE ~ NEAREST_SCH_DIST, data = Houses10)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -590331 -228900 -100902  126304 1803469 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        645189       2797 230.647  < 2e-16 ***
## NEAREST_SCH_DIST    -4472       1111  -4.026 5.68e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 355700 on 33654 degrees of freedom
## Multiple R-squared:  0.0004814,  Adjusted R-squared:  0.0004517 
## F-statistic: 16.21 on 1 and 33654 DF,  p-value: 5.683e-05

• Report the correlation coefficient (r) and coefficient of determination (R2) for each model.

#coefficient of determination for FLOOR_AREA
summary(lm_FA)$r.squared
## [1] 0.3008489
#correlation coefficient (r)
 cor(Houses10$PRICE,Houses10$FLOOR_AREA, use = "complete.obs")
## [1] 0.5484969
#coefficient of determination for NEAREST_SCH_DIST
summary(lm_NSCHD)$r.squared
## [1] 0.0004814228
#correlation coefficient (r)
 cor(Houses10$PRICE,Houses10$NEAREST_SCH_DIST, use = "complete.obs")
## [1] -0.02194135

• Write out each model equation and explain which variable produces the better model and why.

\(\hat(PRICE) = 140367.71 + FLOOR_AREA * 2706.81\)

\(\hat(PRICE) = 6.711e+05 + NEAREST_SCH_DIST * -4472\)

The PRICE and FLOORAREA model produces the best model because its adjusted R squared (0.3008) is higher than that of PRICE and NEAREST_SCH_DIST (0.0004517).

  1. (4 points) Using the better model, make a prediction for your response variable based on a specific explanatory value. • Show your R code and interpret your result in the context of your dataset.
lm_FA <- lm(PRICE ~ FLOOR_AREA, data = Houses10)
new <- tibble(FLOOR_AREA = 5000)

predict_PRICE <- predict(lm_FA, newdata = new)

print(predict_PRICE)
##        1 
## 13674431
  1. (5 points) The Whole Story Combine your results from Part I and Part II into a single, cohesive story-telling report that summarizes your findings. Your final report should: • Restate your research questions. • Summarize your descriptive, inferential, and regression findings. • Include figures and tables as supporting evidence. • Provide a clear and meaningful conclusion connecting your analysis to your research questions

Introduction

My dataset contains information about houses that were sold in Perth, Australia. It contains variables such as price, amount of bedrooms and bathrooms, garage, land area, floor area, and distance to schools and train stations. These variables describe different features available with each property. The goal is to examine how some of these variables impact house price.

Research Question and Hypotheses

Research question: Does the amount of land area significantly affect the price of a house? Hypotheses: H0: The amount of land area has no significant effect on house price. Ha: The amount of land area has an significant effect on house price.

Research Question

Research question: Does the amount of land area significantly affect the price of a house?

Hypotheses

Hypotheses: H0: The amount of land area has no significant effect on house price. Ha: The amount of land area has an significant effect on house price.

Exploratory Data Analysis

The response variable is PRICE which is numerical, unimodal, and right-skewed. Descriptive statistics: The mean house price is approx. 637,072, which is higher than the median of 535,500. Histograms and other plots are above that illustrate that most houses are clustered in the lower price brackets, with a wide spread which is indicated by the standard deviation of 355,826.

Correlation

Floor Area vs. Price: This proved to be a much stronger predictor than school distance. The correlation coefficient (r) was 0.548, indicating a moderate positive linear relationship.

Predictive Modeling

Model Performance: The regression model FLOOR_AREA (R2=0.3008) significantly outperformed the model using school distance (R2=0.00048).

Methodology

The model used was the simple linear regression model.

Results

(PRICE) = 140367.71 + FLOOR_AREA * 2706.81$

Interpretation

The PRICE and FLOORAREA model produces the best model because its adjusted R squared (0.3008) is higher than that of PRICE and NEAREST_SCH_DIST (0.0004517).

Conclusion

While the initial research question focused on LAND_AREA, the data suggests that FLOOR_AREA is more significant for price in the Perth market than the total lot size. Categorical analysis showed that houses that had “More_bedroom” had higher average prices than “Small_Bedroom” homes.

Discussion

Possible Improvements consist of combining floor area, suburb location, and bedroom count for better variance in house prices.

Reference

This data was scraped from http://house.speakingsame.com/ and includes data from 322 Perth suburbs, resulting in an average of about 100 rows per suburb. Longitude and Latitude data was obtained from data.gov.au. School ranking data was obtained from bettereducation.