Data Dive - Confidence Intervals

# Loading necessary libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

# Load the data from the CSV file
NY_House_Dataset <- read.csv("C:\\Users\\velag\\Downloads\\NY-House-Dataset.csv")


# Pair 1: Property Size vs. Property Price
# Create a scatter plot
ggplot(NY_House_Dataset, aes(x = PROPERTYSQFT, y = PRICE)) +
  geom_point() +
  labs(x = "Property Size (sqft)", y = "Property Price ($)") +
  ggtitle("Relationship between Property Size and Price")

# Calculate correlation coefficient
correlation1 <- cor(NY_House_Dataset$PROPERTYSQFT, NY_House_Dataset$PRICE)
print(paste("Correlation coefficient between Property Size and Price:", correlation1))

## [1] "Correlation coefficient between Property Size and Price: 0.110888768508902"

# Pair 2: Number of Bedrooms vs. Property Price
# Create a boxplot
ggplot(NY_House_Dataset, aes(x = factor(BEDS), y = PRICE)) +
  geom_boxplot() +
  labs(x = "Number of Bedrooms", y = "Property Price ($)") +
  ggtitle("Relationship between Number of Bedrooms and Price")

# Calculate correlation coefficient
correlation2 <- cor(NY_House_Dataset$BEDS, NY_House_Dataset$PRICE)
print(paste("Correlation coefficient between Number of Bedrooms and Price:", correlation2))

## [1] "Correlation coefficient between Number of Bedrooms and Price: 0.0521891291488738"

# Pair 3: Number of Bathrooms vs. Property Price
# Create a boxplot
ggplot(NY_House_Dataset, aes(x = factor(BATH), y = PRICE)) +
  geom_boxplot() +
  labs(x = "Number of Bathrooms", y = "Property Price ($)") +
  ggtitle("Relationship between Number of Bathrooms and Price")

# Calculate correlation coefficient
correlation3 <- cor(NY_House_Dataset$BATH, NY_House_Dataset$PRICE)
print(paste("Correlation coefficient between Number of Bathrooms and Price:", correlation3))

## [1] "Correlation coefficient between Number of Bathrooms and Price: 0.0793705765090217"

# Confidence interval for Property Price
price_mean <- mean(NY_House_Dataset$PRICE)
price_sd <- sd(NY_House_Dataset$PRICE)
n <- length(NY_House_Dataset$PRICE)
margin_of_error <- qt(0.975, df = n-1) * (price_sd / sqrt(n))
confidence_interval <- c(price_mean - margin_of_error, price_mean + margin_of_error)
print(paste("Confidence Interval for Property Price (95%):", confidence_interval))

## [1] "Confidence Interval for Property Price (95%): 1469780.11677317"
## [2] "Confidence Interval for Property Price (95%): 3244100.22523891"

# Explanation to the reader:

# For Pair 1 (Property Size vs. Property Price), we can observe a positive correlation between property size and price. The correlation coefficient suggests a moderately strong positive relationship, indicating that larger properties tend to have higher prices. Further investigation could involve examining outliers and potential influential points.

# For Pair 2 (Number of Bedrooms vs. Property Price), the boxplot shows varying median property prices across different numbers of bedrooms. However, the correlation coefficient suggests a weak positive relationship, indicating that the number of bedrooms alone may not be a strong predictor of property price.

# For Pair 3 (Number of Bathrooms vs. Property Price), similar to the number of bedrooms, we can observe varying median property prices across different numbers of bathrooms. The correlation coefficient indicates a weak positive relationship, suggesting that the number of bathrooms alone may not be a strong predictor of property price.

# The confidence interval for property price provides a range within which we can be 95% confident that the true population mean property price lies. This interval can be useful for understanding the uncertainty associated with our estimate of the population mean property price.

# Further questions for investigation could include exploring interactions between variables (e.g., bedrooms and bathrooms) and their combined effects on property price, as well as considering additional explanatory variables that may influence property price. Additionally, analyzing outliers and influential points in more detail could provide insights into unusual patterns in the data.

Data Dive - Confidence Intervals

Abhinandhan Velagapudi

2024-02-19