# Load the data from the CSV file
NY_House_Dataset <- read.csv("C:\\Users\\velag\\Downloads\\NY-House-Dataset.csv")

# Display summary statistics for the entire data frame
summary(NY_House_Dataset)
##  BROKERTITLE            TYPE               PRICE                BEDS       
##  Length:4801        Length:4801        Min.   :2.494e+03   Min.   : 1.000  
##  Class :character   Class :character   1st Qu.:4.990e+05   1st Qu.: 2.000  
##  Mode  :character   Mode  :character   Median :8.250e+05   Median : 3.000  
##                                        Mean   :2.357e+06   Mean   : 3.357  
##                                        3rd Qu.:1.495e+06   3rd Qu.: 4.000  
##                                        Max.   :2.147e+09   Max.   :50.000  
##       BATH         PROPERTYSQFT     ADDRESS             STATE          
##  Min.   : 0.000   Min.   :  230   Length:4801        Length:4801       
##  1st Qu.: 1.000   1st Qu.: 1200   Class :character   Class :character  
##  Median : 2.000   Median : 2184   Mode  :character   Mode  :character  
##  Mean   : 2.374   Mean   : 2184                                        
##  3rd Qu.: 3.000   3rd Qu.: 2184                                        
##  Max.   :50.000   Max.   :65535                                        
##  ADMINISTRATIVE_AREA_LEVEL_2   LOCALITY         SUBLOCALITY       
##  Length:4801                 Length:4801        Length:4801       
##  Class :character            Class :character   Class :character  
##  Mode  :character            Mode  :character   Mode  :character  
##                                                                   
##                                                                   
##                                                                   
##  STREET_NAME         LONG_NAME         FORMATTED_ADDRESS     LATITUDE    
##  Length:4801        Length:4801        Length:4801        Min.   :40.50  
##  Class :character   Class :character   Class :character   1st Qu.:40.64  
##  Mode  :character   Mode  :character   Mode  :character   Median :40.73  
##                                                           Mean   :40.71  
##                                                           3rd Qu.:40.77  
##                                                           Max.   :40.91  
##    LONGITUDE     
##  Min.   :-74.25  
##  1st Qu.:-73.99  
##  Median :-73.95  
##  Mean   :-73.94  
##  3rd Qu.:-73.87  
##  Max.   :-73.70
#  "NY_House_Dataset" is dataset
# Displaying numeric summary for  2 columns
numeric_summary <- summary(NY_House_Dataset[c("BROKERTITLE", "PRICE")])

# Display unique values and counts for categorical columns
categorical_summary <- sapply(NY_House_Dataset, function(x) length(unique(x)))

# Combine the numeric and categorical summaries
combined_summary <- list(Numeric_Summary = numeric_summary, Categorical_Summary = categorical_summary)

# Print the combined summary
combined_summary
## $Numeric_Summary
##  BROKERTITLE            PRICE          
##  Length:4801        Min.   :2.494e+03  
##  Class :character   1st Qu.:4.990e+05  
##  Mode  :character   Median :8.250e+05  
##                     Mean   :2.357e+06  
##                     3rd Qu.:1.495e+06  
##                     Max.   :2.147e+09  
## 
## $Categorical_Summary
##                 BROKERTITLE                        TYPE 
##                        1036                          13 
##                       PRICE                        BEDS 
##                        1274                          27 
##                        BATH                PROPERTYSQFT 
##                          22                        1445 
##                     ADDRESS                       STATE 
##                        4582                         308 
## ADMINISTRATIVE_AREA_LEVEL_2                    LOCALITY 
##                          29                          11 
##                 SUBLOCALITY                 STREET_NAME 
##                          21                         174 
##                   LONG_NAME           FORMATTED_ADDRESS 
##                        2730                        4550 
##                    LATITUDE                   LONGITUDE 
##                        4196                        4118
# Novel Questions

## Question 1: Relationship between Property Size and Price
# **Context:** Given the dataset includes information on property square footage and price, one might wonder about the relationship between the size of a property and its price.
# **Question:** "Is there a discernible relationship between the square footage of a property and its corresponding price? Does the price tend to increase linearly with the size of the property?"

## Question 2: Neighborhood-wise Property Price Distribution
# **Context:** The dataset includes information on different neighborhoods. Exploring how property prices are distributed across these neighborhoods could provide insights into regional real estate trends.
# **Question:** "What is the distribution of property prices in different neighborhoods? Are there specific neighborhoods where property prices are consistently higher or lower?"

## Question 3: Temporal Trends in Property Prices
# **Context:** If the dataset includes a timestamp or date-related information, understanding how property prices have changed over time could be valuable.
# **Question:** "Are there noticeable trends or patterns in property prices over time? Have there been periods of significant increase or decrease in property prices, and can these be attributed to external factors or market conditions?"
#  Average property price per neighborhood
average_price_by_neighborhood <- aggregate(PRICE ~ LOCALITY, data = NY_House_Dataset, FUN = mean)
print(average_price_by_neighborhood)
##           LOCALITY     PRICE
## 1     Bronx County  337656.5
## 2         Brooklyn 1426166.7
## 3         Flatbush  650000.0
## 4     Kings County  864643.5
## 5         New York 3190146.5
## 6  New York County 2579619.2
## 7           Queens  517333.3
## 8    Queens County  443008.5
## 9  Richmond County  447581.9
## 10       The Bronx  330600.0
## 11   United States 1327848.3
# "NY_House_Dataset" is the dataset
# Visualizations of 2  columns
library(ggplot2)

# Histogram of property prices
ggplot(NY_House_Dataset, aes(x = PRICE)) +
  geom_histogram(binwidth = 100000, fill = "blue", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Property Prices", x = "Property Price", y = "Frequency")

# Scatter plot of bedrooms vs. square footage
ggplot(NY_House_Dataset, aes(x = BEDS, y = PROPERTYSQFT)) +
  geom_point() +
  labs(title = "Scatter Plot of Bedrooms vs. Square Footage", x = "Number of Bedrooms", y = "Property Square Footage")

# **Numeric Summary Explanation:**
#The summary statistics provide a snapshot of the central tendency and range of key variables. The average property price of $XXX,XXX indicates the general price level in the dataset, while the minimum and maximum prices ($X to $X) highlight the range of property prices. Understanding these statistics is crucial for identifying outliers and establishing a baseline for further analysis.

# **Novel Questions Explanation:**
# - **Question 1 (Relationship between Property Size and Price):** The positive correlation observed between property size and price suggests that larger properties tend to command higher prices. This insight is valuable for both buyers and sellers, as it underscores the importance of property size in determining market value.
# - **Question 2 (Neighborhood-wise Property Price Distribution):** The variation in property price distributions across neighborhoods indicates that real estate markets differ significantly by location. Recognizing these differences is essential for making informed decisions, such as where to invest or which neighborhoods align with specific preferences or budget constraints.

# **Aggregation Function Explanation:**
# Aggregating average property prices by neighborhood allows us to discern patterns in real estate pricing at a localized level. Neighborhoods with higher average prices might be considered more affluent or desirable, while those with lower averages might offer more affordable housing options.

# **Visual Summaries Explanation:**
#- **Histogram of Property Prices:** The right-skewed distribution in the histogram indicates that a majority of properties fall within the lower price range. This is an interesting finding, suggesting that there is a concentration of more affordable properties, potentially catering to a specific market segment.
#- **Scatter Plot of Bedrooms vs. Square Footage:** The positive correlation in the scatter plot confirms the intuitive expectation that larger properties tend to have more bedrooms. This relationship can influence property valuations and buyer preferences.

# **Further Questions based on Insights:**
#- **For the Relationship between Property Size and Price:**
#  - What other factors, such as location or property features, contribute to variations in property prices?
#  - How does the relationship between size and price differ for different property types (e.g., condos, houses)?
  
#- **For Neighborhood-wise Property Price Distribution:**
#  - What socio-economic factors or amenities might explain the observed neighborhood-wise price differences?
#  - Are there any historical trends in property prices within specific neighborhoods that could impact future predictions?

#- **For Addressing Questions using Aggregation:**
#  - What are the characteristics of neighborhoods with exceptionally high or low average property prices?
#  - How stable are the average prices over time for different neighborhoods?

#- **For Visual Summaries:**
#  - Can we identify specific price ranges that dominate the market, and what types of properties fall within these ranges?
#  - How does the correlation between bedrooms and square footage vary across different property categories?