DARP ASSIGNMENT

Author

NAVNEETH D - CB.BU.P2ASB22106

MARKET ANALYSIS

The market historical data set of real estate valuation are collected from Sindian Dist., New Taipei City, Taiwan.

The inputs are as follows X1=the transaction date (for example, 2013.250=2013 March, 2013.500=2013 June, etc.) X2=the house age (unit: year) X3=the distance to the nearest MRT station (unit: meter) X4=the number of convenience stores in the living circle on foot (integer) X5=the geographic coordinate, latitude. (unit: degree) X6=the geographic coordinate, longitude. (unit: degree)

The output is as follow Y= house price of unit area (10000 New Taiwan Dollar/Ping, where Ping is a local unit, 1 Ping = 3.3 meter squared)

EVALUATION OF THE DATASET

REA <- read.csv("C:/AMRITA SCHOOL OF BUSINESS/Trimister 4/ANALYTICS - DATA ANALYSIS USING R AND PYTHON (DARP)/DARP ASSIGNMENT/Real estate valuation data set.csv",header=TRUE)#NAME OF THE DATASET GIVEN
attach(REA)#ATTACH THE DATA TO THE PROGRAM

library(car)

Warning: package 'car' was built under R version 4.2.3

Loading required package: carData

Warning: package 'carData' was built under R version 4.2.3

SUMMARY OF THE DATA

Here is the summary of the Provided data. In the following analysis we can see the Minimum value, Maximum value, Mean, Median, 1st and 3rd Quarter value etc.

summary(REA)

       No             DATE        HOUSE_AGE      NEAREST_MTR_STN  
 Min.   :  1.0   Min.   :2013   Min.   : 0.000   Min.   :  23.38  
 1st Qu.:104.2   1st Qu.:2013   1st Qu.: 9.025   1st Qu.: 289.32  
 Median :207.5   Median :2013   Median :16.100   Median : 492.23  
 Mean   :207.5   Mean   :2013   Mean   :17.713   Mean   :1083.89  
 3rd Qu.:310.8   3rd Qu.:2013   3rd Qu.:28.150   3rd Qu.:1454.28  
 Max.   :414.0   Max.   :2014   Max.   :43.800   Max.   :6488.02  
 CONVENIENCE_STR_NEAR HOUSE_PRICE_PER_AREA
 Min.   : 0.000       Min.   :  7.60      
 1st Qu.: 1.000       1st Qu.: 27.70      
 Median : 4.000       Median : 38.45      
 Mean   : 4.094       Mean   : 37.98      
 3rd Qu.: 6.000       3rd Qu.: 46.60      
 Max.   :10.000       Max.   :117.50

The summarized data provides a snapshot of the key statistics and characteristics of the dataset, which appears to contain information related to real estate properties. Let’s break down the summary statistics for each variable:

DATE:
- The “DATE” variable indicates the year of the observations, with a range from 2013 to 2014.
- The majority of the observations are from the year 2013, as the mean and median are both 2013.
HOUSE_AGE:
- “HOUSE_AGE” represents the age of the houses in years.
- The minimum house age is 0 years, which might indicate newly constructed houses.
- The average (mean) house age is approximately 17.713 years, with a wide range of ages, going up to a maximum of 43.800 years.
NEAREST_MTR_STN:
- “NEAREST_MTR_STN” appears to denote the distance to the nearest metro or subway station, possibly in meters.
- The distance varies significantly across observations, with a minimum of 23.38 meters and a maximum of 6488.02 meters.
- The mean distance to the nearest metro station is approximately 1083.89 meters, indicating that, on average, properties are located relatively close to metro stations.
CONVENIENCE_STR_NEAR:
- “CONVENIENCE_STR_NEAR” seems to represent the convenience score or level of proximity to nearby amenities or services.
- The convenience score ranges from 0 to 10, with a minimum of 0 indicating potentially less convenience and a maximum of 10 suggesting high convenience.
- The average convenience score is approximately 4.094, indicating a moderate level of convenience on average.
HOUSE_PRICE_PER_AREA:
- “HOUSE_PRICE_PER_AREA” likely represents the price of houses per unit area, which is a common metric in real estate.
- The house prices per unit area vary widely, with a minimum value of 7.60 and a maximum value of 117.50.
- The average house price per unit area is approximately 37.98, with a median of 38.45, suggesting that the majority of properties fall within this price range.

HISTOGRAM

Histograms provide a visual representation of the distribution of each variable in the dataset. They are useful for identifying patterns and characteristics within the data. The shape of each histogram can provide insights into the data distribution, such as skewness or concentration of values in certain ranges, which can be valuable for understanding the dataset’s underlying trends and variations.

hist(DATE)

hist(HOUSE_AGE)

hist(NEAREST_MTR_STN)

hist(CONVENIENCE_STR_NEAR)

hist(HOUSE_PRICE_PER_AREA)

GRAPHICAL REPRESENTATION

Graphical representation of data using various graphs like scatter plots, box plots, and more in R programming serves multiple crucial purposes. Firstly, it aids in data exploration by unveiling patterns, outliers, and potential errors not readily apparent in raw data. Secondly, it facilitates effective data communication, making findings more accessible to various audiences. Thirdly, it assists in hypothesis testing by visually assessing assumptions for statistical tests. Additionally, these visualizations help with comparing data across categories, detecting outliers, validating predictive models, recognizing hidden patterns, and simplifying complex relationships. They are invaluable in storytelling, decision-making, and explaining intricate data interactions. R offers a wide range of libraries for creating diverse visualizations, empowering users to gain deeper insights and make informed, data-driven decisions.

SCATTER PLOT

Here is the scatter plot for the given data which shows which generates for us a line of best fit, which as we will see, is the least-squares line for predicting verbal based on knowledge of quant.

scatterplot(DATE~HOUSE_PRICE_PER_AREA)

scatterplot(HOUSE_AGE~HOUSE_PRICE_PER_AREA)

scatterplot(NEAREST_MTR_STN~HOUSE_PRICE_PER_AREA)

scatterplot(CONVENIENCE_STR_NEAR~HOUSE_PRICE_PER_AREA)

Here is the basic plot analysis of the given data with the absolute value line. Which shows the accurate data points that needed to be taken for the analysis.

par(mfrow=c(2,2))
plot(DATE,HOUSE_PRICE_PER_AREA,xlim=c(2012,2015),ylim=c(6,120),main='DATA SET 1')
abline(lsfit(DATE,HOUSE_PRICE_PER_AREA))
plot(HOUSE_AGE,HOUSE_PRICE_PER_AREA,xlim=c(0,45),ylim=c(6,120),main='DATA SET 2')
abline(lsfit(HOUSE_AGE,HOUSE_PRICE_PER_AREA))
plot(NEAREST_MTR_STN,HOUSE_PRICE_PER_AREA,xlim=c(20,7000),ylim=c(6,120),main='DATA SET 3')
abline(lsfit(NEAREST_MTR_STN,HOUSE_PRICE_PER_AREA))
plot(CONVENIENCE_STR_NEAR,HOUSE_PRICE_PER_AREA,xlim=c(0,10),ylim=c(6,120),main='DATA SET 4')
abline(lsfit(CONVENIENCE_STR_NEAR,HOUSE_PRICE_PER_AREA))

BOXPLOT

Boxplot analysis in R programming is crucial as it provides a succinct visual summary of a dataset’s distribution and central tendencies, aiding in the quick identification of outliers and potential skewness. These plots are particularly valuable when comparing data among different groups and serve as a diagnostic tool for assessing assumptions required for statistical tests, helping to ensure the validity of analytical results. In essence, boxplots offer a comprehensive and intuitive way to grasp essential aspects of data, making them an indispensable tool for data exploration and hypothesis testing in R.

boxplot(DATE)

boxplot(HOUSE_AGE)

boxplot(NEAREST_MTR_STN)

boxplot(CONVENIENCE_STR_NEAR)

boxplot(HOUSE_PRICE_PER_AREA)

BOXPLOT STATS

Here is the Box plot stats for the given data

boxplot.stats(DATE)

$stats
[1] 2012.667 2012.917 2013.167 2013.417 2013.583

$n
[1] 414

$conf
[1] 2013.128 2013.206

$out
numeric(0)

Data Set 1:

Stats: The box plot represents a summary of the data’s distribution. In this case, it shows five key statistics: the minimum, lower quartile (Q1), median (Q2), upper quartile (Q3), and maximum.
n: The sample size for this data set is 414.
Confidence Interval (conf): The range [2013.128, 2013.206] represents the 95% confidence interval for the median (Q2).
Outliers (out): There are no outliers detected in this data set.

boxplot.stats(HOUSE_AGE)

$stats
[1]  0.0  9.0 16.1 28.2 43.8

$n
[1] 414

$conf
[1] 14.60907 17.59093

$out
numeric(0)

Data Set 2:

Stats: This box plot summarizes a different data set. The statistics shown here are similar to the previous set but with different values.
n: The sample size is still 414.
Confidence Interval (conf): The 95% confidence interval for the median is [14.60907, 17.59093].
Outliers (out): No outliers are present in this data set either.

boxplot.stats(NEAREST_MTR_STN)

$stats
[1]   23.38284  289.32480  492.23130 1455.79800 3171.32900

$n
[1] 414

$conf
[1] 401.6514 582.8112

$out
 [1] 5512.038 4519.690 4079.418 4082.015 4066.587 4605.749 4510.359 4510.359
 [9] 4082.015 4066.587 3947.945 6396.283 4197.349 3780.590 4066.587 4082.015
[17] 4066.587 4527.687 4573.779 4449.270 4082.015 4066.587 3771.895 4082.015
[25] 4074.736 4412.765 6306.153 5512.038 4082.015 4197.349 4197.349 4519.690
[33] 6488.021 3529.564 4066.587 4136.271 4082.015

Data Set 3:

Stats: The box plot for this data set provides statistics for a different distribution, with values ranging from 23.38284 to 3171.32900.
n: The sample size is 414.
Confidence Interval (conf): The 95% confidence interval for the median is [401.6514, 582.8112].
Outliers (out): There are multiple outliers present in this data set, as indicated by the numeric values in the “out” section. These outliers are values that fall significantly outside the range of the box plot.

boxplot.stats(CONVENIENCE_STR_NEAR)

$stats
[1]  0  1  4  6 10

$n
[1] 414

$conf
[1] 3.611736 4.388264

$out
integer(0)

Data Set 4:

Stats: This box plot summarizes a distribution with values ranging from 0 to 10.
n: The sample size is 414.
Confidence Interval (conf): The 95% confidence interval for the median is [3.611736, 4.388264].
Outliers (out): There are no integer outliers in this data set.

boxplot.stats(HOUSE_PRICE_PER_AREA)

$stats
[1]  7.60 27.70 38.45 46.60 73.60

$n
[1] 414

$conf
[1] 36.98236 39.91764

$out
[1]  78.3 117.5  78.0

Stats: The box plot for this data set represents a distribution with values ranging from 7.60 to 73.60.
n: The sample size is 414.
Confidence Interval (conf): The 95% confidence interval for the median is [36.98236, 39.91764].
Outliers (out): Outliers are detected and listed as [78.3, 117.5, 78.0]. These values are significantly higher than the maximum value within the box plot.

Each box plot summarizes the distribution of a data set with key statist

LINEAR REGRESSION MODELING

Here is the summary of Linear Regression Modeling of the above given data

X1 <- lm(HOUSE_PRICE_PER_AREA~HOUSE_AGE)
summary(X1)


Call:
lm(formula = HOUSE_PRICE_PER_AREA ~ HOUSE_AGE)

Residuals:
    Min      1Q  Median      3Q     Max 
-31.113 -10.738   1.626   8.199  77.781 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 42.43470    1.21098  35.042  < 2e-16 ***
HOUSE_AGE   -0.25149    0.05752  -4.372 1.56e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.32 on 412 degrees of freedom
Multiple R-squared:  0.04434,   Adjusted R-squared:  0.04202 
F-statistic: 19.11 on 1 and 412 DF,  p-value: 1.56e-05

anova(X1)

Analysis of Variance Table

Response: HOUSE_PRICE_PER_AREA
           Df Sum Sq Mean Sq F value   Pr(>F)    
HOUSE_AGE   1   3390  3390.2  19.115 1.56e-05 ***
Residuals 412  73071   177.4                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Regression Analysis for HOUSE_AGE:

In this analysis, the independent variable is “HOUSE_AGE,” which represents the age of houses.
The coefficient for “HOUSE_AGE” is estimated at -0.25149, and it is statistically significant (p-value < 0.05). This suggests that as the age of houses increases by one year, house prices per unit area decrease by approximately 0.25149 units.
The R-squared value of 0.04434 indicates that only 4.434% of the variation in house prices can be explained by the age of houses. This suggests that “HOUSE_AGE” alone does not provide a strong explanation for variations in house prices.

X2 <- lm(HOUSE_PRICE_PER_AREA~NEAREST_MTR_STN)
summary(X2)


Call:
lm(formula = HOUSE_PRICE_PER_AREA ~ NEAREST_MTR_STN)

Residuals:
    Min      1Q  Median      3Q     Max 
-35.396  -6.007  -1.195   4.831  73.483 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     45.8514271  0.6526105   70.26   <2e-16 ***
NEAREST_MTR_STN -0.0072621  0.0003925  -18.50   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.07 on 412 degrees of freedom
Multiple R-squared:  0.4538,    Adjusted R-squared:  0.4524 
F-statistic: 342.2 on 1 and 412 DF,  p-value: < 2.2e-16

anova(X2)

Analysis of Variance Table

Response: HOUSE_PRICE_PER_AREA
                 Df Sum Sq Mean Sq F value    Pr(>F)    
NEAREST_MTR_STN   1  34695   34695  342.24 < 2.2e-16 ***
Residuals       412  41767     101                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Regression Analysis for NEAREST_MTR_STN:

In this analysis, the independent variable is “NEAREST_MTR_STN,” which represents the proximity to the nearest metro station.
The coefficient for “NEAREST_MTR_STN” is estimated at -0.0072621, and it is highly statistically significant (p-value < 0.05). This suggests that as the distance to the nearest metro station decreases by one unit, house prices per unit area increase by approximately 0.0072621 units.
The R-squared value of 0.4538 indicates that 45.38% of the variation in house prices can be explained by the proximity to the nearest metro station. This suggests that “NEAREST_MTR_STN” is a strong predictor of house prices in this dataset.

X3 <- lm(HOUSE_PRICE_PER_AREA~CONVENIENCE_STR_NEAR)
summary(X3)


Call:
lm(formula = HOUSE_PRICE_PER_AREA ~ CONVENIENCE_STR_NEAR)

Residuals:
    Min      1Q  Median      3Q     Max 
-35.407  -7.341  -1.788   5.984  87.681 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)           27.1811     0.9419   28.86   <2e-16 ***
CONVENIENCE_STR_NEAR   2.6377     0.1868   14.12   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.18 on 412 degrees of freedom
Multiple R-squared:  0.326, Adjusted R-squared:  0.3244 
F-statistic: 199.3 on 1 and 412 DF,  p-value: < 2.2e-16

anova(X3)

Analysis of Variance Table

Response: HOUSE_PRICE_PER_AREA
                      Df Sum Sq Mean Sq F value    Pr(>F)    
CONVENIENCE_STR_NEAR   1  24930 24930.0  199.32 < 2.2e-16 ***
Residuals            412  51531   125.1                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Regression Analysis for CONVENIENCE_STR_NEAR:

In this analysis, the independent variable is “CONVENIENCE_STR_NEAR,” which represents the convenience of nearby stores.
The coefficient for “CONVENIENCE_STR_NEAR” is estimated at 2.6377, and it is highly statistically significant (p-value < 0.05). This suggests that as the convenience of nearby stores increases by one unit, house prices per unit area increase by approximately 2.6377 units.
The R-squared value of 0.326 indicates that 32.6% of the variation in house prices can be explained by the convenience of nearby stores. This suggests that “CONVENIENCE_STR_NEAR” is a moderately strong predictor of house prices in this dataset.

CO-EFFICIENT OF THE LINEAR REGRESSION

These are the co-effs of the given data where we could do the further predictions using there co-efficient.

The formula that we have to use is Y=(intercept)+ (Quant)X

coef(X1)

(Intercept)   HOUSE_AGE 
 42.4346970  -0.2514884

coef(X2)

    (Intercept) NEAREST_MTR_STN 
   45.851427058    -0.007262052

coef(X3)

         (Intercept) CONVENIENCE_STR_NEAR 
           27.181105             2.637653

So for example, For X1 the intercept is 42.4346970 and Quant is -0.2514884. Therefore the equations would be like Y=42.4346970+(-0.2514884)X. Where we can put X values and predict the future values.

RESIDUAL PLOT

Here is the residual plot for the given data

plot(HOUSE_AGE,X1$residuals,ylab="Residuals",xlim=c(0,45),ylim=c(-15,15),main="Residual Data Sheet 1")

The minimum residual is -31.113, indicating that there is an observation with a predicted house price per unit area that is 31.113 units lower than the actual observed value.
The maximum residual is 77.781, suggesting that there is an observation with a predicted value that is 77.781 units higher than the actual observed value.
The residuals appear to be roughly symmetrically distributed around zero based on the median and quartiles.
The residuals show some spread, which could indicate heteroscedasticity, meaning that the variance of residuals may not be constant across all levels of the independent variable.

plot(NEAREST_MTR_STN,X2$residuals,ylab="Residuals",xlim=c(20,7000),ylim=c(-12,12),main="Residual Data Sheet 2")

The minimum residual is -35.396, indicating that there is an observation with a predicted house price per unit area that is 35.396 units lower than the actual observed value.
The maximum residual is 73.483, suggesting that there is an observation with a predicted value that is 73.483 units higher than the actual observed value.
Similar to the previous analysis, the residuals appear to be roughly symmetrically distributed around zero.
The residuals also show some spread, which may indicate heteroscedasticity.

plot(CONVENIENCE_STR_NEAR,X3$residuals,ylab="Residuals",xlim=c(0,10),ylim=c(-13,13),main="Residual Data Sheet 3")

The minimum residual is -35.407, indicating that there is an observation with a predicted house price per unit area that is 35.407 units lower than the actual observed value.
The maximum residual is 87.681, suggesting that there is an observation with a predicted value that is 87.681 units higher than the actual observed value.
Once again, the residuals appear to be roughly symmetrically distributed around zero.
Like the other models, there is some spread in the residuals, indicating potential heteroscedasticity.

OVERALL ANALYSIS

# Load required libraries
library(ggplot2)

Warning: package 'ggplot2' was built under R version 4.2.3

library(lmtest)

Warning: package 'lmtest' was built under R version 4.2.3

Loading required package: zoo

Warning: package 'zoo' was built under R version 4.2.3


Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

# Create a linear regression model
model <- lm(HOUSE_PRICE_PER_AREA ~ HOUSE_AGE + NEAREST_MTR_STN + CONVENIENCE_STR_NEAR, data = REA)

# Summarize the regression model
summary(model)


Call:
lm(formula = HOUSE_PRICE_PER_AREA ~ HOUSE_AGE + NEAREST_MTR_STN + 
    CONVENIENCE_STR_NEAR, data = REA)

Residuals:
    Min      1Q  Median      3Q     Max 
-37.304  -5.430  -1.738   4.325  77.315 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          42.977286   1.384542  31.041  < 2e-16 ***
HOUSE_AGE            -0.252856   0.040105  -6.305 7.47e-10 ***
NEAREST_MTR_STN      -0.005379   0.000453 -11.874  < 2e-16 ***
CONVENIENCE_STR_NEAR  1.297443   0.194290   6.678 7.91e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.251 on 410 degrees of freedom
Multiple R-squared:  0.5411,    Adjusted R-squared:  0.5377 
F-statistic: 161.1 on 3 and 410 DF,  p-value: < 2.2e-16

# Perform diagnostic tests (optional)
# Test for heteroscedasticity
bptest(model)


    studentized Breusch-Pagan test

data:  model
BP = 3.2911, df = 3, p-value = 0.3489

# Test for normality of residuals
shapiro.test(model$residuals)


    Shapiro-Wilk normality test

data:  model$residuals
W = 0.87858, p-value < 2.2e-16

# Make predictions
predicted <- predict(model, newdata = REA)

# Add predictions to the dataframe
REA$Predicted_House_Price <- predicted

# View the results
head(REA)

  No     DATE HOUSE_AGE NEAREST_MTR_STN CONVENIENCE_STR_NEAR
1  1 2012.917      32.0        84.87882                   10
2  2 2012.917      19.5       306.59470                    9
3  3 2013.583      13.3       561.98450                    5
4  4 2013.500      13.3       561.98450                    5
5  5 2012.833       5.0       390.56840                    5
6  6 2012.667       7.1      2175.03000                    3
  HOUSE_PRICE_PER_AREA Predicted_House_Price
1                 37.9              47.40375
2                 42.2              48.07437
3                 47.3              43.07853
4                 54.8              43.07853
5                 43.1              46.09930
6                 32.1              33.37457

The linear regression model aimed to predict “HOUSE_PRICE_PER_AREA” using three key independent variables: “HOUSE_AGE,” “NEAREST_MTR_STN,” and “CONVENIENCE_STR_NEAR.” The coefficients revealed valuable insights into these relationships. The intercept (constant) was estimated at 42.977286, indicating that when all independent variables are zero, the predicted house price per unit area is around 42.98.

The coefficients for the independent variables were as follows: “HOUSE_AGE” had a coefficient of -0.252856, suggesting that for each unit increase in the age of the house, the predicted house price per unit area decreases by approximately 0.25 units. “NEAREST_MTR_STN” had a coefficient of -0.005379, signifying that for each additional unit in proximity to the nearest metro station, the predicted house price per unit area decreases by roughly 0.005 units. Lastly, “CONVENIENCE_STR_NEAR” had a coefficient of 1.297443, implying that a higher convenience store rating was associated with an increase in predicted house price per unit area by about 1.30 units.

All coefficients were statistically significant, as indicated by their low p-values. The model’s overall significance was confirmed by the F-statistic, which had a very small p-value, suggesting that the model as a whole is statistically significant. Diagnostic Tests: Two diagnostic tests were performed. The first, a test for heteroscedasticity (Breusch-Pagan), yielded a p-value of 0.3489, indicating that there was no strong evidence of varying levels of variance in the residuals, which is a positive result.

The second test, for the normality of residuals (Shapiro-Wilk), resulted in a very small p-value, implying that the residuals were not normally distributed. However, this can be acceptable in large samples. Predictions:

The model was used to make predictions on the same dataset (“REA”). These predictions were added to the dataframe as “Predicted_House_Price,” providing estimates of house prices per unit area based on the values of the independent variables for each data point.

In conclusion, the regression model sheds light on how “HOUSE_PRICE_PER_AREA” is influenced by “HOUSE_AGE,” “NEAREST_MTR_STN,” and “CONVENIENCE_STR_NEAR.” The coefficients elucidate the direction and strength of these relationships, while diagnostic tests suggest that the model assumptions are reasonably met. The predictions offer valuable insights for understanding and potentially making future predictions about house prices in the given context