The market historical data set of real estate valuation are collected from Sindian Dist., New Taipei City, Taiwan.
The inputs are as follows X1=the transaction date (for example, 2013.250=2013 March, 2013.500=2013 June, etc.) X2=the house age (unit: year) X3=the distance to the nearest MRT station (unit: meter) X4=the number of convenience stores in the living circle on foot (integer) X5=the geographic coordinate, latitude. (unit: degree) X6=the geographic coordinate, longitude. (unit: degree)
The output is as follow Y= house price of unit area (10000 New Taiwan Dollar/Ping, where Ping is a local unit, 1 Ping = 3.3 meter squared)
EVALUATION OF THE DATASET
REA <-read.csv("C:/AMRITA SCHOOL OF BUSINESS/Trimister 4/ANALYTICS - DATA ANALYSIS USING R AND PYTHON (DARP)/DARP ASSIGNMENT/Real estate valuation data set.csv",header=TRUE)#NAME OF THE DATASET GIVENattach(REA)#ATTACH THE DATA TO THE PROGRAMlibrary(car)
Warning: package 'car' was built under R version 4.2.3
Loading required package: carData
Warning: package 'carData' was built under R version 4.2.3
SUMMARY OF THE DATA
Here is the summary of the Provided data. In the following analysis we can see the Minimum value, Maximum value, Mean, Median, 1st and 3rd Quarter value etc.
summary(REA)
No DATE HOUSE_AGE NEAREST_MTR_STN
Min. : 1.0 Min. :2013 Min. : 0.000 Min. : 23.38
1st Qu.:104.2 1st Qu.:2013 1st Qu.: 9.025 1st Qu.: 289.32
Median :207.5 Median :2013 Median :16.100 Median : 492.23
Mean :207.5 Mean :2013 Mean :17.713 Mean :1083.89
3rd Qu.:310.8 3rd Qu.:2013 3rd Qu.:28.150 3rd Qu.:1454.28
Max. :414.0 Max. :2014 Max. :43.800 Max. :6488.02
CONVENIENCE_STR_NEAR HOUSE_PRICE_PER_AREA
Min. : 0.000 Min. : 7.60
1st Qu.: 1.000 1st Qu.: 27.70
Median : 4.000 Median : 38.45
Mean : 4.094 Mean : 37.98
3rd Qu.: 6.000 3rd Qu.: 46.60
Max. :10.000 Max. :117.50
The summarized data provides a snapshot of the key statistics and characteristics of the dataset, which appears to contain information related to real estate properties. Let’s break down the summary statistics for each variable:
DATE:
The “DATE” variable indicates the year of the observations, with a range from 2013 to 2014.
The majority of the observations are from the year 2013, as the mean and median are both 2013.
HOUSE_AGE:
“HOUSE_AGE” represents the age of the houses in years.
The minimum house age is 0 years, which might indicate newly constructed houses.
The average (mean) house age is approximately 17.713 years, with a wide range of ages, going up to a maximum of 43.800 years.
NEAREST_MTR_STN:
“NEAREST_MTR_STN” appears to denote the distance to the nearest metro or subway station, possibly in meters.
The distance varies significantly across observations, with a minimum of 23.38 meters and a maximum of 6488.02 meters.
The mean distance to the nearest metro station is approximately 1083.89 meters, indicating that, on average, properties are located relatively close to metro stations.
CONVENIENCE_STR_NEAR:
“CONVENIENCE_STR_NEAR” seems to represent the convenience score or level of proximity to nearby amenities or services.
The convenience score ranges from 0 to 10, with a minimum of 0 indicating potentially less convenience and a maximum of 10 suggesting high convenience.
The average convenience score is approximately 4.094, indicating a moderate level of convenience on average.
HOUSE_PRICE_PER_AREA:
“HOUSE_PRICE_PER_AREA” likely represents the price of houses per unit area, which is a common metric in real estate.
The house prices per unit area vary widely, with a minimum value of 7.60 and a maximum value of 117.50.
The average house price per unit area is approximately 37.98, with a median of 38.45, suggesting that the majority of properties fall within this price range.
HISTOGRAM
Histograms provide a visual representation of the distribution of each variable in the dataset. They are useful for identifying patterns and characteristics within the data. The shape of each histogram can provide insights into the data distribution, such as skewness or concentration of values in certain ranges, which can be valuable for understanding the dataset’s underlying trends and variations.
hist(DATE)
hist(HOUSE_AGE)
hist(NEAREST_MTR_STN)
hist(CONVENIENCE_STR_NEAR)
hist(HOUSE_PRICE_PER_AREA)
GRAPHICAL REPRESENTATION
Graphical representation of data using various graphs like scatter plots, box plots, and more in R programming serves multiple crucial purposes. Firstly, it aids in data exploration by unveiling patterns, outliers, and potential errors not readily apparent in raw data. Secondly, it facilitates effective data communication, making findings more accessible to various audiences. Thirdly, it assists in hypothesis testing by visually assessing assumptions for statistical tests. Additionally, these visualizations help with comparing data across categories, detecting outliers, validating predictive models, recognizing hidden patterns, and simplifying complex relationships. They are invaluable in storytelling, decision-making, and explaining intricate data interactions. R offers a wide range of libraries for creating diverse visualizations, empowering users to gain deeper insights and make informed, data-driven decisions.
SCATTER PLOT
Here is the scatter plot for the given data which shows which generates for us a line of best fit, which as we will see, is the least-squares line for predicting verbal based on knowledge of quant.
Here is the basic plot analysis of the given data with the absolute value line. Which shows the accurate data points that needed to be taken for the analysis.
par(mfrow=c(2,2))plot(DATE,HOUSE_PRICE_PER_AREA,xlim=c(2012,2015),ylim=c(6,120),main='DATA SET 1')abline(lsfit(DATE,HOUSE_PRICE_PER_AREA))plot(HOUSE_AGE,HOUSE_PRICE_PER_AREA,xlim=c(0,45),ylim=c(6,120),main='DATA SET 2')abline(lsfit(HOUSE_AGE,HOUSE_PRICE_PER_AREA))plot(NEAREST_MTR_STN,HOUSE_PRICE_PER_AREA,xlim=c(20,7000),ylim=c(6,120),main='DATA SET 3')abline(lsfit(NEAREST_MTR_STN,HOUSE_PRICE_PER_AREA))plot(CONVENIENCE_STR_NEAR,HOUSE_PRICE_PER_AREA,xlim=c(0,10),ylim=c(6,120),main='DATA SET 4')abline(lsfit(CONVENIENCE_STR_NEAR,HOUSE_PRICE_PER_AREA))
BOXPLOT
Boxplot analysis in R programming is crucial as it provides a succinct visual summary of a dataset’s distribution and central tendencies, aiding in the quick identification of outliers and potential skewness. These plots are particularly valuable when comparing data among different groups and serve as a diagnostic tool for assessing assumptions required for statistical tests, helping to ensure the validity of analytical results. In essence, boxplots offer a comprehensive and intuitive way to grasp essential aspects of data, making them an indispensable tool for data exploration and hypothesis testing in R.
Stats: The box plot represents a summary of the data’s distribution. In this case, it shows five key statistics: the minimum, lower quartile (Q1), median (Q2), upper quartile (Q3), and maximum.
n: The sample size for this data set is 414.
Confidence Interval (conf): The range [2013.128, 2013.206] represents the 95% confidence interval for the median (Q2).
Outliers (out): There are no outliers detected in this data set.
Stats: The box plot for this data set provides statistics for a different distribution, with values ranging from 23.38284 to 3171.32900.
n: The sample size is 414.
Confidence Interval (conf): The 95% confidence interval for the median is [401.6514, 582.8112].
Outliers (out): There are multiple outliers present in this data set, as indicated by the numeric values in the “out” section. These outliers are values that fall significantly outside the range of the box plot.
Stats: The box plot for this data set represents a distribution with values ranging from 7.60 to 73.60.
n: The sample size is 414.
Confidence Interval (conf): The 95% confidence interval for the median is [36.98236, 39.91764].
Outliers (out): Outliers are detected and listed as [78.3, 117.5, 78.0]. These values are significantly higher than the maximum value within the box plot.
Each box plot summarizes the distribution of a data set with key statist
LINEAR REGRESSION MODELING
Here is the summary of Linear Regression Modeling of the above given data
Call:
lm(formula = HOUSE_PRICE_PER_AREA ~ HOUSE_AGE)
Residuals:
Min 1Q Median 3Q Max
-31.113 -10.738 1.626 8.199 77.781
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.43470 1.21098 35.042 < 2e-16 ***
HOUSE_AGE -0.25149 0.05752 -4.372 1.56e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.32 on 412 degrees of freedom
Multiple R-squared: 0.04434, Adjusted R-squared: 0.04202
F-statistic: 19.11 on 1 and 412 DF, p-value: 1.56e-05
anova(X1)
Analysis of Variance Table
Response: HOUSE_PRICE_PER_AREA
Df Sum Sq Mean Sq F value Pr(>F)
HOUSE_AGE 1 3390 3390.2 19.115 1.56e-05 ***
Residuals 412 73071 177.4
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Regression Analysis for HOUSE_AGE:
In this analysis, the independent variable is “HOUSE_AGE,” which represents the age of houses.
The coefficient for “HOUSE_AGE” is estimated at -0.25149, and it is statistically significant (p-value < 0.05). This suggests that as the age of houses increases by one year, house prices per unit area decrease by approximately 0.25149 units.
The R-squared value of 0.04434 indicates that only 4.434% of the variation in house prices can be explained by the age of houses. This suggests that “HOUSE_AGE” alone does not provide a strong explanation for variations in house prices.
Call:
lm(formula = HOUSE_PRICE_PER_AREA ~ NEAREST_MTR_STN)
Residuals:
Min 1Q Median 3Q Max
-35.396 -6.007 -1.195 4.831 73.483
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 45.8514271 0.6526105 70.26 <2e-16 ***
NEAREST_MTR_STN -0.0072621 0.0003925 -18.50 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.07 on 412 degrees of freedom
Multiple R-squared: 0.4538, Adjusted R-squared: 0.4524
F-statistic: 342.2 on 1 and 412 DF, p-value: < 2.2e-16
anova(X2)
Analysis of Variance Table
Response: HOUSE_PRICE_PER_AREA
Df Sum Sq Mean Sq F value Pr(>F)
NEAREST_MTR_STN 1 34695 34695 342.24 < 2.2e-16 ***
Residuals 412 41767 101
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Regression Analysis for NEAREST_MTR_STN:
In this analysis, the independent variable is “NEAREST_MTR_STN,” which represents the proximity to the nearest metro station.
The coefficient for “NEAREST_MTR_STN” is estimated at -0.0072621, and it is highly statistically significant (p-value < 0.05). This suggests that as the distance to the nearest metro station decreases by one unit, house prices per unit area increase by approximately 0.0072621 units.
The R-squared value of 0.4538 indicates that 45.38% of the variation in house prices can be explained by the proximity to the nearest metro station. This suggests that “NEAREST_MTR_STN” is a strong predictor of house prices in this dataset.
Call:
lm(formula = HOUSE_PRICE_PER_AREA ~ CONVENIENCE_STR_NEAR)
Residuals:
Min 1Q Median 3Q Max
-35.407 -7.341 -1.788 5.984 87.681
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.1811 0.9419 28.86 <2e-16 ***
CONVENIENCE_STR_NEAR 2.6377 0.1868 14.12 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.18 on 412 degrees of freedom
Multiple R-squared: 0.326, Adjusted R-squared: 0.3244
F-statistic: 199.3 on 1 and 412 DF, p-value: < 2.2e-16
anova(X3)
Analysis of Variance Table
Response: HOUSE_PRICE_PER_AREA
Df Sum Sq Mean Sq F value Pr(>F)
CONVENIENCE_STR_NEAR 1 24930 24930.0 199.32 < 2.2e-16 ***
Residuals 412 51531 125.1
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Regression Analysis for CONVENIENCE_STR_NEAR:
In this analysis, the independent variable is “CONVENIENCE_STR_NEAR,” which represents the convenience of nearby stores.
The coefficient for “CONVENIENCE_STR_NEAR” is estimated at 2.6377, and it is highly statistically significant (p-value < 0.05). This suggests that as the convenience of nearby stores increases by one unit, house prices per unit area increase by approximately 2.6377 units.
The R-squared value of 0.326 indicates that 32.6% of the variation in house prices can be explained by the convenience of nearby stores. This suggests that “CONVENIENCE_STR_NEAR” is a moderately strong predictor of house prices in this dataset.
CO-EFFICIENT OF THE LINEAR REGRESSION
These are the co-effs of the given data where we could do the further predictions using there co-efficient.
The formula that we have to use is Y=(intercept)+ (Quant)X
So for example, For X1 the intercept is 42.4346970 and Quant is -0.2514884. Therefore the equations would be like Y=42.4346970+(-0.2514884)X. Where we can put X values and predict the future values.
RESIDUAL PLOT
Here is the residual plot for the given data
plot(HOUSE_AGE,X1$residuals,ylab="Residuals",xlim=c(0,45),ylim=c(-15,15),main="Residual Data Sheet 1")
The minimum residual is -31.113, indicating that there is an observation with a predicted house price per unit area that is 31.113 units lower than the actual observed value.
The maximum residual is 77.781, suggesting that there is an observation with a predicted value that is 77.781 units higher than the actual observed value.
The residuals appear to be roughly symmetrically distributed around zero based on the median and quartiles.
The residuals show some spread, which could indicate heteroscedasticity, meaning that the variance of residuals may not be constant across all levels of the independent variable.
plot(NEAREST_MTR_STN,X2$residuals,ylab="Residuals",xlim=c(20,7000),ylim=c(-12,12),main="Residual Data Sheet 2")
The minimum residual is -35.396, indicating that there is an observation with a predicted house price per unit area that is 35.396 units lower than the actual observed value.
The maximum residual is 73.483, suggesting that there is an observation with a predicted value that is 73.483 units higher than the actual observed value.
Similar to the previous analysis, the residuals appear to be roughly symmetrically distributed around zero.
The residuals also show some spread, which may indicate heteroscedasticity.
plot(CONVENIENCE_STR_NEAR,X3$residuals,ylab="Residuals",xlim=c(0,10),ylim=c(-13,13),main="Residual Data Sheet 3")
The minimum residual is -35.407, indicating that there is an observation with a predicted house price per unit area that is 35.407 units lower than the actual observed value.
The maximum residual is 87.681, suggesting that there is an observation with a predicted value that is 87.681 units higher than the actual observed value.
Once again, the residuals appear to be roughly symmetrically distributed around zero.
Like the other models, there is some spread in the residuals, indicating potential heteroscedasticity.
OVERALL ANALYSIS
# Load required librarieslibrary(ggplot2)
Warning: package 'ggplot2' was built under R version 4.2.3
library(lmtest)
Warning: package 'lmtest' was built under R version 4.2.3
Loading required package: zoo
Warning: package 'zoo' was built under R version 4.2.3
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
# Create a linear regression modelmodel <-lm(HOUSE_PRICE_PER_AREA ~ HOUSE_AGE + NEAREST_MTR_STN + CONVENIENCE_STR_NEAR, data = REA)# Summarize the regression modelsummary(model)
Call:
lm(formula = HOUSE_PRICE_PER_AREA ~ HOUSE_AGE + NEAREST_MTR_STN +
CONVENIENCE_STR_NEAR, data = REA)
Residuals:
Min 1Q Median 3Q Max
-37.304 -5.430 -1.738 4.325 77.315
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.977286 1.384542 31.041 < 2e-16 ***
HOUSE_AGE -0.252856 0.040105 -6.305 7.47e-10 ***
NEAREST_MTR_STN -0.005379 0.000453 -11.874 < 2e-16 ***
CONVENIENCE_STR_NEAR 1.297443 0.194290 6.678 7.91e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.251 on 410 degrees of freedom
Multiple R-squared: 0.5411, Adjusted R-squared: 0.5377
F-statistic: 161.1 on 3 and 410 DF, p-value: < 2.2e-16
# Perform diagnostic tests (optional)# Test for heteroscedasticitybptest(model)
studentized Breusch-Pagan test
data: model
BP = 3.2911, df = 3, p-value = 0.3489
# Test for normality of residualsshapiro.test(model$residuals)
Shapiro-Wilk normality test
data: model$residuals
W = 0.87858, p-value < 2.2e-16
# Make predictionspredicted <-predict(model, newdata = REA)# Add predictions to the dataframeREA$Predicted_House_Price <- predicted# View the resultshead(REA)
The linear regression model aimed to predict “HOUSE_PRICE_PER_AREA” using three key independent variables: “HOUSE_AGE,” “NEAREST_MTR_STN,” and “CONVENIENCE_STR_NEAR.” The coefficients revealed valuable insights into these relationships. The intercept (constant) was estimated at 42.977286, indicating that when all independent variables are zero, the predicted house price per unit area is around 42.98.
The coefficients for the independent variables were as follows: “HOUSE_AGE” had a coefficient of -0.252856, suggesting that for each unit increase in the age of the house, the predicted house price per unit area decreases by approximately 0.25 units. “NEAREST_MTR_STN” had a coefficient of -0.005379, signifying that for each additional unit in proximity to the nearest metro station, the predicted house price per unit area decreases by roughly 0.005 units. Lastly, “CONVENIENCE_STR_NEAR” had a coefficient of 1.297443, implying that a higher convenience store rating was associated with an increase in predicted house price per unit area by about 1.30 units.
All coefficients were statistically significant, as indicated by their low p-values. The model’s overall significance was confirmed by the F-statistic, which had a very small p-value, suggesting that the model as a whole is statistically significant. Diagnostic Tests: Two diagnostic tests were performed. The first, a test for heteroscedasticity (Breusch-Pagan), yielded a p-value of 0.3489, indicating that there was no strong evidence of varying levels of variance in the residuals, which is a positive result.
The second test, for the normality of residuals (Shapiro-Wilk), resulted in a very small p-value, implying that the residuals were not normally distributed. However, this can be acceptable in large samples. Predictions:
The model was used to make predictions on the same dataset (“REA”). These predictions were added to the dataframe as “Predicted_House_Price,” providing estimates of house prices per unit area based on the values of the independent variables for each data point.
In conclusion, the regression model sheds light on how “HOUSE_PRICE_PER_AREA” is influenced by “HOUSE_AGE,” “NEAREST_MTR_STN,” and “CONVENIENCE_STR_NEAR.” The coefficients elucidate the direction and strength of these relationships, while diagnostic tests suggest that the model assumptions are reasonably met. The predictions offer valuable insights for understanding and potentially making future predictions about house prices in the given context