This document comprises one of the data science projects done for different use cases. Dataset source is from public domain and curated to showcase different techniques and their business value(s).
Primary objectives are to demonstrate following knowledge:
Explore, understand & cleanse data with explained reasoning (Exploratory Data Analysis)
Identify, implement, test & select the best possible statistics model based on data-based proof
Systematic documentation with reproducible R coding for others to reproduce, understand & improve
Summarise observed findings with concise business values & insights through executive summary
Truly understand data project cycle to create a commercial data product with ability to execute statistics through R application
Complete data product via dashboard: Visualise processed output data through visualisation tools such as Qlik or Tableau - trained in Data Visualisation ;)
Deep business domain knowledge in the dataset
External dataset(s) identified to enrich or give another dimensions to tackle the business question
Working style or collaboration methods with other data team members (Range from data analyst, business analyst, data scientist, data engineer, visualisation engineer, business stakeholder, project manager)
Other statistics modeling which might be better or can be used as alternative.
Justification:
All data sources used are taken from public domain and projects are implemented using Rstudio. For full markdown version, kindly contact me via teochunwey@gmail.com
In this project, use case is to determine which is the predictor variable (X) to forecast the resales pricing (response variable/ Y).
Dataset in this context - Singapore HDB resale data from Jan 2015 till July 2018.
Reasoning:
Housing is one of the main topics in Singapore. The resales pricing greatly impact the people’s quality of life and their children upbringing due to the high resales pricing.
Another consideration is that the dataset variables can be easily identified and reproduce in other countries for a similar use case. Examples such as resale price and floor area.
As these modeling are based on historical data with certain assumption. This project did not take into other external factors such as GDP, average household income, politician parties, election period, land of sale price and etc…
Name Title Type Unit of Measure Description 1 month Month Datetime (Month) “YYYY-MM” - - 2 town Town Text (General) - - 3 flat_type Flat Type Text (General) - - 4 block Block Text (General) - - 5 street_name Street Name Text (General) - - 6 storey_range Storey Range Text (General) - - 7 floor_area_sqm Floor Area Numeric (General) Sqm - 8 flat_model Flat Model Text (General) - - 9 lease_commence_date Lease Commencement Date Datetime (Year) “YYYY” - - 10 remaining_lease Remaining Lease Numeric (General) Years - 11 resale_price Resale Price Numeric (General) $ -
Data Source: https://data.gov.sg/dataset/resale-flat-prices
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(stats)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
library(corrplot)
## corrplot 0.84 loaded
library(ggfortify)
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## The following object is masked from 'package:randomForest':
##
## outlier
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(caret)
## Loading required package: lattice
df.original<-read.csv('resales 2015.csv')
set.seed(1) #Ensure reproducible code
df<- sample_frac(df.original,0.7) #split into test and train data by 7:3 ratio
df.index<- as.numeric(rownames(df))
df.test<- df.original[-df.index,]
head(df)
## month town flat_type block street_name storey_range
## 18524 2016-01 PUNGGOL 3 ROOM 306A PUNGGOL PL 16 TO 18
## 25962 2016-06 BUKIT PANJANG 5 ROOM 505 JELAPANG RD 04 TO 06
## 39966 2017-03 CLEMENTI 5 ROOM 337 CLEMENTI AVE 2 16 TO 18
## 63361 2018-04 WOODLANDS 5 ROOM 717 WOODLANDS DR 70 07 TO 09
## 14070 2015-10 KALLANG/WHAMPOA 3 ROOM 28 JLN BAHAGIA 01 TO 03
## 62674 2018-04 PASIR RIS 5 ROOM 226 PASIR RIS ST 21 10 TO 12
## floor_area_sqm flat_model lease_commence_date remaining_lease
## 18524 67 Premium Apartment 2012 95
## 25962 123 Improved 1998 80
## 39966 119 Improved 1978 60
## 63361 120 Improved 1997 78
## 14070 53 Standard 1971 54
## 62674 126 Improved 1993 74
## resale_price
## 18524 407000
## 25962 446888
## 39966 718000
## 63361 395000
## 14070 268500
## 62674 425000
Investigate and compare sample records against data dictionary.
str(df)
## 'data.frame': 48837 obs. of 11 variables:
## $ month : Factor w/ 43 levels "2015-01","2015-02",..: 13 18 27 40 10 40 42 30 29 4 ...
## $ town : Factor w/ 26 levels "ANG MO KIO","BEDOK",..: 18 6 10 25 15 17 11 18 13 11 ...
## $ flat_type : Factor w/ 7 levels "1 ROOM","2 ROOM",..: 3 5 5 5 3 5 4 4 3 3 ...
## $ block : Factor w/ 2207 levels "1","10","100",..: 734 1250 839 1768 617 443 771 561 491 482 ...
## $ street_name : Factor w/ 525 levels "ADMIRALTY DR",..: 341 207 128 485 210 321 453 339 234 160 ...
## $ storey_range : Factor w/ 17 levels "01 TO 03","04 TO 06",..: 6 2 6 3 1 4 1 5 3 3 ...
## $ floor_area_sqm : num 67 123 119 120 53 126 84 93 67 59 ...
## $ flat_model : Factor w/ 21 levels "2-room","Adjoined flat",..: 13 5 5 5 18 5 17 13 12 5 ...
## $ lease_commence_date: int 2012 1998 1978 1997 1971 1993 1985 2013 1983 1977 ...
## $ remaining_lease : int 95 80 60 78 54 74 66 94 64 61 ...
## $ resale_price : num 407000 446888 718000 395000 268500 ...
Investigate the data types in data frame (df) data structure to ensure no changes is required.
summary(df)
## month town flat_type
## 2018-07: 1775 JURONG WEST: 3778 1 ROOM : 12
## 2017-11: 1406 SENGKANG : 3502 2 ROOM : 504
## 2018-06: 1393 WOODLANDS : 3499 3 ROOM :12517
## 2017-05: 1386 TAMPINES : 3299 4 ROOM :19953
## 2017-08: 1362 BEDOK : 2972 5 ROOM :11912
## 2018-03: 1332 YISHUN : 2852 EXECUTIVE : 3931
## (Other):40183 (Other) :28935 MULTI-GENERATION: 8
## block street_name storey_range
## 2 : 194 YISHUN RING RD : 805 04 TO 06:11423
## 1 : 165 BEDOK RESERVOIR RD: 623 07 TO 09:10679
## 8 : 163 ANG MO KIO AVE 10 : 589 10 TO 12: 9347
## 101 : 153 PUNGGOL CTRL : 545 01 TO 03: 8897
## 113 : 146 ANG MO KIO AVE 3 : 537 13 TO 15: 4385
## 110 : 145 PUNGGOL FIELD : 509 16 TO 18: 1867
## (Other):47871 (Other) :45229 (Other) : 2239
## floor_area_sqm flat_model lease_commence_date
## Min. : 31.00 Model A :14783 Min. :1966
## 1st Qu.: 76.00 Improved :12458 1st Qu.:1984
## Median : 97.00 New Generation : 7629 Median :1990
## Mean : 97.66 Premium Apartment: 5118 Mean :1992
## 3rd Qu.:112.00 Simplified : 2302 3rd Qu.:2001
## Max. :259.00 Apartment : 1978 Max. :2016
## (Other) : 4569
## remaining_lease resale_price
## Min. :24.00 Min. : 170000
## 1st Qu.:66.00 1st Qu.: 340000
## Median :73.00 Median : 410000
## Mean :74.06 Mean : 441000
## 3rd Qu.:83.00 3rd Qu.: 505000
## Max. :97.00 Max. :1160000
##
After which, a summary on each variables to understand the number of observation, distribution on the measures and outliers.
summary(is.na(df))
## month town flat_type block
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:48837 FALSE:48837 FALSE:48837 FALSE:48837
## street_name storey_range floor_area_sqm flat_model
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:48837 FALSE:48837 FALSE:48837 FALSE:48837
## lease_commence_date remaining_lease resale_price
## Mode :logical Mode :logical Mode :logical
## FALSE:48837 FALSE:48837 FALSE:48837
Investigate if there are any missing values (NA) in each variables. For this dataset, luckily is quited clean.
However, if there are NA values, following are the recommended approach.
Exlcude them using R code: na.rm=TRUE -> Generally used on vector
Use default values that is pre-defined based on domain knowledge. Usual suspects are 0, -1 or 9999999999
Recode the missing value with a mean/ median value of that variable
Omit the records with missing value: na.omit(df) -> Subset the dataset
Before jumping into visual analysis, sometimes it is useful to look at frequency of important variables. In this context, town, flat_type, floor area sqm and storey_range are identified.
Based on basic domain understanding and assumption, these are important factors that are often considered which may affect the resale pricing. For illustration sake, we run the results below to show the frequency and distribution for each variables.
table(df$town)
##
## ANG MO KIO BEDOK BISHAN BUKIT BATOK
## 2370 2972 960 1934
## BUKIT MERAH BUKIT PANJANG BUKIT TIMAH CENTRAL AREA
## 1884 1704 137 504
## CHOA CHU KANG CLEMENTI GEYLANG HOUGANG
## 2192 1179 1306 2408
## JURONG EAST JURONG WEST KALLANG/WHAMPOA MARINE PARADE
## 1092 3778 1520 299
## PASIR RIS PUNGGOL QUEENSTOWN SEMBAWANG
## 1530 2647 1298 1325
## SENGKANG SERANGOON TAMPINES TOA PAYOH
## 3502 1075 3299 1571
## WOODLANDS YISHUN
## 3499 2852
table(df$flat_type)
##
## 1 ROOM 2 ROOM 3 ROOM 4 ROOM
## 12 504 12517 19953
## 5 ROOM EXECUTIVE MULTI-GENERATION
## 11912 3931 8
table(df$storey_range)
##
## 01 TO 03 04 TO 06 07 TO 09 10 TO 12 13 TO 15 16 TO 18 19 TO 21 22 TO 24
## 8897 11423 10679 9347 4385 1867 807 578
## 25 TO 27 28 TO 30 31 TO 33 34 TO 36 37 TO 39 40 TO 42 43 TO 45 46 TO 48
## 313 194 83 101 88 48 8 13
## 49 TO 51
## 6
Based on above findings, these will give some directions on the distribution and potential outliers (such as outlier or influential outlier).
Next, visualise analysis - number of transactions by month.
ggplot(df,aes(x=month))+geom_bar(color='darkblue', fill='lightblue') + theme(axis.text.x= element_text(angle=90,hjust=1)) + ggtitle("No. of transactions by month")
Above plot observed the following findings:
For the past 3 years+, there seem to have a seasonal trend for resale housing.
Generally, there will be a dip during Chinese New Year (CNY) period and a spike around October.
On a year-on-year basis, annual transaction for each year is trending higher.
During July 2018, there is a big spike of around 2.5k transactions. Most likely due to some external factors or news such as new policy. Required further investigation.
df$year<- as.factor(substr(df$month,1,4))
ggplot(df,aes(x=year))+geom_bar(color='darkblue', fill='lightblue') + theme(axis.text.x= element_text(hjust=1)) + ggtitle("No. of transactions by year")+ geom_text(stat='count',aes(label=..count..),vjust=2)
Moving next, above plot gives a holistic view of the resale transaction by year.
Findings:
This validate the assumption finding on point #3 in the previous plot.
2018 data only consists till July data. Based on past data, total transactions at year end may trend higher barring any unforeseen circumstances.
Based on temporal plots observation, this give us a good understand of what is happening on a macro view. Next will be on the visual analysis - histogram.
ggplot(df,aes(resale_price))+geom_histogram(color='darkblue', fill='lightblue') + ggtitle("Resale pricing histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Above plot is a right-skewed (positive-skew) distribution with a long right tail.
Findings:
Mean is to the right of the peak.
R default no. of bins is 30. We will re-define this bins and replot the histogram.
Based on summary table shown above, min and max of resale price is 170000 and 1180000 respectively. As such, following are computed to get the bin width:
ggplot(df,aes(resale_price))+geom_histogram(color='grey',fill='peachpuff',binwidth = 3811.321) + ggtitle("Resale pricing histogram") + coord_cartesian(xlim = c(230000, 900000)) + geom_vline(aes(xintercept=mean(resale_price)),color="blue", linetype="dashed", size=1, alpha = 1) + geom_vline(aes(xintercept=median(resale_price)),color="chocolate3", linetype="dashed", size=1, alpha = 1) + geom_text(x = median(df$resale_price), y = 200, label = paste("Median", round(median(df$resale_price), digits = 2), sep = '\n'), color = "chocolate3")+
geom_text(x = mean(df$resale_price), y = 200, label = paste("Mean", round(mean(df$resale_price), digits = 2), sep = '\n'), color = "blue",hjust=-1)
Findings:
Right-skewed distribtion plot has a mean of $440000 resale price and a median of $410000.
The mode (peak) for resale price hover around $420000 mark having a count of nearing 1500 transactions.
Distribution is concentrated mainly on left of the plot.
The right tail is longer.
This gives a distribution pattern of frequency on transaction prices generally falls under.
Note: Regression model is not valid due to the normality distribution on response variable is not met. Either log is square to get a normal distribution.
Steps which can be further taken (not done in this analysis):
log.x <- log(df$resale_price)
ggplot(df,aes(log.x))+geom_histogram(color='grey',fill='peachpuff') + ggtitle("Resale pricing histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Log to assess normality distribution.
Findings:
Bimodal distribution with 2 peaks
Right Skewness still valid
For illustration sake, we will continue the linear regression analysis even the distribution do not valid for a parametric test.
Next step, visual analysis - boxplot on resale price distribution by town, storey range, flat types, etc…
ggplot(data=df, aes(x=town, y=resale_price)) + geom_boxplot() +ggtitle('Distribution of resale price by town (Boxplot 1)') + theme(axis.text.x= element_text(angle=90,hjust=1)) +
geom_hline(aes(yintercept=mean(resale_price)),color="blue", linetype="dashed",size=1, alpha = 1) + geom_text(aes(2,mean(resale_price),label = 'mean', vjust = -1),color='blue')
ggplot(data=df, aes(x=storey_range, y=resale_price)) + geom_boxplot() +ggtitle('Distribution of resale price by storey range (Boxplot 2)') + theme(axis.text.x= element_text(angle=90,hjust=1))+
geom_hline(aes(yintercept=mean(resale_price)),color="blue", linetype="dashed",size=1, alpha = 1) + geom_text(aes(2,mean(resale_price),label = 'mean', vjust = -1),color='blue')
ggplot(data=df, aes(x=flat_type, y=resale_price)) + geom_boxplot() +ggtitle('Distribution of resale price by flat type (Boxplot 3)') + theme(axis.text.x= element_text(angle=90,hjust=1))+
geom_hline(aes(yintercept=mean(resale_price)),color="blue", linetype="dashed",size=1, alpha = 1) + geom_text(aes(1,mean(resale_price),label = 'mean', vjust = -1),color='blue')
ggplot(data=df, aes(x=lease_commence_date, y=resale_price,group=lease_commence_date)) + geom_boxplot() +ggtitle('Distribution of resale price by lease commence date (Boxplot 4)') + theme(axis.text.x= element_text(angle=90,hjust=1))+
geom_hline(aes(yintercept=mean(resale_price)),color="blue", linetype="dashed",size=1, alpha = 1) + geom_text(aes(1966,mean(resale_price),label = 'mean', vjust = -1),color='blue')
ggplot(data=df, aes(x=remaining_lease, y=resale_price,group=remaining_lease)) + geom_boxplot() +ggtitle('Distribution of resale price by remaining lease (Boxplot 5)') + theme(axis.text.x= element_text(angle=90,hjust=1))+
geom_hline(aes(yintercept=mean(resale_price)),color="blue", linetype="dashed",size=1, alpha = 1) + geom_text(aes(24,mean(resale_price),label = 'mean', vjust = -1),color='blue') + coord_cartesian(
xlim = c(23, 100))
Based on boxplots above (non exhaustive EDA), below are the findings:
For resale pricing on towm, the higher median (50th percentile) falls onto Bishan, Bukit Timah, Central Area, Bukit Merah, Kallang/ Whampoa, Queenstown, Serangoon. This concides with the domain knowledge as these are the mature estate.
For resale pricing on storey range. Generally, the higher the storey range, the higher the median (50th percentile) for resale pricing.
Most of the outliers for resale pricing fall between 01 to 18 storey. Based on domain knowledge, this may be due to the flats constructed during the early days (1966 to 1990) where the highest is 12 to 18 storey. To add on, these naturally will be the current mature estate where the median is higher the mean (shown in boxplot 1). This explained why the mature estate sell at at higher price.
For resale pricing on flat type, 3/4/5/ executive rooms have the most outliers.
There is a required needs to remove the large amount of outliers as this greatly affect the linear model estimation. This can be checked using cook’s distance plot.
cor.result<- cor(df[,c(7,10,11)])
corrplot(cor.result, method="ellipse")
Findings show that resale price and floor area sqm have the highest correlation due to color intensity.
#Create quantile on dataset group by remaining lease
leasePeriod.Quantile<-ddply(df,'remaining_lease',function(x) quantile(x$resale_price))
colnames(leasePeriod.Quantile)<-c("remaining_lease","minValue", "Q1", "Q2", "Q3","maxValue")
#create IQR & Range
leasePeriod.Quantile$IQR <- leasePeriod.Quantile$Q3-leasePeriod.Quantile$Q1
leasePeriod.Quantile$Range <- leasePeriod.Quantile$IQR*1.5
leasePeriod.Quantile$lowOutlier<- (leasePeriod.Quantile$Q1-leasePeriod.Quantile$Range)
leasePeriod.Quantile$upperOutlier<- (leasePeriod.Quantile$Q3 + leasePeriod.Quantile$Range)
#create variables
OBS<- data.table(df)
SAM<- data.table(leasePeriod.Quantile)
#Merge the quantile and dataset together
df100<-merge(OBS,SAM, allow.cartesian = TRUE, by='remaining_lease')
#Remove Upper Outlier & Lower Outlier
dfsubset2<-df100[!df100$resale_price > df100$upperOutlier,]
dfsubset2<-dfsubset2[!dfsubset2$resale_price<dfsubset2$lowOutlier,]
#Boxplot to have a visual inspection
ggplot(data=dfsubset2, aes(x=remaining_lease, y=resale_price,group=remaining_lease)) + geom_boxplot() +ggtitle('Distribution of resale price by remaining lease (Boxplot 6)') + theme(axis.text.x= element_text(angle=90,hjust=1))+
geom_hline(aes(yintercept=mean(dfsubset2$resale_price)),color="blue", linetype="dashed",size=1, alpha = 1) + geom_text(aes(24,mean(dfsubset2$resale_price),label = 'mean', vjust = -1),color='blue') + coord_cartesian(
xlim = c(23, 100))+stat_boxplot(geom ='errorbar')
cor.result<- cor(df[,c(7,10,11)])
corrplot(cor.result, method="ellipse")
Rerun the corelation plot without outliers. The previous correlation conclusion stays.
Above EDA give a better understanding of data on distribution, outliers in each variables and resale price distribution. With this information, following steps will be performed under this section:
Theory - Using others research findings to determine the important independent variables to specify the best model.
Automatic Tools - Principal component regression, Stepwise regression & Best subsets regression.
Domain knowledge - Deep domain knowledge through observation and research will equip one to identify imporant independent variables.
For this project, we are leveraging on point #3 with assumption that floor area sqm identified are important in a resale pricing.
Best model is only as good as the variables identified and measured during the study.
Data sample might be either by chance or data collection methodology. This resulted in false positive & false negative.
Multicollinearity can impact significance and difficulty in assessing predictor/ independent variable significant.
correlation does not imply causation.
Below modeling will be used:
regressionLease <- lm(formula = resale_price ~remaining_lease, data=dfsubset2)
summary(regressionLease)
##
## Call:
## lm(formula = resale_price ~ remaining_lease, data = dfsubset2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -313111 -80551 -28551 53937 633801
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 94247.64 3481.05 27.07 <2e-16 ***
## remaining_lease 4455.99 46.46 95.90 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 113800 on 46489 degrees of freedom
## Multiple R-squared: 0.1652, Adjusted R-squared: 0.1651
## F-statistic: 9197 on 1 and 46489 DF, p-value: < 2.2e-16
p value - to assess statistical significance
t value - determine whether the mean of a population significantly differs from a specific value (hypothesized mean) or from the mean of another population.
Coefficient estimate
Diagnostic plots - to assess if the model fit the data
This means the sample size collected to calculate the t value is more clustered which is statistically significant (not by chance) with a ration of higher signal to lower noise.
In short, the higher the better as it is formulated based on Ratio of Signal(the difference of mean of sample size and hypothesized mean of population) to Noise (the variation).
Difference in signal isn’t large enough
Variation of noise is too big: Due to either special-cause variation which can be identify using control chart
Sample size is too small: This also explains why an extremely large sample can produce statistically significant results even when a difference is very small and has no practical consequence.
p value : less than 0.05 (Statistically significant) t value : more than 2 (statistically difference)
Third, visit the coefficients (estimate) for associated changes between dependent (y) and independent variable (x/ predictor).
First row is the intercept, expected resale price of $59528.85 for 0 sqm floor area which in our context doesn’t make sense. You can’t have no house to sell and still have a price tag of $60k.
Second row is the coefficient, for every 1 sqm floor area increase, the resale price increase by $3735.28.
autoplot(regressionLease)
The ideal result for above plots should be unbiased (residual randomly distributed) and homoscedastic (spread uniformly around the red line)
Residuals = Observation - fitted value (y value from linear regression formula)
Unbiased: have an average value of zero in any thin vertical strip, and
homoscedastic, which means “same stretch”: the spread of the residuals should be the same in any thin vertical strip.
Top Left: The plot is biased and Heteroscedasticity. As such, linear regression do not fit for this dataset.
Top Right: The distribution is skewed right based on the tail. The deviation become larger when the resale price increases.
Bottom Left: Square the residual. Similar to bottom left, this is biased and Heteroscedasticity.
Bottom Right: there is no influential observation that is outside the cook’s distance (not seen in the plot).
For sake of illustration, if the residual plot is unbiased and homoscedastic. Next 2 values to assess will be R squared (linear regression), Predicted R squared and F test.
R squared is used as a statistical measure to determine how close the data are to the fitted regression line. The downside is that in a multiple regression, any increase in one independent variable, R square percentage will be increased. Thus, adjusted R square is used.
Adjusted R square: 0.5217 Predicted R square: Predicted R square is to determine how well a regression model makes predictions.
pr <- residuals(regressionLease)/(1 - lm.influence(regressionLease)$hat)
PRESS <- sum(pr^2)
PRESS #predictive power of regression (smaller better)
## [1] 6.021147e+14
# anova to calculate residual sum of squares
my.anova <- anova(regressionLease)
tss <- sum(my.anova$"Sum Sq")
# predictive R^2
pred.r.squared <- 1 - PRESS/(tss)
pred.r.squared
## [1] 0.165088
Predicted R Sq & predicted residual sum of squares(PRESS) are used to judge the power of regression. PRESS is preferred here.
RMSE<-sqrt(mean(regressionLease$residuals^2))
RMSE
## [1] 113798.1
#predict.model <- predict(regressionFloor,df.test)
#new.modelvalues <- data.frame(obs=df.test$floor_area_sqm, pred='prediction')
#defaultSummary(new.modelvalues)
RMSE is the square root of the variance of the residuals. It indicates the absolute fit of the model to the data–how close the observed data points are to the model’s predicted values.
In this context, it is not consider a good fit (lower values of RMSE better).
Conclusion: Based on the diagnotics and conclusion, the linear model do not have a good fit.
Kendall-Theil-Siegel regression, and quantile regression are suggested models.