This dataset I found on kaggle after browsing google for dataset that I can create a linear regression for. This dataset contains rows of houses in Taiwan that gives information about the house,its distance from MRt,longitude and latitude and the price.I created a regression model where it creates a model on the input(The age of the house) and it predicts the output(price of the house)
Data <- read.csv("https://raw.githubusercontent.com/AldataSci/LinearRegressionAnalysis/main/Real%20estate.csv",header=TRUE)
head(Data)
## No X1.transaction.date X2.house.age X3.distance.to.the.nearest.MRT.station
## 1 1 2012.917 32.0 84.87882
## 2 2 2012.917 19.5 306.59470
## 3 3 2013.583 13.3 561.98450
## 4 4 2013.500 13.3 561.98450
## 5 5 2012.833 5.0 390.56840
## 6 6 2012.667 7.1 2175.03000
## X4.number.of.convenience.stores X5.latitude X6.longitude
## 1 10 24.98298 121.5402
## 2 9 24.98034 121.5395
## 3 5 24.98746 121.5439
## 4 5 24.98746 121.5439
## 5 5 24.97937 121.5425
## 6 3 24.96305 121.5125
## Y.house.price.of.unit.area
## 1 37.9
## 2 42.2
## 3 47.3
## 4 54.8
## 5 43.1
## 6 32.1
## lets build a simple regression model where we try to predict the price of the house with the age of the house as a predictor..
## lets make meaningful column names!
colnames(Data) <- c("Index","Transaction_Date","House_Age","distance_to_nearest MRT station","#of_convience_stores","latitude","longitude","Price_of_Unit")
head(Data)
## Index Transaction_Date House_Age distance_to_nearest MRT station
## 1 1 2012.917 32.0 84.87882
## 2 2 2012.917 19.5 306.59470
## 3 3 2013.583 13.3 561.98450
## 4 4 2013.500 13.3 561.98450
## 5 5 2012.833 5.0 390.56840
## 6 6 2012.667 7.1 2175.03000
## #of_convience_stores latitude longitude Price_of_Unit
## 1 10 24.98298 121.5402 37.9
## 2 9 24.98034 121.5395 42.2
## 3 5 24.98746 121.5439 47.3
## 4 5 24.98746 121.5439 54.8
## 5 5 24.97937 121.5425 43.1
## 6 3 24.96305 121.5125 32.1
## lets create a simple linear regression model where the inputs for the model is the age of the house and the output are the price of the unit
house_model <- lm(Data$Price_of_Unit ~ Data$House_Age,data = Data)
## The equation of the regression model is: price of the house = 42.4347 -0.2515 * House_Age
house_model
##
## Call:
## lm(formula = Data$Price_of_Unit ~ Data$House_Age, data = Data)
##
## Coefficients:
## (Intercept) Data$House_Age
## 42.4347 -0.2515
plot(Data$Price_of_Unit ~ Data$House_Age,data = Data)
abline(house_model)
## The median is somewhat near 0 and the age of the age plays a big factor in the price of the house which may suggest that
## the age of house plays a role in the price of the house but the R^2 is a really small value which shows that the model may not be
## a fit the data at all.
summary(house_model)
##
## Call:
## lm(formula = Data$Price_of_Unit ~ Data$House_Age, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.113 -10.738 1.626 8.199 77.781
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.43470 1.21098 35.042 < 2e-16 ***
## Data$House_Age -0.25149 0.05752 -4.372 1.56e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.32 on 412 degrees of freedom
## Multiple R-squared: 0.04434, Adjusted R-squared: 0.04202
## F-statistic: 19.11 on 1 and 412 DF, p-value: 1.56e-05
## There is no clear trend within our residuals this may suggest that the predictor of the age of the house may be useful
## in sufficently and explaining the data. (pg 23 in textbook)
plot(fitted(house_model),resid(house_model))
## we can create a qqplot to determine if the model fits the data well, we should expect to see the residuals to be normally
#distributed
qqnorm(resid(house_model))
qqline(resid(house_model))
There are some slight divergence at the front end and at the bottom end, the tail front left tail is “lighter” than what we expected and some of the observation on the right tail is “heavier” than what we expected.