Reading the data

This dataset I found on kaggle after browsing google for dataset that I can create a linear regression for. This dataset contains rows of houses in Taiwan that gives information about the house,its distance from MRt,longitude and latitude and the price.I created a regression model where it creates a model on the input(The age of the house) and it predicts the output(price of the house)

Data <- read.csv("https://raw.githubusercontent.com/AldataSci/LinearRegressionAnalysis/main/Real%20estate.csv",header=TRUE)
head(Data)
##   No X1.transaction.date X2.house.age X3.distance.to.the.nearest.MRT.station
## 1  1            2012.917         32.0                               84.87882
## 2  2            2012.917         19.5                              306.59470
## 3  3            2013.583         13.3                              561.98450
## 4  4            2013.500         13.3                              561.98450
## 5  5            2012.833          5.0                              390.56840
## 6  6            2012.667          7.1                             2175.03000
##   X4.number.of.convenience.stores X5.latitude X6.longitude
## 1                              10    24.98298     121.5402
## 2                               9    24.98034     121.5395
## 3                               5    24.98746     121.5439
## 4                               5    24.98746     121.5439
## 5                               5    24.97937     121.5425
## 6                               3    24.96305     121.5125
##   Y.house.price.of.unit.area
## 1                       37.9
## 2                       42.2
## 3                       47.3
## 4                       54.8
## 5                       43.1
## 6                       32.1

Building the Model

## lets build a simple regression model where we try to predict the price of the house with the age of the house as a predictor..
## lets make meaningful column names!
colnames(Data) <- c("Index","Transaction_Date","House_Age","distance_to_nearest MRT station","#of_convience_stores","latitude","longitude","Price_of_Unit")
head(Data)
##   Index Transaction_Date House_Age distance_to_nearest MRT station
## 1     1         2012.917      32.0                        84.87882
## 2     2         2012.917      19.5                       306.59470
## 3     3         2013.583      13.3                       561.98450
## 4     4         2013.500      13.3                       561.98450
## 5     5         2012.833       5.0                       390.56840
## 6     6         2012.667       7.1                      2175.03000
##   #of_convience_stores latitude longitude Price_of_Unit
## 1                   10 24.98298  121.5402          37.9
## 2                    9 24.98034  121.5395          42.2
## 3                    5 24.98746  121.5439          47.3
## 4                    5 24.98746  121.5439          54.8
## 5                    5 24.97937  121.5425          43.1
## 6                    3 24.96305  121.5125          32.1
## lets create a simple linear regression model where the inputs for the model is the age of the house and the output are the price of the unit
house_model <- lm(Data$Price_of_Unit ~ Data$House_Age,data = Data)
## The equation of the regression model is: price of the house = 42.4347 -0.2515 * House_Age
house_model
## 
## Call:
## lm(formula = Data$Price_of_Unit ~ Data$House_Age, data = Data)
## 
## Coefficients:
##    (Intercept)  Data$House_Age  
##        42.4347         -0.2515

Visualizing the Data

plot(Data$Price_of_Unit ~ Data$House_Age,data = Data)
abline(house_model)

Interpreting the model

## The median is somewhat near 0 and the age of the age plays a big factor in the price of the house which may suggest that 
## the age of house plays a role in the price of the house but the R^2 is a really small value which shows that the model may not be 
## a fit the data at all. 
summary(house_model)
## 
## Call:
## lm(formula = Data$Price_of_Unit ~ Data$House_Age, data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.113 -10.738   1.626   8.199  77.781 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    42.43470    1.21098  35.042  < 2e-16 ***
## Data$House_Age -0.25149    0.05752  -4.372 1.56e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.32 on 412 degrees of freedom
## Multiple R-squared:  0.04434,    Adjusted R-squared:  0.04202 
## F-statistic: 19.11 on 1 and 412 DF,  p-value: 1.56e-05

Residual Analysis:

## There is no clear trend within our residuals this may suggest that the predictor of the age of the house may be useful 
## in sufficently and explaining the data. (pg 23 in textbook)
plot(fitted(house_model),resid(house_model))


Q-Q plots

## we can create a qqplot to determine if the model fits the data well, we should expect to see the residuals to be normally
#distributed
qqnorm(resid(house_model))
qqline(resid(house_model))

There are some slight divergence at the front end and at the bottom end, the tail front left tail is “lighter” than what we expected and some of the observation on the right tail is “heavier” than what we expected.