Read Data

#Data source: https://www.kaggle.com/farhankarim1/usa-house-prices
housing_db = read.csv(file = '/Users/anjalibm.com/Documents/DataScience/DATA_605/Simple_Linear_regression/USA_Housing.csv')
head(housing_db)
##   Avg..Area.Income Avg..Area.House.Age Avg..Area.Number.of.Rooms
## 1         79545.46            5.682861                  7.009188
## 2         79248.64            6.002900                  6.730821
## 3         61287.07            5.865890                  8.512727
## 4         63345.24            7.188236                  5.586729
## 5         59982.20            5.040555                  7.839388
## 6         80175.75            4.988408                  6.104512
##   Avg..Area.Number.of.Bedrooms Area.Population     Price
## 1                         4.09        23086.80 1059033.6
## 2                         3.09        40173.07 1505890.9
## 3                         5.13        36882.16 1058988.0
## 4                         3.26        34310.24 1260616.8
## 5                         4.23        26354.11  630943.5
## 6                         4.04        26748.43 1068138.1
##                                                Address
## 1 208 Michael Ferry Apt. 674\nLaurabury, NE 37010-5101
## 2 188 Johnson Views Suite 079\nLake Kathleen, CA 48958
## 3  9127 Elizabeth Stravenue\nDanieltown, WI 06482-3489
## 4                            USS Barnett\nFPO AP 44820
## 5                           USNS Raymond\nFPO AE 09386
## 6 06039 Jennifer Islands Apt. 443\nTracyport, KS 16077
describe(housing_db)
##                              vars    n       mean        sd     median
## Avg..Area.Income                1 5000   68583.11  10657.99   68804.29
## Avg..Area.House.Age             2 5000       5.98      0.99       5.97
## Avg..Area.Number.of.Rooms       3 5000       6.99      1.01       7.00
## Avg..Area.Number.of.Bedrooms    4 5000       3.98      1.23       4.05
## Area.Population                 5 5000   36163.52   9925.65   36199.41
## Price                           6 5000 1232072.65 353117.63 1232669.38
## Address*                        7 5000    2500.50   1443.52    2500.50
##                                 trimmed       mad      min        max
## Avg..Area.Income               68611.84  10598.27 17796.63  107701.75
## Avg..Area.House.Age                5.98      0.99     2.64       9.52
## Avg..Area.Number.of.Rooms          6.99      1.01     3.24      10.76
## Avg..Area.Number.of.Bedrooms       3.92      1.33     2.00       6.50
## Area.Population                36112.49   9997.21   172.61   69621.71
## Price                        1232159.69 350330.42 15938.66 2469065.59
## Address*                        2500.50   1853.25     1.00    5000.00
##                                   range  skew kurtosis      se
## Avg..Area.Income               89905.12 -0.03     0.04  150.73
## Avg..Area.House.Age                6.87 -0.01    -0.09    0.01
## Avg..Area.Number.of.Rooms          7.52 -0.04    -0.08    0.01
## Avg..Area.Number.of.Bedrooms       4.50  0.38    -0.70    0.02
## Area.Population                69449.10  0.05    -0.01  140.37
## Price                        2453126.94  0.00    -0.06 4993.84
## Address*                        4999.00  0.00    -1.20   20.41
colnames(housing_db)
## [1] "Avg..Area.Income"             "Avg..Area.House.Age"         
## [3] "Avg..Area.Number.of.Rooms"    "Avg..Area.Number.of.Bedrooms"
## [5] "Area.Population"              "Price"                       
## [7] "Address"

Initial visualization

In this simple linear regression we are trying to find out if Avg..Area.Income has any linear relationship with the price of the house in the area. Let’s first visualize both income and price to understand the pattern

income = housing_db$Avg..Area.Income
price = housing_db$Price
hist(income)

hist(price)

we should also plot a scatter plot and see if the relationship between the independent variable Income and dependent variable price is linear.

plot(price ~ income)

in the scatter plot not very obvious if the relationship is linear.

Training a linear Regression Model

To perform a simple linear regression analysis and check the results, we need to us lm function

regression_model = lm (price ~ income)
summary(regression_model)
## 
## Call:
## lm(formula = price ~ income)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -965112 -187163   -2365  183084  983680 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.216e+05  2.500e+04  -8.863   <2e-16 ***
## income       2.120e+01  3.602e-01  58.844   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 271400 on 4998 degrees of freedom
## Multiple R-squared:  0.4093, Adjusted R-squared:  0.4091 
## F-statistic:  3463 on 1 and 4998 DF,  p-value: < 2.2e-16

Plot the regression model

plot(regression_model)

Residual Analysis:

Residuals

Residuals are essentially the difference between the actual observed response values and the response values that the model predicted. The Residuals section of the model output breaks it down into 5 summary points. When assessing how well the model fit the data, we should look for a symmetrical distribution across these points on the mean value zero (0). In our example, we can see that the distribution of the residuals do not appear to be strongly symmetrical. That means that the model predicts certain points that fall far away from the actual observed points.

R-Squared

The R-squared (R^2) statistic provides a measure of how well the model is fitting the actual data. It takes the form of a proportion of variance. R2 is a measure of the linear relationship between our predictor variable (income) and our response / target variable (price). It always lies between 0 and 1 (i.e.: a number near 0 represents a regression that does not explain the variance in the response variable well and a number close to 1 does explain the observed variance in the response variable). In our case the R^2 is .4093 or roughly 40% of the variance found in the response variable (Price) can be predicted using predictor variable(income)