#Data source: https://www.kaggle.com/farhankarim1/usa-house-prices
housing_db = read.csv(file = '/Users/anjalibm.com/Documents/DataScience/DATA_605/Simple_Linear_regression/USA_Housing.csv')
head(housing_db)
## Avg..Area.Income Avg..Area.House.Age Avg..Area.Number.of.Rooms
## 1 79545.46 5.682861 7.009188
## 2 79248.64 6.002900 6.730821
## 3 61287.07 5.865890 8.512727
## 4 63345.24 7.188236 5.586729
## 5 59982.20 5.040555 7.839388
## 6 80175.75 4.988408 6.104512
## Avg..Area.Number.of.Bedrooms Area.Population Price
## 1 4.09 23086.80 1059033.6
## 2 3.09 40173.07 1505890.9
## 3 5.13 36882.16 1058988.0
## 4 3.26 34310.24 1260616.8
## 5 4.23 26354.11 630943.5
## 6 4.04 26748.43 1068138.1
## Address
## 1 208 Michael Ferry Apt. 674\nLaurabury, NE 37010-5101
## 2 188 Johnson Views Suite 079\nLake Kathleen, CA 48958
## 3 9127 Elizabeth Stravenue\nDanieltown, WI 06482-3489
## 4 USS Barnett\nFPO AP 44820
## 5 USNS Raymond\nFPO AE 09386
## 6 06039 Jennifer Islands Apt. 443\nTracyport, KS 16077
describe(housing_db)
## vars n mean sd median
## Avg..Area.Income 1 5000 68583.11 10657.99 68804.29
## Avg..Area.House.Age 2 5000 5.98 0.99 5.97
## Avg..Area.Number.of.Rooms 3 5000 6.99 1.01 7.00
## Avg..Area.Number.of.Bedrooms 4 5000 3.98 1.23 4.05
## Area.Population 5 5000 36163.52 9925.65 36199.41
## Price 6 5000 1232072.65 353117.63 1232669.38
## Address* 7 5000 2500.50 1443.52 2500.50
## trimmed mad min max
## Avg..Area.Income 68611.84 10598.27 17796.63 107701.75
## Avg..Area.House.Age 5.98 0.99 2.64 9.52
## Avg..Area.Number.of.Rooms 6.99 1.01 3.24 10.76
## Avg..Area.Number.of.Bedrooms 3.92 1.33 2.00 6.50
## Area.Population 36112.49 9997.21 172.61 69621.71
## Price 1232159.69 350330.42 15938.66 2469065.59
## Address* 2500.50 1853.25 1.00 5000.00
## range skew kurtosis se
## Avg..Area.Income 89905.12 -0.03 0.04 150.73
## Avg..Area.House.Age 6.87 -0.01 -0.09 0.01
## Avg..Area.Number.of.Rooms 7.52 -0.04 -0.08 0.01
## Avg..Area.Number.of.Bedrooms 4.50 0.38 -0.70 0.02
## Area.Population 69449.10 0.05 -0.01 140.37
## Price 2453126.94 0.00 -0.06 4993.84
## Address* 4999.00 0.00 -1.20 20.41
colnames(housing_db)
## [1] "Avg..Area.Income" "Avg..Area.House.Age"
## [3] "Avg..Area.Number.of.Rooms" "Avg..Area.Number.of.Bedrooms"
## [5] "Area.Population" "Price"
## [7] "Address"
In this simple linear regression we are trying to find out if Avg..Area.Income has any linear relationship with the price of the house in the area. Let’s first visualize both income and price to understand the pattern
income = housing_db$Avg..Area.Income
price = housing_db$Price
hist(income)
hist(price)
we should also plot a scatter plot and see if the relationship between the independent variable Income and dependent variable price is linear.
plot(price ~ income)
in the scatter plot not very obvious if the relationship is linear.
To perform a simple linear regression analysis and check the results, we need to us lm function
regression_model = lm (price ~ income)
summary(regression_model)
##
## Call:
## lm(formula = price ~ income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -965112 -187163 -2365 183084 983680
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.216e+05 2.500e+04 -8.863 <2e-16 ***
## income 2.120e+01 3.602e-01 58.844 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 271400 on 4998 degrees of freedom
## Multiple R-squared: 0.4093, Adjusted R-squared: 0.4091
## F-statistic: 3463 on 1 and 4998 DF, p-value: < 2.2e-16
plot(regression_model)
Residuals are essentially the difference between the actual observed response values and the response values that the model predicted. The Residuals section of the model output breaks it down into 5 summary points. When assessing how well the model fit the data, we should look for a symmetrical distribution across these points on the mean value zero (0). In our example, we can see that the distribution of the residuals do not appear to be strongly symmetrical. That means that the model predicts certain points that fall far away from the actual observed points.
The R-squared (R^2) statistic provides a measure of how well the model is fitting the actual data. It takes the form of a proportion of variance. R2 is a measure of the linear relationship between our predictor variable (income) and our response / target variable (price). It always lies between 0 and 1 (i.e.: a number near 0 represents a regression that does not explain the variance in the response variable well and a number close to 1 does explain the observed variance in the response variable). In our case the R^2 is .4093 or roughly 40% of the variance found in the response variable (Price) can be predicted using predictor variable(income)