This data set includes prices and characteristics of 128 houses. Some of the variables are Price, Size, Bedrooms, Bathrooms, and Neighborhood.
HP <- read.csv("~/Business Analytics/HousePrices.csv")
plot(Price~SqFt,data=HP)
plot(Price~Bedrooms,data=HP)
plot(Price~Bathrooms,data=HP)
plot(Price~Offers,data=HP)
plot(Price~Brick,data=HP)
plot(Price~Neighborhood,data=HP)
Above are all of the XY plots of our data set. These plots all look good and none show any signs of the X variables being colinear.
The correlation matrix supports our assumption of X variables being independent. Now to look at our regression output.
m1=lm(Price~.,data=HP)
summary(m1)
##
## Call:
## lm(formula = Price ~ ., data = HP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27897.8 -6074.8 -48.7 5551.8 27536.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 308.114 9605.692 0.032 0.974465
## HomeID -11.456 25.387 -0.451 0.652616
## SqFt 53.634 5.926 9.051 3.30e-15 ***
## Bedrooms 4136.461 1621.775 2.551 0.012023 *
## Bathrooms 7975.157 2133.831 3.737 0.000287 ***
## Offers -8350.128 1103.693 -7.566 8.96e-12 ***
## BrickYes 17313.540 1988.548 8.707 2.12e-14 ***
## NeighborhoodNorth 1729.613 2433.756 0.711 0.478675
## NeighborhoodWest 22264.319 2540.699 8.763 1.56e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10050 on 119 degrees of freedom
## Multiple R-squared: 0.8688, Adjusted R-squared: 0.86
## F-statistic: 98.54 on 8 and 119 DF, p-value: < 2.2e-16
Our regression output gives us an adjusted R^2 of .86 which tells us that our model is a good fit for our data set. We can also see that Sqft, bedrooms, bathrooms, offers, and bricks are all significant.
The equation for our line is:
Price=308.11-11.46HomeID+53.63SqFt+4136.46Bedrooms+7975.16Bathrooms-8350.13Offers+17313.54BrickYes+1729.613NeighborhoodNorth+22264.32NeighborhoodWest
Next, I will take 90 random cases and create a model with them and use this model to predict a test set of 38 cases. Then I will examine the error.
set.seed(1)
n=length(HP$Price)
n1=90
n2=n-n1
train=sample(1:n,n1)
m1=lm(Price~.,data=HP[train,])
summary(m1)
##
## Call:
## lm(formula = Price ~ ., data = HP[train, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -29267 -4859 -1429 5576 28896
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1782.970 11103.095 0.161 0.87282
## HomeID -18.470 29.611 -0.624 0.53453
## SqFt 51.466 6.655 7.734 2.51e-11 ***
## Bedrooms 4692.374 1854.417 2.530 0.01333 *
## Bathrooms 7309.842 2387.949 3.061 0.00299 **
## Offers -7737.062 1229.031 -6.295 1.49e-08 ***
## BrickYes 17915.620 2434.550 7.359 1.36e-10 ***
## NeighborhoodNorth 3941.988 3001.474 1.313 0.19277
## NeighborhoodWest 22785.492 2919.501 7.805 1.82e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9586 on 81 degrees of freedom
## Multiple R-squared: 0.8777, Adjusted R-squared: 0.8656
## F-statistic: 72.68 on 8 and 81 DF, p-value: < 2.2e-16
pred=predict(m1,newdat=HP[-train,])
obs=HP$Price[-train]
diff=obs-pred
percdiff=abs(diff)/obs
me=mean(diff)
rmse=sqrt(sum(diff**2)/n2)
mape=100*(mean(percdiff))
me # mean error
## [1] 1904.989
rmse # root mean square error
## [1] 11328.64
mape # mean absolute percent error
## [1] 6.882381
With our mean absolute percent error at roughly 7% we can conclude our training set was a good predictor for our test set which tell us our data is a good set of data to make models from and should translate to more data of the same nature.
This data set includes data on gender, work experience, and salary.
GD <- read.csv("~/Business Analytics/GenderDiscrimination.csv")
plot(Salary~Gender,data=GD)
plot(Salary~Experience,data=GD)
Above are all of the XY plots of our data set. These plots all look good and none show any signs of the X variables being colinear.
m2=lm(Salary~.,data=GD)
summary(m2)
##
## Call:
## lm(formula = Salary ~ ., data = GD)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52779 -9806 -121 8347 60913
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53260.0 2416.6 22.039 < 2e-16 ***
## GenderMale 17020.6 2499.6 6.809 1.06e-10 ***
## Experience 1744.6 160.7 10.858 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16910 on 205 degrees of freedom
## Multiple R-squared: 0.4413, Adjusted R-squared: 0.4359
## F-statistic: 80.98 on 2 and 205 DF, p-value: < 2.2e-16
Our regression output gives us an adjusted R^2 of .44 which tells us that our model is not a good fit for our data set and less than 50% of the variation is explained by our model. We can also see that all variables are significant.
The equation for our line is:
Salary= 53260+17020.6GenderMale+1744.6Experience
This data set includes data from a direct marketer who sells his products via mail.
DM <- read.csv("~/Business Analytics/DirectMarketing.csv")
plot(AmountSpent~Age,data=DM)
plot(AmountSpent~Gender,data=DM)
plot(AmountSpent~OwnHome,data=DM)
plot(AmountSpent~Married,data=DM)
plot(AmountSpent~Location,data=DM)
plot(AmountSpent~Children,data=DM)
plot(AmountSpent~History,data=DM)
plot(AmountSpent~Catalogs,data=DM)
plot(AmountSpent~Salary,data=DM)
Above are all of the XY plots of our data set. These plots all look good and none show any signs of the X variables being colinear.
m3=lm(AmountSpent~.,data=DM)
summary(m3)
##
## Call:
## lm(formula = AmountSpent ~ ., data = DM)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1648.11 -286.72 -12.63 218.21 2771.25
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.496e+02 1.340e+02 -1.862 0.06302 .
## AgeOld 4.139e+01 5.276e+01 0.784 0.43311
## AgeYoung 8.965e+01 5.874e+01 1.526 0.12740
## GenderMale -5.370e+01 3.802e+01 -1.413 0.15823
## OwnHomeRent -1.829e+01 4.151e+01 -0.441 0.65967
## MarriedSingle 1.950e+01 4.981e+01 0.392 0.69553
## LocationFar 6.090e+02 4.399e+01 13.845 < 2e-16 ***
## Salary 1.883e-02 1.245e-03 15.124 < 2e-16 ***
## Children -2.683e+02 2.502e+01 -10.723 < 2e-16 ***
## HistoryLow -2.675e+02 8.862e+01 -3.019 0.00263 **
## HistoryMedium -3.446e+02 5.996e+01 -5.746 1.38e-08 ***
## Catalogs 4.052e+01 2.868e+00 14.128 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 463.5 on 685 degrees of freedom
## (303 observations deleted due to missingness)
## Multiple R-squared: 0.7887, Adjusted R-squared: 0.7853
## F-statistic: 232.5 on 11 and 685 DF, p-value: < 2.2e-16
Our regression output gives us an adjusted R^2 of .785 which tells us that our model is a good fit for our data set. We can also see that LocationFar, Salary, Children, History Low, HistoryMedium, and catalogs are all significant.
The equation for our line is:
AmountSpent=-249.6+41.39AgeOld+89.65AgeYoung-53.7GenderMale-18.29OwnHomeRent+19.5MarriedSingle+609LocationFar+.0188Salary-268.3Children-267.5HistoryLow-344.6HistoryMedium+40.52*Catalogs