House Prices is a data set that has 128 observations (houses) and includes prices and characteristics of those houses in a major US metro area.
HP <- read.csv("~/DataMining/Data/HousePrices.csv")
plot(Price~SqFt, data=HP)
plot(Price~Bathrooms, data=HP)
plot(Price~Offers, data=HP)
boxplot(Price~Brick, data=HP)
boxplot(Price~Neighborhood, data=HP)
These are the plots that I have created based on the data and I have no concerns with these plots.
Regression output and summary:
m1 = lm(Price~., data=HP)
summary(m1)
##
## Call:
## lm(formula = Price ~ ., data = HP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27897.8 -6074.8 -48.7 5551.8 27536.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 308.114 9605.692 0.032 0.974465
## HomeID -11.456 25.387 -0.451 0.652616
## SqFt 53.634 5.926 9.051 3.30e-15 ***
## Bedrooms 4136.461 1621.775 2.551 0.012023 *
## Bathrooms 7975.157 2133.831 3.737 0.000287 ***
## Offers -8350.128 1103.693 -7.566 8.96e-12 ***
## BrickYes 17313.540 1988.548 8.707 2.12e-14 ***
## NeighborhoodNorth 1729.613 2433.756 0.711 0.478675
## NeighborhoodWest 22264.319 2540.699 8.763 1.56e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10050 on 119 degrees of freedom
## Multiple R-squared: 0.8688, Adjusted R-squared: 0.86
## F-statistic: 98.54 on 8 and 119 DF, p-value: < 2.2e-16
Equation of the regression (Raw Form): Price = 308.114 -11.456HomeID + 53.634SqFt + 4136.461Bedrooms + 7975.157Bathrooms -8350.128Offers + 17313.540BrickYes + 1729.613NeighborhoodNorth + 22264.319NeighborhoodWest
Based upon this regression we have significant variables of SqrFt, Bathrooms, Offers, Brick, and Bedrooms. We also have an R^2 of .87 making this a good model along with a p-value of 2.2e-16.
Direct Marketing is a data set that is from a direct marketer who sells products only via direct mail.
DM <- read.csv("~/DataMining/Data/DirectMarketing.csv")
plot(AmountSpent~Age, data = DM)
plot(AmountSpent~Gender, data = DM)
plot(AmountSpent~OwnHome, data = DM)
plot(AmountSpent~Married, data = DM)
plot(AmountSpent~Location, data = DM)
plot(AmountSpent~Salary, data = DM)
plot(AmountSpent~Children, data = DM)
plot(AmountSpent~History, data = DM)
plot(AmountSpent~Catalogs, data = DM)
These are the plots that I have created based on the data and I have no concerns with these plots.
Regression and output summary:
m2 = lm(AmountSpent~., data=DM)
summary(m2)
##
## Call:
## lm(formula = AmountSpent ~ ., data = DM)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1648.11 -286.72 -12.63 218.21 2771.25
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.496e+02 1.340e+02 -1.862 0.06302 .
## AgeOld 4.139e+01 5.276e+01 0.784 0.43311
## AgeYoung 8.965e+01 5.874e+01 1.526 0.12740
## GenderMale -5.370e+01 3.802e+01 -1.413 0.15823
## OwnHomeRent -1.829e+01 4.151e+01 -0.441 0.65967
## MarriedSingle 1.950e+01 4.981e+01 0.392 0.69553
## LocationFar 6.090e+02 4.399e+01 13.845 < 2e-16 ***
## Salary 1.883e-02 1.245e-03 15.124 < 2e-16 ***
## Children -2.683e+02 2.502e+01 -10.723 < 2e-16 ***
## HistoryLow -2.675e+02 8.862e+01 -3.019 0.00263 **
## HistoryMedium -3.446e+02 5.996e+01 -5.746 1.38e-08 ***
## Catalogs 4.052e+01 2.868e+00 14.128 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 463.5 on 685 degrees of freedom
## (303 observations deleted due to missingness)
## Multiple R-squared: 0.7887, Adjusted R-squared: 0.7853
## F-statistic: 232.5 on 11 and 685 DF, p-value: < 2.2e-16
Equation of Regression line (Raw Form): AmountSpent = -249.6 + 41.39AgeOld + 89.65AgeYoung - 53.7GenderMale - 18.29OwnHomeRent + 19.5MarriedSingle + 609LocationFar + .0188Salary - 268.3Children - 267.5HistoryLow - 344.6HistoryMedium + 40.52Catalogs
This regression gives us significant variables of LocationFar, Salary, Children, HistoryLow, HistoryMedium and Catalogs. It also has an R^2 of .79 which indicates that it is a pretty good model along with an overall p-value of 2.2e-16.
The Gender Discrimination data looks at variables Salary, Gender and Experience.
GD <- read.csv("~/DataMining/Data/GenderDiscrimination.csv")
plot(Salary~Gender, data = GD)
plot(Salary~Experience, data = GD)
These are the plots that I have created based on the data and I have no concerns with these plots.
Regression and output summary:
m3= lm(Salary~., data = GD)
summary(m3)
##
## Call:
## lm(formula = Salary ~ ., data = GD)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52779 -9806 -121 8347 60913
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53260.0 2416.6 22.039 < 2e-16 ***
## GenderMale 17020.6 2499.6 6.809 1.06e-10 ***
## Experience 1744.6 160.7 10.858 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16910 on 205 degrees of freedom
## Multiple R-squared: 0.4413, Adjusted R-squared: 0.4359
## F-statistic: 80.98 on 2 and 205 DF, p-value: < 2.2e-16
Equation of Regression line (Raw Form): Salary = 53260.0 + 17020.6GenderMale + 1744.6Exerience
In this regression, both of the variables are significant. The R^2 is .44 which tells us that this is not a good model. There is more than 50% variance that cannot be explained by our model.
For this data set I also created a test set and will show that here:
n=length(GD$Salary)
for(k in 1:n)
{
train1=c(1:n)
train=train1[train1!=k]
m3=lm(Salary~.,data=GD[train,])
pred=predict(m3,data=GD[-train,])
obs=GD$Salary[-train]
diff=obs-pred
percdiff=abs(diff)/obs
}
me=mean(diff)
rmse=sqrt(sum(diff**2))
mape=100*(mean(percdiff))
me
## [1] 108678.6
rmse
## [1] 1575709
mape
## [1] 57.80779
This test result gave us a mean absolute error of 57.8%. With this mape, we can confirm that the test set was accurate in prediciting a poor model for the data we have.