Data Set 1: HousePrices.csv
The data set includes prices and characteristics of n=128 houses. The following analysis attempts to explain the sale prices of houses in relation to its characteristics.
HPrice <- read.csv("C:/DataMining/Data/HousePrices.csv")
v1=rep(1,length(HPrice$Neighborhood))
v2=rep(0,length(HPrice$Neighborhood))
HPrice$NeighborhoodNorth=ifelse(HPrice$Neighborhood=="North",v1,v2)
HPrice$NeighborhoodEast=ifelse(HPrice$Neighborhood=="East",v1,v2)
v3=rep(1,length(HPrice$Brick))
v4=rep(0,length(HPrice$Brick))
HPrice$Brick=ifelse(HPrice$Brick=="Yes",v3,v4)
hp=HPrice[-8]
house=hp[-1]
From these graphs it would be reasonable to expect to see a strong positive correlation between price and square footage and price and a strong negative correlation between price and the number of offers a house receives. It would also to be fair to expect that variables that might increase square footage, number of bedrooms and number of bathrooms, would have some level of multicollinearity.
The following is the output for a regression fitting all the data provided. Take into consideration that the variable Neighborhood which has three possible outcomes (East, North, or West) have been converted into three indicator variables, two of which have been included in the model.
model1=lm(Price~.,data=house)
summary(model1)
##
## Call:
## lm(formula = Price ~ ., data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27337.3 -6549.5 -41.7 5803.4 27359.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22840.536 10236.302 2.231 0.02752 *
## SqFt 52.994 5.734 9.242 1.10e-15 ***
## Bedrooms 4246.794 1597.911 2.658 0.00894 **
## Bathrooms 7883.278 2117.035 3.724 0.00030 ***
## Offers -8267.488 1084.777 -7.621 6.47e-12 ***
## Brick 17297.350 1981.616 8.729 1.78e-14 ***
## NeighborhoodNorth -20681.037 3148.954 -6.568 1.38e-09 ***
## NeighborhoodEast -22241.616 2531.758 -8.785 1.32e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10020 on 120 degrees of freedom
## Multiple R-squared: 0.8686, Adjusted R-squared: 0.861
## F-statistic: 113.3 on 7 and 120 DF, p-value: < 2.2e-16
The raw equation for the above regression: Price=22840.5+53.0SquareFeet+4246.8Bedrooms+7883.3Bathrooms-8267.5Offers-20681.0NeighborhoodNorth-22241.6NeighborhoodEast
The model is quite good, explaining 86.1% of the variation in the price of houses. The p-value shows that our F-statistic is statistically significant. However, not all the variables are statistically significant, number of bedrooms has a relatively high p-value, so the model could perhaps be simplified.
To investigate possible concerns with multicollinearity affecting the Bedrooms variable the following correlation matrix is run.
cor(house)
## Price SqFt Bedrooms Bathrooms
## Price 1.0000000 0.55298224 0.52592606 0.523257758
## SqFt 0.5529822 1.00000000 0.48380711 0.522745301
## Bedrooms 0.5259261 0.48380711 1.00000000 0.414555956
## Bathrooms 0.5232578 0.52274530 0.41455596 1.000000000
## Offers -0.3136359 0.33692335 0.11427061 0.143793404
## Brick 0.4528168 0.07979216 0.04638008 0.171976913
## NeighborhoodNorth -0.5482211 -0.28888599 -0.36466744 -0.275829702
## NeighborhoodEast -0.1429589 0.04563915 -0.09175034 -0.001247208
## Offers Brick NeighborhoodNorth
## Price -0.31363588 0.45281679 -0.5482211
## SqFt 0.33692335 0.07979216 -0.2888860
## Bedrooms 0.11427061 0.04638008 -0.3646674
## Bathrooms 0.14379340 0.17197691 -0.2758297
## Offers 1.00000000 -0.14498606 0.3329866
## Brick -0.14498606 1.00000000 -0.2605536
## NeighborhoodNorth 0.33298661 -0.26055361 1.0000000
## NeighborhoodEast -0.01560205 0.14756390 -0.5329100
## NeighborhoodEast
## Price -0.142958878
## SqFt 0.045639151
## Bedrooms -0.091750338
## Bathrooms -0.001247208
## Offers -0.015602052
## Brick 0.147563896
## NeighborhoodNorth -0.532910044
## NeighborhoodEast 1.000000000
This output shows that there is not a high multicollinearity between the variables except in the two that you might expect, the number of bathrooms and bedroom with square footage. These are likely to be of little concern and when the regression model is run without the Bathrooms variable the adjusted R square shows that the original model was better at explaining the variation of prices.
A training sample of 100 houses are created and are used to predict the remaining 28 houses.
set.seed(1)
n=length(house$Price)
n1=100
n2=n-n1
train=sample(1:n,n1)
m4=lm(Price~.,data=house[train,])
summary(m4)
##
## Call:
## lm(formula = Price ~ ., data = house[train, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -27542 -5669 -1185 5893 28238
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22546.310 11203.793 2.012 0.04710 *
## SqFt 52.768 6.075 8.687 1.32e-13 ***
## Bedrooms 4701.107 1678.575 2.801 0.00622 **
## Bathrooms 6836.246 2227.577 3.069 0.00282 **
## Offers -8044.981 1126.298 -7.143 2.09e-10 ***
## Brick 17143.525 2197.482 7.801 9.30e-12 ***
## NeighborhoodNorth -18844.946 3377.426 -5.580 2.41e-07 ***
## NeighborhoodEast -21709.011 2652.542 -8.184 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9401 on 92 degrees of freedom
## Multiple R-squared: 0.8768, Adjusted R-squared: 0.8674
## F-statistic: 93.52 on 7 and 92 DF, p-value: < 2.2e-16
The regression output above is similar to the first one, with a slightly higher R-square and number of bathrooms no longer being significant.
pred=predict(m4,newdat=house[-train,])
obs=house$Price[-train]
diff=obs-pred
percdiff=abs(diff)/obs
mean error
me=mean(diff)
me
## [1] 2703.466
root mean square error
rmse=sqrt(sum(diff**2)/n2)
rmse
## [1] 12106.82
mean absolute percent error
mape=100*(mean(percdiff))
mape
## [1] 7.527531
The percent error is reasonably low proving the model to be quite accurate in its predictions.
Data Set 2: DirectMarketing.csv
This data set includes data from a direct marketer who sells products only via direct mail. The following analysis attempts to explain the amount purchased as a function of the buyer’s characteristics.
dmkt<-read.csv("c:/DataMining/Data/DirectMarketing.csv")
Plots of amount spent against age, gender, marriage status, relative location from a brick and mortar store, salary, number of children, history of previous purchase volume, and the number of catalogs sent are shown as follows.
plot(AmountSpent~Salary,data=dmkt)

plot(AmountSpent~Children,data=dmkt)

plot(AmountSpent~Catalogs,data=dmkt)

plot(AmountSpent~Age,data=dmkt)

plot(AmountSpent~Gender,data=dmkt)

plot(AmountSpent~OwnHome,data=dmkt)

plot(AmountSpent~Married,data=dmkt)

plot(AmountSpent~Location,data=dmkt)

plot(AmountSpent~History,data=dmkt)

From these graphs it would be reasonable to expect correlation between the amount spent and the other variables, with variables such as salary and history having greater influence over the model than others.
The following output is a regression model fitting all the data provided.
model3=lm(AmountSpent~.,data=dmkt)
summary(model3)
##
## Call:
## lm(formula = AmountSpent ~ ., data = dmkt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1648.11 -286.72 -12.63 218.21 2771.25
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.496e+02 1.340e+02 -1.862 0.06302 .
## AgeOld 4.139e+01 5.276e+01 0.784 0.43311
## AgeYoung 8.965e+01 5.874e+01 1.526 0.12740
## GenderMale -5.370e+01 3.802e+01 -1.413 0.15823
## OwnHomeRent -1.829e+01 4.151e+01 -0.441 0.65967
## MarriedSingle 1.950e+01 4.981e+01 0.392 0.69553
## LocationFar 6.090e+02 4.399e+01 13.845 < 2e-16 ***
## Salary 1.883e-02 1.245e-03 15.124 < 2e-16 ***
## Children -2.683e+02 2.502e+01 -10.723 < 2e-16 ***
## HistoryLow -2.675e+02 8.862e+01 -3.019 0.00263 **
## HistoryMedium -3.446e+02 5.996e+01 -5.746 1.38e-08 ***
## Catalogs 4.052e+01 2.868e+00 14.128 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 463.5 on 685 degrees of freedom
## (303 observations deleted due to missingness)
## Multiple R-squared: 0.7887, Adjusted R-squared: 0.7853
## F-statistic: 232.5 on 11 and 685 DF, p-value: < 2.2e-16
The raw equation for the above regression: AmountSpent = -2.496e+02 + 4.139e+01AgeOld + 8.965e+01AgeYoung - 5.370e+01GenderMale - 1.829e+01OwnHomeRent + 1.950e+01MarriedSingle + 6.090e+02Salary - 2.683e+02Children - 2.675e+02HistoryLow + 4.052e+01Catalogs
Without scientific notation: AmountSpent = -249.6+41.4AgeOld+89.7AgeYoung-53.7GenderMale-18.3RentHome+19.5Single+609.0Salary-268.3Children-267.5HistoryLow+40.5Catalogs
This model has an adjusted R-square of 78.5 and a low p-value so the model is quite good. However, only the independent variables location, salary, number of children, history, and number of catalogs appear statistically significant in this model.
This model has an adjusted R-square of 78.5 and a low p-value so the model is quite good. However, only the independent variables location, salary, number of children, history, and number of catalogs appear statistically significant in this model.
Data Set 3: GenderDiscrimaination.csv
ged<-read.csv("c:/DataMining/Data/GenderDiscrimination.csv")
Plots of salary against gender and years of experience are shown as following.
plot(Salary~Experience,data=ged)

plot(Salary~Gender,data=ged)

From these graphs one can reasonably expect to see a moderately strong, positive correlation between salary and years of experience in a predictive model. A regression model should predict a lower salary if the gender of the individual is female.
The following output is a regression model fitting all the data provided.
model2=lm(Salary~.,data=ged)
summary(model2)
##
## Call:
## lm(formula = Salary ~ ., data = ged)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52779 -9806 -121 8347 60913
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53260.0 2416.6 22.039 < 2e-16 ***
## GenderMale 17020.6 2499.6 6.809 1.06e-10 ***
## Experience 1744.6 160.7 10.858 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16910 on 205 degrees of freedom
## Multiple R-squared: 0.4413, Adjusted R-squared: 0.4359
## F-statistic: 80.98 on 2 and 205 DF, p-value: < 2.2e-16
The raw equation for the above regression: Salary=53260.0+17020.6GenderMale+1744.6Experince
If the goal of this model was an accurate prediction of salary it would produce very poor results, suggested by an adjusted R-square of 43.6%. However, what one can conclude from this output is that gender and experience have a statistically significant relationship with salary (p-values<0.001).