Data Set 1: HousePrices.csv

The data set includes prices and characteristics of n=128 houses. The following analysis attempts to explain the sale prices of houses in relation to its characteristics.

HPrice <- read.csv("C:/DataMining/Data/HousePrices.csv")
v1=rep(1,length(HPrice$Neighborhood))
v2=rep(0,length(HPrice$Neighborhood))
HPrice$NeighborhoodNorth=ifelse(HPrice$Neighborhood=="North",v1,v2)
HPrice$NeighborhoodEast=ifelse(HPrice$Neighborhood=="East",v1,v2)
v3=rep(1,length(HPrice$Brick))
v4=rep(0,length(HPrice$Brick))
HPrice$Brick=ifelse(HPrice$Brick=="Yes",v3,v4)
hp=HPrice[-8]
house=hp[-1]

Scatter plots of price against square footage, number of bedrooms, number of bathrooms, number of offers, and whether it is brick or not are shown as follows.

plot(Price~SqFt,data=house)

plot(Price~Bedrooms,data=house)

plot(Price~Bathrooms,data=house)

plot(Price~Offers,data=house)

plot(Price~Brick,data=house)

From these graphs it would be reasonable to expect to see a strong positive correlation between price and square footage and price and a strong negative correlation between price and the number of offers a house receives. It would also to be fair to expect that variables that might increase square footage, number of bedrooms and number of bathrooms, would have some level of multicollinearity.

The following is the output for a regression fitting all the data provided. Take into consideration that the variable Neighborhood which has three possible outcomes (East, North, or West) have been converted into three indicator variables, two of which have been included in the model.

model1=lm(Price~.,data=house)
summary(model1)
## 
## Call:
## lm(formula = Price ~ ., data = house)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -27337.3  -6549.5    -41.7   5803.4  27359.3 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        22840.536  10236.302   2.231  0.02752 *  
## SqFt                  52.994      5.734   9.242 1.10e-15 ***
## Bedrooms            4246.794   1597.911   2.658  0.00894 ** 
## Bathrooms           7883.278   2117.035   3.724  0.00030 ***
## Offers             -8267.488   1084.777  -7.621 6.47e-12 ***
## Brick              17297.350   1981.616   8.729 1.78e-14 ***
## NeighborhoodNorth -20681.037   3148.954  -6.568 1.38e-09 ***
## NeighborhoodEast  -22241.616   2531.758  -8.785 1.32e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10020 on 120 degrees of freedom
## Multiple R-squared:  0.8686, Adjusted R-squared:  0.861 
## F-statistic: 113.3 on 7 and 120 DF,  p-value: < 2.2e-16

The raw equation for the above regression: Price=22840.5+53.0SquareFeet+4246.8Bedrooms+7883.3Bathrooms-8267.5Offers-20681.0NeighborhoodNorth-22241.6NeighborhoodEast

The model is quite good, explaining 86.1% of the variation in the price of houses. The p-value shows that our F-statistic is statistically significant. However, not all the variables are statistically significant, number of bedrooms has a relatively high p-value, so the model could perhaps be simplified.

To investigate possible concerns with multicollinearity affecting the Bedrooms variable the following correlation matrix is run.

cor(house)
##                        Price        SqFt    Bedrooms    Bathrooms
## Price              1.0000000  0.55298224  0.52592606  0.523257758
## SqFt               0.5529822  1.00000000  0.48380711  0.522745301
## Bedrooms           0.5259261  0.48380711  1.00000000  0.414555956
## Bathrooms          0.5232578  0.52274530  0.41455596  1.000000000
## Offers            -0.3136359  0.33692335  0.11427061  0.143793404
## Brick              0.4528168  0.07979216  0.04638008  0.171976913
## NeighborhoodNorth -0.5482211 -0.28888599 -0.36466744 -0.275829702
## NeighborhoodEast  -0.1429589  0.04563915 -0.09175034 -0.001247208
##                        Offers       Brick NeighborhoodNorth
## Price             -0.31363588  0.45281679        -0.5482211
## SqFt               0.33692335  0.07979216        -0.2888860
## Bedrooms           0.11427061  0.04638008        -0.3646674
## Bathrooms          0.14379340  0.17197691        -0.2758297
## Offers             1.00000000 -0.14498606         0.3329866
## Brick             -0.14498606  1.00000000        -0.2605536
## NeighborhoodNorth  0.33298661 -0.26055361         1.0000000
## NeighborhoodEast  -0.01560205  0.14756390        -0.5329100
##                   NeighborhoodEast
## Price                 -0.142958878
## SqFt                   0.045639151
## Bedrooms              -0.091750338
## Bathrooms             -0.001247208
## Offers                -0.015602052
## Brick                  0.147563896
## NeighborhoodNorth     -0.532910044
## NeighborhoodEast       1.000000000

This output shows that there is not a high multicollinearity between the variables except in the two that you might expect, the number of bathrooms and bedroom with square footage. These are likely to be of little concern and when the regression model is run without the Bathrooms variable the adjusted R square shows that the original model was better at explaining the variation of prices.

A training sample of 100 houses are created and are used to predict the remaining 28 houses.

set.seed(1)
n=length(house$Price)
n1=100
n2=n-n1
train=sample(1:n,n1)
m4=lm(Price~.,data=house[train,]) 
summary(m4)
## 
## Call:
## lm(formula = Price ~ ., data = house[train, ])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27542  -5669  -1185   5893  28238 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        22546.310  11203.793   2.012  0.04710 *  
## SqFt                  52.768      6.075   8.687 1.32e-13 ***
## Bedrooms            4701.107   1678.575   2.801  0.00622 ** 
## Bathrooms           6836.246   2227.577   3.069  0.00282 ** 
## Offers             -8044.981   1126.298  -7.143 2.09e-10 ***
## Brick              17143.525   2197.482   7.801 9.30e-12 ***
## NeighborhoodNorth -18844.946   3377.426  -5.580 2.41e-07 ***
## NeighborhoodEast  -21709.011   2652.542  -8.184 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9401 on 92 degrees of freedom
## Multiple R-squared:  0.8768, Adjusted R-squared:  0.8674 
## F-statistic: 93.52 on 7 and 92 DF,  p-value: < 2.2e-16

The regression output above is similar to the first one, with a slightly higher R-square and number of bathrooms no longer being significant.

pred=predict(m4,newdat=house[-train,])
obs=house$Price[-train]
diff=obs-pred
percdiff=abs(diff)/obs

mean error

me=mean(diff)
me 
## [1] 2703.466

root mean square error

rmse=sqrt(sum(diff**2)/n2)
rmse
## [1] 12106.82

mean absolute percent error

mape=100*(mean(percdiff)) 
mape 
## [1] 7.527531

The percent error is reasonably low proving the model to be quite accurate in its predictions.

Data Set 2: DirectMarketing.csv

This data set includes data from a direct marketer who sells products only via direct mail. The following analysis attempts to explain the amount purchased as a function of the buyer’s characteristics.

dmkt<-read.csv("c:/DataMining/Data/DirectMarketing.csv")

Plots of amount spent against age, gender, marriage status, relative location from a brick and mortar store, salary, number of children, history of previous purchase volume, and the number of catalogs sent are shown as follows.

plot(AmountSpent~Salary,data=dmkt)

plot(AmountSpent~Children,data=dmkt)

plot(AmountSpent~Catalogs,data=dmkt)

plot(AmountSpent~Age,data=dmkt)

plot(AmountSpent~Gender,data=dmkt)

plot(AmountSpent~OwnHome,data=dmkt)

plot(AmountSpent~Married,data=dmkt)

plot(AmountSpent~Location,data=dmkt)

plot(AmountSpent~History,data=dmkt)

From these graphs it would be reasonable to expect correlation between the amount spent and the other variables, with variables such as salary and history having greater influence over the model than others.

The following output is a regression model fitting all the data provided.

model3=lm(AmountSpent~.,data=dmkt)
summary(model3)
## 
## Call:
## lm(formula = AmountSpent ~ ., data = dmkt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1648.11  -286.72   -12.63   218.21  2771.25 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.496e+02  1.340e+02  -1.862  0.06302 .  
## AgeOld         4.139e+01  5.276e+01   0.784  0.43311    
## AgeYoung       8.965e+01  5.874e+01   1.526  0.12740    
## GenderMale    -5.370e+01  3.802e+01  -1.413  0.15823    
## OwnHomeRent   -1.829e+01  4.151e+01  -0.441  0.65967    
## MarriedSingle  1.950e+01  4.981e+01   0.392  0.69553    
## LocationFar    6.090e+02  4.399e+01  13.845  < 2e-16 ***
## Salary         1.883e-02  1.245e-03  15.124  < 2e-16 ***
## Children      -2.683e+02  2.502e+01 -10.723  < 2e-16 ***
## HistoryLow    -2.675e+02  8.862e+01  -3.019  0.00263 ** 
## HistoryMedium -3.446e+02  5.996e+01  -5.746 1.38e-08 ***
## Catalogs       4.052e+01  2.868e+00  14.128  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 463.5 on 685 degrees of freedom
##   (303 observations deleted due to missingness)
## Multiple R-squared:  0.7887, Adjusted R-squared:  0.7853 
## F-statistic: 232.5 on 11 and 685 DF,  p-value: < 2.2e-16

The raw equation for the above regression: AmountSpent = -2.496e+02 + 4.139e+01AgeOld + 8.965e+01AgeYoung - 5.370e+01GenderMale - 1.829e+01OwnHomeRent + 1.950e+01MarriedSingle + 6.090e+02Salary - 2.683e+02Children - 2.675e+02HistoryLow + 4.052e+01Catalogs

Without scientific notation: AmountSpent = -249.6+41.4AgeOld+89.7AgeYoung-53.7GenderMale-18.3RentHome+19.5Single+609.0Salary-268.3Children-267.5HistoryLow+40.5Catalogs

This model has an adjusted R-square of 78.5 and a low p-value so the model is quite good. However, only the independent variables location, salary, number of children, history, and number of catalogs appear statistically significant in this model.

This model has an adjusted R-square of 78.5 and a low p-value so the model is quite good. However, only the independent variables location, salary, number of children, history, and number of catalogs appear statistically significant in this model.

Data Set 3: GenderDiscrimaination.csv

ged<-read.csv("c:/DataMining/Data/GenderDiscrimination.csv")

Plots of salary against gender and years of experience are shown as following.

plot(Salary~Experience,data=ged)

plot(Salary~Gender,data=ged)

From these graphs one can reasonably expect to see a moderately strong, positive correlation between salary and years of experience in a predictive model. A regression model should predict a lower salary if the gender of the individual is female.

The following output is a regression model fitting all the data provided.

model2=lm(Salary~.,data=ged)
summary(model2)
## 
## Call:
## lm(formula = Salary ~ ., data = ged)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -52779  -9806   -121   8347  60913 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  53260.0     2416.6  22.039  < 2e-16 ***
## GenderMale   17020.6     2499.6   6.809 1.06e-10 ***
## Experience    1744.6      160.7  10.858  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16910 on 205 degrees of freedom
## Multiple R-squared:  0.4413, Adjusted R-squared:  0.4359 
## F-statistic: 80.98 on 2 and 205 DF,  p-value: < 2.2e-16

The raw equation for the above regression: Salary=53260.0+17020.6GenderMale+1744.6Experince

If the goal of this model was an accurate prediction of salary it would produce very poor results, suggested by an adjusted R-square of 43.6%. However, what one can conclude from this output is that gender and experience have a statistically significant relationship with salary (p-values<0.001).