Homework 2

House Prices

House Prices is a data set that has 128 observations (houses) and includes prices and characteristics of those houses in a major US metro area.

HP <- read.csv("~/DataMining/Data/HousePrices.csv")

plot(Price~SqFt, data=HP)

plot(Price~Bathrooms, data=HP)

plot(Price~Offers, data=HP)

boxplot(Price~Brick, data=HP)

boxplot(Price~Neighborhood, data=HP)

These are the plots that I have created based on the data and I have no concerns with these plots.

Regression output and summary:

m1 = lm(Price~., data=HP)
summary(m1)

## 
## Call:
## lm(formula = Price ~ ., data = HP)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -27897.8  -6074.8    -48.7   5551.8  27536.4 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         308.114   9605.692   0.032 0.974465    
## HomeID              -11.456     25.387  -0.451 0.652616    
## SqFt                 53.634      5.926   9.051 3.30e-15 ***
## Bedrooms           4136.461   1621.775   2.551 0.012023 *  
## Bathrooms          7975.157   2133.831   3.737 0.000287 ***
## Offers            -8350.128   1103.693  -7.566 8.96e-12 ***
## BrickYes          17313.540   1988.548   8.707 2.12e-14 ***
## NeighborhoodNorth  1729.613   2433.756   0.711 0.478675    
## NeighborhoodWest  22264.319   2540.699   8.763 1.56e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10050 on 119 degrees of freedom
## Multiple R-squared:  0.8688, Adjusted R-squared:   0.86 
## F-statistic: 98.54 on 8 and 119 DF,  p-value: < 2.2e-16

Equation of the regression (Raw Form): Price = 308.114 -11.456HomeID + 53.634SqFt + 4136.461Bedrooms + 7975.157Bathrooms -8350.128Offers + 17313.540BrickYes + 1729.613NeighborhoodNorth + 22264.319NeighborhoodWest

Based upon this regression we have significant variables of SqrFt, Bathrooms, Offers, Brick, and Bedrooms. We also have an R^2 of .87 making this a good model along with a p-value of 2.2e-16.

Direct Marketing

Direct Marketing is a data set that is from a direct marketer who sells products only via direct mail.

DM <- read.csv("~/DataMining/Data/DirectMarketing.csv")

plot(AmountSpent~Age, data = DM)

plot(AmountSpent~Gender, data = DM)

plot(AmountSpent~OwnHome, data = DM)

plot(AmountSpent~Married, data = DM)

plot(AmountSpent~Location, data = DM)

plot(AmountSpent~Salary, data = DM)

plot(AmountSpent~Children, data = DM)

plot(AmountSpent~History, data = DM)

plot(AmountSpent~Catalogs, data = DM)

These are the plots that I have created based on the data and I have no concerns with these plots.

Regression and output summary:

m2 = lm(AmountSpent~., data=DM)
summary(m2)

## 
## Call:
## lm(formula = AmountSpent ~ ., data = DM)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1648.11  -286.72   -12.63   218.21  2771.25 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.496e+02  1.340e+02  -1.862  0.06302 .  
## AgeOld         4.139e+01  5.276e+01   0.784  0.43311    
## AgeYoung       8.965e+01  5.874e+01   1.526  0.12740    
## GenderMale    -5.370e+01  3.802e+01  -1.413  0.15823    
## OwnHomeRent   -1.829e+01  4.151e+01  -0.441  0.65967    
## MarriedSingle  1.950e+01  4.981e+01   0.392  0.69553    
## LocationFar    6.090e+02  4.399e+01  13.845  < 2e-16 ***
## Salary         1.883e-02  1.245e-03  15.124  < 2e-16 ***
## Children      -2.683e+02  2.502e+01 -10.723  < 2e-16 ***
## HistoryLow    -2.675e+02  8.862e+01  -3.019  0.00263 ** 
## HistoryMedium -3.446e+02  5.996e+01  -5.746 1.38e-08 ***
## Catalogs       4.052e+01  2.868e+00  14.128  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 463.5 on 685 degrees of freedom
##   (303 observations deleted due to missingness)
## Multiple R-squared:  0.7887, Adjusted R-squared:  0.7853 
## F-statistic: 232.5 on 11 and 685 DF,  p-value: < 2.2e-16

Equation of Regression line (Raw Form): AmountSpent = -249.6 + 41.39AgeOld + 89.65AgeYoung - 53.7GenderMale - 18.29OwnHomeRent + 19.5MarriedSingle + 609LocationFar + .0188Salary - 268.3Children - 267.5HistoryLow - 344.6HistoryMedium + 40.52Catalogs

This regression gives us significant variables of LocationFar, Salary, Children, HistoryLow, HistoryMedium and Catalogs. It also has an R^2 of .79 which indicates that it is a pretty good model along with an overall p-value of 2.2e-16.

Gender Discrimination

The Gender Discrimination data looks at variables Salary, Gender and Experience.

GD <- read.csv("~/DataMining/Data/GenderDiscrimination.csv")

plot(Salary~Gender, data = GD)

plot(Salary~Experience, data = GD)

These are the plots that I have created based on the data and I have no concerns with these plots.

Regression and output summary:

m3= lm(Salary~., data = GD)
summary(m3)

## 
## Call:
## lm(formula = Salary ~ ., data = GD)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -52779  -9806   -121   8347  60913 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  53260.0     2416.6  22.039  < 2e-16 ***
## GenderMale   17020.6     2499.6   6.809 1.06e-10 ***
## Experience    1744.6      160.7  10.858  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16910 on 205 degrees of freedom
## Multiple R-squared:  0.4413, Adjusted R-squared:  0.4359 
## F-statistic: 80.98 on 2 and 205 DF,  p-value: < 2.2e-16

Equation of Regression line (Raw Form): Salary = 53260.0 + 17020.6GenderMale + 1744.6Exerience

In this regression, both of the variables are significant. The R^2 is .44 which tells us that this is not a good model. There is more than 50% variance that cannot be explained by our model.

For this data set I also created a test set and will show that here:

n=length(GD$Salary)
for(k in 1:n)
{
  train1=c(1:n)
  train=train1[train1!=k]
  m3=lm(Salary~.,data=GD[train,])
  pred=predict(m3,data=GD[-train,])
  obs=GD$Salary[-train]
  diff=obs-pred
  percdiff=abs(diff)/obs
}
me=mean(diff)
rmse=sqrt(sum(diff**2))
mape=100*(mean(percdiff))
me

## [1] 108678.6

rmse

## [1] 1575709

mape

## [1] 57.80779

This test result gave us a mean absolute error of 57.8%. With this mape, we can confirm that the test set was accurate in prediciting a poor model for the data we have.

Homework 2

Jones

October 5, 2017

House Prices

Direct Marketing

Gender Discrimination