House Prices

This data set includes prices and characteristics of 128 houses. Some of the variables are Price, Size, Bedrooms, Bathrooms, and Neighborhood.

HP <- read.csv("~/Business Analytics/HousePrices.csv")
plot(Price~SqFt,data=HP)

plot(Price~Bedrooms,data=HP)

plot(Price~Bathrooms,data=HP)

plot(Price~Offers,data=HP)

plot(Price~Brick,data=HP)

plot(Price~Neighborhood,data=HP)

Above are all of the XY plots of our data set. These plots all look good and none show any signs of the X variables being colinear.

The correlation matrix supports our assumption of X variables being independent. Now to look at our regression output.

m1=lm(Price~.,data=HP)
summary(m1)
## 
## Call:
## lm(formula = Price ~ ., data = HP)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -27897.8  -6074.8    -48.7   5551.8  27536.4 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         308.114   9605.692   0.032 0.974465    
## HomeID              -11.456     25.387  -0.451 0.652616    
## SqFt                 53.634      5.926   9.051 3.30e-15 ***
## Bedrooms           4136.461   1621.775   2.551 0.012023 *  
## Bathrooms          7975.157   2133.831   3.737 0.000287 ***
## Offers            -8350.128   1103.693  -7.566 8.96e-12 ***
## BrickYes          17313.540   1988.548   8.707 2.12e-14 ***
## NeighborhoodNorth  1729.613   2433.756   0.711 0.478675    
## NeighborhoodWest  22264.319   2540.699   8.763 1.56e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10050 on 119 degrees of freedom
## Multiple R-squared:  0.8688, Adjusted R-squared:   0.86 
## F-statistic: 98.54 on 8 and 119 DF,  p-value: < 2.2e-16

Our regression output gives us an adjusted R^2 of .86 which tells us that our model is a good fit for our data set. We can also see that Sqft, bedrooms, bathrooms, offers, and bricks are all significant.

The equation for our line is:

Price=308.11-11.46HomeID+53.63SqFt+4136.46Bedrooms+7975.16Bathrooms-8350.13Offers+17313.54BrickYes+1729.613NeighborhoodNorth+22264.32NeighborhoodWest

Next, I will take 90 random cases and create a model with them and use this model to predict a test set of 38 cases. Then I will examine the error.

set.seed(1)
n=length(HP$Price)
n1=90
n2=n-n1
train=sample(1:n,n1)
m1=lm(Price~.,data=HP[train,])
summary(m1)
## 
## Call:
## lm(formula = Price ~ ., data = HP[train, ])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -29267  -4859  -1429   5576  28896 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1782.970  11103.095   0.161  0.87282    
## HomeID              -18.470     29.611  -0.624  0.53453    
## SqFt                 51.466      6.655   7.734 2.51e-11 ***
## Bedrooms           4692.374   1854.417   2.530  0.01333 *  
## Bathrooms          7309.842   2387.949   3.061  0.00299 ** 
## Offers            -7737.062   1229.031  -6.295 1.49e-08 ***
## BrickYes          17915.620   2434.550   7.359 1.36e-10 ***
## NeighborhoodNorth  3941.988   3001.474   1.313  0.19277    
## NeighborhoodWest  22785.492   2919.501   7.805 1.82e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9586 on 81 degrees of freedom
## Multiple R-squared:  0.8777, Adjusted R-squared:  0.8656 
## F-statistic: 72.68 on 8 and 81 DF,  p-value: < 2.2e-16
pred=predict(m1,newdat=HP[-train,])
obs=HP$Price[-train]
diff=obs-pred
percdiff=abs(diff)/obs
me=mean(diff)
rmse=sqrt(sum(diff**2)/n2)
mape=100*(mean(percdiff))
me   # mean error
## [1] 1904.989
rmse # root mean square error
## [1] 11328.64
mape # mean absolute percent error
## [1] 6.882381

With our mean absolute percent error at roughly 7% we can conclude our training set was a good predictor for our test set which tell us our data is a good set of data to make models from and should translate to more data of the same nature.

Gender Discrimination

This data set includes data on gender, work experience, and salary.

GD <- read.csv("~/Business Analytics/GenderDiscrimination.csv")
plot(Salary~Gender,data=GD)

plot(Salary~Experience,data=GD)

Above are all of the XY plots of our data set. These plots all look good and none show any signs of the X variables being colinear.

m2=lm(Salary~.,data=GD)
summary(m2)
## 
## Call:
## lm(formula = Salary ~ ., data = GD)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -52779  -9806   -121   8347  60913 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  53260.0     2416.6  22.039  < 2e-16 ***
## GenderMale   17020.6     2499.6   6.809 1.06e-10 ***
## Experience    1744.6      160.7  10.858  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16910 on 205 degrees of freedom
## Multiple R-squared:  0.4413, Adjusted R-squared:  0.4359 
## F-statistic: 80.98 on 2 and 205 DF,  p-value: < 2.2e-16

Our regression output gives us an adjusted R^2 of .44 which tells us that our model is not a good fit for our data set and less than 50% of the variation is explained by our model. We can also see that all variables are significant.

The equation for our line is:

Salary= 53260+17020.6GenderMale+1744.6Experience

Direct Marketing

This data set includes data from a direct marketer who sells his products via mail.

DM <- read.csv("~/Business Analytics/DirectMarketing.csv")
plot(AmountSpent~Age,data=DM)

plot(AmountSpent~Gender,data=DM)

plot(AmountSpent~OwnHome,data=DM)

plot(AmountSpent~Married,data=DM)

plot(AmountSpent~Location,data=DM)

plot(AmountSpent~Children,data=DM)

plot(AmountSpent~History,data=DM)

plot(AmountSpent~Catalogs,data=DM)

plot(AmountSpent~Salary,data=DM)

Above are all of the XY plots of our data set. These plots all look good and none show any signs of the X variables being colinear.

m3=lm(AmountSpent~.,data=DM)
summary(m3)
## 
## Call:
## lm(formula = AmountSpent ~ ., data = DM)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1648.11  -286.72   -12.63   218.21  2771.25 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.496e+02  1.340e+02  -1.862  0.06302 .  
## AgeOld         4.139e+01  5.276e+01   0.784  0.43311    
## AgeYoung       8.965e+01  5.874e+01   1.526  0.12740    
## GenderMale    -5.370e+01  3.802e+01  -1.413  0.15823    
## OwnHomeRent   -1.829e+01  4.151e+01  -0.441  0.65967    
## MarriedSingle  1.950e+01  4.981e+01   0.392  0.69553    
## LocationFar    6.090e+02  4.399e+01  13.845  < 2e-16 ***
## Salary         1.883e-02  1.245e-03  15.124  < 2e-16 ***
## Children      -2.683e+02  2.502e+01 -10.723  < 2e-16 ***
## HistoryLow    -2.675e+02  8.862e+01  -3.019  0.00263 ** 
## HistoryMedium -3.446e+02  5.996e+01  -5.746 1.38e-08 ***
## Catalogs       4.052e+01  2.868e+00  14.128  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 463.5 on 685 degrees of freedom
##   (303 observations deleted due to missingness)
## Multiple R-squared:  0.7887, Adjusted R-squared:  0.7853 
## F-statistic: 232.5 on 11 and 685 DF,  p-value: < 2.2e-16

Our regression output gives us an adjusted R^2 of .785 which tells us that our model is a good fit for our data set. We can also see that LocationFar, Salary, Children, History Low, HistoryMedium, and catalogs are all significant.

The equation for our line is:

AmountSpent=-249.6+41.39AgeOld+89.65AgeYoung-53.7GenderMale-18.29OwnHomeRent+19.5MarriedSingle+609LocationFar+.0188Salary-268.3Children-267.5HistoryLow-344.6HistoryMedium+40.52*Catalogs