Linear Regression Exercise

Housing Price Data

download.file("https://www.biz.uiowa.edu/faculty/jledolter/datamining/HousePrices.csv", "HousePrices.csv",method="curl")
hp<-read.csv("https://www.biz.uiowa.edu/faculty/jledolter/datamining/HousePrices.csv")
attach(hp)

Housing Plots against House Price

Based on these plots, it seems like the variables that will be most significant in a linear regression are square feet, offers, and brick. This is because these variables show linear correlations when plotted against the dependent variable (house price). In addition, the neighborhood variable needs to be made into dummy variables because it has three categories and it is possibly significant because the box-and-whisker plot of “West” only slightly overlaps with the other two box-and-whisker plots.

Linear Regressions

v1=rep(1,length(hp$Neighborhood))
v2=rep(0,length(hp$Neighborhood))
hp$WestNeighborhood=ifelse(hp$Neighborhood=="West",v1,v2)
hp$NorthNeighborhood=ifelse(hp$Neighborhood=="North",v1,v2)
h1=lm(hp$Price~hp$SqFt+hp$HomeID+hp$Bedrooms+hp$Bathrooms+hp$Offers+hp$Brick+hp$WestNeighborhood+hp$NorthNeighborhood,data=hp)
summary(h1)

## 
## Call:
## lm(formula = hp$Price ~ hp$SqFt + hp$HomeID + hp$Bedrooms + hp$Bathrooms + 
##     hp$Offers + hp$Brick + hp$WestNeighborhood + hp$NorthNeighborhood, 
##     data = hp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -27897.8  -6074.8    -48.7   5551.8  27536.4 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            308.114   9605.692   0.032 0.974465    
## hp$SqFt                 53.634      5.926   9.051 3.30e-15 ***
## hp$HomeID              -11.456     25.387  -0.451 0.652616    
## hp$Bedrooms           4136.461   1621.775   2.551 0.012023 *  
## hp$Bathrooms          7975.157   2133.831   3.737 0.000287 ***
## hp$Offers            -8350.128   1103.693  -7.566 8.96e-12 ***
## hp$BrickYes          17313.540   1988.548   8.707 2.12e-14 ***
## hp$WestNeighborhood  22264.319   2540.699   8.763 1.56e-14 ***
## hp$NorthNeighborhood  1729.613   2433.756   0.711 0.478675    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10050 on 119 degrees of freedom
## Multiple R-squared:  0.8688, Adjusted R-squared:   0.86 
## F-statistic: 98.54 on 8 and 119 DF,  p-value: < 2.2e-16

This is an okay model. This is evidenced by its 0.86 adjusted r-squared value. However, the F-statistic is relatively low which means that we do not know much about the data based on the model. In addition, variables HomeID and North Neighborhood are not significant and should not be in the model.

Based on the raw coefficients from the regression, you get the equation:y=308.11+(53.63SqFt)-(11.46HomeID)+(4136.46Bedrooms)+(7975.16Bathrooms)-(8350.13Offers)+(17313.54Brick)+(22264.32WestNH)+(1729.61NNH)

h2=lm(hp$Price~hp$SqFt+hp$Bedrooms+hp$Bathrooms+hp$Offers+hp$Brick+hp$WestNeighborhood,data=hp)
summary(h2)

## 
## Call:
## lm(formula = hp$Price ~ hp$SqFt + hp$Bedrooms + hp$Bathrooms + 
##     hp$Offers + hp$Brick + hp$WestNeighborhood, data = hp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26810.5  -5953.6   -266.5   5662.9  26793.0 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3067.471   8746.712   0.351 0.726423    
## hp$SqFt                52.149      5.572   9.359 5.44e-16 ***
## hp$Bedrooms          4070.005   1570.921   2.591 0.010751 *  
## hp$Bathrooms         7810.698   2109.060   3.703 0.000322 ***
## hp$Offers           -8019.003   1013.011  -7.916 1.32e-12 ***
## hp$BrickYes         17058.771   1942.805   8.780 1.28e-14 ***
## hp$WestNeighborhood 21937.572   2482.393   8.837 9.39e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9995 on 121 degrees of freedom
## Multiple R-squared:  0.8682, Adjusted R-squared:  0.8616 
## F-statistic: 132.8 on 6 and 121 DF,  p-value: < 2.2e-16

This model is better because the insignificant variables like Home ID are eliminated and the adjusted r squared value increased. In addition, the F statistic increased, meaning that we know more now than we did before.

Based on the raw coefficients from this new regression, this new equation is created: y=3067.47+52.15SqFt+4070Bedrooms+7810.7Bathrooms-8019Offers+17058.77BrickYes+21937.57WestNeighborhood

Correlation Matrix

hp=hp[-7]
hp=hp[-7]
cor(hp)

##                        HomeID      Price       SqFt    Bedrooms  Bathrooms
## HomeID             1.00000000  0.1081900  0.1685533 -0.06856814  0.1276935
## Price              0.10818998  1.0000000  0.5529822  0.52592606  0.5232578
## SqFt               0.16855330  0.5529822  1.0000000  0.48380711  0.5227453
## Bedrooms          -0.06856814  0.5259261  0.4838071  1.00000000  0.4145560
## Bathrooms          0.12769353  0.5232578  0.5227453  0.41455596  1.0000000
## Offers            -0.05359711 -0.3136359  0.3369234  0.11427061  0.1437934
## WestNeighborhood   0.02687339  0.7140066  0.2507592  0.47147686  0.2859231
## NorthNeighborhood  0.04985927 -0.5482211 -0.2888860 -0.36466744 -0.2758297
##                        Offers WestNeighborhood NorthNeighborhood
## HomeID            -0.05359711       0.02687339        0.04985927
## Price             -0.31363588       0.71400660       -0.54822108
## SqFt               0.33692335       0.25075921       -0.28888599
## Bedrooms           0.11427061       0.47147686       -0.36466744
## Bathrooms          0.14379340       0.28592314       -0.27582970
## Offers             1.00000000      -0.32742521        0.33298661
## WestNeighborhood  -0.32742521       1.00000000       -0.47909760
## NorthNeighborhood  0.33298661      -0.47909760        1.00000000

In this correlation matrix, there are a few variables that are highly correlated with other variables. First, SqFt is highly correlated with Bathrooms, Bedrooms, and offers. This is not suprising because the larger a house is the more bedrooms and bathrooms it has. SqFt, according to the regression, is more significant than all three of those variables so I would keep SqFt in the regression. WestNeighborhood is also highly correlated with other variables. These include bedrooms, offers, and NorthNeighborhood. However, because WestNeighborhood is more significant than these variables, I would keep WestNeighborhood in the regression.

Training Data and Model Comparisons

n=length(hp$Price)
diff=dim(n)
percdiff=dim(n)
for (k in 1:n) {
  train1=c(1:n)
  train=train1[train1!=k]
  hp10=lm(hp$Price~hp$SqFt,data=hp[train,])
  pred=predict(hp10,newdat=hp[-train,])
  obs=hp$Price[-train]
  diff[k]=obs-pred
  percdiff[k]=abs(diff[k])/obs
}
me=mean(diff)
rmse=sqrt(mean(diff**2))
mape=100*(mean(percdiff))
me

## [1] 14813.36

rmse

## [1] 30589.65

mape

## [1] 16.57394

This model creates a training set with independent variable SqFt set against House Price. However, this model computes extremely high mean error, root mean square error, and mape. These high values mean that, when the model created from the training set is put into another set, the model does not accurately predict house prices. Therefore, the variable SqFt is not enough to reliably predict House Price.

Direct Marketing Data

download.file("https://www.biz.uiowa.edu/faculty/jledolter/datamining/DirectMarketing.csv", "DirectMarketing.csv",method="curl")
DirectMk<-read.csv("https://www.biz.uiowa.edu/faculty/jledolter/datamining/DirectMarketing.csv")
attach(DirectMk)

Marketing Plots against Amount Spent

plot(DirectMk$AmountSpent~DirectMk$Age,data=DirectMk)

plot(DirectMk$AmountSpent~DirectMk$Gender,data=DirectMk)

plot(DirectMk$AmountSpent~DirectMk$OwnHome,data=DirectMk)

plot(DirectMk$AmountSpent~DirectMk$Married,data=DirectMk)

plot(DirectMk$AmountSpent~DirectMk$Location,data=DirectMk)

plot(DirectMk$AmountSpent~DirectMk$Salary,data=DirectMk)

plot(DirectMk$AmountSpent~DirectMk$Children,data=DirectMk)

plot(DirectMk$AmountSpent~DirectMk$History,data=DirectMk)

plot(DirectMk$AmountSpent~DirectMk$Catalogs,data=DirectMk)

Based on these plots, a number of variables need to be made into dummy variables including History, Location, Married, Own Home, Gender, and Age. In addition, these plots show that some possibly significant variables are Children, Salary, History, and Catalogs. The numerical plots (Salary and Children) show a linear relationship between that variable and the dependent variable (Amount Spent). In the other possibly significant plots (History and Catalogs) there seem to be significant differences between the categories of the indpendent variables.

Dummy Variable Declaration

This code will make dummy variables for the following independent variables: Age, Gender, OwnHome, Married, History, and Location.

v3=rep(1,length(DirectMk$Age))
v4=rep(0,length(DirectMk$Age))
v7=rep(1,length(DirectMk$Gender))
v8=rep(0,length(DirectMk$Gender))
v9=rep(1,length(DirectMk$OwnHome))
v10=rep(0,length(DirectMk$OwnHome))
v11=rep(1,length(DirectMk$Married))
v12=rep(0,length(DirectMk$Married))
v13=rep(1,length(DirectMk$History))
v14=rep(0,length(DirectMk$History))
v15=rep(1,length(DirectMk$Location))
v16=rep(0,length(DirectMk$Location))
DirectMk$Old=ifelse(DirectMk$Age=="Old",v3,v4)
DirectMk$Middle=ifelse(DirectMk$Age=="Middle",v3,v4)
DirectMk$Male=ifelse(DirectMk$Gender=="Male",v7,v8)
DirectMk$Own=ifelse(DirectMk$OwnHome=="Own",v9,v10)
DirectMk$Single=ifelse(DirectMk$Married=="Single",v11,v12)
DirectMk$HighHistory=ifelse(DirectMk$History=="High",v13,v14)
DirectMk$LowHistory=ifelse(DirectMk$History=="Low",v13,v14)
DirectMk$Close=ifelse(DirectMk$Location=="Close",v15,v16)

Linear Regressions

market=DirectMk[-8]
market=market[,6:17]
head(market)

##   Salary Children Catalogs AmountSpent Old Middle Male Own Single
## 1  47500        0        6         755   1      0    0   1      1
## 2  63600        0        6        1318   0      1    1   0      1
## 3  13500        0       18         296   0      0    0   0      1
## 4  85600        1       18        2436   0      1    1   1      0
## 5  68400        0       12        1304   0      1    0   1      1
## 6  30400        0        6         495   0      0    1   1      0
##   HighHistory LowHistory Close
## 1           1          0     0
## 2           1          0     1
## 3           0          1     1
## 4           1          0     1
## 5           1          0     1
## 6           0          1     1

d1=lm(market$AmountSpent~.,data=market)
summary(d1)

## 
## Call:
## lm(formula = market$AmountSpent ~ ., data = market)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1648.11  -286.72   -12.63   218.21  2771.25 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.623e+01  9.344e+01   0.923    0.356    
## Salary       1.883e-02  1.245e-03  15.124  < 2e-16 ***
## Children    -2.683e+02  2.502e+01 -10.723  < 2e-16 ***
## Catalogs     4.052e+01  2.868e+00  14.128  < 2e-16 ***
## Old         -4.827e+01  6.189e+01  -0.780    0.436    
## Middle      -8.965e+01  5.874e+01  -1.526    0.127    
## Male        -5.370e+01  3.802e+01  -1.413    0.158    
## Own          1.829e+01  4.151e+01   0.441    0.660    
## Single       1.950e+01  4.981e+01   0.392    0.696    
## HighHistory  3.446e+02  5.996e+01   5.746 1.38e-08 ***
## LowHistory   7.704e+01  5.889e+01   1.308    0.191    
## Close       -6.090e+02  4.399e+01 -13.845  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 463.5 on 685 degrees of freedom
##   (303 observations deleted due to missingness)
## Multiple R-squared:  0.7887, Adjusted R-squared:  0.7853 
## F-statistic: 232.5 on 11 and 685 DF,  p-value: < 2.2e-16

This is not a good regression because many of the variables are insignificant. Some of those include Age, Gender, and Marital status. In addition, many observations were omitted because of missing data, so that could contribute to the lower F statistic.

d2=lm(market$AmountSpent~market$Salary+market$Children+market$Catalogs+market$HighHistory+market$Close,data=market)
summary(d2)

## 
## Call:
## lm(formula = market$AmountSpent ~ market$Salary + market$Children + 
##     market$Catalogs + market$HighHistory + market$Close, data = market)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1496.89  -292.01   -20.42   207.65  2854.05 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.425e+02  6.290e+01   2.266   0.0238 *  
## market$Salary       1.698e-02  8.446e-04  20.101  < 2e-16 ***
## market$Children    -2.529e+02  1.977e+01 -12.791  < 2e-16 ***
## market$Catalogs     3.932e+01  2.833e+00  13.879  < 2e-16 ***
## market$HighHistory  3.600e+02  5.948e+01   6.052 2.35e-09 ***
## market$Close       -5.952e+02  4.254e+01 -13.993  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 464.2 on 691 degrees of freedom
##   (303 observations deleted due to missingness)
## Multiple R-squared:  0.7862, Adjusted R-squared:  0.7847 
## F-statistic: 508.3 on 5 and 691 DF,  p-value: < 2.2e-16

This is a better model. While the adjusted r-squared value did slightly decrease, the F-Statistic increased, showing that we now know more about the data than we do not. Lastly, all of the variables have a low p-value, which means that they are all significant in predicting Amount Spent.

library(lm.beta)
lm.beta(d2)

## 
## Call:
## lm(formula = market$AmountSpent ~ market$Salary + market$Children + 
##     market$Catalogs + market$HighHistory + market$Close, data = market)
## 
## Standardized Coefficients::
##        (Intercept)      market$Salary    market$Children 
##          0.0000000          0.5210862         -0.2642102 
##    market$Catalogs market$HighHistory       market$Close 
##          0.2597363          0.1734627         -0.2717141

Because the raw coefficients are so small, we will use the standardized coefficients to make the following equation:

y=0.52Salary-0.26Children+.26Catalogs+0.17HighHistory-0.27*Close

Gender Discrimination Data

download.file("https://www.biz.uiowa.edu/faculty/jledolter/datamining/GenderDiscrimination.csv", "GenderDiscrimination.csv",method="curl")
GenDsc<-read.csv("https://www.biz.uiowa.edu/faculty/jledolter/datamining/GenderDiscrimination.csv")
attach(GenDsc)

Discrimination Plots against Salary

plot(GenDsc$Salary~GenDsc$Experience,data=GenDsc)

plot(GenDsc$Salary~GenDsc$Gender,data=GenDsc)

These plots depict Gender and Experience against salary. There may be a slight linear correlation between salary and experience. However, there seems to be a large clump of data points in one area of the graph which may interfere with its significance. In addition, Gender does not seem significant either because the male and female box plots completely overlap, meaning that there is no statistical difference between male and female salaries.