Data Set #1
HP <-read.csv("/Users/hannahpeterson/Documents/R stuff/HousePrices.csv")
plot(Price~SqFt,data=HP)
plot(Price~Bedrooms,data=HP)
plot(Price~Bathrooms,data=HP)
plot(Price~Offers,data=HP)
plot(Price~Brick,data=HP)
plot(Price~Neighborhood,data=HP)
HP=HP[-1]
HP[1:3,]
## Price SqFt Bedrooms Bathrooms Offers Brick Neighborhood
## 1 114300 1790 2 2 2 No East
## 2 114200 2030 4 2 3 No East
## 3 114800 1740 3 2 1 No East
HP2=HP[,1:5]
head(HP2)
## Price SqFt Bedrooms Bathrooms Offers
## 1 114300 1790 2 2 2
## 2 114200 2030 4 2 3
## 3 114800 1740 3 2 1
## 4 94700 1980 3 2 3
## 5 119800 2130 3 3 3
## 6 114600 1780 3 2 2
cor(HP2)
## Price SqFt Bedrooms Bathrooms Offers
## Price 1.0000000 0.5529822 0.5259261 0.5232578 -0.3136359
## SqFt 0.5529822 1.0000000 0.4838071 0.5227453 0.3369234
## Bedrooms 0.5259261 0.4838071 1.0000000 0.4145560 0.1142706
## Bathrooms 0.5232578 0.5227453 0.4145560 1.0000000 0.1437934
## Offers -0.3136359 0.3369234 0.1142706 0.1437934 1.0000000
m1=lm(Price~.,data=HP)
summary(m1)
##
## Call:
## lm(formula = Price ~ ., data = HP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27337.3 -6549.5 -41.7 5803.4 27359.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 598.919 9552.197 0.063 0.95011
## SqFt 52.994 5.734 9.242 1.10e-15 ***
## Bedrooms 4246.794 1597.911 2.658 0.00894 **
## Bathrooms 7883.278 2117.035 3.724 0.00030 ***
## Offers -8267.488 1084.777 -7.621 6.47e-12 ***
## BrickYes 17297.350 1981.616 8.729 1.78e-14 ***
## NeighborhoodNorth 1560.579 2396.765 0.651 0.51621
## NeighborhoodWest 22241.616 2531.758 8.785 1.32e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10020 on 120 degrees of freedom
## Multiple R-squared: 0.8686, Adjusted R-squared: 0.861
## F-statistic: 113.3 on 7 and 120 DF, p-value: < 2.2e-16
According to our adjusted R squared, the variables in this model account for about 86% of the price for homes. All variables are significant factors except for NeighborhoodNorth. This means that the Price is not affected if the Neighborhood is North. The regression equation would be Price=598.919+52.99SqFt+4246.79Bedrooms+7883.27Bathrooms-8267.48Offers+17297.35Brick+1560.57NeighborhoodNorth+22241.61NeighborhoodWest
Data Set #2
DM <- read.csv("/Users/hannahpeterson/Documents/R stuff/DirectMarketing.csv")
plot(AmountSpent~Age,data=DM)
plot(AmountSpent~Gender,data=DM)
plot(AmountSpent~OwnHome,data=DM)
plot(AmountSpent~Married,data=DM)
plot(AmountSpent~Location,data=DM)
plot(AmountSpent~Salary,data=DM)
plot(AmountSpent~Children,data=DM)
plot(AmountSpent~History,data=DM)
plot(AmountSpent~Catalogs,data=DM)
m1=lm(AmountSpent~.,data=DM)
summary(m1)
##
## Call:
## lm(formula = AmountSpent ~ ., data = DM)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1648.11 -286.72 -12.63 218.21 2771.25
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.496e+02 1.340e+02 -1.862 0.06302 .
## AgeOld 4.139e+01 5.276e+01 0.784 0.43311
## AgeYoung 8.965e+01 5.874e+01 1.526 0.12740
## GenderMale -5.370e+01 3.802e+01 -1.413 0.15823
## OwnHomeRent -1.829e+01 4.151e+01 -0.441 0.65967
## MarriedSingle 1.950e+01 4.981e+01 0.392 0.69553
## LocationFar 6.090e+02 4.399e+01 13.845 < 2e-16 ***
## Salary 1.883e-02 1.245e-03 15.124 < 2e-16 ***
## Children -2.683e+02 2.502e+01 -10.723 < 2e-16 ***
## HistoryLow -2.675e+02 8.862e+01 -3.019 0.00263 **
## HistoryMedium -3.446e+02 5.996e+01 -5.746 1.38e-08 ***
## Catalogs 4.052e+01 2.868e+00 14.128 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 463.5 on 685 degrees of freedom
## (303 observations deleted due to missingness)
## Multiple R-squared: 0.7887, Adjusted R-squared: 0.7853
## F-statistic: 232.5 on 11 and 685 DF, p-value: < 2.2e-16
m2=lm(AmountSpent~Location+Salary+Children+History+Catalogs,data=DM)
summary(m2)
##
## Call:
## lm(formula = AmountSpent ~ Location + Salary + Children + History +
## Catalogs, data = DM)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1582.30 -281.82 -19.75 216.38 2796.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.994e+02 1.075e+02 -1.854 0.06414 .
## LocationFar 6.154e+02 4.378e+01 14.057 < 2e-16 ***
## Salary 1.809e-02 1.027e-03 17.603 < 2e-16 ***
## Children -2.742e+02 2.274e+01 -12.059 < 2e-16 ***
## HistoryLow -2.405e+02 8.676e+01 -2.772 0.00572 **
## HistoryMedium -3.468e+02 5.978e+01 -5.801 1e-08 ***
## Catalogs 4.013e+01 2.860e+00 14.031 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 463.3 on 690 degrees of freedom
## (303 observations deleted due to missingness)
## Multiple R-squared: 0.7873, Adjusted R-squared: 0.7855
## F-statistic: 425.8 on 6 and 690 DF, p-value: < 2.2e-16
For this data set, I decided to focus just on the variables that were statistically significant. In the first regression model it showed that location,salary,children,history and catalogs were the significant variables. They account for about 78% of the amount spent. The regression equation is AmountSpent=-1.99e+02+6.15e+02LocationFar+1.80e-02Salary-2.74e+02Children-2.40e+02HistoryLow-3.46e+02HistoryMedium+4.01e+01Catalogs
Data Set #3
GD <- read.csv("/Users/hannahpeterson/Documents/R stuff/GenderDiscrimination.csv")
plot(Salary~Gender,data=GD)
plot(Salary~Experience,data=GD)
m1=lm(Salary~.,data=GD)
summary(m1)
##
## Call:
## lm(formula = Salary ~ ., data = GD)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52779 -9806 -121 8347 60913
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53260.0 2416.6 22.039 < 2e-16 ***
## GenderMale 17020.6 2499.6 6.809 1.06e-10 ***
## Experience 1744.6 160.7 10.858 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16910 on 205 degrees of freedom
## Multiple R-squared: 0.4413, Adjusted R-squared: 0.4359
## F-statistic: 80.98 on 2 and 205 DF, p-value: < 2.2e-16
In this sample, gender and experience both significantly affect a persons salary. These 2 variable alone account for 43% of someones salary. The regression equation is Salary=53260+17020.6GenderMale+1744.6Experience This equation is stating that your salary will be higher if you are male. Experience doesn’t affect salary as much as gender does.
set.seed(1)
n=length(GD$Salary) #n is the number of cases of the DV
n1=100
n1
## [1] 100
n2=n-n1
n2
## [1] 108
train=sample(1:n,n1)
m2=lm(Salary~.,data=GD[train,])
summary(m2)
##
## Call:
## lm(formula = Salary ~ ., data = GD[train, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -34665 -10717 34 8730 49337
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51642.4 3285.4 15.719 < 2e-16 ***
## GenderMale 20123.7 3551.1 5.667 1.49e-07 ***
## Experience 2024.9 242.1 8.365 4.48e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15930 on 97 degrees of freedom
## Multiple R-squared: 0.5239, Adjusted R-squared: 0.5141
## F-statistic: 53.36 on 2 and 97 DF, p-value: 2.343e-16
pred=predict(m2,newdat=GD[-train,])
obs=GD$Salary[-train]
diff=obs-pred
percdiff=abs(diff)/obs
me=mean(diff)
rmse=sqrt(mean(diff**2))
mape=100*(mean(percdiff))
me # mean error
## [1] -5342.269
rmse # root mean square error
## [1] 18482.52
mape # mean absolute percent error
## [1] 18.79173
In regards to the holdout sample, I used 100 variables to make the prediction. It showed that 51% of Salary was accounted for from just gender and experience. The mean absolute percent error was only 18% so we can conclude that the data can make reliable and valid predictions.