The goal of this project is to apply the material introduced in this class on a real world dataset and create a formal report to represent all your work.
Your grade on this project will be based on a written report. You should write up your analysis and report in the form of a short technical/research article, preferably in the format of a high-quality statistics or mathematics journal.
dat <- read.csv("/Users/Patrick/Dropbox/data/mvar/kc_house_data.csv", header = TRUE)
df.num <- dat %>% select(price, floors, bedrooms, bathrooms, sqft_living, sqft_living15,
sqft_lot, sqft_lot15, sqft_above, sqft_basement, lat, long, zipcode, yr_built,
yr_renovated, waterfront, view, condition, grade)
## Partitioning numeric data to explore more thorougly ## Home Attributes
df.home <- df.num %>% select(price, floors, bedrooms, bathrooms, yr_built, yr_renovated,
view)
# Square Footage Measures
df.sqft <- df.num %>% select(price, sqft_living, sqft_living15, sqft_lot, sqft_lot15,
sqft_above, sqft_basement)
# Location info
df.loc <- df.num %>% select(price, lat, long, zipcode, waterfront, condition,
grade)
Click in a column to sort by the corresponding variable.
To regain use of the VIM GUI and the R console, click outside the plot region.
| price | floors | bedrooms | bathrooms | yr_built | yr_renovated | view |
|---|---|---|---|---|---|---|
| Min. : 75000 | Min. :1.000 | Min. : 0.000 | Min. :0.000 | Min. :1900 | Min. : 0.0 | Min. :0.0000 |
| 1st Qu.: 321950 | 1st Qu.:1.000 | 1st Qu.: 3.000 | 1st Qu.:1.750 | 1st Qu.:1951 | 1st Qu.: 0.0 | 1st Qu.:0.0000 |
| Median : 450000 | Median :1.500 | Median : 3.000 | Median :2.250 | Median :1975 | Median : 0.0 | Median :0.0000 |
| Mean : 540088 | Mean :1.494 | Mean : 3.371 | Mean :2.115 | Mean :1971 | Mean : 84.4 | Mean :0.2343 |
| 3rd Qu.: 645000 | 3rd Qu.:2.000 | 3rd Qu.: 4.000 | 3rd Qu.:2.500 | 3rd Qu.:1997 | 3rd Qu.: 0.0 | 3rd Qu.:0.0000 |
| Max. :7700000 | Max. :3.500 | Max. :33.000 | Max. :8.000 | Max. :2015 | Max. :2015.0 | Max. :4.0000 |
| price | sqft_living | sqft_living15 | sqft_lot | sqft_lot15 | sqft_above | sqft_basement |
|---|---|---|---|---|---|---|
| Min. : 75000 | Min. : 290 | Min. : 399 | Min. : 520 | Min. : 651 | Min. : 290 | Min. : 0.0 |
| 1st Qu.: 321950 | 1st Qu.: 1427 | 1st Qu.:1490 | 1st Qu.: 5040 | 1st Qu.: 5100 | 1st Qu.:1190 | 1st Qu.: 0.0 |
| Median : 450000 | Median : 1910 | Median :1840 | Median : 7618 | Median : 7620 | Median :1560 | Median : 0.0 |
| Mean : 540088 | Mean : 2080 | Mean :1987 | Mean : 15107 | Mean : 12768 | Mean :1788 | Mean : 291.5 |
| 3rd Qu.: 645000 | 3rd Qu.: 2550 | 3rd Qu.:2360 | 3rd Qu.: 10688 | 3rd Qu.: 10083 | 3rd Qu.:2210 | 3rd Qu.: 560.0 |
| Max. :7700000 | Max. :13540 | Max. :6210 | Max. :1651359 | Max. :871200 | Max. :9410 | Max. :4820.0 |
| price | lat | long | zipcode | waterfront | condition | grade |
|---|---|---|---|---|---|---|
| Min. : 75000 | Min. :47.16 | Min. :-122.5 | Min. :98001 | Min. :0.000000 | Min. :1.000 | Min. : 1.000 |
| 1st Qu.: 321950 | 1st Qu.:47.47 | 1st Qu.:-122.3 | 1st Qu.:98033 | 1st Qu.:0.000000 | 1st Qu.:3.000 | 1st Qu.: 7.000 |
| Median : 450000 | Median :47.57 | Median :-122.2 | Median :98065 | Median :0.000000 | Median :3.000 | Median : 7.000 |
| Mean : 540088 | Mean :47.56 | Mean :-122.2 | Mean :98078 | Mean :0.007542 | Mean :3.409 | Mean : 7.657 |
| 3rd Qu.: 645000 | 3rd Qu.:47.68 | 3rd Qu.:-122.1 | 3rd Qu.:98118 | 3rd Qu.:0.000000 | 3rd Qu.:4.000 | 3rd Qu.: 8.000 |
| Max. :7700000 | Max. :47.78 | Max. :-121.3 | Max. :98199 | Max. :1.000000 | Max. :5.000 | Max. :13.000 |
king_county <- map_data(map = "county", region = "washington") %>% filter(subregion ==
"king")
king_base <- ggplot(data = king_county, mapping = aes(x = long, y = lat, group = group)) +
coord_fixed(1.3) + geom_polygon(color = "black", fill = NA)
king_base + geom_point(data = dat, mapping = aes(x = long, y = lat, group = zipcode,
color = price))Call:
princomp(x = df.num)
Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
367145.9651446508 46451.9474231992 16856.7971618432 943.1351452520
Comp.5 Comp.6 Comp.7 Comp.8
513.4688851286 408.6524693460 383.6711990638 50.4180676714
Comp.9 Comp.10 Comp.11 Comp.12
23.8452181540 0.7576857097 0.6752474139 0.6251892634
Comp.13 Comp.14 Comp.15 Comp.16
0.5828901824 0.4600428983 0.3527710841 0.1220866594
Comp.17 Comp.18 Comp.19
0.1021258504 0.0765978564 0.0001628953
19 variables and 21613 observations.
equation <- formula(~price + floors + bedrooms + bathrooms + sqft_living + sqft_lot +
sqft_basement + zipcode + yr_built + yr_renovated + waterfront + view +
condition + grade)
FA.norot <- factanal(equation, factors = 3, data = df.num, rotation = "none")
FA.vmax <- factanal(equation, factors = 3, data = df.num, rotation = "varimax")
FA.pmax <- factanal(equation, factors = 3, data = df.num, rotation = "varimax")Rotation Sums of Squared Loadings: the values of the table represent the distribution of the variance after the varimax rotation. Factor Matrix: this table contains the unrotated factor loadings, which are the correlations between the variable and the factor.
Rotated Factor Matrix: this table contains the rotated factor loadings (factor pattern matrix), which represent both how the variables are weighted for each factor but also the correlation between the variables and the factor.
Call:
factanal(x = equation, factors = 3, data = df.num, rotation = "none")
Uniquenesses:
price floors bedrooms bathrooms sqft_living
0.192 0.506 0.561 0.282 0.089
sqft_lot sqft_basement zipcode yr_built yr_renovated
0.976 0.514 0.873 0.315 0.942
waterfront view condition grade
0.871 0.762 0.823 0.245
Loadings:
Factor1 Factor2 Factor3
price 0.769 0.331 0.327
floors 0.439 -0.484 0.259
bedrooms 0.551 -0.368
bathrooms 0.811 -0.224
sqft_living 0.943 -0.131
sqft_lot 0.149
sqft_basement 0.380 0.460 -0.361
zipcode -0.204 0.266 0.121
yr_built 0.396 -0.727
yr_renovated 0.218
waterfront 0.142 0.199 0.264
view 0.317 0.308 0.205
condition 0.393 -0.118
grade 0.835 -0.141 0.194
Factor1 Factor2 Factor3
SS loadings 3.830 1.565 0.654
Proportion Var 0.274 0.112 0.047
Cumulative Var 0.274 0.385 0.432
Test of the hypothesis that 3 factors are sufficient.
The chi square statistic is 12931.95 on 52 degrees of freedom.
The p-value is 0
Length Class Mode
converged 1 -none- logical
loadings 42 loadings numeric
uniquenesses 14 -none- numeric
correlation 196 -none- numeric
criteria 3 -none- numeric
factors 1 -none- numeric
dof 1 -none- numeric
method 1 -none- character
STATISTIC 1 -none- numeric
PVAL 1 -none- numeric
n.obs 1 -none- numeric
call 5 -none- call
Call:
factanal(x = equation, factors = 3, data = df.num, rotation = "varimax")
Uniquenesses:
price floors bedrooms bathrooms sqft_living
0.192 0.506 0.561 0.282 0.089
sqft_lot sqft_basement zipcode yr_built yr_renovated
0.976 0.514 0.873 0.315 0.942
waterfront view condition grade
0.871 0.762 0.823 0.245
Loadings:
Factor1 Factor2 Factor3
price 0.482 0.130 0.748
floors 0.157 0.672 0.133
bedrooms 0.661
bathrooms 0.693 0.458 0.166
sqft_living 0.852 0.251 0.349
sqft_lot 0.145
sqft_basement 0.567 -0.382 0.137
zipcode -0.205 -0.257 0.137
yr_built 0.243 0.764 -0.206
yr_renovated -0.132 0.201
waterfront 0.359
view 0.179 0.449
condition -0.412
grade 0.557 0.506 0.435
Factor1 Factor2 Factor3
SS loadings 2.688 1.987 1.374
Proportion Var 0.192 0.142 0.098
Cumulative Var 0.192 0.334 0.432
Test of the hypothesis that 3 factors are sufficient.
The chi square statistic is 12931.95 on 52 degrees of freedom.
The p-value is 0
Length Class Mode
converged 1 -none- logical
loadings 42 loadings numeric
uniquenesses 14 -none- numeric
correlation 196 -none- numeric
criteria 3 -none- numeric
factors 1 -none- numeric
dof 1 -none- numeric
method 1 -none- character
rotmat 9 -none- numeric
STATISTIC 1 -none- numeric
PVAL 1 -none- numeric
n.obs 1 -none- numeric
call 5 -none- call
Call:
factanal(x = equation, factors = 3, data = df.num, rotation = "varimax")
Uniquenesses:
price floors bedrooms bathrooms sqft_living
0.192 0.506 0.561 0.282 0.089
sqft_lot sqft_basement zipcode yr_built yr_renovated
0.976 0.514 0.873 0.315 0.942
waterfront view condition grade
0.871 0.762 0.823 0.245
Loadings:
Factor1 Factor2 Factor3
price 0.482 0.130 0.748
floors 0.157 0.672 0.133
bedrooms 0.661
bathrooms 0.693 0.458 0.166
sqft_living 0.852 0.251 0.349
sqft_lot 0.145
sqft_basement 0.567 -0.382 0.137
zipcode -0.205 -0.257 0.137
yr_built 0.243 0.764 -0.206
yr_renovated -0.132 0.201
waterfront 0.359
view 0.179 0.449
condition -0.412
grade 0.557 0.506 0.435
Factor1 Factor2 Factor3
SS loadings 2.688 1.987 1.374
Proportion Var 0.192 0.142 0.098
Cumulative Var 0.192 0.334 0.432
Test of the hypothesis that 3 factors are sufficient.
The chi square statistic is 12931.95 on 52 degrees of freedom.
The p-value is 0
Length Class Mode
converged 1 -none- logical
loadings 42 loadings numeric
uniquenesses 14 -none- numeric
correlation 196 -none- numeric
criteria 3 -none- numeric
factors 1 -none- numeric
dof 1 -none- numeric
method 1 -none- character
rotmat 9 -none- numeric
STATISTIC 1 -none- numeric
PVAL 1 -none- numeric
n.obs 1 -none- numeric
call 5 -none- call