The goal of this project is to apply the material introduced in this class on a real world dataset and create a formal report to represent all your work.
Your grade on this project will be based on a written report. You should write up your analysis and report in the form of a short technical/research article, preferably in the format of a high-quality statistics or mathematics journal.
dat <- read.csv("C:/Users/poster/Desktop/kc_house_data.csv", header = TRUE)
df.num <- dat %>% select(price, floors, bedrooms, bathrooms, sqft_living, sqft_living15,
sqft_lot, sqft_lot15, sqft_above, sqft_basement, lat, long, zipcode, yr_built,
yr_renovated, waterfront, view, condition, grade)
## Partitioning numeric data to explore more thorougly
# Home Attributes
df.home <- df.num %>% select(price, floors, bedrooms, bathrooms, yr_built, yr_renovated,
view)
# Square Footage Measures
df.sqft <- df.num %>% select(price, sqft_living, sqft_living15, sqft_lot, sqft_lot15,
sqft_above, sqft_basement)
# Location info
df.loc <- df.num %>% select(price, lat, long, zipcode, waterfront, condition,
grade)| price | floors | bedrooms | bathrooms | yr_built | yr_renovated | view |
|---|---|---|---|---|---|---|
| Min. : 75000 | Min. :1.000 | Min. : 0.000 | Min. :0.000 | Min. :1900 | Min. : 0.0 | Min. :0.0000 |
| 1st Qu.: 321950 | 1st Qu.:1.000 | 1st Qu.: 3.000 | 1st Qu.:1.750 | 1st Qu.:1951 | 1st Qu.: 0.0 | 1st Qu.:0.0000 |
| Median : 450000 | Median :1.500 | Median : 3.000 | Median :2.250 | Median :1975 | Median : 0.0 | Median :0.0000 |
| Mean : 540088 | Mean :1.494 | Mean : 3.371 | Mean :2.115 | Mean :1971 | Mean : 84.4 | Mean :0.2343 |
| 3rd Qu.: 645000 | 3rd Qu.:2.000 | 3rd Qu.: 4.000 | 3rd Qu.:2.500 | 3rd Qu.:1997 | 3rd Qu.: 0.0 | 3rd Qu.:0.0000 |
| Max. :7700000 | Max. :3.500 | Max. :33.000 | Max. :8.000 | Max. :2015 | Max. :2015.0 | Max. :4.0000 |
| price | sqft_living | sqft_living15 | sqft_lot | sqft_lot15 | sqft_above | sqft_basement |
|---|---|---|---|---|---|---|
| Min. : 75000 | Min. : 290 | Min. : 399 | Min. : 520 | Min. : 651 | Min. : 290 | Min. : 0.0 |
| 1st Qu.: 321950 | 1st Qu.: 1427 | 1st Qu.:1490 | 1st Qu.: 5040 | 1st Qu.: 5100 | 1st Qu.:1190 | 1st Qu.: 0.0 |
| Median : 450000 | Median : 1910 | Median :1840 | Median : 7618 | Median : 7620 | Median :1560 | Median : 0.0 |
| Mean : 540088 | Mean : 2080 | Mean :1987 | Mean : 15107 | Mean : 12768 | Mean :1788 | Mean : 291.5 |
| 3rd Qu.: 645000 | 3rd Qu.: 2550 | 3rd Qu.:2360 | 3rd Qu.: 10688 | 3rd Qu.: 10083 | 3rd Qu.:2210 | 3rd Qu.: 560.0 |
| Max. :7700000 | Max. :13540 | Max. :6210 | Max. :1651359 | Max. :871200 | Max. :9410 | Max. :4820.0 |
| price | lat | long | zipcode | waterfront | condition | grade |
|---|---|---|---|---|---|---|
| Min. : 75000 | Min. :47.16 | Min. :-122.5 | Min. :98001 | Min. :0.000000 | Min. :1.000 | Min. : 1.000 |
| 1st Qu.: 321950 | 1st Qu.:47.47 | 1st Qu.:-122.3 | 1st Qu.:98033 | 1st Qu.:0.000000 | 1st Qu.:3.000 | 1st Qu.: 7.000 |
| Median : 450000 | Median :47.57 | Median :-122.2 | Median :98065 | Median :0.000000 | Median :3.000 | Median : 7.000 |
| Mean : 540088 | Mean :47.56 | Mean :-122.2 | Mean :98078 | Mean :0.007542 | Mean :3.409 | Mean : 7.657 |
| 3rd Qu.: 645000 | 3rd Qu.:47.68 | 3rd Qu.:-122.1 | 3rd Qu.:98118 | 3rd Qu.:0.000000 | 3rd Qu.:4.000 | 3rd Qu.: 8.000 |
| Max. :7700000 | Max. :47.78 | Max. :-121.3 | Max. :98199 | Max. :1.000000 | Max. :5.000 | Max. :13.000 |
king_county <- map_data(map = "county", region = "washington") %>% filter(subregion ==
"king")
king_base <- ggplot(data = king_county, mapping = aes(x = long, y = lat, group = group)) +
coord_fixed(1.3) + geom_polygon(color = "black", fill = NA)
king_base + geom_point(data = dat, mapping = aes(x = long, y = lat, group = zipcode,
color = price))equation <- formula(~price + floors + bedrooms + bathrooms + sqft_living + sqft_lot +
sqft_basement + zipcode + yr_built + yr_renovated + waterfront + view +
condition + grade)
FA.norot <- factanal(equation, factors = 3, data = df.num, rotation = "none")
FA.vmax <- factanal(equation, factors = 3, data = df.num, rotation = "varimax")
FA.pmax <- factanal(equation, factors = 3, data = df.num, rotation = "varimax")Rotation Sums of Squared Loadings: the values of the table represent the distribution of the variance after the varimax rotation. Factor Matrix: this table contains the unrotated factor loadings, which are the correlations between the variable and the factor.
Rotated Factor Matrix: this table contains the rotated factor loadings (factor pattern matrix), which represent both how the variables are weighted for each factor but also the correlation between the variables and the factor.
Call: factanal(x = equation, factors = 3, data = df.num, rotation = “none”)
Uniquenesses: price floors bedrooms bathrooms sqft_living 0.192 0.506 0.561 0.282 0.089 sqft_lot sqft_basement zipcode yr_built yr_renovated 0.976 0.514 0.873 0.315 0.942 waterfront view condition grade 0.871 0.762 0.823 0.245
Loadings: Factor1 Factor2 Factor3 price 0.769 0.331 0.327 floors 0.439 -0.484 0.259 bedrooms 0.551 -0.368 bathrooms 0.811 -0.224
sqft_living 0.943 -0.131 sqft_lot 0.149
sqft_basement 0.380 0.460 -0.361 zipcode -0.204 0.266 0.121 yr_built 0.396 -0.727
yr_renovated 0.218
waterfront 0.142 0.199 0.264 view 0.317 0.308 0.205 condition 0.393 -0.118 grade 0.835 -0.141 0.194
Factor1 Factor2 Factor3
SS loadings 3.830 1.565 0.654 Proportion Var 0.274 0.112 0.047 Cumulative Var 0.274 0.385 0.432
Test of the hypothesis that 3 factors are sufficient. The chi square statistic is 12931.95 on 52 degrees of freedom. The p-value is 0
Length Class Mode
converged 1 -none- logical
loadings 42 loadings numeric
uniquenesses 14 -none- numeric
correlation 196 -none- numeric
criteria 3 -none- numeric
factors 1 -none- numeric
dof 1 -none- numeric
method 1 -none- character STATISTIC 1 -none- numeric
PVAL 1 -none- numeric
n.obs 1 -none- numeric
call 5 -none- call
Call: factanal(x = equation, factors = 3, data = df.num, rotation = “varimax”)
Uniquenesses: price floors bedrooms bathrooms sqft_living 0.192 0.506 0.561 0.282 0.089 sqft_lot sqft_basement zipcode yr_built yr_renovated 0.976 0.514 0.873 0.315 0.942 waterfront view condition grade 0.871 0.762 0.823 0.245
Loadings: Factor1 Factor2 Factor3 price 0.482 0.130 0.748 floors 0.157 0.672 0.133 bedrooms 0.661
bathrooms 0.693 0.458 0.166 sqft_living 0.852 0.251 0.349 sqft_lot 0.145
sqft_basement 0.567 -0.382 0.137 zipcode -0.205 -0.257 0.137 yr_built 0.243 0.764 -0.206 yr_renovated -0.132 0.201 waterfront 0.359 view 0.179 0.449 condition -0.412
grade 0.557 0.506 0.435
Factor1 Factor2 Factor3
SS loadings 2.688 1.987 1.374 Proportion Var 0.192 0.142 0.098 Cumulative Var 0.192 0.334 0.432
Test of the hypothesis that 3 factors are sufficient. The chi square statistic is 12931.95 on 52 degrees of freedom. The p-value is 0
Length Class Mode
converged 1 -none- logical
loadings 42 loadings numeric
uniquenesses 14 -none- numeric
correlation 196 -none- numeric
criteria 3 -none- numeric
factors 1 -none- numeric
dof 1 -none- numeric
method 1 -none- character rotmat 9 -none- numeric
STATISTIC 1 -none- numeric
PVAL 1 -none- numeric
n.obs 1 -none- numeric
call 5 -none- call
Call: factanal(x = equation, factors = 3, data = df.num, rotation = “varimax”)
Uniquenesses: price floors bedrooms bathrooms sqft_living 0.192 0.506 0.561 0.282 0.089 sqft_lot sqft_basement zipcode yr_built yr_renovated 0.976 0.514 0.873 0.315 0.942 waterfront view condition grade 0.871 0.762 0.823 0.245
Loadings: Factor1 Factor2 Factor3 price 0.482 0.130 0.748 floors 0.157 0.672 0.133 bedrooms 0.661
bathrooms 0.693 0.458 0.166 sqft_living 0.852 0.251 0.349 sqft_lot 0.145
sqft_basement 0.567 -0.382 0.137 zipcode -0.205 -0.257 0.137 yr_built 0.243 0.764 -0.206 yr_renovated -0.132 0.201 waterfront 0.359 view 0.179 0.449 condition -0.412
grade 0.557 0.506 0.435
Factor1 Factor2 Factor3
SS loadings 2.688 1.987 1.374 Proportion Var 0.192 0.142 0.098 Cumulative Var 0.192 0.334 0.432
Test of the hypothesis that 3 factors are sufficient. The chi square statistic is 12931.95 on 52 degrees of freedom. The p-value is 0
Length Class Mode
converged 1 -none- logical
loadings 42 loadings numeric
uniquenesses 14 -none- numeric
correlation 196 -none- numeric
criteria 3 -none- numeric
factors 1 -none- numeric
dof 1 -none- numeric
method 1 -none- character rotmat 9 -none- numeric
STATISTIC 1 -none- numeric
PVAL 1 -none- numeric
n.obs 1 -none- numeric
call 5 -none- call