Project Goal

The goal of this project is to apply the material introduced in this class on a real world dataset and create a formal report to represent all your work.

Instructions

  1. Write a summary and description of the dataset you use. You may use summary statistics, descriptive statistics, figures, etc.
  2. Clearly state the research questions and your analysis plan of the project.
  3. State at least two possible multivariate analysis methods you want to use in the analysis. You may also use other multivariate analysis methods that’s not covered in the book to do your analysis.
  4. State whether your data satisfy the assumptions, how you fit the models, how you use the output from these models to answer your research questions (include numerical summaries, graphics and interpretations).
  5. Compare the models or further discussion of your models, provide suggestions to other people if they want to use similar methods in their analysis.
  6. Address any problems you meet while analyzing the dataset.
  7. Discuss further questions raised by your study (that might be investigated in future research or analysis).

Report

Your grade on this project will be based on a written report. You should write up your analysis and report in the form of a short technical/research article, preferably in the format of a high-quality statistics or mathematics journal.

  • Your main part of report should be 4-7 pages, your writeup should briefly describe the points listed in the previous section.
  • An abstract highlighting the results from your analysis should be included before the body of the work.
  • Graphs, figures and tables can be helpful to understand your findings. An appendix should include your code and/or computer output. (extra figures/tables and appendix are not included in the 4-7 page limit)
  • Interesting computer results can be included in the body of the report.
  • A bibliography is necessary for any references you make.

Grading

  • 20% Description of the data, summary of the possible research questions
  • 40% Appropriate and correct analysis procedures
  • 20% Complexity of your decisions, suggestions, extensions
  • 10% Well-written and attractive report meeting the guidelines set above
  • 10% Spelling, grammar and punctuation

Dataset: House Price Data

Variable Descriptions

  • id: a notation for a house
  • date: date the house was sold
  • price: prediction target
  • bedrooms: number of bedrooms per house
  • bathrooms: number of bathrooms per house
  • sqft_living: square footage of the home
  • sqft_lot: square footage of the lot
  • floors: total floors (levels) in house
  • waterfront: house which has a view to a waterfront
  • view: has been viewed
  • condition: how good the condition is
  • grade: overall grade given to the housing unit based on the King County grading system
  • sqft_above: square footage of house apart from basement
  • sqft_basement: square footage of the basement
  • yr_built: built year
  • yr_renovated: year when the house was renovated
  • zipcode: zip
  • lat: latitude coordinate
  • long: longitude coordinate
  • sqft_living15: living room area in 2015 (implies some renovations) which may have affected lotsize area
  • sqft_lot15: lot size area in 2015 (implies some renovations)

Research Questions

Methodology of Analysis

Initial Exploratory Data Analysis

Summary Statistics

price floors bedrooms bathrooms yr_built yr_renovated view
Min. : 75000 Min. :1.000 Min. : 0.000 Min. :0.000 Min. :1900 Min. : 0.0 Min. :0.0000
1st Qu.: 321950 1st Qu.:1.000 1st Qu.: 3.000 1st Qu.:1.750 1st Qu.:1951 1st Qu.: 0.0 1st Qu.:0.0000
Median : 450000 Median :1.500 Median : 3.000 Median :2.250 Median :1975 Median : 0.0 Median :0.0000
Mean : 540088 Mean :1.494 Mean : 3.371 Mean :2.115 Mean :1971 Mean : 84.4 Mean :0.2343
3rd Qu.: 645000 3rd Qu.:2.000 3rd Qu.: 4.000 3rd Qu.:2.500 3rd Qu.:1997 3rd Qu.: 0.0 3rd Qu.:0.0000
Max. :7700000 Max. :3.500 Max. :33.000 Max. :8.000 Max. :2015 Max. :2015.0 Max. :4.0000
price sqft_living sqft_living15 sqft_lot sqft_lot15 sqft_above sqft_basement
Min. : 75000 Min. : 290 Min. : 399 Min. : 520 Min. : 651 Min. : 290 Min. : 0.0
1st Qu.: 321950 1st Qu.: 1427 1st Qu.:1490 1st Qu.: 5040 1st Qu.: 5100 1st Qu.:1190 1st Qu.: 0.0
Median : 450000 Median : 1910 Median :1840 Median : 7618 Median : 7620 Median :1560 Median : 0.0
Mean : 540088 Mean : 2080 Mean :1987 Mean : 15107 Mean : 12768 Mean :1788 Mean : 291.5
3rd Qu.: 645000 3rd Qu.: 2550 3rd Qu.:2360 3rd Qu.: 10688 3rd Qu.: 10083 3rd Qu.:2210 3rd Qu.: 560.0
Max. :7700000 Max. :13540 Max. :6210 Max. :1651359 Max. :871200 Max. :9410 Max. :4820.0
price lat long zipcode waterfront condition grade
Min. : 75000 Min. :47.16 Min. :-122.5 Min. :98001 Min. :0.000000 Min. :1.000 Min. : 1.000
1st Qu.: 321950 1st Qu.:47.47 1st Qu.:-122.3 1st Qu.:98033 1st Qu.:0.000000 1st Qu.:3.000 1st Qu.: 7.000
Median : 450000 Median :47.57 Median :-122.2 Median :98065 Median :0.000000 Median :3.000 Median : 7.000
Mean : 540088 Mean :47.56 Mean :-122.2 Mean :98078 Mean :0.007542 Mean :3.409 Mean : 7.657
3rd Qu.: 645000 3rd Qu.:47.68 3rd Qu.:-122.1 3rd Qu.:98118 3rd Qu.:0.000000 3rd Qu.:4.000 3rd Qu.: 8.000
Max. :7700000 Max. :47.78 Max. :-121.3 Max. :98199 Max. :1.000000 Max. :5.000 Max. :13.000

Principal Component Analysis

Assumptions for PCA

Performing PCA

Factor Analysis

Assumptions for Factor Analysis

Performing Factor Analysis

Rotation Sums of Squared Loadings: the values of the table represent the distribution of the variance after the varimax rotation. Factor Matrix: this table contains the unrotated factor loadings, which are the correlations between the variable and the factor.
Rotated Factor Matrix: this table contains the rotated factor loadings (factor pattern matrix), which represent both how the variables are weighted for each factor but also the correlation between the variables and the factor.

No Rotation

Call: factanal(x = equation, factors = 3, data = df.num, rotation = “none”)

Uniquenesses: price floors bedrooms bathrooms sqft_living 0.192 0.506 0.561 0.282 0.089 sqft_lot sqft_basement zipcode yr_built yr_renovated 0.976 0.514 0.873 0.315 0.942 waterfront view condition grade 0.871 0.762 0.823 0.245

Loadings: Factor1 Factor2 Factor3 price 0.769 0.331 0.327 floors 0.439 -0.484 0.259 bedrooms 0.551 -0.368 bathrooms 0.811 -0.224
sqft_living 0.943 -0.131 sqft_lot 0.149
sqft_basement 0.380 0.460 -0.361 zipcode -0.204 0.266 0.121 yr_built 0.396 -0.727
yr_renovated 0.218
waterfront 0.142 0.199 0.264 view 0.317 0.308 0.205 condition 0.393 -0.118 grade 0.835 -0.141 0.194

           Factor1 Factor2 Factor3

SS loadings 3.830 1.565 0.654 Proportion Var 0.274 0.112 0.047 Cumulative Var 0.274 0.385 0.432

Test of the hypothesis that 3 factors are sufficient. The chi square statistic is 12931.95 on 52 degrees of freedom. The p-value is 0

         Length Class    Mode     

converged 1 -none- logical
loadings 42 loadings numeric
uniquenesses 14 -none- numeric
correlation 196 -none- numeric
criteria 3 -none- numeric
factors 1 -none- numeric
dof 1 -none- numeric
method 1 -none- character STATISTIC 1 -none- numeric
PVAL 1 -none- numeric
n.obs 1 -none- numeric
call 5 -none- call

Varimax Rotation

Call: factanal(x = equation, factors = 3, data = df.num, rotation = “varimax”)

Uniquenesses: price floors bedrooms bathrooms sqft_living 0.192 0.506 0.561 0.282 0.089 sqft_lot sqft_basement zipcode yr_built yr_renovated 0.976 0.514 0.873 0.315 0.942 waterfront view condition grade 0.871 0.762 0.823 0.245

Loadings: Factor1 Factor2 Factor3 price 0.482 0.130 0.748 floors 0.157 0.672 0.133 bedrooms 0.661
bathrooms 0.693 0.458 0.166 sqft_living 0.852 0.251 0.349 sqft_lot 0.145
sqft_basement 0.567 -0.382 0.137 zipcode -0.205 -0.257 0.137 yr_built 0.243 0.764 -0.206 yr_renovated -0.132 0.201 waterfront 0.359 view 0.179 0.449 condition -0.412
grade 0.557 0.506 0.435

           Factor1 Factor2 Factor3

SS loadings 2.688 1.987 1.374 Proportion Var 0.192 0.142 0.098 Cumulative Var 0.192 0.334 0.432

Test of the hypothesis that 3 factors are sufficient. The chi square statistic is 12931.95 on 52 degrees of freedom. The p-value is 0

         Length Class    Mode     

converged 1 -none- logical
loadings 42 loadings numeric
uniquenesses 14 -none- numeric
correlation 196 -none- numeric
criteria 3 -none- numeric
factors 1 -none- numeric
dof 1 -none- numeric
method 1 -none- character rotmat 9 -none- numeric
STATISTIC 1 -none- numeric
PVAL 1 -none- numeric
n.obs 1 -none- numeric
call 5 -none- call

Promax Rotation

Call: factanal(x = equation, factors = 3, data = df.num, rotation = “varimax”)

Uniquenesses: price floors bedrooms bathrooms sqft_living 0.192 0.506 0.561 0.282 0.089 sqft_lot sqft_basement zipcode yr_built yr_renovated 0.976 0.514 0.873 0.315 0.942 waterfront view condition grade 0.871 0.762 0.823 0.245

Loadings: Factor1 Factor2 Factor3 price 0.482 0.130 0.748 floors 0.157 0.672 0.133 bedrooms 0.661
bathrooms 0.693 0.458 0.166 sqft_living 0.852 0.251 0.349 sqft_lot 0.145
sqft_basement 0.567 -0.382 0.137 zipcode -0.205 -0.257 0.137 yr_built 0.243 0.764 -0.206 yr_renovated -0.132 0.201 waterfront 0.359 view 0.179 0.449 condition -0.412
grade 0.557 0.506 0.435

           Factor1 Factor2 Factor3

SS loadings 2.688 1.987 1.374 Proportion Var 0.192 0.142 0.098 Cumulative Var 0.192 0.334 0.432

Test of the hypothesis that 3 factors are sufficient. The chi square statistic is 12931.95 on 52 degrees of freedom. The p-value is 0

         Length Class    Mode     

converged 1 -none- logical
loadings 42 loadings numeric
uniquenesses 14 -none- numeric
correlation 196 -none- numeric
criteria 3 -none- numeric
factors 1 -none- numeric
dof 1 -none- numeric
method 1 -none- character rotmat 9 -none- numeric
STATISTIC 1 -none- numeric
PVAL 1 -none- numeric
n.obs 1 -none- numeric
call 5 -none- call