Project Goal

The goal of this project is to apply the material introduced in this class on a real world dataset and create a formal report to represent all your work.

Instructions

  1. Write a summary and description of the dataset you use. You may use summary statistics, descriptive statistics, figures, etc.
  2. Clearly state the research questions and your analysis plan of the project.
  3. State at least two possible multivariate analysis methods you want to use in the analysis. You may also use other multivariate analysis methods that’s not covered in the book to do your analysis.
  4. State whether your data satisfy the assumptions, how you fit the models, how you use the output from these models to answer your research questions (include numerical summaries, graphics and interpretations).
  5. Compare the models or further discussion of your models, provide suggestions to other people if they want to use similar methods in their analysis.
  6. Address any problems you meet while analyzing the dataset.
  7. Discuss further questions raised by your study (that might be investigated in future research or analysis).

Report

Your grade on this project will be based on a written report. You should write up your analysis and report in the form of a short technical/research article, preferably in the format of a high-quality statistics or mathematics journal.

  • Your main part of report should be 4-7 pages, your writeup should briefly describe the points listed in the previous section.
  • An abstract highlighting the results from your analysis should be included before the body of the work.
  • Graphs, figures and tables can be helpful to understand your findings. An appendix should include your code and/or computer output. (extra figures/tables and appendix are not included in the 4-7 page limit)
  • Interesting computer results can be included in the body of the report.
  • A bibliography is necessary for any references you make.

Grading

  • 20% Description of the data, summary of the possible research questions
  • 40% Appropriate and correct analysis procedures
  • 20% Complexity of your decisions, suggestions, extensions
  • 10% Well-written and attractive report meeting the guidelines set above
  • 10% Spelling, grammar and punctuation

Dataset: House Price Data

Variable Descriptions

  • id: a notation for a house
  • date: date the house was sold
  • price: prediction target
  • bedrooms: number of bedrooms per house
  • bathrooms: number of bathrooms per house
  • sqft_living: square footage of the home
  • sqft_lot: square footage of the lot
  • floors: total floors (levels) in house
  • waterfront: house which has a view to a waterfront
  • view: has been viewed
  • condition: how good the condition is
  • grade: overall grade given to the housing unit based on the King County grading system
  • sqft_above: square footage of house apart from basement
  • sqft_basement: square footage of the basement
  • yr_built: built year
  • yr_renovated: year when the house was renovated
  • zipcode: zip
  • lat: latitude coordinate
  • long: longitude coordinate
  • sqft_living15: living room area in 2015 (implies some renovations) which may have affected lotsize area
  • sqft_lot15: lot size area in 2015 (implies some renovations)

Research Questions

Methodology of Analysis

Initial Exploratory Data Analysis

Missingness


Click in a column to sort by the corresponding variable.
To regain use of the VIM GUI and the R console, click outside the plot region.

Summary Statistics

price floors bedrooms bathrooms yr_built yr_renovated view
Min. : 75000 Min. :1.000 Min. : 0.000 Min. :0.000 Min. :1900 Min. : 0.0 Min. :0.0000
1st Qu.: 321950 1st Qu.:1.000 1st Qu.: 3.000 1st Qu.:1.750 1st Qu.:1951 1st Qu.: 0.0 1st Qu.:0.0000
Median : 450000 Median :1.500 Median : 3.000 Median :2.250 Median :1975 Median : 0.0 Median :0.0000
Mean : 540088 Mean :1.494 Mean : 3.371 Mean :2.115 Mean :1971 Mean : 84.4 Mean :0.2343
3rd Qu.: 645000 3rd Qu.:2.000 3rd Qu.: 4.000 3rd Qu.:2.500 3rd Qu.:1997 3rd Qu.: 0.0 3rd Qu.:0.0000
Max. :7700000 Max. :3.500 Max. :33.000 Max. :8.000 Max. :2015 Max. :2015.0 Max. :4.0000
price sqft_living sqft_living15 sqft_lot sqft_lot15 sqft_above sqft_basement
Min. : 75000 Min. : 290 Min. : 399 Min. : 520 Min. : 651 Min. : 290 Min. : 0.0
1st Qu.: 321950 1st Qu.: 1427 1st Qu.:1490 1st Qu.: 5040 1st Qu.: 5100 1st Qu.:1190 1st Qu.: 0.0
Median : 450000 Median : 1910 Median :1840 Median : 7618 Median : 7620 Median :1560 Median : 0.0
Mean : 540088 Mean : 2080 Mean :1987 Mean : 15107 Mean : 12768 Mean :1788 Mean : 291.5
3rd Qu.: 645000 3rd Qu.: 2550 3rd Qu.:2360 3rd Qu.: 10688 3rd Qu.: 10083 3rd Qu.:2210 3rd Qu.: 560.0
Max. :7700000 Max. :13540 Max. :6210 Max. :1651359 Max. :871200 Max. :9410 Max. :4820.0
price lat long zipcode waterfront condition grade
Min. : 75000 Min. :47.16 Min. :-122.5 Min. :98001 Min. :0.000000 Min. :1.000 Min. : 1.000
1st Qu.: 321950 1st Qu.:47.47 1st Qu.:-122.3 1st Qu.:98033 1st Qu.:0.000000 1st Qu.:3.000 1st Qu.: 7.000
Median : 450000 Median :47.57 Median :-122.2 Median :98065 Median :0.000000 Median :3.000 Median : 7.000
Mean : 540088 Mean :47.56 Mean :-122.2 Mean :98078 Mean :0.007542 Mean :3.409 Mean : 7.657
3rd Qu.: 645000 3rd Qu.:47.68 3rd Qu.:-122.1 3rd Qu.:98118 3rd Qu.:0.000000 3rd Qu.:4.000 3rd Qu.: 8.000
Max. :7700000 Max. :47.78 Max. :-121.3 Max. :98199 Max. :1.000000 Max. :5.000 Max. :13.000

Principal Component Analysis

Assumptions for PCA

Performing PCA

Call:
princomp(x = df.num)

Standard deviations:
           Comp.1            Comp.2            Comp.3            Comp.4 
367145.9651446508  46451.9474231992  16856.7971618432    943.1351452520 
           Comp.5            Comp.6            Comp.7            Comp.8 
   513.4688851286    408.6524693460    383.6711990638     50.4180676714 
           Comp.9           Comp.10           Comp.11           Comp.12 
    23.8452181540      0.7576857097      0.6752474139      0.6251892634 
          Comp.13           Comp.14           Comp.15           Comp.16 
     0.5828901824      0.4600428983      0.3527710841      0.1220866594 
          Comp.17           Comp.18           Comp.19 
     0.1021258504      0.0765978564      0.0001628953 

 19  variables and  21613 observations.

Factor Analysis

Assumptions for Factor Analysis

Performing Factor Analysis

Rotation Sums of Squared Loadings: the values of the table represent the distribution of the variance after the varimax rotation. Factor Matrix: this table contains the unrotated factor loadings, which are the correlations between the variable and the factor.
Rotated Factor Matrix: this table contains the rotated factor loadings (factor pattern matrix), which represent both how the variables are weighted for each factor but also the correlation between the variables and the factor.

No Rotation


Call:
factanal(x = equation, factors = 3, data = df.num, rotation = "none")

Uniquenesses:
        price        floors      bedrooms     bathrooms   sqft_living 
        0.192         0.506         0.561         0.282         0.089 
     sqft_lot sqft_basement       zipcode      yr_built  yr_renovated 
        0.976         0.514         0.873         0.315         0.942 
   waterfront          view     condition         grade 
        0.871         0.762         0.823         0.245 

Loadings:
              Factor1 Factor2 Factor3
price          0.769   0.331   0.327 
floors         0.439  -0.484   0.259 
bedrooms       0.551          -0.368 
bathrooms      0.811  -0.224         
sqft_living    0.943          -0.131 
sqft_lot       0.149                 
sqft_basement  0.380   0.460  -0.361 
zipcode       -0.204   0.266   0.121 
yr_built       0.396  -0.727         
yr_renovated           0.218         
waterfront     0.142   0.199   0.264 
view           0.317   0.308   0.205 
condition              0.393  -0.118 
grade          0.835  -0.141   0.194 

               Factor1 Factor2 Factor3
SS loadings      3.830   1.565   0.654
Proportion Var   0.274   0.112   0.047
Cumulative Var   0.274   0.385   0.432

Test of the hypothesis that 3 factors are sufficient.
The chi square statistic is 12931.95 on 52 degrees of freedom.
The p-value is 0 
             Length Class    Mode     
converged      1    -none-   logical  
loadings      42    loadings numeric  
uniquenesses  14    -none-   numeric  
correlation  196    -none-   numeric  
criteria       3    -none-   numeric  
factors        1    -none-   numeric  
dof            1    -none-   numeric  
method         1    -none-   character
STATISTIC      1    -none-   numeric  
PVAL           1    -none-   numeric  
n.obs          1    -none-   numeric  
call           5    -none-   call     

Varimax Rotation


Call:
factanal(x = equation, factors = 3, data = df.num, rotation = "varimax")

Uniquenesses:
        price        floors      bedrooms     bathrooms   sqft_living 
        0.192         0.506         0.561         0.282         0.089 
     sqft_lot sqft_basement       zipcode      yr_built  yr_renovated 
        0.976         0.514         0.873         0.315         0.942 
   waterfront          view     condition         grade 
        0.871         0.762         0.823         0.245 

Loadings:
              Factor1 Factor2 Factor3
price          0.482   0.130   0.748 
floors         0.157   0.672   0.133 
bedrooms       0.661                 
bathrooms      0.693   0.458   0.166 
sqft_living    0.852   0.251   0.349 
sqft_lot       0.145                 
sqft_basement  0.567  -0.382   0.137 
zipcode       -0.205  -0.257   0.137 
yr_built       0.243   0.764  -0.206 
yr_renovated          -0.132   0.201 
waterfront                     0.359 
view           0.179           0.449 
condition             -0.412         
grade          0.557   0.506   0.435 

               Factor1 Factor2 Factor3
SS loadings      2.688   1.987   1.374
Proportion Var   0.192   0.142   0.098
Cumulative Var   0.192   0.334   0.432

Test of the hypothesis that 3 factors are sufficient.
The chi square statistic is 12931.95 on 52 degrees of freedom.
The p-value is 0 
             Length Class    Mode     
converged      1    -none-   logical  
loadings      42    loadings numeric  
uniquenesses  14    -none-   numeric  
correlation  196    -none-   numeric  
criteria       3    -none-   numeric  
factors        1    -none-   numeric  
dof            1    -none-   numeric  
method         1    -none-   character
rotmat         9    -none-   numeric  
STATISTIC      1    -none-   numeric  
PVAL           1    -none-   numeric  
n.obs          1    -none-   numeric  
call           5    -none-   call     

Promax Rotation


Call:
factanal(x = equation, factors = 3, data = df.num, rotation = "varimax")

Uniquenesses:
        price        floors      bedrooms     bathrooms   sqft_living 
        0.192         0.506         0.561         0.282         0.089 
     sqft_lot sqft_basement       zipcode      yr_built  yr_renovated 
        0.976         0.514         0.873         0.315         0.942 
   waterfront          view     condition         grade 
        0.871         0.762         0.823         0.245 

Loadings:
              Factor1 Factor2 Factor3
price          0.482   0.130   0.748 
floors         0.157   0.672   0.133 
bedrooms       0.661                 
bathrooms      0.693   0.458   0.166 
sqft_living    0.852   0.251   0.349 
sqft_lot       0.145                 
sqft_basement  0.567  -0.382   0.137 
zipcode       -0.205  -0.257   0.137 
yr_built       0.243   0.764  -0.206 
yr_renovated          -0.132   0.201 
waterfront                     0.359 
view           0.179           0.449 
condition             -0.412         
grade          0.557   0.506   0.435 

               Factor1 Factor2 Factor3
SS loadings      2.688   1.987   1.374
Proportion Var   0.192   0.142   0.098
Cumulative Var   0.192   0.334   0.432

Test of the hypothesis that 3 factors are sufficient.
The chi square statistic is 12931.95 on 52 degrees of freedom.
The p-value is 0 
             Length Class    Mode     
converged      1    -none-   logical  
loadings      42    loadings numeric  
uniquenesses  14    -none-   numeric  
correlation  196    -none-   numeric  
criteria       3    -none-   numeric  
factors        1    -none-   numeric  
dof            1    -none-   numeric  
method         1    -none-   character
rotmat         9    -none-   numeric  
STATISTIC      1    -none-   numeric  
PVAL           1    -none-   numeric  
n.obs          1    -none-   numeric  
call           5    -none-   call