HOUSE SALES IN KING COUNTY, SEATTLE
This dataset includes sale prices for houses in King County, Seattle between May 2014 and May 2015. The columns include zipcode, longitude, latitude, square footage, number of bedrooms and bathrooms, etc. I will be investigating which features have an influence on the average price of the house. Some of the columns I will be using include:
1. price
2. condition-how good the overall condition is on a scale of 1-5
3. grade-overall grade based on King County grading system
4. view-whether the house has been viewed
5. sqft_living-square footage of the house
6. zipcode
7. waterfront(where 0=no view and 1=view)

source: https://www.kaggle.com/harlfoxem/housesalesprediction

DESCRIPTIVE STATISTICS
Data type
Head
Tail
Row and column count
Column Names

## id                 int64
## date              object
## price            float64
## bedrooms           int64
## bathrooms        float64
## sqft_living        int64
## sqft_lot           int64
## floors           float64
## waterfront         int64
## view               int64
## condition          int64
## grade              int64
## sqft_above         int64
## sqft_basement      int64
## yr_built           int64
## yr_renovated       int64
## zipcode            int64
## lat              float64
## long             float64
## sqft_living15      int64
## sqft_lot15         int64
## dtype: object
##            id             date     price  ...     long  sqft_living15  sqft_lot15
## 0  7129300520  20141013T000000  221900.0  ... -122.257           1340        5650
## 1  6414100192  20141209T000000  538000.0  ... -122.319           1690        7639
## 2  5631500400  20150225T000000  180000.0  ... -122.233           2720        8062
## 3  2487200875  20141209T000000  604000.0  ... -122.393           1360        5000
## 4  1954400510  20150218T000000  510000.0  ... -122.045           1800        7503
## 
## [5 rows x 21 columns]
##                id             date  ...  sqft_living15  sqft_lot15
## 21608   263000018  20140521T000000  ...           1530        1509
## 21609  6600060120  20150223T000000  ...           1830        7200
## 21610  1523300141  20140623T000000  ...           1020        2007
## 21611   291310100  20150116T000000  ...           1410        1287
## 21612  1523300157  20141015T000000  ...           1020        1357
## 
## [5 rows x 21 columns]
## (21613, 21)
## Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
##        'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
##        'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
##        'lat', 'long', 'sqft_living15', 'sqft_lot15'],
##       dtype='object')

CLEANING UP DATA

Are there any NAs in each column?

## id               0
## date             0
## price            0
## bedrooms         0
## bathrooms        0
## sqft_living      0
## sqft_lot         0
## floors           0
## waterfront       0
## view             0
## condition        0
## grade            0
## sqft_above       0
## sqft_basement    0
## yr_built         0
## yr_renovated     0
## zipcode          0
## lat              0
## long             0
## sqft_living15    0
## sqft_lot15       0
## dtype: int64

Convert the data type of price from float to integer for formatting purposes and print type again to make sure the price has been converted to integer

## id                 int64
## date              object
## price              int32
## bedrooms           int64
## bathrooms        float64
## sqft_living        int64
## sqft_lot           int64
## floors           float64
## waterfront         int64
## view               int64
## condition          int64
## grade              int64
## sqft_above         int64
## sqft_basement      int64
## yr_built           int64
## yr_renovated       int64
## zipcode            int64
## lat              float64
## long             float64
## sqft_living15      int64
## sqft_lot15         int64
## dtype: object

PLOTS
1. Create bar chart showing top ten zip codes with highest average selling price
Conclusion: the most expensive zip code in King County was 98039 (with an average price over $2 million) followed by 98004 and 98040.

  1. Create a bar chart showing avg sales price based on the condition of the house and whether it had a waterfront view
    Conclusion: homes that had a waterfront view and better condition grade sold for much more than houses without. However, for homes with a waterfront view the condition did not seem to matter as much since a house with a rating of 2 sold for more than houses with a grade of 3 or higher

  1. Show side-by-side bar chart with avg sales price and avg sq ft living space plotted on the y-axis to see whether higher sq ft correlated with a higher selling price.
    Conclusion: there seems to be a positive relationship between avg sales price and avg sq ft of of living space where homes with higher prices had more living space, though this relationship seems to be weaker for zip codes with lower prices.

  1. Create a line graph showing avg sales price based on the grade of the house and how many times it had been viewed.

Conclusion: houses with a higher grade sold for more, especially those that had been viewed 3 or 4 times. Overall, all the variables studied seemed to have an infuence on the selling price of houses in King County (zipcode, sq_ft living space, grade, condition, waterfront view, and how many views it had). This type of analysis would be helpful for homeowners who are looking to put a value on their house. However, I believe that other variables that can also be included for further research include the number of bedrooms, bathrooms, floors, and whether the house has been renovated.

```