The data included in this assignment looks at the sales prices of houses based on their features.
Here are the packages needed for running this assignment:
| Package | Explanation |
|---|---|
| tidyverse | Group of packages that includes readr, dplyr, ggplot2, etc |
| sqldf | Allows SQL code to be used to query against a data table |
| DT | Allows user to create tables that can be interacted with |
This subset selects just four columns of interest from the data:
It filters the subset to list only properties with a lot size greater than 60000.
This subset selects information relevant to determining how overall quality and renovations affect the sale price of a house.
This subset organizes the data into groups of each Neighborhood and finds the average lot size for each group.
This subset mutates a column to show how many years occured between when the house was built and when it was remodeled. It also selects information about property quality, sale price, and the year sold.
This subset calculates the total square footage of the house on each property. Then it selects just the properties in the neighborhood called NAmes that have home square footage over 3000 and an overall quality above 4. Essentially this is looking for large, quality homes in this neighborhood which could be helpful for someone looking for data about homes similar to one they own or are looking to purchase.
This visualization is a faceted scatterplot of LotArea by SalesPrice with a linear model placed over it for each year of property sales.
This visualization shows how overall quality affects the sale price of a property. Note how as the quality increases, the price does as well. This also tells that overall quality might be a good predictior of sale price in a multiple regression.
This visualization takes the third subset which shows each neighborhood’s average lot size and shows the ones with over 10,000 average lot size in a bar graph. This is interesting to see the neighborhoods with larger property sizes in comparison to each other.
This visualization uses the Remodeled data from Subset 4, but it turns the YearsTilRemodel into a dummy variable deciding whether or not it was remodeled (if the remodel happened within 4 years of original build, I did not count that as a remodel). It then shows the average sale price for houses that were remodeled versus not. It is interesting to note that the hosues that were not remodeled sold for more on average.
This visualization shows the data from Subset 5 where only the large, quality hosues in the NAmes neighborhood are selected. The color of the points on the scatterplot are based on the overall quality, with the lighter points being more quality properties. This color difference shows that the better quality properties sell for more, at least in this neighborhood.