Three columns that were initially unclear until reading the documentation: —> in_sf: This stands for “in San Francisco”. It took reading the documentation to understand that this is a binary indicator, with 1 meaning the home is in San Francisco and 0 meaning it is not. —> price_per_sqft: This is the home price divided by the square footage. It wasn’t immediately obvious what this column represented without reading the documentation. —> elevation: This is the height of the house above sea level, in feet. Without the documentation, I wouldn’t have recognised that this refers to height. The documentation helps clarify these abbreviations and derived values. Without it, I may have misinterpreted aspects of the data. For example, I may have thought in_sf referred to the number of bedrooms or bathrooms rather than location, leading to incorrect analysis.
Even after reading the documentation, there is still some ambiguity regarding the year_built field. It is unclear from the paperwork if this is the exact year the house was built or just an approximation. This can result in incorrect home age analysis.
Here is a visualization highlighting the potential issue with year_built:
Homes <- read.csv('D:/DataSet/Homes.csv')
library(ggthemes)
library(ggrepel)
## Loading required package: ggplot2
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
ggplot(data = Homes, aes(x = year_built)) +
geom_bar(fill = "steelblue",stat="count") +
labs(title = "Unclear if Year Built is Actual or Estimated",
x = "Year Built") +
annotate("text", x = 1925, y = 20, label = "Not having explanation if year is actual or estimated", color = "red")
#risk It is unclear from the documentation if year_built refers to the actual or anticipated year of construction. This can cause analysis to make incorrect assumptions about the age of the home. I would advise the data suppliers to make it clear what year_built stands for in order to lower risk. If the value is ambiguous, they may also think about including a separate column for estimated vs. real year.