Documentation

California Housing Dataset.

The data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data.

The dataset consists of 20,640 rows and 10 columns.

Description of columns :

longitude: A measure of how far west a house is; a higher value is farther west
latitude: A measure of how far north a house is; a higher value is farther north
housingMedianAge: Median age of a house within a block; a lower number is a newer building
totalRooms: Total number of rooms within a block
totalBedrooms: Total number of bedrooms within a block
population: Total number of people residing within a block
households: Total number of households, a group of people residing within a home unit, for a block
medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
medianHouseValue: Median house value for households within a block (measured in US Dollars)
oceanProximity: Location of the house

library(readr)
housing_data<-read.csv("/Users/sharmistaroy/Downloads/housing.csv")

View(housing_data)

Task 1: Identify Unclear Columns/Values

Let’s begin by pointing out three columns or values in the California Housing dataset that, unless you read the documentation, are unclear:

Let’s start by identifying three columns or values in the California Housing dataset that are unclear until you read the documentation:

Totalrooms: As previously mentioned, this column represents the average number of rooms in a house. It’s still not immediately clear why this is measured as an average. Is it the average for a specific area or type of housing? The documentation may clarify this.

Median_income : While the name suggests “Median Income,” it’s not specified whether this income is measured per household, per capita, or some other way. Understanding this is crucial for its interpretation.

Median_house_value: As previously mentioned, the documentation doesn’t provide details on how this value is calculated. It’s essential to know if it’s adjusted for inflation or other factors.

Task 2: Data Unclear Even After Reading Documentation

The issue regarding the definition of the target variable, Median_house_value (Median House Value), remains unclear. Is this value adjusted for inflation or other factors? The documentation doesn’t provide details on how this value is calculated.

Task 3: Build a Visualization

To visualize the issue with the median_income column (median income), let’s create a boxplot:

library(ggplot2)

# Create a boxplot of Median Income
ggplot(data = housing_data, aes(y = median_income)) +
  geom_boxplot() +
  labs(title = "Boxplot of Median Income in California Housing Data")

This boxplot shows the distribution of median income. However, without clear documentation, it’s challenging to interpret what “median income” means in this context. The significance of this issue is that it can affect any analysis or modeling efforts using this variable.

# Load necessary libraries
library(ggplot2)

# Create a histogram of Median Income
ggplot(data = housing_data, aes(x = median_income)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black") +
  labs(
    title = "Histogram of Median Income in California Housing Data",
    x = "Median Income",
    y = "Frequency"
  )

Task 4: Identify Significant Risks

The main risk associated with utilizing this data is incorrectly interpreting the columns owing to unclear documentation, which can result in biased or inaccurate analysis and modeling findings. For instance, making an incorrect assumption about MedInc’s per capita income when it really refers to household income might lead to incorrect estimates.

Reducing Negative Consequences:

You can follow the above listed procedures to lessen unfavorable effects:

Contact the Data Source: Make an effort to get in touch with the data supplier or the dataset’s creator to find out more about the ambiguous columns or to get their clarifications.

Feature engineering: If the documentation is still unclear, think about developing new features or changing the functionality of current ones to make them more pertinent to your particular research. You might, for instance, determine per capita income if necessary.

Sensitivity Analysis: To do sensitivity analysis, consider several hypotheses on the ambiguous columns and evaluate the impact of those hypotheses on the outcomes. This might assist you in comprehending how these uncertainties might affect your models.

In conclusion, even when the information reflects households rather than average occupancy, documentation and comprehension of the data are essential for effective analysis and modeling. The California Housing dataset highlights the value of precise documentation and the dangers of misunderstanding. To resolve these problems, more research or consultation with data suppliers may be required.

Data dive week 5

Sharmista Kothavadla

2023-09-25

Documentation

Task 1: Identify Unclear Columns/Values

Task 2: Data Unclear Even After Reading Documentation

Task 3: Build a Visualization

Task 4: Identify Significant Risks