Data Preparation

The data set is regarding the houses in California, which talks about the longitude, latitude ,housing_median_age ,total_rooms ,total_bedrooms , population ,households , median_income , median_house_value , ocean_proximity (distance to the ocean) ( <1H OCEAN / INLAND / ISLAND / NEAR BAY / NEAR OCEAN ) columns.

# load data
getwd()
housingdata <- read.csv("housing.csv") 
  
str(housingdata)

dim(housingdata)

head(housingdata)

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

The research question is there a correlation in the house value with total Rooms and ocean proximity.

Cases

What are the cases, and how many are there? There are 20640 cases with 10 variables. Each case refers house sold in california bay area based on longitude, latitude ,housing_median_age ,total_rooms ,total_bedrooms , population ,households , median_income , median_house_value , ocean_proximity.

Data collection

Describe the method of data collection. The Data set was easily available on Kaggle site.

Type of study

What type of study is this (observational/experiment)? This is an observational study, as we are trying to infer from already collected data and make some correlation.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link. The data has been taken from site the url for the dataset is https://www.kaggle.com/harrywang/housing/downloads/housing.csv/4

Dependent Variable

What is the response variable? Is it quantitative or qualitative? Response variable selected for this is house value ,It is a quantitative variable.

Independent Variable

You should have two independent variables, one quantitative and one qualitative. The Independent variables are housing_median_age and Ocean Proximity in which housing_median_age is numerical and Ocean Proximity is Categorical.

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

House Median Age :- Looking at the histogram chart we can it is right skewed and bimodal and most of the house median age is more than 15 yrs.

summary(housingdata$housing_median_age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   18.00   29.00   28.64   37.00   52.00
hist(housingdata$housing_median_age)

Ocean Proximty :- the maximun houses are in ocean proximity(<1H ocean) than inland, ocean , Near Bay , whereas Island has the least .

plot.default(housingdata$ocean_proximity)

summary(housingdata$ocean_proximity)
##  <1H OCEAN     INLAND     ISLAND   NEAR BAY NEAR OCEAN 
##       9136       6551          5       2290       2658
plot.default(x=housingdata$ocean_proximity, y=housingdata$median_house_value)