The data set is regarding the houses in California, which talks about the longitude, latitude ,housing_median_age ,total_rooms ,total_bedrooms , population ,households , median_income , median_house_value , ocean_proximity (distance to the ocean) ( <1H OCEAN / INLAND / ISLAND / NEAR BAY / NEAR OCEAN ) columns.
# load data
getwd()
housingdata <- read.csv("housing.csv")
str(housingdata)
dim(housingdata)
head(housingdata)
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
The research question is there a correlation in the house value with total Rooms and ocean proximity.
What are the cases, and how many are there? There are 20640 cases with 10 variables. Each case refers house sold in california bay area based on longitude, latitude ,housing_median_age ,total_rooms ,total_bedrooms , population ,households , median_income , median_house_value , ocean_proximity.
Describe the method of data collection. The Data set was easily available on Kaggle site.
What type of study is this (observational/experiment)? This is an observational study, as we are trying to infer from already collected data and make some correlation.
If you collected the data, state self-collected. If not, provide a citation/link. The data has been taken from site the url for the dataset is https://www.kaggle.com/harrywang/housing/downloads/housing.csv/4
What is the response variable? Is it quantitative or qualitative? Response variable selected for this is house value ,It is a quantitative variable.
You should have two independent variables, one quantitative and one qualitative. The Independent variables are housing_median_age and Ocean Proximity in which housing_median_age is numerical and Ocean Proximity is Categorical.
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
House Median Age :- Looking at the histogram chart we can it is right skewed and bimodal and most of the house median age is more than 15 yrs.
summary(housingdata$housing_median_age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 18.00 29.00 28.64 37.00 52.00
hist(housingdata$housing_median_age)
Ocean Proximty :- the maximun houses are in ocean proximity(<1H ocean) than inland, ocean , Near Bay , whereas Island has the least .
plot.default(housingdata$ocean_proximity)
summary(housingdata$ocean_proximity)
## <1H OCEAN INLAND ISLAND NEAR BAY NEAR OCEAN
## 9136 6551 5 2290 2658
plot.default(x=housingdata$ocean_proximity, y=housingdata$median_house_value)