Final report

library(tidyverse)
library(plotly)

df <- read_csv("C:/Users/ronald.hernandez/Documents/tareas coursera/Data Visualization Capstone/Tarea 1/housing_price_King_County_extended.csv")

The project is about sales prices in King County, the dataset can be find in: https://www.kaggle.com/datasets/vaseline555/house-price-of-king-county-extended

The purpose of the project is find the most cheap houses based in zipcode and who has the better conditions.

The dataset contains features like price, number of bedrooms or bathrooms, square foot living area, numbers of floors, if has waterfront, zipcode, if is renovated, features related to wather and crime

head(df)

## # A tibble: 6 x 26
##           id date                 price bedrooms bathrooms sqft_living sqft_lot
##        <dbl> <dttm>               <dbl>    <dbl>     <dbl>       <dbl>    <dbl>
## 1 7129300520 2014-10-13 00:00:00 221900        3      1           1180     5650
## 2 6126500060 2014-11-24 00:00:00 329950        3      1.75        2080     5969
## 3 4060000240 2014-06-23 00:00:00 205425        2      1            880     6780
## 4 3454800060 2015-01-08 00:00:00 171800        4      2           1570     9600
## 5 4058801670 2014-07-17 00:00:00 445000        3      2.25        2100     8201
## 6 7549802535 2014-11-11 00:00:00 423000        4      2           1970     6480
## # ... with 19 more variables: floors <dbl>, waterfront <dbl>, view <dbl>,
## #   condition <dbl>, grade <dbl>, sqft_above <dbl>, sqft_basement <dbl>,
## #   yr_built <dbl>, is_renovated <dbl>, zipcode <chr>, lat <dbl>, long <dbl>,
## #   sqft_living15 <dbl>, sqft_lot15 <dbl>, summer_high <dbl>, winter_low <dbl>,
## #   precipitation <dbl>, violent_crime <dbl>, property_crime <dbl>

First we check in a boxplot the prices of houses grouped by the zipcode:

As we can see, the most cheap zones are: ALGONA, AUBURN, MIDWAY, TUKWILA, WABASH, WHITE CENTER, WILDERNESS VILLAGE

Now we visualize separatly this zones in the following graph:

Is dificult to apreciate differences between the prices in the samples of each zipcode. Now we plot the median price in the next chart. We used the median because the data has outliers as we can see in the previous graph

To verify if are signicative differences between the median prices, we conduct an Kruskall Wallis test:

kruskal.test(df_compra$price~df_compra$zipcode)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  df_compra$price by df_compra$zipcode
## Kruskal-Wallis chi-squared = 314.11, df = 6, p-value < 2.2e-16

The p values is less than 0.05, then we can conclude that are significal differences in the median prices of houses.

In the following comparisions, we can see the pairswise statistical differences, complement with the barchart, we take AURBURN, MIDWAY, TUKWILA, and WABASH as the most cheap zones.

pairwise.wilcox.test(df_compra$price,df_compra$zipcode, p.adj = "bonf")

## 
##  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
## 
## data:  df_compra$price and df_compra$zipcode 
## 
##                    ALGONA  AUBURN  MIDWAY  TUKWILA WABASH  WHITE CENTER
## AUBURN             0.3804  -       -       -       -       -           
## MIDWAY             1.0000  0.0016  -       -       -       -           
## TUKWILA            1.0000  1.0000  0.0071  -       -       -           
## WABASH             0.0074  0.3483  4.4e-05 0.4875  -       -           
## WHITE CENTER       3.3e-10 2.7e-11 1.7e-11 3.8e-08 0.5624  -           
## WILDERNESS VILLAGE < 2e-16 < 2e-16 < 2e-16 < 2e-16 9.9e-11 8.9e-08     
## 
## P value adjustment method: bonferroni

In the next plot, we can see the distribution of sqft_living, and the houses of MIDWAY and AUBURN has more living area in the samples taken in the dataset.

In the following plot, we can see AUBURN and MIDWAY has beeter distribution related to the number of bedrooms

In the next plot we can check AUBBURN has better distrubution related to the number of bathrooms

Now we can see the values of Crime are very similar

# Paso 1: Calcular las medianas por "zipcode"
df_medianas <- df_compra %>%
  group_by(zipcode) %>%
  summarize(mediana_precipitation = median(precipitation),
            mediana_summer_high = median(summer_high),
            mediana_winter_low = median(winter_low))

# Paso 2: Convertir el dataframe a formato largo (long)
df_medianas_long <- df_medianas %>%
  pivot_longer(cols = c(mediana_precipitation, mediana_summer_high, mediana_winter_low),
               names_to = "Variable",
               values_to = "Mediana")

# Paso 3: Crear el gráfico de barras múltiple con ggplot2
ggplotly(ggplot(df_medianas_long, aes(x = zipcode, y = Mediana, fill = Variable)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "zipcode", y = "Mediana") +
  ggtitle("Median values of Precipitation, Summer High and Winter Low per zipcode") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)))

Now we can investigate the relationship between the price of house and sqft_living area (our interest variable)

We can see the relationships are linear but, AURBURN has better slope than MIDWAY, those are the zones in the previous analysis

##

## `geom_smooth()` using formula = 'y ~ x'

Finally we can investigate the relationship between price and sqft_living, using a GAM modelo. We can see the sqft_living does not vary much in 4,000 to 5,000. This is the optime size of house with better price in AUBURN, KING COUNTY

df_compra2= df_compra %>% filter(zipcode=="AUBURN")

library("mgcv")

modelo_gam <- gam(price ~ s(sqft_living)+(bedrooms)+bathrooms+is_renovated+yr_built+grade+condition+(sqft_lot), data = df_compra2)

plot(modelo_gam, main="Smooth function of relationship between price and sqft_living")