library(ggplot2)
library(dplyr)
homes = read.csv('https://raw.githubusercontent.com/vittorioaddona/data/main/homes.csv')
# Enter your R commands below:
ggplot(data=homes,aes(x=Price))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data=homes,aes(y=Price))+geom_boxplot()
Boxplot:right skewed data will have the middle 50 quantile at lower
values with a lot of high outliers Histogram:histogram will have right
skewed data shown with a tail towards the right and the “bump” towards
the left
# Enter your R commands below:
homes %>% summarize (median(Price))
## median(Price)
## 1 189900
homes %>% summarize (mean(Price))
## mean(Price)
## 1 211966.7
median(Price)
1 189900
mean(Price)
1 211966.7
if the data was symmetric, the mean and median should be the same, in skewed data such as here, we have slightly different values, in this case the mean is higher than the median by a bit
when its right skewed, that means that you have a tail off to the right, meaning that you have a greater extreme on that side, this would affect the mean more than the median
homes%>%summarize(quantile(Price,0.025))
## quantile(Price, 0.025)
## 1 85000
homes%>%summarize(quantile(Price,0.975))
## quantile(Price, 0.975)
## 1 453421.4
2.5th:85000 97.5th:453421.4
95% of the data exists between the 2.5th quantile and the 97.5th quantile
var(homes$Price)
## [1] 9690707465
9690707465
the standard deviation is the square root of the variance
variance is a squared value, standard deviation fits better with data because it is the square root of the square so its theoretically in the same realm as the data itself
mutate function:homes = homes %>%
mutate( Central.Air.Cat = case_when( Central.Air == 0 ~ "No" ,
Central.Air == 1 ~ "Yes"
) ,
New.Construct.Cat = case_when( New.Construct == 0 ~ "No" ,
New.Construct == 1 ~ "Yes"
)
)
ggplot(data=homes,aes(x=Central.Air.Cat,y=Price))+geom_boxplot()
the r commands are creating categorical variables and assigning data
points to them
No, because it does not appear that the presence of central air significantly increases or decreases the price of a home
ggplot(data=homes,aes(x=Price,y=Pct.College))+geom_point()
ggplot(data=homes,aes(x=Price,y=Living.Area))+geom_point()
ggplot(data=homes,aes(x=Living.Area,y=New.Construct.Cat))+geom_boxplot()
homes %>%
group_by(New.Construct.Cat) %>%
summarize(mean(Living.Area))
## # A tibble: 2 × 2
## New.Construct.Cat `mean(Living.Area)`
## <chr> <dbl>
## 1 No 1718.
## 2 Yes 2497.
Diamonds = read.csv("https://raw.githubusercontent.com/vittorioaddona/data/main/Diamonds.csv")
ggplot( data=Diamonds , aes( x=carat , y=price ) )+geom_point()
ggplot(data=Diamonds,aes(x=price,y=color))+geom_boxplot()
ggplot(data=Diamonds,aes(x=price,y=certification))+geom_boxplot()
I would say that certification is a better explanatory variable
because the quantile range of the data, or the two sides of the box, is
much smaller meaning that within a certain certification, the range of
the price varies much less than the price variation within the same
color.
ggplot(data=Diamonds,aes(x=price,y=clarity))+geom_boxplot()
The highest clarity diamonds are also the cheapest
It’s possible that flawless diamonds have the cheapest median because
so few of them are bought that its very hard to get a good trend, this
can be seen in the graph due to the large number of extremely high
priced outliers present.
homes %>%
count( Waterfront ) %>%
mutate( Percentage = prop.table(n)*100 )
Its calculating the percentage of homes with a waterfront ###
(b)
homes %>%
mutate( WaterfrontCat = case_when( Waterfront == 0 ~ "No" ,
Waterfront == 1 ~ "Yes"
) ) %>%
select( WaterfrontCat , New.Construct.Cat ) %>%
table( )
This command compares whether a home is a new construction and
whether that home is on a waterfront
homes %>%
filter( Living.Area < 1600 ) %>%
summarize( max(Price) )
this command tells you the maximum price for a home that is less
than 1600 square feet
homes %>%
group_by( Bedrooms ) %>%
summarize( mean(Living.Area) , median(Living.Area) , sd(Living.Area) )
number of bedrooms compared to living area
homes %>%
mutate( WaterfrontCat = case_when( Waterfront == 0 ~ "No" ,
Waterfront == 1 ~ "Yes"
) ) %>%
select( WaterfrontCat , New.Construct.Cat ) %>%
table( ) %>%
prop.table( margin=1 )*100 ### HINT: margin=1 does things by rows
percentage of homes that have a waterfront and/or have a new
construction
homes %>%
mutate( WaterfrontCat = case_when( Waterfront == 0 ~ "No" ,
Waterfront == 1 ~ "Yes"
) ) %>%
select( WaterfrontCat , New.Construct.Cat ) %>%
table( ) %>%
prop.table( margin=2 )*100 ### HINT: margin=2 does things by columns
percentage of homes that are on the water and/or are new
constructions
homes %>%
filter( Living.Area < 1600 ) %>%
ggplot( aes(x=Age,y=Price) ) +
geom_point( ) +
ggtitle( "Question 4, Part (g); Give a better title" ) +
xlab( "Age of the home (in years)" ) +
ylab( "Price of the home (in US dollars)" )
scatter plot showing price of home vs age of home
homes %>%
mutate( BedBathRatio = Bedrooms / Bathrooms ) %>%
ggplot( aes(x="",y=BedBathRatio) ) +
geom_violin() +
xlab("")
## Warning: Removed 1 rows containing non-finite values (`stat_ydensity()`).
violin plot showing the bedroom to bathroom ratio