Problem 1

For a data set of your choosing, make a faceted plot using the trelliscopejs package. You may make any type of plot; scatter plot, histogram, etc. but, as mentioned in the discussion below, you must explain why you chose this plot and what you are investigating about the variable you are graphing.

The trelliscope plot must include one cognostic measure of your own. Include a description of what it is and what information this measure gives.

# Read in data
alleg <- read.csv("allegheny_county_master_file (2).csv")
# Check structure and the unique values of MUNIDESC
#unique(alleg$MUNIDESC)
str(alleg$LOTAREA)
##  int [1:584107] 8320 282835 106321 9000 1800 54582 54581 5768 9568 21388 ...
str(alleg$MUNIDESC)
##  chr [1:584107] "1st Ward  - PITTSBURGH" "1st Ward  - PITTSBURGH" ...
# Try to make scatterplot of SALEPRICE vs LOTAREA faceted by municipality
# Change MUNIDESC to factor
alleg$MUNIDESC <- as.factor(alleg$MUNIDESC)
str(alleg$MUNIDESC)
##  Factor w/ 175 levels "10th Ward -  McKEESPORT",..: 17 17 17 17 17 17 17 17 17 17 ...
# Check values of usedesc
#unique(alleg$USEDESC)
# Filter to only single family homes
library(tidyr)
alleg_new <- subset(alleg, USEDESC == "SINGLE FAMILY")
# String replace all the spaces in the MUNIDESC variable
library(stringr)
alleg_new$MUNIDESC <- str_replace(alleg_new$MUNIDESC, "  ", "")

# Make one scatterplot to make sure it works
# Test on Shaler
shaler_test <- subset(alleg_new, MUNIDESC == "Shaler")
# Filter where SALEPRICE < 750000 and LOTAREA < 500000
library(ggplot2)
shaler_filter <- subset(shaler_test, SALEPRICE < 75000 & LOTAREA < 500000)
                        
# Test scatterplot for Shaler
ggplot(shaler_test, aes(x = LOTAREA, y = SALEPRICE)) + 
  geom_point()
## Warning: Removed 17 rows containing missing values or values outside the scale range
## (`geom_point()`).

# New scatterplot on the filter one
ggplot(shaler_filter, aes(x = LOTAREA, y = SALEPRICE)) + 
  geom_point()

# Now filter every house by those conditions
alleg_filter <- subset(alleg_new, SALEPRICE < 75000 & LOTAREA < 500000)

alleg_filter$LOTAREA <- as.numeric(alleg_filter$LOTAREA)
# Add LOTAREA SD as new label and cognostic
library(trelliscopejs)
## Warning: package 'trelliscopejs' was built under R version 4.4.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
alleg_filter_cog <- alleg_filter %>%
                       group_by(MUNIDESC) %>% 
                       mutate(LOTAREA_SD = sd(LOTAREA)) 


# Description to cog
alleg_filter_cog$LOTAREA_SD <- cog(alleg_filter_cog$LOTAREA_SD, 
                                   desc = "Standard Deviation of LOTAREA", default_label = TRUE)
# Create trelliscope graph 
# Scatterplot of SALEPRICE vs LOTAREA for single family homes faceted by municipality
library(ggplot2)

library(tidyverse)
alleg_filter_cog %>% 
  ggplot( aes(x = LOTAREA, y = SALEPRICE)) + 
  geom_point() + 
  
  facet_trelliscope(~ MUNIDESC,
                    name = "Single Family Houses",
                    desc = "Scatterplot for Single Family Houses\nIn Allegheny County by Municipality",
                    nrow = 2,
                    ncol = 3,
                    scales = c("free", "free"),
                    path = ".",
                    self_contained = TRUE)
## using data from the first layer


Description 2-3 paragraphs.

Describe the data set. Explain the variable you are graphing in your plots and the reason you are investigating with it. Discuss the reason/motivation you chose the variable to facet on, and what insight or trend you are attempting to investigate. Discuss any challenges you had in making the graphs and how you dealt with these challenges. Name at least one cognostic measure (this can include the cognostic you created or be different) the reader could investigate, and explain any insight they might gain from it.

I think that it would be insightful to only perform the analysis on single family homes in each municipality to remove all of the other types of property such as different types of buildings and businesses. This would really be telling on how the housing is set up in each municipality. As with every county and city, some parts are going to be better of than others, I think that might translate into the housing market. I am curious though if in more wealthy municipalities if there is more of a positive correlation or if the datapoints tend to go more towards the top and have much higher selling prices than square footage. Also in my analysis I turned the lot area and price variables to numeric to make sure they would work and I made the municipality variable (MUNIDESC) a factor so that I can use it to facet and group by. To initially start my analysis, I started with my municipality where I live, Shaler. When I firsr tried to filter on Shaler it did not work so I looked at the structure of the MUNIDESC variable and saw that there was some white space at the end so I used the string replace function from the stringr package and was able to replace the white space with nothing so that I could filter easily. This was the main challenge I faced when making this plot was to make sure the strings were all working properly. I also saw that the municipalities have high outliers so I filtered again to make there be less outliers and I got a scatterplot. I then filtered again on the whole data set, setting boundaries for both LOTAREA and SALEPRICE.

Then I got to making the trelliscope graph where I made a faceted scatterplot of SALEPRICE (Y) vs LOTAREA (X) faceted (grouped by) municipality on this filtered dataset and I did set the axes to ‘free’ scaling so the scaling works for whatever is best for the data in that graph. A drawback to this is I cannot compare apples to apples by saying this graph is more positively correlated than this one since the axes have different scaling for each. I did this to make them more visually appealing though. When I initially made the trelliscope, I was looking through the different cognostics in the labels part and I did not see that there was an option for the standard deviation of LOTAREA. So I went back and decided to make my own cognostic for this in which I would take the standard devaition of the LOTAREA variable but this would be grouped by municipality so that each property in the municipality had the same LOTAREA standard devaition but they would be different across the municipalities. I did this using the mutate() function in dplyr so that I could add a column to compute the sd. I think that this number showing up on each graph is very important and can give a lot of insight. I used this to overcome the drawback of setting the axes scales to free. Now, I have a way that I can compare apples to apples between the graphs. This is the average deviation for each house in that municipality from the mean square footage of the houses in that municipality. This standard deviation can explain skew in the variables and can show which municipalities are less skewed than others. For example if municipality had a LOTAREA standard deviation of 1000, I know that each house in this municipality is on average 1000 square feet away from the mean in that municipality. If a municipality had a standard deviation of 10000, I know that there must be some high skew in the data within the municipality, most likely from houses that have very large properties which might be found more in wealthier municipalies such as Fox Chapel or Sewickly as opposed to less well off municipalities such as in the city. Overall, I think that this analysis was very interesting and it gives a lot of insight to the housing markets in each municipality and it can answer questions about the general economics behind each municipality, such as, which municipality has the largest properties, which municipality has overpriced properties, which municipality has underpriced properties. All this and more can be figured out from this analysis so I thought this was very interesting.

Published Trelliscope Page

  • knit the file to an html document

  • publish this to an RPubs page.


grading: trelliscope plot[25 points], discussion[25 points]


Note: you can add a url directly to the text and it will be active in the html (and word document if you knit to that)

Example: https://www.google.com

If you want to be fancy and make your url active text, you can do this