Project 2 Mandate
Choose any three of the “wide” datasets . For each of the three chosen datasets:
Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
Perform the analysis requested in the discussion item. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
Please include in your homework submission, for each of the three chosen datasets:
The URL to the .Rmd file in your GitHub repository, and The URL for your rpubs.com web page.
Dataset 1 - Median Square Foot Price
Having moved to NYC only a few years ago, I struggle to justify the cost of housing in this area. I wanted to examine the median square foot price comparison over the years between NYC, a few metro regions of my choice and US as a whole. The dataset was sourced from the Home Listsings and Sales section of Zillow.
# Library Imports
library(dplyr)
library(tidyr)
library(stringr)
library(knitr)
library(ggplot2)
library(zoo)
library(plotly)# Dataset Import
mdnprice <- read.csv('MedianHousePrices.csv', header=T, stringsAsFactors = F, check.names = F )
head(mdnprice[,0:4])## RegionName SizeRank 2010-01 2010-02
## 1 United States 0 116.98501 116.9276
## 2 New York, NY 1 235.61896 234.2708
## 3 Los Angeles-Long Beach-Anaheim, CA 2 280.95984 277.4083
## 4 Chicago, IL 3 153.13139 153.0168
## 5 Dallas-Fort Worth, TX 4 83.96947 84.2555
## 6 Philadelphia, PA 5 143.19367 143.9976
As we can see, the data is in a “wide” format with 119 columns.Since the data ranges from Jan 2010 through Sep 2019, it needs quite a bit of tidying up.
# Using gather to move data from a "wide" to a "long" format
mdnprice2 <-gather(mdnprice, year ,mdnsqftprice,3:119)
# Using seperate to create one column for City and another for State
mdnprice3<-tidyr::separate(mdnprice2,RegionName,c("City","State"),sep=",")## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 117 rows
## [1, 844, 1687, 2530, 3373, 4216, 5059, 5902, 6745, 7588, 8431, 9274, 10117,
## 10960, 11803, 12646, 13489, 14332, 15175, 16018, ...].
## City State SizeRank year mdnsqftprice
## 1 United States <NA> 0 2010-01 116.98501
## 2 New York NY 1 2010-01 235.61896
## 3 Los Angeles-Long Beach-Anaheim CA 2 2010-01 280.95984
## 4 Chicago IL 3 2010-01 153.13139
## 5 Dallas-Fort Worth TX 4 2010-01 83.96947
## 6 Philadelphia PA 5 2010-01 143.19367
The two metros I picked to compare were LA and Washington DC
# Creating Subsets for median sq ft price of the US, NYC, LA and DC
USmdn <- subset(mdnprice3, City=="United States")
NYCmdn<- subset(mdnprice3, City=="New York")
DCmdn<- subset(mdnprice3, City=="Washington" & State==' DC')
LAmdn<- subset(mdnprice3, City=="Los Angeles-Long Beach-Anaheim")Compare median Sq ft prices growth between NYC,LA,DC,Aggregated US
# Using plotly to make an interactive visualization to compare month over month price increases
plot_ly(data= NYCmdn, x= ~year,y=~mdnsqftprice,type = 'scatter', mode='lines',name='NYC' ) %>%
add_trace(y =LAmdn$mdnsqftprice, name = 'LA',mode = 'lines')%>%
add_trace(y =DCmdn$mdnsqftprice, name = 'DC',mode = 'lines')%>%
add_trace(y =USmdn$mdnsqftprice, name = 'US',mode = 'lines')%>%
layout(title = 'Growth of Median Sqft Price',
autosize=F,width=900,
yaxis = list(zeroline = FALSE,title='Median Sqft Price'),
xaxis = list(title='Year-Month'),
hovermode = 'compare',
paper_bgcolor='rgb(245,252,255)',
plot_bgcolor='rgba(0,0,0,0)')## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
Analysis to calculate average change in price per city over the duration of the dataset
#Aggregating by city and year. Calculating average yearly price per sq ft
aggr.tbl <- mdnprice4 %>% group_by(City,year) %>%
summarize(Yrly_avg = mean(mdnsqftprice)) # Creating a table that calculates the percentage difference in average yearly prices between 2010 and 2019
pct_chg.tbl<-aggr.tbl %>%
group_by(City) %>%
dplyr::mutate(
pct_change= -(dplyr::first(Yrly_avg)-dplyr::last(Yrly_avg))/first(Yrly_avg))%>%
select(-year,-Yrly_avg)%>%
group_by(City)%>%
summarize(price_change_pct = round(mean(pct_change)*100))## Selecting by price_change_pct
## # A tibble: 7 x 2
## City price_change_pct
## <chr> <dbl>
## 1 Merced 136
## 2 Modesto 121
## 3 Blackfoot 118
## 4 Fernley 109
## 5 Summit Park 106
## 6 Rapid City 100
## 7 Reno 100
## Selecting by price_change_pct
## # A tibble: 7 x 2
## City price_change_pct
## <chr> <dbl>
## 1 Sayre -35
## 2 Atlantic City -23
## 3 Vineland -22
## 4 Roanoke Rapids -20
## 5 Coffeyville -17
## 6 Amsterdam -13
## 7 Bennington -13
Histogram to look at the distribution of price change percent
plot_ly(x=pct_chg.tbl$price_change_pct,type="histogram", color='rgba(255, 182, 193)',nbinsx = 20)%>%
layout(title = 'Distribution of Price Changes since 2010',
autosize=F,width=900,
hovermode = 'compare',
yaxis = list(title='Frequency'),
xaxis = list(title='Percent Change'),
paper_bgcolor='rgb(245,252,255)',
plot_bgcolor='rgba(0,0,0,0)')## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
## Warning: Ignoring 107 observations
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Conclusion
I was surprised to see that:
- LA Real estate prices has increased significantly faster than NYC.
- NYC is not even in the top 1% of cities when it comes to percentage of increase in real estate prices.
- Real estate in the majority of the cities increased (between 0-40%).
- Price dropped in 51 cities
Dataset 2- Religion and Education
For my second Dataset I decided to look at the education breakdown of members from different religions. This dataset was sourced from the Pew Research Organization.
# Dataset Import
religion <- read.csv('religion.csv', header=T, stringsAsFactors = F, check.names = F )
head(religion)#[,0:5]## Religion High_school_or_less Some_college College
## 1 Buddhist 0.20 0.33 0.28
## 2 Catholic 0.46 0.27 0.16
## 3 Evangelical Protestant 0.43 0.35 0.14
## 4 Hindu 0.12 0.11 0.29
## 5 Historically Black Protestant 0.52 0.33 0.09
## 6 Jehovah's Witness 0.63 0.25 0.09
## Post_grad Sample_Size
## 1 0.20 262
## 2 0.10 7157
## 3 0.07 8545
## 4 0.48 197
## 5 0.06 1905
## 6 0.03 241
Yet again the data is in a “wide” format. The columns with suffixes “Bill” & Pay" are histories of bills sent to and payments made by the customer respectively.
## Religion Sample_Size education percentage
## 1 Buddhist 262 High_school_or_less 0.20
## 2 Catholic 7157 High_school_or_less 0.46
## 3 Evangelical Protestant 8545 High_school_or_less 0.43
## 4 Hindu 197 High_school_or_less 0.12
## 5 Historically Black Protestant 1905 High_school_or_less 0.52
## 6 Jehovah's Witness 241 High_school_or_less 0.63
# Figuring out top 5 religions by sample size
religion1 %>%
group_by(Religion)%>%
summarise(Total=mean(Sample_Size))%>%
arrange(-Total)%>%
top_n(5)## Selecting by Total
## # A tibble: 5 x 2
## Religion Total
## <chr> <dbl>
## 1 Evangelical Protestant 8545
## 2 Unaffiliated 7532
## 3 Catholic 7157
## 4 Mainline Protestant 6045
## 5 Historically Black Protestant 1905
# Figuring out bottom 5 religions by sample size
religion1 %>%
group_by(Religion)%>%
summarise(Total=mean(Sample_Size))%>%
arrange(-Total)%>%
top_n(-5)## Selecting by Total
## # A tibble: 5 x 2
## Religion Total
## <chr> <dbl>
## 1 Buddhist 262
## 2 Jehovah's Witness 241
## 3 Muslim 237
## 4 Hindu 197
## 5 Orthodox Christian 185
# aggregated dataset
aggr<-religion1 %>%
group_by(Religion,education,percentage)%>%
summarise(Total=mean(Sample_Size))%>%
arrange(-Total)
head(aggr)## # A tibble: 6 x 4
## # Groups: Religion, education [6]
## Religion education percentage Total
## <chr> <chr> <dbl> <dbl>
## 1 Evangelical Protestant College 0.14 8545
## 2 Evangelical Protestant High_school_or_less 0.43 8545
## 3 Evangelical Protestant Post_grad 0.07 8545
## 4 Evangelical Protestant Some_college 0.35 8545
## 5 Unaffiliated College 0.18 7532
## 6 Unaffiliated High_school_or_less 0.38 7532
# Creating a column to conver percentages to numbers
aggr1<- aggr%>%
group_by(Religion,education,percentage)%>%
mutate(counts=round(percentage*Total))
head(aggr1)## # A tibble: 6 x 5
## # Groups: Religion, education, percentage [6]
## Religion education percentage Total counts
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Evangelical Protestant College 0.14 8545 1196
## 2 Evangelical Protestant High_school_or_less 0.43 8545 3674
## 3 Evangelical Protestant Post_grad 0.07 8545 598
## 4 Evangelical Protestant Some_college 0.35 8545 2991
## 5 Unaffiliated College 0.18 7532 1356
## 6 Unaffiliated High_school_or_less 0.38 7532 2862
# Top 3 religions
aggr3<-subset(aggr1,Religion %in% c("Evangelical Protestant", "Unaffiliated","Catholic") )
head(aggr3)## # A tibble: 6 x 5
## # Groups: Religion, education, percentage [6]
## Religion education percentage Total counts
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Evangelical Protestant College 0.14 8545 1196
## 2 Evangelical Protestant High_school_or_less 0.43 8545 3674
## 3 Evangelical Protestant Post_grad 0.07 8545 598
## 4 Evangelical Protestant Some_college 0.35 8545 2991
## 5 Unaffiliated College 0.18 7532 1356
## 6 Unaffiliated High_school_or_less 0.38 7532 2862
ggplot(data= aggr3, aes(x = Religion, y = counts, fill = education)) + geom_bar(stat = "identity",
position = "stack") +ggtitle("Education breakdown top 5 religions")Conclusion
- I was surprised by how many individuals identified themselves as “Unaffiliated”
- The education breakdown of the top 3 religions is fairly similar. No correlation can be identified with this level of analysis
Dataset 3- Seasonality in unemployment
My last dataset was obtained from the US Bureau of Labor Statistics here. The data contains monthly unemployment numbers from 2009 onwards
# Dataset Import
unemp <- read.csv('unemployment.csv', header=T, stringsAsFactors = T, check.names = F )
head(unemp[,0:4])## Year Jan Feb Mar
## 1 2009 7.8 8.3 8.7
## 2 2010 9.8 9.8 9.9
## 3 2011 9.1 9.0 9.0
## 4 2012 8.3 8.3 8.2
## 5 2013 8.0 7.7 7.5
## 6 2014 6.6 6.7 6.7
unemp1<-unemp%>%
gather(month,unemployment,2:13)%>%
arrange(Year)%>%
unite(Period,Year,month,sep='-')
head(unemp1)## Period unemployment
## 1 2009-Jan 7.8
## 2 2009-Feb 8.3
## 3 2009-Mar 8.7
## 4 2009-Apr 9.0
## 5 2009-May 9.4
## 6 2009-Jun 9.5
# Looking at seasonality of Unemployment
ggplot(data=unemp1, aes(x=Period, y=unemployment, group=1)) +
geom_line()+
geom_point()## Warning: Removed 3 rows containing missing values (geom_point).