Project 2 Mandate

Choose any three of the “wide” datasets . For each of the three chosen datasets:

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
Perform the analysis requested in the discussion item. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

Please include in your homework submission, for each of the three chosen datasets:

The URL to the .Rmd file in your GitHub repository, and The URL for your rpubs.com web page.

Dataset 1 - Median Square Foot Price

Having moved to NYC only a few years ago, I struggle to justify the cost of housing in this area. I wanted to examine the median square foot price comparison over the years between NYC, a few metro regions of my choice and US as a whole. The dataset was sourced from the Home Listsings and Sales section of Zillow.

# Library Imports
library(dplyr)
library(tidyr)
library(stringr)
library(knitr)
library(ggplot2)
library(zoo)
library(plotly)

# Dataset Import 
mdnprice <- read.csv('MedianHousePrices.csv', header=T, stringsAsFactors = F, check.names = F ) 
head(mdnprice[,0:4])

##                           RegionName SizeRank   2010-01  2010-02
## 1                      United States        0 116.98501 116.9276
## 2                       New York, NY        1 235.61896 234.2708
## 3 Los Angeles-Long Beach-Anaheim, CA        2 280.95984 277.4083
## 4                        Chicago, IL        3 153.13139 153.0168
## 5              Dallas-Fort Worth, TX        4  83.96947  84.2555
## 6                   Philadelphia, PA        5 143.19367 143.9976

As we can see, the data is in a “wide” format with 119 columns.Since the data ranges from Jan 2010 through Sep 2019, it needs quite a bit of tidying up.

# Using gather to move data from a "wide" to a "long" format
mdnprice2 <-gather(mdnprice, year ,mdnsqftprice,3:119)

# Using seperate to create one column for City and another for State
mdnprice3<-tidyr::separate(mdnprice2,RegionName,c("City","State"),sep=",")

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 117 rows
## [1, 844, 1687, 2530, 3373, 4216, 5059, 5902, 6745, 7588, 8431, 9274, 10117,
## 10960, 11803, 12646, 13489, 14332, 15175, 16018, ...].

head(mdnprice3)

##                             City State SizeRank    year mdnsqftprice
## 1                  United States  <NA>        0 2010-01    116.98501
## 2                       New York    NY        1 2010-01    235.61896
## 3 Los Angeles-Long Beach-Anaheim    CA        2 2010-01    280.95984
## 4                        Chicago    IL        3 2010-01    153.13139
## 5              Dallas-Fort Worth    TX        4 2010-01     83.96947
## 6                   Philadelphia    PA        5 2010-01    143.19367

The two metros I picked to compare were LA and Washington DC

# Creating Subsets for median sq ft price of the US, NYC, LA and DC
USmdn <- subset(mdnprice3, City=="United States")
NYCmdn<- subset(mdnprice3, City=="New York")
DCmdn<- subset(mdnprice3, City=="Washington" & State==' DC')
LAmdn<- subset(mdnprice3, City=="Los Angeles-Long Beach-Anaheim")

Compare median Sq ft prices growth between NYC,LA,DC,Aggregated US

# Using plotly to make an interactive visualization to compare month over month price increases
plot_ly(data= NYCmdn, x= ~year,y=~mdnsqftprice,type = 'scatter', mode='lines',name='NYC' )  %>%
  add_trace(y =LAmdn$mdnsqftprice, name = 'LA',mode = 'lines')%>%
  add_trace(y =DCmdn$mdnsqftprice, name = 'DC',mode = 'lines')%>%
  add_trace(y =USmdn$mdnsqftprice, name = 'US',mode = 'lines')%>%
  layout(title = 'Growth of Median Sqft Price',
         autosize=F,width=900,
         yaxis = list(zeroline = FALSE,title='Median Sqft Price'),
         xaxis = list(title='Year-Month'),
         hovermode = 'compare',
         paper_bgcolor='rgb(245,252,255)',
         plot_bgcolor='rgba(0,0,0,0)')

## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()

Analysis to calculate average change in price per city over the duration of the dataset

#Seperating months and years
mdnprice4 <- separate(mdnprice3,year,c("year","month"),sep='-')

#Aggregating by city and year. Calculating average yearly price per sq ft
aggr.tbl <- mdnprice4 %>% group_by(City,year) %>% 
              summarize(Yrly_avg = mean(mdnsqftprice))

# Creating a table that calculates the percentage difference in average yearly prices between 2010 and 2019
pct_chg.tbl<-aggr.tbl %>% 
    group_by(City) %>%
    dplyr::mutate(
        pct_change= -(dplyr::first(Yrly_avg)-dplyr::last(Yrly_avg))/first(Yrly_avg))%>%
        select(-year,-Yrly_avg)%>%
        group_by(City)%>% 
        summarize(price_change_pct = round(mean(pct_change)*100))

# Top 1% price change
pct_chg.tbl %>% 
              arrange(-price_change_pct) %>%
                top_frac(.01)

## Selecting by price_change_pct

## # A tibble: 7 x 2
##   City        price_change_pct
##   <chr>                  <dbl>
## 1 Merced                   136
## 2 Modesto                  121
## 3 Blackfoot                118
## 4 Fernley                  109
## 5 Summit Park              106
## 6 Rapid City               100
## 7 Reno                     100

# Bottom 1% price change
pct_chg.tbl %>% 
              arrange(price_change_pct) %>%
                top_frac(-.01)

## Selecting by price_change_pct

## # A tibble: 7 x 2
##   City           price_change_pct
##   <chr>                     <dbl>
## 1 Sayre                       -35
## 2 Atlantic City               -23
## 3 Vineland                    -22
## 4 Roanoke Rapids              -20
## 5 Coffeyville                 -17
## 6 Amsterdam                   -13
## 7 Bennington                  -13

Histogram to look at the distribution of price change percent

plot_ly(x=pct_chg.tbl$price_change_pct,type="histogram", color='rgba(255, 182, 193)',nbinsx = 20)%>% 
        layout(title = 'Distribution of Price Changes since 2010',
         autosize=F,width=900,
         hovermode = 'compare',
         yaxis = list(title='Frequency'),
         xaxis = list(title='Percent Change'),
         paper_bgcolor='rgb(245,252,255)',
         plot_bgcolor='rgba(0,0,0,0)')

## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()

## Warning: Ignoring 107 observations

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

Conclusion

I was surprised to see that:

LA Real estate prices has increased significantly faster than NYC.
NYC is not even in the top 1% of cities when it comes to percentage of increase in real estate prices.
Real estate in the majority of the cities increased (between 0-40%).
Price dropped in 51 cities

Dataset 2- Religion and Education

For my second Dataset I decided to look at the education breakdown of members from different religions. This dataset was sourced from the Pew Research Organization.

# Dataset Import 
religion <- read.csv('religion.csv', header=T, stringsAsFactors = F, check.names = F ) 
head(religion)#[,0:5]

##                        Religion High_school_or_less Some_college College
## 1                      Buddhist                0.20         0.33    0.28
## 2                      Catholic                0.46         0.27    0.16
## 3        Evangelical Protestant                0.43         0.35    0.14
## 4                         Hindu                0.12         0.11    0.29
## 5 Historically Black Protestant                0.52         0.33    0.09
## 6             Jehovah's Witness                0.63         0.25    0.09
##   Post_grad Sample_Size
## 1      0.20         262
## 2      0.10        7157
## 3      0.07        8545
## 4      0.48         197
## 5      0.06        1905
## 6      0.03         241

Yet again the data is in a “wide” format. The columns with suffixes “Bill” & Pay" are histories of bills sent to and payments made by the customer respectively.

# Dataset Import 
religion1<-gather(religion,education,percentage,2:5)
head(religion1)

##                        Religion Sample_Size           education percentage
## 1                      Buddhist         262 High_school_or_less       0.20
## 2                      Catholic        7157 High_school_or_less       0.46
## 3        Evangelical Protestant        8545 High_school_or_less       0.43
## 4                         Hindu         197 High_school_or_less       0.12
## 5 Historically Black Protestant        1905 High_school_or_less       0.52
## 6             Jehovah's Witness         241 High_school_or_less       0.63

# Figuring out top 5 religions by sample size
 religion1 %>% 
  group_by(Religion)%>% 
  summarise(Total=mean(Sample_Size))%>% 
  arrange(-Total)%>% 
  top_n(5)

## Selecting by Total

## # A tibble: 5 x 2
##   Religion                      Total
##   <chr>                         <dbl>
## 1 Evangelical Protestant         8545
## 2 Unaffiliated                   7532
## 3 Catholic                       7157
## 4 Mainline Protestant            6045
## 5 Historically Black Protestant  1905

# Figuring out bottom 5 religions by sample size
 religion1 %>% 
  group_by(Religion)%>% 
  summarise(Total=mean(Sample_Size))%>% 
  arrange(-Total)%>% 
  top_n(-5)

## Selecting by Total

## # A tibble: 5 x 2
##   Religion           Total
##   <chr>              <dbl>
## 1 Buddhist             262
## 2 Jehovah's Witness    241
## 3 Muslim               237
## 4 Hindu                197
## 5 Orthodox Christian   185

# aggregated dataset
 aggr<-religion1 %>% 
  group_by(Religion,education,percentage)%>% 
  summarise(Total=mean(Sample_Size))%>% 
  arrange(-Total)
head(aggr)

## # A tibble: 6 x 4
## # Groups:   Religion, education [6]
##   Religion               education           percentage Total
##   <chr>                  <chr>                    <dbl> <dbl>
## 1 Evangelical Protestant College                   0.14  8545
## 2 Evangelical Protestant High_school_or_less       0.43  8545
## 3 Evangelical Protestant Post_grad                 0.07  8545
## 4 Evangelical Protestant Some_college              0.35  8545
## 5 Unaffiliated           College                   0.18  7532
## 6 Unaffiliated           High_school_or_less       0.38  7532

# Creating a column to conver percentages to numbers
aggr1<- aggr%>% 
  group_by(Religion,education,percentage)%>% 
  mutate(counts=round(percentage*Total))
head(aggr1)

## # A tibble: 6 x 5
## # Groups:   Religion, education, percentage [6]
##   Religion               education           percentage Total counts
##   <chr>                  <chr>                    <dbl> <dbl>  <dbl>
## 1 Evangelical Protestant College                   0.14  8545   1196
## 2 Evangelical Protestant High_school_or_less       0.43  8545   3674
## 3 Evangelical Protestant Post_grad                 0.07  8545    598
## 4 Evangelical Protestant Some_college              0.35  8545   2991
## 5 Unaffiliated           College                   0.18  7532   1356
## 6 Unaffiliated           High_school_or_less       0.38  7532   2862

# Top 3 religions
aggr3<-subset(aggr1,Religion %in% c("Evangelical Protestant", "Unaffiliated","Catholic") )
head(aggr3)

## # A tibble: 6 x 5
## # Groups:   Religion, education, percentage [6]
##   Religion               education           percentage Total counts
##   <chr>                  <chr>                    <dbl> <dbl>  <dbl>
## 1 Evangelical Protestant College                   0.14  8545   1196
## 2 Evangelical Protestant High_school_or_less       0.43  8545   3674
## 3 Evangelical Protestant Post_grad                 0.07  8545    598
## 4 Evangelical Protestant Some_college              0.35  8545   2991
## 5 Unaffiliated           College                   0.18  7532   1356
## 6 Unaffiliated           High_school_or_less       0.38  7532   2862

ggplot(data= aggr3, aes(x = Religion, y = counts, fill = education)) + geom_bar(stat = "identity", 
    position = "stack") +ggtitle("Education breakdown top 5 religions")

Conclusion

I was surprised by how many individuals identified themselves as “Unaffiliated”
The education breakdown of the top 3 religions is fairly similar. No correlation can be identified with this level of analysis

Dataset 3- Seasonality in unemployment

My last dataset was obtained from the US Bureau of Labor Statistics here. The data contains monthly unemployment numbers from 2009 onwards

# Dataset Import 
unemp <- read.csv('unemployment.csv', header=T, stringsAsFactors = T, check.names = F ) 
head(unemp[,0:4])

##   Year Jan Feb Mar
## 1 2009 7.8 8.3 8.7
## 2 2010 9.8 9.8 9.9
## 3 2011 9.1 9.0 9.0
## 4 2012 8.3 8.3 8.2
## 5 2013 8.0 7.7 7.5
## 6 2014 6.6 6.7 6.7

unemp1<-unemp%>%
    gather(month,unemployment,2:13)%>%
    arrange(Year)%>% 
    unite(Period,Year,month,sep='-')
head(unemp1)

##     Period unemployment
## 1 2009-Jan          7.8
## 2 2009-Feb          8.3
## 3 2009-Mar          8.7
## 4 2009-Apr          9.0
## 5 2009-May          9.4
## 6 2009-Jun          9.5

# Looking at seasonality of Unemployment
ggplot(data=unemp1, aes(x=Period, y=unemployment, group=1)) +
  geom_line()+
  geom_point()

## Warning: Removed 3 rows containing missing values (geom_point).

DATA 607 - Project 2

Devanshu Mehrotra

2019-10-31

Project 2 Mandate

Dataset 1 - Median Square Foot Price

Compare median Sq ft prices growth between NYC,LA,DC,Aggregated US

Analysis to calculate average change in price per city over the duration of the dataset

Histogram to look at the distribution of price change percent

Conclusion

Dataset 2- Religion and Education

Conclusion

Dataset 3- Seasonality in unemployment