Project 2 Mandate

Choose any three of the “wide” datasets . For each of the three chosen datasets:

  1. Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.

  2. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]

  3. Perform the analysis requested in the discussion item. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

Please include in your homework submission, for each of the three chosen datasets:

The URL to the .Rmd file in your GitHub repository, and The URL for your rpubs.com web page.

Dataset 1 - Median Square Foot Price

Having moved to NYC only a few years ago, I struggle to justify the cost of housing in this area. I wanted to examine the median square foot price comparison over the years between NYC, a few metro regions of my choice and US as a whole. The dataset was sourced from the Home Listsings and Sales section of Zillow.

##                           RegionName SizeRank   2010-01  2010-02
## 1                      United States        0 116.98501 116.9276
## 2                       New York, NY        1 235.61896 234.2708
## 3 Los Angeles-Long Beach-Anaheim, CA        2 280.95984 277.4083
## 4                        Chicago, IL        3 153.13139 153.0168
## 5              Dallas-Fort Worth, TX        4  83.96947  84.2555
## 6                   Philadelphia, PA        5 143.19367 143.9976

As we can see, the data is in a “wide” format with 119 columns.Since the data ranges from Jan 2010 through Sep 2019, it needs quite a bit of tidying up.

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 117 rows
## [1, 844, 1687, 2530, 3373, 4216, 5059, 5902, 6745, 7588, 8431, 9274, 10117,
## 10960, 11803, 12646, 13489, 14332, 15175, 16018, ...].
##                             City State SizeRank    year mdnsqftprice
## 1                  United States  <NA>        0 2010-01    116.98501
## 2                       New York    NY        1 2010-01    235.61896
## 3 Los Angeles-Long Beach-Anaheim    CA        2 2010-01    280.95984
## 4                        Chicago    IL        3 2010-01    153.13139
## 5              Dallas-Fort Worth    TX        4 2010-01     83.96947
## 6                   Philadelphia    PA        5 2010-01    143.19367

The two metros I picked to compare were LA and Washington DC

Compare median Sq ft prices growth between NYC,LA,DC,Aggregated US

## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()

Analysis to calculate average change in price per city over the duration of the dataset

## Selecting by price_change_pct
## # A tibble: 7 x 2
##   City        price_change_pct
##   <chr>                  <dbl>
## 1 Merced                   136
## 2 Modesto                  121
## 3 Blackfoot                118
## 4 Fernley                  109
## 5 Summit Park              106
## 6 Rapid City               100
## 7 Reno                     100
## Selecting by price_change_pct
## # A tibble: 7 x 2
##   City           price_change_pct
##   <chr>                     <dbl>
## 1 Sayre                       -35
## 2 Atlantic City               -23
## 3 Vineland                    -22
## 4 Roanoke Rapids              -20
## 5 Coffeyville                 -17
## 6 Amsterdam                   -13
## 7 Bennington                  -13

Histogram to look at the distribution of price change percent

## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
## Warning: Ignoring 107 observations
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

Conclusion

I was surprised to see that:

  1. LA Real estate prices has increased significantly faster than NYC.
  2. NYC is not even in the top 1% of cities when it comes to percentage of increase in real estate prices.
  3. Real estate in the majority of the cities increased (between 0-40%).
  4. Price dropped in 51 cities

Dataset 2- Religion and Education

For my second Dataset I decided to look at the education breakdown of members from different religions. This dataset was sourced from the Pew Research Organization.

##                        Religion High_school_or_less Some_college College
## 1                      Buddhist                0.20         0.33    0.28
## 2                      Catholic                0.46         0.27    0.16
## 3        Evangelical Protestant                0.43         0.35    0.14
## 4                         Hindu                0.12         0.11    0.29
## 5 Historically Black Protestant                0.52         0.33    0.09
## 6             Jehovah's Witness                0.63         0.25    0.09
##   Post_grad Sample_Size
## 1      0.20         262
## 2      0.10        7157
## 3      0.07        8545
## 4      0.48         197
## 5      0.06        1905
## 6      0.03         241

Yet again the data is in a “wide” format. The columns with suffixes “Bill” & Pay" are histories of bills sent to and payments made by the customer respectively.

##                        Religion Sample_Size           education percentage
## 1                      Buddhist         262 High_school_or_less       0.20
## 2                      Catholic        7157 High_school_or_less       0.46
## 3        Evangelical Protestant        8545 High_school_or_less       0.43
## 4                         Hindu         197 High_school_or_less       0.12
## 5 Historically Black Protestant        1905 High_school_or_less       0.52
## 6             Jehovah's Witness         241 High_school_or_less       0.63
## Selecting by Total
## # A tibble: 5 x 2
##   Religion                      Total
##   <chr>                         <dbl>
## 1 Evangelical Protestant         8545
## 2 Unaffiliated                   7532
## 3 Catholic                       7157
## 4 Mainline Protestant            6045
## 5 Historically Black Protestant  1905
## Selecting by Total
## # A tibble: 5 x 2
##   Religion           Total
##   <chr>              <dbl>
## 1 Buddhist             262
## 2 Jehovah's Witness    241
## 3 Muslim               237
## 4 Hindu                197
## 5 Orthodox Christian   185
## # A tibble: 6 x 4
## # Groups:   Religion, education [6]
##   Religion               education           percentage Total
##   <chr>                  <chr>                    <dbl> <dbl>
## 1 Evangelical Protestant College                   0.14  8545
## 2 Evangelical Protestant High_school_or_less       0.43  8545
## 3 Evangelical Protestant Post_grad                 0.07  8545
## 4 Evangelical Protestant Some_college              0.35  8545
## 5 Unaffiliated           College                   0.18  7532
## 6 Unaffiliated           High_school_or_less       0.38  7532
## # A tibble: 6 x 5
## # Groups:   Religion, education, percentage [6]
##   Religion               education           percentage Total counts
##   <chr>                  <chr>                    <dbl> <dbl>  <dbl>
## 1 Evangelical Protestant College                   0.14  8545   1196
## 2 Evangelical Protestant High_school_or_less       0.43  8545   3674
## 3 Evangelical Protestant Post_grad                 0.07  8545    598
## 4 Evangelical Protestant Some_college              0.35  8545   2991
## 5 Unaffiliated           College                   0.18  7532   1356
## 6 Unaffiliated           High_school_or_less       0.38  7532   2862
## # A tibble: 6 x 5
## # Groups:   Religion, education, percentage [6]
##   Religion               education           percentage Total counts
##   <chr>                  <chr>                    <dbl> <dbl>  <dbl>
## 1 Evangelical Protestant College                   0.14  8545   1196
## 2 Evangelical Protestant High_school_or_less       0.43  8545   3674
## 3 Evangelical Protestant Post_grad                 0.07  8545    598
## 4 Evangelical Protestant Some_college              0.35  8545   2991
## 5 Unaffiliated           College                   0.18  7532   1356
## 6 Unaffiliated           High_school_or_less       0.38  7532   2862

Conclusion

  1. I was surprised by how many individuals identified themselves as “Unaffiliated”
  2. The education breakdown of the top 3 religions is fairly similar. No correlation can be identified with this level of analysis

Dataset 3- Seasonality in unemployment

My last dataset was obtained from the US Bureau of Labor Statistics here. The data contains monthly unemployment numbers from 2009 onwards

##   Year Jan Feb Mar
## 1 2009 7.8 8.3 8.7
## 2 2010 9.8 9.8 9.9
## 3 2011 9.1 9.0 9.0
## 4 2012 8.3 8.3 8.2
## 5 2013 8.0 7.7 7.5
## 6 2014 6.6 6.7 6.7
##     Period unemployment
## 1 2009-Jan          7.8
## 2 2009-Feb          8.3
## 3 2009-Mar          8.7
## 4 2009-Apr          9.0
## 5 2009-May          9.4
## 6 2009-Jun          9.5
## Warning: Removed 3 rows containing missing values (geom_point).