GDP by County from 2012 to 2015

Data description

The source of this data set is from the website https://github.com/kwstat/untidydata2/blob/main/inst/messydata/gdp_by_county.xlsx. In this dataset I have used Current Dollar GDP sheet.

The purpose of this dataset is to analyse how GDP of different county’s in USA is varying from 2012 to 2014.

Variable description is as shown below:

  1. FIPS - It is a five digit numeric code which uniquely represents geographic areas.
  2. Countyname - Specifies the division or specific region of a state.
  3. Postal - Unique number for postal services
  4. Linecode - which gives numeric representation to types of Industryname
  5. Industryname - In the dataset we have four types of industries like All Industries,Private goods producing industries,Private services producing industries,Government and government enterprises among which GDP of county’s are shown.

Characteristics of Dataset There are numeric and char data types. Via summary of dataset we found there are null values in linecode and year columns.

Coding

Loading all the required libraries and packages

readxl - to read xlsx file

tidyverse - to transform and manipulate data

ggplot2 - for visualisation plots

library("readxl")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggplot2)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

Loading and cleaning messy data

Data summary - str(), summary()

colnames - to change the column names

rm(list = ls())
library("readxl")
gdp <- read_excel("Assignment1-Kandru-gdp-by-county-Data-Wrangling-and-Visualization.xlsx",col_names=TRUE,skip=3,sheet=1, na = "(D)")
## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
colnames(gdp)[c(1,2,3,4,5)] = c('FIPS','Countyname','postalcode','linecode','industry_name')
str(gdp)
## tibble [12,456 × 9] (S3: tbl_df/tbl/data.frame)
##  $ FIPS         : chr [1:12456] "01001" "01001" "01001" "01001" ...
##  $ Countyname   : chr [1:12456] "Autauga" "Autauga" "Autauga" "Autauga" ...
##  $ postalcode   : chr [1:12456] "AL" "AL" "AL" "AL" ...
##  $ linecode     : num [1:12456] 1 2 3 4 1 2 3 4 1 2 ...
##  $ industry_name: chr [1:12456] "All Industries" "Private goods-producing industries" "Private services-providing industries" "Government and government enterprises" ...
##  $ 2012         : num [1:12456] 1383941 286396 948490 149055 5599194 ...
##  $ 2013         : num [1:12456] 1363368 310468 904599 148301 6365080 ...
##  $ 2014         : num [1:12456] 1402516 323582 928438 150496 6547396 ...
##  $ 2015         : num [1:12456] 1539406 346355 1037309 155742 6436107 ...
summary(gdp)
##      FIPS            Countyname         postalcode           linecode   
##  Length:12456       Length:12456       Length:12456       Min.   :1.00  
##  Class :character   Class :character   Class :character   1st Qu.:1.75  
##  Mode  :character   Mode  :character   Mode  :character   Median :2.50  
##                                                           Mean   :2.50  
##                                                           3rd Qu.:3.25  
##                                                           Max.   :4.00  
##                                                           NA's   :4     
##  industry_name           2012                2013                2014          
##  Length:12456       Min.   :      735   Min.   :     1069   Min.   :     1088  
##  Class :character   1st Qu.:    99618   1st Qu.:   107671   1st Qu.:   107457  
##  Mode  :character   Median :   300890   Median :   323052   Median :   329138  
##                     Mean   :  2590566   Mean   :  2716786   Mean   :  2833624  
##                     3rd Qu.:  1038078   3rd Qu.:  1101818   3rd Qu.:  1134311  
##                     Max.   :610107941   Max.   :619538778   Max.   :657009064  
##                     NA's   :46          NA's   :202         NA's   :192        
##       2015          
##  Min.   :      366  
##  1st Qu.:   106921  
##  Median :   326972  
##  Mean   :  2932933  
##  3rd Qu.:  1136410  
##  Max.   :691948578  
##  NA's   :124
kbl(gdp[1:30,]) %>%
  kable_paper(bootstrap_options = "striped", full_width = F)
FIPS Countyname postalcode linecode industry_name 2012 2013 2014 2015
01001 Autauga AL 1 All Industries 1383941 1363368 1402516 1539406
01001 Autauga AL 2 Private goods-producing industries 286396 310468 323582 346355
01001 Autauga AL 3 Private services-providing industries 948490 904599 928438 1037309
01001 Autauga AL 4 Government and government enterprises 149055 148301 150496 155742
01003 Baldwin AL 1 All Industries 5599194 6365080 6547396 6436107
01003 Baldwin AL 2 Private goods-producing industries 681871 698500 711443 735432
01003 Baldwin AL 3 Private services-providing industries 4306173 5037325 5194273 5023903
01003 Baldwin AL 4 Government and government enterprises 611151 629255 641681 676772
01005 Barbour AL 1 All Industries 639833 701750 689212 743779
01005 Barbour AL 2 Private goods-producing industries 253232 303099 286467 307299
01005 Barbour AL 3 Private services-providing industries 274894 288270 291688 324052
01005 Barbour AL 4 Government and government enterprises 111707 110381 111057 112428
01007 Bibb AL 1 All Industries 297560 325906 329087 322307
01007 Bibb AL 2 Private goods-producing industries 98048 113691 108170 102984
01007 Bibb AL 3 Private services-providing industries 116767 129718 138327 133371
01007 Bibb AL 4 Government and government enterprises 82746 82498 82591 85952
01009 Blount AL 1 All Industries 632761 701145 688525 819608
01009 Blount AL 2 Private goods-producing industries 187947 235705 206791 201373
01009 Blount AL 3 Private services-providing industries 310042 330581 347601 481303
01009 Blount AL 4 Government and government enterprises 134773 134859 134133 136932
01011 Bullock AL 1 All Industries 191052 190103 178408 178902
01011 Bullock AL 2 Private goods-producing industries 67063 71650 60743 61008
01011 Bullock AL 3 Private services-providing industries 78889 72767 71542 71217
01011 Bullock AL 4 Government and government enterprises 45101 45686 46123 46678
01013 Butler AL 1 All Industries 514216 522547 512826 542129
01013 Butler AL 2 Private goods-producing industries 189427 166962 155406 156906
01013 Butler AL 3 Private services-providing industries 263801 294015 294944 321026
01013 Butler AL 4 Government and government enterprises 60988 61569 62476 64197
01015 Calhoun AL 1 All Industries 4073476 3988587 3943024 3999134
01015 Calhoun AL 2 Private goods-producing industries 1051376 990407 868396 849069

Data Cleaning

Removing null values and drop one column

gdp <- na.omit(gdp)
colSums(is.na(gdp))
##          FIPS    Countyname    postalcode      linecode industry_name 
##             0             0             0             0             0 
##          2012          2013          2014          2015 
##             0             0             0             0

Removing column (Linecode) from data

gdp <- subset(gdp, select = -c(linecode))
kbl(gdp[1:30,]) %>%
  kable_paper(bootstrap_options = "striped", full_width = F)
FIPS Countyname postalcode industry_name 2012 2013 2014 2015
01001 Autauga AL All Industries 1383941 1363368 1402516 1539406
01001 Autauga AL Private goods-producing industries 286396 310468 323582 346355
01001 Autauga AL Private services-providing industries 948490 904599 928438 1037309
01001 Autauga AL Government and government enterprises 149055 148301 150496 155742
01003 Baldwin AL All Industries 5599194 6365080 6547396 6436107
01003 Baldwin AL Private goods-producing industries 681871 698500 711443 735432
01003 Baldwin AL Private services-providing industries 4306173 5037325 5194273 5023903
01003 Baldwin AL Government and government enterprises 611151 629255 641681 676772
01005 Barbour AL All Industries 639833 701750 689212 743779
01005 Barbour AL Private goods-producing industries 253232 303099 286467 307299
01005 Barbour AL Private services-providing industries 274894 288270 291688 324052
01005 Barbour AL Government and government enterprises 111707 110381 111057 112428
01007 Bibb AL All Industries 297560 325906 329087 322307
01007 Bibb AL Private goods-producing industries 98048 113691 108170 102984
01007 Bibb AL Private services-providing industries 116767 129718 138327 133371
01007 Bibb AL Government and government enterprises 82746 82498 82591 85952
01009 Blount AL All Industries 632761 701145 688525 819608
01009 Blount AL Private goods-producing industries 187947 235705 206791 201373
01009 Blount AL Private services-providing industries 310042 330581 347601 481303
01009 Blount AL Government and government enterprises 134773 134859 134133 136932
01011 Bullock AL All Industries 191052 190103 178408 178902
01011 Bullock AL Private goods-producing industries 67063 71650 60743 61008
01011 Bullock AL Private services-providing industries 78889 72767 71542 71217
01011 Bullock AL Government and government enterprises 45101 45686 46123 46678
01013 Butler AL All Industries 514216 522547 512826 542129
01013 Butler AL Private goods-producing industries 189427 166962 155406 156906
01013 Butler AL Private services-providing industries 263801 294015 294944 321026
01013 Butler AL Government and government enterprises 60988 61569 62476 64197
01015 Calhoun AL All Industries 4073476 3988587 3943024 3999134
01015 Calhoun AL Private goods-producing industries 1051376 990407 868396 849069

Data Wrangling

This is untidy data as it violates below mentioned tidy principles

  • Not Each variable have its own column.

  • Not Each observation have its own row.

Tidy up the data

Using Pivot_longer

Here in this dataset, we have a set of columns(5,6,7,8) with names (2012,2013,2014,2015) which are values but not variables.

So applying Pivot_longer function, converting values to variables by creating another new columns named ‘Years’ and ‘GDP’

gdp_updated = gdp %>% pivot_longer(cols=c(starts_with('2012'),starts_with('2013'),starts_with('2014'),starts_with('2015')), names_to =  "year", values_to = "GDP")
kbl(gdp_updated[1:35,]) %>%
  kable_paper(bootstrap_options = "striped", full_width = F)
FIPS Countyname postalcode industry_name year GDP
01001 Autauga AL All Industries 2012 1383941
01001 Autauga AL All Industries 2013 1363368
01001 Autauga AL All Industries 2014 1402516
01001 Autauga AL All Industries 2015 1539406
01001 Autauga AL Private goods-producing industries 2012 286396
01001 Autauga AL Private goods-producing industries 2013 310468
01001 Autauga AL Private goods-producing industries 2014 323582
01001 Autauga AL Private goods-producing industries 2015 346355
01001 Autauga AL Private services-providing industries 2012 948490
01001 Autauga AL Private services-providing industries 2013 904599
01001 Autauga AL Private services-providing industries 2014 928438
01001 Autauga AL Private services-providing industries 2015 1037309
01001 Autauga AL Government and government enterprises 2012 149055
01001 Autauga AL Government and government enterprises 2013 148301
01001 Autauga AL Government and government enterprises 2014 150496
01001 Autauga AL Government and government enterprises 2015 155742
01003 Baldwin AL All Industries 2012 5599194
01003 Baldwin AL All Industries 2013 6365080
01003 Baldwin AL All Industries 2014 6547396
01003 Baldwin AL All Industries 2015 6436107
01003 Baldwin AL Private goods-producing industries 2012 681871
01003 Baldwin AL Private goods-producing industries 2013 698500
01003 Baldwin AL Private goods-producing industries 2014 711443
01003 Baldwin AL Private goods-producing industries 2015 735432
01003 Baldwin AL Private services-providing industries 2012 4306173
01003 Baldwin AL Private services-providing industries 2013 5037325
01003 Baldwin AL Private services-providing industries 2014 5194273
01003 Baldwin AL Private services-providing industries 2015 5023903
01003 Baldwin AL Government and government enterprises 2012 611151
01003 Baldwin AL Government and government enterprises 2013 629255
01003 Baldwin AL Government and government enterprises 2014 641681
01003 Baldwin AL Government and government enterprises 2015 676772
01005 Barbour AL All Industries 2012 639833
01005 Barbour AL All Industries 2013 701750
01005 Barbour AL All Industries 2014 689212

##Using Pivot_wider

Here in this dataset, different industry names are scattered accross multiple rows(which violates tidy priciple).

So by applying pivot_wider function to scatter all four types of industry names(column)

gdp_final = gdp_updated %>% pivot_wider(names_from =  "industry_name", values_from = "GDP")
kbl(gdp_final[1:35,]) %>%
  kable_paper(bootstrap_options = "striped", full_width = F)                  
FIPS Countyname postalcode year All Industries Private goods-producing industries Private services-providing industries Government and government enterprises
01001 Autauga AL 2012 1383941 286396 948490 149055
01001 Autauga AL 2013 1363368 310468 904599 148301
01001 Autauga AL 2014 1402516 323582 928438 150496
01001 Autauga AL 2015 1539406 346355 1037309 155742
01003 Baldwin AL 2012 5599194 681871 4306173 611151
01003 Baldwin AL 2013 6365080 698500 5037325 629255
01003 Baldwin AL 2014 6547396 711443 5194273 641681
01003 Baldwin AL 2015 6436107 735432 5023903 676772
01005 Barbour AL 2012 639833 253232 274894 111707
01005 Barbour AL 2013 701750 303099 288270 110381
01005 Barbour AL 2014 689212 286467 291688 111057
01005 Barbour AL 2015 743779 307299 324052 112428
01007 Bibb AL 2012 297560 98048 116767 82746
01007 Bibb AL 2013 325906 113691 129718 82498
01007 Bibb AL 2014 329087 108170 138327 82591
01007 Bibb AL 2015 322307 102984 133371 85952
01009 Blount AL 2012 632761 187947 310042 134773
01009 Blount AL 2013 701145 235705 330581 134859
01009 Blount AL 2014 688525 206791 347601 134133
01009 Blount AL 2015 819608 201373 481303 136932
01011 Bullock AL 2012 191052 67063 78889 45101
01011 Bullock AL 2013 190103 71650 72767 45686
01011 Bullock AL 2014 178408 60743 71542 46123
01011 Bullock AL 2015 178902 61008 71217 46678
01013 Butler AL 2012 514216 189427 263801 60988
01013 Butler AL 2013 522547 166962 294015 61569
01013 Butler AL 2014 512826 155406 294944 62476
01013 Butler AL 2015 542129 156906 321026 64197
01015 Calhoun AL 2012 4073476 1051376 1951870 1070229
01015 Calhoun AL 2013 3988587 990407 1995846 1002334
01015 Calhoun AL 2014 3943024 868396 2036779 1037850
01015 Calhoun AL 2015 3999134 849069 2086608 1063457
01017 Chambers AL 2012 567650 177977 301425 88248
01017 Chambers AL 2013 629384 214956 325399 89029
01017 Chambers AL 2014 687210 220871 346159 120180

Now our data obeys tidy data principles as :

Each variable have its own column (achieved by using pivot_longer)

Each observation must have its own row (acheived by using pivot_wider)

No cell has has two values

Data Visualization

Before ploting removing spaces in column names with ’_’ for better readability

colnames(gdp_final)[c(5,6,7,8)] = c('All_Industries','Private_goods_producing_industries',                                'Private_services_producing_industries','Government_and_government_enterprises')
kbl(gdp_final[1:35,]) %>%
  kable_paper(bootstrap_options = "striped", full_width = F)
FIPS Countyname postalcode year All_Industries Private_goods_producing_industries Private_services_producing_industries Government_and_government_enterprises
01001 Autauga AL 2012 1383941 286396 948490 149055
01001 Autauga AL 2013 1363368 310468 904599 148301
01001 Autauga AL 2014 1402516 323582 928438 150496
01001 Autauga AL 2015 1539406 346355 1037309 155742
01003 Baldwin AL 2012 5599194 681871 4306173 611151
01003 Baldwin AL 2013 6365080 698500 5037325 629255
01003 Baldwin AL 2014 6547396 711443 5194273 641681
01003 Baldwin AL 2015 6436107 735432 5023903 676772
01005 Barbour AL 2012 639833 253232 274894 111707
01005 Barbour AL 2013 701750 303099 288270 110381
01005 Barbour AL 2014 689212 286467 291688 111057
01005 Barbour AL 2015 743779 307299 324052 112428
01007 Bibb AL 2012 297560 98048 116767 82746
01007 Bibb AL 2013 325906 113691 129718 82498
01007 Bibb AL 2014 329087 108170 138327 82591
01007 Bibb AL 2015 322307 102984 133371 85952
01009 Blount AL 2012 632761 187947 310042 134773
01009 Blount AL 2013 701145 235705 330581 134859
01009 Blount AL 2014 688525 206791 347601 134133
01009 Blount AL 2015 819608 201373 481303 136932
01011 Bullock AL 2012 191052 67063 78889 45101
01011 Bullock AL 2013 190103 71650 72767 45686
01011 Bullock AL 2014 178408 60743 71542 46123
01011 Bullock AL 2015 178902 61008 71217 46678
01013 Butler AL 2012 514216 189427 263801 60988
01013 Butler AL 2013 522547 166962 294015 61569
01013 Butler AL 2014 512826 155406 294944 62476
01013 Butler AL 2015 542129 156906 321026 64197
01015 Calhoun AL 2012 4073476 1051376 1951870 1070229
01015 Calhoun AL 2013 3988587 990407 1995846 1002334
01015 Calhoun AL 2014 3943024 868396 2036779 1037850
01015 Calhoun AL 2015 3999134 849069 2086608 1063457
01017 Chambers AL 2012 567650 177977 301425 88248
01017 Chambers AL 2013 629384 214956 325399 89029
01017 Chambers AL 2014 687210 220871 346159 120180

Considering highest gdp countys to plot the GDP variations for those countys

major_county <- filter(gdp_final,Countyname %in% c("Los Angeles", "New York", "Harris", "Santa Clara") & postalcode %in% c("CA", "NY", "TX", "CA"))
kbl(major_county[1:15,]) %>%
  kable_paper(bootstrap_options = "striped", full_width = F)
FIPS Countyname postalcode year All_Industries Private_goods_producing_industries Private_services_producing_industries Government_and_government_enterprises
06037 Los Angeles CA 2012 593671075 70641406 464547602 58482067
06037 Los Angeles CA 2013 617139519 74129717 482219285 60790517
06037 Los Angeles CA 2014 654038890 75338855 514271617 64428418
06037 Los Angeles CA 2015 691948578 75466445 547957509 68524623
06085 Santa Clara CA 2012 195550832 55682457 130284946 9583429
06085 Santa Clara CA 2013 213384895 58899838 144409875 10075181
06085 Santa Clara CA 2014 233442273 62954143 159775321 10712810
06085 Santa Clara CA 2015 262053352 69535372 181130859 11387121
36061 New York NY 2012 610107941 7877198 574476082 27754660
36061 New York NY 2013 619538778 8217717 581904074 29416987
36061 New York NY 2014 657009064 8988338 617415093 30605633
36061 New York NY 2015 690040650 10101652 648976531 30962467
48201 Harris TX 2012 377095884 147166060 207419397 22510427
48201 Harris TX 2013 395892886 158361188 213813227 23718471
48201 Harris TX 2014 402388140 151076820 226394473 24916846

Scatter plot of year v/s All_industries grouped by county

ggplot(data = major_county) +
geom_point(mapping = aes(x = year, y = All_Industries,color=Countyname))

By observing scatter plot we can conclude:

  • Santa Clara county has low GDP from 2012 to 2015. However, it is increasing gradually.

  • New York and Los Angeles county are very competitive and have almost negligible gdp difference.

MUTATE and ploting box plot

major_county_mutated = major_county %>%
 mutate(
  avg_gdp = All_Industries+Private_goods_producing_industries+Private_services_producing_industries+Government_and_government_enterprises
 )
kbl(major_county_mutated[1:15,]) %>%
  kable_paper(bootstrap_options = "striped", full_width = F)
FIPS Countyname postalcode year All_Industries Private_goods_producing_industries Private_services_producing_industries Government_and_government_enterprises avg_gdp
06037 Los Angeles CA 2012 593671075 70641406 464547602 58482067 1187342150
06037 Los Angeles CA 2013 617139519 74129717 482219285 60790517 1234279038
06037 Los Angeles CA 2014 654038890 75338855 514271617 64428418 1308077780
06037 Los Angeles CA 2015 691948578 75466445 547957509 68524623 1383897155
06085 Santa Clara CA 2012 195550832 55682457 130284946 9583429 391101664
06085 Santa Clara CA 2013 213384895 58899838 144409875 10075181 426769789
06085 Santa Clara CA 2014 233442273 62954143 159775321 10712810 466884547
06085 Santa Clara CA 2015 262053352 69535372 181130859 11387121 524106704
36061 New York NY 2012 610107941 7877198 574476082 27754660 1220215881
36061 New York NY 2013 619538778 8217717 581904074 29416987 1239077556
36061 New York NY 2014 657009064 8988338 617415093 30605633 1314018128
36061 New York NY 2015 690040650 10101652 648976531 30962467 1380081300
48201 Harris TX 2012 377095884 147166060 207419397 22510427 754191768
48201 Harris TX 2013 395892886 158361188 213813227 23718471 791785772
48201 Harris TX 2014 402388140 151076820 226394473 24916846 804776279

Box plot of average GDP by year

theme_set(theme_bw())
ggplot(data = major_county_mutated, mapping = aes(x = year, y = avg_gdp)) +
  geom_boxplot()+ labs(x="Year", y="avg_adp",title = "Box Plot of average gdp by year")

By obesrving Box plot, we can say

  • The average GDP(Gross domestic product) of USA is increasing year by year. As a result, the economy of USA is raising with time.

Summary

From the plots we can conclude that USA has been increased its GDP from 2012 to 2015. Apparently, New York and Los Angles are one of the top countys which have high GDP growth.