R Project #2 10.06.2018

Data #1

Source of data is International Monetary Fund; “World Economic Outlook”; GDP by country; https://www.imf.org/external/pubs/ft/weo/2018/01/weodata/

The data is wide.

The data layout is a typical wide one. Columns are years. I will transpose years to make data long.

First step would be to read data in

gdp_data<-read.csv(file="C://Data/WEO_Data GDP.csv")

head(gdp_data)

##            ï..Country                     Subject.Descriptor        Units
## 1         Afghanistan Gross domestic product, current prices U.S. dollars
## 2             Albania Gross domestic product, current prices U.S. dollars
## 3             Algeria Gross domestic product, current prices U.S. dollars
## 4              Angola Gross domestic product, current prices U.S. dollars
## 5 Antigua and Barbuda Gross domestic product, current prices U.S. dollars
## 6           Argentina Gross domestic product, current prices U.S. dollars
##      Scale
## 1 Billions
## 2 Billions
## 3 Billions
## 4 Billions
## 5 Billions
## 6 Billions
##                                                 Country.Series.specific.Notes
## 1 See notes for:  Gross domestic product, current prices (National currency).
## 2 See notes for:  Gross domestic product, current prices (National currency).
## 3 See notes for:  Gross domestic product, current prices (National currency).
## 4 See notes for:  Gross domestic product, current prices (National currency).
## 5 See notes for:  Gross domestic product, current prices (National currency).
## 6 See notes for:  Gross domestic product, current prices (National currency).
##     X2015   X2016   X2017   X2018   X2019   X2020   X2021   X2022
## 1  19.687  18.886   20.57  21.706  23.233   24.93  26.857  29.113
## 2  11.393  12.126  12.294   12.92  13.838   14.87  15.932  17.141
## 3 164.779 160.784 173.947 177.682 181.891 187.567 194.882 202.242
## 4 102.962  95.821 122.365 134.305  138.18 142.733 147.525  150.23
## 5   1.353   1.398   1.454   1.514   1.578   1.646   1.716    1.79
## 6 631.621 545.124 628.935 658.129 711.893 775.217 840.353 908.327
##   Estimates.Start.After
## 1                  2014
## 2                  2015
## 3                  2015
## 4                  2015
## 5                  2016
## 6                  2016

Next, let’s install tidyr and dplyr libraries.

#install.packages("tidyr")

library("tidyr")

library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Next we will rename the first and the last columns and delete 2, 3, 4, and 5 columns, which do not contain critical information (columns are repetetive and rudant). Getting rid of these columns will allow us to make dataset more managable. We also need to convert GDP values to numeric, but we have to be careful with commas.

gdp_data1<-rename(gdp_data, "Country"="ï..Country","EstimateStartYr"="Estimates.Start.After")

gdp_data2<-select(gdp_data1,-c("Subject.Descriptor","Units","Scale","Country.Series.specific.Notes"))

gdp_data2[2:9] <- lapply(gdp_data2[2:9], function(x) as.numeric(as.character(gsub("\\,", "",x))))

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

head(gdp_data2)

##               Country   X2015   X2016   X2017   X2018   X2019   X2020
## 1         Afghanistan  19.687  18.886  20.570  21.706  23.233  24.930
## 2             Albania  11.393  12.126  12.294  12.920  13.838  14.870
## 3             Algeria 164.779 160.784 173.947 177.682 181.891 187.567
## 4              Angola 102.962  95.821 122.365 134.305 138.180 142.733
## 5 Antigua and Barbuda   1.353   1.398   1.454   1.514   1.578   1.646
## 6           Argentina 631.621 545.124 628.935 658.129 711.893 775.217
##     X2021   X2022 EstimateStartYr
## 1  26.857  29.113            2014
## 2  15.932  17.141            2015
## 3 194.882 202.242            2015
## 4 147.525 150.230            2015
## 5   1.716   1.790            2016
## 6 840.353 908.327            2016

Let’s transpose years into a row.

gdp_data3 <- gdp_data2 %>% gather(Year, GDP, -c("Country","EstimateStartYr"))

head(gdp_data1)

##               Country                     Subject.Descriptor        Units
## 1         Afghanistan Gross domestic product, current prices U.S. dollars
## 2             Albania Gross domestic product, current prices U.S. dollars
## 3             Algeria Gross domestic product, current prices U.S. dollars
## 4              Angola Gross domestic product, current prices U.S. dollars
## 5 Antigua and Barbuda Gross domestic product, current prices U.S. dollars
## 6           Argentina Gross domestic product, current prices U.S. dollars
##      Scale
## 1 Billions
## 2 Billions
## 3 Billions
## 4 Billions
## 5 Billions
## 6 Billions
##                                                 Country.Series.specific.Notes
## 1 See notes for:  Gross domestic product, current prices (National currency).
## 2 See notes for:  Gross domestic product, current prices (National currency).
## 3 See notes for:  Gross domestic product, current prices (National currency).
## 4 See notes for:  Gross domestic product, current prices (National currency).
## 5 See notes for:  Gross domestic product, current prices (National currency).
## 6 See notes for:  Gross domestic product, current prices (National currency).
##     X2015   X2016   X2017   X2018   X2019   X2020   X2021   X2022
## 1  19.687  18.886   20.57  21.706  23.233   24.93  26.857  29.113
## 2  11.393  12.126  12.294   12.92  13.838   14.87  15.932  17.141
## 3 164.779 160.784 173.947 177.682 181.891 187.567 194.882 202.242
## 4 102.962  95.821 122.365 134.305  138.18 142.733 147.525  150.23
## 5   1.353   1.398   1.454   1.514   1.578   1.646   1.716    1.79
## 6 631.621 545.124 628.935 658.129 711.893 775.217 840.353 908.327
##   EstimateStartYr
## 1            2014
## 2            2015
## 3            2015
## 4            2015
## 5            2016
## 6            2016

Next we will reformat Year column by getiing rid of X in front. We will also delete NA in GDP column. And we will resort by country and year

gdp_data3$Year<-substring(gdp_data3$Year,2,5)

gdp_data3<-filter(gdp_data3,GDP!="NA")

gdp_data4<-gdp_data3 %>% arrange(Country, Year)

head(gdp_data4)

##       Country EstimateStartYr Year    GDP
## 1 Afghanistan            2014 2015 19.687
## 2 Afghanistan            2014 2016 18.886
## 3 Afghanistan            2014 2017 20.570
## 4 Afghanistan            2014 2018 21.706
## 5 Afghanistan            2014 2019 23.233
## 6 Afghanistan            2014 2020 24.930

We will calculate mean of gdp, and mean of gdp by country, and mean of gdp by year, min values, max values, count of countries

summarize(gdp_data4, Mean_GDP_billions = mean(GDP, na.rm = T))

##   Mean_GDP_billions
## 1          446.8422

summarize(gdp_data4, Max_GDP_billions = max(GDP))

##   Max_GDP_billions
## 1         23760.33

GroupCountry <- group_by(gdp_data4, Country)

summarize(GroupCountry,mean_GDP_billions=mean(GDP,na.rm=T))

## # A tibble: 191 x 2
##    Country             mean_GDP_billions
##    <fct>                           <dbl>
##  1 Afghanistan                     23.1 
##  2 Albania                         13.8 
##  3 Algeria                        180.  
##  4 Angola                         129.  
##  5 Antigua and Barbuda              1.56
##  6 Argentina                      712.  
##  7 Armenia                         11.8 
##  8 Australia                     1461.  
##  9 Austria                        407.  
## 10 Azerbaijan                      46.9 
## # ... with 181 more rows

GroupYear <- group_by(gdp_data4, Year)

summarize(GroupYear,mean_GDP_billions=mean(GDP,na.rm=T))

## # A tibble: 8 x 2
##   Year  mean_GDP_billions
##   <chr>             <dbl>
## 1 2015               388.
## 2 2016               394.
## 3 2017               410.
## 4 2018               431.
## 5 2019               453.
## 6 2020               476.
## 7 2021               500.
## 8 2022               525.

summarize(GroupYear,number_year=n())

## # A tibble: 8 x 2
##   Year  number_year
##   <chr>       <int>
## 1 2015          191
## 2 2016          191
## 3 2017          189
## 4 2018          189
## 5 2019          189
## 6 2020          189
## 7 2021          189
## 8 2022          189

summarize(GroupYear,min_year=min(GDP))

## # A tibble: 8 x 2
##   Year  min_year
##   <chr>    <dbl>
## 1 2015     0.033
## 2 2016     0.034
## 3 2017     0.036
## 4 2018     0.037
## 5 2019     0.039
## 6 2020     0.04 
## 7 2021     0.042
## 8 2022     0.043

summarize(GroupYear,max_year=max(GDP))

## # A tibble: 8 x 2
##   Year  max_year
##   <chr>    <dbl>
## 1 2015    18037.
## 2 2016    18569.
## 3 2017    19417.
## 4 2018    20352.
## 5 2019    21239.
## 6 2020    22063.
## 7 2021    22886.
## 8 2022    23760.

Mean GDP for all countries for all years in the dataset (2015-2022) is less than GDP of NYC

Max GDP is GDP of USA, of course. Even so China is catching up fast

Mean GDP will increase from 388 billion in 2015 to projected 525 in 2022, a pretty sharp increase which would indicate strong world economy (it could also be due to the weaking of dollar)

Data #2

Source of data is Football (soccer) data used for betting. http://www.football-data.co.uk/englandm.php

The data is wide. Please note that I have deleted most columns to make data more managable.

Columns represent measurments of match. I will transpose them to make data long.

First step would be to read data in

soccer_data<-read.csv(file="C://Data/Soccer.csv")

head(soccer_data)

##   Div      Date     HomeTeam       AwayTeam FTHG FTAG FTR HTHG HTAG HTR
## 1  E0 10/8/2018   Man United      Leicester    2    1   H    1    0   H
## 2  E0 11/8/2018  Bournemouth        Cardiff    2    0   H    1    0   H
## 3  E0 11/8/2018       Fulham Crystal Palace    0    2   A    0    1   A
## 4  E0 11/8/2018 Huddersfield        Chelsea    0    3   A    0    2   A
## 5  E0 11/8/2018    Newcastle      Tottenham    1    2   A    1    2   A
## 6  E0 11/8/2018      Watford       Brighton    2    0   H    1    0   H
##      Referee
## 1 A Marriner
## 2   K Friend
## 3     M Dean
## 4 C Kavanagh
## 5 M Atkinson
## 6     J Moss

Next we will rename columns to make dataset more readible. We will delete columns HTHG, HTAG, and HTR, which show Half Time results, as well as column referee. For our analysis, it does not matter who referee is.

soccer_data1<-rename(soccer_data,"HomeTeamGoals"="FTHG", "AwayTeamGoals"="FTAG", "WinningTeam"="FTR")

soccer_data2<-select(soccer_data1,-c("HTHG","HTAG","HTR","Referee"))

head(soccer_data2)

##   Div      Date     HomeTeam       AwayTeam HomeTeamGoals AwayTeamGoals
## 1  E0 10/8/2018   Man United      Leicester             2             1
## 2  E0 11/8/2018  Bournemouth        Cardiff             2             0
## 3  E0 11/8/2018       Fulham Crystal Palace             0             2
## 4  E0 11/8/2018 Huddersfield        Chelsea             0             3
## 5  E0 11/8/2018    Newcastle      Tottenham             1             2
## 6  E0 11/8/2018      Watford       Brighton             2             0
##   WinningTeam
## 1           H
## 2           H
## 3           A
## 4           A
## 5           A
## 6           H

Now it is time to transpose our data. We will transpose teams.

soccer_data3 <- soccer_data2 %>% gather(TeamOrigin, Team, c("HomeTeam","AwayTeam"))

head(soccer_data3)

##   Div      Date HomeTeamGoals AwayTeamGoals WinningTeam TeamOrigin
## 1  E0 10/8/2018             2             1           H   HomeTeam
## 2  E0 11/8/2018             2             0           H   HomeTeam
## 3  E0 11/8/2018             0             2           A   HomeTeam
## 4  E0 11/8/2018             0             3           A   HomeTeam
## 5  E0 11/8/2018             1             2           A   HomeTeam
## 6  E0 11/8/2018             2             0           H   HomeTeam
##           Team
## 1   Man United
## 2  Bournemouth
## 3       Fulham
## 4 Huddersfield
## 5    Newcastle
## 6      Watford

Next we will create new column which will show how many goals a team scored. And also we will create a column to show if a team won (1) or lost (0).

soccer_data4 <- mutate(soccer_data3, GoalScored = ifelse(TeamOrigin == 'HomeTeam', HomeTeamGoals,         AwayTeamGoals))

soccer_data5 <- mutate(soccer_data4, Outcome = ifelse(TeamOrigin == 'HomeTeam' & WinningTeam=='H',1,    ifelse(TeamOrigin == 'HomeTeam' & WinningTeam=='A',0,ifelse(WinningTeam=='A',1,0))))


head(soccer_data5)

##   Div      Date HomeTeamGoals AwayTeamGoals WinningTeam TeamOrigin
## 1  E0 10/8/2018             2             1           H   HomeTeam
## 2  E0 11/8/2018             2             0           H   HomeTeam
## 3  E0 11/8/2018             0             2           A   HomeTeam
## 4  E0 11/8/2018             0             3           A   HomeTeam
## 5  E0 11/8/2018             1             2           A   HomeTeam
## 6  E0 11/8/2018             2             0           H   HomeTeam
##           Team GoalScored Outcome
## 1   Man United          2       1
## 2  Bournemouth          2       1
## 3       Fulham          0       0
## 4 Huddersfield          0       0
## 5    Newcastle          1       0
## 6      Watford          2       1

Now, after we created new columns, we no more need columns HomeTeamGoals, AwayTeamGoals, WinningTeam, and TeamOrigin, so we will delete them

soccer_data6<-select(soccer_data5,-c("HomeTeamGoals","AwayTeamGoals","WinningTeam","TeamOrigin"))

head(soccer_data6)

##   Div      Date         Team GoalScored Outcome
## 1  E0 10/8/2018   Man United          2       1
## 2  E0 11/8/2018  Bournemouth          2       1
## 3  E0 11/8/2018       Fulham          0       0
## 4  E0 11/8/2018 Huddersfield          0       0
## 5  E0 11/8/2018    Newcastle          1       0
## 6  E0 11/8/2018      Watford          2       1

Now it is time to analyze the data. We will calculate the mean of goals for all teams for all games. Also, we will calculate maximim goals scored by any time for any match. We will calculate mean by outcome (win/loss). And finally we will calculate mean by team.

summarize(soccer_data6, Mean_goals = mean(GoalScored, na.rm = T))

##   Mean_goals
## 1   1.421429

summarize(soccer_data6, Max_goals = max(GoalScored))

##   Max_goals
## 1         6

GroupOutcome <- group_by(soccer_data6, Outcome)

summarize(GroupOutcome,mean_goals_by_outcome=mean(GoalScored,na.rm=T))

## # A tibble: 2 x 2
##   Outcome mean_goals_by_outcome
##     <dbl>                 <dbl>
## 1       0                 0.714
## 2       1                 2.48

GroupTeam <- group_by(soccer_data6, Team)

summarize(GroupTeam,mean_goals_by_team=mean(GoalScored,na.rm=T))

## # A tibble: 20 x 2
##    Team           mean_goals_by_team
##    <chr>                       <dbl>
##  1 Arsenal                     2    
##  2 Bournemouth                 1.71 
##  3 Brighton                    1.14 
##  4 Burnley                     1.29 
##  5 Cardiff                     0.571
##  6 Chelsea                     2.14 
##  7 Crystal Palace              0.714
##  8 Everton                     1.57 
##  9 Fulham                      1.14 
## 10 Huddersfield                0.429
## 11 Leicester                   1.86 
## 12 Liverpool                   2.14 
## 13 Man City                    3    
## 14 Man United                  1.43 
## 15 Newcastle                   0.571
## 16 Southampton                 0.857
## 17 Tottenham                   2    
## 18 Watford                     1.57 
## 19 West Ham                    1.14 
## 20 Wolves                      1.14

So, on avarage a team scores 1.4 goals per game. It looks reasonable, soccer has low output.

Maximum goals scored by any team was 6. It looks right.

Losing team scored on average 0.7 goals and winning 2.5.

Manchaster City is doing good they are scoring 3 goals per game, while Huddersfield is not performing well with only 0.4 goals per game

Data #3

Source of data is US Census, USA Counties Data File Downloads. https://www.census.gov/support/USACdataDownloads.html

The data is wide. Please note that I have deleted most columns to make data more managable.

Columns represent number of farms by year. I will transpose them to make data long.

First step would be to read data in:

agri_data<-read.csv(file="C://Data/Agriculture.csv")

head(agri_data)

##     ï..Areaname STCOU AGN020197F AGN020197D AGN020197N1 AGN020197N2
## 1 UNITED STATES     0          0    1911859           0           0
## 2       ALABAMA  1000          0      41384           0           0
## 3   Autauga, AL  1001          0        348           0           0
## 4   Baldwin, AL  1003          0        977           0           0
## 5   Barbour, AL  1005          0        417           0           0
## 6      Bibb, AL  1007          0        177           0           0
##   AGN020202F AGN020202D AGN020202N1 AGN020202N2 AGN020207F
## 1          0    2128982           0           0          0
## 2          0      45126           0           0          0
## 3          0        373           0           0          0
## 4          0       1062           0           0          0
## 5          0        531           0           0          0
## 6          0        187           0           0          0

Data is unreadible. So we will rename columns. We will delete columns ending with F, N1, and N2 which indicate additional information we do not need for our analysis. We would need column ending with D (data)

agri_data1<-rename(agri_data,"1997"="AGN020197D","2002"="AGN020202D","StateCountyCode"="STCOU", "AreaName"="ï..Areaname")

agri_data2<-select(agri_data1,c("AreaName","StateCountyCode","1997","2002"))

head(agri_data2)

##        AreaName StateCountyCode    1997    2002
## 1 UNITED STATES               0 1911859 2128982
## 2       ALABAMA            1000   41384   45126
## 3   Autauga, AL            1001     348     373
## 4   Baldwin, AL            1003     977    1062
## 5   Barbour, AL            1005     417     531
## 6      Bibb, AL            1007     177     187

Our data has totals and subtotals, we need to delete them

agri_data3<-filter(agri_data2, AreaName!=toupper(AreaName))

head(agri_data3)

##      AreaName StateCountyCode 1997 2002
## 1 Autauga, AL            1001  348  373
## 2 Baldwin, AL            1003  977 1062
## 3 Barbour, AL            1005  417  531
## 4    Bibb, AL            1007  177  187
## 5  Blount, AL            1009 1191 1248
## 6 Bullock, AL            1011  277  273

Now it is time to transpose our data. We will transpose years.

agri_data4 <- agri_data3 %>% gather(Year, Farms, c("1997","2002"))

head(agri_data4)

##      AreaName StateCountyCode Year Farms
## 1 Autauga, AL            1001 1997   348
## 2 Baldwin, AL            1003 1997   977
## 3 Barbour, AL            1005 1997   417
## 4    Bibb, AL            1007 1997   177
## 5  Blount, AL            1009 1997  1191
## 6 Bullock, AL            1011 1997   277

It looks like we need to widen our data - we will split area into State and County

agri_data5 <- agri_data4 %>% separate(AreaName, c("County","State"),",")

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [320,
## 3466].

head(agri_data5)

##    County State StateCountyCode Year Farms
## 1 Autauga    AL            1001 1997   348
## 2 Baldwin    AL            1003 1997   977
## 3 Barbour    AL            1005 1997   417
## 4    Bibb    AL            1007 1997   177
## 5  Blount    AL            1009 1997  1191
## 6 Bullock    AL            1011 1997   277

We will calculate:

farms by state in 2002
average farms by year
the lowest number of farms in county
number of counties with 0 farms in 2002
the higest number of farm in county
total # of farms by year

GroupState <- group_by(filter(agri_data5,Year==2002), State)

arrange(summarize(GroupState,farms_by_state=sum(Farms)),farms_by_state)

## # A tibble: 51 x 2
##    State farms_by_state
##    <chr>          <int>
##  1 <NA>               0
##  2 " AK"            609
##  3 " RI"            858
##  4 " DE"           2391
##  5 " NV"           2989
##  6 " NH"           3363
##  7 " CT"           4191
##  8 " HI"           5398
##  9 " MA"           6075
## 10 " VT"           6571
## # ... with 41 more rows

GroupYear <- group_by(agri_data5, Year)

summarize(GroupYear,mean_farms_by_year=mean(Farms))

## # A tibble: 2 x 2
##   Year  mean_farms_by_year
##   <chr>              <dbl>
## 1 1997                608.
## 2 2002                677.

summarize(agri_data5,min_farms_county=min(Farms))

##   min_farms_county
## 1                0

summarize(filter(agri_data5,Year==2002&Farms==0),num_county=n())

##   num_county
## 1         70

summarize(filter(agri_data5,Year==2002),max_farms_county=max(Farms))

##   max_farms_county
## 1             6281

summarize(GroupYear,farms_by_year=sum(Farms))

## # A tibble: 2 x 2
##   Year  farms_by_year
##   <chr>         <int>
## 1 1997        1911859
## 2 2002        2128982

tail(arrange(filter(agri_data5,Year==2002), Farms))

##          County State StateCountyCode Year Farms
## 3141 Stanislaus    CA            6099 2002  4267
## 3142  Clackamas    OR           41005 2002  4676
## 3143  San Diego    CA            6073 2002  5255
## 3144  Lancaster    PA           42071 2002  5293
## 3145     Tulare    CA            6107 2002  5738
## 3146     Fresno    CA            6019 2002  6281

In 2002, Alaska had 609 farms, and Texas had almost 229K

On average, a county in USA had 608 farms in 1997 and suprisingly increasing 677 farms in 2002

There were counties with no farms in USA. Actually in 2002, there were 70 counties with no farms

In 2002, there was one county with 6,281 counties (Fresno, CA)