Data #1
Source of data is International Monetary Fund; “World Economic Outlook”; GDP by country; https://www.imf.org/external/pubs/ft/weo/2018/01/weodata/
The data is wide.
The data layout is a typical wide one. Columns are years. I will transpose years to make data long.
First step would be to read data in
gdp_data<-read.csv(file="C://Data/WEO_Data GDP.csv")
head(gdp_data)
## ï..Country Subject.Descriptor Units
## 1 Afghanistan Gross domestic product, current prices U.S. dollars
## 2 Albania Gross domestic product, current prices U.S. dollars
## 3 Algeria Gross domestic product, current prices U.S. dollars
## 4 Angola Gross domestic product, current prices U.S. dollars
## 5 Antigua and Barbuda Gross domestic product, current prices U.S. dollars
## 6 Argentina Gross domestic product, current prices U.S. dollars
## Scale
## 1 Billions
## 2 Billions
## 3 Billions
## 4 Billions
## 5 Billions
## 6 Billions
## Country.Series.specific.Notes
## 1 See notes for: Gross domestic product, current prices (National currency).
## 2 See notes for: Gross domestic product, current prices (National currency).
## 3 See notes for: Gross domestic product, current prices (National currency).
## 4 See notes for: Gross domestic product, current prices (National currency).
## 5 See notes for: Gross domestic product, current prices (National currency).
## 6 See notes for: Gross domestic product, current prices (National currency).
## X2015 X2016 X2017 X2018 X2019 X2020 X2021 X2022
## 1 19.687 18.886 20.57 21.706 23.233 24.93 26.857 29.113
## 2 11.393 12.126 12.294 12.92 13.838 14.87 15.932 17.141
## 3 164.779 160.784 173.947 177.682 181.891 187.567 194.882 202.242
## 4 102.962 95.821 122.365 134.305 138.18 142.733 147.525 150.23
## 5 1.353 1.398 1.454 1.514 1.578 1.646 1.716 1.79
## 6 631.621 545.124 628.935 658.129 711.893 775.217 840.353 908.327
## Estimates.Start.After
## 1 2014
## 2 2015
## 3 2015
## 4 2015
## 5 2016
## 6 2016
Next, let’s install tidyr and dplyr libraries.
#install.packages("tidyr")
library("tidyr")
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Next we will rename the first and the last columns and delete 2, 3, 4, and 5 columns, which do not contain critical information (columns are repetetive and rudant). Getting rid of these columns will allow us to make dataset more managable. We also need to convert GDP values to numeric, but we have to be careful with commas.
gdp_data1<-rename(gdp_data, "Country"="ï..Country","EstimateStartYr"="Estimates.Start.After")
gdp_data2<-select(gdp_data1,-c("Subject.Descriptor","Units","Scale","Country.Series.specific.Notes"))
gdp_data2[2:9] <- lapply(gdp_data2[2:9], function(x) as.numeric(as.character(gsub("\\,", "",x))))
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
head(gdp_data2)
## Country X2015 X2016 X2017 X2018 X2019 X2020
## 1 Afghanistan 19.687 18.886 20.570 21.706 23.233 24.930
## 2 Albania 11.393 12.126 12.294 12.920 13.838 14.870
## 3 Algeria 164.779 160.784 173.947 177.682 181.891 187.567
## 4 Angola 102.962 95.821 122.365 134.305 138.180 142.733
## 5 Antigua and Barbuda 1.353 1.398 1.454 1.514 1.578 1.646
## 6 Argentina 631.621 545.124 628.935 658.129 711.893 775.217
## X2021 X2022 EstimateStartYr
## 1 26.857 29.113 2014
## 2 15.932 17.141 2015
## 3 194.882 202.242 2015
## 4 147.525 150.230 2015
## 5 1.716 1.790 2016
## 6 840.353 908.327 2016
Let’s transpose years into a row.
gdp_data3 <- gdp_data2 %>% gather(Year, GDP, -c("Country","EstimateStartYr"))
head(gdp_data1)
## Country Subject.Descriptor Units
## 1 Afghanistan Gross domestic product, current prices U.S. dollars
## 2 Albania Gross domestic product, current prices U.S. dollars
## 3 Algeria Gross domestic product, current prices U.S. dollars
## 4 Angola Gross domestic product, current prices U.S. dollars
## 5 Antigua and Barbuda Gross domestic product, current prices U.S. dollars
## 6 Argentina Gross domestic product, current prices U.S. dollars
## Scale
## 1 Billions
## 2 Billions
## 3 Billions
## 4 Billions
## 5 Billions
## 6 Billions
## Country.Series.specific.Notes
## 1 See notes for: Gross domestic product, current prices (National currency).
## 2 See notes for: Gross domestic product, current prices (National currency).
## 3 See notes for: Gross domestic product, current prices (National currency).
## 4 See notes for: Gross domestic product, current prices (National currency).
## 5 See notes for: Gross domestic product, current prices (National currency).
## 6 See notes for: Gross domestic product, current prices (National currency).
## X2015 X2016 X2017 X2018 X2019 X2020 X2021 X2022
## 1 19.687 18.886 20.57 21.706 23.233 24.93 26.857 29.113
## 2 11.393 12.126 12.294 12.92 13.838 14.87 15.932 17.141
## 3 164.779 160.784 173.947 177.682 181.891 187.567 194.882 202.242
## 4 102.962 95.821 122.365 134.305 138.18 142.733 147.525 150.23
## 5 1.353 1.398 1.454 1.514 1.578 1.646 1.716 1.79
## 6 631.621 545.124 628.935 658.129 711.893 775.217 840.353 908.327
## EstimateStartYr
## 1 2014
## 2 2015
## 3 2015
## 4 2015
## 5 2016
## 6 2016
Next we will reformat Year column by getiing rid of X in front. We will also delete NA in GDP column. And we will resort by country and year
gdp_data3$Year<-substring(gdp_data3$Year,2,5)
gdp_data3<-filter(gdp_data3,GDP!="NA")
gdp_data4<-gdp_data3 %>% arrange(Country, Year)
head(gdp_data4)
## Country EstimateStartYr Year GDP
## 1 Afghanistan 2014 2015 19.687
## 2 Afghanistan 2014 2016 18.886
## 3 Afghanistan 2014 2017 20.570
## 4 Afghanistan 2014 2018 21.706
## 5 Afghanistan 2014 2019 23.233
## 6 Afghanistan 2014 2020 24.930
We will calculate mean of gdp, and mean of gdp by country, and mean of gdp by year, min values, max values, count of countries
summarize(gdp_data4, Mean_GDP_billions = mean(GDP, na.rm = T))
## Mean_GDP_billions
## 1 446.8422
summarize(gdp_data4, Max_GDP_billions = max(GDP))
## Max_GDP_billions
## 1 23760.33
GroupCountry <- group_by(gdp_data4, Country)
summarize(GroupCountry,mean_GDP_billions=mean(GDP,na.rm=T))
## # A tibble: 191 x 2
## Country mean_GDP_billions
## <fct> <dbl>
## 1 Afghanistan 23.1
## 2 Albania 13.8
## 3 Algeria 180.
## 4 Angola 129.
## 5 Antigua and Barbuda 1.56
## 6 Argentina 712.
## 7 Armenia 11.8
## 8 Australia 1461.
## 9 Austria 407.
## 10 Azerbaijan 46.9
## # ... with 181 more rows
GroupYear <- group_by(gdp_data4, Year)
summarize(GroupYear,mean_GDP_billions=mean(GDP,na.rm=T))
## # A tibble: 8 x 2
## Year mean_GDP_billions
## <chr> <dbl>
## 1 2015 388.
## 2 2016 394.
## 3 2017 410.
## 4 2018 431.
## 5 2019 453.
## 6 2020 476.
## 7 2021 500.
## 8 2022 525.
summarize(GroupYear,number_year=n())
## # A tibble: 8 x 2
## Year number_year
## <chr> <int>
## 1 2015 191
## 2 2016 191
## 3 2017 189
## 4 2018 189
## 5 2019 189
## 6 2020 189
## 7 2021 189
## 8 2022 189
summarize(GroupYear,min_year=min(GDP))
## # A tibble: 8 x 2
## Year min_year
## <chr> <dbl>
## 1 2015 0.033
## 2 2016 0.034
## 3 2017 0.036
## 4 2018 0.037
## 5 2019 0.039
## 6 2020 0.04
## 7 2021 0.042
## 8 2022 0.043
summarize(GroupYear,max_year=max(GDP))
## # A tibble: 8 x 2
## Year max_year
## <chr> <dbl>
## 1 2015 18037.
## 2 2016 18569.
## 3 2017 19417.
## 4 2018 20352.
## 5 2019 21239.
## 6 2020 22063.
## 7 2021 22886.
## 8 2022 23760.
Mean GDP for all countries for all years in the dataset (2015-2022) is less than GDP of NYC
Max GDP is GDP of USA, of course. Even so China is catching up fast
Mean GDP will increase from 388 billion in 2015 to projected 525 in 2022, a pretty sharp increase which would indicate strong world economy (it could also be due to the weaking of dollar)
Data #2
Source of data is Football (soccer) data used for betting. http://www.football-data.co.uk/englandm.php
The data is wide. Please note that I have deleted most columns to make data more managable.
Columns represent measurments of match. I will transpose them to make data long.
First step would be to read data in
soccer_data<-read.csv(file="C://Data/Soccer.csv")
head(soccer_data)
## Div Date HomeTeam AwayTeam FTHG FTAG FTR HTHG HTAG HTR
## 1 E0 10/8/2018 Man United Leicester 2 1 H 1 0 H
## 2 E0 11/8/2018 Bournemouth Cardiff 2 0 H 1 0 H
## 3 E0 11/8/2018 Fulham Crystal Palace 0 2 A 0 1 A
## 4 E0 11/8/2018 Huddersfield Chelsea 0 3 A 0 2 A
## 5 E0 11/8/2018 Newcastle Tottenham 1 2 A 1 2 A
## 6 E0 11/8/2018 Watford Brighton 2 0 H 1 0 H
## Referee
## 1 A Marriner
## 2 K Friend
## 3 M Dean
## 4 C Kavanagh
## 5 M Atkinson
## 6 J Moss
Next we will rename columns to make dataset more readible. We will delete columns HTHG, HTAG, and HTR, which show Half Time results, as well as column referee. For our analysis, it does not matter who referee is.
soccer_data1<-rename(soccer_data,"HomeTeamGoals"="FTHG", "AwayTeamGoals"="FTAG", "WinningTeam"="FTR")
soccer_data2<-select(soccer_data1,-c("HTHG","HTAG","HTR","Referee"))
head(soccer_data2)
## Div Date HomeTeam AwayTeam HomeTeamGoals AwayTeamGoals
## 1 E0 10/8/2018 Man United Leicester 2 1
## 2 E0 11/8/2018 Bournemouth Cardiff 2 0
## 3 E0 11/8/2018 Fulham Crystal Palace 0 2
## 4 E0 11/8/2018 Huddersfield Chelsea 0 3
## 5 E0 11/8/2018 Newcastle Tottenham 1 2
## 6 E0 11/8/2018 Watford Brighton 2 0
## WinningTeam
## 1 H
## 2 H
## 3 A
## 4 A
## 5 A
## 6 H
Now it is time to transpose our data. We will transpose teams.
soccer_data3 <- soccer_data2 %>% gather(TeamOrigin, Team, c("HomeTeam","AwayTeam"))
head(soccer_data3)
## Div Date HomeTeamGoals AwayTeamGoals WinningTeam TeamOrigin
## 1 E0 10/8/2018 2 1 H HomeTeam
## 2 E0 11/8/2018 2 0 H HomeTeam
## 3 E0 11/8/2018 0 2 A HomeTeam
## 4 E0 11/8/2018 0 3 A HomeTeam
## 5 E0 11/8/2018 1 2 A HomeTeam
## 6 E0 11/8/2018 2 0 H HomeTeam
## Team
## 1 Man United
## 2 Bournemouth
## 3 Fulham
## 4 Huddersfield
## 5 Newcastle
## 6 Watford
Next we will create new column which will show how many goals a team scored. And also we will create a column to show if a team won (1) or lost (0).
soccer_data4 <- mutate(soccer_data3, GoalScored = ifelse(TeamOrigin == 'HomeTeam', HomeTeamGoals, AwayTeamGoals))
soccer_data5 <- mutate(soccer_data4, Outcome = ifelse(TeamOrigin == 'HomeTeam' & WinningTeam=='H',1, ifelse(TeamOrigin == 'HomeTeam' & WinningTeam=='A',0,ifelse(WinningTeam=='A',1,0))))
head(soccer_data5)
## Div Date HomeTeamGoals AwayTeamGoals WinningTeam TeamOrigin
## 1 E0 10/8/2018 2 1 H HomeTeam
## 2 E0 11/8/2018 2 0 H HomeTeam
## 3 E0 11/8/2018 0 2 A HomeTeam
## 4 E0 11/8/2018 0 3 A HomeTeam
## 5 E0 11/8/2018 1 2 A HomeTeam
## 6 E0 11/8/2018 2 0 H HomeTeam
## Team GoalScored Outcome
## 1 Man United 2 1
## 2 Bournemouth 2 1
## 3 Fulham 0 0
## 4 Huddersfield 0 0
## 5 Newcastle 1 0
## 6 Watford 2 1
Now, after we created new columns, we no more need columns HomeTeamGoals, AwayTeamGoals, WinningTeam, and TeamOrigin, so we will delete them
soccer_data6<-select(soccer_data5,-c("HomeTeamGoals","AwayTeamGoals","WinningTeam","TeamOrigin"))
head(soccer_data6)
## Div Date Team GoalScored Outcome
## 1 E0 10/8/2018 Man United 2 1
## 2 E0 11/8/2018 Bournemouth 2 1
## 3 E0 11/8/2018 Fulham 0 0
## 4 E0 11/8/2018 Huddersfield 0 0
## 5 E0 11/8/2018 Newcastle 1 0
## 6 E0 11/8/2018 Watford 2 1
Now it is time to analyze the data. We will calculate the mean of goals for all teams for all games. Also, we will calculate maximim goals scored by any time for any match. We will calculate mean by outcome (win/loss). And finally we will calculate mean by team.
summarize(soccer_data6, Mean_goals = mean(GoalScored, na.rm = T))
## Mean_goals
## 1 1.421429
summarize(soccer_data6, Max_goals = max(GoalScored))
## Max_goals
## 1 6
GroupOutcome <- group_by(soccer_data6, Outcome)
summarize(GroupOutcome,mean_goals_by_outcome=mean(GoalScored,na.rm=T))
## # A tibble: 2 x 2
## Outcome mean_goals_by_outcome
## <dbl> <dbl>
## 1 0 0.714
## 2 1 2.48
GroupTeam <- group_by(soccer_data6, Team)
summarize(GroupTeam,mean_goals_by_team=mean(GoalScored,na.rm=T))
## # A tibble: 20 x 2
## Team mean_goals_by_team
## <chr> <dbl>
## 1 Arsenal 2
## 2 Bournemouth 1.71
## 3 Brighton 1.14
## 4 Burnley 1.29
## 5 Cardiff 0.571
## 6 Chelsea 2.14
## 7 Crystal Palace 0.714
## 8 Everton 1.57
## 9 Fulham 1.14
## 10 Huddersfield 0.429
## 11 Leicester 1.86
## 12 Liverpool 2.14
## 13 Man City 3
## 14 Man United 1.43
## 15 Newcastle 0.571
## 16 Southampton 0.857
## 17 Tottenham 2
## 18 Watford 1.57
## 19 West Ham 1.14
## 20 Wolves 1.14
So, on avarage a team scores 1.4 goals per game. It looks reasonable, soccer has low output.
Maximum goals scored by any team was 6. It looks right.
Losing team scored on average 0.7 goals and winning 2.5.
Manchaster City is doing good they are scoring 3 goals per game, while Huddersfield is not performing well with only 0.4 goals per game
Data #3
Source of data is US Census, USA Counties Data File Downloads. https://www.census.gov/support/USACdataDownloads.html
The data is wide. Please note that I have deleted most columns to make data more managable.
Columns represent number of farms by year. I will transpose them to make data long.
First step would be to read data in:
agri_data<-read.csv(file="C://Data/Agriculture.csv")
head(agri_data)
## ï..Areaname STCOU AGN020197F AGN020197D AGN020197N1 AGN020197N2
## 1 UNITED STATES 0 0 1911859 0 0
## 2 ALABAMA 1000 0 41384 0 0
## 3 Autauga, AL 1001 0 348 0 0
## 4 Baldwin, AL 1003 0 977 0 0
## 5 Barbour, AL 1005 0 417 0 0
## 6 Bibb, AL 1007 0 177 0 0
## AGN020202F AGN020202D AGN020202N1 AGN020202N2 AGN020207F
## 1 0 2128982 0 0 0
## 2 0 45126 0 0 0
## 3 0 373 0 0 0
## 4 0 1062 0 0 0
## 5 0 531 0 0 0
## 6 0 187 0 0 0
Data is unreadible. So we will rename columns. We will delete columns ending with F, N1, and N2 which indicate additional information we do not need for our analysis. We would need column ending with D (data)
agri_data1<-rename(agri_data,"1997"="AGN020197D","2002"="AGN020202D","StateCountyCode"="STCOU", "AreaName"="ï..Areaname")
agri_data2<-select(agri_data1,c("AreaName","StateCountyCode","1997","2002"))
head(agri_data2)
## AreaName StateCountyCode 1997 2002
## 1 UNITED STATES 0 1911859 2128982
## 2 ALABAMA 1000 41384 45126
## 3 Autauga, AL 1001 348 373
## 4 Baldwin, AL 1003 977 1062
## 5 Barbour, AL 1005 417 531
## 6 Bibb, AL 1007 177 187
Our data has totals and subtotals, we need to delete them
agri_data3<-filter(agri_data2, AreaName!=toupper(AreaName))
head(agri_data3)
## AreaName StateCountyCode 1997 2002
## 1 Autauga, AL 1001 348 373
## 2 Baldwin, AL 1003 977 1062
## 3 Barbour, AL 1005 417 531
## 4 Bibb, AL 1007 177 187
## 5 Blount, AL 1009 1191 1248
## 6 Bullock, AL 1011 277 273
Now it is time to transpose our data. We will transpose years.
agri_data4 <- agri_data3 %>% gather(Year, Farms, c("1997","2002"))
head(agri_data4)
## AreaName StateCountyCode Year Farms
## 1 Autauga, AL 1001 1997 348
## 2 Baldwin, AL 1003 1997 977
## 3 Barbour, AL 1005 1997 417
## 4 Bibb, AL 1007 1997 177
## 5 Blount, AL 1009 1997 1191
## 6 Bullock, AL 1011 1997 277
It looks like we need to widen our data - we will split area into State and County
agri_data5 <- agri_data4 %>% separate(AreaName, c("County","State"),",")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [320,
## 3466].
head(agri_data5)
## County State StateCountyCode Year Farms
## 1 Autauga AL 1001 1997 348
## 2 Baldwin AL 1003 1997 977
## 3 Barbour AL 1005 1997 417
## 4 Bibb AL 1007 1997 177
## 5 Blount AL 1009 1997 1191
## 6 Bullock AL 1011 1997 277
We will calculate:
GroupState <- group_by(filter(agri_data5,Year==2002), State)
arrange(summarize(GroupState,farms_by_state=sum(Farms)),farms_by_state)
## # A tibble: 51 x 2
## State farms_by_state
## <chr> <int>
## 1 <NA> 0
## 2 " AK" 609
## 3 " RI" 858
## 4 " DE" 2391
## 5 " NV" 2989
## 6 " NH" 3363
## 7 " CT" 4191
## 8 " HI" 5398
## 9 " MA" 6075
## 10 " VT" 6571
## # ... with 41 more rows
GroupYear <- group_by(agri_data5, Year)
summarize(GroupYear,mean_farms_by_year=mean(Farms))
## # A tibble: 2 x 2
## Year mean_farms_by_year
## <chr> <dbl>
## 1 1997 608.
## 2 2002 677.
summarize(agri_data5,min_farms_county=min(Farms))
## min_farms_county
## 1 0
summarize(filter(agri_data5,Year==2002&Farms==0),num_county=n())
## num_county
## 1 70
summarize(filter(agri_data5,Year==2002),max_farms_county=max(Farms))
## max_farms_county
## 1 6281
summarize(GroupYear,farms_by_year=sum(Farms))
## # A tibble: 2 x 2
## Year farms_by_year
## <chr> <int>
## 1 1997 1911859
## 2 2002 2128982
tail(arrange(filter(agri_data5,Year==2002), Farms))
## County State StateCountyCode Year Farms
## 3141 Stanislaus CA 6099 2002 4267
## 3142 Clackamas OR 41005 2002 4676
## 3143 San Diego CA 6073 2002 5255
## 3144 Lancaster PA 42071 2002 5293
## 3145 Tulare CA 6107 2002 5738
## 3146 Fresno CA 6019 2002 6281
In 2002, Alaska had 609 farms, and Texas had almost 229K
On average, a county in USA had 608 farms in 1997 and suprisingly increasing 677 farms in 2002
There were counties with no farms in USA. Actually in 2002, there were 70 counties with no farms
In 2002, there was one county with 6,281 counties (Fresno, CA)