Overview:
In this assignment we are asked to choose 3 wide structured data sets and to create a .CSV file that includes data from our findings. The goal of this assignment is to tidy and transform the data as described below:
Create a .CSV file that includes all of the information included in the dataset and read the information from .CSV file into R.
Use tidyr and dplyr to tidy and transform data and perform our analysis.
Code should be in an R Markdown file, and should include narrative descriptions of the data cleanup work, analysis, and conclusions.
To begin, we will create a .CSV file, upload load to GitHub, then load from GitHub to R. From R, we will use the necessary libraries to tidy and transform the data.
Load required libraries:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(data.table)
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:dplyr':
##
## between, first, last
##
## The following object is masked from 'package:purrr':
##
## transpose
This is a dataset that contains U.S. Imports for Consumption of Steel Products measured by quantity in metric tons and value in thousands of dollars as of November, December and Annual for the year 2021 - 2022.
First we will load the data:
Obtain data from Github:
Steel<-read.csv("https://raw.githubusercontent.com/FarhanaAkther23/DATA607/main/Project%202/US%20imports%20for%20Consumption%20of%20Steel%20Products%20Dec%202022%20.csv", skip = 6L)
head(Steel)
## X X.1 X.2 X.3 X.4 X.5 X.6 X.7 X.8
## 1 December Nov-22 December November
## 2 Final Final Final Final
## 3 Commodity
## 4 Grouping Quantity Value Quantity Value Quantity Value Quantity Value
## 5
## 6
## X.9 X.10 X.11 X.12
## 1 2022 Final 2021 Final
## 2
## 3
## 4 Quantity Value Quantity Value
## 5
## 6
Above data is untidy. Now we will transform the data for further analysis. We will add appropriate column names to identify our data and get rid of a few rows that we not need for our analysis:
colnames(Steel)[1:13] <- c("Commodity Group", "Quantity December 2022", "Value December 2022", "Quantity November 2022", "Value November 2022", "Quantity December 2021", "Value December 2021", "Quantity November 2021", "Value November 2021", "Quantity Final 2022", "Value Final 2022", "Quantity Final 2021", "Value Final 2021")
Steel2 <- Steel[-c(1:7, 9, 49, 50),]
head(Steel2)
## Commodity Group Quantity December 2022 Value December 2022
## 8 Total Selected Commodities 1,999,419 2,962,664
## 10 Ingots And Steel For Castings 496 2,775
## 11 Blooms, Billets And Slabs 307,434 274,080
## 12 Wire Rods 62,430 74,175
## 13 Structural Shapes Heavy 33,489 44,633
## 14 Steel Piling 16,617 20,727
## Quantity November 2022 Value November 2022 Quantity December 2021
## 8 1,826,547 2,785,960 2,620,324
## 10 317 2,430 1,119
## 11 224,505 183,759 570,331
## 12 66,820 86,418 103,409
## 13 54,108 71,357 38,912
## 14 3,280 4,293 3,614
## Value December 2021 Quantity November 2021 Value November 2021
## 8 3,821,610 2,626,154 3,653,916
## 10 3,671 1,310 3,936
## 11 508,257 504,511 446,403
## 12 114,330 97,781 106,199
## 13 51,210 66,487 74,199
## 14 3,711 3,867 3,378
## Quantity Final 2022 Value Final 2022 Quantity Final 2021 Value Final 2021
## 8 28,014,754 41,944,101 28,576,768 33,456,282
## 10 13,345 61,808 16,432 48,905
## 11 5,053,020 4,601,400 7,881,262 6,143,157
## 12 1,352,232 1,554,663 1,042,755 1,011,789
## 13 723,674 950,645 530,946 539,266
## 14 109,589 129,247 77,291 68,115
Above we can see that all the columns have appropriate names instead of x and we have removed all the empty rows from the regional .CSV file.
Now that our data looks much cleaner, we can start analyzing:
First we’ll look at each commodity by quantity at the end of 2021 and 2022:
commodity <- Steel2[-c(1),]
commodity <- select(commodity, "Commodity Group", "Quantity Final 2022", "Value Final 2022", "Quantity Final 2021", "Value Final 2021")
commodityq <- commodity %>% gather("Annual", "Quantity", 2:5) %>% filter(Annual %in% c("Quantity Final 2022", "Quantity Final 2021"))
#Our quantities were being read as strings because they included a comma in them. I have changed them to numeric in this step:
commodityq$Quantity <- as.numeric(gsub(",", "", commodityq$Quantity))
Visual Comparison of Commodity Imports by Quantity:
ggplot(data=commodityq, aes(x=`Commodity Group`, y=Quantity, fill = factor(`Commodity Group`))) + geom_bar(stat = "identity") + theme(legend.position = "none", axis.text.y = element_text(size=4), axis.text.x = element_text(angle = 90, hjust = 1, size=5)) + labs(y="Quantity (Metric Tons)") + coord_flip() + facet_wrap(~Annual, ncol=2)
From the graph above, We can see that overall quantities of steel import for consumption have increased between 2021 and 2022 with a few variations. On the contrary, it looks like the largest quantity import Blooms, Billets and Slabs, has actually decreased it’s imports from 2021 to 2022 with a difference of about 2,828,242 metric tons. Form this we can assume that the COVID pandemic may have impacted negativity the consumption of steel product for Blooms, Billets and Slabs.
Visual Comparison of Commodity of Imports by Value
commodityv <- commodity %>% gather("Annual", "Value", 2:5) %>% filter(Annual %in% c("Value Final 2022", "Value Final 2021"))
#Our values were being read as strings because they included a comma in them. We have changed them to numeric in this step:
commodityv$Value <- as.numeric(gsub(",", "", commodityv$Value))
ggplot(data=commodityv, aes(x=`Commodity Group`, y=Value, fill = factor(`Commodity Group`))) + geom_bar(stat = "identity") + theme(legend.position = "none", axis.text.y = element_text(size=4), axis.text.x = element_text(angle = 90, hjust = 1, size=5)) + labs(y="Value (Thousands of Dollars)") + coord_flip() + facet_wrap(~Annual, ncol=2)
From the graph above, We can see that the value for the imports correlate with the quantities being imported.
One big difference that is noticeable is the value between two of the largest imports of steel for Oil Country Goods and Blooms, Billets and Slabs. Apparently, Oil Country Goods are more expensive than Blooms, Billets and Slabs, which is why in 2021 we can see that even though quantity is higher for Blooms, Billets and Slabs, the value is higher for Oil Country Goods. Form this we can assume that probably the war between Russia and Ukraine has impacted negativity the price of steel product specially for Oil Country Goods.
Total import for consumption of steel:
Finally we will look at the total import for consumption of steel products by quantity and value for the years 2021 and 2022.
totalq <- Steel2 %>% select("Commodity Group", "Quantity Final 2022", "Value Final 2022", "Quantity Final 2021", "Value Final 2021") %>% gather("Annual", "Quantity", 2:5) %>% filter(`Commodity Group` == "Total Selected Commodities") %>% filter(Annual %in% c("Quantity Final 2022", "Quantity Final 2021"))
totalv <- Steel2 %>% select("Commodity Group", "Quantity Final 2022", "Value Final 2022", "Quantity Final 2021", "Value Final 2021") %>% gather("Annual", "Value", 2:5) %>% filter(`Commodity Group` == "Total Selected Commodities" & Annual %in% c("Value Final 2022", "Value Final 2021"))
ggplot(totalq, aes(Annual, Quantity)) + geom_bar(stat="identity", width = 0.7, color = "steelblue4", fill = "steelblue1") + labs(y="Quantity (Metric Tons)", title="Total Import of Steel by Quantity for 2021 and 2022")
By looking at this graph, it looks like the overall quantity of the consumption has decreased almost 50% which is a bit misleading. However, if we look at the number of quantity, We can see that the total quantity of steel import for consumption of steel products was only 1.9% less at the end of 2022 than it was in 2021.
Further, we would expect for the total value to correlate with these quantities. Let’s find out what the total value of steel import for consumption of steel products looks like:
ggplot(totalv, aes(Annual, Value)) + geom_bar(stat="identity", width = 0.7, color = "steelblue4", fill = "steelblue1") + labs(y="Value (Thousands of Dollars)", title="Total Import of Steel by Value for 2021 and 2022")
From the graph above we can see that although the consumption quantity has decreased from the last few years, the price for over all steel product has increased in 2022 then 2021.
Conclusion:
In this project we have taken an untidy data and have tidy it up. From our analysis we can see that the U.S. imports for consumption of steel products has decreased in last few years where the price has increased. Thus, we can see the correlation between the value and quantities. What also stands out is that the largest quantity import Blooms, Billets and Slabs, has decreased it’s imports significant from 2021 to 202 which, I assumed could be due to COVID pandemic. Also, the over all price hike may caused by the value increase of Oil Country Goods which, I assumed could be due to the recent war between Russia and Ukraine.
This is a dataset that contains the 2016 US presidential vote counts for two U.S. states: California and Florida.
First we will load the data:
Obtain data from Github:
Election<-read.csv("https://raw.githubusercontent.com/FarhanaAkther23/DATA607/main/Project%202/Election%20Result%20for%202016%20U.S.%20Presidential%20Vote.csv")
Election
## X Candidate CA FL
## 1 1 Hillary Clinton 5931283 4485745
## 2 2 Donald Trump 3184721 4605515
## 3 3 Gary Johnson 308392 206007
## 4 4 Jill Stein 166311 64019
In this table, the column names CA and FL are values of the variable state. Therefore, we can say that this table is in an untidy and table format is wide.
Tidying the date:
We will remove the X column which, adds no values in our data.
Election2 <- Election[,-1]
Election2
## Candidate CA FL
## 1 Hillary Clinton 5931283 4485745
## 2 Donald Trump 3184721 4605515
## 3 Gary Johnson 308392 206007
## 4 Jill Stein 166311 64019
Long Format
We are gong melt this table from Wide to long format below using library(data.table)
library(data.table)
Election2<- melt(as.data.table(Election2),
id.vars = "Candidate",
measure.vars = c("CA", "FL"),
variable.name = "State",
value.name = "Votes")
Election2
## Candidate State Votes
## 1: Hillary Clinton CA 5931283
## 2: Donald Trump CA 3184721
## 3: Gary Johnson CA 308392
## 4: Jill Stein CA 166311
## 5: Hillary Clinton FL 4485745
## 6: Donald Trump FL 4605515
## 7: Gary Johnson FL 206007
## 8: Jill Stein FL 64019
From the table above we can see that data is in long format and we have removed extra x column. Now that we have tidy dataset lets do some analysis in this.
Analysis:
Lets calculate the total votes for each candidate
Total_Votes <- Election2 %>%
group_by(Candidate) %>%
summarize(Total_Votes = sum(Votes))
Total_Votes
## # A tibble: 4 × 2
## Candidate Total_Votes
## <chr> <int>
## 1 Donald Trump 7790236
## 2 Gary Johnson 514399
## 3 Hillary Clinton 10417028
## 4 Jill Stein 230330
From this table we can see that Hillary Clinton has a most votes combined in CA and FL.
Now we are going to add the total votes to the long format data frame:
Election3 <- merge(Election2, Total_Votes, by = "Candidate")
Election3
## Candidate State Votes Total_Votes
## 1: Donald Trump CA 3184721 7790236
## 2: Donald Trump FL 4605515 7790236
## 3: Gary Johnson CA 308392 514399
## 4: Gary Johnson FL 206007 514399
## 5: Hillary Clinton CA 5931283 10417028
## 6: Hillary Clinton FL 4485745 10417028
## 7: Jill Stein CA 166311 230330
## 8: Jill Stein FL 64019 230330
Above we can see that total votes in one table
Lets calculate the percentage of total Votes for each candidate:
Vote_Percentage <- Election3 %>%
group_by(Candidate, State)%>%
summarize(Percentage = Votes/Total_Votes)
## `summarise()` has grouped output by 'Candidate'. You can override using the
## `.groups` argument.
Vote_Percentage$Percentage <- label_percent()(Vote_Percentage$Percentage)
Vote_Percentage
## # A tibble: 8 × 3
## # Groups: Candidate [4]
## Candidate State Percentage
## <chr> <fct> <chr>
## 1 Donald Trump CA 40.88%
## 2 Donald Trump FL 59.12%
## 3 Gary Johnson CA 59.95%
## 4 Gary Johnson FL 40.05%
## 5 Hillary Clinton CA 56.94%
## 6 Hillary Clinton FL 43.06%
## 7 Jill Stein CA 72.21%
## 8 Jill Stein FL 27.79%
From the table above, we can see that total percentage for each state per candidate. This data does not gives a clear picture of the candidates’ lead as the percentage are calculated for each candidates with their own vote count. So, if we look at Jill Stein vote percentage in CA it looks like Jill is leading (which is not really true) but it’s because we are calculating the percentage for each candidate with their own votes for each states.
Visualization
library(ggplot2)
ggplot(Election3, aes(x = Candidate, y = Votes, fill = State)) +
geom_col(position = "dodge") +
labs(title = "Vote Counts in Each State Per Candidate",
x = "Candidate", y = "Votes",
fill = "State")
The graph above shows the number of votes for each candidate for each states: CA and FL from 2016 election
Conclusion:
In this project we have taken an untidy data that contains the 2016 US presidential vote counts for two U.S. states: California and Florida and have tidy it up. We have calculated total number votes and percentage for each candidates. From our analysis we can see that although Donald Trump was leading in FL, Hillary Clinton was pretty close too about 1.7% difference. Additionally, Hillary Clinton had the most vote counts in CA compared to Donald Trump as well as the other candidates. As per the percentage analysis, I think further analysis can be done by calculating the percentage based on the total number of vote as a whole.
This is a dataset that contains the total output of farms by state from 1960-2004. We will tidy the data in order to find some patterns.
First we will load the data:
Obtain data from Github:
Note that we eliminate the first five rows when loading in order to make the column names the 6th row in the dataset.
Farm<-read.csv("https://raw.githubusercontent.com/FarhanaAkther23/DATA607/main/Project%202/Farm%20Output%20by%20State.CSV", skip = 5L)
head(Farm)
## Year AL AR AZ CA CO CT DE FL GA IA
## 1 NA NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA NA NA
## 3 1960 0.5668 0.6085 0.4063 3.3565 0.6274 0.1205 0.0889 0.8115 0.7119 2.8691
## 4 1961 0.5765 0.6561 0.4236 3.3836 0.6363 0.1191 0.0867 0.8756 0.7621 2.9787
## 5 1962 0.5604 0.6731 0.4392 3.5495 0.6212 0.1168 0.0865 0.9719 0.7421 2.9869
## 6 1963 0.6288 0.7117 0.4559 3.5883 0.6260 0.1183 0.0916 0.8532 0.8442 3.2091
## ID IL IN KS KY LA MA MD ME MI MN
## 1 NA NA NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA NA NA NA
## 3 0.4713 2.0900 1.1926 1.3509 0.6772 0.3159 0.1586 0.2653 0.1615 0.8521 1.8182
## 4 0.5004 2.1441 1.1894 1.3667 0.7308 0.3254 0.1557 0.2657 0.1743 0.8963 1.8957
## 5 0.5288 2.1843 1.2265 1.3114 0.7283 0.3409 0.1548 0.2690 0.1754 0.9003 1.7694
## 6 0.5503 2.2805 1.2853 1.3163 0.7705 0.3762 0.1528 0.2702 0.1799 0.9073 1.9349
## MO MS MT NC ND NE NH NJ NM NV NY
## 1 NA NA NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA NA NA NA
## 3 1.2814 0.6156 0.4086 1.0808 0.6278 1.3339 0.0547 0.2913 0.2268 0.0549 1.0470
## 4 1.3055 0.6624 0.3547 1.0994 0.4852 1.3145 0.0556 0.2770 0.2333 0.0524 1.0757
## 5 1.2508 0.6601 0.4309 1.1417 0.7527 1.3634 0.0544 0.2769 0.2464 0.0578 1.0510
## 6 1.3501 0.7595 0.4634 1.1709 0.6995 1.3916 0.0541 0.2617 0.2554 0.0599 1.0784
## OH OK OR PA RI SC SD TN TX UT VA
## 1 NA NA NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA NA NA NA
## 3 1.1778 0.8314 0.4206 0.8766 0.0190 0.3862 0.8235 0.6198 2.2307 0.1885 0.5463
## 4 1.1700 0.8108 0.4338 0.8895 0.0187 0.3974 0.7932 0.6418 2.2370 0.1871 0.5628
## 5 1.1774 0.7242 0.4572 0.8394 0.0191 0.4088 0.8547 0.6172 2.1889 0.1975 0.5653
## 6 1.2121 0.7349 0.4600 0.8579 0.0189 0.4094 0.8948 0.6555 2.3235 0.1971 0.5008
## VT WA WI WV WY
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 0.1571 0.5693 1.6851 0.1393 0.1755
## 4 0.1620 0.5761 1.7360 0.1363 0.1763
## 5 0.1588 0.6061 1.7720 0.1290 0.1807
## 6 0.1575 0.6409 1.7240 0.1230 0.2003
Tidying the Date:
Let’s eliminate all empty rows:
Farm2 <- na.omit(Farm)
head (Farm2)
## Year AL AR AZ CA CO CT DE FL GA IA
## 3 1960 0.5668 0.6085 0.4063 3.3565 0.6274 0.1205 0.0889 0.8115 0.7119 2.8691
## 4 1961 0.5765 0.6561 0.4236 3.3836 0.6363 0.1191 0.0867 0.8756 0.7621 2.9787
## 5 1962 0.5604 0.6731 0.4392 3.5495 0.6212 0.1168 0.0865 0.9719 0.7421 2.9869
## 6 1963 0.6288 0.7117 0.4559 3.5883 0.6260 0.1183 0.0916 0.8532 0.8442 3.2091
## 7 1964 0.6314 0.7517 0.4292 3.6871 0.6235 0.1175 0.0902 0.8508 0.8278 3.2115
## 8 1965 0.6643 0.8145 0.4602 3.6855 0.6213 0.1189 0.1002 0.9743 0.8739 3.1501
## ID IL IN KS KY LA MA MD ME MI MN
## 3 0.4713 2.0900 1.1926 1.3509 0.6772 0.3159 0.1586 0.2653 0.1615 0.8521 1.8182
## 4 0.5004 2.1441 1.1894 1.3667 0.7308 0.3254 0.1557 0.2657 0.1743 0.8963 1.8957
## 5 0.5288 2.1843 1.2265 1.3114 0.7283 0.3409 0.1548 0.2690 0.1754 0.9003 1.7694
## 6 0.5503 2.2805 1.2853 1.3163 0.7705 0.3762 0.1528 0.2702 0.1799 0.9073 1.9349
## 7 0.5460 2.1810 1.1761 1.2692 0.8157 0.3738 0.1473 0.2742 0.1792 0.9466 1.8176
## 8 0.5452 2.3400 1.2316 1.3680 0.7482 0.3908 0.1434 0.2944 0.1740 0.8724 1.7931
## MO MS MT NC ND NE NH NJ NM NV NY
## 3 1.2814 0.6156 0.4086 1.0808 0.6278 1.3339 0.0547 0.2913 0.2268 0.0549 1.0470
## 4 1.3055 0.6624 0.3547 1.0994 0.4852 1.3145 0.0556 0.2770 0.2333 0.0524 1.0757
## 5 1.2508 0.6601 0.4309 1.1417 0.7527 1.3634 0.0544 0.2769 0.2464 0.0578 1.0510
## 6 1.3501 0.7595 0.4634 1.1709 0.6995 1.3916 0.0541 0.2617 0.2554 0.0599 1.0784
## 7 1.2891 0.7818 0.4641 1.2103 0.7110 1.3560 0.0514 0.2477 0.2300 0.0608 1.0594
## 8 1.3551 0.7578 0.4940 1.1144 0.7736 1.4025 0.0525 0.2511 0.2445 0.0672 1.0805
## OH OK OR PA RI SC SD TN TX UT VA
## 3 1.1778 0.8314 0.4206 0.8766 0.0190 0.3862 0.8235 0.6198 2.2307 0.1885 0.5463
## 4 1.1700 0.8108 0.4338 0.8895 0.0187 0.3974 0.7932 0.6418 2.2370 0.1871 0.5628
## 5 1.1774 0.7242 0.4572 0.8394 0.0191 0.4088 0.8547 0.6172 2.1889 0.1975 0.5653
## 6 1.2121 0.7349 0.4600 0.8579 0.0189 0.4094 0.8948 0.6555 2.3235 0.1971 0.5008
## 7 1.1678 0.7515 0.4630 0.8591 0.0183 0.4040 0.8302 0.6837 2.2159 0.1932 0.5369
## 8 1.1749 0.8508 0.4706 0.8736 0.0175 0.4113 0.8937 0.6678 2.3917 0.1970 0.5337
## VT WA WI WV WY
## 3 0.1571 0.5693 1.6851 0.1393 0.1755
## 4 0.1620 0.5761 1.7360 0.1363 0.1763
## 5 0.1588 0.6061 1.7720 0.1290 0.1807
## 6 0.1575 0.6409 1.7240 0.1230 0.2003
## 7 0.1565 0.6587 1.7165 0.1224 0.1968
## 8 0.1519 0.6599 1.7089 0.1175 0.1927
Long Format
The data will be transformed from wide to long and we will make our new “State” and “Output” columns:
Farm_long <- Farm2[-c(46:54),]
Farm_long <- gather(Farm_long, "State", "Output", 2:49)
head(Farm_long)
## Year State Output
## 1 1960 AL 0.5668
## 2 1961 AL 0.5765
## 3 1962 AL 0.5604
## 4 1963 AL 0.6288
## 5 1964 AL 0.6314
## 6 1965 AL 0.6643
tail(Farm_long)
## Year State Output
## 2155 1999 WY 0.3004
## 2156 2000 WY 0.2790
## 2157 2001 WY 0.2740
## 2158 2002 WY 0.2420
## 2159 2003 WY 0.2725
## 2160 2004 WY 0.2618
Analysis
We will now summarize some of our data to see a graph that shows the total output by state from 1960 to 2004.
TotalState <- Farm_long %>% group_by(State) %>% summarise(Total = sum(Output))
head(TotalState)
## # A tibble: 6 × 2
## State Total
## <chr> <dbl>
## 1 AL 37.6
## 2 AR 58.7
## 3 AZ 26.5
## 4 CA 261.
## 5 CO 48.0
## 6 CT 5.37
Above table shows total farm output by state from 1960-2004
Visualization
Total output by state from 1960 to 2004:
ggplot(TotalState, aes(State, Total)) + geom_bar(stat="identity", width = 0.5, color="coral4", fill="coral1") + labs(y="Output", title="Total Farm Output by State: 1960-2004") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
From this graph above, We can see above that CA had the largest farm output from 1960 to 2004. This is understandable as California leads the country as the largest producer of agricultural products (crops and livestock) and one of the top largest states in the country.
Now let’s see which year had the highest farm output:
TotalYear <- Farm_long %>% group_by(Year) %>% summarise(Total = sum(Output))
ggplot(TotalYear, aes(Year, Total)) + geom_bar(stat="identity", width = 0.5, color="salmon4", fill="salmon1") + labs(y="Output", title="Total Farm Output in the US: 1960-2004") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
From the graph above, we can see that farm output has been increasing at a steady rate from 1960-2004.
lets us also take a look at the average annual growth rate over the years:
Farm3 <- Farm2[c(47:54),]
Farm3_long <- gather(Farm3, "State", "AVG Annual Growth", 2:49)
totalGrowth <- Farm3_long %>% group_by(Year) %>% summarise(Total = sum(`AVG Annual Growth`))
ggplot(totalGrowth, aes(Year, Total)) + geom_bar(stat="identity", width = 0.5, color= "burlywood4", fill="burlywood1") + labs(y="Rate", title="Average Annual Growth Rates (percent): 1960-2004")
In the graph above, we can see that the highest growth rate in farm output happened between the years 1973 and 1979. This could be due to the farm boom of the 1970s, we can see a similar boom happening during the 1990s.
Conclusion:
In this project we have taken an untidy wide dataset that contains the total U.S farm output by State from 1960-2004. we have tidy it by removing unwanted rows, transform the dataset from wide to long format and gave proper names for the column. We have also calculated totals and created visual graphs to be able to analyze the performance of total farm outputs for each state. By tidying, visualizing and analyzing the dataset, we were able to find out which stated were out perfuming (like CA), the steady growth of farm output over the years as well as the total average growth.
Source:
https://www.ers.usda.gov/data-products/agricultural-productivity-in-the-u-s/agricultural-productivity-in-the-u-s/#State-Level%20Tables,%20Relative%20Level%20Indices%20and%20Growth,%201960-2004-Outputs https://www.ers.usda.gov/data-products/chart-gallery/gallery/chart-detail/?chartId=58315 https://aei.ag/2020/11/30/the-farm-economy-inflation-and-the-1970s/ http://sape.inf.usi.ch/quick-reference/ggplot2/colour https://bookdown.org/yihui/rmarkdown/html-document.html#tabbed-sections https://www.census.gov/foreign-trade/Press-Release/steel_index.html https://stackoverflow.com/questions/1523126/how-to-read-data-when-some-numbers-contain-commas-as-thousand-separator https://gagneurlab.github.io/dataviz/tidy-data-and-combining-tables.html