DATA 607 Project 2

Overview:

In this assignment we are asked to choose 3 wide structured data sets and to create a .CSV file that includes data from our findings. The goal of this assignment is to tidy and transform the data as described below:

Create a .CSV file that includes all of the information included in the dataset and read the information from .CSV file into R.
Use tidyr and dplyr to tidy and transform data and perform our analysis.
Code should be in an R Markdown file, and should include narrative descriptions of the data cleanup work, analysis, and conclusions.

To begin, we will create a .CSV file, upload load to GitHub, then load from GitHub to R. From R, we will use the necessary libraries to tidy and transform the data.

Load required libraries:

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.3.0      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(ggplot2)
library(scales)

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

library(data.table)

## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## The following object is masked from 'package:purrr':
## 
##     transpose

Datasets

U.S Imports for Consumption of Steel Products:

This is a dataset that contains U.S. Imports for Consumption of Steel Products measured by quantity in metric tons and value in thousands of dollars as of November, December and Annual for the year 2021 - 2022.

First we will load the data:

Obtain data from Github:

Steel<-read.csv("https://raw.githubusercontent.com/FarhanaAkther23/DATA607/main/Project%202/US%20imports%20for%20Consumption%20of%20Steel%20Products%20Dec%202022%20.csv", skip = 6L)

head(Steel)

##           X      X.1   X.2      X.3   X.4      X.5   X.6      X.7   X.8
## 1           December         Nov-22       December       November      
## 2              Final          Final          Final          Final      
## 3 Commodity                                                            
## 4  Grouping Quantity Value Quantity Value Quantity Value Quantity Value
## 5                                                                      
## 6                                                                      
##          X.9  X.10       X.11  X.12
## 1 2022 Final       2021 Final      
## 2                                  
## 3                                  
## 4   Quantity Value   Quantity Value
## 5                                  
## 6

Above data is untidy. Now we will transform the data for further analysis. We will add appropriate column names to identify our data and get rid of a few rows that we not need for our analysis:

colnames(Steel)[1:13] <- c("Commodity Group", "Quantity December 2022", "Value December 2022", "Quantity November 2022", "Value November 2022", "Quantity December 2021", "Value December 2021", "Quantity November 2021", "Value November 2021", "Quantity Final 2022", "Value Final 2022", "Quantity Final 2021", "Value Final 2021")
Steel2 <- Steel[-c(1:7, 9, 49, 50),]

head(Steel2)

##                  Commodity Group Quantity December 2022 Value December 2022
## 8     Total Selected Commodities              1,999,419           2,962,664
## 10 Ingots And Steel For Castings                    496               2,775
## 11     Blooms, Billets And Slabs                307,434             274,080
## 12                     Wire Rods                 62,430              74,175
## 13       Structural Shapes Heavy                 33,489              44,633
## 14                  Steel Piling                 16,617              20,727
##    Quantity November 2022 Value November 2022 Quantity December 2021
## 8               1,826,547           2,785,960              2,620,324
## 10                    317               2,430                  1,119
## 11                224,505             183,759                570,331
## 12                 66,820              86,418                103,409
## 13                 54,108              71,357                 38,912
## 14                  3,280               4,293                  3,614
##    Value December 2021 Quantity November 2021 Value November 2021
## 8            3,821,610              2,626,154           3,653,916
## 10               3,671                  1,310               3,936
## 11             508,257                504,511             446,403
## 12             114,330                 97,781             106,199
## 13              51,210                 66,487              74,199
## 14               3,711                  3,867               3,378
##    Quantity Final 2022 Value Final 2022 Quantity Final 2021 Value Final 2021
## 8           28,014,754       41,944,101          28,576,768       33,456,282
## 10              13,345           61,808              16,432           48,905
## 11           5,053,020        4,601,400           7,881,262        6,143,157
## 12           1,352,232        1,554,663           1,042,755        1,011,789
## 13             723,674          950,645             530,946          539,266
## 14             109,589          129,247              77,291           68,115

Above we can see that all the columns have appropriate names instead of x and we have removed all the empty rows from the regional .CSV file.

Now that our data looks much cleaner, we can start analyzing:

First we’ll look at each commodity by quantity at the end of 2021 and 2022:

commodity <- Steel2[-c(1),]
commodity <- select(commodity, "Commodity Group", "Quantity Final 2022", "Value Final 2022", "Quantity Final 2021", "Value Final 2021")
commodityq <- commodity %>% gather("Annual", "Quantity", 2:5) %>% filter(Annual %in% c("Quantity Final 2022", "Quantity Final 2021"))

#Our quantities were being read as strings because they included a comma in them. I have changed them to numeric in this step:

commodityq$Quantity <- as.numeric(gsub(",", "", commodityq$Quantity))

Visual Comparison of Commodity Imports by Quantity:

ggplot(data=commodityq, aes(x=`Commodity Group`, y=Quantity, fill = factor(`Commodity Group`))) + geom_bar(stat = "identity") + theme(legend.position = "none", axis.text.y = element_text(size=4), axis.text.x = element_text(angle = 90, hjust = 1, size=5)) + labs(y="Quantity (Metric Tons)") + coord_flip() + facet_wrap(~Annual, ncol=2)

From the graph above, We can see that overall quantities of steel import for consumption have increased between 2021 and 2022 with a few variations. On the contrary, it looks like the largest quantity import Blooms, Billets and Slabs, has actually decreased it’s imports from 2021 to 2022 with a difference of about 2,828,242 metric tons. Form this we can assume that the COVID pandemic may have impacted negativity the consumption of steel product for Blooms, Billets and Slabs.

Visual Comparison of Commodity of Imports by Value

commodityv <- commodity %>% gather("Annual", "Value", 2:5) %>% filter(Annual %in% c("Value Final 2022", "Value Final 2021"))
#Our values were being read as strings because they included a comma in them. We have changed them to numeric in this step:

commodityv$Value <- as.numeric(gsub(",", "", commodityv$Value))

ggplot(data=commodityv, aes(x=`Commodity Group`, y=Value, fill = factor(`Commodity Group`))) + geom_bar(stat = "identity") + theme(legend.position = "none", axis.text.y = element_text(size=4), axis.text.x = element_text(angle = 90, hjust = 1, size=5)) + labs(y="Value (Thousands of Dollars)") + coord_flip() + facet_wrap(~Annual, ncol=2)

From the graph above, We can see that the value for the imports correlate with the quantities being imported.

One big difference that is noticeable is the value between two of the largest imports of steel for Oil Country Goods and Blooms, Billets and Slabs. Apparently, Oil Country Goods are more expensive than Blooms, Billets and Slabs, which is why in 2021 we can see that even though quantity is higher for Blooms, Billets and Slabs, the value is higher for Oil Country Goods. Form this we can assume that probably the war between Russia and Ukraine has impacted negativity the price of steel product specially for Oil Country Goods.

Total import for consumption of steel:

Finally we will look at the total import for consumption of steel products by quantity and value for the years 2021 and 2022.

totalq <- Steel2 %>% select("Commodity Group", "Quantity Final 2022", "Value Final 2022", "Quantity Final 2021", "Value Final 2021") %>% gather("Annual", "Quantity", 2:5) %>% filter(`Commodity Group` == "Total Selected Commodities") %>% filter(Annual %in% c("Quantity Final 2022", "Quantity Final 2021")) 

totalv <- Steel2 %>% select("Commodity Group", "Quantity Final 2022", "Value Final 2022", "Quantity Final 2021", "Value Final 2021") %>% gather("Annual", "Value", 2:5) %>% filter(`Commodity Group` == "Total Selected Commodities" & Annual %in% c("Value Final 2022", "Value Final 2021"))

ggplot(totalq, aes(Annual, Quantity)) + geom_bar(stat="identity", width = 0.7,  color = "steelblue4", fill = "steelblue1") + labs(y="Quantity (Metric Tons)", title="Total Import of Steel by Quantity for 2021 and 2022")

By looking at this graph, it looks like the overall quantity of the consumption has decreased almost 50% which is a bit misleading. However, if we look at the number of quantity, We can see that the total quantity of steel import for consumption of steel products was only 1.9% less at the end of 2022 than it was in 2021.

Further, we would expect for the total value to correlate with these quantities. Let’s find out what the total value of steel import for consumption of steel products looks like:

ggplot(totalv, aes(Annual, Value)) + geom_bar(stat="identity", width = 0.7, color = "steelblue4", fill = "steelblue1") + labs(y="Value (Thousands of Dollars)", title="Total Import of Steel by Value for 2021 and 2022")

From the graph above we can see that although the consumption quantity has decreased from the last few years, the price for over all steel product has increased in 2022 then 2021.

Conclusion:

In this project we have taken an untidy data and have tidy it up. From our analysis we can see that the U.S. imports for consumption of steel products has decreased in last few years where the price has increased. Thus, we can see the correlation between the value and quantities. What also stands out is that the largest quantity import Blooms, Billets and Slabs, has decreased it’s imports significant from 2021 to 202 which, I assumed could be due to COVID pandemic. Also, the over all price hike may caused by the value increase of Oil Country Goods which, I assumed could be due to the recent war between Russia and Ukraine.

U.S. Presidential Vote 2016: CA and FL

This is a dataset that contains the 2016 US presidential vote counts for two U.S. states: California and Florida.

First we will load the data:

Obtain data from Github:

Election<-read.csv("https://raw.githubusercontent.com/FarhanaAkther23/DATA607/main/Project%202/Election%20Result%20for%202016%20U.S.%20Presidential%20Vote.csv")
Election

##   X       Candidate      CA      FL
## 1 1 Hillary Clinton 5931283 4485745
## 2 2    Donald Trump 3184721 4605515
## 3 3    Gary Johnson  308392  206007
## 4 4      Jill Stein  166311   64019

In this table, the column names CA and FL are values of the variable state. Therefore, we can say that this table is in an untidy and table format is wide.

Tidying the date:

We will remove the X column which, adds no values in our data.

Election2 <- Election[,-1]
Election2

##         Candidate      CA      FL
## 1 Hillary Clinton 5931283 4485745
## 2    Donald Trump 3184721 4605515
## 3    Gary Johnson  308392  206007
## 4      Jill Stein  166311   64019

Long Format

We are gong melt this table from Wide to long format below using library(data.table)

library(data.table)
Election2<- melt(as.data.table(Election2),
     id.vars = "Candidate",
     measure.vars = c("CA", "FL"),
     variable.name = "State",
     value.name = "Votes")
Election2

##          Candidate State   Votes
## 1: Hillary Clinton    CA 5931283
## 2:    Donald Trump    CA 3184721
## 3:    Gary Johnson    CA  308392
## 4:      Jill Stein    CA  166311
## 5: Hillary Clinton    FL 4485745
## 6:    Donald Trump    FL 4605515
## 7:    Gary Johnson    FL  206007
## 8:      Jill Stein    FL   64019

From the table above we can see that data is in long format and we have removed extra x column. Now that we have tidy dataset lets do some analysis in this.

Analysis:

Lets calculate the total votes for each candidate

Total_Votes <- Election2 %>%
  group_by(Candidate) %>%
  summarize(Total_Votes = sum(Votes))
Total_Votes

## # A tibble: 4 × 2
##   Candidate       Total_Votes
##   <chr>                 <int>
## 1 Donald Trump        7790236
## 2 Gary Johnson         514399
## 3 Hillary Clinton    10417028
## 4 Jill Stein           230330

From this table we can see that Hillary Clinton has a most votes combined in CA and FL.

Now we are going to add the total votes to the long format data frame:

Election3 <- merge(Election2, Total_Votes, by = "Candidate")
Election3

##          Candidate State   Votes Total_Votes
## 1:    Donald Trump    CA 3184721     7790236
## 2:    Donald Trump    FL 4605515     7790236
## 3:    Gary Johnson    CA  308392      514399
## 4:    Gary Johnson    FL  206007      514399
## 5: Hillary Clinton    CA 5931283    10417028
## 6: Hillary Clinton    FL 4485745    10417028
## 7:      Jill Stein    CA  166311      230330
## 8:      Jill Stein    FL   64019      230330

Above we can see that total votes in one table

Lets calculate the percentage of total Votes for each candidate:

Vote_Percentage <- Election3 %>%
  group_by(Candidate, State)%>%
  summarize(Percentage = Votes/Total_Votes)

## `summarise()` has grouped output by 'Candidate'. You can override using the
## `.groups` argument.

  Vote_Percentage$Percentage <- label_percent()(Vote_Percentage$Percentage)
Vote_Percentage

## # A tibble: 8 × 3
## # Groups:   Candidate [4]
##   Candidate       State Percentage
##   <chr>           <fct> <chr>     
## 1 Donald Trump    CA    40.88%    
## 2 Donald Trump    FL    59.12%    
## 3 Gary Johnson    CA    59.95%    
## 4 Gary Johnson    FL    40.05%    
## 5 Hillary Clinton CA    56.94%    
## 6 Hillary Clinton FL    43.06%    
## 7 Jill Stein      CA    72.21%    
## 8 Jill Stein      FL    27.79%

From the table above, we can see that total percentage for each state per candidate. This data does not gives a clear picture of the candidates’ lead as the percentage are calculated for each candidates with their own vote count. So, if we look at Jill Stein vote percentage in CA it looks like Jill is leading (which is not really true) but it’s because we are calculating the percentage for each candidate with their own votes for each states.

Visualization

library(ggplot2)

ggplot(Election3, aes(x = Candidate, y = Votes, fill = State)) +
  geom_col(position = "dodge") +
  labs(title = "Vote Counts in Each State Per Candidate",
       x = "Candidate", y = "Votes",
       fill = "State")

The graph above shows the number of votes for each candidate for each states: CA and FL from 2016 election

Conclusion:

In this project we have taken an untidy data that contains the 2016 US presidential vote counts for two U.S. states: California and Florida and have tidy it up. We have calculated total number votes and percentage for each candidates. From our analysis we can see that although Donald Trump was leading in FL, Hillary Clinton was pretty close too about 1.7% difference. Additionally, Hillary Clinton had the most vote counts in CA compared to Donald Trump as well as the other candidates. As per the percentage analysis, I think further analysis can be done by calculating the percentage based on the total number of vote as a whole.

Farm Output by state

This is a dataset that contains the total output of farms by state from 1960-2004. We will tidy the data in order to find some patterns.

First we will load the data:

Obtain data from Github:

Note that we eliminate the first five rows when loading in order to make the column names the 6th row in the dataset.

Farm<-read.csv("https://raw.githubusercontent.com/FarhanaAkther23/DATA607/main/Project%202/Farm%20Output%20by%20State.CSV", skip = 5L)

head(Farm)

##   Year     AL     AR     AZ     CA     CO     CT     DE     FL     GA     IA
## 1          NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 2          NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 3 1960 0.5668 0.6085 0.4063 3.3565 0.6274 0.1205 0.0889 0.8115 0.7119 2.8691
## 4 1961 0.5765 0.6561 0.4236 3.3836 0.6363 0.1191 0.0867 0.8756 0.7621 2.9787
## 5 1962 0.5604 0.6731 0.4392 3.5495 0.6212 0.1168 0.0865 0.9719 0.7421 2.9869
## 6 1963 0.6288 0.7117 0.4559 3.5883 0.6260 0.1183 0.0916 0.8532 0.8442 3.2091
##       ID     IL     IN     KS     KY     LA     MA     MD     ME     MI     MN
## 1     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 2     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 3 0.4713 2.0900 1.1926 1.3509 0.6772 0.3159 0.1586 0.2653 0.1615 0.8521 1.8182
## 4 0.5004 2.1441 1.1894 1.3667 0.7308 0.3254 0.1557 0.2657 0.1743 0.8963 1.8957
## 5 0.5288 2.1843 1.2265 1.3114 0.7283 0.3409 0.1548 0.2690 0.1754 0.9003 1.7694
## 6 0.5503 2.2805 1.2853 1.3163 0.7705 0.3762 0.1528 0.2702 0.1799 0.9073 1.9349
##       MO     MS     MT     NC     ND     NE     NH     NJ     NM     NV     NY
## 1     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 2     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 3 1.2814 0.6156 0.4086 1.0808 0.6278 1.3339 0.0547 0.2913 0.2268 0.0549 1.0470
## 4 1.3055 0.6624 0.3547 1.0994 0.4852 1.3145 0.0556 0.2770 0.2333 0.0524 1.0757
## 5 1.2508 0.6601 0.4309 1.1417 0.7527 1.3634 0.0544 0.2769 0.2464 0.0578 1.0510
## 6 1.3501 0.7595 0.4634 1.1709 0.6995 1.3916 0.0541 0.2617 0.2554 0.0599 1.0784
##       OH     OK     OR     PA     RI     SC     SD     TN     TX     UT     VA
## 1     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 2     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 3 1.1778 0.8314 0.4206 0.8766 0.0190 0.3862 0.8235 0.6198 2.2307 0.1885 0.5463
## 4 1.1700 0.8108 0.4338 0.8895 0.0187 0.3974 0.7932 0.6418 2.2370 0.1871 0.5628
## 5 1.1774 0.7242 0.4572 0.8394 0.0191 0.4088 0.8547 0.6172 2.1889 0.1975 0.5653
## 6 1.2121 0.7349 0.4600 0.8579 0.0189 0.4094 0.8948 0.6555 2.3235 0.1971 0.5008
##       VT     WA     WI     WV     WY
## 1     NA     NA     NA     NA     NA
## 2     NA     NA     NA     NA     NA
## 3 0.1571 0.5693 1.6851 0.1393 0.1755
## 4 0.1620 0.5761 1.7360 0.1363 0.1763
## 5 0.1588 0.6061 1.7720 0.1290 0.1807
## 6 0.1575 0.6409 1.7240 0.1230 0.2003

Tidying the Date:

Let’s eliminate all empty rows:

Farm2 <- na.omit(Farm)
head (Farm2)

##   Year     AL     AR     AZ     CA     CO     CT     DE     FL     GA     IA
## 3 1960 0.5668 0.6085 0.4063 3.3565 0.6274 0.1205 0.0889 0.8115 0.7119 2.8691
## 4 1961 0.5765 0.6561 0.4236 3.3836 0.6363 0.1191 0.0867 0.8756 0.7621 2.9787
## 5 1962 0.5604 0.6731 0.4392 3.5495 0.6212 0.1168 0.0865 0.9719 0.7421 2.9869
## 6 1963 0.6288 0.7117 0.4559 3.5883 0.6260 0.1183 0.0916 0.8532 0.8442 3.2091
## 7 1964 0.6314 0.7517 0.4292 3.6871 0.6235 0.1175 0.0902 0.8508 0.8278 3.2115
## 8 1965 0.6643 0.8145 0.4602 3.6855 0.6213 0.1189 0.1002 0.9743 0.8739 3.1501
##       ID     IL     IN     KS     KY     LA     MA     MD     ME     MI     MN
## 3 0.4713 2.0900 1.1926 1.3509 0.6772 0.3159 0.1586 0.2653 0.1615 0.8521 1.8182
## 4 0.5004 2.1441 1.1894 1.3667 0.7308 0.3254 0.1557 0.2657 0.1743 0.8963 1.8957
## 5 0.5288 2.1843 1.2265 1.3114 0.7283 0.3409 0.1548 0.2690 0.1754 0.9003 1.7694
## 6 0.5503 2.2805 1.2853 1.3163 0.7705 0.3762 0.1528 0.2702 0.1799 0.9073 1.9349
## 7 0.5460 2.1810 1.1761 1.2692 0.8157 0.3738 0.1473 0.2742 0.1792 0.9466 1.8176
## 8 0.5452 2.3400 1.2316 1.3680 0.7482 0.3908 0.1434 0.2944 0.1740 0.8724 1.7931
##       MO     MS     MT     NC     ND     NE     NH     NJ     NM     NV     NY
## 3 1.2814 0.6156 0.4086 1.0808 0.6278 1.3339 0.0547 0.2913 0.2268 0.0549 1.0470
## 4 1.3055 0.6624 0.3547 1.0994 0.4852 1.3145 0.0556 0.2770 0.2333 0.0524 1.0757
## 5 1.2508 0.6601 0.4309 1.1417 0.7527 1.3634 0.0544 0.2769 0.2464 0.0578 1.0510
## 6 1.3501 0.7595 0.4634 1.1709 0.6995 1.3916 0.0541 0.2617 0.2554 0.0599 1.0784
## 7 1.2891 0.7818 0.4641 1.2103 0.7110 1.3560 0.0514 0.2477 0.2300 0.0608 1.0594
## 8 1.3551 0.7578 0.4940 1.1144 0.7736 1.4025 0.0525 0.2511 0.2445 0.0672 1.0805
##       OH     OK     OR     PA     RI     SC     SD     TN     TX     UT     VA
## 3 1.1778 0.8314 0.4206 0.8766 0.0190 0.3862 0.8235 0.6198 2.2307 0.1885 0.5463
## 4 1.1700 0.8108 0.4338 0.8895 0.0187 0.3974 0.7932 0.6418 2.2370 0.1871 0.5628
## 5 1.1774 0.7242 0.4572 0.8394 0.0191 0.4088 0.8547 0.6172 2.1889 0.1975 0.5653
## 6 1.2121 0.7349 0.4600 0.8579 0.0189 0.4094 0.8948 0.6555 2.3235 0.1971 0.5008
## 7 1.1678 0.7515 0.4630 0.8591 0.0183 0.4040 0.8302 0.6837 2.2159 0.1932 0.5369
## 8 1.1749 0.8508 0.4706 0.8736 0.0175 0.4113 0.8937 0.6678 2.3917 0.1970 0.5337
##       VT     WA     WI     WV     WY
## 3 0.1571 0.5693 1.6851 0.1393 0.1755
## 4 0.1620 0.5761 1.7360 0.1363 0.1763
## 5 0.1588 0.6061 1.7720 0.1290 0.1807
## 6 0.1575 0.6409 1.7240 0.1230 0.2003
## 7 0.1565 0.6587 1.7165 0.1224 0.1968
## 8 0.1519 0.6599 1.7089 0.1175 0.1927

Long Format

The data will be transformed from wide to long and we will make our new “State” and “Output” columns:

Farm_long <- Farm2[-c(46:54),]
Farm_long <- gather(Farm_long, "State", "Output", 2:49)
head(Farm_long)

##   Year State Output
## 1 1960    AL 0.5668
## 2 1961    AL 0.5765
## 3 1962    AL 0.5604
## 4 1963    AL 0.6288
## 5 1964    AL 0.6314
## 6 1965    AL 0.6643

tail(Farm_long)

##      Year State Output
## 2155 1999    WY 0.3004
## 2156 2000    WY 0.2790
## 2157 2001    WY 0.2740
## 2158 2002    WY 0.2420
## 2159 2003    WY 0.2725
## 2160 2004    WY 0.2618

Analysis

We will now summarize some of our data to see a graph that shows the total output by state from 1960 to 2004.

TotalState <- Farm_long %>% group_by(State) %>% summarise(Total = sum(Output))
head(TotalState)

## # A tibble: 6 × 2
##   State  Total
##   <chr>  <dbl>
## 1 AL     37.6 
## 2 AR     58.7 
## 3 AZ     26.5 
## 4 CA    261.  
## 5 CO     48.0 
## 6 CT      5.37

Above table shows total farm output by state from 1960-2004

Visualization

Total output by state from 1960 to 2004:

ggplot(TotalState, aes(State, Total)) + geom_bar(stat="identity", width = 0.5, color="coral4", fill="coral1") + labs(y="Output", title="Total Farm Output by State: 1960-2004") + theme(axis.text.x = element_text(angle = 90, hjust = 1))

From this graph above, We can see above that CA had the largest farm output from 1960 to 2004. This is understandable as California leads the country as the largest producer of agricultural products (crops and livestock) and one of the top largest states in the country.

Now let’s see which year had the highest farm output:

TotalYear <- Farm_long %>% group_by(Year) %>% summarise(Total = sum(Output))
ggplot(TotalYear, aes(Year, Total)) + geom_bar(stat="identity", width = 0.5, color="salmon4", fill="salmon1") + labs(y="Output", title="Total Farm Output in the US: 1960-2004") + theme(axis.text.x = element_text(angle = 90, hjust = 1))

From the graph above, we can see that farm output has been increasing at a steady rate from 1960-2004.

lets us also take a look at the average annual growth rate over the years:

Farm3 <- Farm2[c(47:54),]
Farm3_long <- gather(Farm3, "State", "AVG Annual Growth", 2:49)
totalGrowth <- Farm3_long %>% group_by(Year) %>% summarise(Total = sum(`AVG Annual Growth`))
ggplot(totalGrowth, aes(Year, Total)) + geom_bar(stat="identity", width = 0.5, color= "burlywood4", fill="burlywood1") + labs(y="Rate", title="Average Annual Growth Rates (percent): 1960-2004")

In the graph above, we can see that the highest growth rate in farm output happened between the years 1973 and 1979. This could be due to the farm boom of the 1970s, we can see a similar boom happening during the 1990s.

Conclusion:

In this project we have taken an untidy wide dataset that contains the total U.S farm output by State from 1960-2004. we have tidy it by removing unwanted rows, transform the dataset from wide to long format and gave proper names for the column. We have also calculated totals and created visual graphs to be able to analyze the performance of total farm outputs for each state. By tidying, visualizing and analyzing the dataset, we were able to find out which stated were out perfuming (like CA), the steady growth of farm output over the years as well as the total average growth.

Source:

https://www.ers.usda.gov/data-products/agricultural-productivity-in-the-u-s/agricultural-productivity-in-the-u-s/#State-Level%20Tables,%20Relative%20Level%20Indices%20and%20Growth,%201960-2004-Outputs https://www.ers.usda.gov/data-products/chart-gallery/gallery/chart-detail/?chartId=58315 https://aei.ag/2020/11/30/the-farm-economy-inflation-and-the-1970s/ http://sape.inf.usi.ch/quick-reference/ggplot2/colour https://bookdown.org/yihui/rmarkdown/html-document.html#tabbed-sections https://www.census.gov/foreign-trade/Press-Release/steel_index.html https://stackoverflow.com/questions/1523126/how-to-read-data-when-some-numbers-contain-commas-as-thousand-separator https://gagneurlab.github.io/dataviz/tidy-data-and-combining-tables.html

DATA 607 Project 2

Farhana Akther

2023-03-03

Datasets

U.S Imports for Consumption of Steel Products:

U.S. Presidential Vote 2016: CA and FL

Farm Output by state