title: “Lab 2: Merging and Analyzing Data” author: “Grayson Grabois” output: html_document editor_options: chunk_output_type: console —
Question 1: Loading and Exploring the Dataset
First, we will load bth data sets. Because they are CSV files, I will use the read.csv function.
ag_output <- read.csv("ag_output.csv")
gdppc <- read.csv("gdppc.csv")
###shows first few rows of data so we can visualize what the dataset looks like
head(ag_output)
## Country Country.Code Outall_Q Land_Q Labor_Q
## 1 Nigeria NGA 67238499.9 65431.1311 26809.343
## 2 Benin BEN 4644196.3 4658.0306 1341.492
## 3 Cote d'Ivoire CIV 13861494.7 14790.5359 4715.421
## 4 Ghana GHA 15922946.4 9405.3255 5578.694
## 5 Guinea GIN 5451421.8 6555.7681 2432.623
## 6 Guinea-Bissau GNB 519897.2 699.9943 343.297
head(gdppc)
## Country.Name Country.Code GDPPerCapita
## 1 Aruba ABW 30559.5335
## 2 Africa Eastern and Southern AFE 1628.0245
## 3 Afghanistan AFG 357.2612
## 4 Africa Western and Central AFW 1777.2350
## 5 Angola AGO 2929.6945
## 6 Albania ALB 6846.4261
###this checks the dimentions in each dataset so each observation as a row and then each variable as columns
dim(ag_output)
## [1] 179 5
dim(gdppc)
## [1] 266 3
Question 2: Merging Data
# Merges datasets by their common Country.Code
ag_merge <- merge(ag_output, gdppc, by = "Country.Code")
# Check dimensions of the merged dataset making sure nothing is wrong
dim(ag_merge)
## [1] 174 7
# Attaches merged data set so its easier to find.
attach(ag_merge)
head(ag_merge)
## Country.Code Country Outall_Q Land_Q Labor_Q
## 1 AFG Afghanistan 6736355 11677.0378 3523.311
## 2 AGO Angola 7820210 6755.0958 7102.524
## 3 ALB Albania 2219637 932.8246 443.723
## 4 ARE United Arab Emirates 1302831 247.4919 88.776
## 5 ARG Argentina 76191086 48593.2519 1421.964
## 6 ARM Armenia 1675969 872.1348 750.752
## Country.Name GDPPerCapita
## 1 Afghanistan 357.2612
## 2 Angola 2929.6945
## 3 Albania 6846.4261
## 4 United Arab Emirates 49899.0653
## 5 Argentina 13935.6811
## 6 Armenia 6571.9745
Question 3: Examining Association Between Variables
# This creates a scatterplot with Outall_Q (agricultural output) on the vertical axis and the total amount of agriculural land on the x axis.
#I notice that most countries have only a little bit of agricultural land and they all have similar outputs. And then any country with more than 50000 units of land has an output that scatters.
plot(Outall_Q ~ Land_Q,
main = "Agricultural Output vs. Land Area",
xlab = "Agricultural Land",
ylab = "Agricultural Output",
col = "blue")
Question 4: Creating New Variables
# This creates a new variable called OutPerHec which diviides the output of agricultural by the land showing the output.
OutPerHec <- Outall_Q / Land_Q
#shows first few values of this new output variable
head(OutPerHec)
## [1] 576.8891 1157.6756 2379.4799 5264.1367 1567.9355 1921.6855
# Creates a histogram of the output variable.
hist(OutPerHec,
main = " Agricultural Output per Hectare",
xlab = "Output per Hectare",
col = "blue",
border = "black")
###I observe that the histogram is right skewed which means most of the data is clustered on the left side. This can indicate that most countries have a similar output. This aligns with the scatterplot above that shows that most countries have a smaller amount of land and outputs compared to the few countries that have more.
Question 5: Associations Between Variables
# This creates a scatterplot with output on the y axis and GDP on the x axis.
plot(OutPerHec ~ GDPPerCapita,
main = "Output per Hectare vs. GDP Per Capita",
xlab = "GDP ",
ylab = "Output",
col = "blue" )
#This plot is more spread out but still mostly clustered near the origin. But it does show that countries with a high output also have a high GDP and both variables rise and fall together.
Question 6: More Associations Between Variables
# This creates a new variable that is the amount of output divided b the amount of labor involved in productionto show how much output each laborer produces.
OutPerLab <- Outall_Q / Labor_Q
# This shows the first few values of OutPerLab
head(OutPerLab)
## [1] 1911.939 1101.047 5002.304 14675.489 53581.586 2232.387
# Create histogram of OutPerLab
hist(OutPerLab,
main = "Agricultural Output per Laborer",
xlab = "Output per Laborer",
col = "blue")
#This is probably the most right skewed of all the graphs showing that a few countries have managed to be way more productive with their laborers than others.
# This creates a scatterplot comparing the agricultural output per laborer on the y axis and the GDP per capita on the x axis.
plot(OutPerLab ~ GDPPerCapita,
main = "Output per Laborer vs. GDP Per Capita",
xlab = "GDP Per Capita ($)",
ylab = "Output per Laborer ($1000s per 1000 workers)",
col = " blue")
###again, this is mostly clustered towards the origin but is also more spread out compared to other plots. This could indicate that countries with higher GDPs are able to be more productive with each worker.
Question 7: What about the United States?
# This finds the Agricultural Output for the USA
OutPerHec[Country.Code == "USA"]
## [1] 1734.344
# This finds the Agricultural Output per Laborer in America
OutPerLab[Country.Code == "USA"]
## [1] 155777.8
Question 8: What countries maximize their land?
# This returns the countries where OutPerHec > 10,000
Country.Name[OutPerHec > 10000]
## [1] "Bahrain" "Brunei Darussalam" "Ireland"
## [4] "Kuwait" "Malta" "Netherlands"
## [7] "Norway"
# This makes a new variable that sums the total global output and then divides that by the total amount of agricultural land. To get the output per Hectare
global_outperhec <- sum(Outall_Q) / sum(Land_Q)
# This makes a new variable that sums the total global output and then divides that by the total labor. This gets the total output per laborer
global_outperlab <- sum(Outall_Q) / sum(Labor_Q)
# Print the results
global_outperhec
## [1] 2038.442
global_outperlab
## [1] 5028.743