Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets: Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!] Perform the analysis requested in the discussion item. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
The three datasets picked are: (1) FIFA21 Player Information (2) Surface Temperature by Country (3) Cost of Scientific Publications in 2012 - 13
The FIFA21 Player Information dataset comes webscraped. It is in long format but contains a lot of incomplete data, and also many special characters. In order to analyze this dataset, it needs to be cleaned first. The main question here is: (1) do players are paid more if they’re with a club longer, while holding skill constant (i.e., as covariate)
library(tidyr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 3.4.4 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(ggplot2)
fifa_raw = read.csv('https://raw.githubusercontent.com/lucasweyrich958/DATA607/main/fifa21_male2.csv')
summary(fifa_raw)
## ID Name Age OVA
## Min. : 2 Length:17125 Min. :16.00 Min. :38.00
## 1st Qu.:204082 Class :character 1st Qu.:21.00 1st Qu.:62.00
## Median :228961 Mode :character Median :25.00 Median :67.00
## Mean :219389 Mean :25.27 Mean :66.97
## 3rd Qu.:243911 3rd Qu.:29.00 3rd Qu.:72.00
## Max. :259105 Max. :53.00 Max. :93.00
##
## Nationality Club BOV BP
## Length:17125 Length:17125 Min. :42.0 Length:17125
## Class :character Class :character 1st Qu.:64.0 Class :character
## Mode :character Mode :character Median :68.0 Mode :character
## Mean :67.9
## 3rd Qu.:72.0
## Max. :93.0
##
## Position Player.Photo Club.Logo Flag.Photo
## Length:17125 Length:17125 Length:17125 Length:17125
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## POT Team...Contract Height Weight
## Min. :47.00 Length:17125 Length:17125 Length:17125
## 1st Qu.:69.00 Class :character Class :character Class :character
## Median :72.00 Mode :character Mode :character Mode :character
## Mean :72.49
## 3rd Qu.:76.00
## Max. :95.00
##
## foot Growth Joined Loan.Date.End
## Length:17125 Min. :-1.000 Length:17125 Length:17125
## Class :character 1st Qu.: 0.000 Class :character Class :character
## Mode :character Median : 4.000 Mode :character Mode :character
## Mean : 5.525
## 3rd Qu.: 9.000
## Max. :26.000
##
## Value Wage Release.Clause Contract
## Length:17125 Length:17125 Length:17125 Length:17125
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Attacking Crossing Finishing Heading.Accuracy
## Min. : 33.0 Min. : 6.00 Min. : 3.00 Min. : 5.0
## 1st Qu.:232.0 1st Qu.:41.00 1st Qu.:33.00 1st Qu.:46.0
## Median :271.0 Median :56.00 Median :52.00 Median :57.0
## Mean :258.5 Mean :51.62 Mean :47.96 Mean :53.6
## 3rd Qu.:306.0 3rd Qu.:65.00 3rd Qu.:64.00 3rd Qu.:65.0
## Max. :437.0 Max. :94.00 Max. :95.00 Max. :93.0
##
## Short.Passing Volleys Skill Dribbling
## Min. : 7.00 Min. : 4.00 Min. : 43.0 Min. : 5.00
## 1st Qu.:56.00 1st Qu.:32.00 1st Qu.:232.0 1st Qu.:53.00
## Median :63.00 Median :46.00 Median :279.0 Median :63.00
## Mean :60.51 Mean :45.01 Mean :266.6 Mean :57.85
## 3rd Qu.:69.00 3rd Qu.:59.00 3rd Qu.:319.0 3rd Qu.:70.00
## Max. :94.00 Max. :90.00 Max. :470.0 Max. :96.00
## NA's :58
## Curve FK.Accuracy Long.Passing Ball.Control
## Min. : 4.00 Min. : 5.00 Min. : 9.00 Min. : 5.00
## 1st Qu.:37.00 1st Qu.:32.00 1st Qu.:45.00 1st Qu.:57.00
## Median :51.00 Median :43.00 Median :57.00 Median :64.00
## Mean :49.57 Mean :44.39 Mean :54.32 Mean :60.64
## 3rd Qu.:64.00 3rd Qu.:58.00 3rd Qu.:65.00 3rd Qu.:70.00
## Max. :94.00 Max. :94.00 Max. :93.00 Max. :96.00
## NA's :58
## Movement Acceleration Sprint.Speed Agility Reactions
## Min. :113.0 Min. :12.00 Min. :11.00 Min. :14.0 Min. :24.00
## 1st Qu.:294.0 1st Qu.:58.00 1st Qu.:59.00 1st Qu.:57.0 1st Qu.:57.00
## Median :331.0 Median :68.00 Median :68.00 Median :67.0 Median :63.00
## Mean :322.7 Mean :65.45 Mean :65.44 Mean :64.6 Mean :62.92
## 3rd Qu.:360.0 3rd Qu.:75.00 3rd Qu.:75.00 3rd Qu.:75.0 3rd Qu.:69.00
## Max. :464.0 Max. :97.00 Max. :96.00 Max. :96.0 Max. :96.00
## NA's :58
## Balance Power Shot.Power Jumping
## Min. :17.00 Min. :128.0 Min. :12.00 Min. :22.00
## 1st Qu.:57.00 1st Qu.:272.0 1st Qu.:50.00 1st Qu.:58.00
## Median :67.00 Median :308.0 Median :61.00 Median :66.00
## Mean :64.72 Mean :302.4 Mean :59.71 Mean :65.17
## 3rd Qu.:75.00 3rd Qu.:339.0 3rd Qu.:70.00 3rd Qu.:73.00
## Max. :97.00 Max. :444.0 Max. :95.00 Max. :95.00
## NA's :58 NA's :58
## Stamina Strength Long.Shots Mentality Aggression
## Min. :11.00 Min. :16.00 Min. : 4.00 Min. : 50.0 Min. : 9
## 1st Qu.:56.00 1st Qu.:58.00 1st Qu.:35.00 1st Qu.:235.0 1st Qu.:45
## Median :66.00 Median :67.00 Median :53.00 Median :269.0 Median :60
## Mean :63.31 Mean :65.31 Mean :49.14 Mean :261.9 Mean :57
## 3rd Qu.:73.00 3rd Qu.:74.00 3rd Qu.:64.00 3rd Qu.:304.0 3rd Qu.:70
## Max. :97.00 Max. :97.00 Max. :94.00 Max. :421.0 Max. :96
##
## Interceptions Positioning Vision Penalties
## Min. : 4.00 Min. : 2.00 Min. :10.00 Min. : 7.00
## 1st Qu.:26.00 1st Qu.:43.00 1st Qu.:47.00 1st Qu.:40.00
## Median :53.00 Median :57.00 Median :57.00 Median :51.00
## Mean :47.09 Mean :52.37 Mean :55.44 Mean :50.25
## 3rd Qu.:65.00 3rd Qu.:66.00 3rd Qu.:65.00 3rd Qu.:62.00
## Max. :95.00 Max. :96.00 Max. :95.00 Max. :94.00
## NA's :7 NA's :7 NA's :58
## Composure Defending Marking Standing.Tackle Sliding.Tackle
## Min. :12.00 Min. : 17.0 Min. : 3.00 Min. : 5.00 Min. : 6.0
## 1st Qu.:53.00 1st Qu.: 84.0 1st Qu.:29.00 1st Qu.:28.00 1st Qu.:25.0
## Median :61.00 Median :158.0 Median :52.00 Median :55.00 Median :52.0
## Mean :59.94 Mean :141.5 Mean :47.25 Mean :48.28 Mean :46.1
## 3rd Qu.:68.00 3rd Qu.:194.0 3rd Qu.:64.00 3rd Qu.:66.00 3rd Qu.:64.0
## Max. :96.00 Max. :272.0 Max. :94.00 Max. :93.00 Max. :95.0
## NA's :423 NA's :58
## Goalkeeping GK.Diving GK.Handling GK.Kicking
## Min. : 5.00 Min. : 1.0 Min. : 1.00 Min. : 1.00
## 1st Qu.: 48.00 1st Qu.: 8.0 1st Qu.: 8.00 1st Qu.: 8.00
## Median : 53.00 Median :11.0 Median :11.00 Median :11.00
## Mean : 77.61 Mean :15.6 Mean :15.48 Mean :15.47
## 3rd Qu.: 59.00 3rd Qu.:14.0 3rd Qu.:14.00 3rd Qu.:14.00
## Max. :440.00 Max. :90.0 Max. :92.00 Max. :93.00
##
## GK.Positioning GK.Reflexes Total.Stats Base.Stats
## Min. : 1.00 Min. : 1.00 Min. : 731 Min. :228.0
## 1st Qu.: 8.00 1st Qu.: 8.00 1st Qu.:1492 1st Qu.:333.0
## Median :11.00 Median :11.00 Median :1659 Median :362.0
## Mean :15.51 Mean :15.74 Mean :1631 Mean :361.4
## 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:1812 3rd Qu.:390.0
## Max. :93.00 Max. :90.00 Max. :2316 Max. :498.0
##
## W.F SM A.W D.W
## Length:17125 Length:17125 Length:17125 Length:17125
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## IR PAC SHO PAS
## Length:17125 Min. :25.00 Min. :16.00 Min. :25.00
## Class :character 1st Qu.:62.00 1st Qu.:46.00 1st Qu.:52.00
## Mode :character Median :69.00 Median :58.00 Median :60.00
## Mean :68.09 Mean :54.97 Mean :58.93
## 3rd Qu.:75.00 3rd Qu.:65.00 3rd Qu.:66.00
## Max. :96.00 Max. :93.00 Max. :93.00
##
## DRI DEF PHY Hits
## Min. :28.00 Min. :12.00 Min. :27.00 Length:17125
## 1st Qu.:59.00 1st Qu.:35.00 1st Qu.:59.00 Class :character
## Median :65.00 Median :53.00 Median :66.00 Mode :character
## Mean :64.21 Mean :50.27 Mean :64.91
## 3rd Qu.:71.00 3rd Qu.:64.00 3rd Qu.:72.00
## Max. :95.00 Max. :91.00 Max. :93.00
##
## LS ST RS LW
## Length:17125 Length:17125 Length:17125 Length:17125
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## LF CF RF RW
## Length:17125 Length:17125 Length:17125 Length:17125
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## LAM CAM RAM LM
## Length:17125 Length:17125 Length:17125 Length:17125
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## LCM CM RCM RM
## Length:17125 Length:17125 Length:17125 Length:17125
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## LWB LDM CDM RDM
## Length:17125 Length:17125 Length:17125 Length:17125
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## RWB LB LCB CB
## Length:17125 Length:17125 Length:17125 Length:17125
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## RCB RB GK Gender
## Length:17125 Length:17125 Length:17125 Length:17125
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
As can be seen with the summary() command, there is a lot of information in this spreadsheet, and most of the columns are characters, even though they contain numbers. So the first step is to filter out the columns of interest, followed by adjusting the data types for each column. Also, there is some missing data, but these are not considered NA, so they need to be set as NA, then removed. The missing data seems to be a result of retired players, as they do not have a current club, therefore, these players are missing a Joined date.
fifa = fifa_raw %>%
select(Name, Age, OVA, Club, Joined, Value, Wage, Contract) %>%
mutate(Joined = na_if(Joined, '')) %>%
drop_na(Joined)
fifa = fifa %>%
filter(!grepl('Free', Contract)) %>%
filter(!grepl('On Loan', Contract)) %>%
filter(!Value == '€0')
The code above retained only relevant columns, and removed NAs in the Joined date. After visual inspection in connection with domain knowledge, it became apparent that the Contract columns contains more information than needed, namely whether a player is on loan to another club or free agent. Both factors are outside of the scope of the question, so the removed these using grepl() to partially match. Additionally, the code removed player with a value of 0€. Next, the contract column needs to be split into two: contract start and end date, and then the value and wage columns need to be adjusted to be numerical.
fifa = fifa %>%
separate(Contract, into = c('Contract_Start', 'Contract_End'), sep = '~') %>%
mutate(Contract_End = as.numeric(Contract_End)) %>%
mutate(Contract_Start = as.numeric(str_sub(Contract_Start, start = -5)))
fifa = fifa %>%
mutate(Value = case_when(
str_detect(Value, 'K$') ~ as.numeric(str_extract(Value, '\\d+')) * 1000,
str_detect(Value, 'M$') ~ as.numeric(str_extract(Value, '\\d+')) * 1000000
))
fifa = fifa %>%
mutate(Wage = case_when(
str_detect(Wage, 'K$') ~ as.numeric(str_extract(Wage, '\\d+')) * 1000,
str_detect(Wage, 'M$') ~ as.numeric(str_extract(Wage, '\\d+')) * 1000000
))
fifa = fifa %>%
mutate(years = Contract_End - Contract_Start) %>%
drop_na(Wage)
The code above split the column Contract into two, by using ~ as a separator (i.e., 2008 ~ 2010). After inspection, some rows had additional character prior to the contract start year, so the code was adjusted to only include the last five characters in the Contract Start column. At the same time both columns were converted to numeric. Following that, the columns Value and Wage were converted to numeric. The dplyr function str_detect() can be used to create a sort of condition, in this case either K or M (for thousand and million). Depending on whether that was the case, the numbers were excluded and then multiplied either by one thousand or one million.
Lastly, a new column was created that shows the amount of years a player is with a club by subtracting the start from the end date. Now the data is ready to be analyzed.
ggplot(data = fifa, aes(x = years, y = Wage)) +
geom_point(color = '#289c60') +
geom_smooth(method = "lm", se = FALSE, color = '#637069') +
theme_minimal() +
theme(panel.grid = element_blank()) +
labs(y = 'Weekly Wage (€)',
x = 'Years') +
scale_y_continuous(labels = scales::number_format(scale = 1))
## `geom_smooth()` using formula = 'y ~ x'
ggplot(data = fifa, aes(x = Wage)) +
geom_histogram() +
scale_x_continuous(labels = scales::number_format(scale = 1))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
fifa = fifa %>%
mutate(Wage_ln = log(Wage))
ggplot(data = fifa, aes(x = Wage_ln)) +
geom_histogram() +
scale_x_continuous(labels = scales::number_format(scale = 1))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = fifa, aes(x = years, y = Wage_ln)) +
geom_point(color = '#289c60') +
geom_smooth(method = "lm", se = FALSE, color = '#637069') +
theme_minimal() +
theme(panel.grid = element_blank()) +
labs(y = 'Weekly Wage (ln, €)',
x = 'Years') +
scale_y_continuous(labels = scales::number_format(scale = 1))
## `geom_smooth()` using formula = 'y ~ x'
fifa_lm = lm(Wage_ln ~ years + OVA, data = fifa)
summary(fifa_lm)
##
## Call:
## lm(formula = Wage_ln ~ years + OVA, data = fifa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2144 -0.4460 -0.0115 0.4426 2.5910
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.5602050 0.0608233 -25.65 <2e-16 ***
## years 0.0364326 0.0024821 14.68 <2e-16 ***
## OVA 0.1487736 0.0009161 162.40 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6483 on 13094 degrees of freedom
## Multiple R-squared: 0.6954, Adjusted R-squared: 0.6953
## F-statistic: 1.495e+04 on 2 and 13094 DF, p-value: < 2.2e-16
First, looking at the scatterplot it appears that there is a positive relationship between the weekly wage and the years a player is part of a club, but it also seems that there is a significant bump between 5 and 10 years. Additionally, there is one very obvious outlier hanging out at about 16 years, and that is Lionel Messi, who has since moved on to Inter Miami in the MLS. Because the data appears to contain some outliers, a histogram confirms that by being extremely right tailed. Because of that, it makes sense to normalize the Wage data using a natural log, to mitigate this skeweness to the right at least a bit. Rescaling the Wage data to its natural log seems to mitigate outliers a bit, but not fully. The scatterplot with this shows a steeper positive relationship with less outliers. Using this data, a linear model can be run that includes the variable OVA, which is the overall player’s rating, as covariate. This can determine whether it is worthwhile for a player to stay loyal to a club, at least in terms of pay. Looking at the output of that model, it is apparent that there is a significant relationship between weekly Wage and years of membership, however, the OVa shows a higher t-value, which means that it is still the more important variable between the two, unsurprisingly. Generally, however, it can be understood that staying loyal has positive benfits for a player’s pay. Additional analyses that could be done here is to delve further into each player’s position and skillsets, and whether these are affecting their pay.
This data set contains the annual mean surface temperature change by country from the years 1961 to 2022. It is a great simple data set to analyze and visualize climate change in general, and to understand which countries are most affected. First, the data is loaded, followed by cleaning and prepping it.
climate_raw = read.csv('https://raw.githubusercontent.com/lucasweyrich958/DATA607/main/Indicator_3_1_Climate_Indicators_Annual_Mean_Global_Surface_Temperature_577579683071085080.csv')
climate = climate_raw %>%
select(ISO3, X1961:X2022) %>%
rename(Country = ISO3)
After importing the data set above, the code selects only the necessary columns. These include the ISO3, which is the country short-code, and each year’s data. The column ISO3 is renamed to Country. Not much cleaning was needed for this data set. Since there are 225 countries, it would not be worthwhile to plot all at one. To analyze it more efficiently, the mean temperature change over all years can be calculated and the top 10 and lowest 10 can be used to get a good understanding. Additionally, a time series for these can be plotted, alongside a worldwide average, to understand the trajectories.
climate <- climate %>%
mutate(average = rowMeans(select(., X1961:X2022), na.rm = TRUE))
top10 = climate %>%
arrange(desc(average)) %>%
head(5)
low10 = climate %>%
arrange(average) %>%
head(5)
toplow10 = rbind(top10, low10)
print(top10$average)
## [1] 1.584348 1.555581 1.541941 1.526348 1.513419
print(low10$average)
## [1] -0.10559184 -0.03678947 0.00800000 0.13909677 0.13943548
toplow10 = toplow10 %>%
gather(key = "Year", value = "Value", starts_with("X"))
toplow10$Year = as.numeric(sub("X", "", toplow10$Year))
mean_row = climate %>%
summarise(across(starts_with("X1961"):starts_with("X2022"), mean, na.rm = TRUE))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(starts_with("X1961"):starts_with("X2022"), mean, na.rm =
## TRUE)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
mean_row = mean_row %>%
mutate(Country = "Mean") %>%
relocate(Country, .before = 1)
mean_row = mean_row %>%
gather(key = "Year", value = "Value", starts_with("X"))
mean_row$Year = as.numeric(sub("X", "", mean_row$Year))
ggplot() +
geom_line(data = toplow10, aes(x = Year, y = Value, group = Country, color = Country), linetype = 'solid') +
labs(title = "Over-the-year Temperature Change by Countries",
x = "Year",
y = "Temperature Change (°C)",
color = "Country") +
theme_minimal() +
theme(panel.grid = element_blank()) +
geom_line(data = mean_row, aes(x = Year, y = Value), linetype = "dashed")
## Warning: Removed 299 rows containing missing values (`geom_line()`).
The code above calculates row-wise means for each country, and then separates the five countries with the highest and lowest average, in order to show the extreme values. Additionally, the code calculates the column-wise mean, so that a global average can be calculated. Following that, both of these new data frames are converted into long format, so that it can be plotted as a time-series that can show the trajectory. Another informative plot that can be plotted is a world-heatmap that shows the average temperature change by country. Since the ISO3 code is avaialble, this is quite simple–see below.
library(sf)
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(rnaturalearth)
spatial_world <- ne_countries(returnclass = "sf")
spatial_climate <- merge(spatial_world, climate, by.x = "iso_a3", by.y = "Country", all.x = TRUE)
ggplot(spatial_climate) +
geom_sf(aes(fill = average)) +
scale_fill_gradient(low = "#81c8db", high = "#e30e15", name = "average") +
labs(title = "Avg. Temp. Changes (Last 60 yrs)", fill = "average") +
theme_minimal() +
theme(panel.grid = element_blank())
The code above loaded the two packages sf and rnaturalearth. Sf allows
to create spatial maps and rnaturalearth is a package that includes
spatial information for countries. Since the column Country contains the
ISO3 codes, the code simply matched them with the spatial country
dataframe, then created a spatial ggplot with geom_sf. This heatmap
shows some interesting trajectories, namely that most countries are at
an average of 0.5 degrees or above increases over the last 60 years.
Additionally, it is interesting that some northern countries, like
Russia and Canada report the highest over-the-year increases. This is
likely explained by the pole melting, that results in steeper
temperature increases. But also some equatorial countries seem to
experience steep increases in temperature. It would be interesting to
see look at this maps with absolute average temperatures from 1961 and
2022, as it surely would give a different picture.
This data set shows the costs of publication of scientific research in peer reviewd journals, an endavour that has become increasingly expensive for scientistis and governments (as the primary funders of scientific research). With this data set, the question of which journals are most expensive can be answered. For this, the data set is imported first, and then cleaned.
research_raw = read.csv('https://raw.githubusercontent.com/lucasweyrich958/DATA607/main/University%20returns_for_figshare_FINAL.csv')
research = research_raw %>%
select(Publisher, Journal.title, COST.....charged.to.Wellcome..inc.VAT.when.charged., Article.title)
research <- research %>%
mutate(Cost = parse_number(COST.....charged.to.Wellcome..inc.VAT.when.charged.))
research$Cost = as.numeric(research$Cost)
The code above imported the dataset and also edited the cost column. This had the GPB symbol included, that is not suitable for numerical datatypes. Using parse_number() this was removed and the column was renamed to a shorter name. The two main questions are the distribution of the publishing cost, as well as the most expensive and least expensibve publishers.
ggplot(data = research, aes(x = Cost)) +
geom_histogram() +
labs(x = "Cost (£)", y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
mean_costs <- research %>%
group_by(Publisher) %>%
summarise(mean_cost = mean(Cost, na.rm = TRUE))
print(max(mean_costs$mean_cost))
## [1] 13200
print(min(mean_costs$mean_cost))
## [1] 45.94
As can be seen in the histogram above, most publication costs range between 0 and 5000£, however, there are a few more expensive outliers. Specifically, the most expensive publication was to the publisher MacMillan, but no journal name is indicated. It cost 13,200£, and is more than twice as expensive as second publication. Therefore, it is possible that this is a book. The least expensive publication was to the journal American Society for Nutrition and cost 45.94£. This dataset is challenging as it is not well recorded (i.e., the publisher names contain spelling errors); therefore, it is hard to further analyze this dataset without invading severley. While this dataset contains interesting information, it is the perfect example of how important data quality is.