DATA607 Project 2

Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets:  Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.  Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]  Perform the analysis requested in the discussion item.  Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

The three datasets picked are: (1) FIFA21 Player Information (2) Surface Temperature by Country (3) Cost of Scientific Publications in 2012 - 13

FIFA21 Player Information

The FIFA21 Player Information dataset comes webscraped. It is in long format but contains a lot of incomplete data, and also many special characters. In order to analyze this dataset, it needs to be cleaned first. The main question here is: (1) do players are paid more if they’re with a club longer, while holding skill constant (i.e., as covariate)

library(tidyr)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.4.4     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(ggplot2)
fifa_raw = read.csv('https://raw.githubusercontent.com/lucasweyrich958/DATA607/main/fifa21_male2.csv')
summary(fifa_raw)

##        ID             Name                Age             OVA       
##  Min.   :     2   Length:17125       Min.   :16.00   Min.   :38.00  
##  1st Qu.:204082   Class :character   1st Qu.:21.00   1st Qu.:62.00  
##  Median :228961   Mode  :character   Median :25.00   Median :67.00  
##  Mean   :219389                      Mean   :25.27   Mean   :66.97  
##  3rd Qu.:243911                      3rd Qu.:29.00   3rd Qu.:72.00  
##  Max.   :259105                      Max.   :53.00   Max.   :93.00  
##                                                                     
##  Nationality            Club                BOV            BP           
##  Length:17125       Length:17125       Min.   :42.0   Length:17125      
##  Class :character   Class :character   1st Qu.:64.0   Class :character  
##  Mode  :character   Mode  :character   Median :68.0   Mode  :character  
##                                        Mean   :67.9                     
##                                        3rd Qu.:72.0                     
##                                        Max.   :93.0                     
##                                                                         
##    Position         Player.Photo        Club.Logo          Flag.Photo       
##  Length:17125       Length:17125       Length:17125       Length:17125      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##       POT        Team...Contract       Height             Weight         
##  Min.   :47.00   Length:17125       Length:17125       Length:17125      
##  1st Qu.:69.00   Class :character   Class :character   Class :character  
##  Median :72.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :72.49                                                           
##  3rd Qu.:76.00                                                           
##  Max.   :95.00                                                           
##                                                                          
##      foot               Growth          Joined          Loan.Date.End     
##  Length:17125       Min.   :-1.000   Length:17125       Length:17125      
##  Class :character   1st Qu.: 0.000   Class :character   Class :character  
##  Mode  :character   Median : 4.000   Mode  :character   Mode  :character  
##                     Mean   : 5.525                                        
##                     3rd Qu.: 9.000                                        
##                     Max.   :26.000                                        
##                                                                           
##     Value               Wage           Release.Clause       Contract        
##  Length:17125       Length:17125       Length:17125       Length:17125      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    Attacking        Crossing       Finishing     Heading.Accuracy
##  Min.   : 33.0   Min.   : 6.00   Min.   : 3.00   Min.   : 5.0    
##  1st Qu.:232.0   1st Qu.:41.00   1st Qu.:33.00   1st Qu.:46.0    
##  Median :271.0   Median :56.00   Median :52.00   Median :57.0    
##  Mean   :258.5   Mean   :51.62   Mean   :47.96   Mean   :53.6    
##  3rd Qu.:306.0   3rd Qu.:65.00   3rd Qu.:64.00   3rd Qu.:65.0    
##  Max.   :437.0   Max.   :94.00   Max.   :95.00   Max.   :93.0    
##                                                                  
##  Short.Passing      Volleys          Skill         Dribbling    
##  Min.   : 7.00   Min.   : 4.00   Min.   : 43.0   Min.   : 5.00  
##  1st Qu.:56.00   1st Qu.:32.00   1st Qu.:232.0   1st Qu.:53.00  
##  Median :63.00   Median :46.00   Median :279.0   Median :63.00  
##  Mean   :60.51   Mean   :45.01   Mean   :266.6   Mean   :57.85  
##  3rd Qu.:69.00   3rd Qu.:59.00   3rd Qu.:319.0   3rd Qu.:70.00  
##  Max.   :94.00   Max.   :90.00   Max.   :470.0   Max.   :96.00  
##                  NA's   :58                                     
##      Curve        FK.Accuracy     Long.Passing    Ball.Control  
##  Min.   : 4.00   Min.   : 5.00   Min.   : 9.00   Min.   : 5.00  
##  1st Qu.:37.00   1st Qu.:32.00   1st Qu.:45.00   1st Qu.:57.00  
##  Median :51.00   Median :43.00   Median :57.00   Median :64.00  
##  Mean   :49.57   Mean   :44.39   Mean   :54.32   Mean   :60.64  
##  3rd Qu.:64.00   3rd Qu.:58.00   3rd Qu.:65.00   3rd Qu.:70.00  
##  Max.   :94.00   Max.   :94.00   Max.   :93.00   Max.   :96.00  
##  NA's   :58                                                     
##     Movement      Acceleration    Sprint.Speed      Agility       Reactions    
##  Min.   :113.0   Min.   :12.00   Min.   :11.00   Min.   :14.0   Min.   :24.00  
##  1st Qu.:294.0   1st Qu.:58.00   1st Qu.:59.00   1st Qu.:57.0   1st Qu.:57.00  
##  Median :331.0   Median :68.00   Median :68.00   Median :67.0   Median :63.00  
##  Mean   :322.7   Mean   :65.45   Mean   :65.44   Mean   :64.6   Mean   :62.92  
##  3rd Qu.:360.0   3rd Qu.:75.00   3rd Qu.:75.00   3rd Qu.:75.0   3rd Qu.:69.00  
##  Max.   :464.0   Max.   :97.00   Max.   :96.00   Max.   :96.0   Max.   :96.00  
##                                                  NA's   :58                    
##     Balance          Power         Shot.Power       Jumping     
##  Min.   :17.00   Min.   :128.0   Min.   :12.00   Min.   :22.00  
##  1st Qu.:57.00   1st Qu.:272.0   1st Qu.:50.00   1st Qu.:58.00  
##  Median :67.00   Median :308.0   Median :61.00   Median :66.00  
##  Mean   :64.72   Mean   :302.4   Mean   :59.71   Mean   :65.17  
##  3rd Qu.:75.00   3rd Qu.:339.0   3rd Qu.:70.00   3rd Qu.:73.00  
##  Max.   :97.00   Max.   :444.0   Max.   :95.00   Max.   :95.00  
##  NA's   :58                                      NA's   :58     
##     Stamina         Strength       Long.Shots      Mentality       Aggression
##  Min.   :11.00   Min.   :16.00   Min.   : 4.00   Min.   : 50.0   Min.   : 9  
##  1st Qu.:56.00   1st Qu.:58.00   1st Qu.:35.00   1st Qu.:235.0   1st Qu.:45  
##  Median :66.00   Median :67.00   Median :53.00   Median :269.0   Median :60  
##  Mean   :63.31   Mean   :65.31   Mean   :49.14   Mean   :261.9   Mean   :57  
##  3rd Qu.:73.00   3rd Qu.:74.00   3rd Qu.:64.00   3rd Qu.:304.0   3rd Qu.:70  
##  Max.   :97.00   Max.   :97.00   Max.   :94.00   Max.   :421.0   Max.   :96  
##                                                                              
##  Interceptions    Positioning        Vision        Penalties    
##  Min.   : 4.00   Min.   : 2.00   Min.   :10.00   Min.   : 7.00  
##  1st Qu.:26.00   1st Qu.:43.00   1st Qu.:47.00   1st Qu.:40.00  
##  Median :53.00   Median :57.00   Median :57.00   Median :51.00  
##  Mean   :47.09   Mean   :52.37   Mean   :55.44   Mean   :50.25  
##  3rd Qu.:65.00   3rd Qu.:66.00   3rd Qu.:65.00   3rd Qu.:62.00  
##  Max.   :95.00   Max.   :96.00   Max.   :95.00   Max.   :94.00  
##  NA's   :7       NA's   :7       NA's   :58                     
##    Composure       Defending        Marking      Standing.Tackle Sliding.Tackle
##  Min.   :12.00   Min.   : 17.0   Min.   : 3.00   Min.   : 5.00   Min.   : 6.0  
##  1st Qu.:53.00   1st Qu.: 84.0   1st Qu.:29.00   1st Qu.:28.00   1st Qu.:25.0  
##  Median :61.00   Median :158.0   Median :52.00   Median :55.00   Median :52.0  
##  Mean   :59.94   Mean   :141.5   Mean   :47.25   Mean   :48.28   Mean   :46.1  
##  3rd Qu.:68.00   3rd Qu.:194.0   3rd Qu.:64.00   3rd Qu.:66.00   3rd Qu.:64.0  
##  Max.   :96.00   Max.   :272.0   Max.   :94.00   Max.   :93.00   Max.   :95.0  
##  NA's   :423                                                     NA's   :58    
##   Goalkeeping       GK.Diving     GK.Handling      GK.Kicking   
##  Min.   :  5.00   Min.   : 1.0   Min.   : 1.00   Min.   : 1.00  
##  1st Qu.: 48.00   1st Qu.: 8.0   1st Qu.: 8.00   1st Qu.: 8.00  
##  Median : 53.00   Median :11.0   Median :11.00   Median :11.00  
##  Mean   : 77.61   Mean   :15.6   Mean   :15.48   Mean   :15.47  
##  3rd Qu.: 59.00   3rd Qu.:14.0   3rd Qu.:14.00   3rd Qu.:14.00  
##  Max.   :440.00   Max.   :90.0   Max.   :92.00   Max.   :93.00  
##                                                                 
##  GK.Positioning   GK.Reflexes     Total.Stats     Base.Stats   
##  Min.   : 1.00   Min.   : 1.00   Min.   : 731   Min.   :228.0  
##  1st Qu.: 8.00   1st Qu.: 8.00   1st Qu.:1492   1st Qu.:333.0  
##  Median :11.00   Median :11.00   Median :1659   Median :362.0  
##  Mean   :15.51   Mean   :15.74   Mean   :1631   Mean   :361.4  
##  3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:1812   3rd Qu.:390.0  
##  Max.   :93.00   Max.   :90.00   Max.   :2316   Max.   :498.0  
##                                                                
##      W.F                 SM                A.W                D.W           
##  Length:17125       Length:17125       Length:17125       Length:17125      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##       IR                 PAC             SHO             PAS       
##  Length:17125       Min.   :25.00   Min.   :16.00   Min.   :25.00  
##  Class :character   1st Qu.:62.00   1st Qu.:46.00   1st Qu.:52.00  
##  Mode  :character   Median :69.00   Median :58.00   Median :60.00  
##                     Mean   :68.09   Mean   :54.97   Mean   :58.93  
##                     3rd Qu.:75.00   3rd Qu.:65.00   3rd Qu.:66.00  
##                     Max.   :96.00   Max.   :93.00   Max.   :93.00  
##                                                                    
##       DRI             DEF             PHY            Hits          
##  Min.   :28.00   Min.   :12.00   Min.   :27.00   Length:17125      
##  1st Qu.:59.00   1st Qu.:35.00   1st Qu.:59.00   Class :character  
##  Median :65.00   Median :53.00   Median :66.00   Mode  :character  
##  Mean   :64.21   Mean   :50.27   Mean   :64.91                     
##  3rd Qu.:71.00   3rd Qu.:64.00   3rd Qu.:72.00                     
##  Max.   :95.00   Max.   :91.00   Max.   :93.00                     
##                                                                    
##       LS                 ST                 RS                 LW           
##  Length:17125       Length:17125       Length:17125       Length:17125      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##       LF                 CF                 RF                 RW           
##  Length:17125       Length:17125       Length:17125       Length:17125      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      LAM                CAM                RAM                 LM           
##  Length:17125       Length:17125       Length:17125       Length:17125      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      LCM                 CM                RCM                 RM           
##  Length:17125       Length:17125       Length:17125       Length:17125      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      LWB                LDM                CDM                RDM           
##  Length:17125       Length:17125       Length:17125       Length:17125      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      RWB                 LB                LCB                 CB           
##  Length:17125       Length:17125       Length:17125       Length:17125      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      RCB                 RB                 GK               Gender         
##  Length:17125       Length:17125       Length:17125       Length:17125      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##

As can be seen with the summary() command, there is a lot of information in this spreadsheet, and most of the columns are characters, even though they contain numbers. So the first step is to filter out the columns of interest, followed by adjusting the data types for each column. Also, there is some missing data, but these are not considered NA, so they need to be set as NA, then removed. The missing data seems to be a result of retired players, as they do not have a current club, therefore, these players are missing a Joined date.

fifa = fifa_raw %>%
  select(Name, Age, OVA, Club, Joined, Value, Wage, Contract) %>%
  mutate(Joined = na_if(Joined, '')) %>%
  drop_na(Joined)

fifa = fifa %>% 
  filter(!grepl('Free', Contract)) %>%
  filter(!grepl('On Loan', Contract)) %>%
  filter(!Value == '€0')

The code above retained only relevant columns, and removed NAs in the Joined date. After visual inspection in connection with domain knowledge, it became apparent that the Contract columns contains more information than needed, namely whether a player is on loan to another club or free agent. Both factors are outside of the scope of the question, so the removed these using grepl() to partially match. Additionally, the code removed player with a value of 0€. Next, the contract column needs to be split into two: contract start and end date, and then the value and wage columns need to be adjusted to be numerical.

fifa = fifa %>%
  separate(Contract, into = c('Contract_Start', 'Contract_End'), sep = '~') %>%
  mutate(Contract_End = as.numeric(Contract_End)) %>%
  mutate(Contract_Start = as.numeric(str_sub(Contract_Start, start = -5)))

fifa = fifa %>%
  mutate(Value = case_when(
    str_detect(Value, 'K$') ~ as.numeric(str_extract(Value, '\\d+')) * 1000,
    str_detect(Value, 'M$') ~ as.numeric(str_extract(Value, '\\d+')) * 1000000
  ))

fifa = fifa %>%
  mutate(Wage = case_when(
    str_detect(Wage, 'K$') ~ as.numeric(str_extract(Wage, '\\d+')) * 1000,
    str_detect(Wage, 'M$') ~ as.numeric(str_extract(Wage, '\\d+')) * 1000000
  ))

fifa = fifa %>%
  mutate(years = Contract_End - Contract_Start) %>%
  drop_na(Wage)

The code above split the column Contract into two, by using ~ as a separator (i.e., 2008 ~ 2010). After inspection, some rows had additional character prior to the contract start year, so the code was adjusted to only include the last five characters in the Contract Start column. At the same time both columns were converted to numeric. Following that, the columns Value and Wage were converted to numeric. The dplyr function str_detect() can be used to create a sort of condition, in this case either K or M (for thousand and million). Depending on whether that was the case, the numbers were excluded and then multiplied either by one thousand or one million.

Lastly, a new column was created that shows the amount of years a player is with a club by subtracting the start from the end date. Now the data is ready to be analyzed.

ggplot(data = fifa, aes(x = years, y = Wage)) +
  geom_point(color = '#289c60') +
  geom_smooth(method = "lm", se = FALSE, color = '#637069') +
  theme_minimal() +
  theme(panel.grid = element_blank()) +
  labs(y = 'Weekly Wage (€)',
       x = 'Years') +
  scale_y_continuous(labels = scales::number_format(scale = 1))

## `geom_smooth()` using formula = 'y ~ x'

ggplot(data = fifa, aes(x = Wage)) +
  geom_histogram() +
  scale_x_continuous(labels = scales::number_format(scale = 1))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

fifa = fifa %>%
  mutate(Wage_ln = log(Wage))

ggplot(data = fifa, aes(x = Wage_ln)) +
  geom_histogram() +
  scale_x_continuous(labels = scales::number_format(scale = 1))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = fifa, aes(x = years, y = Wage_ln)) +
  geom_point(color = '#289c60') +
  geom_smooth(method = "lm", se = FALSE, color = '#637069') +
  theme_minimal() +
  theme(panel.grid = element_blank()) +
  labs(y = 'Weekly Wage (ln, €)',
       x = 'Years') +
  scale_y_continuous(labels = scales::number_format(scale = 1))

## `geom_smooth()` using formula = 'y ~ x'

fifa_lm = lm(Wage_ln ~ years + OVA, data = fifa)
summary(fifa_lm)

## 
## Call:
## lm(formula = Wage_ln ~ years + OVA, data = fifa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2144 -0.4460 -0.0115  0.4426  2.5910 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.5602050  0.0608233  -25.65   <2e-16 ***
## years        0.0364326  0.0024821   14.68   <2e-16 ***
## OVA          0.1487736  0.0009161  162.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6483 on 13094 degrees of freedom
## Multiple R-squared:  0.6954, Adjusted R-squared:  0.6953 
## F-statistic: 1.495e+04 on 2 and 13094 DF,  p-value: < 2.2e-16

First, looking at the scatterplot it appears that there is a positive relationship between the weekly wage and the years a player is part of a club, but it also seems that there is a significant bump between 5 and 10 years. Additionally, there is one very obvious outlier hanging out at about 16 years, and that is Lionel Messi, who has since moved on to Inter Miami in the MLS. Because the data appears to contain some outliers, a histogram confirms that by being extremely right tailed. Because of that, it makes sense to normalize the Wage data using a natural log, to mitigate this skeweness to the right at least a bit. Rescaling the Wage data to its natural log seems to mitigate outliers a bit, but not fully. The scatterplot with this shows a steeper positive relationship with less outliers. Using this data, a linear model can be run that includes the variable OVA, which is the overall player’s rating, as covariate. This can determine whether it is worthwhile for a player to stay loyal to a club, at least in terms of pay. Looking at the output of that model, it is apparent that there is a significant relationship between weekly Wage and years of membership, however, the OVa shows a higher t-value, which means that it is still the more important variable between the two, unsurprisingly. Generally, however, it can be understood that staying loyal has positive benfits for a player’s pay. Additional analyses that could be done here is to delve further into each player’s position and skillsets, and whether these are affecting their pay.

Surface Temperature by Country

This data set contains the annual mean surface temperature change by country from the years 1961 to 2022. It is a great simple data set to analyze and visualize climate change in general, and to understand which countries are most affected. First, the data is loaded, followed by cleaning and prepping it.

climate_raw = read.csv('https://raw.githubusercontent.com/lucasweyrich958/DATA607/main/Indicator_3_1_Climate_Indicators_Annual_Mean_Global_Surface_Temperature_577579683071085080.csv')

climate = climate_raw %>%
  select(ISO3, X1961:X2022) %>%
  rename(Country = ISO3)

After importing the data set above, the code selects only the necessary columns. These include the ISO3, which is the country short-code, and each year’s data. The column ISO3 is renamed to Country. Not much cleaning was needed for this data set. Since there are 225 countries, it would not be worthwhile to plot all at one. To analyze it more efficiently, the mean temperature change over all years can be calculated and the top 10 and lowest 10 can be used to get a good understanding. Additionally, a time series for these can be plotted, alongside a worldwide average, to understand the trajectories.

climate <- climate %>%
  mutate(average = rowMeans(select(., X1961:X2022), na.rm = TRUE))

top10 = climate %>%
  arrange(desc(average)) %>%
  head(5)
low10 = climate %>%
  arrange(average) %>%
  head(5)
toplow10 = rbind(top10, low10)
print(top10$average)

## [1] 1.584348 1.555581 1.541941 1.526348 1.513419

print(low10$average)

## [1] -0.10559184 -0.03678947  0.00800000  0.13909677  0.13943548

toplow10 = toplow10 %>%
  gather(key = "Year", value = "Value", starts_with("X"))
toplow10$Year = as.numeric(sub("X", "", toplow10$Year))

mean_row = climate %>%
  summarise(across(starts_with("X1961"):starts_with("X2022"), mean, na.rm = TRUE))

## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(starts_with("X1961"):starts_with("X2022"), mean, na.rm =
##   TRUE)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))

mean_row = mean_row %>%
  mutate(Country = "Mean") %>%
  relocate(Country, .before = 1)
mean_row = mean_row %>%
  gather(key = "Year", value = "Value", starts_with("X"))
mean_row$Year = as.numeric(sub("X", "", mean_row$Year))
  

ggplot() +
  geom_line(data = toplow10, aes(x = Year, y = Value, group = Country, color = Country), linetype = 'solid') +
  labs(title = "Over-the-year Temperature Change by Countries",
       x = "Year",
       y = "Temperature Change (°C)",
       color = "Country") +
  theme_minimal() +
  theme(panel.grid = element_blank()) +
  geom_line(data = mean_row, aes(x = Year, y = Value), linetype = "dashed")

## Warning: Removed 299 rows containing missing values (`geom_line()`).

The code above calculates row-wise means for each country, and then separates the five countries with the highest and lowest average, in order to show the extreme values. Additionally, the code calculates the column-wise mean, so that a global average can be calculated. Following that, both of these new data frames are converted into long format, so that it can be plotted as a time-series that can show the trajectory. Another informative plot that can be plotted is a world-heatmap that shows the average temperature change by country. Since the ISO3 code is avaialble, this is quite simple–see below.

library(sf)

## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE

library(rnaturalearth)
spatial_world <- ne_countries(returnclass = "sf")
spatial_climate <- merge(spatial_world, climate, by.x = "iso_a3", by.y = "Country", all.x = TRUE)

ggplot(spatial_climate) +
  geom_sf(aes(fill = average)) +
  scale_fill_gradient(low = "#81c8db", high = "#e30e15", name = "average") +
  labs(title = "Avg. Temp. Changes (Last 60 yrs)", fill = "average") +
  theme_minimal() +
  theme(panel.grid = element_blank())

The code above loaded the two packages sf and rnaturalearth. Sf allows to create spatial maps and rnaturalearth is a package that includes spatial information for countries. Since the column Country contains the ISO3 codes, the code simply matched them with the spatial country dataframe, then created a spatial ggplot with geom_sf. This heatmap shows some interesting trajectories, namely that most countries are at an average of 0.5 degrees or above increases over the last 60 years. Additionally, it is interesting that some northern countries, like Russia and Canada report the highest over-the-year increases. This is likely explained by the pole melting, that results in steeper temperature increases. But also some equatorial countries seem to experience steep increases in temperature. It would be interesting to see look at this maps with absolute average temperatures from 1961 and 2022, as it surely would give a different picture.

Cost of Scientific Publications in 2012 - 13

This data set shows the costs of publication of scientific research in peer reviewd journals, an endavour that has become increasingly expensive for scientistis and governments (as the primary funders of scientific research). With this data set, the question of which journals are most expensive can be answered. For this, the data set is imported first, and then cleaned.

research_raw = read.csv('https://raw.githubusercontent.com/lucasweyrich958/DATA607/main/University%20returns_for_figshare_FINAL.csv')
research = research_raw %>%
  select(Publisher, Journal.title, COST.....charged.to.Wellcome..inc.VAT.when.charged., Article.title)

research <- research %>%
  mutate(Cost = parse_number(COST.....charged.to.Wellcome..inc.VAT.when.charged.))
research$Cost = as.numeric(research$Cost)

The code above imported the dataset and also edited the cost column. This had the GPB symbol included, that is not suitable for numerical datatypes. Using parse_number() this was removed and the column was renamed to a shorter name. The two main questions are the distribution of the publishing cost, as well as the most expensive and least expensibve publishers.

ggplot(data = research, aes(x = Cost)) +
  geom_histogram() +
  labs(x = "Cost (£)", y = "Count")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mean_costs <- research %>%
  group_by(Publisher) %>%
  summarise(mean_cost = mean(Cost, na.rm = TRUE))
print(max(mean_costs$mean_cost))

## [1] 13200

print(min(mean_costs$mean_cost))

## [1] 45.94

As can be seen in the histogram above, most publication costs range between 0 and 5000£, however, there are a few more expensive outliers. Specifically, the most expensive publication was to the publisher MacMillan, but no journal name is indicated. It cost 13,200£, and is more than twice as expensive as second publication. Therefore, it is possible that this is a book. The least expensive publication was to the journal American Society for Nutrition and cost 45.94£. This dataset is challenging as it is not well recorded (i.e., the publisher names contain spelling errors); therefore, it is hard to further analyze this dataset without invading severley. While this dataset contains interesting information, it is the perfect example of how important data quality is.

DATA607 Project 2

Lucas Weyrich

2024-03-01

FIFA21 Player Information

Surface Temperature by Country

Cost of Scientific Publications in 2012 - 13