In this project, we’re going to work through three datasets, cleaning and analyzing them to uncover interesting insights. We’ll be focusing on:
Fake Untidy Data – Tidying up the messy data and seeing what trends we can find. World Population Data – Looking at how populations have changed over time and which countries have seen the fastest growth. University Returns Data – Finding out which universities offer the best bang for your buck based on post-graduation earnings.
First, we’ll clean up the untidy data, this report analyzes a dataset related to mental health surveys, including the General Anxiety Disorder (GAD-7) and Patient Health Questionnaire (PHQ-9). The dataset is untidy and needs cleaning before analysis.
# Read the untidy dataset
data <- read_csv("https://raw.githubusercontent.com/simonchy/DATA607/refs/heads/main/Week%206/Fake_untidy_data.csv")
## New names:
## Rows: 87 Columns: 9
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (8): ...2, ...3, GAD 7 Questions, GAD 7, gad response, PHQ 9 Questions, ... dbl
## (1): id
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...2`
## • `` -> `...3`
# Filter out rows with actual GAD-7 and PHQ-9 data
gad_phq_data <- data %>%
filter(!is.na(`GAD 7 Questions`) | !is.na(`PHQ 9 Questions`))
# Remove unnecessary columns and rows
tidy_data <- gad_phq_data %>%
select(id, `GAD 7 Questions`, `gad response`, `PHQ 9 Questions`, `phq response`)
# Fill forward the 'id' column for missing IDs
tidy_data <- tidy_data %>%
fill(id)
# Show the cleaned data
head(tidy_data)
## # A tibble: 6 × 5
## id `GAD 7 Questions` `gad response` `PHQ 9 Questions` `phq response`
## <dbl> <chr> <chr> <chr> <chr>
## 1 202 how often have they bee… <NA> Over the last 2 … 2
## 2 202 Feeling nervous, anxiou… 2 Little interest … 0
## 3 202 Not being able to stop … 0 Feeling down, de… 0
## 4 202 Worrying too much about… 2 Trouble falling … 0
## 5 202 Trouble relaxing 3 Feeling tired or… 2
## 6 202 Being so restless that … 2 Poor appetite or… 3
We will now separate the GAD-7 and PHQ-9 responses for further analysis.
# Separate GAD-7 and PHQ-9 data
gad7_data <- tidy_data %>%
filter(!is.na(`GAD 7 Questions`)) %>%
select(id, `GAD 7 Questions`, `gad response`)
phq9_data <- tidy_data %>%
filter(!is.na(`PHQ 9 Questions`)) %>%
select(id, `PHQ 9 Questions`, `phq response`)
# Display the separated data
head(gad7_data)
## # A tibble: 6 × 3
## id `GAD 7 Questions` `gad response`
## <dbl> <chr> <chr>
## 1 202 how often have they been bothered by the following over … <NA>
## 2 202 Feeling nervous, anxious, or on edge 2
## 3 202 Not being able to stop or control worrying 0
## 4 202 Worrying too much about different things 2
## 5 202 Trouble relaxing 3
## 6 202 Being so restless that it's hard to sit still 2
head(phq9_data)
## # A tibble: 6 × 3
## id `PHQ 9 Questions` `phq response`
## <dbl> <chr> <chr>
## 1 202 Over the last 2 weeks, how often have you been bothered … 2
## 2 202 Little interest or pleasure in doing things 0
## 3 202 Feeling down, depressed, or hopeless 0
## 4 202 Trouble falling or staying asleep, or sleeping too much 0
## 5 202 Feeling tired or having little energy 2
## 6 202 Poor appetite or overeating 3
Let’s calculate summary statistics for the GAD-7 scores.
# Convert responses to numeric
gad7_data$`gad response` <- as.numeric(gad7_data$`gad response`)
## Warning: NAs introduced by coercion
# Calculate mean and standard deviation for GAD-7 scores
gad7_summary <- gad7_data %>%
group_by(id) %>%
summarise(mean_gad7 = mean(`gad response`, na.rm = TRUE),
sd_gad7 = sd(`gad response`, na.rm = TRUE))
# Show the summary statistics
gad7_summary
## # A tibble: 4 × 3
## id mean_gad7 sd_gad7
## <dbl> <dbl> <dbl>
## 1 202 1.64 1.08
## 2 207 2 1
## 3 211 1.2 1.32
## 4 242 1.5 0.760
Similarly, we will calculate the summary statistics for the PHQ-9 scores.
# Convert responses to numeric
phq9_data$`phq response` <- as.numeric(phq9_data$`phq response`)
## Warning: NAs introduced by coercion
# Calculate mean and standard deviation for PHQ-9 scores
phq9_summary <- phq9_data %>%
group_by(id) %>%
summarise(mean_phq9 = mean(`phq response`, na.rm = TRUE),
sd_phq9 = sd(`phq response`, na.rm = TRUE))
# Show the summary statistics
phq9_summary
## # A tibble: 4 × 3
## id mean_phq9 sd_phq9
## <dbl> <dbl> <dbl>
## 1 202 0.875 1.02
## 2 207 0.588 0.939
## 3 211 1.72 1.18
## 4 242 1.33 1.08
We will visualize the distribution of GAD-7 and PHQ-9 scores.
# Plot GAD-7 scores
ggplot(gad7_summary, aes(x = mean_gad7)) +
geom_histogram(binwidth = 1, fill = "blue", color = "black") +
labs(title = "Distribution of GAD-7 Scores", x = "Mean GAD-7 Score", y = "Frequency")
# Plot PHQ-9 scores
ggplot(phq9_summary, aes(x = mean_phq9)) +
geom_histogram(binwidth = 1, fill = "red", color = "black") +
labs(title = "Distribution of PHQ-9 Scores", x = "Mean PHQ-9 Score", y = "Frequency")
In this analysis, we tidied an initially untidy dataset containing GAD-7 (General Anxiety Disorder) and PHQ-9 (Patient Health Questionnaire) responses. This analysis provides an overview of the mental health status of the respondents based on their survey results. These insights can be used to identify trends in anxiety and depression levels, informing further investigation or potential interventions.
This dataset tracks population by country across various years. To ensure the accuracy of the analysis, we’ll clean the data, convert necessary columns to numeric formats, and remove missing values.
# Read the world population dataset
population_data <- read_csv("https://raw.githubusercontent.com/simonchy/DATA607/refs/heads/main/Week%206/world_population.csv")
## Rows: 234 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): CCA3, Country/Territory, Capital, Continent
## dbl (13): Rank, 2022 Population, 2020 Population, 2015 Population, 2010 Popu...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Check the column names to verify how they are formatted
colnames(population_data)
## [1] "Rank" "CCA3"
## [3] "Country/Territory" "Capital"
## [5] "Continent" "2022 Population"
## [7] "2020 Population" "2015 Population"
## [9] "2010 Population" "2000 Population"
## [11] "1990 Population" "1980 Population"
## [13] "1970 Population" "Area (km²)"
## [15] "Density (per km²)" "Growth Rate"
## [17] "World Population Percentage"
# Check for missing values
sum(is.na(population_data))
## [1] 0
# Convert relevant columns to numeric
population_data <- population_data %>%
mutate_at(vars(contains("Population"), `Density (per km²)`, `Growth Rate`, `World Population Percentage`), as.numeric)
# Ensure data is clean and formatted
head(population_data)
## # A tibble: 6 × 17
## Rank CCA3 `Country/Territory` Capital Continent `2022 Population`
## <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 36 AFG Afghanistan Kabul Asia 41128771
## 2 138 ALB Albania Tirana Europe 2842321
## 3 34 DZA Algeria Algiers Africa 44903225
## 4 213 ASM American Samoa Pago Pago Oceania 44273
## 5 203 AND Andorra Andorra la Vella Europe 79824
## 6 42 AGO Angola Luanda Africa 35588987
## # ℹ 11 more variables: `2020 Population` <dbl>, `2015 Population` <dbl>,
## # `2010 Population` <dbl>, `2000 Population` <dbl>, `1990 Population` <dbl>,
## # `1980 Population` <dbl>, `1970 Population` <dbl>, `Area (km²)` <dbl>,
## # `Density (per km²)` <dbl>, `Growth Rate` <dbl>,
## # `World Population Percentage` <dbl>
We’ll analyze how population growth varies across continents by calculating the total population and average growth rate in 2022.
# Summarize population by continent for the year 2022
population_by_continent <- population_data %>%
group_by(Continent) %>%
summarise(total_population_2022 = sum(`2022 Population`, na.rm = TRUE),
mean_growth_rate = mean(`Growth Rate`, na.rm = TRUE))
# Show the summary
population_by_continent
## # A tibble: 6 × 3
## Continent total_population_2022 mean_growth_rate
## <chr> <dbl> <dbl>
## 1 Africa 1426730932 1.02
## 2 Asia 4721383274 1.01
## 3 Europe 743147538 1.00
## 4 North America 600296136 1.00
## 5 Oceania 45038554 1.01
## 6 South America 436816608 1.01
Next, identify the top 10 countries with the highest population growth rates in 2022.
# Identify top 10 countries by growth rate
top_growth_countries <- population_data %>%
arrange(desc(`Growth Rate`)) %>%
select(`Country/Territory`, `Growth Rate`) %>%
head(10)
# Display the top growth countries
top_growth_countries
## # A tibble: 10 × 2
## `Country/Territory` `Growth Rate`
## <chr> <dbl>
## 1 Moldova 1.07
## 2 Poland 1.04
## 3 Niger 1.04
## 4 Syria 1.04
## 5 Slovakia 1.04
## 6 DR Congo 1.03
## 7 Mayotte 1.03
## 8 Chad 1.03
## 9 Angola 1.03
## 10 Mali 1.03
we will visualize key aspects of population growth and density.
Histogram to show the distribution of population density across countries.
# Plot population density across countries
ggplot(population_data, aes(x = `Density (per km²)`)) +
geom_histogram(binwidth = 1000, fill = "skyblue", color = "black") +
labs(title = "Distribution of Population Density", x = "Population Density (per km²)", y = "Number of Countries")
##### Population Growth vs. Density Scatter plot to examine the
relationship between population growth rates and population density.
# Scatter plot of growth rate vs population density
ggplot(population_data, aes(x = `Density (per km²)`, y = `Growth Rate`)) +
geom_point(alpha = 0.5) +
labs(title = "Population Growth Rate vs. Density", x = "Population Density (per km²)", y = "Growth Rate")
#### Conclusion
This dataset analysis provided insights into global population growth, population density distribution, and country rankings in terms of growth rates. We observed that certain continents, particularly Africa and Asia, showed the highest population growth rates, reflecting ongoing demographic expansion. The top 10 countries by population growth highlight regions experiencing rapid population increases, possibly driven by factors such as high birth rates and improving life expectancy. Population density varies significantly across countries, with some small, densely populated nations and vast countries with low population density.
The data, sourced from Wellcome Trust-supported publications for the 2012-2013 period, includes details like the cost of APCs, journal titles, publishers, and associated article information. Before proceeding with the analysis, we’ll clean the dataset to ensure it’s in a usable format. Convert the COST (£) column to numeric by removing any non-numeric characters. Handle missing values by filtering out rows without valid cost data.
data <- read_csv("https://raw.githubusercontent.com/simonchy/DATA607/refs/heads/main/Week%206/University_returns_for_figshare_FINAL.csv")
## Rows: 2127 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): PMID/PMCID, Publisher, Journal title, Article title
## dbl (1): COST (£) charged to Wellcome (inc VAT when charged)
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Clean the COST column (convert to numeric)
data <- data %>%
mutate(COST = as.numeric(gsub("[^0-9.]", "", `COST (£) charged to Wellcome (inc VAT when charged)`)))
# Remove rows with missing or invalid COST
data <- data %>%
filter(!is.na(COST))
# Preview cleaned data
head(data)
## # A tibble: 6 × 6
## `PMID/PMCID` Publisher `Journal title` `Article title` COST (£) charged to …¹
## <chr> <chr> <chr> <chr> <dbl>
## 1 "PMC3378987\… Elsevier Academy of Nut… Parent support… 2380.
## 2 "PMCID: PMC3… ACS (Ame… ACS Chemical B… A Novel Allost… 1295.
## 3 "PMCID: PMC3… ACS (Ame… ACS Chemical B… Chemical prote… 1295.
## 4 <NA> American… ACS Chemical B… Discovery of β… 947.
## 5 "PMID: 24015… American… ACS Chemical B… Discovery of a… 1268.
## 6 ": PMC380533… American… ACS Chemical B… Synthesis of a… 2287.
## # ℹ abbreviated name: ¹`COST (£) charged to Wellcome (inc VAT when charged)`
## # ℹ 1 more variable: COST <dbl>
Calculating summary statistics for the APC costs, including minimum, maximum, median, and mean values.
# Summary statistics of the COST
summary(data$COST)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 45.94 1261.17 1850.00 1826.42 2301.26 13200.00
To better understand the distribution of APC costs, we’ll visualize the data using a histogram.
# Plot cost distribution
ggplot(data, aes(x = COST)) +
geom_histogram(binwidth = 100, fill = "blue", color = "black") +
labs(title = "Distribution of Article Costs", x = "Cost (£)", y = "Frequency")
##### Publisher-wise Cost Analysis Now we group the data by publishers
to find out which publishers incurred the highest costs and what the
average APC per publisher was.
# Group by Publisher and summarize costs
publisher_summary <- data %>%
group_by(Publisher) %>%
summarize(
total_cost = sum(COST, na.rm = TRUE),
avg_cost = mean(COST, na.rm = TRUE),
count = n()
) %>%
arrange(desc(total_cost))
# Display the summary
print(publisher_summary)
## # A tibble: 280 × 4
## Publisher total_cost avg_cost count
## <chr> <dbl> <dbl> <int>
## 1 Elsevier 955231. 2418. 395
## 2 Public Library of Science 318307. 1145. 278
## 3 Wiley 286597. 1936. 148
## 4 Springer 166719. 2033. 82
## 5 Oxford University Press 143709. 1796. 80
## 6 Nature Publishing Group 133416. 2839. 47
## 7 Wiley-Blackwell 121651. 2172. 56
## 8 OUP 107935. 1927. 56
## 9 BioMed Central 67452. 1377. 49
## 10 ASBMB 66392. 1443. 46
## # ℹ 270 more rows
The distribution of APCs shows most APCs fall in 1- 5000£ range. The publisher Elsevier accounts for the highest total APC costs, followed by Public Library of Science. The average cost of publishing with MacMillan was the highest. These insights can help universities and funding agencies make more informed decisions about their academic publishing spend