Project 2

Analysis of Three Datasets

Dataset 1: Fake Untidy Data

Tidying the Data

First, we’ll clean up the untidy data, this report analyzes a dataset related to mental health surveys, including the General Anxiety Disorder (GAD-7) and Patient Health Questionnaire (PHQ-9). The dataset is untidy and needs cleaning before analysis.

# Read the untidy dataset
data <- read_csv("https://raw.githubusercontent.com/simonchy/DATA607/refs/heads/main/Week%206/Fake_untidy_data.csv")

## New names:
## Rows: 87 Columns: 9
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (8): ...2, ...3, GAD 7 Questions, GAD 7, gad response, PHQ 9 Questions, ... dbl
## (1): id
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...2`
## • `` -> `...3`

# Filter out rows with actual GAD-7 and PHQ-9 data

gad_phq_data <- data %>%
  filter(!is.na(`GAD 7 Questions`) | !is.na(`PHQ 9 Questions`))

# Remove unnecessary columns and rows
tidy_data <- gad_phq_data %>%
  select(id, `GAD 7 Questions`, `gad response`, `PHQ 9 Questions`, `phq response`)

# Fill forward the 'id' column for missing IDs
tidy_data <- tidy_data %>%
  fill(id)

# Show the cleaned data
head(tidy_data)

## # A tibble: 6 × 5
##      id `GAD 7 Questions`        `gad response` `PHQ 9 Questions` `phq response`
##   <dbl> <chr>                    <chr>          <chr>             <chr>         
## 1   202 how often have they bee… <NA>           Over the last 2 … 2             
## 2   202 Feeling nervous, anxiou… 2              Little interest … 0             
## 3   202 Not being able to stop … 0              Feeling down, de… 0             
## 4   202 Worrying too much about… 2              Trouble falling … 0             
## 5   202 Trouble relaxing         3              Feeling tired or… 2             
## 6   202 Being so restless that … 2              Poor appetite or… 3

Analysis

Separate GAD-7 and PHQ-9 Data

We will now separate the GAD-7 and PHQ-9 responses for further analysis.

# Separate GAD-7 and PHQ-9 data
gad7_data <- tidy_data %>%
  filter(!is.na(`GAD 7 Questions`)) %>%
  select(id, `GAD 7 Questions`, `gad response`)

phq9_data <- tidy_data %>%
  filter(!is.na(`PHQ 9 Questions`)) %>%
  select(id, `PHQ 9 Questions`, `phq response`)

# Display the separated data
head(gad7_data)

## # A tibble: 6 × 3
##      id `GAD 7 Questions`                                         `gad response`
##   <dbl> <chr>                                                     <chr>         
## 1   202 how often have they been bothered by the following over … <NA>          
## 2   202 Feeling nervous, anxious, or on edge                      2             
## 3   202 Not being able to stop or control worrying                0             
## 4   202 Worrying too much about different things                  2             
## 5   202 Trouble relaxing                                          3             
## 6   202 Being so restless that it's hard to sit still             2

head(phq9_data)

## # A tibble: 6 × 3
##      id `PHQ 9 Questions`                                         `phq response`
##   <dbl> <chr>                                                     <chr>         
## 1   202 Over the last 2 weeks, how often have you been bothered … 2             
## 2   202 Little interest or pleasure in doing things               0             
## 3   202 Feeling down, depressed, or hopeless                      0             
## 4   202 Trouble falling or staying asleep, or sleeping too much   0             
## 5   202 Feeling tired or having little energy                     2             
## 6   202 Poor appetite or overeating                               3

GAD-7 Summary Statistics

Let’s calculate summary statistics for the GAD-7 scores.

# Convert responses to numeric
gad7_data$`gad response` <- as.numeric(gad7_data$`gad response`)

## Warning: NAs introduced by coercion

# Calculate mean and standard deviation for GAD-7 scores
gad7_summary <- gad7_data %>%
  group_by(id) %>%
  summarise(mean_gad7 = mean(`gad response`, na.rm = TRUE),
            sd_gad7 = sd(`gad response`, na.rm = TRUE))

# Show the summary statistics
gad7_summary

## # A tibble: 4 × 3
##      id mean_gad7 sd_gad7
##   <dbl>     <dbl>   <dbl>
## 1   202      1.64   1.08 
## 2   207      2      1    
## 3   211      1.2    1.32 
## 4   242      1.5    0.760

PHQ-9 Summary Statistics

Similarly, we will calculate the summary statistics for the PHQ-9 scores.

# Convert responses to numeric
phq9_data$`phq response` <- as.numeric(phq9_data$`phq response`)

## Warning: NAs introduced by coercion

# Calculate mean and standard deviation for PHQ-9 scores
phq9_summary <- phq9_data %>%
  group_by(id) %>%
  summarise(mean_phq9 = mean(`phq response`, na.rm = TRUE),
            sd_phq9 = sd(`phq response`, na.rm = TRUE))

# Show the summary statistics
phq9_summary

## # A tibble: 4 × 3
##      id mean_phq9 sd_phq9
##   <dbl>     <dbl>   <dbl>
## 1   202     0.875   1.02 
## 2   207     0.588   0.939
## 3   211     1.72    1.18 
## 4   242     1.33    1.08

Visualization

We will visualize the distribution of GAD-7 and PHQ-9 scores.

# Plot GAD-7 scores
ggplot(gad7_summary, aes(x = mean_gad7)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black") +
  labs(title = "Distribution of GAD-7 Scores", x = "Mean GAD-7 Score", y = "Frequency")

# Plot PHQ-9 scores
ggplot(phq9_summary, aes(x = mean_phq9)) +
  geom_histogram(binwidth = 1, fill = "red", color = "black") +
  labs(title = "Distribution of PHQ-9 Scores", x = "Mean PHQ-9 Score", y = "Frequency")

Conclusion

In this analysis, we tidied an initially untidy dataset containing GAD-7 (General Anxiety Disorder) and PHQ-9 (Patient Health Questionnaire) responses. This analysis provides an overview of the mental health status of the respondents based on their survey results. These insights can be used to identify trends in anxiety and depression levels, informing further investigation or potential interventions.

Dataset 2: World Population Data

Tidying the Data

This dataset tracks population by country across various years. To ensure the accuracy of the analysis, we’ll clean the data, convert necessary columns to numeric formats, and remove missing values.

# Read the world population dataset
population_data <- read_csv("https://raw.githubusercontent.com/simonchy/DATA607/refs/heads/main/Week%206/world_population.csv")

## Rows: 234 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): CCA3, Country/Territory, Capital, Continent
## dbl (13): Rank, 2022 Population, 2020 Population, 2015 Population, 2010 Popu...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Check the column names to verify how they are formatted
colnames(population_data)

##  [1] "Rank"                        "CCA3"                       
##  [3] "Country/Territory"           "Capital"                    
##  [5] "Continent"                   "2022 Population"            
##  [7] "2020 Population"             "2015 Population"            
##  [9] "2010 Population"             "2000 Population"            
## [11] "1990 Population"             "1980 Population"            
## [13] "1970 Population"             "Area (km²)"                 
## [15] "Density (per km²)"           "Growth Rate"                
## [17] "World Population Percentage"

# Check for missing values
sum(is.na(population_data))

## [1] 0

# Convert relevant columns to numeric
population_data <- population_data %>%
  mutate_at(vars(contains("Population"), `Density (per km²)`, `Growth Rate`, `World Population Percentage`), as.numeric)

# Ensure data is clean and formatted
head(population_data)

## # A tibble: 6 × 17
##    Rank CCA3  `Country/Territory` Capital          Continent `2022 Population`
##   <dbl> <chr> <chr>               <chr>            <chr>                 <dbl>
## 1    36 AFG   Afghanistan         Kabul            Asia               41128771
## 2   138 ALB   Albania             Tirana           Europe              2842321
## 3    34 DZA   Algeria             Algiers          Africa             44903225
## 4   213 ASM   American Samoa      Pago Pago        Oceania               44273
## 5   203 AND   Andorra             Andorra la Vella Europe                79824
## 6    42 AGO   Angola              Luanda           Africa             35588987
## # ℹ 11 more variables: `2020 Population` <dbl>, `2015 Population` <dbl>,
## #   `2010 Population` <dbl>, `2000 Population` <dbl>, `1990 Population` <dbl>,
## #   `1980 Population` <dbl>, `1970 Population` <dbl>, `Area (km²)` <dbl>,
## #   `Density (per km²)` <dbl>, `Growth Rate` <dbl>,
## #   `World Population Percentage` <dbl>

Population Growth Analysis

Overall Growth Rate by Continent

We’ll analyze how population growth varies across continents by calculating the total population and average growth rate in 2022.

# Summarize population by continent for the year 2022
population_by_continent <- population_data %>%
  group_by(Continent) %>%
  summarise(total_population_2022 = sum(`2022 Population`, na.rm = TRUE),
            mean_growth_rate = mean(`Growth Rate`, na.rm = TRUE))

# Show the summary
population_by_continent

## # A tibble: 6 × 3
##   Continent     total_population_2022 mean_growth_rate
##   <chr>                         <dbl>            <dbl>
## 1 Africa                   1426730932             1.02
## 2 Asia                     4721383274             1.01
## 3 Europe                    743147538             1.00
## 4 North America             600296136             1.00
## 5 Oceania                    45038554             1.01
## 6 South America             436816608             1.01

Top 10 Countries by Population Growth

Next, identify the top 10 countries with the highest population growth rates in 2022.

# Identify top 10 countries by growth rate
top_growth_countries <- population_data %>%
  arrange(desc(`Growth Rate`)) %>%
  select(`Country/Territory`, `Growth Rate`) %>%
  head(10)

# Display the top growth countries
top_growth_countries

## # A tibble: 10 × 2
##    `Country/Territory` `Growth Rate`
##    <chr>                       <dbl>
##  1 Moldova                      1.07
##  2 Poland                       1.04
##  3 Niger                        1.04
##  4 Syria                        1.04
##  5 Slovakia                     1.04
##  6 DR Congo                     1.03
##  7 Mayotte                      1.03
##  8 Chad                         1.03
##  9 Angola                       1.03
## 10 Mali                         1.03

Visualization

we will visualize key aspects of population growth and density.

Population Density Distribution

Histogram to show the distribution of population density across countries.

# Plot population density across countries
ggplot(population_data, aes(x = `Density (per km²)`)) +
  geom_histogram(binwidth = 1000, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Population Density", x = "Population Density (per km²)", y = "Number of Countries")

##### Population Growth vs. Density Scatter plot to examine the relationship between population growth rates and population density.

# Scatter plot of growth rate vs population density
ggplot(population_data, aes(x = `Density (per km²)`, y = `Growth Rate`)) +
  geom_point(alpha = 0.5) +
  labs(title = "Population Growth Rate vs. Density", x = "Population Density (per km²)", y = "Growth Rate")

#### Conclusion

This dataset analysis provided insights into global population growth, population density distribution, and country rankings in terms of growth rates. We observed that certain continents, particularly Africa and Asia, showed the highest population growth rates, reflecting ongoing demographic expansion. The top 10 countries by population growth highlight regions experiencing rapid population increases, possibly driven by factors such as high birth rates and improving life expectancy. Population density varies significantly across countries, with some small, densely populated nations and vast countries with low population density.

Dataset 3: University Returns Data

Tidying the Data

The data, sourced from Wellcome Trust-supported publications for the 2012-2013 period, includes details like the cost of APCs, journal titles, publishers, and associated article information. Before proceeding with the analysis, we’ll clean the dataset to ensure it’s in a usable format. Convert the COST (£) column to numeric by removing any non-numeric characters. Handle missing values by filtering out rows without valid cost data.

data <- read_csv("https://raw.githubusercontent.com/simonchy/DATA607/refs/heads/main/Week%206/University_returns_for_figshare_FINAL.csv")

## Rows: 2127 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): PMID/PMCID, Publisher, Journal title, Article title
## dbl (1): COST (£) charged to Wellcome (inc VAT when charged)
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Clean the COST column (convert to numeric)
data <- data %>%
  mutate(COST = as.numeric(gsub("[^0-9.]", "", `COST (£) charged to Wellcome (inc VAT when charged)`)))

# Remove rows with missing or invalid COST
data <- data %>%
  filter(!is.na(COST))

# Preview cleaned data
head(data)

## # A tibble: 6 × 6
##   `PMID/PMCID`  Publisher `Journal title` `Article title` COST (£) charged to …¹
##   <chr>         <chr>     <chr>           <chr>                            <dbl>
## 1 "PMC3378987\… Elsevier  Academy of Nut… Parent support…                  2380.
## 2 "PMCID: PMC3… ACS (Ame… ACS Chemical B… A Novel Allost…                  1295.
## 3 "PMCID: PMC3… ACS (Ame… ACS Chemical B… Chemical prote…                  1295.
## 4  <NA>         American… ACS Chemical B… Discovery of β…                   947.
## 5 "PMID: 24015… American… ACS Chemical B… Discovery of a…                  1268.
## 6 ": PMC380533… American… ACS Chemical B… Synthesis of a…                  2287.
## # ℹ abbreviated name: ¹`COST (£) charged to Wellcome (inc VAT when charged)`
## # ℹ 1 more variable: COST <dbl>

Data Analysis

Summary Statistics

Calculating summary statistics for the APC costs, including minimum, maximum, median, and mean values.

# Summary statistics of the COST
summary(data$COST)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    45.94  1261.17  1850.00  1826.42  2301.26 13200.00

Distribution of APC Costs

To better understand the distribution of APC costs, we’ll visualize the data using a histogram.

# Plot cost distribution
ggplot(data, aes(x = COST)) +
  geom_histogram(binwidth = 100, fill = "blue", color = "black") +
  labs(title = "Distribution of Article Costs", x = "Cost (£)", y = "Frequency")

##### Publisher-wise Cost Analysis Now we group the data by publishers to find out which publishers incurred the highest costs and what the average APC per publisher was.

# Group by Publisher and summarize costs
publisher_summary <- data %>%
  group_by(Publisher) %>%
  summarize(
    total_cost = sum(COST, na.rm = TRUE),
    avg_cost = mean(COST, na.rm = TRUE),
    count = n()
  ) %>%
  arrange(desc(total_cost))

# Display the summary
print(publisher_summary)

## # A tibble: 280 × 4
##    Publisher                 total_cost avg_cost count
##    <chr>                          <dbl>    <dbl> <int>
##  1 Elsevier                     955231.    2418.   395
##  2 Public Library of Science    318307.    1145.   278
##  3 Wiley                        286597.    1936.   148
##  4 Springer                     166719.    2033.    82
##  5 Oxford University Press      143709.    1796.    80
##  6 Nature Publishing Group      133416.    2839.    47
##  7 Wiley-Blackwell              121651.    2172.    56
##  8 OUP                          107935.    1927.    56
##  9 BioMed Central                67452.    1377.    49
## 10 ASBMB                         66392.    1443.    46
## # ℹ 270 more rows

Conclusion

The distribution of APCs shows most APCs fall in 1- 5000£ range. The publisher Elsevier accounts for the highest total APC costs, followed by Public Library of Science. The average cost of publishing with MacMillan was the highest. These insights can help universities and funding agencies make more informed decisions about their academic publishing spend

Project 2

MD SIMON CHOWDHURY

2024-10-10

Introduction

Analysis of Three Datasets

Dataset 1: Fake Untidy Data

Tidying the Data

Analysis

Separate GAD-7 and PHQ-9 Data

GAD-7 Summary Statistics

PHQ-9 Summary Statistics

Visualization

Conclusion

Dataset 2: World Population Data

Tidying the Data

Population Growth Analysis

Overall Growth Rate by Continent

Top 10 Countries by Population Growth

Visualization

Population Density Distribution

Dataset 3: University Returns Data

Tidying the Data

Data Analysis

Summary Statistics

Distribution of APC Costs

Conclusion