Loading the Dataset

data <- read.csv("world-data-2023.csv")
library("ggplot2")

Converting char values to num values

I had to change a column to numeric values to be plotted by the histogram.

data$Co2.Emissions <- gsub(",", "", data$Co2.Emissions)
data$Co2.Emissions <- as.numeric(data$Co2.Emissions)

Creating a histogram!

I did it both ways so Dr. Gunay wouldn’t yell at me.

This histogram shows the range of CO2 Emissions per capita. One of these is very high while most of them have no significance at all. I may need to experiment with bin numbers here and, if that fails, then perhaps choose a different column to glean more information from.

hist(data$Co2.Emissions, main="Distribution of CO2 Emissions", xlab="CO2 Emissions (ton per capita)", ylab = "Frequency")

## Histogram with ggplot

ggplot(data, aes(x = Co2.Emissions)) +
  geom_histogram(bins = 20, fill = "blue", color = "black") +  # Adjust number of bins if necessary
  ggtitle("Distribution of CO2 Emissions") +
  xlab("CO2 Emissions (tons per capita)") +
  ylab("Frequency") +
  theme_minimal()
## Warning: Removed 7 rows containing non-finite outside the scale range
## (`stat_bin()`).

## CO2 Emissions by Country with a Bar Graph

This bar chart displays the top 20 countries with the highest CO2 Emissions. I first had to create a top_countries variable that held the top 20 countries before creating the chart; it was much too cluttered beforehand. This gives us a clearer view then the histogram did

top_countries <- data[order(-data$Co2.Emissions), ][1:20, ]
ggplot(top_countries, aes(x = reorder(Country, Co2.Emissions), y = Co2.Emissions)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Top 20 Countries by CO2 Emissions",
       x = "Country",
       y = "CO2 Emissions (tons per capita)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

## Investigating Birth Rates and Infant Mortality Rates Calculating the mean birth rate and mean infant mortality rate. I would like to compare birth rate and GDP, but my GDP column has both a $ and , and I can’t seem to get it to convert to numeric without putting all NA values into the GDP column :-(

There is a positive correlation between the birth rate and the infant mortality rate. This could be due to the simple fact that as birth rates increase, infant mortality rates also increase, but further investigation into other columns such as GDP, minimum wage, maternal mortality, healthcare, and many more factors could contribute to this.

mean_infant_mortality <- mean(data$Infant.mortality, na.rm = TRUE)
mean_infant_mortality <- mean(data$Infant.mortality, na.rm = TRUE)
correlation_birth_infant <- cor(data$Birth.Rate, data$Infant.mortality, use = "complete.obs")

Statistical Investigation

First, dividing the dataset:

First calculating the median birth rate then creating two groups based on the birth rates.

median_birth_rate <- median(data$Birth.Rate, na.rm = TRUE)
low_birth_rate <- data[data$Birth.Rate <= median_birth_rate, ]
high_birth_rate <- data[data$Birth.Rate > median_birth_rate, ]

Visualizing the Distributions

ggplot(data, aes(x = Infant.mortality, fill = factor(Birth.Rate > median_birth_rate))) +
  geom_histogram(alpha = 0.5, position = 'identity', bins = 30) +
  labs(title = "Distribution of Infant Mortality by Birth Rate Groups",
       x = "Infant Mortality Rate",
       fill = "Group") +
  theme_minimal()
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_bin()`).

Perform the T-test

t_test_results <- t.test(low_birth_rate$Infant.mortality, high_birth_rate$Infant.mortality, na.rm = TRUE)
print(t_test_results)
## 
##  Welch Two Sample t-test
## 
## data:  low_birth_rate$Infant.mortality and high_birth_rate$Infant.mortality
## t = -13.127, df = 116.64, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -31.20224 -23.02116
## sample estimates:
## mean of x mean of y 
##  7.780851 34.892553