R Notebook

install.packages("ggplot2")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)

install.packages("dplyr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)

install.packages("GGally")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)

Data Visualizations with the dataset infections

The following lines create the data frame with 3 columns: infections, ufo2010 and pop

# Create the data frame
infections <- c(245, 215, 2076, 5023, 189, 195, 123, 116, 3298, 430, 502, 126, 112, 67, 52, 39, 54, 2356, 6781, 120, 2389, 279, 257, 290, 234, 5689, 261, 672, 205)
ufo2010 <- c(2, 6, 2, 59, 0, 1, 1, 0, 115, 0, 0, 0, 0, 0, 0, 0, 6, 4, 2, 7, 2, 9, 2, 29, 10, 169, 1, 40, 16)
pop <- c(25101, 61912, 33341, 409061, 7481, 18675, 25581, 22286, 459598, 3915, 67197, 34365, 3911, 32122, 31459, 2311, 28350, 101482, 19005, 20679, 36745, 162812, 15927, 251417, 153920, 1554720, 16148, 305455, 37276)

The following cell assigns the data frame and creates a Bar Graph

Observation: This bar graph illustrates a comparison between two variables—Infections and UFO Sightings (2010)—across 29 distinct data points.

There is a significant difference in magnitude between the two variables. While several data points show infection counts peaking between 2,000 and 7,000, the UFO sightings are so low relative to infections that they are barely visible on the baseline of the chart.

The infection data is highly inconsistent across the indices. Most data points remain below 1,000, but there are sharp spikes at specific intervals (notably around index 4, 9, 19, and 26).

The salmon-colored bars for UFO sightings are nearly flat across the entire X-axis, indicating that the counts for this variable are consistently near zero or significantly lower than the infection counts at every data point.

df <- data.frame(infections, ufo2010, pop)

# Load necessary libraries
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# --- 1. Bar Graph: Comparing Infections and UFO Sightings ---
ggplot(df, aes(x = 1:nrow(df))) +
  geom_bar(aes(y = infections, fill = "Infections"), stat = "identity", position = "dodge") +
  geom_bar(aes(y = ufo2010, fill = "UFO Sightings (2010)"), stat = "identity", position = "dodge", alpha = 0.7) +
  scale_fill_manual("Variables", values = c("Infections" = "skyblue", "UFO Sightings (2010)" = "salmon")) +
  labs(x = "Data Point Index", y = "Count", title = "Comparison of Infections and UFO Sightings") +
  theme_minimal() +
  theme(legend.position = "top")

The following is the Line Chart. It visualizes the relationship between Infections and Population across 29 data points

“Infections” is represented by a solid green line, while “Population” is shown as a dashed purple line.

Because the population values are so large (peaking over 1.5 million), the infection counts—which generally stay under 7,000—appear as a perfectly flat green line at the very bottom of the graph.

The population data shows extreme variance. Most data points represent smaller populations (under 100,000), but there are massive spikes, particularly at Index 26, which represents the largest population center in the set.

# --- 2. Line Chart: Trends in Infections and Population ---
ggplot(df, aes(x = 1:nrow(df))) +
  geom_line(aes(y = infections, color = "Infections"), linewidth = 1) +
  geom_line(aes(y = pop, color = "Population"), linewidth = 1, linetype = "dashed") +
  scale_color_manual("Variables", values = c("Infections" = "green", "Population" = "purple")) +
  labs(x = "Data Point Index", y = "Count", title = "Trends in Infections and Population") +
  theme_minimal() +
  theme(legend.position = "top")

Below, I use a log scale to see the relative “shape” of the trends regardless of magnitude.

# --- IMPROVED Line Chart: Using Dual Axes or Log Scales ---
# Observation: Raw population and infection counts shouldn't be on the same linear Y-axis.
ggplot(df, aes(x = 1:nrow(df))) +
  geom_line(aes(y = infections, color = "Infections"), linewidth = 1) +
  geom_line(aes(y = pop, color = "Population"), linewidth = 1, linetype = "dashed") +
  scale_y_log10(labels = scales::comma) + # Log scale allows us to see both lines clearly
  scale_color_manual("Variables", values = c("Infections" = "#2ecc71", "Population" = "#9b59b6")) +
  labs(x = "Data Point Index", y = "Count (Log Scale)", 
       title = "Relative Trends: Infections vs Population") +
  theme_minimal()

This scatter plot explores the potential correlation between Population size and the Number of Infections.

Each blue dot represents a single observation (a specific location or entity).

The vast majority of data points are clustered in the bottom-left corner. This indicates that most observations in the dataset involve relatively small populations (under 250,000) with low infection counts (under 1,000).

The relationship is not strictly linear.

# --- 3. Scatter Plot: Relationship between Population and Infections ---
ggplot(df, aes(x = pop, y = infections)) +
  geom_point(color = "blue", alpha = 0.6) +
  labs(x = "Population", y = "Number of Infections", title = "Relationship between Population and Number of Infections") +
  theme_minimal()

# Observation: This scatter plot explores the relationship between population size and the number
# of infections. There doesn't appear to be a strong linear correlation. While some high-population
# areas have high infection counts, this is not consistently the case.

Below, I use a log-log plot which reveals correlations in skewed demographic data.

# --- IMPROVED Scatter Plot: Log-Log Relationship ---
# Observation: The original scatter plot has most points bunched in the corner.
ggplot(df, aes(x = pop, y = infections)) +
  geom_point(color = "blue", alpha = 0.5) +
  geom_smooth(method = "lm", color = "red", se = FALSE, linetype = "dotted") + # Added trend line
  scale_x_log10(labels = scales::comma) + 
  scale_y_log10(labels = scales::comma) +
  labs(x = "Population (Log)", y = "Infections (Log)", 
       title = "Correlation: Population vs. Infections",
       subtitle = "Logarithmic scales help reveal patterns in skewed data") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

This box plot provides a statistical summary of the infections variable, highlighting its central tendency, spread, and the presence of extreme values.

The lightcoral box represents the Interquartile Range (IQR), which contains the middle 50% of the data.

The thick horizontal line inside the box shows the median number of infections, which appears to be quite low (well under 500).

The black dots extending high above the box represent outliers—specific data points with significantly higher infection counts than the rest of the group.

The box is compressed at the bottom of the scale, indicating that the vast majority of the data points have very low infection counts.

# --- 4. Box Plot: Distribution of Infections ---
ggplot(df, aes(y = infections)) +
  geom_boxplot(fill = "lightcoral") +
  labs(y = "Number of Infections", title = "Distribution of Number of Infections") +
  theme_minimal()

# Observation: This box plot summarizes the distribution of the 'infections' variable. It shows the
# median, quartiles, and potential outliers. The plot indicates that the majority of infection
# counts are relatively low, with some higher values identified as outliers.

The following histogram visualizes the frequency distribution of UFO sightings in 2010 across the provided data points. It shows how many locations fall into specific “bins” or ranges of sighting counts.

The distribution is heavily skewed to the right. This means that the vast majority of observations are concentrated at the very low end of the scale.

The tallest bar is at the zero mark, with a frequency of over 15. This indicates that more than half of the locations in the dataset reported between 0 and 5 UFO sightings.

There are several “islands” or isolated bars on the far right of the chart. These represent rare locations with unusually high sighting counts, including one location with over 115 sightings and another significant outlier reaching nearly 170.

# --- 5. Histogram: Frequency Distribution of UFO Sightings ---
ggplot(df, aes(x = ufo2010)) +
  geom_histogram(binwidth = 5, fill = "orange", color = "black", alpha = 0.7) +
  labs(x = "Number of UFO Sightings (2010)", y = "Frequency", title = "Frequency Distribution of UFO Sightings (2010)") +
  theme_minimal()

# Observation: This histogram shows the frequency distribution of UFO sightings in 2010. The
# distribution is heavily skewed towards zero, indicating that most data points have very few or
# no reported UFO sightings.

The following scatter plot examines the correlation between Population size and the Number of UFO Sightings (2010) across the 29 data points.

As the population increases, the number of UFO sightings generally tends to increase as well.

The majority of the data points are clustered near the bottom-left corner (lower population and lower sightings). This confirms that most locations in the dataset are smaller areas with very few reported sightings.

# --- 6. Scatter Plot: Relationship between Population and UFO Sightings ---
ggplot(df, aes(x = pop, y = ufo2010)) +
  geom_point(color = "purple", alpha = 0.6) +
  labs(x = "Population", y = "Number of UFO Sightings (2010)", title = "Relationship between Population and UFO Sightings (2010)") +
  theme_minimal()

# Observation: This scatter plot examines the relationship between population size and the number
# of UFO sightings. There doesn't seem to be a clear linear relationship between these two variables.

The following bubble chart visualizes the relationship between three variables: Number of UFO Sightings (2010), Number of Infections, and Population Size.

There is no clear linear relationship between UFO sightings and infections. If a strong correlation existed, the points would form a distinct diagonal line. Instead, the points are scattered throughout the plot.

A specific data point in the top-left corner shows the highest number of infections (over 6,000) despite having nearly zero UFO sightings and a relatively small population size (indicated by the small bubble).

Most data points are clustered near the bottom-left corner, representing areas with low infections, few UFO sightings, and smaller populations.

# --- 7. Scatter Plot: Infections vs. UFOs with Population Size ---
ggplot(df, aes(x = ufo2010, y = infections, size = pop)) +
  geom_point(alpha = 0.6, color = "maroon") +
  scale_size_continuous(name = "Population Size") +
  labs(x = "Number of UFO Sightings (2010)", y = "Number of Infections", title = "Infections vs. UFO Sightings, Size by Population") +
  theme_minimal()

# Observation: This scatter plot shows the relationship between infections and UFO sightings, with
# the size of each point representing the population size. It helps to visualize if areas with higher
# infections or UFO sightings also tend to have larger populations. No strong pattern is immediately
# apparent.

The following pair plot provides a comprehensive matrix of the relationships between infections, ufo2010, and pop.

All three variables (infections, UFO sightings, and population) are heavily right-skewed.

The peaks are at the far left, indicating that the majority of the data points have low values, with long “tails” representing a few extreme outliers.

There is a nearly perfect positive correlation between population and UFO sightings.

Infections have a moderate positive correlation with both UFO sightings and population.

# --- 8. Pair Plot: Overview of Relationships ---
library(GGally)
ggpairs(df) +
  ggtitle("Pair Plot of Infections, UFO Sightings, and Population") +
  theme_minimal()

# Observation: The pair plot provides a matrix of scatter plots for each pair of variables and
# density plots for the distribution of each individual variable. This gives a quick overview of
# potential linear relationships and the shape of the distributions. The distributions of
# infections and UFO sightings appear skewed, and the scatter plots reiterate the lack of strong
# linear correlations observed in the individual plots.

Summary

These visualizations reveal a dataset characterized by extreme right-skewness and significant disparities in scale, where population totals vastly outweigh infection counts and UFO sightings. While the initial bar and line charts demonstrate how these scale differences can obscure smaller trends, the scatter and pair plots uncover a nearly perfect positive correlation between population size and UFO sightings, suggesting that sighting reports are largely a function of population density. In contrast, the relationship between population and infections is notably more complex and only moderately correlated, as evidenced by box plots and bubble charts that identify significant outliers where high infection spikes occur in areas with relatively small populations.