Introduction

The dataset I am using is ‘Global Airports’ from the World Bank catalog, which provides data on the locations of airports that facilitate international travel. It is publicly accessible under the Creative Commons Attribution 4.0 license, which allows users to freely use, share, and adapt the data with appropriate attribution. This dataset is relevant for topics in transport and infrastructure, specifically focusing on airports and aviation.

The questions I set out to investigate is ‘How does geographic location (latitude) influence airport capacity (TotalSeats) across different regions?’

The variables that are included in this data are:

-TotalSeats: Represents the total number of seats available for flights at each airport, likely aggregated over a certain period. Numerical variable. Useful for understanding the passenger or seat capacity of each airport, allowing comparisons of airport sizes, traffic volumes, or potential demand.

-Airport1Latitude: The latitude coordinate of the airport, indicating its geographic location in terms of north-south positioning. Numerical Variable. Essential for spatial analysis or mapping. When combined with longitude, it helps place the airport accurately on a map.

-Airport1Longitude: The longitude coordinate of the airport, indicating its geographic location in terms of east-west positioning. Numerical variable. Used in conjunction with latitude for spatial analyses or mapping. Allows users to identify airport locations precisely on a global scale.

Methodology

The key technique I am using Simple Linear Regression. Simple linear regression is used when the goal is to examine the relationship between one predictor and one response. In in case, it would be latitude as the independent variable and airport capacity as the dependent variable. And I am going to use something that I learned this in this class with simple linear regression, which is correlation analysis. We need it because measures the strength of association between two variables but does not model or predict relationships. That’s why I am using these two techniques together for this research.

Assumptions

Linearity: The relationship between latitude and TotalSeats must be linear.

Independence: Observations are independent of each other.

Homoscedasticity: The variance of the residuals is constant across all levels of latitude. Normality: Residuals (differences between observed and predicted values) should be approximately normally distributed.

Results and conclusions

data <- read.csv("airport_volume_airport_locations.csv")
model <- lm(TotalSeats ~ Airport1Latitude + I(Airport1Latitude^2), data = data)
summary(model)
## 
## Call:
## lm(formula = TotalSeats ~ Airport1Latitude + I(Airport1Latitude^2), 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1225339 -1193206 -1075916  -587813 54470728 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           956864.4   145822.2   6.562 6.63e-11 ***
## Airport1Latitude       13009.8     5722.4   2.274   0.0231 *  
## I(Airport1Latitude^2)   -157.5      123.6  -1.275   0.2025    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4082000 on 2170 degrees of freedom
## Multiple R-squared:  0.002615,   Adjusted R-squared:  0.001696 
## F-statistic: 2.845 on 2 and 2170 DF,  p-value: 0.05838

From the summary, we can see that the linear term, The coefficient (13,009.8) is significant, which indicates that the latitude has a small but significant linear effect on airport capacity. A unit increase in latitude corresponds to an increase of approximately 13,009 seats, assuming other factors remain constant.

For the quadratic term, The coefficient (-157.5) is not significant, suggesting the quadratic relationship between latitude and airport capacity is weak.

Residual standard error is 4,082,000 indicates substantial unexplained variability.

plot(model, which = 1, main = "Residuals vs Fitted", col = "blue", pch = 19)

From this Residuals Vs. Fitted plot, it shows some extreme residuals, these indicate data points where the model’s predictions deviate significantly from the observed values.

The upward pattern and increasing spread suggest that the relationship between the predictors and the response might not be adequately captured by the current model.

We can conclude that it gives a pattern of heteroscedasticity, which means non-constant variance.

plot(model, which = 2, main = "Normal Q-Q Plot", col = "blue", pch = 19)

With the outliers form thr Normal qq plot, it shows deviations from normality which means the unusually high values that are not consistent with a normal distribution. The not normally distributed residuals may impact the validity of statistical inference for the model.

On the left side, the points generally align with the line, suggesting no significant issues with the lower tail of the residual distribution.

Discussion and critique.

From these datas, I learned that latitude has a small linear effect on airport capacity. The relationship is weak and does not significantly improve the model. With the non-constant variance and non-normal residuals, the plots reveal the problem with the assumptions of the linear regression. And the airport capacity must be influenced by other factors.

The strengths would be the quadratic model testing, since it went beyond a simple linear model by testing for non-linear or quadratic relationships. And visualization of Scatter plots and fitted curves effectively illustrated the weak relationship between latitude and capacity.

My weakness would be Poor Handling of Spatial Data since Latitude alone does not capture the complex spatial and geographic relationships that likely influence airport capacity. And low Flexibility, because the method does not account for potential interactions or non-linearity beyond the specified quadratic term.

To improve on this research, I would incorporate more variables like population density.