In this report, we will perform exploratory data analysis (EDA) on two datasets. The first is the NYC Taxi Trips Dataset, a large dataset with over 1.4 million records, sourced from Kaggle, which provides access to NYC taxi trip data. The second is a Sales Dataset with 10,000 records, obtained from excelbianalytics (10,000 Sales Records), designed for testing and analysis purposes. Our goal is to compare these datasets, explore their correlations, predictability, and assess the applicability of various machine learning algorithms.
NYC Taxi Trips Dataset:
Sales Dataset:
This analysis will allow us to gain insights into the structure and complexity of both datasets and evaluate their suitability for machine learning applications.
# Load the small dataset (Sales Data)
small_data <- read.csv("/Users/souleymanedoumbia/Library/Mobile Documents/com~apple~CloudDocs/CUNY SPS CLASSES/MSDS CLASSES/Data 622 Fall 2024/Week 8/Homework 1 _ Exploratory Data Analysis/10000 Sales Records.csv")
# Load the large dataset (NYC Taxi Trips)
large_data <- read.csv("/Users/souleymanedoumbia/Library/Mobile Documents/com~apple~CloudDocs/CUNY SPS CLASSES/MSDS CLASSES/Data 622 Fall 2024/Week 8/Homework 1 _ Exploratory Data Analysis/train_augmented.csv")
The following code summarizes the sales data and provides a basic overview of the NYC taxi trip dataset.
# Summary and structure of small dataset (Sales Data)
summary(small_data)
## Region Country Item.Type Sales.Channel
## Length:10000 Length:10000 Length:10000 Length:10000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Order.Priority Order.Date Order.ID Ship.Date
## Length:10000 Length:10000 Min. :100089156 Length:10000
## Class :character Class :character 1st Qu.:321806669 Class :character
## Mode :character Mode :character Median :548566305 Mode :character
## Mean :549871874
## 3rd Qu.:775998104
## Max. :999934232
## Units.Sold Unit.Price Unit.Cost Total.Revenue
## Min. : 2 Min. : 9.33 Min. : 6.92 Min. : 168
## 1st Qu.: 2531 1st Qu.:109.28 1st Qu.: 56.67 1st Qu.: 288551
## Median : 4962 Median :205.70 Median :117.11 Median : 800051
## Mean : 5003 Mean :268.14 Mean :188.81 Mean :1333355
## 3rd Qu.: 7472 3rd Qu.:437.20 3rd Qu.:364.69 3rd Qu.:1819143
## Max. :10000 Max. :668.27 Max. :524.96 Max. :6680027
## Total.Cost Total.Profit
## Min. : 125 Min. : 43.4
## 1st Qu.: 164786 1st Qu.: 98329.1
## Median : 481606 Median : 289099.0
## Mean : 938266 Mean : 395089.3
## 3rd Qu.:1183822 3rd Qu.: 566422.7
## Max. :5241726 Max. :1738178.4
str(small_data)
## 'data.frame': 10000 obs. of 14 variables:
## $ Region : chr "Sub-Saharan Africa" "Europe" "Middle East and North Africa" "Sub-Saharan Africa" ...
## $ Country : chr "Chad" "Latvia" "Pakistan" "Democratic Republic of the Congo" ...
## $ Item.Type : chr "Office Supplies" "Beverages" "Vegetables" "Household" ...
## $ Sales.Channel : chr "Online" "Online" "Offline" "Online" ...
## $ Order.Priority: chr "L" "C" "C" "C" ...
## $ Order.Date : chr "1/27/2011" "12/28/2015" "1/13/2011" "9/11/2012" ...
## $ Order.ID : int 292494523 361825549 141515767 500364005 127481591 482292354 844532620 564251220 411809480 327881228 ...
## $ Ship.Date : chr "2/12/2011" "1/23/2016" "2/1/2011" "10/6/2012" ...
## $ Units.Sold : int 4484 1075 6515 7683 3491 9880 4825 3330 2431 6197 ...
## $ Unit.Price : num 651.2 47.5 154.1 668.3 47.5 ...
## $ Unit.Cost : num 525 31.8 90.9 502.5 31.8 ...
## $ Total.Revenue : num 2920026 51009 1003701 5134318 165648 ...
## $ Total.Cost : num 2353921 34174 592409 3861015 110979 ...
## $ Total.Profit : num 566105 16834 411292 1273304 54669 ...
# Summary and structure of large dataset (NYC Taxi Trips)
summary(large_data)
## id distance duration motorway
## Length:1458643 Min. : 0 Min. : 0.0 Min. :0.0000000
## Class :character 1st Qu.: 1654 1st Qu.: 155.4 1st Qu.:0.0000000
## Mode :character Median : 2730 Median : 249.2 Median :0.0000000
## Mean : 4569 Mean : 343.5 Mean :0.0004421
## 3rd Qu.: 5049 3rd Qu.: 424.0 3rd Qu.:0.0000000
## Max. :1038310 Max. :46645.3 Max. :1.0000000
##
## trunk primary secondary tertiary
## Min. :0.000 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.08329
## Median :0.000 Median :0.00000 Median :0.1314 Median :0.34729
## Mean :0.126 Mean :0.03019 Mean :0.2665 Mean :0.41943
## 3rd Qu.:0.000 3rd Qu.:0.00000 3rd Qu.:0.4580 3rd Qu.:0.75704
## Max. :1.000 Max. :1.00000 Max. :1.0000 Max. :1.00000
##
## unclassified residential nTrafficSignals nCrossing
## Min. :0.00000 Min. :0.00000 Min. : 0.0 Min. : 0.000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.: 14.0 1st Qu.: 2.000
## Median :0.00000 Median :0.00000 Median : 21.0 Median : 6.000
## Mean :0.01338 Mean :0.04399 Mean : 24.5 Mean : 8.812
## 3rd Qu.:0.00000 3rd Qu.:0.02493 3rd Qu.: 32.0 3rd Qu.:13.000
## Max. :1.00000 Max. :1.00000 Max. :112.0 Max. :91.000
##
## nStop nIntersection srcCounty dstCounty
## Min. : 0.0000 Min. : 0.00 Min. :1.000 Min. :1.000
## 1st Qu.: 0.0000 1st Qu.: 0.00 1st Qu.:1.000 1st Qu.:1.000
## Median : 0.0000 Median : 1.00 Median :1.000 Median :1.000
## Mean : 0.1153 Mean : 2.52 Mean :1.214 Mean :1.271
## 3rd Qu.: 0.0000 3rd Qu.: 3.00 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :10.0000 Max. :132.00 Max. :5.000 Max. :5.000
## NA's :1227 NA's :6338
str(large_data)
## 'data.frame': 1458643 obs. of 16 variables:
## $ id : chr "id2875421" "id2377394" "id3858529" "id3504673" ...
## $ distance : num 2009 2513 9911 1779 1615 ...
## $ duration : num 161 256 680 182 132 ...
## $ motorway : num 0 0 0 0 0 0 0 0 0 0 ...
## $ trunk : num 0 0 0.543 0 0 ...
## $ primary : num 0 0 0 0 0 0 0 0 0 0 ...
## $ secondary : num 0 0.349 0.373 0 0.637 ...
## $ tertiary : num 1 0.1748 0.0398 0.4245 0.3627 ...
## $ unclassified : num 0 0 0 0 0 ...
## $ residential : num 0 0.1439 0.00686 0.03974 0 ...
## $ nTrafficSignals: int 14 25 38 18 17 12 19 33 9 58 ...
## $ nCrossing : int 5 13 12 6 2 3 0 18 2 11 ...
## $ nStop : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nIntersection : int 4 0 3 1 2 0 2 23 0 6 ...
## $ srcCounty : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dstCounty : int 1 1 1 1 1 1 1 4 1 1 ...
We’ll first check if there are missing values in the datasets and decide how to handle them.
# Check for missing values in the Sales Data
sum(is.na(small_data))
## [1] 0
# Identify columns with missing data
colSums(is.na(small_data))
## Region Country Item.Type Sales.Channel Order.Priority
## 0 0 0 0 0
## Order.Date Order.ID Ship.Date Units.Sold Unit.Price
## 0 0 0 0 0
## Unit.Cost Total.Revenue Total.Cost Total.Profit
## 0 0 0 0
# Handling missing values (if any)
# we would remove rows with missing values for simplicity
small_data_clean <- small_data %>%
drop_na()
# Verify if missing values have been removed
sum(is.na(small_data_clean))
## [1] 0
# Check for missing values in the NYC Taxi Trips Dataset
sum(is.na(large_data))
## [1] 7565
# Identify columns with missing data
colSums(is.na(large_data))
## id distance duration motorway trunk
## 0 0 0 0 0
## primary secondary tertiary unclassified residential
## 0 0 0 0 0
## nTrafficSignals nCrossing nStop nIntersection srcCounty
## 0 0 0 0 1227
## dstCounty
## 6338
# Handling missing values (if any)
# we would remove rows with missing values for simplicity
large_data_clean <- large_data %>%
drop_na()
# Verify if missing values have been removed
sum(is.na(large_data_clean))
## [1] 0
Once the data is cleaned, we can start with basic statistics and visualizing the distributions of key variables.
# Summary statistics for some key variables
summary(small_data_clean[, c("Units.Sold", "Unit.Price", "Total.Revenue", "Total.Profit")])
## Units.Sold Unit.Price Total.Revenue Total.Profit
## Min. : 2 Min. : 9.33 Min. : 168 Min. : 43.4
## 1st Qu.: 2531 1st Qu.:109.28 1st Qu.: 288551 1st Qu.: 98329.1
## Median : 4962 Median :205.70 Median : 800051 Median : 289099.0
## Mean : 5003 Mean :268.14 Mean :1333355 Mean : 395089.3
## 3rd Qu.: 7472 3rd Qu.:437.20 3rd Qu.:1819143 3rd Qu.: 566422.7
## Max. :10000 Max. :668.27 Max. :6680027 Max. :1738178.4
# Visualize distributions of the main predictors and target variables
ggplot(small_data_clean, aes(x = Units.Sold)) +
geom_histogram(binwidth = 500, fill = "blue", color = "white") +
ggtitle("Distribution of Units Sold")
ggplot(small_data_clean, aes(x = Unit.Price)) +
geom_histogram(binwidth = 10, fill = "orange", color = "white") +
ggtitle("Distribution of Unit Price")
ggplot(small_data_clean, aes(x = Total.Revenue)) +
geom_histogram(binwidth = 50000, fill = "green", color = "white") +
ggtitle("Distribution of Total Revenue")
ggplot(small_data_clean, aes(x = Total.Profit)) +
geom_histogram(binwidth = 50000, fill = "purple", color = "white") +
ggtitle("Distribution of Total Profit")
#### 1.4.1.1 Summary Statistics: Sales Data The summary statistics for
key variables in the Sales Data dataset reveal patterns and distribution
characteristics:
Units Sold shows a fairly uniform distribution, reflecting consistent sales volume.
Total Revenue and Total Profit distributions are right-skewed, pointing to a few high-value transactions.
The multimodal distribution of Unit Price suggests different price categories, possibly for varied product lines.
These observations emphasize the relevance of focusing on Total Revenue and Total Profit for profitability analysis, given their variability and significant impact within the dataset.
# Summary statistics for some key variables
summary(large_data_clean[, c("distance", "duration", "nIntersection", "nTrafficSignals")])
## distance duration nIntersection nTrafficSignals
## Min. : 0 Min. : 0.0 Min. : 0.000 Min. : 0.00
## 1st Qu.: 1652 1st Qu.: 155.2 1st Qu.: 0.000 1st Qu.: 14.00
## Median : 2721 Median : 248.5 Median : 1.000 Median : 21.00
## Mean : 4494 Mean : 340.0 Mean : 2.505 Mean : 24.53
## 3rd Qu.: 5005 3rd Qu.: 421.3 3rd Qu.: 3.000 3rd Qu.: 32.00
## Max. :52309 Max. :2725.1 Max. :99.000 Max. :112.00
# Visualize distributions of the main predictors and target variable
ggplot(large_data_clean, aes(x = distance)) +
geom_histogram(binwidth = 500, fill = "purple", color = "white") +
ggtitle("Distribution of Trip Distance")
ggplot(large_data_clean, aes(x = duration)) +
geom_histogram(binwidth = 50, fill = "red", color = "white") +
ggtitle("Distribution of Trip Duration")
ggplot(large_data_clean, aes(x = nIntersection)) +
geom_histogram(binwidth = 1, fill = "blue", color = "white") +
ggtitle("Distribution of Number of Intersections")
ggplot(large_data_clean, aes(x = nTrafficSignals)) +
geom_histogram(binwidth = 1, fill = "green", color = "white") +
ggtitle("Distribution of Number of Traffic Signals")
distance
is highly skewed, with the
majority of trips covering short distances (median: 2721).duration
distribution is also right-skewed, with
most trips being relatively short (median: 248.5 seconds).nIntersection
):
nTrafficSignals
):
nTrafficSignals
distribution follows a skewed
pattern, with a mean of 24.53.The skewed distributions in both distance
and
duration
suggest variability in trip lengths, likely due to
urban vs. suburban trip characteristics.
High variability in intersections and traffic signals aligns with different traffic conditions across routes, highlighting the importance of these features in the model.
This analysis supports the selection of features such as
distance
, duration
,
nIntersection
, and nTrafficSignals
as
predictive variables for modeling trip characteristics.
# Correlation matrix for numeric variables in Sales Data
sales_data_numeric <- small_data_clean %>% select_if(is.numeric)
# Calculate and plot the correlation matrix
sales_corr_matrix <- cor(sales_data_numeric, use = "complete.obs")
corrplot::corrplot(sales_corr_matrix, method = "circle", type = "upper",
addCoef.col = "black", tl.col = "black", tl.srt = 45,
title = "Sales Data: Correlation Heatmap (All Numeric Variables)", mar = c(0, 0, 1, 0))
# Correlation matrix for numeric variables in NYC Taxi Data
taxi_data_numeric <- large_data_clean %>% select_if(is.numeric)
# Calculate and plot the correlation matrix
taxi_corr_matrix <- cor(taxi_data_numeric, use = "complete.obs")
# Plot correlation heatmap with smaller text for coefficients
corrplot::corrplot(taxi_corr_matrix, method = "circle", type = "upper",
addCoef.col = "black", tl.col = "black", tl.srt = 45,
number.cex = 0.4, # By adjusting this value we can to resize the coef. text in the circle
title = "NYC Taxi Data: Correlation Heatmap (All Numeric Variables)",
mar = c(0, 0, 1, 0))
We can use Linear Regression to predict Total Profit based on variables like Units Sold, Total Revenue, and Unit Price.
# Linear Regression on Sales Data
lm_model_sales <- lm(Total.Profit ~ Units.Sold + Unit.Price + Total.Revenue, data = small_data_clean)
# Summary of the model
summary(lm_model_sales)
##
## Call:
## lm(formula = Total.Profit ~ Units.Sold + Unit.Price + Total.Revenue,
## data = small_data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -531043 -95592 1994 71189 604582
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.786e+02 5.333e+03 0.127 0.899
## Units.Sold 2.510e+01 9.221e-01 27.226 <2e-16 ***
## Unit.Price -1.763e+00 1.529e+01 -0.115 0.908
## Total.Revenue 2.020e-01 2.659e-03 75.945 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 166800 on 9996 degrees of freedom
## Multiple R-squared: 0.8048, Adjusted R-squared: 0.8047
## F-statistic: 1.374e+04 on 3 and 9996 DF, p-value: < 2.2e-16
Model Summary:
Statistical Significance:
Model Fit:
F-statistic: The high F-statistic value (1.374e+04) with a p-value of \(< 2.2e-16\) indicates that the model as a whole is statistically significant.
The model demonstrates that Units Sold and Total Revenue are strong predictors of Total Profit, while Unit Price has limited predictive power in this context.
For the NYC Taxi Data, we can use KNN to predict Trip Duration based on features like Distance*, nIntersection, and nTrafficSignals.
In this section, we encountered an error while attempting to apply the K-Nearest Neighbors (KNN) algorithm to predict the Trip Duration from features like Distance, nIntersection, and nTrafficSignals. The error we encountered was: . Below is the initial code used, which led to this issue:
### KNN Code Leading to "too many ties in knn" issues
# Prepare data (scale features)
scaled_data <- scale(large_data_clean[, c("distance", "nIntersection", "nTrafficSignals")])
# Create training and testing datasets
set.seed(123)
train_indices <- sample(1:nrow(scaled_data), size = 0.8 * nrow(scaled_data))
train_data <- scaled_data[train_indices, ]
train_labels <- large_data_clean$duration[train_indices]
test_data <- scaled_data[-train_indices, ]
test_labels <- large_data_clean$duration[-train_indices]
## Running KNN = Leading to "too many ties in knn" Error
#k_value <- 7
#knn_model <- knn(train = train_data, test = test_data, cl = train_labels, k = k_value, use.all = FALSE)
# Evaluate the model
#table(predicted = knn_model, actual = test_labels)
# Sample a subset of the data
set.seed(123)
subset_indices <- sample(1:nrow(large_data_clean), size = 500000)
large_data_sample <- large_data_clean[subset_indices, ]
# Scale features
scaled_data <- scale(large_data_sample[, c("distance", "nIntersection", "nTrafficSignals")])
# Split into training and testing datasets
set.seed(123)
train_indices <- sample(1:nrow(scaled_data), size = 0.8 * nrow(scaled_data))
train_data <- scaled_data[train_indices, ]
train_labels <- large_data_sample$duration[train_indices]
test_data <- scaled_data[-train_indices, ]
test_labels <- large_data_sample$duration[-train_indices]
# Add random noise to break ties
train_data_noisy <- train_data + matrix(rnorm(n = nrow(train_data) * ncol(train_data), mean = 0, sd = 0.0001), nrow = nrow(train_data))
test_data_noisy <- test_data + matrix(rnorm(n = nrow(test_data) * ncol(test_data), mean = 0, sd = 0.0001), nrow = nrow(test_data))
# Run KNN with noisy data
k_value <- 7
knn_model_noisy <- knn(train = train_data_noisy, test = test_data_noisy, cl = train_labels, k = k_value, use.all = FALSE)
# Calculate accuracy
accuracy <- sum(knn_model_noisy == test_labels) / length(test_labels)
accuracy
## [1] 0.00679
# Fit Linear Regression model on the NYC Taxi data
linear_model <- lm(duration ~ distance + nIntersection + nTrafficSignals, data = large_data_clean)
# View the summary of the model
summary(linear_model)
##
## Call:
## lm(formula = duration ~ distance + nIntersection + nTrafficSignals,
## data = large_data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -261.590 -21.778 -2.401 19.675 252.303
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.648e+01 5.890e-02 279.7 <2e-16 ***
## distance 5.128e-02 6.689e-06 7665.9 <2e-16 ***
## nIntersection 2.530e+00 6.437e-03 393.1 <2e-16 ***
## nTrafficSignals 3.537e+00 2.072e-03 1707.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.38 on 1452045 degrees of freedom
## Multiple R-squared: 0.9848, Adjusted R-squared: 0.9848
## F-statistic: 3.126e+07 on 3 and 1452045 DF, p-value: < 2.2e-16
# Predict on the dataset
predicted_duration <- predict(linear_model, large_data_clean)
# Calculate RMSE to evaluate model performance
actual_duration <- large_data_clean$duration
rmse <- sqrt(mean((predicted_duration - actual_duration)^2))
rmse
## [1] 35.37794
With this strong performance, Linear Regression appears well-suited for predicting trip duration in the NYC Taxi data.
In summary, Linear Regression proved effective for both datasets, providing interpretable and useful insights. While the current analysis is informative, further exploration with additional features and advanced models could yield even more accurate predictions, especially for datasets with complex relationships or outliers.