Introduction and Setup¶

Continuing from Project 1, the project provides further analysis of the battery recycling output and analyzes some of the challenges from output recycling. Using the Material Flow Analysis csv dataset file. Using linear regression analysis including a scatterplot and residual plots, the question to solve relationship between the kg batteries retired and the kg recycled weight produced and to determine if any external parameters influence the data.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Introducing our Data/Importing Libraries¶

In R, we use ggplot2, plotly, tidyr, and dplyr for our libraries to help analyze and visualize our data. In Python, we have to import libraries to read our data file and to plot our data. Thus, we include "pandas" library to read our csv data and "matplotlib.pylot" for our data. First we take our original file and display the summary description of the data.

In [2]:
df_mfa=pd.read_csv("OUTPUT_MFA.csv")
df_mfa.head()
Out[2]:
Year scenario.percent.repurposed scenario.reuse.lifespan material recycling.process Sales.Scenario Cathode.Scenario kWh.retired kg.retired kg.recycled.weight kWh kg.demand circularity (%) collection_rate
0 2020 high reuse.high Aluminum direct SDS LFP 2417264.362 7192595.151 6473335.636 20735767.32 68338836.46 0.094724 0.6
1 2020 high reuse.high Aluminum direct STEPS LFP 2417264.362 7192595.151 6473335.636 20735767.32 68338836.46 0.094724 0.6
2 2020 high reuse.high Aluminum hydro SDS LFP 2417264.362 7192595.151 6473335.636 20735767.32 68338836.46 0.094724 0.6
3 2020 high reuse.high Aluminum hydro STEPS LFP 2417264.362 7192595.151 6473335.636 20735767.32 68338836.46 0.094724 0.6
4 2020 high reuse.high Aluminum pyro SDS LFP 2417264.362 7192595.151 0.000 20735767.32 68338836.46 0.000000 0.6

Based on the summary description of the data, it is obvious that the dataset is large and there are too many variables to choose from. To begin cleaning up some of the data and picking which ones we need, we use the filter command in Python. For this case, we just want the kg.retired and kg.recycled weight columns. For an example of selecting few pieces of our data, we use the head() function connected to our filtered dataset. In this case we want the first 20 pieces of data.

From the Data and analyzing the CSV file, the relationship between Kg.retired and kg.recycled.weight looks almost close to linear. However, there are few points in the data where kg.recycled.weight drops to zero. The ones that were identifiable had the recycling process as "pyro". Using further data cleaning methods, we were able to purge those problematic data points out of our new data set.

In [3]:
import pandas as pd

mask=df_mfa['recycling.process'].str.contains('pyro', case=False, na=False)&(df_mfa['kg.recycled.weight']==0)
df_filtered=df_mfa[~mask]
df_filtered_mfa=df_filtered[['kg.retired', 'kg.recycled.weight']]
print(df_filtered_mfa.head())
     kg.retired  kg.recycled.weight
0  7.192595e+06        6.473336e+06
1  7.192595e+06        6.473336e+06
2  7.192595e+06        6.473336e+06
3  7.192595e+06        6.473336e+06
6  3.697164e+05        3.327448e+05
In [4]:
print(df_filtered_mfa.head(20))
      kg.retired  kg.recycled.weight
0   7.192595e+06        6.473336e+06
1   7.192595e+06        6.473336e+06
2   7.192595e+06        6.473336e+06
3   7.192595e+06        6.473336e+06
6   3.697164e+05        3.327448e+05
7   3.697164e+05        3.327448e+05
8   3.697164e+05        3.623221e+05
9   3.697164e+05        3.623221e+05
10  3.697164e+05        3.327448e+05
11  3.697164e+05        3.327448e+05
12  1.444868e+06        1.300381e+06
13  1.444868e+06        1.300381e+06
14  1.444868e+06        1.300381e+06
15  1.444868e+06        1.300381e+06
16  1.444868e+06        1.300381e+06
17  1.444868e+06        1.300381e+06
18  2.364594e+06        2.128135e+06
19  2.364594e+06        2.128135e+06
20  2.364594e+06        2.128135e+06
21  2.364594e+06        2.128135e+06

Graph Time¶

Since we do have our dataset, how do we graph it in python? In python there are two libraries we use, mathplotlib.pyplot and seaborn for any kind of graph. To demonstrate the use of seaborn and matplotlib.pyplot, we use histograms for our data. Both histograms show a left skew with a frequency distribution of 160000 when analyzing the kg.retired region between 0-0.2*1e9.

However matplotlib.plyplot does not show the labels x and y by default nor the title. When using seaborn to graph the plot, the titles of the graphs show up.

In [29]:
import matplotlib.pyplot as plt
plt.hist(df_filtered_mfa['kg.recycled.weight'],bins=15, color='red', edgecolor='pink')
print(plt.title==('Frequency of Kg. Recycled weight'))
print(plt.xlabel==('kg.recycled.weight'))
print(plt.ylabel==('frequency'))
plt.show()
False
False
False
No description has been provided for this image
In [6]:
import seaborn as sns
sns.histplot(data=df_filtered_mfa,x="kg.retired",bins=15,color='red',edgecolor='pink')
plt.title=('Frequency of kg batteries retired')
plt.xlabel=('kg retired')
plt.ylabel=('frequency')
plt.show
Out[6]:
<function matplotlib.pyplot.show(close=None, block=None)>
No description has been provided for this image

Scatterplot¶

As promised we will plot a scatterplot. Throughout the course we have used Rstudio and R coding language to plot our scatterplot and the linear trendline. However using the python language to do it is very different. For this, we use matplotlib.pyplot and seaborn to help us.

In [7]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.regplot(data=df_filtered_mfa, x='kg.retired',y='kg.recycled.weight',
           line_kws={"color":"blue"},scatter_kws={"alpha":0.4})
plt.title=("Scatterplot of Kg.retired vs Kg.Recycled Weight")
plt.xlabel=("kg retired")
plt.ylabel=("kg.recycled.weight")
plt.show()
No description has been provided for this image

Normality/Data Analysis¶

One of the important aspects when conducting data analysis not present in Project 1 is the normality assessment. We used two methods, the histogram and Quantile Quantile plots. The histogram plot that we used is about somewhat confident as the distribution of residual values was around 0.0. However, using the Quantile Quantile plot showed the most residuals concentrated at around -2.0 on the y axis. The points between -2 and 2 on the x axis are close to the line of best fit so those data points are indeed reliable. anything past -2 on the x axis is very deviant from the trend line and therefore has a low confidence of reliability.

In [22]:
residuals=df_filtered_mfa['kg.recycled.weight']-df_filtered_mfa['kg.retired']
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(residuals, bins=20, kde=True, color='orange')
plt.title=("Histogram of Residuals of Kg Retired vs Kg Recycled Weight")
plt.xlabel=("Residuals")
plt.ylabel=("Frequency")
plt.show()
No description has been provided for this image
In [23]:
import scipy.stats as stats
import matplotlib.pyplot as plt
residuals=df_filtered_mfa['kg.recycled.weight']-df_filtered_mfa['kg.retired']
stats.probplot(residuals,dist="norm",plot=plt)
plt.title=("QQ Plot of Residuals")
plt.show()
No description has been provided for this image

Additional Notes about Data Analysis¶

To perform additional data analysis including obtaining correlation coefficients, we use scripy library to compile our dataset. We wanted to obtain slope, r coefficient, stastistical errors, and the intercept from the data. After doing additional analysis on python compiling our dataset, the dataset we used had an unsolved problem. The dataset has a very strong multicollinearity as it seems there are two output variables even with filtering out the data. When identifying the data causing the issue, it turns out where the kg recycled weight was 0, the recycling process was "pyro" regardless of the material of the battery being recycled. Therefore, the recycling process involved influences the data as pyro recycling process causes the kg.recycled.weight to become 0, despite the large kg retired values.

In [25]:
from scipy import stats

x=df_filtered_mfa['kg.retired']
y=df_filtered_mfa['kg.recycled.weight']
slope,intercept,r_value,p_value,std_err=stats.linregress(x, y)
print(f"Slope: {slope:.2f}")
print(f"Intercept: {intercept:.2f}")
print(f"R-squared: {r_value**2:.2f}")
print(f"P-value: {p_value:.4f}")
Slope: 0.90
Intercept: -105169.79
R-squared: 1.00
P-value: 0.0000
In [28]:
import statsmodels.api as sm
Z=df_filtered_mfa['kg.retired']
y=df_filtered_mfa['kg.recycled.weight']
Z=sm.add_constant(X)
model=sm.OLS(y,Z).fit()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:     kg.recycled.weight   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 5.623e+08
Date:                Thu, 26 Jun 2025   Prob (F-statistic):               0.00
Time:                        23:07:21   Log-Likelihood:            -2.5278e+06
No. Observations:              151776   AIC:                         5.056e+06
Df Residuals:                  151774   BIC:                         5.056e+06
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -1.052e+05   1.16e+04     -9.094      0.000   -1.28e+05   -8.25e+04
kg.retired     0.9007    3.8e-05   2.37e+04      0.000       0.901       0.901
==============================================================================
Omnibus:                   107137.686   Durbin-Watson:                   1.256
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         13828849.858
Skew:                          -2.543   Prob(JB):                         0.00
Kurtosis:                      49.485   Cond. No.                     3.31e+08
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.31e+08. This might indicate that there are
strong multicollinearity or other numerical problems.

Conclusion¶

Even when purging the problematic data, the data set is strongly multicollinear after running regression modeling. There is a mostly confident relationship between kg of batteries retired and the kg recycled weight that comes from them but the recycling process type has an impact on the data. The recycling process for some of the data that was classified as "pyro" was causing our graphs to look very strange where there would be a data cluster near the trendline and another datacluster that was a flat horizontal line. By removing the dataset where the pyro recycling process was a parameter we were able to get a linearly fit dataset for our scatterplot.

Reference of Data Set used:¶

Jessica Dunn, Alissa Kendall, Margaret Slattery, Electric vehicle lithium-ion battery recycled content stan- dards for the US – targets, costs, and environmental impacts, Resources, Conservation and Recycling, Volume 185, 2022, ISSN 0921-3449, https://doi.org/10.1016/j.resconrec.2022.106488