Introduction and Setup¶
Continuing from Project 1, the project provides further analysis of the battery recycling output and analyzes some of the challenges from output recycling. Using the Material Flow Analysis csv dataset file. Using linear regression analysis including a scatterplot and residual plots, the question to solve relationship between the kg batteries retired and the kg recycled weight produced and to determine if any external parameters influence the data.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Introducing our Data/Importing Libraries¶
In R, we use ggplot2, plotly, tidyr, and dplyr for our libraries to help analyze and visualize our data. In Python, we have to import libraries to read our data file and to plot our data. Thus, we include "pandas" library to read our csv data and "matplotlib.pylot" for our data. First we take our original file and display the summary description of the data.
df_mfa=pd.read_csv("OUTPUT_MFA.csv")
df_mfa.head()
| Year | scenario.percent.repurposed | scenario.reuse.lifespan | material | recycling.process | Sales.Scenario | Cathode.Scenario | kWh.retired | kg.retired | kg.recycled.weight | kWh | kg.demand | circularity (%) | collection_rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020 | high | reuse.high | Aluminum | direct | SDS | LFP | 2417264.362 | 7192595.151 | 6473335.636 | 20735767.32 | 68338836.46 | 0.094724 | 0.6 |
| 1 | 2020 | high | reuse.high | Aluminum | direct | STEPS | LFP | 2417264.362 | 7192595.151 | 6473335.636 | 20735767.32 | 68338836.46 | 0.094724 | 0.6 |
| 2 | 2020 | high | reuse.high | Aluminum | hydro | SDS | LFP | 2417264.362 | 7192595.151 | 6473335.636 | 20735767.32 | 68338836.46 | 0.094724 | 0.6 |
| 3 | 2020 | high | reuse.high | Aluminum | hydro | STEPS | LFP | 2417264.362 | 7192595.151 | 6473335.636 | 20735767.32 | 68338836.46 | 0.094724 | 0.6 |
| 4 | 2020 | high | reuse.high | Aluminum | pyro | SDS | LFP | 2417264.362 | 7192595.151 | 0.000 | 20735767.32 | 68338836.46 | 0.000000 | 0.6 |
Based on the summary description of the data, it is obvious that the dataset is large and there are too many variables to choose from. To begin cleaning up some of the data and picking which ones we need, we use the filter command in Python. For this case, we just want the kg.retired and kg.recycled weight columns. For an example of selecting few pieces of our data, we use the head() function connected to our filtered dataset. In this case we want the first 20 pieces of data.
From the Data and analyzing the CSV file, the relationship between Kg.retired and kg.recycled.weight looks almost close to linear. However, there are few points in the data where kg.recycled.weight drops to zero. The ones that were identifiable had the recycling process as "pyro". Using further data cleaning methods, we were able to purge those problematic data points out of our new data set.
import pandas as pd
mask=df_mfa['recycling.process'].str.contains('pyro', case=False, na=False)&(df_mfa['kg.recycled.weight']==0)
df_filtered=df_mfa[~mask]
df_filtered_mfa=df_filtered[['kg.retired', 'kg.recycled.weight']]
print(df_filtered_mfa.head())
kg.retired kg.recycled.weight 0 7.192595e+06 6.473336e+06 1 7.192595e+06 6.473336e+06 2 7.192595e+06 6.473336e+06 3 7.192595e+06 6.473336e+06 6 3.697164e+05 3.327448e+05
print(df_filtered_mfa.head(20))
kg.retired kg.recycled.weight 0 7.192595e+06 6.473336e+06 1 7.192595e+06 6.473336e+06 2 7.192595e+06 6.473336e+06 3 7.192595e+06 6.473336e+06 6 3.697164e+05 3.327448e+05 7 3.697164e+05 3.327448e+05 8 3.697164e+05 3.623221e+05 9 3.697164e+05 3.623221e+05 10 3.697164e+05 3.327448e+05 11 3.697164e+05 3.327448e+05 12 1.444868e+06 1.300381e+06 13 1.444868e+06 1.300381e+06 14 1.444868e+06 1.300381e+06 15 1.444868e+06 1.300381e+06 16 1.444868e+06 1.300381e+06 17 1.444868e+06 1.300381e+06 18 2.364594e+06 2.128135e+06 19 2.364594e+06 2.128135e+06 20 2.364594e+06 2.128135e+06 21 2.364594e+06 2.128135e+06
Graph Time¶
Since we do have our dataset, how do we graph it in python? In python there are two libraries we use, mathplotlib.pyplot and seaborn for any kind of graph. To demonstrate the use of seaborn and matplotlib.pyplot, we use histograms for our data. Both histograms show a left skew with a frequency distribution of 160000 when analyzing the kg.retired region between 0-0.2*1e9.
However matplotlib.plyplot does not show the labels x and y by default nor the title. When using seaborn to graph the plot, the titles of the graphs show up.
import matplotlib.pyplot as plt
plt.hist(df_filtered_mfa['kg.recycled.weight'],bins=15, color='red', edgecolor='pink')
print(plt.title==('Frequency of Kg. Recycled weight'))
print(plt.xlabel==('kg.recycled.weight'))
print(plt.ylabel==('frequency'))
plt.show()
False False False
import seaborn as sns
sns.histplot(data=df_filtered_mfa,x="kg.retired",bins=15,color='red',edgecolor='pink')
plt.title=('Frequency of kg batteries retired')
plt.xlabel=('kg retired')
plt.ylabel=('frequency')
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
Scatterplot¶
As promised we will plot a scatterplot. Throughout the course we have used Rstudio and R coding language to plot our scatterplot and the linear trendline. However using the python language to do it is very different. For this, we use matplotlib.pyplot and seaborn to help us.
import seaborn as sns
import matplotlib.pyplot as plt
sns.regplot(data=df_filtered_mfa, x='kg.retired',y='kg.recycled.weight',
line_kws={"color":"blue"},scatter_kws={"alpha":0.4})
plt.title=("Scatterplot of Kg.retired vs Kg.Recycled Weight")
plt.xlabel=("kg retired")
plt.ylabel=("kg.recycled.weight")
plt.show()
Normality/Data Analysis¶
One of the important aspects when conducting data analysis not present in Project 1 is the normality assessment. We used two methods, the histogram and Quantile Quantile plots. The histogram plot that we used is about somewhat confident as the distribution of residual values was around 0.0. However, using the Quantile Quantile plot showed the most residuals concentrated at around -2.0 on the y axis. The points between -2 and 2 on the x axis are close to the line of best fit so those data points are indeed reliable. anything past -2 on the x axis is very deviant from the trend line and therefore has a low confidence of reliability.
residuals=df_filtered_mfa['kg.recycled.weight']-df_filtered_mfa['kg.retired']
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(residuals, bins=20, kde=True, color='orange')
plt.title=("Histogram of Residuals of Kg Retired vs Kg Recycled Weight")
plt.xlabel=("Residuals")
plt.ylabel=("Frequency")
plt.show()
import scipy.stats as stats
import matplotlib.pyplot as plt
residuals=df_filtered_mfa['kg.recycled.weight']-df_filtered_mfa['kg.retired']
stats.probplot(residuals,dist="norm",plot=plt)
plt.title=("QQ Plot of Residuals")
plt.show()
Additional Notes about Data Analysis¶
To perform additional data analysis including obtaining correlation coefficients, we use scripy library to compile our dataset. We wanted to obtain slope, r coefficient, stastistical errors, and the intercept from the data. After doing additional analysis on python compiling our dataset, the dataset we used had an unsolved problem. The dataset has a very strong multicollinearity as it seems there are two output variables even with filtering out the data. When identifying the data causing the issue, it turns out where the kg recycled weight was 0, the recycling process was "pyro" regardless of the material of the battery being recycled. Therefore, the recycling process involved influences the data as pyro recycling process causes the kg.recycled.weight to become 0, despite the large kg retired values.
from scipy import stats
x=df_filtered_mfa['kg.retired']
y=df_filtered_mfa['kg.recycled.weight']
slope,intercept,r_value,p_value,std_err=stats.linregress(x, y)
print(f"Slope: {slope:.2f}")
print(f"Intercept: {intercept:.2f}")
print(f"R-squared: {r_value**2:.2f}")
print(f"P-value: {p_value:.4f}")
Slope: 0.90 Intercept: -105169.79 R-squared: 1.00 P-value: 0.0000
import statsmodels.api as sm
Z=df_filtered_mfa['kg.retired']
y=df_filtered_mfa['kg.recycled.weight']
Z=sm.add_constant(X)
model=sm.OLS(y,Z).fit()
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: kg.recycled.weight R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 5.623e+08
Date: Thu, 26 Jun 2025 Prob (F-statistic): 0.00
Time: 23:07:21 Log-Likelihood: -2.5278e+06
No. Observations: 151776 AIC: 5.056e+06
Df Residuals: 151774 BIC: 5.056e+06
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -1.052e+05 1.16e+04 -9.094 0.000 -1.28e+05 -8.25e+04
kg.retired 0.9007 3.8e-05 2.37e+04 0.000 0.901 0.901
==============================================================================
Omnibus: 107137.686 Durbin-Watson: 1.256
Prob(Omnibus): 0.000 Jarque-Bera (JB): 13828849.858
Skew: -2.543 Prob(JB): 0.00
Kurtosis: 49.485 Cond. No. 3.31e+08
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.31e+08. This might indicate that there are
strong multicollinearity or other numerical problems.
Conclusion¶
Even when purging the problematic data, the data set is strongly multicollinear after running regression modeling. There is a mostly confident relationship between kg of batteries retired and the kg recycled weight that comes from them but the recycling process type has an impact on the data. The recycling process for some of the data that was classified as "pyro" was causing our graphs to look very strange where there would be a data cluster near the trendline and another datacluster that was a flat horizontal line. By removing the dataset where the pyro recycling process was a parameter we were able to get a linearly fit dataset for our scatterplot.
Reference of Data Set used:¶
Jessica Dunn, Alissa Kendall, Margaret Slattery, Electric vehicle lithium-ion battery recycled content stan- dards for the US – targets, costs, and environmental impacts, Resources, Conservation and Recycling, Volume 185, 2022, ISSN 0921-3449, https://doi.org/10.1016/j.resconrec.2022.106488