HW KNN WNBA NBA¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
plt.style.use('fivethirtyeight')
plt.rcParams["figure.figsize"] = (12, 10)
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'
Big picture¶
In this hw we train a KNN model to predict a basketball player's league based on their height.
Instructions¶
- Import the data file all528.csv as a pandas data.frame. Write a paragraph describing the data set.
- We will use HEIGHT to predict LEAGUE. What kind of machine learning problem is this? We usually say $n \times p$ to represent the dimensions of the 𝑋 matrix. What are n and p for this problem?
- Split the data into training and test sets. Ask for a 70/30 split and use the value 20240923 to set the random_state. How many observations are in each set? What are the frequencies for WNBA and NBA in the training data?
- Make a scatter plot of the training data with a different color for each league. Explain why using
marker=2andmarker=3is a good idea. Matplotlib marker type documentation - With $k=3$, fit a KNN classifier to the training data, compute the $\hat y$'s for the training data, and make a scatter plot of the $\hat y$'s with a different color for each league. The model partitions the x-axis (the HEIGHT values) into WNBA regions and NBA regions. (Some regions are big; some are tiny. Two WNBA predictions are in the same region if there are no NBA predictions between them. Same for NBA predictions.) How many regions are there in all?
- Make a confusion matrix for this fit to the training data. Use the code in Train_versus_test.ipynb to display the matrix as a "nicely labeled dataframe" . Use Ground Truth and Model to label the axes. Use WNBA and NBA to label the rows and columns. Think a little to get the right label in the right place. Explain how you decided where to place the WNBA and NBA labels.
- Compute the accuracy of this fit to the training data.
- Using some larger values for $k$, fit the KNN model to the training data, compute the $\hat y$'s, and make the scatter plot of the predictions. What do you notice about the plot as $k$ increases? You don't have to show all of these plots in your final submission.
- What is the smallest value of $k$ which produces a scatter plot with exactly two regions (keep using the training data for this part)? Estimate the HEIGHT value which represents the boundary between WNBA and NBA. Display a confusion matrix for this fit and compute the accuracy.
- Use this same classifier to compute the $\hat y$'s for the test data. Produce a scatter plot, a confusion matrix and compute the accuracy. Compare the accuracy to the value obtained for the training data.
Code¶
Your code for this hw goes in the cells below. You will have to add more cells as you go. Include comments and/or markdown cells to relate your code to the problems above.
All of your plots should appear in this section.
You don't have to discuss results here because there's a section for that at the end of the hw.
#Step 1 Import the data
all = pd.read_csv('all528.csv')
# Display a numerical summary
summary_num = all.describe()
summary_num
# Display a summary of the DataFrame
summary = all.info()
summary
| HEIGHT | |
|---|---|
| count | 528.000000 |
| mean | 75.923977 |
| std | 4.715112 |
| min | 61.470000 |
| 25% | 72.895000 |
| 50% | 76.550000 |
| 75% | 79.432500 |
| max | 85.800000 |
<class 'pandas.core.frame.DataFrame'> RangeIndex: 528 entries, 0 to 527 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 HEIGHT 528 non-null float64 1 LEAGUE 528 non-null object dtypes: float64(1), object(1) memory usage: 8.4+ KB
#Step 2 Display the shape and first few rows of the dataset
all.shape
all.head()
# Sort the data by LEAGUE and HEIGHT
all_sort = all.sort_values(by=['LEAGUE', 'HEIGHT'], axis=0, ascending=[False, False], inplace=False)
# Display the sorted data
all_sort
(528, 2)
| HEIGHT | LEAGUE | |
|---|---|---|
| 0 | 74.95 | WNBA |
| 1 | 81.11 | NBA |
| 2 | 74.82 | NBA |
| 3 | 65.41 | WNBA |
| 4 | 72.05 | WNBA |
| HEIGHT | LEAGUE | |
|---|---|---|
| 426 | 80.71 | WNBA |
| 457 | 80.56 | WNBA |
| 462 | 79.78 | WNBA |
| 491 | 79.74 | WNBA |
| 500 | 79.09 | WNBA |
| ... | ... | ... |
| 25 | 71.59 | NBA |
| 254 | 71.56 | NBA |
| 83 | 71.55 | NBA |
| 340 | 71.53 | NBA |
| 258 | 71.27 | NBA |
528 rows × 2 columns
#Step 3 Split the data
X_train, X_test, y_train, y_test = train_test_split(all[['HEIGHT']], all['LEAGUE'], test_size=0.3, random_state=20240923)
# Number of observations in each set
n_train = X_train.shape[0]
n_test = X_test.shape[0]
# Frequencies of WNBA and NBA in the training data
freq_train = y_train.value_counts()
print('Number of observations in the training data:',n_train)
print('Number of observations in the testing data:',n_test)
print('Frequencies in the training data:\n',freq_train)
#X_train; y_train
#X_test; y_test
Number of observations in the training data: 369 Number of observations in the testing data: 159 Frequencies in the training data: LEAGUE NBA 250 WNBA 119 Name: count, dtype: int64
#Step 4 Scatter plot
plt.scatter(X_train[y_train == 'WNBA'], np.zeros(119), color='Plum', marker=2, s=500, label='WNBA',linewidth=.75)
plt.scatter(X_train[y_train == 'NBA'], np.zeros(250), color='Lime', marker=3, s=500,label='NBA',linewidth=.75)
plt.xlabel('Height')
plt.title('Training Data')
plt.grid(color = 'black')
plt.show();
#Step 5 Fit the KNN classifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
# Predict on the training data
yhat = knn.predict(X_train)
predictions=knn.predict(X_train)
#Vectors of predictions
WNBA=X_train[predictions=='WNBA']
NBA=X_train[predictions=='NBA']
# Scatter plot of predictions
plt.scatter(x=WNBA,y=np.zeros(len(WNBA)), color='plum', marker=2, s=500, label='WNBA',linewidth=.75)
plt.scatter(x=NBA,y= np.zeros(len(NBA)), color='lime', marker=3, s=500,label='NBA',linewidth=.75)
plt.xlabel('Height')
plt.grid(color = 'Black')
plt.title('Predictions on Training Data')
plt.show();
# Fit KNN with larger k values
knn = KNeighborsClassifier(n_neighbors=150)
knn.fit(X_train, y_train)
# Predict on the training data
yhat = knn.predict(X_train)
predictions=knn.predict(X_train)
#Vectors of predictions
WNBA=X_train[predictions=='WNBA']
NBA=X_train[predictions=='NBA']
# Scatter plot of predictions
plt.scatter(x=WNBA, y= np.zeros(len(WNBA)), color='plum', marker=2, s=500, label='WNBA',linewidth=.75)
plt.scatter(x=NBA, y= np.zeros(len(NBA)), color='lime', marker=3, s=500,label='NBA',linewidth=.75)
plt.xlabel('Height')
plt.title('Predictions: k = 150')
plt.grid(color = 'black')
plt.show();
# Fit the KNN classifier - smallest k value to seperate the boundary
knn = KNeighborsClassifier(n_neighbors = 55)
knn.fit(X_train, y_train)
# Predict on the training data
yhat = knn.predict(X_train)
predictions=knn.predict(X_train)
#Vectors of predictions
WNBA=X_train[predictions=='WNBA']
NBA=X_train[predictions=='NBA']
# Scatter plot of predictions
plt.scatter(x=WNBA, y= np.zeros(len(WNBA)), color='plum', marker=2, s=500,label='WNBA',linewidth=.75)
plt.scatter(x=NBA, y= np.zeros(len(NBA)), color='lime', marker=3, s=500,label='NBA',linewidth=.75)
plt.xlabel('Height')
plt.title('Predictions: k = 55')
plt.grid(color = 'black')
plt.show();
Results and discussion¶
Use the cells below to report your results for each problem.
Each question has a markdown cell for you to fill. Write sentences and paragraphs. If the problem asks for numbers, inlcude them in your text.
Some questions also have a markdown cell where you can display larger python objects (on this hw they are all confusion matrices).
Don't put plots in this section. They should be displayed above with the rest of your code.
Problem 1 The all528 dataset contains two columns, height and league. Height is a continuous qualitative variable that represents the players' height in inches. League is a qualitative variable that represents which professional basketball league the player belongs to. The height column is used as the predictor and the league column is the target variable (yhat). The goal is to predict the league the player belongs to based on their height.
Problem 2 This is a supervised classification machine learning problem. N is 528, the number of observations (the players). P is 1, the number of features, with 2 categories (WNBA and NBA).
Problem 3 The training data is split 70/30. The number of observations in the training data is 369 and the number of observations in the testing data is 159. The frequencies in the training data are WNBA: 119 NBA: 250. 119
Problem 4 Using different marker types give the plots distinct shapes. This makes it easier to see which data point belongs to which category.
Problem 5 There are 14 distinct regions when the k value is 3, as the k value increases the regions decrease.
Problem 6 Ground truth represents the actual height of the class WNBA or NBA. Model represents the prediction for WNBA or NBA. In the training data and the scatter plot, WNBA is listed first, so I followed the pattern.
# use this code cell to display results for Problem 6
## Confusion matrix
cmat = confusion_matrix(y_train, yhat)
cmat
pd.DataFrame(cmat, columns = pd.MultiIndex.from_tuples([('Model','WNBA'),('Model','NBA')]),
index=pd.MultiIndex.from_tuples([('Ground Truth','WNBA'),('Ground Truth','NBA')]))
array([[236, 14],
[ 27, 92]], dtype=int64)
| Model | |||
|---|---|---|---|
| WNBA | NBA | ||
| Ground Truth | WNBA | 236 | 14 |
| NBA | 27 | 92 | |
Problem 7 0.8889 rounded to 4 decimal points.
# Accuracy_score computes the accuracy of the model’s predictions
accuracy = accuracy_score(y_train, yhat)
print('Training Accuracy',accuracy)
Training Accuracy 0.8888888888888888
Problem 8 As the k value increases, the NBA data moves further left along the x-axis and creates an increased area for the model to predict. This in turn creates a decrease in the WNBA data and that data is not counted accurately.
Problem 9 The smallest k value to produce exactly 2 regions is 55.
# Fit KNN with smaller k value
# Fit the KNN classifier
knn = KNeighborsClassifier(n_neighbors = 55)
knn.fit(X_train, y_train)
# Predict on the training data
yhat = knn.predict(X_train)
## Confusion matrix
cmat = confusion_matrix(y_train, yhat)
cmat
pd.DataFrame(cmat, columns = pd.MultiIndex.from_tuples([('Model','WNBA'),('Model','NBA')]),
index=pd.MultiIndex.from_tuples([('Ground Truth','WNBA'),('Ground Truth','NBA')]))
# Accuracy_score computes the accuracy of the model’s predictions
accuracy = accuracy_score(y_train, yhat)
print('Training Accuracy for k = 55',accuracy)
KNeighborsClassifier(n_neighbors=55)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=55)
array([[236, 14],
[ 27, 92]], dtype=int64)
| Model | |||
|---|---|---|---|
| WNBA | NBA | ||
| Ground Truth | WNBA | 236 | 14 |
| NBA | 27 | 92 | |
Training Accuracy for k = 55 0.8888888888888888
Problem 10 The accuracy for the training data was 0.8889 rounded to 4 decimal points and the testing accuracy 0.8679 rounded to 4 decimal points. This means the training model should be adjusted, since the test data needs to have better accuracy to test unknown values.
# use this code cell to display results for Problem 10
# Predict on the test data
yhat_pred = knn.predict(X_test)
# Confusion matrix for test data
cmat_test = confusion_matrix(y_test, yhat_pred, labels=['WNBA', 'NBA'])
pd.DataFrame(cmat_test, columns = pd.MultiIndex.from_tuples([('Model','WNBA'),('Model','NBA')]),
index=pd.MultiIndex.from_tuples([('Ground Truth','WNBA'),('Ground Truth','NBA')]))
# Accuracy on test data
accuracy = accuracy_score(y_test, yhat_pred)
print('Test Accuracy', accuracy)
| Model | |||
|---|---|---|---|
| WNBA | NBA | ||
| Ground Truth | WNBA | 37 | 12 |
| NBA | 9 | 101 | |
Test Accuracy 0.8679245283018868
One more question: How does Calvin Murphy relate to this hw? He was a basketball player and commentator for the Houston Rockets.