Linear Regression


The data set: USA_Housing.csv.

The data contains the following columns:

  • Avg. Area Income: Avg. Income of residents of the city house is located in.
  • Avg. Area House Age: Avg Age of Houses in same city
  • Avg. Area Number of Rooms: Avg Number of Rooms for Houses in same city
  • Avg. Area Number of Bedrooms: Avg Number of Bedrooms for Houses in same city
  • Area Population: Population of city house is located in
  • Price: Price that the house sold at
  • Address: Address for the house

Import Libraries

> import pandas as pd
+ import numpy as np
+ import matplotlib.pyplot as plt
+ import seaborn as sns

The Data

> USAhousing = pd.read_csv('USA_Housing.csv')
USAhousing.head()
Avg. Area Income Avg. Area House Age Avg. Area Number of Rooms Avg. Area Number of Bedrooms Area Population Price Address
79545.46 5.682861 7.009188 4.09 23086.80 1059033.6 208 Michael Ferry Apt. 674 Laurabury, NE 37010-5101
79248.64 6.002900 6.730821 3.09 40173.07 1505890.9 188 Johnson Views Suite 079 Lake Kathleen, CA 48958
61287.07 5.865890 8.512727 5.13 36882.16 1058988.0 9127 Elizabeth Stravenue Danieltown, WI 06482-3489
63345.24 7.188236 5.586729 3.26 34310.24 1260616.8 USS Barnett FPO AP 44820
59982.20 5.040554 7.839388 4.23 26354.11 630943.5 USNS Raymond FPO AE 09386
80175.75 4.988408 6.104512 4.04 26748.43 1068138.1 06039 Jennifer Islands Apt. 443 Tracyport, KS 16077
> USAhousing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
Avg. Area Income                5000 non-null float64
Avg. Area House Age             5000 non-null float64
Avg. Area Number of Rooms       5000 non-null float64
Avg. Area Number of Bedrooms    5000 non-null float64
Area Population                 5000 non-null float64
Price                           5000 non-null float64
Address                         5000 non-null object
dtypes: float64(6), object(1)
memory usage: 273.6+ KB
> housd = USAhousing.describe()
Avg. Area Income Avg. Area House Age Avg. Area Number of Rooms Avg. Area Number of Bedrooms Area Population Price
count 5000.00 5000.0000000 5000.000000 5000.000000 5000.0000 5000.00
mean 68583.11 5.9772220 6.987792 3.981330 36163.5160 1232072.65
std 10657.99 0.9914562 1.005833 1.234137 9925.6501 353117.63
min 17796.63 2.6443042 3.236194 2.000000 172.6107 15938.66
25% 61480.56 5.3222830 6.299250 3.140000 29403.9287 997577.14
50% 68804.29 5.9704289 7.002902 4.050000 36199.4067 1232669.38
> USAhousing.columns
Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
       'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
      dtype='object')

Exploratory Data Analysis

> sns.set_style('darkgrid')
+ sns.pairplot(USAhousing);
+ plt.show()

> plt.figure(figsize=(8,6))
+ sns.distplot(USAhousing['Price']);
+ plt.show(())

> houscorr = USAhousing.corr()
USAhousing.corr()
Avg. Area Income Avg. Area House Age Avg. Area Number of Rooms Avg. Area Number of Bedrooms Area Population Price
Avg. Area Income 1.0000000 -0.0020068 -0.0110317 0.0197882 -0.0162337 0.6397338
Avg. Area House Age -0.0020068 1.0000000 -0.0094283 0.0061489 -0.0187428 0.4525425
Avg. Area Number of Rooms -0.0110317 -0.0094283 1.0000000 0.4626949 0.0020399 0.3356645
Avg. Area Number of Bedrooms 0.0197882 0.0061489 0.4626949 1.0000000 -0.0221676 0.1710710
Area Population -0.0162337 -0.0187428 0.0020399 -0.0221676 1.0000000 0.4085559
Price 0.6397338 0.4525425 0.3356645 0.1710710 0.4085559 1.0000000
> plt.figure(figsize=(8,6))
+ sns.heatmap(USAhousing.corr(), 
+             annot=True, cmap='coolwarm');
+ plt.tight_layout()
+ plt.show()

Training the Model

We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable, in this case the Price column. We will toss out the Address column because it only has text info that the linear regression model can’t use.

> X = USAhousing[['Avg. Area Income',
+                 'Avg. Area House Age', 
+                 'Avg. Area Number of Rooms',
+                 'Avg. Area Number of Bedrooms', 
+                 'Area Population']]
+ y = USAhousing['Price']

Train Test Split

Now let’s split the data into a training set and a testing set. We will train out model on the training set and then use the test set to evaluate the model.

> from sklearn.model_selection import train_test_split
> # tuple unpacking
+ # test size = % allocated to test size
+ # random_state = set seed
+ X_train, X_test, y_train, y_test = train_test_split(
+     X, y, test_size=0.4, random_state=101)

Creating the Model

> from sklearn.linear_model import LinearRegression
+ lm = LinearRegression()
+ lm.fit(X_train,y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Model Evaluation

Let’s evaluate the model by checking out it’s coefficients and how we can interpret them.

> # print the intercept
+ print(lm.intercept_)
-2640159.796851911
> # columns, index
+ coeff_df = pd.DataFrame(lm.coef_,
+             X.columns,columns=['Coefficient'])
Coefficient
Avg. Area Income 21.52828
Avg. Area House Age 164883.28203
Avg. Area Number of Rooms 122368.67803
Avg. Area Number of Bedrooms 2233.80186
Area Population 15.15042

Interpreting the coefficients:

  • Holding all other features fixed, a 1 unit increase in Avg. Area Income is associated with an increase of $21.52.
  • Holding all other features fixed, a 1 unit increase in Avg. Area House Age is associated with an increase of $164883.28.
  • Holding all other features fixed, a 1 unit increase in Avg. Area Number of Rooms is associated with an increase of $122368.67.
  • Holding all other features fixed, a 1 unit increase in Avg. Area Number of Bedrooms is associated with an increase of $2233.80.
  • Holding all other features fixed, a 1 unit increase in Area Population is associated with an increase of $15.15.

Does this make sense? Probably not because this data is artificial. If you want real data to repeat this sort of analysis, check out the boston dataset http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html

>   from sklearn.datasets import load_boston
>     boston = load_boston()
>     boston.keys()
>     print(boston.DESCR)
>     print(boston['data'])
>     print(boston['feature_names'])
>     print(boston['target'])
>     boston_df = boston.data

Model Predictions

> predictions = lm.predict(X_test)
+ predictions
array([1260960.70567626,  827588.75560352, 1742421.24254328, ...,
        372191.40626952, 1365217.15140895, 1914519.54178824])
> plt.figure(figsize=(8,6))
+ plt.scatter(y_test,predictions);
+ plt.show()

Residual Histogram

> # normally distributed
+ plt.figure(figsize=(8,6))
+ sns.distplot((y_test-predictions),bins=50);
+ plt.show()

Regression Metrics

Here are three common evaluation metrics for regression problems:

Mean Absolute Error (MAE) is the mean of the absolute value of the errors:

\[\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|\]

Mean Squared Error (MSE) is the mean of the squared errors:

\[\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2\]

Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:

\[\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}\]

Comparing these metrics:

  • MAE is the easiest to understand, because it’s the average error.
  • MSE is more popular than MAE, because MSE “punishes” larger errors, which tends to be useful in the real world.
  • RMSE is even more popular than MSE, because RMSE is interpretable in the “y” units.

All of these are loss functions, because we want to minimize them.

> from sklearn import metrics
> print('MAE:', metrics.mean_absolute_error(
+     y_test, predictions))
MAE: 82288.22251914957
> print('MSE:', metrics.mean_squared_error(
+     y_test, predictions))
MSE: 10460958907.209507
> print('RMSE:', np.sqrt(metrics.mean_squared_error(
+     y_test, predictions)))
RMSE: 102278.82922291156
> print('Rsq.:', metrics.r2_score(
+     y_test, predictions))
Rsq.: 0.91768240096492

Regression Exercise


The Data

We’ll work with the Ecommerce Customers csv file. It has Customer info, suchas Email, Address, and their color Avatar. Then it also has numerical value columns:

  • Avg. Session Length: Average session of in-store style advice sessions.
  • Time on App: Average time spent on App in minutes
  • Time on Website: Average time spent on Website in minutes
  • Length of Membership: How many years the customer has been a member.
  1. Read in the Ecommerce Customers csv file as a DataFrame called customers.
> customers = pd.read_csv("Ecommerce Customers")
  1. Check the head of customers, and check out its info() and describe() methods.
customers.head()
Email Address Avatar Avg. Session Length Time on App Time on Website Length of Membership Yearly Amount Spent
835 Frank Tunnel Wrightmouth, MI 82180-9605 Violet 34.49727 12.65565 39.57767 4.082621 587.9511
4547 Archer Common Diazchester, CA 06566-8576 DarkGreen 31.92627 11.10946 37.26896 2.664034 392.2049
24645 Valerie Unions Suite 582 Cobbborough, DC 99414-7564 Bisque 33.00091 11.33028 37.11060 4.104543 487.5475
1414 David Throughway Port Jason, OH 22070-1220 SaddleBrown 34.30556 13.71751 36.72128 3.120179 581.8523
14023 Rodriguez Passage Port Jacobville, PR 37242-1057 MediumAquaMarine 33.33067 12.79519 37.53665 4.446308 599.4061
customers.describe()
Avg. Session Length Time on App Time on Website Length of Membership Yearly Amount Spent
count 500.0000000 500.0000000 500.000000 500.0000000 500.00000
mean 33.0531935 12.0524879 37.060445 3.5334616 499.31404
std 0.9925631 0.9942156 1.010489 0.9992775 79.31478
min 29.5324290 8.5081522 33.913847 0.2699011 256.67058
25% 32.3418220 11.3881534 36.349257 2.9304497 445.03828
50% 33.0820076 11.9832313 37.069367 3.5339750 498.88788
> customers.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
Email                   500 non-null object
Address                 500 non-null object
Avatar                  500 non-null object
Avg. Session Length     500 non-null float64
Time on App             500 non-null float64
Time on Website         500 non-null float64
Length of Membership    500 non-null float64
Yearly Amount Spent     500 non-null float64
dtypes: float64(5), object(3)
memory usage: 31.4+ KB

Exploratory Data Analysis

> sns.set_palette("dark")
+ sns.set_style('whitegrid')
  1. Use seaborn to create a jointplot to compare the Time on Website and Yearly Amount Spent columns. Does the correlation make sense?
> # More time on site, more money spent.
+ sns.jointplot(x='Time on Website',
+               y='Yearly Amount Spent',data=customers);
+ plt.show()

  1. Do the same but with the Time on App column instead.
> sns.jointplot(x='Time on App',
+               y='Yearly Amount Spent',data=customers);
+ plt.show()

  1. Use jointplot to create a 2D hex bin plot comparing Time on App and Length of Membership.
> sns.jointplot(x='Time on App',
+               y='Length of Membership',
+               kind='hex',data=customers);
+ plt.show()

  1. Let’s explore these types of relationships across the entire data set. Use pairplot to recreate the plot below.
> sns.pairplot(customers);
+ plt.show()

  1. Based off this plot what looks to be the most correlated feature with Yearly Amount Spent?

Length of Membership

  1. Create a linear model plot (using seaborn’s lmplot) of Yearly Amount Spent vs. Length of Membership.
> plt.figure(figsize=(8,6))
+ sns.lmplot(x='Length of Membership',
+            y='Yearly Amount Spent',data=customers);
+ plt.show()

Training the Model

  1. Set a variable X equal to the numerical features of the customers and a variable y equal to the “Yearly Amount Spent” column.
> y = customers['Yearly Amount Spent']
> X = customers[['Avg. Session Length',
+                'Time on App',
+                'Time on Website',
+                'Length of Membership']]
  1. Use model_selection.train_test_split from sklearn to split the data into training and testing sets. Set test_size=0.3 and random_state=101.
> X_train, X_test, y_train, y_test = train_test_split(
+     X, y, test_size=0.3, random_state=101)
  1. Create an instance of a LinearRegression() model named lm.
> lm = LinearRegression()
  1. Train/fit lm on the training data.
> lm.fit(X_train,y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
  1. Print out the coefficients of the model.
> # The coefficients
+ print('Coefficients: \n', lm.coef_)
Coefficients: 
 [25.98154972 38.59015875  0.19040528 61.27909654]

Predicting Test Data

  1. Use lm.predict() to predict off the X_test set of the data.
> predictions = lm.predict( X_test)
  1. Create a scatterplot of the real test values versus the predicted values.
> plt.figure(figsize=(8,6))
+ plt.scatter(y_test,predictions)
+ plt.xlabel('Y Test')
+ plt.ylabel('Predicted Y');
+ plt.show()

Evaluating the Model

  1. Calculate the Mean Absolute Error, Mean Squared Error, and the Root Mean Squared Error.
> print('MAE:', metrics.mean_absolute_error(
+     y_test, predictions))
MAE: 7.228148653430838
> print('MSE:', metrics.mean_squared_error(
+     y_test, predictions))
MSE: 79.81305165097461
> print('RMSE:', np.sqrt(metrics.mean_squared_error(
+     y_test, predictions)))
RMSE: 8.933815066978642
> print('Rsq.:', metrics.r2_score(
+     y_test, predictions))
Rsq.: 0.9890046246741234
  1. Plot a histogram of the residuals and make sure it looks normally distributed. Use either seaborn distplot, or just plt.hist().
> plt.figure(figsize=(8,6))
+ sns.distplot((y_test-predictions),bins=50);
+ plt.show()

Conclusion

> coeffecients = pd.DataFrame(lm.coef_,X.columns)
+ coeffecients.columns = ['Coeffecient']
+ coeffecients
                      Coeffecient
Avg. Session Length     25.981550
Time on App             38.590159
Time on Website          0.190405
Length of Membership    61.279097
  1. How can you interpret these coefficients?
  • Holding all other features fixed, a 1 unit increase in Avg. Session Length is associated with an increase of 25.98 total dollars spent.
  • Holding all other features fixed, a 1 unit increase in Time on App is associated with an increase of 38.59 total dollars spent.
  • Holding all other features fixed, a 1 unit increase in Time on Website is associated with an increase of 0.19 total dollars spent.
  • Holding all other features fixed, a 1 unit increase in Length of Membership is associated with an increase of 61.27 total dollars spent.