Linear Regression

The data set: USA_Housing.csv.

The data contains the following columns:

Avg. Area Income: Avg. Income of residents of the city house is located in.
Avg. Area House Age: Avg Age of Houses in same city
Avg. Area Number of Rooms: Avg Number of Rooms for Houses in same city
Avg. Area Number of Bedrooms: Avg Number of Bedrooms for Houses in same city
Area Population: Population of city house is located in
Price: Price that the house sold at
Address: Address for the house

Import Libraries

> import pandas as pd
+ import numpy as np
+ import matplotlib.pyplot as plt
+ import seaborn as sns

The Data

> USAhousing = pd.read_csv('USA_Housing.csv')

USAhousing.head()

Avg. Area Income	Avg. Area House Age	Avg. Area Number of Rooms	Avg. Area Number of Bedrooms	Area Population	Price	Address
79545.46	5.682861	7.009188	4.09	23086.80	1059033.6	208 Michael Ferry Apt. 674 Laurabury, NE 37010-5101
79248.64	6.002900	6.730821	3.09	40173.07	1505890.9	188 Johnson Views Suite 079 Lake Kathleen, CA 48958
61287.07	5.865890	8.512727	5.13	36882.16	1058988.0	9127 Elizabeth Stravenue Danieltown, WI 06482-3489
63345.24	7.188236	5.586729	3.26	34310.24	1260616.8	USS Barnett FPO AP 44820
59982.20	5.040554	7.839388	4.23	26354.11	630943.5	USNS Raymond FPO AE 09386
80175.75	4.988408	6.104512	4.04	26748.43	1068138.1	06039 Jennifer Islands Apt. 443 Tracyport, KS 16077

> USAhousing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
Avg. Area Income                5000 non-null float64
Avg. Area House Age             5000 non-null float64
Avg. Area Number of Rooms       5000 non-null float64
Avg. Area Number of Bedrooms    5000 non-null float64
Area Population                 5000 non-null float64
Price                           5000 non-null float64
Address                         5000 non-null object
dtypes: float64(6), object(1)
memory usage: 273.6+ KB

> housd = USAhousing.describe()

	Avg. Area Income	Avg. Area House Age	Avg. Area Number of Rooms	Avg. Area Number of Bedrooms	Area Population	Price
count	5000.00	5000.0000000	5000.000000	5000.000000	5000.0000	5000.00
mean	68583.11	5.9772220	6.987792	3.981330	36163.5160	1232072.65
std	10657.99	0.9914562	1.005833	1.234137	9925.6501	353117.63
min	17796.63	2.6443042	3.236194	2.000000	172.6107	15938.66
25%	61480.56	5.3222830	6.299250	3.140000	29403.9287	997577.14
50%	68804.29	5.9704289	7.002902	4.050000	36199.4067	1232669.38

> USAhousing.columns

Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
       'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
      dtype='object')

Exploratory Data Analysis

> sns.set_style('darkgrid')
+ sns.pairplot(USAhousing);
+ plt.show()

> plt.figure(figsize=(8,6))
+ sns.distplot(USAhousing['Price']);
+ plt.show(())

> houscorr = USAhousing.corr()

USAhousing.corr()

	Avg. Area Income	Avg. Area House Age	Avg. Area Number of Rooms	Avg. Area Number of Bedrooms	Area Population	Price
Avg. Area Income	1.0000000	-0.0020068	-0.0110317	0.0197882	-0.0162337	0.6397338
Avg. Area House Age	-0.0020068	1.0000000	-0.0094283	0.0061489	-0.0187428	0.4525425
Avg. Area Number of Rooms	-0.0110317	-0.0094283	1.0000000	0.4626949	0.0020399	0.3356645
Avg. Area Number of Bedrooms	0.0197882	0.0061489	0.4626949	1.0000000	-0.0221676	0.1710710
Area Population	-0.0162337	-0.0187428	0.0020399	-0.0221676	1.0000000	0.4085559
Price	0.6397338	0.4525425	0.3356645	0.1710710	0.4085559	1.0000000

> plt.figure(figsize=(8,6))
+ sns.heatmap(USAhousing.corr(), 
+             annot=True, cmap='coolwarm');
+ plt.tight_layout()
+ plt.show()

Training the Model

We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable, in this case the Price column. We will toss out the Address column because it only has text info that the linear regression model can’t use.

> X = USAhousing[['Avg. Area Income',
+                 'Avg. Area House Age', 
+                 'Avg. Area Number of Rooms',
+                 'Avg. Area Number of Bedrooms', 
+                 'Area Population']]
+ y = USAhousing['Price']

Train Test Split

Now let’s split the data into a training set and a testing set. We will train out model on the training set and then use the test set to evaluate the model.

> from sklearn.model_selection import train_test_split

> # tuple unpacking
+ # test size = % allocated to test size
+ # random_state = set seed
+ X_train, X_test, y_train, y_test = train_test_split(
+     X, y, test_size=0.4, random_state=101)

Creating the Model

> from sklearn.linear_model import LinearRegression
+ lm = LinearRegression()
+ lm.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Model Evaluation

Let’s evaluate the model by checking out it’s coefficients and how we can interpret them.

> # print the intercept
+ print(lm.intercept_)

-2640159.796851911

> # columns, index
+ coeff_df = pd.DataFrame(lm.coef_,
+             X.columns,columns=['Coefficient'])

	Coefficient
Avg. Area Income	21.52828
Avg. Area House Age	164883.28203
Avg. Area Number of Rooms	122368.67803
Avg. Area Number of Bedrooms	2233.80186
Area Population	15.15042

Interpreting the coefficients:

Holding all other features fixed, a 1 unit increase in Avg. Area Income is associated with an increase of $21.52.
Holding all other features fixed, a 1 unit increase in Avg. Area House Age is associated with an increase of $164883.28.
Holding all other features fixed, a 1 unit increase in Avg. Area Number of Rooms is associated with an increase of $122368.67.
Holding all other features fixed, a 1 unit increase in Avg. Area Number of Bedrooms is associated with an increase of $2233.80.
Holding all other features fixed, a 1 unit increase in Area Population is associated with an increase of $15.15.

Does this make sense? Probably not because this data is artificial. If you want real data to repeat this sort of analysis, check out the boston dataset http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html

>   from sklearn.datasets import load_boston
>     boston = load_boston()
>     boston.keys()
>     print(boston.DESCR)
>     print(boston['data'])
>     print(boston['feature_names'])
>     print(boston['target'])
>     boston_df = boston.data

Model Predictions

> predictions = lm.predict(X_test)
+ predictions

array([1260960.70567626,  827588.75560352, 1742421.24254328, ...,
        372191.40626952, 1365217.15140895, 1914519.54178824])

> plt.figure(figsize=(8,6))
+ plt.scatter(y_test,predictions);
+ plt.show()

Residual Histogram

> # normally distributed
+ plt.figure(figsize=(8,6))
+ sns.distplot((y_test-predictions),bins=50);
+ plt.show()

Regression Metrics

Here are three common evaluation metrics for regression problems:

Mean Absolute Error (MAE) is the mean of the absolute value of the errors:

\[\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|\]

Mean Squared Error (MSE) is the mean of the squared errors:

\[\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2\]

Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:

\[\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}\]

Comparing these metrics:

MAE is the easiest to understand, because it’s the average error.
MSE is more popular than MAE, because MSE “punishes” larger errors, which tends to be useful in the real world.
RMSE is even more popular than MSE, because RMSE is interpretable in the “y” units.

All of these are loss functions, because we want to minimize them.

> from sklearn import metrics

> print('MAE:', metrics.mean_absolute_error(
+     y_test, predictions))

MAE: 82288.22251914957

> print('MSE:', metrics.mean_squared_error(
+     y_test, predictions))

MSE: 10460958907.209507

> print('RMSE:', np.sqrt(metrics.mean_squared_error(
+     y_test, predictions)))

RMSE: 102278.82922291156

> print('Rsq.:', metrics.r2_score(
+     y_test, predictions))

Rsq.: 0.91768240096492

Regression Exercise

The Data

We’ll work with the Ecommerce Customers csv file. It has Customer info, suchas Email, Address, and their color Avatar. Then it also has numerical value columns:

Avg. Session Length: Average session of in-store style advice sessions.
Time on App: Average time spent on App in minutes
Time on Website: Average time spent on Website in minutes
Length of Membership: How many years the customer has been a member.

Read in the Ecommerce Customers csv file as a DataFrame called customers.

> customers = pd.read_csv("Ecommerce Customers")

Check the head of customers, and check out its info() and describe() methods.

customers.head()

Email	Address	Avatar	Avg. Session Length	Time on App	Time on Website	Length of Membership	Yearly Amount Spent
mstephenson@fernandez.com	835 Frank Tunnel Wrightmouth, MI 82180-9605	Violet	34.49727	12.65565	39.57767	4.082621	587.9511
hduke@hotmail.com	4547 Archer Common Diazchester, CA 06566-8576	DarkGreen	31.92627	11.10946	37.26896	2.664034	392.2049
pallen@yahoo.com	24645 Valerie Unions Suite 582 Cobbborough, DC 99414-7564	Bisque	33.00091	11.33028	37.11060	4.104543	487.5475
riverarebecca@gmail.com	1414 David Throughway Port Jason, OH 22070-1220	SaddleBrown	34.30556	13.71751	36.72128	3.120179	581.8523
mstephens@davidson-herman.com	14023 Rodriguez Passage Port Jacobville, PR 37242-1057	MediumAquaMarine	33.33067	12.79519	37.53665	4.446308	599.4061

customers.describe()

	Avg. Session Length	Time on App	Time on Website	Length of Membership	Yearly Amount Spent
count	500.0000000	500.0000000	500.000000	500.0000000	500.00000
mean	33.0531935	12.0524879	37.060445	3.5334616	499.31404
std	0.9925631	0.9942156	1.010489	0.9992775	79.31478
min	29.5324290	8.5081522	33.913847	0.2699011	256.67058
25%	32.3418220	11.3881534	36.349257	2.9304497	445.03828
50%	33.0820076	11.9832313	37.069367	3.5339750	498.88788

> customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
Email                   500 non-null object
Address                 500 non-null object
Avatar                  500 non-null object
Avg. Session Length     500 non-null float64
Time on App             500 non-null float64
Time on Website         500 non-null float64
Length of Membership    500 non-null float64
Yearly Amount Spent     500 non-null float64
dtypes: float64(5), object(3)
memory usage: 31.4+ KB

Exploratory Data Analysis

> sns.set_palette("dark")
+ sns.set_style('whitegrid')

Use seaborn to create a jointplot to compare the Time on Website and Yearly Amount Spent columns. Does the correlation make sense?

> # More time on site, more money spent.
+ sns.jointplot(x='Time on Website',
+               y='Yearly Amount Spent',data=customers);
+ plt.show()

Do the same but with the Time on App column instead.

> sns.jointplot(x='Time on App',
+               y='Yearly Amount Spent',data=customers);
+ plt.show()

Use jointplot to create a 2D hex bin plot comparing Time on App and Length of Membership.

> sns.jointplot(x='Time on App',
+               y='Length of Membership',
+               kind='hex',data=customers);
+ plt.show()

Let’s explore these types of relationships across the entire data set. Use pairplot to recreate the plot below.

> sns.pairplot(customers);
+ plt.show()

Based off this plot what looks to be the most correlated feature with Yearly Amount Spent?

Length of Membership

Create a linear model plot (using seaborn’s lmplot) of Yearly Amount Spent vs. Length of Membership.

> plt.figure(figsize=(8,6))
+ sns.lmplot(x='Length of Membership',
+            y='Yearly Amount Spent',data=customers);
+ plt.show()

Training the Model

Set a variable X equal to the numerical features of the customers and a variable y equal to the “Yearly Amount Spent” column.

> y = customers['Yearly Amount Spent']

> X = customers[['Avg. Session Length',
+                'Time on App',
+                'Time on Website',
+                'Length of Membership']]

Use model_selection.train_test_split from sklearn to split the data into training and testing sets. Set test_size=0.3 and random_state=101.

> X_train, X_test, y_train, y_test = train_test_split(
+     X, y, test_size=0.3, random_state=101)

Create an instance of a LinearRegression() model named lm.

> lm = LinearRegression()

Train/fit lm on the training data.

> lm.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Print out the coefficients of the model.

> # The coefficients
+ print('Coefficients: \n', lm.coef_)

Coefficients: 
 [25.98154972 38.59015875  0.19040528 61.27909654]

Predicting Test Data

Use lm.predict() to predict off the X_test set of the data.

> predictions = lm.predict( X_test)

Create a scatterplot of the real test values versus the predicted values.

> plt.figure(figsize=(8,6))
+ plt.scatter(y_test,predictions)
+ plt.xlabel('Y Test')
+ plt.ylabel('Predicted Y');
+ plt.show()

Evaluating the Model

Calculate the Mean Absolute Error, Mean Squared Error, and the Root Mean Squared Error.

> print('MAE:', metrics.mean_absolute_error(
+     y_test, predictions))

MAE: 7.228148653430838

> print('MSE:', metrics.mean_squared_error(
+     y_test, predictions))

MSE: 79.81305165097461

> print('RMSE:', np.sqrt(metrics.mean_squared_error(
+     y_test, predictions)))

RMSE: 8.933815066978642

> print('Rsq.:', metrics.r2_score(
+     y_test, predictions))

Rsq.: 0.9890046246741234

Plot a histogram of the residuals and make sure it looks normally distributed. Use either seaborn distplot, or just plt.hist().

> plt.figure(figsize=(8,6))
+ sns.distplot((y_test-predictions),bins=50);
+ plt.show()

Conclusion

> coeffecients = pd.DataFrame(lm.coef_,X.columns)
+ coeffecients.columns = ['Coeffecient']
+ coeffecients

                      Coeffecient
Avg. Session Length     25.981550
Time on App             38.590159
Time on Website          0.190405
Length of Membership    61.279097

How can you interpret these coefficients?

Holding all other features fixed, a 1 unit increase in Avg. Session Length is associated with an increase of 25.98 total dollars spent.
Holding all other features fixed, a 1 unit increase in Time on App is associated with an increase of 38.59 total dollars spent.
Holding all other features fixed, a 1 unit increase in Time on Website is associated with an increase of 0.19 total dollars spent.
Holding all other features fixed, a 1 unit increase in Length of Membership is associated with an increase of 61.27 total dollars spent.

Python for Linear Regression

Python code in R Markdown

Paul Jozefek

2020-09-21

Linear Regression

Import Libraries

The Data

Exploratory Data Analysis

Training the Model

Train Test Split

Creating the Model

Model Evaluation

Model Predictions

Residual Histogram

Regression Metrics

Regression Exercise

The Data

Exploratory Data Analysis

Training the Model

Predicting Test Data

Evaluating the Model

Conclusion