The data set: USA_Housing.csv.
The data contains the following columns:
Avg. Area Income: Avg. Income of residents of the city house is located in.Avg. Area House Age: Avg Age of Houses in same cityAvg. Area Number of Rooms: Avg Number of Rooms for Houses in same cityAvg. Area Number of Bedrooms: Avg Number of Bedrooms for Houses in same cityArea Population: Population of city house is located inPrice: Price that the house sold atAddress: Address for the house| Avg. Area Income | Avg. Area House Age | Avg. Area Number of Rooms | Avg. Area Number of Bedrooms | Area Population | Price | Address |
|---|---|---|---|---|---|---|
| 79545.46 | 5.682861 | 7.009188 | 4.09 | 23086.80 | 1059033.6 | 208 Michael Ferry Apt. 674 Laurabury, NE 37010-5101 |
| 79248.64 | 6.002900 | 6.730821 | 3.09 | 40173.07 | 1505890.9 | 188 Johnson Views Suite 079 Lake Kathleen, CA 48958 |
| 61287.07 | 5.865890 | 8.512727 | 5.13 | 36882.16 | 1058988.0 | 9127 Elizabeth Stravenue Danieltown, WI 06482-3489 |
| 63345.24 | 7.188236 | 5.586729 | 3.26 | 34310.24 | 1260616.8 | USS Barnett FPO AP 44820 |
| 59982.20 | 5.040554 | 7.839388 | 4.23 | 26354.11 | 630943.5 | USNS Raymond FPO AE 09386 |
| 80175.75 | 4.988408 | 6.104512 | 4.04 | 26748.43 | 1068138.1 | 06039 Jennifer Islands Apt. 443 Tracyport, KS 16077 |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
Avg. Area Income 5000 non-null float64
Avg. Area House Age 5000 non-null float64
Avg. Area Number of Rooms 5000 non-null float64
Avg. Area Number of Bedrooms 5000 non-null float64
Area Population 5000 non-null float64
Price 5000 non-null float64
Address 5000 non-null object
dtypes: float64(6), object(1)
memory usage: 273.6+ KB
| Avg. Area Income | Avg. Area House Age | Avg. Area Number of Rooms | Avg. Area Number of Bedrooms | Area Population | Price | |
|---|---|---|---|---|---|---|
| count | 5000.00 | 5000.0000000 | 5000.000000 | 5000.000000 | 5000.0000 | 5000.00 |
| mean | 68583.11 | 5.9772220 | 6.987792 | 3.981330 | 36163.5160 | 1232072.65 |
| std | 10657.99 | 0.9914562 | 1.005833 | 1.234137 | 9925.6501 | 353117.63 |
| min | 17796.63 | 2.6443042 | 3.236194 | 2.000000 | 172.6107 | 15938.66 |
| 25% | 61480.56 | 5.3222830 | 6.299250 | 3.140000 | 29403.9287 | 997577.14 |
| 50% | 68804.29 | 5.9704289 | 7.002902 | 4.050000 | 36199.4067 | 1232669.38 |
Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
dtype='object')
| Avg. Area Income | Avg. Area House Age | Avg. Area Number of Rooms | Avg. Area Number of Bedrooms | Area Population | Price | |
|---|---|---|---|---|---|---|
| Avg. Area Income | 1.0000000 | -0.0020068 | -0.0110317 | 0.0197882 | -0.0162337 | 0.6397338 |
| Avg. Area House Age | -0.0020068 | 1.0000000 | -0.0094283 | 0.0061489 | -0.0187428 | 0.4525425 |
| Avg. Area Number of Rooms | -0.0110317 | -0.0094283 | 1.0000000 | 0.4626949 | 0.0020399 | 0.3356645 |
| Avg. Area Number of Bedrooms | 0.0197882 | 0.0061489 | 0.4626949 | 1.0000000 | -0.0221676 | 0.1710710 |
| Area Population | -0.0162337 | -0.0187428 | 0.0020399 | -0.0221676 | 1.0000000 | 0.4085559 |
| Price | 0.6397338 | 0.4525425 | 0.3356645 | 0.1710710 | 0.4085559 | 1.0000000 |
> plt.figure(figsize=(8,6))
+ sns.heatmap(USAhousing.corr(),
+ annot=True, cmap='coolwarm');
+ plt.tight_layout()
+ plt.show()We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable, in this case the Price column. We will toss out the Address column because it only has text info that the linear regression model can’t use.
> X = USAhousing[['Avg. Area Income',
+ 'Avg. Area House Age',
+ 'Avg. Area Number of Rooms',
+ 'Avg. Area Number of Bedrooms',
+ 'Area Population']]
+ y = USAhousing['Price']Now let’s split the data into a training set and a testing set. We will train out model on the training set and then use the test set to evaluate the model.
Let’s evaluate the model by checking out it’s coefficients and how we can interpret them.
-2640159.796851911
| Coefficient | |
|---|---|
| Avg. Area Income | 21.52828 |
| Avg. Area House Age | 164883.28203 |
| Avg. Area Number of Rooms | 122368.67803 |
| Avg. Area Number of Bedrooms | 2233.80186 |
| Area Population | 15.15042 |
Interpreting the coefficients:
Avg. Area Income is associated with an increase of $21.52.Avg. Area House Age is associated with an increase of $164883.28.Avg. Area Number of Rooms is associated with an increase of $122368.67.Avg. Area Number of Bedrooms is associated with an increase of $2233.80.Area Population is associated with an increase of $15.15.Does this make sense? Probably not because this data is artificial. If you want real data to repeat this sort of analysis, check out the boston dataset http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html
array([1260960.70567626, 827588.75560352, 1742421.24254328, ...,
372191.40626952, 1365217.15140895, 1914519.54178824])
Here are three common evaluation metrics for regression problems:
Mean Absolute Error (MAE) is the mean of the absolute value of the errors:
\[\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|\]
Mean Squared Error (MSE) is the mean of the squared errors:
\[\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2\]
Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:
\[\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}\]
Comparing these metrics:
MAE is the easiest to understand, because it’s the average error.MSE is more popular than MAE, because MSE “punishes” larger errors, which tends to be useful in the real world.RMSE is even more popular than MSE, because RMSE is interpretable in the “y” units.All of these are loss functions, because we want to minimize them.
MAE: 82288.22251914957
MSE: 10460958907.209507
RMSE: 102278.82922291156
Rsq.: 0.91768240096492
We’ll work with the Ecommerce Customers csv file. It has Customer info, suchas Email, Address, and their color Avatar. Then it also has numerical value columns:
info() and describe() methods.| Address | Avatar | Avg. Session Length | Time on App | Time on Website | Length of Membership | Yearly Amount Spent | |
|---|---|---|---|---|---|---|---|
| mstephenson@fernandez.com | 835 Frank Tunnel Wrightmouth, MI 82180-9605 | Violet | 34.49727 | 12.65565 | 39.57767 | 4.082621 | 587.9511 |
| hduke@hotmail.com | 4547 Archer Common Diazchester, CA 06566-8576 | DarkGreen | 31.92627 | 11.10946 | 37.26896 | 2.664034 | 392.2049 |
| pallen@yahoo.com | 24645 Valerie Unions Suite 582 Cobbborough, DC 99414-7564 | Bisque | 33.00091 | 11.33028 | 37.11060 | 4.104543 | 487.5475 |
| riverarebecca@gmail.com | 1414 David Throughway Port Jason, OH 22070-1220 | SaddleBrown | 34.30556 | 13.71751 | 36.72128 | 3.120179 | 581.8523 |
| mstephens@davidson-herman.com | 14023 Rodriguez Passage Port Jacobville, PR 37242-1057 | MediumAquaMarine | 33.33067 | 12.79519 | 37.53665 | 4.446308 | 599.4061 |
| Avg. Session Length | Time on App | Time on Website | Length of Membership | Yearly Amount Spent | |
|---|---|---|---|---|---|
| count | 500.0000000 | 500.0000000 | 500.000000 | 500.0000000 | 500.00000 |
| mean | 33.0531935 | 12.0524879 | 37.060445 | 3.5334616 | 499.31404 |
| std | 0.9925631 | 0.9942156 | 1.010489 | 0.9992775 | 79.31478 |
| min | 29.5324290 | 8.5081522 | 33.913847 | 0.2699011 | 256.67058 |
| 25% | 32.3418220 | 11.3881534 | 36.349257 | 2.9304497 | 445.03828 |
| 50% | 33.0820076 | 11.9832313 | 37.069367 | 3.5339750 | 498.88788 |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
Email 500 non-null object
Address 500 non-null object
Avatar 500 non-null object
Avg. Session Length 500 non-null float64
Time on App 500 non-null float64
Time on Website 500 non-null float64
Length of Membership 500 non-null float64
Yearly Amount Spent 500 non-null float64
dtypes: float64(5), object(3)
memory usage: 31.4+ KB
jointplot to compare the Time on Website and Yearly Amount Spent columns. Does the correlation make sense?> # More time on site, more money spent.
+ sns.jointplot(x='Time on Website',
+ y='Yearly Amount Spent',data=customers);
+ plt.show()> sns.jointplot(x='Time on App',
+ y='Length of Membership',
+ kind='hex',data=customers);
+ plt.show()pairplot to recreate the plot below.Length of Membership
> plt.figure(figsize=(8,6))
+ sns.lmplot(x='Length of Membership',
+ y='Yearly Amount Spent',data=customers);
+ plt.show()> X = customers[['Avg. Session Length',
+ 'Time on App',
+ 'Time on Website',
+ 'Length of Membership']]model_selection.train_test_split from sklearn to split the data into training and testing sets. Set test_size=0.3 and random_state=101.LinearRegression() model named lm.LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Coefficients:
[25.98154972 38.59015875 0.19040528 61.27909654]
lm.predict() to predict off the X_test set of the data.> plt.figure(figsize=(8,6))
+ plt.scatter(y_test,predictions)
+ plt.xlabel('Y Test')
+ plt.ylabel('Predicted Y');
+ plt.show()MAE: 7.228148653430838
MSE: 79.81305165097461
RMSE: 8.933815066978642
Rsq.: 0.9890046246741234
distplot, or just plt.hist().> coeffecients = pd.DataFrame(lm.coef_,X.columns)
+ coeffecients.columns = ['Coeffecient']
+ coeffecients Coeffecient
Avg. Session Length 25.981550
Time on App 38.590159
Time on Website 0.190405
Length of Membership 61.279097
Avg. Session Length is associated with an increase of 25.98 total dollars spent.Time on App is associated with an increase of 38.59 total dollars spent.Time on Website is associated with an increase of 0.19 total dollars spent.Length of Membership is associated with an increase of 61.27 total dollars spent.