Introduction - working with numpy

In this session, you will learn about some further techniques in Python. Last week’s session covered the basics of how to code in Python. This week you will look at more data types and tools in Python for advanced data analysis. In particular you will see how Python can handle matrices, and some basic 2d and 3d graphics tools. One of the key packages to enable this is call numpy. This package should be available on your standard Anaconda Python installations. One of the key things that numpy does is to provide matrix and array data types. Although similar to lists as seen last week, in some ways they are more powerful. The following code demonstrates this:

# Firstly import numpy
import numpy as np
# Create two 1D arrays using np.array
x = np.array([5.0,6.0,8.0,10.0])
y = np.array([8.5,8.4,8.1,7.2])
# Note that adding up arrays works element by element
print(x + y)
# and numpy functions also work like this - so eg np.log 
# is better with arrays than math.log 
## [ 13.5  14.4  16.1  17.2]
print(np.log(x))
# Two dimensional arrays (ie matrices) are also possible
## [ 1.60943791  1.79175947  2.07944154  2.30258509]
z1 = np.array([[2.0, 3.1, -2.0],[1.0,-1.0,6.3]])
z2 = np.array([[-2.0, 4.7, 3.1],[-1.0,-1.4,1.8]])
print(z1 + z2)
## [[ 0.   7.8  1.1]
##  [ 0.  -2.4  8.1]]
print(np.sin(z2))
## [[-0.90929743 -0.99992326  0.04158066]
##  [-0.84147098 -0.98544973  0.97384763]]

The data types provided by numpy are quite similar to the arrays and matrices in R - and are useful for data analysis in the same way. Some of the statistical functions are also provided:

# mean, std are supported
print(np.mean(x))
## 7.25
z_score = (x - np.mean(x))/np.std(x)
print(z_score)
## [-1.1717002  -0.65094455  0.39056673  1.43207802]

You can use these functions in your own functions:

def zscore(dat):
    return (dat - np.mean(dat)) / np.std(dat)
print(zscore(y))
## [ 0.87831007  0.68313005  0.09759001 -1.65903012]

2D Arrays and Matrices

It is also possible to use np.mean and so on with higher dimensional data. Functions that work in this way include np.mean, np.std, np.sum, np.min and np.max. You can use them to compute values for the entire matrix, or by specifying axis=0 provide column-wise results, or row-wise results for axis=1:

print(np.mean(z1))
## 1.56666666667
print(np.sum(z1,axis=0))
## [ 3.   2.1  4.3]
print(np.min(z1,axis=1))
## [-2. -1.]

Vector and array objects also have - these are basically properties of the objects - such as the number of rows or columns a matrix has. That particular attribute is called :

print(y.shape)
## (4,)
print(z2.shape)
## (2, 3)
print(np.min(z1,axis=0).shape)
## (3,)

The result of these expressions a a - a Python object similar to a list. One useful proprty of tuples is that although it is possible to assign a whole tuple to a variable, it is also possible to do this element by element:

num_rows, num_cols = z1.shape
print("Rows =", num_rows, "; Cols=", num_cols)
## Rows = 2 ; Cols= 3

It is also possible to use slicing on arrays (cf last week). For example, to pick out columns 0 and 1 from z1 use

part_z1 = z1[:,0:2]
print(part_z1)
## [[ 2.   3.1]
##  [ 1.  -1. ]]

Note - a slice expression is needed in both dimensions, hence the : in the row dimension - stating that all rows are wanted. Also remember that the right-hand value in a slice expression refers to the element after the last one to be selected.

Note that for 2D arrays, such as z1 and z2, arithmetic operators +, -, *, / and so on work on an element by element basis. Thus, for example:

v1 = np.array([[2.0, 3.1, -2.0],[1.0,-1.0,6.3],[2.0,-1.0,3.5]])
v2 = np.array([[-7.3, 6.1, -2.2],[-1.3,1.4,1.7],[-3.0,2.0,-1.0]])
print(v1 * v2)
## [[-14.6   18.91   4.4 ]
##  [ -1.3   -1.4   10.71]
##  [ -6.    -2.    -3.5 ]]

Thus, the [0,0] element of the result above is v1[0,0] times v2[0,0] - this is different from matrix multiplication. numpy also offers a data type called matrix for which multiplication is actually matrix multiplication:

m1 = np.matrix([[2.0, 3.1, -2.0],[1.0,-1.0,6.3],[2.0,-1.0,3.5]])
m2 = np.matrix([[-7.3, 6.1, -2.2],[-1.3,1.4,1.7],[-3.0,2.0,-1.0]])
print(m1 * m2)
## [[-12.63  12.54   2.87]
##  [-24.9   17.3  -10.2 ]
##  [-23.8   17.8   -9.6 ]]

Data via pandas

Another helpful Python package is pandas. This provides a new data type which is very similar to an R **data frame*. This package is used for a number of data manipulation tasks - one of the most practical is its ability to read in and manipulate data stored in csv files. For this part of the practical, you will need some data relating to house prices. Currently this is available in the NCG603 directory on Dropbox, which can be downloaded from https://www.dropbox.com/s/2bfz210216dw6cw/hpdemo.csv?dl=0. If you open this with a text editor, the first few lines look like this:

"ID","east","north","price","fl_area"
"1",523800,179700,107000,50
"2",533200,170900,55500,66
"3",514600,175800,103000,90
"4",516000,171000,187000,125
"5",533700,169200,43000,50

Respectively, the columns are:

Variable Description
ID An ID number for each house
east The easting (x-coordinate) of the house
north The northing (y-coordinate) of the house
price The sale price of the house
fl_area The floor area of the house

The pandas package has a useful function for reading csv files. Make sure the house price file is in the folder you are working in, and then enter:

import pandas as pd
hp = pd.read_csv('hpdemo.csv')
print(hp)
##         ID    east   north   price  fl_area
## 0        1  523800  179700  107000       50
## 1        2  533200  170900   55500       66
## 2        3  514600  175800  103000       90
## 3        4  516000  171000  187000      125
## 4        5  533700  169200   43000       50
## 5        6  547900  189600   69995       95
## 6        7  531900  189900   80000       79
## 7        8  514600  190300   57500       44
## 8        9  525700  172900  104000       68
## 9       10  536700  188100   69000       87
## 10      11  529000  171300   88000       76
## 11      12  523900  191200  123000      107
## 12      13  546800  169600   83000       92
## 13      14  547100  187600   52995       53
## 14      15  531900  185800  255000      146
## 15      16  528700  173400  185000      125
## 16      17  529200  164400  152000      157
## 17      18  525000  190200   85000       79
## 18      19  514600  189300  120000      108
## 19      20  524200  174600  135000       91
## 20      21  516300  176700  143000      100
## 21      22  505100  183400   59950       39
## 22      23  523800  184600  285000      198
## 23      24  526400  177100  272500      155
## 24      25  532800  167200   73950       87
## 25      26  533700  166500   67500       59
## 26      27  514200  172200  174000      114
## 27      28  541200  178200   65000       92
## 28      29  538400  191500   50300       61
## 29      30  539900  190200   63200       62
## ...    ...     ...     ...     ...      ...
## 1375  1376  528000  176400   52000       83
## 1376  1377  539300  169400  142000      127
## 1377  1378  537100  182700   70000       60
## 1378  1379  531200  194600  243000      129
## 1379  1380  527900  166100   58000       48
## 1380  1381  522900  168500  139950      120
## 1381  1382  518800  163600  125000       98
## 1382  1383  522200  177600  134000       70
## 1383  1384  526700  175100   85000       46
## 1384  1385  528900  194900   99000       96
## 1385  1386  526700  165900   77000       80
## 1386  1387  527000  171600   76000       80
## 1387  1388  538700  185600   64000       76
## 1388  1389  540900  173700   64000       77
## 1389  1390  526000  173800   59999       52
## 1390  1391  552000  190400  246000      155
## 1391  1392  536000  169200  122000      127
## 1392  1393  532000  172200   79000       75
## 1393  1394  536600  165300  139950      110
## 1394  1395  527600  193300   65000       54
## 1395  1396  520000  186900  250000      136
## 1396  1397  511300  176400   66950      105
## 1397  1398  522400  166400  119950       85
## 1398  1399  528300  175900  151500       75
## 1399  1400  528800  171300  230000      178
## 1400  1401  515600  173100   68500       44
## 1401  1402  513200  186500   58500       59
## 1402  1403  542900  189500  247000      185
## 1403  1404  524900  185300  153000       96
## 1404  1405  522000  185400  146250      111
## 
## [1405 rows x 5 columns]

You can access the contents of the columns in the same way as you access items in dictionaries:

print(hp['price'])
## 0       107000
## 1        55500
## 2       103000
## 3       187000
## 4        43000
## 5        69995
## 6        80000
## 7        57500
## 8       104000
## 9        69000
## 10       88000
## 11      123000
## 12       83000
## 13       52995
## 14      255000
## 15      185000
## 16      152000
## 17       85000
## 18      120000
## 19      135000
## 20      143000
## 21       59950
## 22      285000
## 23      272500
## 24       73950
## 25       67500
## 26      174000
## 27       65000
## 28       50300
## 29       63200
##          ...  
## 1375     52000
## 1376    142000
## 1377     70000
## 1378    243000
## 1379     58000
## 1380    139950
## 1381    125000
## 1382    134000
## 1383     85000
## 1384     99000
## 1385     77000
## 1386     76000
## 1387     64000
## 1388     64000
## 1389     59999
## 1390    246000
## 1391    122000
## 1392     79000
## 1393    139950
## 1394     65000
## 1395    250000
## 1396     66950
## 1397    119950
## 1398    151500
## 1399    230000
## 1400     68500
## 1401     58500
## 1402    247000
## 1403    153000
## 1404    146250
## Name: price, Length: 1405, dtype: int64

This lists all of the items in the column price. Sometimes rescaling is useful - particularly to obtain ‘neater’ axes.

print(hp['price']/1000)
## 0       107.000
## 1        55.500
## 2       103.000
## 3       187.000
## 4        43.000
## 5        69.995
## 6        80.000
## 7        57.500
## 8       104.000
## 9        69.000
## 10       88.000
## 11      123.000
## 12       83.000
## 13       52.995
## 14      255.000
## 15      185.000
## 16      152.000
## 17       85.000
## 18      120.000
## 19      135.000
## 20      143.000
## 21       59.950
## 22      285.000
## 23      272.500
## 24       73.950
## 25       67.500
## 26      174.000
## 27       65.000
## 28       50.300
## 29       63.200
##          ...   
## 1375     52.000
## 1376    142.000
## 1377     70.000
## 1378    243.000
## 1379     58.000
## 1380    139.950
## 1381    125.000
## 1382    134.000
## 1383     85.000
## 1384     99.000
## 1385     77.000
## 1386     76.000
## 1387     64.000
## 1388     64.000
## 1389     59.999
## 1390    246.000
## 1391    122.000
## 1392     79.000
## 1393    139.950
## 1394     65.000
## 1395    250.000
## 1396     66.950
## 1397    119.950
## 1398    151.500
## 1399    230.000
## 1400     68.500
## 1401     58.500
## 1402    247.000
## 1403    153.000
## 1404    146.250
## Name: price, Length: 1405, dtype: float64

Basic Statistical Graphics

Having read in this data, we can start to analysis and visualise it in Python. Firstly, to visualise it the pyplot package (a sub-package of matplotlib)1 can be used - this produces Python plots that look quite like ones that MATLAB produces.

import matplotlib.pyplot as pl
# Try to draw a histogram
pl.hist(hp['price'])
pl.show()

You will see that at the pl.hist stage no graphic appears. pylab works by building up the layers in a image, and then when everything in place, using the show function to actually create the plot.

This would perhaps look better if the prices were in thousands of pounds and some titles and labels were provided. A nicer result is obtained using:

# Try to draw a neater histogram
pl.close()
pl.hist(hp['price']/1000)
pl.xlabel('Price (1000s Pounds)')
pl.ylabel('Frequency')
pl.title('London House Prices')
pl.show()

Note the way the graphic is ‘built up’ by putting the data, and then various text labels on step by step. Finally, the show command is added, to draw the completed graphic. The initial pl.close() function closes the previous plot (otherwise the new plot would be overplotted on the original axes).

You can use some of the numpy operations on data frames - for example it might be useful to see the distribution of log house prices - this sometimes makes the distribution more like a log normal.

# histogram of log house price
pl.close()
pl.hist(np.log10(hp['price']))
pl.xlabel('Logarithm of Price (1000s Pounds)')
pl.ylabel('Frequency')
pl.title('Logged house prices')
pl.show()

Here the function log10 is used - taking logs to the base 10 of the prices - so that \(\log_{10}(10,000) = 4\), \(\log_{10}(100,000) = 5\) and \(\log_{10}(1,000,000) = 6\).

It is also possible to create scatter plots - using scatter:

pl.close()
pl.scatter(hp['fl_area'],hp['price']/1000,color='r',marker='.')
pl.xlabel('Floor Area (sq. metres)')
pl.ylabel('Price (1000s Pounds)')
pl.title('House Price v. Floor Area (London)')
pl.show()

Interestingly, although there is some linkage, the relationship is not excessively strong. Note that the dot density is quite high in the lower left hand corner, so it is hard to see relative numbers of points. One way of addressing this is to use semi-transparent points. In the scatter function, the colour was specified as r (for red), but it is also possible to denote colour as a red, green, blue, transparency 4-tuple. In the case of the three colours, levels range from 0 to 1 - for transparancy the level also runs from 0 to 1, with 0 being completely transparent, and 1 not transparent at all.

pl.close()
pl.scatter(hp['fl_area'],hp['price']/1000,color=(1,0,0,0.1),marker='.')
pl.xlabel('Floor Area (sq. metres)')
pl.ylabel('Price (1000s Pounds)')
pl.title('House Price v. Floor Area (London)')
pl.grid(True)
pl.show()

Note a grid was also added here. The log transform is also interesting here. Suppose we take logs of both the price and the floor area and plot the result:

pl.close()
pl.scatter(np.log10(hp['fl_area']),np.log10(hp['price']),color='r',marker='.')
pl.xlabel('Log Floor Area (sq. metres)')
pl.ylabel('Log Price')
pl.title('House Price v. Floor Area (London) - all data log transformed')
pl.grid(True)
pl.show()

Here it is worth noting that the distribution is elliptical - suggesting a correlation, but also a possible bivariate normal shape. another way to visualise this is to leave the data untransformed, but to use log-scale axes. This is achieved with the loglog function:

pl.close()
pl.scatter(hp['fl_area'],hp['price'],color='r',marker='.')
pl.xlabel('Floor Area (sq. metres)')
pl.ylabel('Price')
pl.title('House Price v. Floor Area (London)')
pl.grid()
pl.loglog()
pl.show()

Some Statistical analysis

Having plotted this data, some statistical analysis may be appropriate. A further package, called scipy (Scientific Python) can be useful here. It is probably useful to fit a regression line to the data in the last figure you drew. In scipy there is a sub-package called stats. This has a function called linregress:

import scipy.stats as st
print(st.linregress(np.log10(hp['fl_area']),np.log10(hp['price'])))
## LinregressResult(slope=0.92700369750028555, intercept=3.2248787881527665, rvalue=0.68269377157031175, pvalue=2.106487886894245e-193, stderr=0.026489181088576768)

The five numbers returned by this function as a 5-tuple are as follows:

  1. The slope of the regression line
  2. The intercept of the regression line
  3. The correlation of the regression line
  4. The \(p\)-value of a significance test for the null hypothesis that the slope is zero
  5. The standard deviation of the residuals

As with the dimensions of arrays with shape earlier, you can assign all of these to five variables in a single statement:

# Here are the 5 variables being assigned
a,b,r,pval,sderr = st.linregress(np.log10(hp['fl_area']),
    np.log10(hp['price']))
# Print some out neatly...
print("a = %6.3f (slope)" % a)
## a =  0.927 (slope)
print("b = %6.3f (Intercept)" % b)
## b =  3.225 (Intercept)
print("r = %6.3f (correlation)" % r)
## r =  0.683 (correlation)

This just fitted a model of the form \(\log_{10}(\texttt{price}) = a + b\log_{10}(\texttt{fl_area})\) where \(a\) is the intercept, and \(b\) is the slope. The fit is reasonably good, suggesting a power law relation between price and floor area. This can also be written in the form \[\begin{equation} \texttt{price} = A(\texttt{fl_area})^{b} \end{equation}\] where \(A = 10^{a}\). You can use Python to see the value of \(A\):

print(10 ** a)
## 8.45286041734

Since \(b\) is slightly less than one, this suggests a slight diminishing return on the value of a house given its floor area - doubling the floor area doesn’t quite double the value of the house.

You can also add the fitted line to the scatter plot. The plot command in pylab can do this. As before, the plot is ‘built up’ and the line is added after the scatter points and labels. plot is a general line drawing function which will join together a set of \((x,y)\) points, but here the \(x\)-values are a set of regular points on the line, and the \(y\)-values are the corresponding fitted regression values. The function np.linspace(lowest,highest,n) creates a set of linearly-spaced values from a lowest to a highest value, incorporating n points. In the plot function, the colour 'b' means blue. Also, the lw parameter controls the line width.

pl.close()
pl.scatter(np.log10(hp['fl_area']),np.log10(hp['price']),
    color='r',marker='.')
pl.xlabel('Log Floor Area (sq. metres)')
pl.ylabel('Log Price')
pl.title('House Price v. Floor Area (London) - all data log transformed')
pl.grid(True)
x_reg = np.linspace(1.3,2.5,100)
y_reg = b + a * x_reg
pl.plot(x_reg,y_reg,color='b',lw=2.0)
pl.show()

To see the power law relation, it may be useful to plot the untransformed variables - the following code applies the inverse of the \(\log_{10}(x)\) transform (\(10^{x}\)) to the points on the fitted line

pl.close()
pl.scatter(hp['fl_area'],hp['price']/1000,color='r',marker='.')
pl.xlabel('Floor Area (sq. metres)')
pl.ylabel('Price')
pl.title('House Price v. Floor Area (London)')
pl.grid(True)
pl.plot(10.0**x_reg,10.0**y_reg/1000,color='b',lw=2.0)
pl.show()

However, it should be noted that the deviation from linearity is quite small.

Kernel Density Estimation

In addition to creating a scatter plot of price and floor area, another idea might be to create a kernel density estimate of these two variables. There is a function to this in scipy.stats as well, called gaussian_kde.

One complication here is that this function expects data in a different format to much of the numpy and scipy and pyplot material. In particular, for the input to the KDE function, the data needs to be an 2 \(\times\) n matrix, where n is the number of data points. Note this is the opposite order to a data frame, as variables correspond to rows not columns. Giving hp a list of column names will extract the corresponding columns - but here, we will need the data to be transposed. Fortunately this is fairly easy in numpy - the transpose of matrix x is just x.T. In a single expression it is possible to extract the price and floor area columns, transpose them, take logs of them and assign the result to a new variable called log_data.

log_data = np.log10(hp[['fl_area','price']].T) 

Next, this data is fed into the gaussian_kde function - this is an unusual function in that it doesn’t return a value - instead it returns another function. This function is acxtually used to compute the density estimate at new data points. Here, the newly created function is called f_hat. This name comes from the fact that often in texts about kernel density estimation, the estimation function is referred to as \(\hat{f}(x,y)\).

f_hat = st.gaussian_kde(log_data)

This creates a new function that takes a matrix of data points, and works out the value of kernel density estimate at those points. A typical use of this is to evaluate f_hat over a regular grid of \((x,y)\) points, which can then be used to draw contour images, or plot 3D surfaces. To do this, you need to create a 2D array of \(x\)-points and another of the corresponding \(y\)-points, and then step through each grid location and compute the density with f_hat. The grid values for the \(x\) and \(y\) points can be created using the numpy function meshgrid - this takes a list of \(x\)-values, and another of \(y\)-values and creates a 2D array of all possible combinations. linspace can be used to generate evenly spaced sequences in the \(x\) and \(y\) directions.

x_grd, y_grd = np.meshgrid(np.linspace(1.3,2.5,100),
    np.linspace(4.0,6.5,100))

Note that this is another function that returns multiple values, in this case two 2D arrays, one for the \(x\)-values and one for the \(y\)’s. The next step is to create a third 2D array in which the corresponding density estimates should be stored. This can be done via the zeros_like function - it creates a 2D array that has the same number of rows and columns as its argument, but every element is zero. Essentially this just creates a 2D array of place holders. Once this is done, you can loop through the grid of \(x\) and \(y\) values, and compute the density estimates:

dens_grd = np.zeros_like(x_grd)
for i in range(len(x_grd)):
    for j in range(len(y_grd)):
        dens_grd[i,j] = f_hat([x_grd[i,j],y_grd[i,j]])

Now the array dens_grid contains the grid of kernel density estimates, it may be used to create contour images. Contour plots can be created by the pyplot function contour:

pl.close()
pl.contour(x_grd,y_grd,dens_grd,colors='DarkGreen')
pl.xlabel('Log Floor Area (sq. metres)')
pl.ylabel('Log Price')
pl.title('House Price v. Floor Area (London) - all data log transformed')
pl.show()

Note that the colors argument here specified the colour of the contour lines. The colour 'DarkGreen' is a standard HTML colour - as listed at http://www.w3schools.com/html/html_colornames.asp.

A related function, contourf function plots filled contours, where the regions in between the contour lines have a solid colour.

pl.close()
pl.contourf(x_grd,y_grd,dens_grd)
pl.xlabel('Log Floor Area (sq. metres)')
pl.ylabel('Log Price')
pl.title('House Price v. Floor Area (London) - all data log transformed')
pl.show()

The result with a default colour scheme is works well here - basically this is the viridis palette. However, other colour schemes are possible. pyplot refers to these as color maps - each one has a name, with many of the names corresponding to the ColorBrewer palettes. All of the standard palettes are available via the cmap argument - plus a few other ones - see https://matplotlib.org/users/colormaps.html for a list. The ones in the group labelled ‘Miscellaneous Colormaps’ on this page are best avoided2. Here the Greens ColorBrewer-based color map is used:

pl.close()
pl.contourf(x_grd,y_grd,dens_grd,cmap='Greens')
pl.xlabel('Log Floor Area (sq. metres)')
pl.ylabel('Log Price')
pl.title('House Price v. Floor Area (London) - all data log transformed')
pl.show()

Finally a good effect can be created by mapping both the filled contours and the contour lines:

pl.close()
pl.contourf(x_grd,y_grd,dens_grd,cmap='Greens')
pl.contour(x_grd,y_grd,dens_grd,colors='DarkGreen')
pl.xlabel('Log Floor Area (sq. metres)')
pl.ylabel('Log Price')
pl.title('House Price v. Floor Area (London) - all data log transformed')
pl.show()

Triangulated Contours

A further approach to creating contour plots is the tricontour function. This creates contours without the need to create regular grids. Essentially, the function name is an abbreveation of triangular contour. It takes an irregular set of (\(x\),\(y\),\(z\)) points, creates a triangulated network (Delauny triangulation) over the \((x,y)\) points, and then adds the \(z\) points to create a surface made of irregular triangular faces - which it then uses to create the contours. The useful thing about this approach is that it saves having to compute all of the grid points. In pyplot the functions tricontour and tricontourf are the respective triangulation-based versions of contour and contourf. In the code below, the \((x,y)\) data points (i.e. the floor area and price) are ‘fed’ into f_hat to estimate the density at the data points. Then the \(x\) and \(y\) values are extracted from log_data for the triangulated contour functions, using the expressions log_data.T['fl_area'] and log_data.T['price'].

pl.close()
dens =  f_hat(log_data)
pl.tricontourf(log_data.T['fl_area'],log_data.T['price'],dens,cmap='Blues')
pl.tricontour(log_data.T['fl_area'],log_data.T['price'],dens,colors='Indigo')
pl.xlabel('Log Floor Area (sq. metres)')
pl.ylabel('Log Price')
pl.title('House Price v. Floor Area (London) - all data log transformed')
pl.show()

Note that in this case, the outer limit of the contour shading is not the entire graph area, but the convex hull of the points on the plot. this is an artefact of the Delauney triangulation algorithm. Also note that the contours associated with the lower densities are more `wobbly’. This is essentially because - fairly obviously - lower densities occur when there are fewer points, and so the triangulation is cruder in this area.

Three-Dimensional Graphics

As well as using contour-based methods, it is also possible to create 3D plots in Python. This is done using the mpl_toolkits.mplot3d package. This provides a function called Axes3d that takes a figure object and creates a set of 3D axes (actually a coordinate mapping from 3D to 2D) that projects 3D images into a 2D window. This sounds complex, but in reality it just means that you create an object called ax that you place in front of the 3D graphics functions. One such function is plot_surface which creates surfaces from rectangular grid data:

from mpl_toolkits.mplot3d import Axes3D
pl.close()
fig = pl.figure()
ax = Axes3D(fig)
ax.plot_surface(x_grd, y_grd, dens_grd, rstride=1, cstride=1, 
    cmap='Reds',linewidth=0.05)
pl.xlabel('Log Floor Area')
pl.ylabel('Log Price')
pl.show()

A triangulated version also exists - using the function plot_trisurf.

pl.close()
fig = pl.figure()
ax = Axes3D(fig)
ax.plot_trisurf(log_data.T['fl_area'],log_data.T['price'],dens,cmap='Blues',linewidth=0.05)
pl.xlabel('Log Floor Area')
pl.ylabel('Log Price')
pl.show()

Putting it all together

Although you have been working with chunks of code up to this point it is helpful to be able to create a whole program. An example is given in https://www.dropbox.com/s/jr9qn3yghnwtptd/housing.py?dl=0 - this reproduces the grid-based 3D density surface. The file is listed below:

# Import the various packages as toolkits
import numpy as np
import pylab as pl
import scipy.stats as st
import pandas as pd
from mpl_toolkits.mplot3d import Axes3D
 
 
# Read in the house price data 
hp = pd.read_csv('hpdemo.csv')
 
 
# Take logs of price and floor area
log_data = np.log10(hp[['fl_area','price']].T) 
 
# Create the kernel function
f_hat = st.gaussian_kde(log_data)
 
# Create the mesh points
x_grd, y_grd = np.meshgrid(np.linspace(1.3,2.5,100),
    np.linspace(4.0,6.5,100))
 
# Create the densities on the mesh point grid
dens_grd = np.zeros_like(x_grd)
for i in range(len(x_grd)):
    for j in range(len(y_grd)):
        dens_grd[i,j] = f_hat([x_grd[i,j],y_grd[i,j]])
 
# Create the 3D figure
fig = pl.figure()
ax = Axes3D(fig)
ax.plot_surface(x_grd, y_grd, dens_grd, rstride=1, cstride=1, 
    cmap='Reds',linewidth=0.05)
pl.xlabel('Log Floor Area')
pl.ylabel('Log Price')
pl.show()

If you have downloaded this file, and ipython is working in the same folder, you can run the file as a single job via the command

python housing.py

A few hints on creating program files follow:

  • Don’t forget to include the import statements - I usually put all the ones I need at the top of the program.
  • Comment statements are helpful:
    • They help split the program into its component parts.
    • The help explain what is going on when you come back to the file after 18 months.
    • You can use them to include citations to papers explaining algorithms you use.
    • You can use them to cite sources of data.

Self-Test Exercises

These are all exercises that involve modifying the file you have just downloaded.

Exercise 1

Modify the function to produce a 2D filled contour plot instead of a 3D surface.

Exercise 2

It is possible to save the diagrams to a file (a pdf for example), using pl.savefig('myfile.pdf') instead of pl.show(). Modify your code from the previous exercise to do this. Also, use a different colour map for the contour shading.

Exercise 3

Going back to the original 3D plot, and replace the original ax.plot_surface line with the following two lines:

ax.contour3D(x_grd, y_grd, dens_grd, rstride=1, cstride=1, 
    colors='k', linewidth=2)
ax.plot_surface(x_grd, y_grd, dens_grd, rstride=1, cstride=1, 
    cmap='Reds',linewidth=0,alpha=0.5)

Can you explain what is going on here?


  1. You wont have seen sub-packages yet, but it really just means a package called matplotlib.pyplot that can be imported without the rest of scipy.

  2. Its not just me saying this - the one called ‘jet’ used to be the default, leading to this discussion https://github.com/matplotlib/matplotlib/issues/875 .