numpy
In this session, you will learn about some further techniques in Python. Last week’s session covered the basics of how to code in Python. This week you will look at more data types and tools in Python for advanced data analysis. In particular you will see how Python can handle matrices, and some basic 2d and 3d graphics tools. One of the key packages to enable this is call numpy
. This package should be available on your standard Anaconda Python installations. One of the key things that numpy
does is to provide matrix and array data types. Although similar to lists as seen last week, in some ways they are more powerful. The following code demonstrates this:
# Firstly import numpy
import numpy as np
# Create two 1D arrays using np.array
x = np.array([5.0,6.0,8.0,10.0])
y = np.array([8.5,8.4,8.1,7.2])
# Note that adding up arrays works element by element
print(x + y)
# and numpy functions also work like this - so eg np.log
# is better with arrays than math.log
## [ 13.5 14.4 16.1 17.2]
print(np.log(x))
# Two dimensional arrays (ie matrices) are also possible
## [ 1.60943791 1.79175947 2.07944154 2.30258509]
z1 = np.array([[2.0, 3.1, -2.0],[1.0,-1.0,6.3]])
z2 = np.array([[-2.0, 4.7, 3.1],[-1.0,-1.4,1.8]])
print(z1 + z2)
## [[ 0. 7.8 1.1]
## [ 0. -2.4 8.1]]
print(np.sin(z2))
## [[-0.90929743 -0.99992326 0.04158066]
## [-0.84147098 -0.98544973 0.97384763]]
The data types provided by numpy
are quite similar to the arrays and matrices in R - and are useful for data analysis in the same way. Some of the statistical functions are also provided:
# mean, std are supported
print(np.mean(x))
## 7.25
z_score = (x - np.mean(x))/np.std(x)
print(z_score)
## [-1.1717002 -0.65094455 0.39056673 1.43207802]
You can use these functions in your own functions:
def zscore(dat):
return (dat - np.mean(dat)) / np.std(dat)
print(zscore(y))
## [ 0.87831007 0.68313005 0.09759001 -1.65903012]
It is also possible to use np.mean
and so on with higher dimensional data. Functions that work in this way include np.mean
, np.std
, np.sum
, np.min
and np.max
. You can use them to compute values for the entire matrix, or by specifying axis=0
provide column-wise results, or row-wise results for axis=1
:
print(np.mean(z1))
## 1.56666666667
print(np.sum(z1,axis=0))
## [ 3. 2.1 4.3]
print(np.min(z1,axis=1))
## [-2. -1.]
Vector and array objects also have - these are basically properties of the objects - such as the number of rows or columns a matrix has. That particular attribute is called :
print(y.shape)
## (4,)
print(z2.shape)
## (2, 3)
print(np.min(z1,axis=0).shape)
## (3,)
The result of these expressions a a - a Python object similar to a list. One useful proprty of tuples is that although it is possible to assign a whole tuple to a variable, it is also possible to do this element by element:
num_rows, num_cols = z1.shape
print("Rows =", num_rows, "; Cols=", num_cols)
## Rows = 2 ; Cols= 3
It is also possible to use slicing on arrays (cf last week). For example, to pick out columns 0 and 1 from z1
use
part_z1 = z1[:,0:2]
print(part_z1)
## [[ 2. 3.1]
## [ 1. -1. ]]
Note - a slice expression is needed in both dimensions, hence the :
in the row dimension - stating that all
rows are wanted. Also remember that the right-hand value in a slice expression refers to the element after
the last one to be selected.
Note that for 2D arrays, such as z1
and z2
, arithmetic operators +
, -
, *
, /
and so on work on an element by element basis. Thus, for example:
v1 = np.array([[2.0, 3.1, -2.0],[1.0,-1.0,6.3],[2.0,-1.0,3.5]])
v2 = np.array([[-7.3, 6.1, -2.2],[-1.3,1.4,1.7],[-3.0,2.0,-1.0]])
print(v1 * v2)
## [[-14.6 18.91 4.4 ]
## [ -1.3 -1.4 10.71]
## [ -6. -2. -3.5 ]]
Thus, the [0,0]
element of the result above is v1[0,0]
times v2[0,0]
- this is different from matrix multiplication. numpy
also offers a data type called matrix
for which multiplication is actually matrix multiplication:
m1 = np.matrix([[2.0, 3.1, -2.0],[1.0,-1.0,6.3],[2.0,-1.0,3.5]])
m2 = np.matrix([[-7.3, 6.1, -2.2],[-1.3,1.4,1.7],[-3.0,2.0,-1.0]])
print(m1 * m2)
## [[-12.63 12.54 2.87]
## [-24.9 17.3 -10.2 ]
## [-23.8 17.8 -9.6 ]]
pandas
Another helpful Python package is pandas
. This provides a new data type which is very similar to an R **data frame*. This package is used for a number of data manipulation tasks - one of the most practical is its ability to read in and manipulate data stored in csv
files. For this part of the practical, you will need some data relating to house prices. Currently this is available in the NCG603
directory on Dropbox, which can be downloaded from https://www.dropbox.com/s/2bfz210216dw6cw/hpdemo.csv?dl=0. If you open this with a text editor, the first few lines look like this:
"ID","east","north","price","fl_area"
"1",523800,179700,107000,50
"2",533200,170900,55500,66
"3",514600,175800,103000,90
"4",516000,171000,187000,125
"5",533700,169200,43000,50
Respectively, the columns are:
Variable | Description |
---|---|
ID | An ID number for each house |
east | The easting (x-coordinate) of the house |
north | The northing (y-coordinate) of the house |
price | The sale price of the house |
fl_area | The floor area of the house |
The pandas
package has a useful function for reading csv
files. Make sure the house price file is in the folder you are working in, and then enter:
import pandas as pd
hp = pd.read_csv('hpdemo.csv')
print(hp)
## ID east north price fl_area
## 0 1 523800 179700 107000 50
## 1 2 533200 170900 55500 66
## 2 3 514600 175800 103000 90
## 3 4 516000 171000 187000 125
## 4 5 533700 169200 43000 50
## 5 6 547900 189600 69995 95
## 6 7 531900 189900 80000 79
## 7 8 514600 190300 57500 44
## 8 9 525700 172900 104000 68
## 9 10 536700 188100 69000 87
## 10 11 529000 171300 88000 76
## 11 12 523900 191200 123000 107
## 12 13 546800 169600 83000 92
## 13 14 547100 187600 52995 53
## 14 15 531900 185800 255000 146
## 15 16 528700 173400 185000 125
## 16 17 529200 164400 152000 157
## 17 18 525000 190200 85000 79
## 18 19 514600 189300 120000 108
## 19 20 524200 174600 135000 91
## 20 21 516300 176700 143000 100
## 21 22 505100 183400 59950 39
## 22 23 523800 184600 285000 198
## 23 24 526400 177100 272500 155
## 24 25 532800 167200 73950 87
## 25 26 533700 166500 67500 59
## 26 27 514200 172200 174000 114
## 27 28 541200 178200 65000 92
## 28 29 538400 191500 50300 61
## 29 30 539900 190200 63200 62
## ... ... ... ... ... ...
## 1375 1376 528000 176400 52000 83
## 1376 1377 539300 169400 142000 127
## 1377 1378 537100 182700 70000 60
## 1378 1379 531200 194600 243000 129
## 1379 1380 527900 166100 58000 48
## 1380 1381 522900 168500 139950 120
## 1381 1382 518800 163600 125000 98
## 1382 1383 522200 177600 134000 70
## 1383 1384 526700 175100 85000 46
## 1384 1385 528900 194900 99000 96
## 1385 1386 526700 165900 77000 80
## 1386 1387 527000 171600 76000 80
## 1387 1388 538700 185600 64000 76
## 1388 1389 540900 173700 64000 77
## 1389 1390 526000 173800 59999 52
## 1390 1391 552000 190400 246000 155
## 1391 1392 536000 169200 122000 127
## 1392 1393 532000 172200 79000 75
## 1393 1394 536600 165300 139950 110
## 1394 1395 527600 193300 65000 54
## 1395 1396 520000 186900 250000 136
## 1396 1397 511300 176400 66950 105
## 1397 1398 522400 166400 119950 85
## 1398 1399 528300 175900 151500 75
## 1399 1400 528800 171300 230000 178
## 1400 1401 515600 173100 68500 44
## 1401 1402 513200 186500 58500 59
## 1402 1403 542900 189500 247000 185
## 1403 1404 524900 185300 153000 96
## 1404 1405 522000 185400 146250 111
##
## [1405 rows x 5 columns]
You can access the contents of the columns in the same way as you access items in dictionaries:
print(hp['price'])
## 0 107000
## 1 55500
## 2 103000
## 3 187000
## 4 43000
## 5 69995
## 6 80000
## 7 57500
## 8 104000
## 9 69000
## 10 88000
## 11 123000
## 12 83000
## 13 52995
## 14 255000
## 15 185000
## 16 152000
## 17 85000
## 18 120000
## 19 135000
## 20 143000
## 21 59950
## 22 285000
## 23 272500
## 24 73950
## 25 67500
## 26 174000
## 27 65000
## 28 50300
## 29 63200
## ...
## 1375 52000
## 1376 142000
## 1377 70000
## 1378 243000
## 1379 58000
## 1380 139950
## 1381 125000
## 1382 134000
## 1383 85000
## 1384 99000
## 1385 77000
## 1386 76000
## 1387 64000
## 1388 64000
## 1389 59999
## 1390 246000
## 1391 122000
## 1392 79000
## 1393 139950
## 1394 65000
## 1395 250000
## 1396 66950
## 1397 119950
## 1398 151500
## 1399 230000
## 1400 68500
## 1401 58500
## 1402 247000
## 1403 153000
## 1404 146250
## Name: price, Length: 1405, dtype: int64
This lists all of the items in the column price
. Sometimes rescaling is useful - particularly to obtain ‘neater’ axes.
print(hp['price']/1000)
## 0 107.000
## 1 55.500
## 2 103.000
## 3 187.000
## 4 43.000
## 5 69.995
## 6 80.000
## 7 57.500
## 8 104.000
## 9 69.000
## 10 88.000
## 11 123.000
## 12 83.000
## 13 52.995
## 14 255.000
## 15 185.000
## 16 152.000
## 17 85.000
## 18 120.000
## 19 135.000
## 20 143.000
## 21 59.950
## 22 285.000
## 23 272.500
## 24 73.950
## 25 67.500
## 26 174.000
## 27 65.000
## 28 50.300
## 29 63.200
## ...
## 1375 52.000
## 1376 142.000
## 1377 70.000
## 1378 243.000
## 1379 58.000
## 1380 139.950
## 1381 125.000
## 1382 134.000
## 1383 85.000
## 1384 99.000
## 1385 77.000
## 1386 76.000
## 1387 64.000
## 1388 64.000
## 1389 59.999
## 1390 246.000
## 1391 122.000
## 1392 79.000
## 1393 139.950
## 1394 65.000
## 1395 250.000
## 1396 66.950
## 1397 119.950
## 1398 151.500
## 1399 230.000
## 1400 68.500
## 1401 58.500
## 1402 247.000
## 1403 153.000
## 1404 146.250
## Name: price, Length: 1405, dtype: float64
Having read in this data, we can start to analysis and visualise it in Python. Firstly, to visualise it the pyplot
package (a sub-package of matplotlib
)1 can be used - this produces Python plots that look quite like ones that MATLAB produces.
import matplotlib.pyplot as pl
# Try to draw a histogram
pl.hist(hp['price'])
pl.show()
You will see that at the pl.hist
stage no graphic appears. pylab
works by building up the layers in a image, and then when everything in place, using the show
function to actually create the plot.
This would perhaps look better if the prices were in thousands of pounds and some titles and labels were provided. A nicer result is obtained using:
# Try to draw a neater histogram
pl.close()
pl.hist(hp['price']/1000)
pl.xlabel('Price (1000s Pounds)')
pl.ylabel('Frequency')
pl.title('London House Prices')
pl.show()
Note the way the graphic is ‘built up’ by putting the data, and then various text labels on step by step. Finally, the show
command is added, to draw the completed graphic. The initial pl.close()
function closes the previous plot (otherwise the new plot would be overplotted on the original axes).
You can use some of the numpy
operations on data frames - for example it might be useful to see the distribution of log house prices - this sometimes makes the distribution more like a log normal.
# histogram of log house price
pl.close()
pl.hist(np.log10(hp['price']))
pl.xlabel('Logarithm of Price (1000s Pounds)')
pl.ylabel('Frequency')
pl.title('Logged house prices')
pl.show()
Here the function log10
is used - taking logs to the base 10 of the prices - so that \(\log_{10}(10,000) = 4\), \(\log_{10}(100,000) = 5\) and \(\log_{10}(1,000,000) = 6\).
It is also possible to create scatter plots - using scatter
:
pl.close()
pl.scatter(hp['fl_area'],hp['price']/1000,color='r',marker='.')
pl.xlabel('Floor Area (sq. metres)')
pl.ylabel('Price (1000s Pounds)')
pl.title('House Price v. Floor Area (London)')
pl.show()
Interestingly, although there is some linkage, the relationship is not excessively strong. Note that the dot density is quite high in the lower left hand corner, so it is hard to see relative numbers of points. One way of addressing this is to use semi-transparent points. In the scatter
function, the colour was specified as r
(for red), but it is also possible to denote colour as a red, green, blue, transparency 4-tuple. In the case of the three colours, levels range from 0 to 1 - for transparancy the level also runs from 0 to 1, with 0 being completely transparent, and 1 not transparent at all.
pl.close()
pl.scatter(hp['fl_area'],hp['price']/1000,color=(1,0,0,0.1),marker='.')
pl.xlabel('Floor Area (sq. metres)')
pl.ylabel('Price (1000s Pounds)')
pl.title('House Price v. Floor Area (London)')
pl.grid(True)
pl.show()
Note a grid was also added here. The log transform is also interesting here. Suppose we take logs of both the price and the floor area and plot the result:
pl.close()
pl.scatter(np.log10(hp['fl_area']),np.log10(hp['price']),color='r',marker='.')
pl.xlabel('Log Floor Area (sq. metres)')
pl.ylabel('Log Price')
pl.title('House Price v. Floor Area (London) - all data log transformed')
pl.grid(True)
pl.show()
Here it is worth noting that the distribution is elliptical - suggesting a correlation, but also a possible bivariate normal shape. another way to visualise this is to leave the data untransformed, but to use log-scale axes. This is achieved with the loglog
function:
pl.close()
pl.scatter(hp['fl_area'],hp['price'],color='r',marker='.')
pl.xlabel('Floor Area (sq. metres)')
pl.ylabel('Price')
pl.title('House Price v. Floor Area (London)')
pl.grid()
pl.loglog()
pl.show()
Having plotted this data, some statistical analysis may be appropriate. A further package, called scipy
(Scientific Python) can be useful here. It is probably useful to fit a regression line to the data in the last figure you drew. In scipy
there is a sub-package called stats
. This has a function called linregress
:
import scipy.stats as st
print(st.linregress(np.log10(hp['fl_area']),np.log10(hp['price'])))
## LinregressResult(slope=0.92700369750028555, intercept=3.2248787881527665, rvalue=0.68269377157031175, pvalue=2.106487886894245e-193, stderr=0.026489181088576768)
The five numbers returned by this function as a 5-tuple are as follows:
As with the dimensions of arrays with shape
earlier, you can assign all of these to five variables in a single statement:
# Here are the 5 variables being assigned
a,b,r,pval,sderr = st.linregress(np.log10(hp['fl_area']),
np.log10(hp['price']))
# Print some out neatly...
print("a = %6.3f (slope)" % a)
## a = 0.927 (slope)
print("b = %6.3f (Intercept)" % b)
## b = 3.225 (Intercept)
print("r = %6.3f (correlation)" % r)
## r = 0.683 (correlation)
This just fitted a model of the form \(\log_{10}(\texttt{price}) = a + b\log_{10}(\texttt{fl_area})\) where \(a\) is the intercept, and \(b\) is the slope. The fit is reasonably good, suggesting a power law relation between price and floor area. This can also be written in the form \[\begin{equation} \texttt{price} = A(\texttt{fl_area})^{b} \end{equation}\] where \(A = 10^{a}\). You can use Python to see the value of \(A\):
print(10 ** a)
## 8.45286041734
Since \(b\) is slightly less than one, this suggests a slight diminishing return on the value of a house given its floor area - doubling the floor area doesn’t quite double the value of the house.
You can also add the fitted line to the scatter plot. The plot
command in pylab
can do this. As before, the plot is ‘built up’ and the line is added after the scatter points and labels. plot
is a general line drawing function which will join together a set of \((x,y)\) points, but here the \(x\)-values are a set of regular points on the line, and the \(y\)-values are the corresponding fitted regression values. The function np.linspace(lowest,highest,n)
creates a set of linearly-spaced values from a lowest
to a highest
value, incorporating n
points. In the plot
function, the colour 'b'
means blue. Also, the lw
parameter controls the line width.
pl.close()
pl.scatter(np.log10(hp['fl_area']),np.log10(hp['price']),
color='r',marker='.')
pl.xlabel('Log Floor Area (sq. metres)')
pl.ylabel('Log Price')
pl.title('House Price v. Floor Area (London) - all data log transformed')
pl.grid(True)
x_reg = np.linspace(1.3,2.5,100)
y_reg = b + a * x_reg
pl.plot(x_reg,y_reg,color='b',lw=2.0)
pl.show()
To see the power law relation, it may be useful to plot the untransformed variables - the following code applies the inverse of the \(\log_{10}(x)\) transform (\(10^{x}\)) to the points on the fitted line
pl.close()
pl.scatter(hp['fl_area'],hp['price']/1000,color='r',marker='.')
pl.xlabel('Floor Area (sq. metres)')
pl.ylabel('Price')
pl.title('House Price v. Floor Area (London)')
pl.grid(True)
pl.plot(10.0**x_reg,10.0**y_reg/1000,color='b',lw=2.0)
pl.show()
However, it should be noted that the deviation from linearity is quite small.
In addition to creating a scatter plot of price and floor area, another idea might be to create a kernel density estimate of these two variables. There is a function to this in scipy.stats
as well, called gaussian_kde
.
One complication here is that this function expects data in a different format to much of the numpy
and scipy
and pyplot
material. In particular, for the input to the KDE function, the data needs to be an 2 \(\times\) n
matrix, where n
is the number of data points. Note this is the opposite order to a data frame, as variables correspond to rows not columns. Giving hp
a list of column names will extract the corresponding columns - but here, we will need the data to be transposed. Fortunately this is fairly easy in numpy
- the transpose of matrix x
is just x.T
. In a single expression it is possible to extract the price and floor area columns, transpose them, take logs of them and assign the result to a new variable called log_data
.
log_data = np.log10(hp[['fl_area','price']].T)
Next, this data is fed into the gaussian_kde
function - this is an unusual function in that it doesn’t return a value - instead it returns another function. This function is acxtually used to compute the density estimate at new data points. Here, the newly created function is called f_hat
. This name comes from the fact that often in texts about kernel density estimation, the estimation function is referred to as \(\hat{f}(x,y)\).
f_hat = st.gaussian_kde(log_data)
This creates a new function that takes a matrix of data points, and works out the value of kernel density estimate at those points. A typical use of this is to evaluate f_hat
over a regular grid of \((x,y)\) points, which can then be used to draw contour images, or plot 3D surfaces. To do this, you need to create a 2D array of \(x\)-points and another of the corresponding \(y\)-points, and then step through each grid location and compute the density with f_hat
. The grid values for the \(x\) and \(y\) points can be created using the numpy
function meshgrid
- this takes a list of \(x\)-values, and another of \(y\)-values and creates a 2D array of all possible combinations. linspace
can be used to generate evenly spaced sequences in the \(x\) and \(y\) directions.
x_grd, y_grd = np.meshgrid(np.linspace(1.3,2.5,100),
np.linspace(4.0,6.5,100))
Note that this is another function that returns multiple values, in this case two 2D arrays, one for the \(x\)-values and one for the \(y\)’s. The next step is to create a third 2D array in which the corresponding density estimates should be stored. This can be done via the zeros_like
function - it creates a 2D array that has the same number of rows and columns as its argument, but every element is zero. Essentially this just creates a 2D array of place holders. Once this is done, you can loop through the grid of \(x\) and \(y\) values, and compute the density estimates:
dens_grd = np.zeros_like(x_grd)
for i in range(len(x_grd)):
for j in range(len(y_grd)):
dens_grd[i,j] = f_hat([x_grd[i,j],y_grd[i,j]])
Now the array dens_grid
contains the grid of kernel density estimates, it may be used to create contour images. Contour plots can be created by the pyplot
function contour
:
pl.close()
pl.contour(x_grd,y_grd,dens_grd,colors='DarkGreen')
pl.xlabel('Log Floor Area (sq. metres)')
pl.ylabel('Log Price')
pl.title('House Price v. Floor Area (London) - all data log transformed')
pl.show()
Note that the colors
argument here specified the colour of the contour lines. The colour 'DarkGreen'
is a standard HTML colour - as listed at http://www.w3schools.com/html/html_colornames.asp.
A related function, contourf
function plots filled contours, where the regions in between the contour lines have a solid colour.
pl.close()
pl.contourf(x_grd,y_grd,dens_grd)
pl.xlabel('Log Floor Area (sq. metres)')
pl.ylabel('Log Price')
pl.title('House Price v. Floor Area (London) - all data log transformed')
pl.show()
The result with a default colour scheme is works well here - basically this is the viridis palette. However, other colour schemes are possible. pyplot
refers to these as color maps
- each one has a name, with many of the names corresponding to the ColorBrewer palettes. All of the standard palettes are available via the cmap
argument - plus a few other ones - see https://matplotlib.org/users/colormaps.html for a list. The ones in the group labelled ‘Miscellaneous Colormaps’ on this page are best avoided2. Here the Greens
ColorBrewer-based color map is used:
pl.close()
pl.contourf(x_grd,y_grd,dens_grd,cmap='Greens')
pl.xlabel('Log Floor Area (sq. metres)')
pl.ylabel('Log Price')
pl.title('House Price v. Floor Area (London) - all data log transformed')
pl.show()
Finally a good effect can be created by mapping both the filled contours and the contour lines:
pl.close()
pl.contourf(x_grd,y_grd,dens_grd,cmap='Greens')
pl.contour(x_grd,y_grd,dens_grd,colors='DarkGreen')
pl.xlabel('Log Floor Area (sq. metres)')
pl.ylabel('Log Price')
pl.title('House Price v. Floor Area (London) - all data log transformed')
pl.show()
A further approach to creating contour plots is the tricontour
function. This creates contours without the need to create regular grids. Essentially, the function name is an abbreveation of triangular contour. It takes an irregular set of (\(x\),\(y\),\(z\)) points, creates a triangulated network (Delauny triangulation) over the \((x,y)\) points, and then adds the \(z\) points to create a surface made of irregular triangular faces - which it then uses to create the contours. The useful thing about this approach is that it saves having to compute all of the grid points. In pyplot
the functions tricontour
and tricontourf
are the respective triangulation-based versions of contour
and contourf
. In the code below, the \((x,y)\) data points (i.e. the floor area and price) are ‘fed’ into f_hat
to estimate the density at the data points. Then the \(x\) and \(y\) values are extracted from log_data
for the triangulated contour functions, using the expressions log_data.T['fl_area']
and log_data.T['price']
.
pl.close()
dens = f_hat(log_data)
pl.tricontourf(log_data.T['fl_area'],log_data.T['price'],dens,cmap='Blues')
pl.tricontour(log_data.T['fl_area'],log_data.T['price'],dens,colors='Indigo')
pl.xlabel('Log Floor Area (sq. metres)')
pl.ylabel('Log Price')
pl.title('House Price v. Floor Area (London) - all data log transformed')
pl.show()
Note that in this case, the outer limit of the contour shading is not the entire graph area, but the convex hull of the points on the plot. this is an artefact of the Delauney triangulation algorithm. Also note that the contours associated with the lower densities are more `wobbly’. This is essentially because - fairly obviously - lower densities occur when there are fewer points, and so the triangulation is cruder in this area.
As well as using contour-based methods, it is also possible to create 3D plots in Python. This is done using the mpl_toolkits.mplot3d
package. This provides a function called Axes3d
that takes a figure
object and creates a set of 3D axes (actually a coordinate mapping from 3D to 2D) that projects 3D images into a 2D window. This sounds complex, but in reality it just means that you create an object called ax
that you place in front of the 3D graphics functions. One such function is plot_surface
which creates surfaces from rectangular grid data:
from mpl_toolkits.mplot3d import Axes3D
pl.close()
fig = pl.figure()
ax = Axes3D(fig)
ax.plot_surface(x_grd, y_grd, dens_grd, rstride=1, cstride=1,
cmap='Reds',linewidth=0.05)
pl.xlabel('Log Floor Area')
pl.ylabel('Log Price')
pl.show()
A triangulated version also exists - using the function plot_trisurf
.
pl.close()
fig = pl.figure()
ax = Axes3D(fig)
ax.plot_trisurf(log_data.T['fl_area'],log_data.T['price'],dens,cmap='Blues',linewidth=0.05)
pl.xlabel('Log Floor Area')
pl.ylabel('Log Price')
pl.show()
Although you have been working with chunks of code up to this point it is helpful to be able to create a whole program. An example is given in https://www.dropbox.com/s/jr9qn3yghnwtptd/housing.py?dl=0 - this reproduces the grid-based 3D density surface. The file is listed below:
# Import the various packages as toolkits
import numpy as np
import pylab as pl
import scipy.stats as st
import pandas as pd
from mpl_toolkits.mplot3d import Axes3D
# Read in the house price data
hp = pd.read_csv('hpdemo.csv')
# Take logs of price and floor area
log_data = np.log10(hp[['fl_area','price']].T)
# Create the kernel function
f_hat = st.gaussian_kde(log_data)
# Create the mesh points
x_grd, y_grd = np.meshgrid(np.linspace(1.3,2.5,100),
np.linspace(4.0,6.5,100))
# Create the densities on the mesh point grid
dens_grd = np.zeros_like(x_grd)
for i in range(len(x_grd)):
for j in range(len(y_grd)):
dens_grd[i,j] = f_hat([x_grd[i,j],y_grd[i,j]])
# Create the 3D figure
fig = pl.figure()
ax = Axes3D(fig)
ax.plot_surface(x_grd, y_grd, dens_grd, rstride=1, cstride=1,
cmap='Reds',linewidth=0.05)
pl.xlabel('Log Floor Area')
pl.ylabel('Log Price')
pl.show()
If you have downloaded this file, and ipython is working in the same folder, you can run the file as a single job via the command
python housing.py
A few hints on creating program files follow:
import
statements - I usually put all the ones I need at the top of the program.These are all exercises that involve modifying the file you have just downloaded.
Modify the function to produce a 2D filled contour plot instead of a 3D surface.
It is possible to save the diagrams to a file (a pdf
for example), using pl.savefig('myfile.pdf')
instead of pl.show()
. Modify your code from the previous exercise to do this. Also, use a different colour map for the contour shading.
Going back to the original 3D plot, and replace the original ax.plot_surface
line with the following two lines:
ax.contour3D(x_grd, y_grd, dens_grd, rstride=1, cstride=1,
colors='k', linewidth=2)
ax.plot_surface(x_grd, y_grd, dens_grd, rstride=1, cstride=1,
cmap='Reds',linewidth=0,alpha=0.5)
Can you explain what is going on here?
You wont have seen sub-packages yet, but it really just means a package called matplotlib.pyplot
that can be imported without the rest of scipy
.↩
Its not just me saying this - the one called ‘jet’ used to be the default, leading to this discussion https://github.com/matplotlib/matplotlib/issues/875 .↩