Assignment 10

Plot everything! In this assignment we will explore the matplotlib library and its features by plotting the results of previous assignments. Please do all of the following:

  1. Express the cars.data.csv data as a series of bar graphs. The x-axis represents a feature and the y-axis is the frequency in the sample. Do this with the buying, maint, safety, and doors fields with one plot for each for a total of four. Make each graph a subplot of a single output.
  2. Plot your results from the linear regression in assignment 5 and 7 (for any of the provided data sets). The plot should include.
    1. A scatter of the points in the .csv file
    2. A line showing the regression line (either from the calculation in homework 5 or linefitting from homework 7).
    3. Something on the plot that specifies the equation for the regression line.
  3. Create an overlay of the center points found in objects.png from homework 8. The image should be in the background and the object centers can be small circles or points at or around the center points.
  4. Plot a line graph that shows the hour by hour change in number of server requests from the HTTP in homework 9. The xaxis is the discrete hour intervals (e.g. 13:00 - 14:00) and the yaxis is the number of requests.

As with previous assignments, many of the details of the implementation are up to you. However, keep this in mind. Much of the purpose of plotting is to communicate data and the information therein effectively and efficiently. Your plots should be able to be interpreted easily and be robust enough to express the data used to create it. This means include things like labels, legends, proper scaling, etc. Also, you don’t need to perform the operations themselves from the previous homework. You can just have a static list of data, e.g. for the center points in part 3 you can have a list hard coded or read in somewhere.

import matplotlib.pyplot as plt

# PLOT CARS DATA
import pandas, numpy
url_cars = "https://raw.githubusercontent.com/jzuniga123/SPS/master/DATA%20602/cars.data.csv"
cars_vars = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class_val']
cars_data = pandas.read_table(url_cars, sep=',', header = None, names = cars_vars)
variable = ['buying', 'maint', 'safety', 'doors']
plt.figure(1)
for i in range(0, len(variable)):
    plt.subplot(221 + i)
    frequency = cars_data.groupby(variable[i])[variable[i]].count()
    features = list(frequency.index)
    frequencies = list(frequency)
    y_pos = numpy.arange(len(features))
    plt.bar(y_pos, frequencies, align='center', alpha=0.5)
    plt.xticks(y_pos, features)
    plt.ylabel('Frequency')
    plt.title(variable[i].title())
plt.savefig(".\DATA_602_HW10_1.png")

# PLOT REGRESSION
import pandas, numpy
from scipy import stats
url_brainbody = "https://raw.githubusercontent.com/jzuniga123/SPS/master/DATA%20602/brainandbody.csv"
brainandbody = pandas.read_csv(url_brainbody)
br = brainandbody["brain"] # X-data
bo = brainandbody["body"] # Y-data
slope, intercept, r_value, p_value, std_err = stats.linregress(br, bo)
plt.figure(2)
plt.plot(br, bo, '.')
plt.plot(br, slope*br + intercept, 'r-')
plt.title('$bo = %3.7s*br%3.7s$' %(slope, intercept))
plt.savefig(".\DATA_602_HW10_2.png")

# PLOT ON IMAGE
import scipy.ndimage as ndimage
import scipy.misc as misc
import urllib2, cStringIO, numpy
url = "https://raw.githubusercontent.com/jzuniga123/SPS/master/DATA%20602/objects.png"
infile = cStringIO.StringIO(urllib2.urlopen(url).read())
raw = misc.imread(infile)
img = ndimage.gaussian_filter(raw, 2)
thres = img > img.mean()
labels, count = ndimage.label(thres)
index = list(range(0, count))
center = ndimage.measurements.center_of_mass(img, labels, index)
x = []; y = []
for i in range(0, count):
    x.append(center[i][1])
    y.append(center[i][0])
plt.figure(3)
infile = plt.imread(url)
implot = plt.imshow(infile)
plt.scatter(x, y, c='r', s=10)
plt.savefig(".\DATA_602_HW10_3.png")

# PLOT SERVER REQUESTS
import pandas, datetime
url = "https://raw.githubusercontent.com/jzuniga123/SPS/master/DATA%20602/epa-http.txt"
dirty = pandas.read_table(url, header = None, names = ['raw'])
clean = dirty.replace('=\\"\sH', '=\sH', regex=True)
data = pandas.DataFrame(clean, columns = ['raw'])
data['date'] = data['raw'].str.extract('(\[\S+\])', expand=True)
data['date'] = pandas.to_datetime(data['date']+"081995", format='[%d:%H:%M:%S]%m%Y')
data['hour'] = pandas.DatetimeIndex(data['date']).hour
frequency = data.groupby('hour')['hour'].count()
hours = list(frequency.index)
frequencies = list(frequency)
plt.figure(4)
plt.plot(hours, frequencies, 'ro-')
plt.ylabel('Frequency')
plt.title("Requests Per Hour")
plt.savefig(".\DATA_602_HW10_4.png")
Plot Cars Data Plot Regression Plot on Image Plot Server Requests
Cars Regression Objects Server

Assignment 11

Perform a Monte Carlo simulation to calculate Value at Risk (VaR) for the Apple stock price using the file attached to this lesson. There exist a number of ways to do this type of analysis, but you can follow this basic procedure (refer to this file for a more rigorous mathematical overview):

  1. The file apple.2011.csv has 3 columns: date, price, and percent change. The information you are really interested in is the percent change. This value is the percent change in the price from the previous date to the date on the corresponding row.
  2. Use the percent change values to create a Gaussian random number generator. This generator will create Gaussian distributed randoms to use as hypothetical percent changes on the day to day price of Apple stock.
  3. With that generator, create 20 random numbers. With these randoms, you can find a potential price change over 20 days.
  4. Start with the price from the last row and the random numbers to determine a hypothetical price after 20 days. To do this, take that last price, apply a percent change and get a new price. Then take that new price, and apply the next percent change. Do this for the 20 randoms, and get a final price.
  5. Store the final price after 20 days (you can discard the intermediate price values).
  6. Repeat steps 3 through 5 a very large number of times. Something like 10000. Each run will yield a different result.
  7. Take all of the stored prices (10000 or so), sort them, and find the 1% lowest percentile. This value is the final result, and represents the VaR with 99% confidence. This means that the price will be above this result after 20 days with a confidence level of 99%.

The other requirement for this assignment is to use an IPython Notebook. Include in the notebook all the code, the results, and any other information you feel is needed (charts, graphs, plots, etc). Rather than submitting .py files, give me the .ipynb file for your notebook.

# TO create Jupiter Notebook, start Anaconda, 
# open Jupiter Notebook from Anaconda, browser will open.
# From the Files tab, select New Python 2 Notebook.

import pandas, numpy, random

url = "https://raw.githubusercontent.com/jzuniga123/SPS/master/DATA%20602/apple.2011.csv"
apple_vars = ['date', 'price', 'change']
apple_data = pandas.read_csv(url, sep=',', header = None, names = apple_vars, skiprows = 1)
change = apple_data['change']
change.pop(0)
change = pandas.to_numeric(change)
mu = change.mean()
sigma = change.std()
print "mu =", mu, "\nsigma =", sigma, "\n\nTRIAL RUN"

# TRIAL RUN
p = [apple_data['price'][0]]
d = [0]
for i in range(20 + 1):
    if i != 0:
        d.append(random.gauss(mu, sigma))
        p.append(p[i - 1] * (1 + d[i]))
    print "Day", i, "Price:", p[i], "Change:", d[i]

# SIMULATION
sim_price = []
for j in range(0, 10000):
    p = [apple_data['price'][0]]
    d = [0]
    for i in range(20 + 1):
        if i != 0:
            d.append(random.gauss(mu, sigma))
            p.append(p[i - 1] * (1 + d[i]))
    sim_price.append(p[len(p) - 1])
VaR = numpy.percentile(sim_price, 1)
print "\nValue at Risk:", VaR
## mu = 0.000957355207171 
## sigma = 0.0165205562984 
## 
## TRIAL RUN
## Day 0 Price: 329.57 Change: 0
## Day 1 Price: 329.681262543 Change: 0.000337599123228
## Day 2 Price: 338.203725213 Change: 0.0258506128124
## Day 3 Price: 333.233213693 Change: -0.0146967970761
## Day 4 Price: 325.743593696 Change: -0.0224756107396
## Day 5 Price: 334.556182337 Change: 0.0270537588812
## Day 6 Price: 337.579207861 Change: 0.00903592784485
## Day 7 Price: 335.522798413 Change: -0.00609163538528
## Day 8 Price: 342.772031742 Change: 0.0216057846557
## Day 9 Price: 332.760843361 Change: -0.029206549701
## Day 10 Price: 336.873303444 Change: 0.012358605783
## Day 11 Price: 339.611199999 Change: 0.00812737764146
## Day 12 Price: 343.981604379 Change: 0.0128688464342
## Day 13 Price: 345.970091141 Change: 0.00578079390481
## Day 14 Price: 346.497524283 Change: 0.001524505024
## Day 15 Price: 359.932835263 Change: 0.0387746233049
## Day 16 Price: 372.470976744 Change: 0.0348346698403
## Day 17 Price: 369.547412869 Change: -0.00784910518565
## Day 18 Price: 373.823718252 Change: 0.0115717367627
## Day 19 Price: 382.933973097 Change: 0.0243704569818
## Day 20 Price: 370.681369365 Change: -0.0319966484893
## 
## Value at Risk: 281.113617852

Assignment 12

This homework will get your feet wet with some basic parallel computing approaches. Do both of the following:

  1. Take your solution from Homework 11 and complete the Monte Carlo step (step 6) in parallel. There are many ways you can go about doing this, and I’m not looking for anything too complicated. If you can get multiple processes crunching the data together, that is great. Using IPython’s builtin tools would be a great method.
  2. Compare the timing for your solution in homework 11 and this parallel solution. This is similar to what you did in homeworks 6 and 7. Ideally, you’ll see some speed improvement. The amount you see will largely be based the capabilities of your hardware, and less on the software implementation. There is additional overhead for running an operation in parallel, so speed gains will be more obvious with a larger number of calculations.

As with the last homework, you will submit this to me as an IPython notebook. Include the results of your comparison there, along with everything else (code, charts, graphs, etc.)

# Install ipyparallel: $ pip install ipyparallel
# To enable the IPython Clusters tab in Jupyter Notebook: $ ipcluster nbextension enable
# Go to IPython Clusters tab in Jupyter Notebook, select # of engines, click "Start"
# To disable it again: $ ipcluster nbextension disable
# To Begin Parallel Computer Cluster of 4 in Regular Python: $ ipcluster start -n 4
# To End Parallel Computer Cluster in Regular Python: $ $ ipcluster stop

import timeit, ipyparallel as ipp

def MC_Simulation(input):
    import pandas, numpy, random
    url = "https://raw.githubusercontent.com/jzuniga123/SPS/master/DATA%20602/apple.2011.csv"
    apple_vars = ['date', 'price', 'change']
    apple_data = pandas.read_csv(url, sep=',', header = None, names = apple_vars, skiprows = 1)
    change = apple_data['change']
    change.pop(0)
    change = pandas.to_numeric(change)
    mu = change.mean()
    sigma = change.std()
    sim_price = []
    for j in range(0, input):
        p = [apple_data['price'][0]]
        d = [0]
        for i in range(20 + 1):
            if i != 0:
                d.append(random.gauss(mu, sigma))
                p.append(p[i - 1] * (1 + d[i]))
        sim_price.append(p[len(p) - 1])
    VaR = numpy.percentile(sim_price, 1)
    return "\nValue at Risk:", VaR

def parrallel_computing(input):
    direct_view = clients[:] # use all engines
    async_result = direct_view.apply_async(MC_Simulation, input)
    return async_result.get()

# PARALLEL COMPUTING
clients = ipp.Client()
clients.block = True
print "Clients:", clients.ids

# COMPARE TIMES
n = 1#0**2
t = timeit.Timer(lambda: MC_Simulation(10000))
print "Processed Locally:", n, "loops =", t.timeit(n), "seconds"
t = timeit.Timer(lambda: parrallel_computing(10000))
print "Parallel Computing:", n, "loops =", t.timeit(n), "seconds"
clients.shutdown(hub=True)

USING PYTHON LOCALLY

Clients: [0, 1, 2, 3]

Processed Locally: 100 loops = 119.301532646 seconds

Parallel Computing: 100 loops = 483.29072373 seconds

USING JUPYTER NOTEBOOK

Clients: [0, 1, 2, 3]

Processed Locally: 100 loops = 148.152401799 seconds

Parallel Computing: 100 loops = 357.975700053 seconds

References

http://matplotlib.org/examples/

http://ipython.org/install.html

https://rpubs.com/josezuniga/257325

https://rpubs.com/josezuniga/259983

http://minrk.github.io/scipy-tutorial-2011/

https://www.youtube.com/watch?v=xpmliX5-BMk

http://minrk.github.io/scipy-tutorial-2011/

https://www.youtube.com/watch?v=S5dhe0f5huA

https://docs.python.org/2/library/random.html

https://github.com/ipython/ipython/issues/9500

https://pythonspot.com/en/matplotlib-bar-chart/

http://matplotlib.org/users/pyplot_tutorial.html

http://ipyparallel.readthedocs.io/en/latest/intro.html

http://www.labri.fr/perso/nrougier/teaching/matplotlib/

http://scipy.github.io/old-wiki/pages/Cookbook/Matplotlib

http://ipyparallel.readthedocs.io/en/latest/multiengine.html

https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.normal.html

http://stackoverflow.com/questions/19585280/convert-a-row-in-pandas-into-list

http://stackoverflow.com/questions/4426663/how-do-i-remove-the-first-item-from-a-python-list

http://stackoverflow.com/questions/19068862/how-to-overplot-a-line-on-a-scatter-plot-in-python