Python

Introduction

This Python Assignment looks at parking violations in the Philadephia, Pennsylvania for different years. These datasets are from OpenDataPhilly: https://www.opendataphilly.org/dataset/parking-violations. When looking at the data I wanted to look at the quarter and months of when violations occurred, the actaul versus the average for violations for a month, the day of the week and the hour when the violations occurred, and the top violations. When thinking of parking violations, I can think of about five different types off the top of my head but I was pretty wrong when actually there are 114 different parking violations in this dataset.

Dataset

The dataset that I used was actually compromised of eight different datasets combined. These datasets represented six months of a year, which the years ranging from 2014 to 2017. By combining the datasets, I was able to do comparisons on all months of a year as well as comparison across different years. The combined data set had 6,428,915 lines of data with 13 different variables in the data set. The most common variable is issue_datetime which was used to get the Year, Quarter, Month, Day of the Week, and Hour for when the violation occurred. Another common variable is violation_desc which gives the description of the violation that occurred. Lastly, the lat and lon variables were used to identify where the actual violation occurred.

Findings

Some general findings are shown in Tab 1 with the Top 20 Violations by Description and the Top 20 Violations by Description without the Outlier. These visualizations lay the ground work for the more in-depth visualizations in Tabs 2-6.


import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.ticker import FuncFormatter
import plotly.graph_objects as go
import folium

#path = "c:/users/pptallon/OneDrive - Loyola University Maryland/desktop/"
path="U:/"
filename_2014_1 = "parking_violations 1st half 2014.csv"
df_2014_1 = pd.read_csv(path + filename_2014_1)

filename_2014_2 = "parking_violations 2nd half 2014.csv"
df_2014_2 = pd.read_csv(path + filename_2014_2)

filename_2015_1 = "parking_violations 1st half 2015.csv"
df_2015_1 = pd.read_csv(path + filename_2015_1)

filename_2015_2 = "parking_violations 2nd half 2015.csv"
df_2015_2 = pd.read_csv(path + filename_2015_2)

filename_2016_1 = "parking_violations 1st half 2016.csv"
df_2016_1 = pd.read_csv(path + filename_2016_1)

filename_2016_2 = "parking_violations 2nd half 2016.csv"
df_2016_2 = pd.read_csv(path + filename_2016_2)

filename_2017_1 = "parking_violations 1st half 2017.csv"
df_2017_1 = pd.read_csv(path + filename_2017_1)

filename_2017_2 = "parking_violations 2nd half 2017.csv"
df_2017_2 = pd.read_csv(path + filename_2017_2)

df = pd.concat([df_2014_1, df_2014_2, 
                df_2015_1, df_2015_2,
                df_2016_1, df_2016_2,
                df_2017_1, df_2017_2], axis = 0)
                
df['Issue Date'] = pd.to_datetime(df.issue_datetime, format = '%Y-%m-%d %H:%M:%S')
df['Year'] = df['Issue Date'].dt.year
df['Quarter'] = df['Issue Date'].dt.quarter
df['Month'] = df['Issue Date'].dt.month
df['MonthName'] = df['Issue Date'].dt.strftime('%b')

Bar Charts

This tab shows some basic data visualization for the the top violations, by description. These visualizations help to set the stage for more complex and in depth visualizations.

The first visualization is a reverse bar chart which counts the total violations for each type of violation by description. Then there is an average line for the average number of violations determined by the top 20 violations. The legend has three different colors depending on where the violation falls within the legend. The purple is for violations that are above the average (more than 10% higher than the average), the grey is for the violations which are within the 10% of the average and the blue is for the violations below the average (less than 10% lower the average). This particularly graph has no grey bars, there are no violations within 10% of the average. The average for this visualization is 305,341 violations.


df['violation_desc']= df['violation_desc'].str.capitalize()

x = df.groupby(['violation_desc']).agg({'violation_desc' : ['count']}).reset_index()
x.columns = ['ViolationDescription', 'Count']
x = x.sort_values('Count', ascending = False)
x.reset_index(inplace = True, drop = True)

def pick_colors_according_to_mean_count(this_data):
    colors = []
    avg = this_data.Count.mean()
    for each in this_data.Count:
        if each > avg*1.1:
            colors.append('darkviolet')
        elif each < avg*0.9:
            colors.append('blue')
        else:
            colors.append('grey')
    return colors

bottom = 0 
top = 19 
d1 = x.loc[bottom:top]
d1 = d1.sort_values('Count', ascending = True)
d1.reset_index(inplace = True, drop = True)

my_colors = pick_colors_according_to_mean_count(d1)

Above = mpatches.Patch(color = 'darkviolet', label = 'Above Average')
At = mpatches.Patch(color = 'grey', label = 'Within 10% of the Average')
Below = mpatches.Patch(color = 'blue', label = 'Below Average')

fig = plt.figure(figsize = (18, 10))
ax1 = fig.add_subplot(1, 1, 1)
ax1.barh(d1.ViolationDescription, d1.Count, color = my_colors);

for row_counter, value_at_row_counter in enumerate(d1.Count):
    if value_at_row_counter > d1.Count.mean()*1.1:
        color = 'darkviolet'
    elif value_at_row_counter < d1.Count.mean()*0.9:
        color = 'blue'
    else: 
        color = 'grey'
    ax1.text(value_at_row_counter+2, row_counter, '{:,}'.format(value_at_row_counter), 
             color = color, size = 12, fontweight = 'bold',
            ha = 'left', va = 'center', backgroundcolor = 'white')
plt.xlim(0, d1.Count.max()*1.1);

ax1.legend(loc = 'lower right', handles = [Above, At, Below], fontsize = 14)
plt.axvline(d1.Count.mean(), color = 'black', linestyle = 'dashed')
ax1.text(d1.Count.mean()+5000, 0, 'Mean = ' + str('{:,.0f}'.format(d1.Count.mean())), rotation = 0, fontsize = 14)

ax1.set_title('Top ' + str(top+1) + ' Violations', size = 20)
ax1.set_xlabel('Violation Count', fontsize = 16)
ax1.set_ylabel('Violation Description', fontsize = 16)
plt.xticks(fontsize = 14);
plt.yticks(fontsize = 14);

ax1.xaxis.set_major_formatter(FuncFormatter(lambda x, pos:('%1.1fM')%(x*1e-6)))

plt.show()

The second visualization is also a reverse bar chart with the top violations. The same legend is used from the visualization above. The difference is, the outlier the top violation, meter expired cc, has been removed from the visualization. This brings the average line down from 305,341 to 228,410. This new average now has three violations within 10% of the average (shown in grey bars).


df['violation_desc']= df['violation_desc'].str.capitalize()

x = df.groupby(['violation_desc']).agg({'violation_desc' : ['count']}).reset_index()
x.columns = ['ViolationDescription', 'Count']
x = x.sort_values('Count', ascending = False)
x.reset_index(inplace = True, drop = True)

def pick_colors_according_to_mean_count(this_data):
    colors = []
    avg = this_data.Count.mean()
    for each in this_data.Count:
        if each > avg*1.1:
            colors.append('darkviolet')
        elif each < avg*0.9:
            colors.append('blue')
        else:
            colors.append('grey')
    return colors

bottom = 1
top = 20 
d1 = x.loc[bottom:top]
d1 = d1.sort_values('Count', ascending = True)
d1.reset_index(inplace = True, drop = True)

my_colors = pick_colors_according_to_mean_count(d1)

Above = mpatches.Patch(color = 'darkviolet', label = 'Above Average')
At = mpatches.Patch(color = 'grey', label = 'Within 10% of the Average')
Below = mpatches.Patch(color = 'blue', label = 'Below Average')

fig = plt.figure(figsize = (18, 10))
ax1 = fig.add_subplot(1, 1, 1)
ax1.barh(d1.ViolationDescription, d1.Count, color = my_colors);

for row_counter, value_at_row_counter in enumerate(d1.Count):
    if value_at_row_counter > d1.Count.mean()*1.1:
        color = 'darkviolet'
    elif value_at_row_counter < d1.Count.mean()*0.9:
        color = 'blue'
    else: 
        color = 'grey'
    ax1.text(value_at_row_counter+2, row_counter, '{:,}'.format(value_at_row_counter), 
             color = color, size = 12, fontweight = 'bold',
            ha = 'left', va = 'center', backgroundcolor = 'white')
plt.xlim(0, d1.Count.max()*1.1);

ax1.legend(loc = 'lower right', handles = [Above, At, Below], fontsize = 14)
plt.axvline(d1.Count.mean(), color = 'black', linestyle = 'dashed')
ax1.text(d1.Count.mean()+5000, 0, 'Mean = ' + str('{:,.0f}'.format(d1.Count.mean())), rotation = 0, fontsize = 14)

ax1.set_title('Top ' + str(top) + ' Violations without Outlier', size = 20)
ax1.set_xlabel('Violation Count', fontsize = 16)
ax1.set_ylabel('Violation Description', fontsize = 16)
plt.xticks(np.arange(0, max(d1.Count)+1e5, 1e5), fontsize = 14);
plt.yticks(fontsize = 14);

ax1.xaxis.set_major_formatter(FuncFormatter(lambda x, pos:('%1.0fK')%(x*1e-3)))

plt.show()

Donut Chart

This Donut Chart shows the total violations in the middle of the chart for 2017 with 1.72 million total violations. Then the chart shows the total violations for each quarter and what percentage that each quarter makes up of the total violations for the year. All four quarters are within the 23.84% of the total and 25.66%. The highest quarter is Quarter 3 with 442.1K violations and the lowest quarter is Quarter 1 with 410.7K violations. Lastly, the donut chart shows the break down of each month, with total violations percentage for each month. The highest month is August which makes up 9.22% of the total violations which makes sense since, August is in Quarter 3 which has the highest percentage of total violations. The lowest month is December with 7.16%. December is in Quarter 4 which is not the lowest quarter but Quarter 4 is evened out by the second highest month, October being in Quarter 4.


df['Quarter'] = 'Quarter ' + df['Issue Date'].dt.quarter.astype('str')
pie_df_2017 = df[df['Year'] == 2017].groupby(['Quarter', 'MonthName', 'Month'])['Year'].count().reset_index(name = 'TotalViolations')
pie_df_2017.sort_values(by = ['Month'], inplace = True)
pie_df_2017.reset_index(inplace = True, drop = True)
del pie_df_2017['Month']

number_outside_colors_2017 = len(pie_df_2017.Quarter.unique())
outside_color_ref_number_2017 = np.arange(number_outside_colors_2017)*4

number_inside_colors_2017 = len(pie_df_2017.MonthName.unique())
all_color_ref_number_2017 = np.arange(number_outside_colors_2017 + number_inside_colors_2017)

inside_color_ref_number_2017 = []
for each in all_color_ref_number_2017:
    if each not in outside_color_ref_number_2017:
        inside_color_ref_number_2017.append(each)
        
fig = plt.figure(figsize = (7, 7))
ax1 = fig.add_subplot(1, 1, 1)

colormap = plt.get_cmap("tab20c")
outer_colors_2017 = colormap(outside_color_ref_number_2017)

all_violations_2017 = pie_df_2017.TotalViolations.sum()

pie_df_2017.groupby(['Quarter'])['TotalViolations'].sum().plot(
    kind = 'pie', radius = 1, colors = outer_colors_2017, pctdistance = 0.85, labeldistance = 1.1,
    wedgeprops = dict(edgecolor = 'white'), textprops = {'fontsize' : 13},
    autopct = lambda p: '{:.2f}%\n({:.1f}K)'.format(p, (p/100)*all_violations_2017/1e+3),
    startangle = 90)

inner_colors_2017 = colormap(inside_color_ref_number_2017)

pie_df_2017.TotalViolations.plot(
    kind = 'pie', radius = 0.7, colors = inner_colors_2017, pctdistance = 0.55, labeldistance = 0.8,
    wedgeprops = dict(edgecolor = 'white'), textprops = {'fontsize' : 12},
    labels = pie_df_2017.MonthName,
    autopct = '%1.2f%%',
    startangle = 90)

hole = plt.Circle((0,0), 0.3, fc = 'white')
fig1 = plt.gcf()
fig1.gca().add_artist(hole)

ax1.yaxis.set_visible(False)
plt.title('Total Fines by Quarter and Month', fontsize = 18)

ax1.text(0,0, 'Total Violations\n' + str(round(all_violations_2017/1e+6, 2)) + 'M', ha = 'center', va = 'center', size = 16)
ax1.axis('equal');
plt.tight_layout()

plt.show()

Waterfall Diagram

This visualization looks at a Waterfall Diagram for the actual total violations per month versus the average total violations per month for 2017. The average violations per month was developed by the total violations for 2017 divided by 12. This average was then compared to the actual total violations per month to see which months were lower or higher than the average. There was also a running deviation from the average total violation for each month. The average violations for a month is 125,805 violations. February had the lowest amount of violations, being about 18.8K violations less than the average for the month. Also after both January and February being both below average, the cumulative deviation after February is 33.3K violations. The month of October had the highest amount of violations with 144.6K violations which was 11.2K over the average. From January to October, the actual violations was below the total violations for the same point. In October, the actual violations for the year exceeded the average for the year up to that point by 8.7K.


wf_df_2014 = df[df['Year'] == 2014].groupby(['MonthName'])['Year'].count().reset_index(name = 'TotalViolations')
wf_df_2014['Budget'] = round(wf_df_2014.TotalViolations.sum()/12, 0)
wf_df_2014['Deviation'] = wf_df_2014.TotalViolations - wf_df_2014.Budget

months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
wf_df_2014.MonthName = pd.Categorical(wf_df_2014.MonthName, categories = months, ordered = True)
wf_df_2014.sort_values(by = 'MonthName', inplace = True)
wf_df_2014.reset_index(inplace = True, drop = True)

if wf_df_2014.loc[11, 'Deviation'] > 0:
    end_color = 'black'
elif wf_df_2014.loc[11, 'Deviation'] < 0:
    end_color = 'red'
else: end_color = 'blue'
    
fig_2014 = go.Figure(go.Waterfall(name = '', orientation = 'v', x = wf_df_2014['MonthName'], textposition = 'outside',
                             measure = ['relative', 'relative', 'relative', 'relative', 'relative', 'relative', 
                                        'relative', 'relative', 'relative', 'relative', 'relative', 'relative'], 
                             y = wf_df_2014['Deviation']/1e3,
                             text = ['{:.1f}K'.format(each/1e3) for each in wf_df_2014['TotalViolations']],
                             decreasing = {'marker' :{'color' : 'red'}},
                             increasing = {'marker' :{'color' : 'green'}}, 
                             hovertemplate = 'Cumulative Deviation to Date: ' + '%{y:,.1f}K' + '<br>' +
                                             'Total Citations in %{x}: %{text}'))

fig_2014.layout = go.Layout(yaxis = dict(tickformat = '.1f'))

fig_2014.update_xaxes(title_text = 'Months', title_font = {'size' : 18});
fig_2014.update_yaxes(title_text = 'TotalCitations (Running Total)', ticksuffix ='K', title_font = {'size' : 18}, 
                 zeroline = True);

fig_2014.update_layout(title = dict(text = 'Deviation Between Actual and Average Monthly Citations in 2014 (Waterfall Diagram)<br>' +
                              'Above Average Appears in Green, Below Average Appears in Red', 
                              font = dict(family = 'Arial', size = 18, color = 'black')),
                 template = 'simple_white',
                 title_x = 0.5,
                 showlegend = False,
                 autosize = True,
                 margin = dict(l = 30, r = 30, t = 60, b = 30));

fig_2014.show()

Scatterplot

This third visualization is a Scatterplot by Weekday and Hour of when violations occurred. The legend shows that the lighter the color gets (yellow) the more violations that occur on that day at that time. The darker the color (purple) the less violations that occur then. The size of the circles on the scatterplot are also indicative of the amount of violations occurring, the larger the circle, the more violations. The highest tick on the legend is 120,000, with them increasing in units of 10,000. This tells us that the highest violation is between 120,000 violations and 130,000 violations at a certain day and time. The hour of 12, so from noon to 1pm on days Tuesday through Friday seems to be the highest amount of violations. The lowest amount of violations is difficult to tell based off there being many small, purple circles on the scatterplot.


df['Issue Date'] = pd.to_datetime(df.issue_datetime, format = '%Y-%m-%d %H:%M:%S')
df['Hour'] = df['Issue Date'].dt.hour
df['DayName'] = df['Issue Date'].dt.strftime('%a')

hour_df = df.groupby(['Hour', 'DayName']).agg({'Hour': ['count']}).reset_index()
hour_df.columns = ['Hour', 'Weekday', 'Count']

days = ['Sun', 'Sat', 'Fri', 'Thu', 'Wed', 'Tue', 'Mon']
hour_df.Weekday = pd.Categorical(hour_df.Weekday, categories = days, ordered = True)
hour_df.sort_values(by = 'Weekday', inplace = True)
hour_df.reset_index(inplace = True, drop = True)

hour_df['Hour'] = round(hour_df['Hour'], 0)
hour_df['Count_Hundreds'] = round(hour_df['Count']/ 100, 0)

plt.figure(figsize = (12, 6))

plt.scatter(hour_df['Hour'], hour_df['Weekday'], marker = '.', cmap = 'viridis', 
            s = hour_df['Count_Hundreds'], c = hour_df['Count_Hundreds'])

plt.title('Violations by Weekday and Hour', fontsize = 18)
plt.xlabel('Hour (24 Hour)', fontsize = 14)
plt.ylabel('Weekday', fontsize = 14)

cbar = plt.colorbar()
cbar.set_label('Number of Violations', rotation = 270, fontsize = 14, color = 'black', labelpad = 20)

my_colorbar_ticks = [*range(0, int(hour_df['Count_Hundreds'].max()), 100)]
cbar.set_ticks(my_colorbar_ticks)

my_colorbar_tick_labels = [*range(0, int(hour_df['Count'].max()), 10000)]
my_colorbar_tick_labels = ['{:,}'.format(each) for each in my_colorbar_tick_labels]
cbar.set_ticklabels(my_colorbar_tick_labels)

my_x_ticks = [*range(int(hour_df['Hour'].min()), int(hour_df['Hour'].max()+1), 1)]
plt.xticks(my_x_ticks, fontsize = 14, color = 'black');

plt.show()

Stacked Bar Chart

This visualization is a Stacked Bar Chart where the top 7 parking violations show the total violations by the day of the week that the violation occurred on. Overall, the weekdays have much more parking violations than the weekends for the top 7 violations. The Meter Expired violation had pretty similar amounts of violations for Monday through Saturday but then it looks like there are no violations of this type on Sunday. The parking prohibited violation had a similar amount of this type of violation for every day. The decrease in the parking violations on the weekends could be attributed to the decrease in people traveling to the city if many of these violations are occurring for people coming into the city for work. Another cause for the decrease could be due to police officers have more pressing priorities on the weekends due to an increase of people coming into the city for a good time. This could cause the police to not be as worried about parking violations during these times.


count_top_desc = df.value_counts(subset = ['violation_desc'], sort = True).reset_index(name = 'count')
seven_largest = count_top_desc.loc[1:7, 'violation_desc']
use_this_df = df[df['violation_desc'].isin(seven_largest)]

stacked_df = use_this_df.groupby(['DayName', 'violation_desc'])['violation_desc'].count().reset_index(name = 'TotalViolations')
stacked_df = stacked_df.pivot(index = 'violation_desc', columns = 'DayName', values = 'TotalViolations')
day_order = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
stacked_df = stacked_df.reindex(columns = reversed(day_order))

fig = plt.figure(figsize = (18,10))
ax = fig.add_subplot(1, 1, 1)

stacked_df.plot(kind = 'bar', stacked = True, ax = ax)

df['violation_desc']= df['violation_desc'].str.capitalize()

plt.ylabel('Total Violations Issued', fontsize = 18, labelpad = 10)
plt.title('Total Violations Issued By Day for the Top 7 Violation Types\n Stacked Bar Plot', fontsize = 18)
plt.xticks(rotation = 0, horizontalalignment = 'center', fontsize = 14);
plt.yticks(rotation = 0, fontsize = 14);
ax.set_xlabel('Violation Type', fontsize = 18)

handles, labels = ax.get_legend_handles_labels()
handles = [handles[6], handles[5], handles[4], handles[3], handles[2], handles[1], handles[0]]
labels  = [ labels[6],  labels[5],  labels[4],  labels[3],  labels[2],  labels[1],  labels[0]]
plt.legend(handles, labels, loc = 'best', fontsize = 14)

ax.yaxis.set_major_formatter(FuncFormatter(lambda x, pos:('%1.0fK')%(x*1e-3)))

plt.show()

Map of Phildelaphia

This visualization looks at a map of Philadelphia and plots different color dots based on the latitude and longitude of the violation, based on violation description. This visualization only looks at the first 5,000 violations on the dataset, which starts with the first half of 2014. Only certain violation descriptions are plotted, the other descriptions are not plotted. A green dot on the map is for meter expired, a blue dot is for stop prohibited cc, a red dot is for over time limit, a purple dot is for expired inspection, an orange dot is for fire hydrant, and a gold dot for sidewalk. Any other violation description will not be plotted on the map. There is a lot of blue dots concentrated in the city of Philadelphia. This makes sense as a prohibited stop would happen more in the center of the city rather than the outskirts.


new_df = df.dropna(subset = ['lat', 'lon'])
new_df = new_df.reset_index(drop = True)

center_of_map = [39.952583, -75.165222] # Philadelphia, PA

my_map = folium.Map(location = center_of_map,
                   zoom_start = 12,
                   width = '90%',
                   height = '100%',
                   left = '5%',
                   right = '5%', 
                   top = '0%')

tiles = ['cartodbpositron', 'openstreetmap', 'stamenterrain', 'stamentoner']
for tile in tiles: 
    folium.TileLayer(tile).add_to(my_map);
    
folium.LayerControl().add_to(my_map);

for i in range(0, 5000):
    violation = new_df.loc[i, 'violation_desc']
    if violation == 'Meter expired':
        color = 'green'
    elif violation == 'Over time limit':
        color = ' red'
    elif violation == 'Stop prohibited cc':
        color = 'blue'
    elif violation == 'Expired inspection':
        color = 'purple'
    elif violation == 'Fire hydrant':
        color = 'orange'
    elif violation == 'Sidewalk':
        color = 'gold'
    else: color = 'black'
        
    try: 
        if color != 'black':
            folium.Circle(location = [new_df.loc[i, 'lat'], new_df.loc[i, 'lon']],
                         tooltip = new_df.loc[i, 'violation_desc'],
                         popup = 'Date: {}: '.format(new_df.loc[i, 'issue_datetime']),
                         radius = 50, 
                         color = color,
                         fill = True, 
                         fill_color = color,
                         fill_opacity = 0.5).add_to(my_map)
    except: 
        pass;
      
my_map.save('U:/Dots_Parking.html')

knitr::include_graphics("U:/Map.png")

Map

Conclusion

Overall, these visualizations helped gain a better understanding of the parking violations in Philadelphia and where they occurred. There were also more questions that were raised from the findings and visualizations. These visualizations show the different types of parking violations that occur and when they occurred but more analysis could be done if there was violation fines given with the violation descriptions. With the violation fines, an analysis of the total fines collected by each hour and day occur as well as the looking at the highest month in violation fine collection. Overall, these visualizations help provide a good basis on the types of parking violations and when and where they are happening in the city to help with other analysis.