Python Final Project


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
import matplotlib.patches as mpatches
from matplotlib.ticker import FuncFormatter

Analysis of Healthly Lifestyles of Cities Around the World in 2021

Introduction

This data set contains information on the healthy lifestyle of 42 cities around the world in 2021. The cities that will be analyzed are: Amsterdam, Sydney, Vienna, Stockholm, Copenhagen, Helsinki, Fukuoka, Berlin, Barcelona, Vancouver, Melbourne, Beijing, Bangkok, Buenos Aires, Toronto, Madrid, Jakarta, Seoul, Frankfurt, Geneva, Tel Aviv, Istanbul, Cairo, Taipei, Los Angeles, Mumbai, Boston, Dublin, Tokyo, Chicago, Hong Kong, Shanghai, Brussels, San Francisco, Paris, Sao Paulo, Zurich, London, Johannesburg, Milan, Washington, D.C., New York, Moscow, and Mexico City.

The variables that this analysis will take into consideration are:

Sunshine hours
Cost of a bottle of water
Obesity levels
Life expectancy(years)
Annual avg. hours worked
Happiness levels
Outdoor activities
Number of take out places
Cost of a monthly gym membership

The questions that this analysis aim to answer are:

What is the obesity level in each city? Which cities have the highest obesity level?
How happy are the cities who logged the most annual hours worked?
How expensive is it to live a healthy lifestyle in each city?
How many take out places are in each city?
What cities are the most active?
What is the correlation between obesity levels and life expectancy? Does a higher obesity levels indicate a shorter life expectancy?

Findings

Obesity Level Analysis:

According to the Top 42 Obesity level Analysis, obesity levels for 2021 ranged between 36% and 5% with an average of 21.92%. Amongst the top 10 cities, the mean obesity level is 34.01% which is 12.09% higher than mean for the top 42 cities. Precisely, Chicago, Boston, San Francisco, Washing D.C., New York, and Los Angeles are the leading cities for obesity levels. These all happen to be cities in the United States which fundamentally illustrates America’s obesity crisis.

warnings.filterwarnings("ignore")

path = "u:/"

filename = "healthy_lifestyle_city_2021.csv"
#df = pd.read_csv(path + filename, nrows=20)

def pick_colors_according_to_mean_count(this_data):
    colors=[]
    avg = this_data.obesity_levels.mean()
    for each in this_data.obesity_levels:
        if each > avg*1.01:
            colors.append('lightblue')
        elif each < avg*0.99:
            colors.append('lightgreen')
        else:
            colors.append('black')
    return colors
  
Above = mpatches.Patch(color='lightblue', label='Above Average')
At = mpatches.Patch(color='black', label='Within 1% of the Average')
Below = mpatches.Patch(color='lightgreen', label='Below Average')

df3 = pd.read_csv(filename, usecols = ['City','Obesity levels(Country)'])

x3 = df3.sort_values(['Obesity levels(Country)'], ascending=True)
x3.loc[x3.index.max()+1] = ['City','Obesity levels(Country)']
x3.reset_index(inplace=True, drop=True)
x3.columns = ['City', 'Obesity_Levels']
x3['Obesity_Levels'] = x3['Obesity_Levels'].str[:-1]

obesity_levels = pd.to_numeric(x3['Obesity_Levels'], errors = 'coerce')
x3['obesity_levels'] = obesity_levels
x3.columns = ['City', 'Obesity_Levels', 'obesity_levels']
df_sorted_desc2 = x3.sort_values('obesity_levels', ascending=False)
df_sorted_desc2 .reset_index(inplace=True, drop=True)
df_sorted_desc2.columns = ['City', 'Obesity_Levels', 'obesity_levels']
obesity_mean = df_sorted_desc2['obesity_levels'].mean()

bottom3 = 0
top3 = 43
df3 = df_sorted_desc2.loc[bottom3:top3]
my_colors2 = pick_colors_according_to_mean_count(df3)

fig = plt.figure(figsize=(18,14));
fig.suptitle('Obesity Level Analysis', fontsize=18, fontweight='bold');

ax1 = fig.add_subplot(2, 1, 1);
ax1.bar(df3.City, df3.obesity_levels, label='Obesity Levels', color=my_colors2, edgecolor='black');
ax1.legend(handles=[Above, At, Below],fontsize=14);
ax1.set_title('Top 42 Obesity Levels', size=20);
plt.axhline(df3.obesity_levels.mean(), color='black', linestyle='dashed');
ax1.spines['right'].set_visible(False);
ax1.spines['top'].set_visible(False);
ax1.axes.xaxis.set_visible(False);
ax1.text(top3-5, df3.obesity_levels.mean()+1, 'Mean = ' + str(round(df3.obesity_levels.mean(),2)) +'%', rotation=0, fontsize=14);

bottom4 = 0
top4 = 9

df4 = df_sorted_desc2.loc[bottom4:top4]
my_colors3 = pick_colors_according_to_mean_count(df4)
ax2 = fig.add_subplot(2, 1, 2)

ax2.bar(df4.City, df4.obesity_levels, label='Obesity Levels', color= my_colors3, edgecolor='black');
ax2.set_ylim(0,df4.obesity_levels.max()*1.5);
ax2.legend(handles=[Above, At, Below],fontsize=14,loc='upper right');

ax2.set_title('Top 10 Cities for Highest Obesity Levels', size=20);
plt.axhline(df4.obesity_levels.mean(), color='black', linestyle='solid');
ax2.spines['right'].set_visible(False);
ax2.spines['top'].set_visible(False);
ax2.axes.xaxis.set_visible(True);
plt.xticks(rotation=45);
ax2.text(top4-1, df4.obesity_levels.mean()+1, 'Mean = ' + str(round(df4.obesity_levels.mean(),2)) +'%', rotation=0, fontsize=14);

fig.subplots_adjust(hspace=0.35);

ax1.yaxis.set_major_formatter('{x:.0f}%');
ax2.yaxis.set_major_formatter('{x:.0f}%');

plt.show()

Annual Avg. Hours Worked and Happiness Levels Analysis

This dual bar chart conveys that on a scale of 1-10 (1 being very unhappy and 10 being the happiest) the cities that logged the most annual average hours worked are in the top half range of happiness. This is a good indication that people feel happy about the jobs they devote a significant amount of time and how working can be a fulfilling way to happiness. It would be concerning to see low levels of happiness for the cities with the most logged hours worked because when people are generally happy to be at work, they are more likely to produce better outcomes that enhance the overall success of any business.


df1 = pd.read_csv(filename, usecols = ['City','Annual avg. hours worked','Happiness levels(Country)'])
x = df1.sort_values('Annual avg. hours worked', ascending=False)
x.loc[x.index.max()+1] = ['City','Annual avg. hours worked','Happiness levels(Country)']
x.reset_index(inplace=True, drop=True)
bottom1 = 0
top1 = 19
d2 = x.loc[bottom1:top1]
d2.columns = ['City', 'Hours', 'Happiness_levels']
df_sorted_desc= d2.sort_values('Hours',ascending=True)

def autolabel(these_bars, this_ax, place_of_decimal, symbol):
    for each_bar in these_bars:
        height = each_bar.get_height()
        this_ax.text(each_bar.get_x()+each_bar.get_width()/2, height*1.01, symbol+format(height, place_of_decimal),
                   fontsize=10, color='black', ha='center', va='bottom'  )
                   
df_sorted_desc.Hours = df_sorted_desc.Hours.astype(int)

fig = plt.figure(figsize=(18, 14));
ax1 = fig.add_subplot(1, 1, 1);
ax2 = ax1.twinx();
bar_width = 0.35
x_pos = np.arange(20)
hours_bars = ax1.bar(x_pos-(0.5*bar_width), df_sorted_desc.Hours, bar_width, color='gray', edgecolor='black', label='Annual Avg. Hours Worked')
happiness_bars = ax2.bar(x_pos+(0.5*bar_width), df_sorted_desc.Happiness_levels, bar_width, color='green', edgecolor='black', label='Happiness Levels')
ax1.set_xticklabels(d2['City'], rotation=45);
ax1.set_xlabel('City', fontsize=18);
ax1.set_ylabel('Annual Avg. Hours Worked', fontsize=18, labelpad=20);
ax2.set_ylabel('Happiness Level', fontsize=18, rotation=270, labelpad=20);
ax1.tick_params(axis='y', labelsize=14);
ax2.tick_params(axis='y', labelsize=14);
plt.title('Annual Avg. Hours Worked and Happiness Levels Analysis\n Top 20 Most Logged Hours', fontsize=18);
ax1.set_xticks(x_pos);
hours_color, hours_label = ax1.get_legend_handles_labels()
happiness_color, happiness_label = ax2.get_legend_handles_labels()
legend = ax1.legend(hours_color + happiness_color, hours_label + happiness_label, loc='upper left', frameon=True, ncol=1, shadow=True,
                    borderpad=1, fontsize=14);
ax2.set_ylim(0,df_sorted_desc.Happiness_levels.max()*1.5);
autolabel(happiness_bars, ax2, '.2f', '' );
ax1.yaxis.set_major_formatter('{x:,.0f}');
plt.show()

Avg. Water Bottle Cost and Monthly Gym Cost per City Analysis

According to this stacked bar chart which examines how expensive it is to live a healthy life style in each city, Zurich charges the overall most for a monthly gym membership and water bottles (above 70 dollars) while Sao Paulo charges the least(under 20 dollars). The top five most expensive cities are Zurich, Tokyo, Geneva, Washington D.C, and San Francisco. Depicted in orange, the cities that charge the least for water Bottles are Istanbul, Buenos Aires, Johannesburg, Vancouver, and Los Angeles.


df3 = pd.read_csv(filename,
                  usecols=['City', 'Cost of a bottle of water(City)', 'Cost of a monthly gym membership(City)'])
df3.columns = ['City', 'WBCost', 'Gym_Cost']

stacked_df = df3.groupby(['City', 'Gym_Cost'])['WBCost'].sum().reset_index(name='WaterBottle_Cost')
stacked_df['Gym_Cost'] = stacked_df['Gym_Cost'].map(lambda x: str(x)[1:])
stacked_df['WaterBottle_Cost'] = stacked_df['WaterBottle_Cost'].map(lambda x: str(x)[1:])
stacked_df['Gym_Cost'] = stacked_df['Gym_Cost'].astype(float)
stacked_df['WaterBottle_Cost'] = stacked_df['WaterBottle_Cost'].astype(float)
stacked_df.set_index("City", drop=True, inplace=True)
stacked_df=stacked_df.sort_values('Gym_Cost', ascending=False)

fig3 = plt.figure(figsize=(18, 14));
ax = fig3.add_subplot(1, 1, 1);

stacked_df.plot(kind='bar', stacked=True, ax=ax)
plt.ylabel('Total Cost', fontsize=18, labelpad=10);
plt.title('Avg. Water Bottle Cost and Monthly Gym Cost per City Analysis \n Stacked Bar Plot', fontsize=18);
plt.xticks(rotation=63, horizontalalignment='center', fontsize=8);
ax.set_xlabel('City',  fontsize=18);
ax.yaxis.set_major_formatter('${x:1.0f}');
plt.show()

Take Out Places per City Analysis

According to this vertical bar chart, London has the most Take Out restaurants with a count of 6,417 and Cairo has the least amount at 250 Take Out restaurants. The average amount of Take Out places amongst the cities is 1,443. Los Angeles and Buenos Aires are within 1% of the mean. The top five cities for the highest Take Out place count are London, Tokyo, Paris, Sao Paulo, and Moscow. As the two cities with the fewest, business opportunities for restaurant expansion are promising in Cairo and Beijing as millions of potential customers live within the cities.


df2 = pd.read_csv(filename, usecols = ['City','Number of take out places(City)'])
x2 = df2.sort_values(['Number of take out places(City)'], ascending=True)
x2.loc[x2.index.max()+1] = ['City','Number of take out places(City)']
x2.reset_index(inplace=True, drop=True)
x2.columns = ['City', 'TakeOut_Count']
TakeOut_Count = pd.to_numeric(x2['TakeOut_Count'], errors = 'coerce')
x2['TakeOut_Count'] = TakeOut_Count
df_sorted_desc = x2.sort_values('TakeOut_Count', ascending=False)
df_sorted_desc.reset_index(inplace=True, drop=True)
df_sorted_desc.columns = ['City', 'TakeOut_Count']
TakeOut_mean = df_sorted_desc['TakeOut_Count'].mean()

def pick_colors_according_to_mean_count(this_data):
    colors=[]
    avg = this_data.TakeOut_Count.mean()
    for each in this_data.TakeOut_Count:
        if each > avg*1.01:
            colors.append('green')
        elif each < avg*0.99:
            colors.append('orange')
        else:
            colors.append('black')
    return colors
  
Above = mpatches.Patch(color='green',  label='Above Average')
At    = mpatches.Patch(color='black',  label='Within 1% of the Average')
Below = mpatches.Patch(color='orange', label='Below Average')

bottom2 = 0
top2 = 43
df2 = df_sorted_desc.loc[bottom2:top2]
df2 = df2.sort_values('TakeOut_Count', ascending=True)
my_colors = pick_colors_according_to_mean_count(df2)

fig2 = plt.figure(figsize=(18,14));
fig2.suptitle('TakeOut Count', fontsize=18, fontweight='bold');
ax1 = fig2.add_subplot(1, 1, 1);
ax1.barh(df2.City, df2.TakeOut_Count, label='TakeOut_Count', color=my_colors);

for row_counter, value_at_row_counter in enumerate(df2.TakeOut_Count):
    if value_at_row_counter > df2.TakeOut_Count.mean() * 1.1:
        color = 'green'
    elif value_at_row_counter < df2.TakeOut_Count.mean() * .99:
        color = 'orange'
    else:
        color = 'black'
    ax1.text(value_at_row_counter+2, row_counter, str(value_at_row_counter), color=color, size=5.5, fontweight='bold',
             ha='left', va='center', backgroundcolor='white')

plt.xlim(0,df2.TakeOut_Count.max()*1.1);

ax1.legend(handles=[Above, At, Below],fontsize=14);

ax1.set_title('Cities with the most TakeOut Resturant', size=20);
ax1.set_xlabel('Number of TakeOut Resturants', fontsize=14);
ax1.set_ylabel('City', fontsize=14);
plt.xticks(fontsize=12);
plt.yticks(fontsize=12);
plt.axvline(df2.TakeOut_Count.mean(), color='black', linestyle='dashed');
ax1.legend(loc='lower right', handles=[Above, At, Below], fontsize=14);
ax1.text(top2, df2.TakeOut_Count.mean()-1450, 'Mean = ' + str(round(df2.TakeOut_Count.mean())), rotation=0, fontsize=14);

plt.show()

Top 5 Cities for Hours spent on Outdoor Activities

This pie charts portrays Amsterdam, London, Sydney, Barcelona, and Istanbul as the top 5 cities for the most logged hours for Outdoor Activity. Accumulatively, these cities logged 2265 hours spent on Outdoor Activities for 2021. Barcelona is ranked number 1 at 585 hours, London is ranked second at 433 hours, Amsterdam is third with 422 hours, Istanbul is fourth with 419 hours, and finally Sydney is fifth with 406 hours.


df2 = pd.read_csv(filename, usecols = ['City','Outdoor activities(City)'])
x2 = df2.sort_values(['Outdoor activities(City)'], ascending=True)
pie_df = x2.groupby(['City'])['Outdoor activities(City)'].sum().reset_index(name='Outdoor_Activities')

pie_df=pie_df.sort_values(['Outdoor_Activities'], ascending=False)
top = 5
bottom = 0
pie_df=pie_df[bottom:top]
number_outside_colors = len(pie_df.Outdoor_Activities.unique())
outside_color_ref_number = np.arange(number_outside_colors)*4

fig = plt.figure(figsize=(10,10));
ax = fig.add_subplot(1, 1, 1);

colormap = plt.get_cmap("tab20c")
outer_colors = colormap(outside_color_ref_number)

Total_hours = pie_df.Outdoor_Activities.sum()

pie_df = pie_df.groupby(['City'])['Outdoor_Activities'].sum().plot(
    kind='pie', radius=1, colors = outer_colors,
    pctdistance=0.85, labeldistance = 1.3,
    wedgeprops=dict(edgecolor='white'), textprops={'fontsize':14},
    autopct=lambda p: '{:.2f}%\n({:.0f} Hrs)'.format(p,(p/100)*Total_hours),
    startangle=90)
    
hole = plt.Circle((0,0),0.35,fc='white')

fig4 = plt.gcf();
fig4.gca().add_artist(hole);
ax.yaxis.set_visible(False);
plt.title('Top 5 Cities for Highest Outdoor Activities (Hrs.)', fontsize=18);
ax.axis('equal');
plt.tight_layout();
ax.text(0,0, str(round(Total_hours))+'\nTotal Hours \n Spent on\n Outdoor Activities\n' ,size=14, ha='center', va='center');

legend = ax.legend( loc='lower right', frameon=True, ncol=1, shadow=True,
                    borderpad=1, fontsize=11);

plt.show()

Obesity levels and Life Expectancy Analysis

It is apparent that there is no direct correlation between life expectancy and obesity levels due to the wide range of points on the scatter plot. This means to reject the hypothesis which believed there to be a direct correlation between life expectancy and obesity levels; specifically, the higher the obesity levels the lower the life expectancy of the cities. Note, the small cluster in the top center indicates that for a 20-25% obesity level the average life expectancy is 80 years old. Despite this small cluster, around 5% obesity level the life expectancy ranges between 67 to above 80 years old while at around 30% obesity level the life expectancy ranges between 56 to above 80 years. Thus, there are many other factors beyond obesity level that impacts life expectancy.


df4 = pd.read_csv(filename, usecols = ['Obesity levels(Country)','Life expectancy(years) (Country)'])
df4.columns = ['Obesity_Levels','LifeExpectancy']
stacked_df2 = df4.groupby(['Obesity_Levels', 'LifeExpectancy'])['LifeExpectancy'].sum().reset_index(name='ObesityLevels')
stacked_df2['Obesity_Levels'] = stacked_df2['Obesity_Levels'].map(lambda x: str(x)[:-1])
stacked_df2['Obesity_Levels'] = stacked_df2['Obesity_Levels'].astype(float)
stacked_df2 = stacked_df2.sort_values('Obesity_Levels', ascending=False)
stacked_df2.reset_index(inplace=True, drop=True)

fig2 = plt.figure(figsize=(18,14));
ax = fig2.add_subplot(1, 1, 1);

plt.scatter(stacked_df2['Obesity_Levels'],stacked_df2['LifeExpectancy'], s=400,color='lightblue', edgecolors='black');
plt.title('Obesity Level vs. Life Expectancy\n Scatter Plot', fontsize=18);
plt.xlabel('Obesity Level', fontsize=14);
plt.ylabel('Life Expectancy', fontsize=14);
ax.xaxis.set_major_formatter('{x:.0f}%');
plt.show()

Conclusion

Obesity levels are around 21% for most cities around the world for 2021. Cities in the United States are leading for highest obesity levels in comparison to the other cities around the world. After scoring a 7.23, the happiest cities out of the top 20 cities who logged the most annua average hours worked are Seoul and Moscow. Madrid is the least happy for it received a 5.13. Nonetheless, the overall takeaway is that cities who logged the most hours worked are in the top half of being happy. Zurich charges the overall most for a monthly gym membership and water bottles (above $70) while Sao Paulo charges the least(under $20). The top five most expensive cities are Zurich, Tokyo, Geneva, Washington D.C, and San Francisco. Depicted in orange, the cities that charge the least for water Bottles are Istanbul, Buenos Aires, Johannesburg, Vancouver, and Los Angeles. If Washington D.C. and San Francisco found a way to lower their monthly gym membership and water bottle costs it could positively impact their cities’ high obesity levels by making a healthy lifestyle more affordable. London has the most Take Out restaurants with a count of 6,417 and Cairo has the least amount at 250 Take Out restaurants. The average amount of take out places is 1,443. Not to mention, Amsterdam, London, Sydney, Barcelona, and Istanbul logged the most hours for Outdoor Activity. The fact that the cities with highest obesity levels are nowhere to be seen for top outdoor activity levels makes sense. Lastly, the hypothesis is rejected for there is no direct correlation between life expectancy and obesity levels.