The YouTube dataset used in this analysis provides insight into the world of video content and the creators who share their content on YouTube. YouTube, as one of the largest video-sharing platforms globally, has transformed the way we consume media, offering a diverse array of content ranging from educational tutorials and entertainment to product reviews and music videos. This dataset allows for a deeper understanding of the channels of the top YouTubers and how they compare to other channels.
The dataset consists of 28 columns and 995 rows and consists of YouTuber channels, their ranks, when they were created, subscriber count, total views, the categories of their videos, the country they’re based, income levels, and more. There are multiple relationships to be formed between the different variables that are created in the visualizations. Creating visualizations based off these relationships allow for users to understand the data more.
The findings from the dataset reveal information about the top YouTubers in the world as of 2023 as well as their income levels. Other visualizations reveal out of the top 1000 YouTube channels, when were they created, what are the percentages of the categories on YouTuber. Additionally, the the visualization reveals which countries have the most YouTube channels and what type of channels they are.
import os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH']='C:/ProgramData/Anaconda3/Library/plugins/platfroms'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpatches
import matplotlib.patches as mpatches
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
path = "U:/"
filename = path + 'youtube.csv'
df = pd.read_csv(filename, encoding = "ISO-8859-1")
This bar chart visualization shows the top 10 YouTube of 2023. This chart shows that out of the top ten YouTube channels, the average subscriber count is 1.44 billion. The top YouTuber in 2023 is T-Series with a subscriber count of nearly 2.5 Billion subscribers. The bar chart emphasizes how much more popular T-Series is than the rest of the top ten channels. The 2nd most popular YouTuber channel in bar chart is the YouTuber Movie Channel with 1.7 Billion subscribers. In the 3rd spot is Mr. Beast with about 1.7 Billion subscribers as well.
### BAR CHART
df['video views'] = pd.to_numeric(df['video views'])
df_bar = df[['rank', 'Youtuber', 'subscribers']]
top_youtuber = df[['rank', 'Youtuber', 'subscribers']]
def pick_colors_according_to_mean_data(this_data):
colors=[]
avg = this_data.subscribers.mean()
for each in this_data.subscribers:
if each > avg*1.25:
colors.append('purple')
elif each < avg*0.75:
colors.append('grey')
else:
colors.append('darkblue')
return colors
bottom1 = 0
top1 = 9
d1 = top_youtuber.loc[bottom1:top1]
my_colors1 = pick_colors_according_to_mean_data(d1)
Above = mpatches.Patch(color='purple', label='Well Above Average')
Within = mpatches.Patch(color='blue', label='Within 25% of Average')
Below = mpatches.Patch(color='grey', label='Below Average')
fig = plt.figure(figsize=(18,16))
ax1 = fig.add_subplot(2, 1, 1)
ax1.bar(d1.Youtuber, d1.subscribers, color=my_colors1)
#ax1.legend(fontsize=14)
ax1.legend(handles=[Above, Within, Below], fontsize=14)
plt.axhline(d1.subscribers.mean(), color='black', linestyle='solid')
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.axes.xaxis.set_visible(True)
ax1.set_title('Top 10 Most Popular Youtube Chanels', fontsize=20)
ax1.text(top1-0.8, d1.subscribers.mean()+10, 'Mean =' + str(d1.subscribers.mean()), rotation=0, fontsize=15)
ax1.set_ylabel('Subscriber Count\n (in billions)', fontsize=14, labelpad=20)
ax1.set_xlabel('Youtube Channel', fontsize=14, labelpad=20)
ax1.set_xticks(range(len(d1.Youtuber)))
ax1.set_xticklabels(d1.Youtuber, rotation=45)
plt.show()
This dual axis visualization shows the estimated monthly earnings of the top 10 YouTubers. T-Series and Mr.Beast both have the highest monthly incomes with highest earnings of nine million and eight million respectively. This corresponds to the bar chart visualization because the bar chart showed how these two YouTubers are in the top 3 most popular Youtubers of 2023. What is suprising is that PewDiePie has substantially less income than most of the other channels despite having one billion subscribers.
##Dual Axis Chart
income = df[['rank', 'Youtuber', 'highest_monthly_earnings', 'lowest_monthly_earnings']]
omit = ['YouTube Movies', 'Music']
new_income = income.loc[ ~income['Youtuber'].isin(omit) ]
fig = plt.figure(figsize=(18,12))
ax1 = fig.add_subplot(1, 1, 1)
ax2 = ax1.twinx()
bar_width = 0.4
x_pos = np.arange(10)
low_bars = ax1.bar(x_pos-(0.5*bar_width), new_income.loc[0:11, 'lowest_monthly_earnings'], bar_width, color='red', label='lowest income')
high_bars = ax2.bar(x_pos+(0.5*bar_width), new_income.loc[0:11, 'highest_monthly_earnings'], bar_width, color='green', label='highest income')
ax1.set_yticks(np.arange(0, 1000001, 100000))
ax1.set_yticklabels([f'{x:,}' for x in np.arange(0, 1000001, 100000)])
ax2.set_yticks(np.arange(0, 10000001, 1000000))
ax2.set_yticklabels([f'{x:,}' for x in np.arange(0, 10000001, 1000000)])
ax1.set_xlabel('Youtuber', fontsize=18)
ax1.set_ylabel('Lowest Monthly Earnings', fontsize=18, labelpad=20)
ax2.set_ylabel('Highest Monthly Earnings', fontsize=18, labelpad=20, rotation=270)
ax1.tick_params(axis='y', labelsize=14)
ax2.tick_params(axis='y', labelsize=14)
plt.title('Estimated Monthly Earnings of Youtubers\n Top 10 Youtube Channels', fontsize = 18)
ax1.set_xticks(x_pos)
ax1.set_xticklabels(new_income.loc[0:11, 'Youtuber'], fontsize=14, rotation=45)
plt.show()
This pie chart shows what types of YouTube channels are most prevalent in 2023, with Entertainment channels having the highest amount of channels representing 25.7% of all channels in 2023. Following Entertainment channels is Music channels with 21.6% The last channel that represents a substantial amount of channels is personal blogs at 14.1% and then gaming at 10%.
##Pie Chart
category = df[['Youtuber', 'category']]
gb_cat = df.groupby(['category'])['category'].count().reset_index(name='count')
gb_cat = pd.DataFrame(gb_cat)
total_count = gb_cat['count'].sum()
gb_cat['percent'] = (gb_cat['count']/total_count)*100
gb_cat = gb_cat.sort_values(by='percent', ascending=False)
omit = ['Pets & Animals', 'Trailers', 'Autos & Vehicles', 'Nonprofits & Activism', 'Movies', 'Travel & Events']
gb_cat_filter = gb_cat[~gb_cat['category'].isin(omit)]
fig = plt.figure(figsize=(10,10))
plt.pie(gb_cat_filter['percent'], labels=gb_cat_filter['category'], autopct='%1.1f%%', startangle=0)
plt.title('Proportion of the Category Types of YouTube Channels', fontsize=18)
plt.show()
This heat map is interesting because it breaks down per country what tyoe of channels are most popular. Though many of the boxes are blank in the heat map, it is important in that it reveals that many countries do not have popular YouTube channels in specific categories. For example, the United States clearly dominates the YouTube platform having channels in almost every single category. In the opposite direction, Russia does not have many popular YouTube channels, with only 8 channels in the top 1000, all under the people and blogs category. The only other countries that compares to the United States is India, which as multiple channels in each category. This data reveals that there are gaps in the YouTube industry depending on the country.
##HeatMap
country = df[['category', 'Country']]
country_group = df.groupby(['Country', 'category'])['category'].count().reset_index(name='count')
country_group = pd.DataFrame(country_group)
hm_df = pd.pivot_table(country_group, index='category', columns='Country', values='count')
country_group = df.groupby(['Country', 'category'])['category'].count().reset_index(name='count')
country_group = country_group.sort_values(by='count', ascending=False)
country_group = country_group[country_group['count']>4]
country_group = pd.DataFrame(country_group)
hm_df2 = pd.pivot_table(country_group, index='category', columns='Country', values='count')
fig = plt.figure(figsize=(18,10))
ax= fig.add_subplot(1, 1, 1)
comma_fmt = FuncFormatter(lambda x , p: format(int(x), ','))
custom_cmap = sns.color_palette("magma")
ax = sns.heatmap(hm_df2, linewidth = 1, linecolor='black', annot = True, cmap='YlGnBu', fmt=',.0f',
square = True, annot_kws={'size': 11},
cbar_kws = {'format': comma_fmt, 'orientation':'vertical'})
plt.title('Heatmap of the number of Youtube Channels per Country by Category Type', fontsize=18, pad=15),
plt.xlabel('Country', fontsize=15, labelpad=10),
plt.ylabel('Channel Catgeoty', fontsize=15, labelpad=10)
plt.xticks(rotation=45)
plt.xticks()
ax.invert_xaxis()
cbar = ax.collections[0].colorbar
cbar.set_label('Number of Youtube Channels', rotation=270, fontsize=14, color='black', labelpad=20)
plt.show()
The scatter plot reveals when YouTube channels were created from the start of YouTube in 2005, all the way to 2022. Out of the top 1000 channels, the time period that had the most channels created was in March of 2006. This reveals how some channels rely on long term growth and that they had to build up their channels for over a decade and a half. 2014 through 2017 saw a slight increase in the number of channels created per month, this could be because of the YouTuber culture that was brought to life in the mid to late 2010’s as social media was really becoming a prominent aspect in everyday life.
##ScatterPlot
df_sp = df[['created_month', 'created_year']]
df_sp = df_sp[ df_sp['created_month'].notna() & df_sp['created_year'].notna() ]
df_sp = df_sp[ df_sp['created_month'] !=0 ]
x = df_sp.groupby(['created_year', 'created_month'])['created_year'].count().reset_index(name='count')
x = pd.DataFrame(x)
omit = [1970.0]
x2 = x.loc[ ~x['created_year'].isin(omit) ]
month_mapping = {
'Jan': 1,
'Feb': 2,
'Mar': 3,
'Apr': 4,
'May': 5,
'Jun': 6,
'Jul': 7,
'Aug': 8,
'Sep': 9,
'Oct': 10,
'Nov': 11,
'Dec': 12,
}
pd.options.mode.chained_assignment=None
x2['month_numeric'] = x2['created_month'].map(month_mapping)
x2.dropna(axis=0, inplace=True)
x2['month_numeric'] = x2['month_numeric'].astype('int')
x2['created_year'] = x2['created_year'].astype('int')
x2['count'] = x2['count'].astype('int')
x3 = x2.groupby(['created_year', 'month_numeric'])['count'].sum().reset_index()
x3 = pd.DataFrame(x3)
x3['count _hundreds'] = round(x3['count']/100, 0)
x3 = x3.reset_index(drop=True)
x4 = x3.groupby(['month_numeric', 'created_year'])['count'].sum().reset_index
x4 = pd.DataFrame(x3)
x4['count_hundreds'] = round(x4['count']/100, 0)
plt.figure(figsize=(18,10))
scale_factor = 50
plt.scatter(x4['month_numeric'], x4['created_year'], marker='8', cmap='viridis',
c=x4['count'], s=x4['count']*scale_factor, edgecolors='black')
plt.title('When Youtube Channels Were Created', fontsize=18)
plt.xlabel('Months of the Year', fontsize=14)
plt.ylabel('Year', fontsize=14)
cbar = plt.colorbar()
cbar.set_label('Number of Channels Created', color='black', fontsize=14, rotation=270, labelpad=30)
my_colorbar_ticks= [*range(1, int(x3['count'].max()), 1)]
cbar.set_ticks(my_colorbar_ticks)
my_x_ticks = [*range( x4['month_numeric'].min(), x4['month_numeric'].max()+1, 1 )]
plt.xticks(my_x_ticks, fontsize=14, color='black')
my_y_ticks = [*range( x4['created_year'].min(), x4['created_year'].max()+1, 1 )]
plt.yticks(my_y_ticks, fontsize=14, color='black')
plt.show()
The YouTube dataset analyzed in this study offers valuable insights into video content and creators on the platform. It encompasses information about top YouTuber channels, their rankings, subscriber counts, video categories, and more. Through data visualizations, relationships between these variables are explored, enhancing our understanding of the data. The findings reveal details about the world’s leading YouTubers in 2023, their income levels, creation dates, and the prevalence of video categories. This dataset provides a comprehensive view of YouTube’s landscape and its influential content creators.