This post is an English translation of my original blog post on nipponkiyoshi.com.
We are going to learn how to access wordpress stats to
extract relevant information and process them in Python.
The normal stats page from wp-admin used to look like
this.
On this page, you can see the number of views and visits daily, weekly, monthly and yearly, by countries of origin, which posts are opened the most. In the Insights section, you can see the summary in more detail.
However, ever since WordPress switched to Jetpack, it is no longer possible to download these data. Fine. Using Python, we will get that information ourselves. I will show you how to do it.
Let’s go through the static url page that contains the
data
Access: https://stats.wordpress.com/csv.php?
You’ll see a picture like this:
This is a guide on how to access the link that contains the raw data and extract it.
Required parameters are the required components,
including api_key and blog_id or
blog_uri.
Optional parameters are parameters that are adjusted to get the
desired data. If you look at the table item, you will see various data
that can be retrieved such as views,
postviews, search term, clicks,
etc.
We need to take 2 indispensable ingredients, namely
api_key and blog_uri.
This API is based on Akismet, which is WordPress’ anti-spam service and is free. You’ll need an API key.
The blog URI is the URL without https://.
For example, my blog’s domain is
https://nipponkiyoshi.com so I set the blog_uri as
nipponkiyoshi.com.
We will use the following libraries.
import matplotlib.dates as mdates
import matplotlib.axes as ax
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
Then make some variables to contain the parameters. The code here should change depending on your blog page.
#edit parameters here
api_key = 'abcxyz123456'
blog_uri = 'somewebsite.com'
table = ['views', 'postviews', 'referrers',
'referrers_grouped', 'searchterms', 'clicks', 'videoplays']
days = -1
master_url = 'https://stats.wordpress.com/csv.php?'
Here our api_key example is abcxyz123456,
and blog_uri is somewebsite.com.
Of course, when applying for my blog, I have to replace
abcxyz123456 with my own specific variables.
Setting days = -1 means that you get all the information
of the web from the founding date. Of course, it can be adjusted to any
desired value. For example, if days = 7, you will get the
stats of the blog in the last 7 days.
Next, we can build a complete query to get the desired data.
x = 1
url = master_url + 'api_key=' + api_key + '&blog_uri=' + blog_uri + '&table=' + table[x] + '&days=' + str(days)
print(url)
## https://stats.wordpress.com/csv.php?api_key=abcxyz123456&blog_uri=somewebsite.com&table=postviews&days=-1
Here the variable x replaces the order of the elements
in the table above. You can edit x depending on the data you want to
get. For example, x = 0 corresponds to
table[0] as views. Here I leave
x = 1 as postviews, then in the data that
appears every day, which posts are viewed and the number of views of
each post. If you want to see which pages refer to your page
(referrers), replace x = 2.
You can paste this url into your browser to view it. Since the
columns are automatically separated by commas, you can Copy-Paste into
any TextEditor and save it as a .csv file.
This is the result of the query for my blog:
"date","post_id","post_title","post_permalink","views"
"2021-03-05",469,"Viết thư bằng tiếng Nhật","https://nipponkiyoshi.com/2014/11/20/viet-thu/",23
"2021-03-05",0,"Home page","https://nipponkiyoshi.com/",20
"2021-03-05",644,"Tóm tắt lịch sử Nhật Bản","https://nipponkiyoshi.com/2014/12/14/tom-tat-lich-su-nhat-ban/",17
"2021-03-05",2275,"Mạc phủ Tokugawa sụp đổ. Thiên hoàng trở lại nắm quyền.","https://nipponkiyoshi.com/2019/03/06/mac-phu-tokugawa-sup-do-duy-tan-minh-tri-thanh-cong/",13
"2021-03-05",134,"Tự luyện thi JLPT N3 (P1): Sách luyện thi (có link download)","https://nipponkiyoshi.com/2014/09/27/tu-hoc-on-thi-jlpt-n3-sach-luyen-thi/",13
"2021-03-05",1199,"Tự luyện thi JLPT N2 (P1): Cấu trúc đề thi","https://nipponkiyoshi.com/2015/03/02/tu-luyen-thi-jlpt-n2-p1-cau-truc-de-thi/",10
"2021-03-05",1723,"Những từ ghép với 気","https://nipponkiyoshi.com/2015/11/10/nhung-tu-ghep-voi-%e6%b0%97/",10
"2021-03-05",1995,"Tự luyện thi JLPT N1: Sách luyện thi (Update: 2020/06/29)","https://nipponkiyoshi.com/2017/05/09/tu-luyen-thi-jlpt-n1-sach-luyen-thi-co-link-download/",7
"2021-03-05",1094,"Tiếng Nhật ""lóng""","https://nipponkiyoshi.com/2015/02/08/tieng-nhat-long/",6
Finally, we’ll save the data to a DataFrame (make sure to edit the
path of the csv file to where you want to save it. pandas
can read directly from this url because the data has been
beautifully formatted in .csv format.
data = pd.read_csv(url)
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'])
# if you want to extract the data to a csv file
df.to_csv('path/output.csv')
The raw data extraction phase is complete. Next, you can do simple activities such as drawing charts or analyzing this data (see which days have the most views, which posts have the most views).
With the DataFrame df above, you can use the
groupby function to group posts by post_title
and add up the total number of views. Then we will sort them in order
from high to low.
#Sort by viewcounts (to see which posts have the most view counts)
by_viewcounts = df.groupby(['post_title', 'post_id','post_permalink']).agg(
{'views': 'sum'})
by_viewcounts_view = by_viewcounts.sort_values(by='views', ascending=False)
#pd.set_option('display.max_rows', 500) #uncomment to print all.
print(by_viewcounts_view)
## views
## post_title post_id post_permalink
## 10 Ứng dụng học Tiếng Nhật tốt nhất 1248.0 http://nipponkiyoshi.com/2015/03/20/10-ung-dung... 221085
## Tự luyện thi JLPT N3 (P1): Sách luyện thi (có l... 134.0 http://nipponkiyoshi.com/2014/09/27/tu-hoc-on-t... 188220
## Viết thư bằng tiếng Nhật 469.0 http://nipponkiyoshi.com/2014/11/20/viet-thu/ 146521
## Home page 0.0 http://nipponkiyoshi.com/ 136911
## Tự luyện thi JLPT N2: Sách luyện thi (Update 20... 1555.0 http://nipponkiyoshi.com/2015/07/26/tu-luyen-th... 119392
## ... ...
## Seiryuen-Garden-Nijo-Castle-Kyoto-Japan-768x1366 959.0 http://nipponkiyoshi.com/seiryuen-garden-nijo-c... 1
## FEATURE_docgikochan_tamthe 4948.0 http://nipponkiyoshi.com/2024/06/05/doc-gi-ko-c... 1
## FEATURE_docgikochan2 4927.0 http://nipponkiyoshi.com/2024/05/31/doc-gi-ko-c... 1
## feature_new_OLG_books 4917.0 http://nipponkiyoshi.com/2024/05/22/mot-so-sach... 1
## feature-8-ung-dung-hoc-tieng-nhat 2143.0 http://nipponkiyoshi.com/2015/03/20/10-ung-dung... 1
##
## [339 rows x 1 columns]
The three most read articles on my blog are: “Top 10 Japanese learning applications”, “How to self-study for JLPT N3”, and “How to write letter in Japanese”.
This code shows us the day with the most view counts.
#Sort by date (days) to see which days have the most views
by_date = df.groupby('date').agg({'views':'sum'})
by_date_view = by_date.sort_values(by='views',ascending=False)
#pd.set_option('display.max_rows', 10000) #uncomment to print all.
print(by_date_view)
## views
## date
## 2016-07-18 7430
## 2018-01-24 7267
## 2015-08-27 4554
## 2015-06-24 2926
## 2016-07-19 2594
## ... ...
## 2014-10-23 1
## 2014-09-07 1
## 2014-10-20 1
## 2014-10-13 1
## 2014-08-28 1
##
## [3599 rows x 1 columns]
My most successful day was 2016/07/18 and 2018/01/24. The gap between those 2 days and the rest is staggering. Since then, I have never been able to reach that level. You can also see that in the early days of my blog (first post in 2014), there is only 1 view a day, which I think is my own view :)
We can also count and rank the total views by month and year.
#Sort by month
by_month = df.groupby(pd.Grouper(key='date', freq='ME')).agg({'views': 'sum'})
by_month_view = by_month.sort_values(by='views',ascending=False)
print(by_month_view)
## views
## date
## 2016-07-31 45368
## 2017-08-31 36712
## 2018-01-31 36609
## 2017-07-31 36454
## 2016-08-31 36130
## ... ...
## 2014-12-31 1800
## 2014-11-30 723
## 2014-10-31 203
## 2014-09-30 153
## 2014-08-31 3
##
## [120 rows x 1 columns]
#Sort by year
by_year = df.groupby(pd.Grouper(key='date', freq='YE')).agg({'views': 'sum'})
by_year_view = by_year.sort_values(by='views',ascending=False)
print(by_year_view)
## views
## date
## 2016-12-31 384866
## 2017-12-31 379951
## 2018-12-31 345058
## 2015-12-31 261304
## 2019-12-31 251436
## 2020-12-31 148610
## 2021-12-31 133182
## 2023-12-31 103652
## 2022-12-31 99657
## 2024-12-31 55673
## 2014-12-31 2882
We can visualize the data using matplotlib. First, I
plot the view counts by month.
# Set the monthly locator
import datetime
import matplotlib.dates as mdates
#plot by month
x = by_month.index
values = by_month['views']
# Optional
locator = mdates.MonthLocator(interval=6) # blank to show every month
# Specify the format - %b gives us Jan, Feb...
fmt = mdates.DateFormatter('%b/%y')
# Specify formatter
X = plt.gca().xaxis
X.set_major_locator(locator)
X.set_major_formatter(fmt)
#specify plot parameters
plt.xlabel('time')
plt.ylabel('views')
plt.plot(x[:], values[:], color='cornflowerblue')
plt.xticks(rotation=90)
# highlight July and December every year
for i in range(0, len(x)):
if x[i].month == 7:
plt.scatter(x[i], values[i], color='orange', alpha=0.5, s=50)
elif x[i].month == 12:
plt.scatter(x[i], values[i], color='red', alpha=0.5, s=50)
plt.title('nipponkiyoshi.com \n Views by month')
plt.show()
The view counts by month show that my blog has a peak before and after July/December every year. This is probably because the JLPT tests are held in July and December, so many people are looking for study materials around that time.
I now plot the view counts by year, which is similar to the default
stats you can see from wp-admin
import matplotlib.pyplot as plt
#convert Datetime Index to list and extract year (if not, matplot will automatically assign 2018/12/31 to 2019)
y = by_year.index
y = y.to_list()
y_date = pd.DatetimeIndex(y).year
#declare plot elements
values = by_year['views']
widths = [0.5 for i in y_date]
colors = ['limegreen' for i in y_date]
plt.xlabel('Time')
plt.ylabel('views (in thousands)')
#plot
plt.bar(y_date, values/1000, width = widths, color = colors)
#show values on top of bars
for i in range(0, len(y_date)):
plt.text(y_date[i], values[i]/1000, round(values[i]/1000), ha = 'center')
plt.title('nipponkiyoshi.com \n Views by year')
plt.show()
At the beginning, I wrote my blog to share my notes while learning Japanese. Back in 2015 and 2016, there was a boom of Japanese learners in Vietnam. That’s probably the reason why my blog received a lot of attention in those years. However, the number of views has decreased significantly since then. I think it’s because I haven’t been updating my blog as frequently as before, and there are a lot of other blogs that provide better content.
Nowadays, I write longer blog posts, which take more time to research and write. I also have a full-time Ph.D to finish, so I don’t have as much time to write as before. The content also changes. Japanese language no longer fascinates me as much as before. I mainly write about economics, applied data science, and more personal reviews/thoughts. Well, gaining views was never my goal, so I’m not too worried about it. I’m just happy that my blog is still alive and that I can share my thoughts with the world.
Read this post in Vietnamese here: https://nipponkiyoshi.com/2021/03/05/cach-trich-xuat-du-lieu-wordpress-stats-bang-python/