(Fun DS) Extracting Wordpress Stats with Python

This post is an English translation of my original blog post on nipponkiyoshi.com.

We are going to learn how to access wordpress stats to extract relevant information and process them in Python.

Data Extraction

The normal stats page from wp-admin used to look like this.

On this page, you can see the number of views and visits daily, weekly, monthly and yearly, by countries of origin, which posts are opened the most. In the Insights section, you can see the summary in more detail.

However, ever since WordPress switched to Jetpack, it is no longer possible to download these data. Fine. Using Python, we will get that information ourselves. I will show you how to do it.

Preparation

Let’s go through the static url page that contains the data

Access: https://stats.wordpress.com/csv.php?

You’ll see a picture like this:

This is a guide on how to access the link that contains the raw data and extract it.

Required parameters are the required components, including api_key and blog_id or blog_uri.

Optional parameters are parameters that are adjusted to get the desired data. If you look at the table item, you will see various data that can be retrieved such as views, postviews, search term, clicks, etc.

We need to take 2 indispensable ingredients, namely api_key and blog_uri.

Step 1: Get API Key

This API is based on Akismet, which is WordPress’ anti-spam service and is free. You’ll need an API key.

Enter https://akismet.com/
Sign up with the email you use to create your blog (completely free)
After registration, you will receive 1 notification email from akismet, which contains the Akismet API Key (including 12 characters).
Copy that API_key because we’ll need to use it to extract the data.

Step 2: Get Blog URI

The blog URI is the URL without https://.

For example, my blog’s domain is https://nipponkiyoshi.com so I set the blog_uri as nipponkiyoshi.com.

Step 3: Use Python

We will use the following libraries.

import matplotlib.dates as mdates
import matplotlib.axes as ax
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests

Then make some variables to contain the parameters. The code here should change depending on your blog page.

#edit parameters here
api_key = 'abcxyz123456'
blog_uri = 'somewebsite.com'
table = ['views', 'postviews', 'referrers',
         'referrers_grouped', 'searchterms', 'clicks', 'videoplays']
days = -1
master_url = 'https://stats.wordpress.com/csv.php?'

Here our api_key example is abcxyz123456, and blog_uri is somewebsite.com.

Of course, when applying for my blog, I have to replace abcxyz123456 with my own specific variables.

Setting days = -1 means that you get all the information of the web from the founding date. Of course, it can be adjusted to any desired value. For example, if days = 7, you will get the stats of the blog in the last 7 days.

Next, we can build a complete query to get the desired data.

x = 1
url = master_url + 'api_key=' + api_key + '&blog_uri=' + blog_uri + '&table=' + table[x] + '&days=' + str(days)
print(url)

## https://stats.wordpress.com/csv.php?api_key=abcxyz123456&blog_uri=somewebsite.com&table=postviews&days=-1

Here the variable x replaces the order of the elements in the table above. You can edit x depending on the data you want to get. For example, x = 0 corresponds to table[0] as views. Here I leave x = 1 as postviews, then in the data that appears every day, which posts are viewed and the number of views of each post. If you want to see which pages refer to your page (referrers), replace x = 2.

You can paste this url into your browser to view it. Since the columns are automatically separated by commas, you can Copy-Paste into any TextEditor and save it as a .csv file.

This is the result of the query for my blog:

"date","post_id","post_title","post_permalink","views"
"2021-03-05",469,"Viết thư bằng tiếng Nhật","https://nipponkiyoshi.com/2014/11/20/viet-thu/",23
"2021-03-05",0,"Home page","https://nipponkiyoshi.com/",20
"2021-03-05",644,"Tóm tắt lịch sử Nhật Bản","https://nipponkiyoshi.com/2014/12/14/tom-tat-lich-su-nhat-ban/",17
"2021-03-05",2275,"Mạc phủ Tokugawa sụp đổ. Thiên hoàng trở lại nắm quyền.","https://nipponkiyoshi.com/2019/03/06/mac-phu-tokugawa-sup-do-duy-tan-minh-tri-thanh-cong/",13
"2021-03-05",134,"Tự luyện thi JLPT N3 (P1): Sách luyện thi (có link download)","https://nipponkiyoshi.com/2014/09/27/tu-hoc-on-thi-jlpt-n3-sach-luyen-thi/",13
"2021-03-05",1199,"Tự luyện thi JLPT N2 (P1): Cấu trúc đề thi","https://nipponkiyoshi.com/2015/03/02/tu-luyen-thi-jlpt-n2-p1-cau-truc-de-thi/",10
"2021-03-05",1723,"Những từ ghép với 気","https://nipponkiyoshi.com/2015/11/10/nhung-tu-ghep-voi-%e6%b0%97/",10
"2021-03-05",1995,"Tự luyện thi JLPT N1: Sách luyện thi (Update: 2020/06/29)","https://nipponkiyoshi.com/2017/05/09/tu-luyen-thi-jlpt-n1-sach-luyen-thi-co-link-download/",7
"2021-03-05",1094,"Tiếng Nhật ""lóng""","https://nipponkiyoshi.com/2015/02/08/tieng-nhat-long/",6

Finally, we’ll save the data to a DataFrame (make sure to edit the path of the csv file to where you want to save it. pandas can read directly from this url because the data has been beautifully formatted in .csv format.

data = pd.read_csv(url)
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'])
# if you want to extract the data to a csv file
df.to_csv('path/output.csv')

The raw data extraction phase is complete. Next, you can do simple activities such as drawing charts or analyzing this data (see which days have the most views, which posts have the most views).

Data Exploration

Most read article

With the DataFrame df above, you can use the groupby function to group posts by post_title and add up the total number of views. Then we will sort them in order from high to low.

#Sort by viewcounts (to see which posts have the most view counts)
by_viewcounts = df.groupby(['post_title', 'post_id','post_permalink']).agg(
    {'views': 'sum'})
by_viewcounts_view = by_viewcounts.sort_values(by='views', ascending=False)
#pd.set_option('display.max_rows', 500) #uncomment to print all.
print(by_viewcounts_view)

##                                                                                                                 views
## post_title                                         post_id post_permalink                                            
## 10 Ứng dụng học Tiếng Nhật tốt nhất                1248.0  http://nipponkiyoshi.com/2015/03/20/10-ung-dung...  221085
## Tự luyện thi JLPT N3 (P1): Sách luyện thi (có l... 134.0   http://nipponkiyoshi.com/2014/09/27/tu-hoc-on-t...  188220
## Viết thư bằng tiếng Nhật                           469.0   http://nipponkiyoshi.com/2014/11/20/viet-thu/       146521
## Home page                                          0.0     http://nipponkiyoshi.com/                           136911
## Tự luyện thi JLPT N2: Sách luyện thi (Update 20... 1555.0  http://nipponkiyoshi.com/2015/07/26/tu-luyen-th...  119392
## ...                                                                                                               ...
## Seiryuen-Garden-Nijo-Castle-Kyoto-Japan-768x1366   959.0   http://nipponkiyoshi.com/seiryuen-garden-nijo-c...       1
## FEATURE_docgikochan_tamthe                         4948.0  http://nipponkiyoshi.com/2024/06/05/doc-gi-ko-c...       1
## FEATURE_docgikochan2                               4927.0  http://nipponkiyoshi.com/2024/05/31/doc-gi-ko-c...       1
## feature_new_OLG_books                              4917.0  http://nipponkiyoshi.com/2024/05/22/mot-so-sach...       1
## feature-8-ung-dung-hoc-tieng-nhat                  2143.0  http://nipponkiyoshi.com/2015/03/20/10-ung-dung...       1
## 
## [339 rows x 1 columns]

The three most read articles on my blog are: “Top 10 Japanese learning applications”, “How to self-study for JLPT N3”, and “How to write letter in Japanese”.

View counts by day

This code shows us the day with the most view counts.

#Sort by date (days) to see which days have the most views
by_date = df.groupby('date').agg({'views':'sum'})
by_date_view = by_date.sort_values(by='views',ascending=False)
#pd.set_option('display.max_rows', 10000) #uncomment to print all.
print(by_date_view)

##             views
## date             
## 2016-07-18   7430
## 2018-01-24   7267
## 2015-08-27   4554
## 2015-06-24   2926
## 2016-07-19   2594
## ...           ...
## 2014-10-23      1
## 2014-09-07      1
## 2014-10-20      1
## 2014-10-13      1
## 2014-08-28      1
## 
## [3599 rows x 1 columns]

My most successful day was 2016/07/18 and 2018/01/24. The gap between those 2 days and the rest is staggering. Since then, I have never been able to reach that level. You can also see that in the early days of my blog (first post in 2014), there is only 1 view a day, which I think is my own view :)

We can also count and rank the total views by month and year.

#Sort by month
by_month = df.groupby(pd.Grouper(key='date', freq='ME')).agg({'views': 'sum'})
by_month_view = by_month.sort_values(by='views',ascending=False)
print(by_month_view)

##             views
## date             
## 2016-07-31  45368
## 2017-08-31  36712
## 2018-01-31  36609
## 2017-07-31  36454
## 2016-08-31  36130
## ...           ...
## 2014-12-31   1800
## 2014-11-30    723
## 2014-10-31    203
## 2014-09-30    153
## 2014-08-31      3
## 
## [120 rows x 1 columns]

#Sort by year
by_year = df.groupby(pd.Grouper(key='date', freq='YE')).agg({'views': 'sum'})
by_year_view = by_year.sort_values(by='views',ascending=False)
print(by_year_view)

##              views
## date              
## 2016-12-31  384866
## 2017-12-31  379951
## 2018-12-31  345058
## 2015-12-31  261304
## 2019-12-31  251436
## 2020-12-31  148610
## 2021-12-31  133182
## 2023-12-31  103652
## 2022-12-31   99657
## 2024-12-31   55673
## 2014-12-31    2882

Data Visualization

We can visualize the data using matplotlib. First, I plot the view counts by month.

# Set the monthly locator
import datetime
import matplotlib.dates as mdates
#plot by month
x = by_month.index
values = by_month['views']
# Optional
locator = mdates.MonthLocator(interval=6)  # blank to show every month
# Specify the format - %b gives us Jan, Feb...
fmt = mdates.DateFormatter('%b/%y')
# Specify formatter
X = plt.gca().xaxis
X.set_major_locator(locator)
X.set_major_formatter(fmt)
#specify plot parameters
plt.xlabel('time')
plt.ylabel('views')
plt.plot(x[:], values[:], color='cornflowerblue')
plt.xticks(rotation=90)

# highlight July and December every year
for i in range(0, len(x)):
    if x[i].month == 7:
        plt.scatter(x[i], values[i], color='orange', alpha=0.5, s=50)
    elif x[i].month == 12:
        plt.scatter(x[i], values[i], color='red', alpha=0.5, s=50)
plt.title('nipponkiyoshi.com \n Views by month')
plt.show()

The view counts by month show that my blog has a peak before and after July/December every year. This is probably because the JLPT tests are held in July and December, so many people are looking for study materials around that time.

I now plot the view counts by year, which is similar to the default stats you can see from wp-admin

import matplotlib.pyplot as plt
#convert Datetime Index to list and extract year (if not, matplot will automatically assign 2018/12/31 to 2019)
y = by_year.index
y = y.to_list()
y_date = pd.DatetimeIndex(y).year
#declare plot elements
values = by_year['views']
widths = [0.5 for i in y_date]
colors = ['limegreen' for i in y_date]
plt.xlabel('Time')
plt.ylabel('views (in thousands)')
#plot
plt.bar(y_date, values/1000, width = widths, color = colors)
#show values on top of bars
for i in range(0, len(y_date)):
    plt.text(y_date[i], values[i]/1000, round(values[i]/1000), ha = 'center')
plt.title('nipponkiyoshi.com \n Views by year')
plt.show()

At the beginning, I wrote my blog to share my notes while learning Japanese. Back in 2015 and 2016, there was a boom of Japanese learners in Vietnam. That’s probably the reason why my blog received a lot of attention in those years. However, the number of views has decreased significantly since then. I think it’s because I haven’t been updating my blog as frequently as before, and there are a lot of other blogs that provide better content.

Nowadays, I write longer blog posts, which take more time to research and write. I also have a full-time Ph.D to finish, so I don’t have as much time to write as before. The content also changes. Japanese language no longer fascinates me as much as before. I mainly write about economics, applied data science, and more personal reviews/thoughts. Well, gaining views was never my goal, so I’m not too worried about it. I’m just happy that my blog is still alive and that I can share my thoughts with the world.

Read this post in Vietnamese here: https://nipponkiyoshi.com/2021/03/05/cach-trich-xuat-du-lieu-wordpress-stats-bang-python/