How The Guardian Uses “Attention Analytics” To Track Rising Stories

A tool created during an internal hack day has become a key part of the Guardian’s future.


A day after Philip Seymour Hoffman had died of a drug overdose, a year-old article by comedian and actor Russell Brand on why he gave up drugs suddenly burst into the Guardian’s top 10 most read list. There was gradual growth in Facebook traffic throughout the day culminating in Brand himself tweeting it at 11:13PM, setting off a massive wave of Facebook referrals.


“Somebody tweeted that yesterday completely independently of us,” said Guardian architect Graham Tackley. “So we posted it on our Facebook account,” says Tackley. “We wouldn’t have noticed that it had gotten popular if we didn’t have real-time feedback.”

That real-time feedback came from the Guardian’s in-house “attention data” tool Ophan. It tracks all of the Guardian’s traffic and makes it available to 400 journalists, editors, and developers with a time-lag of less than five seconds. Users can see what’s being read most on the Guardian’s various home pages. The data can be filtered by country, time period, section, mobile app and devices, browsers, referral sources, and more.

For a particular story, journalists can see how the traffic has changed over time and get data like which tweets have driven the most traffic or the effect of internal promotion (putting the article on the Guardian’s home page).

The Guardian’s developers use Ophan to track load time on its various web and mobile sites. The team releases several new version of the website per day. “You can tell when you have broken something because the graph drops to the floor,” says Tackley. “If you get that feedback in a couple of minutes you can fix it and the impact is not so great.”

From Hack Day To Essential Tool

Last year the Guardian had 84 million unique readers every month but its third-party analytics tool only provided traffic data with a four-hour time lag broken down per hour. Most Guardian journalists and editors didn’t even have access to that data. This made life difficult for the newspaper’s digital audience manager, Chris Moran. One of his responsibilities was SEO for 400 pieces of content the Guardian can produce in a day.


“When he (Moran) was trying to promote stuff and tweak headlines he got very limited feedback four hours later,” says Tackley. “For a news organization, that’s pretty poor.” So Tackley decided to take on the problem during one of the Guardian’s in-house hack days.

Tackley already spent a lot of his time analyzing the Guardian’s web logs to identify the causes of any problems with the site. Every reader visit was also logged there. He tailed the logs on to a couple of servers, pushed it to a messaging queue, and created a Scala Play Framework app to consume and display the data on a dashboard.

Since Moran only cared about what was happening now, Tackley stored the last three minutes of data in-memory in a Scala list. He was still only sampling 10% of the Guardian’s traffic at this point but for the SEO team, it was a revolution. “What this enabled them to do was learn for themselves what worked and what didn’t,” says Tackley. Moran asked Tackley to keep the dashboard running, which for six months he did on his own desktop.

Word got around and more and more Guardian employees started to use Tackley’s dashboard, now named Ophan. Tackley decided to upgrade it to capture the Guardian’s entire click stream, which generates between 15 million and 25 million events a day and store the data for seven days. This meant moving from his desktop to Amazon Web Services.

A JavaScript hidden pixel on the website now records every event instead of retrieving it from the logs and places it in a message queue. Since there were now too many events to hold in-memory, an app called Serf takes the message queue, extracts what was needed, and inserts it into an ElasticSearch cluster. The dashboard asks the same questions of ElasticSearch, a real-time search and analytics engine, that it had previously posed to the in-memory event list.


Analyze This

Journalists and editors can now explore and understand audience attention for themselves and take action accordingly. If a great piece isn’t being read it can be promoted internally or externally. If there’s an unexpected hit like Russell Brand’s article, it can be boosted accordingly. The team that manages the front page even uses it to see what has been published recently.

“Historically as an organization we have been a bit nervous of looking at how many people are looking at our content,” says Tackley. “In a serious news organization like ours there’s obviously a fear of the BuzzFeed-ification of news, especially as social starts to catch up with traditional search referrals. People traditionally think that the only thing that does well on Facebook is ‘top 10 cats.’ Actually our serious journalism does really well as well. People are realizing that looking at what people read is not evil. It shouldn’t be the only thing you chase and it’s merely one input into the editorial process but it’s not necessarily a negative thing.”

Ophan has helped the Guardian get a lot better at posting on Facebook. “Most of that is applying the subbing rules you use in a traditional print product. Pick out the key line. Pick out the key quote” he says. “There’s a number of journalists who run their own Twitter accounts who have actually started seeing the style of tweet that works for their content.”

“It used to be that when your article was queued to be printed, that’s it. It’s over, “ says Tackley. “One of the things we have realized, and we see it again and again, is that if you publish something on the web and it doesn’t do very well in the first hour, probably it’s not going anywhere ever. Understanding how people are going to find an article from the moment you have published it is a really important part of the production cycle. We are certainly getting better at that.”