Analysing the wording of the NPS question

NPS (Net Promoter Score) is a popular way to measure customer satisfaction. The NPS score is supposed to correlate with growth and as such of course appeals to management teams.

The idea is simple, you ask the customer how likely he or she is to recommend your product/service to others on a scale from 0 to 10. Then you calculate the score by subtracting the sum of zeros to sixes from the sum of nines and tens. If the score is positive it is supposed to indicate growth, if it is negative it is supposed to indicate decline.

My employer is a news company publishing newspapers and sites mainly in swedish (some finnish too). Therefore we mainly use the key question in swedish, i.e. Hur sannolikt skulle du rekommendera X till dina vänner? This wording, although an exact mach to the original (How likely is it that you would recommend X to a friend?) seems a little bit clumsy in swedish. We would prefer to use a more direct wording, i.e. Skulle du rekommentera X till dina vänner? which would translate into Would you recommend X to a friend? However, we were a bit hesitant to change the wordin without solid proof that it would not affect the answers.

So we decided to test it. We randomely asked our readers either the original key question or the modified one. The total amount of answers was 1521. Then, using R and the wilcox.test() function, I analysed the answers and could conclude that there is no difference in the results whichever way we are asking the question.

There is some criticism out there about using the NPS and I catch myself wondering every now and again if people are getting too used to the scale for it to be accurate any more. Also, here in Finland there is a small risk that people mix the scale with the scale 4-10 which is commonly used in schools and therefore apply their opinions to their years old impression about what is considered good and what is considered bad. I’d very much like to see some research about it.

Nevertheless, we are nowaday happily using the shorter version of the NPS key question. And have not found any reason why not to. Perhaps it could be altered in other languages too?

The 2018 presidential election in Finland, some observations from a news analytics perspective

The presidential elections 2018 in Finland were quite lame. The incumbent president, Sauli Niinistö, was a very strong candidate from the offset and was predicted to win in the first round, which he did. You can read more about the elections for instance on Wikipedia.

Boring election or not, from an analytics perspective there is always something interesting to learn. So I dug into the data and tried to understand how the elections had played out on our site, hbl.fi (which is the largest swedish language news site in Finland).

We published a total of 275 articles about the presidential election of 2018. 15 of these were published already in 2016, but the vast majority (123) was pubslished in January 2018.

Among the readers the interest for the elections grew over time, which might not be that extraordinery (for Finnish circumstances at least). Here are the pageviews per article over time (as Google Analytics samples the data heavily i used Supermetrics to retrieve the unsampled data – filtering on a custom dimension to get only the articles about the election):

Not much interesting going on there. So, I also took a look at the traffic coming in via social media. Twitter is big in certain circles, but not really that important a driver of traffic to our site. Facebook, on the other hand, is quite interesting.

Using Supermetrics again, and doing some manual(!) work too, I matched the Facebook post reach for a selection of our articles to the unsampled pageviews measured by Google Analytics. From this, it is apparent that approximately one in ten persons reached on Facebook ended up reading our articles on our site. Or more, as we know that some of the social media traffic is dark.

The problem with traffic that originates from Facebook is that people tend to jump in and read one article and then jump out again. Regarding the presidential elections this was painfully clear, the average pageviews was down to 1,2 for sessions originating from Facebook. You can picture this as: Four out of five people read only the one article that was linked to Facebook and then they leave our site. One out of five person reads an additional article and then decides to leave. But nobody reads three or more articles. This is something to think about – we get a good amount of traffic on these articles from Facebook but then we are not that good at keeping the readers on board. There’s certainly room for improvement.

What about the content then? Which articles interested the readers? Well, with good metadata this is not that difficult an analysis. Looking at the articles split by the candidate they covered and the time of day the article was published:

(The legend of the graph is in swedish => “Allmän artikel” means a general article, i.e. either it covered many candidates or it didn’t cover any candidates at all.)

Apart from telling us which candidates attracted the most pageviews, this also clearly shows how many articles were written about which candidate. A quite simple graph in itself, a scatter diagram coloured by the metadata, but revealing a lot of information. From this graph there are several take aways; at what time should we (not) publish, which candidates did our readers find interesting, should we have written more/less about one candidate or the other. When you plot these graphs for all different kinds of meta data, you get a quite interesting story to tell the editors!

So even a boring election can be interesting when you look at the data. In fact, with data, nothing is ever boring 😉

A note about the graphs: The first graph in this post was made with Google Sheets’ chart function. It was an easy to use, and good enough, solution to tell the story of the pageviews. Why use something more fancy? The second graph I drew in Tableau, as the visualisation options are so much better there than in other tools. I like using the optimal tool for the task, not overkilling easy stuff with importing it to Tableau, but also not settling for lesser quality when there is a solution using a more advanced tool. If I had the need to plot the same graphs over and over again, I would go with an R-script to decrease the need of manual clicking and pointing.

Google Analytics and R for a news website

For a news site understanding the analytics is essential. The basic reporting provided by Google Analytics (GA) gives us good tools for monitoring the performance on a daily bases. Even the standard version of GA (which we use) offers a wide variety of reporting options which carries you a long way. However, when you have exhausted all these options and need more, you can either use some kind of tool like Supermetrics or then query the GA api directly. For the latter purpose, I’m using R.

Querying GA with R is a very powerful way to access the analytics data. Where GA only allows you to use two dimensions at the same time, using R you can query several dimensions and easily join different datasets to combine all your data into one large data set that you then can use for further analysis. Provided you know R of course – otherwise I suggest you use a tool like the above mentioned Supermetrics.

For querying GA with R I have used the package RGoogleAnalytics. There are other packages out there, but as for many other packages in R, this is the one I first stumbled upon and then continued using… And so far, I’m quite happy with it, so why change?!

Setting up R to work with GA is quite straight forward, you can find a nice post on it here.

Recently I needed to query GA for our main site’s (hbl.fi, a newssite about Finland in swedish) different measures such as sessions, users, pageviews but also some custom dimensions including author, publish date etc. The aim was to collate this data for last year and then run some analysis on it.

I started out querying the api for the basic information: date (for when the article was read), publish date (a custom dimension), page path, page title and pageviews. After this I queried several different custom dimension one by one and joined them in R with the first dataset. This is necessary as GA only returns rows where there are no NA:s. And as I know that our metadata sometimes is incomplete, this solution allows me to stitch together a dataset that is as complete as possible.

This is my basic query:

# Init() combines all the query parameters into a list that is passed as an argument to QueryBuilder()
query.list <- Init(start.date = "2017-01-01",
                  end.date = "2017-12-31",
                  dimensions = "ga:date,ga:pagePath,ga:pageTitle,ga:dimension13", 
                  metrics = "ga:sessions,ga:users,ga:pageviews,ga:avgTimeOnPage",
                  max.results = 99999,
                  sort = "ga:date",
                  table.id = "ga:110501343")

# Create the Query Builder object so that the query parameters are validated
ga.query <- QueryBuilder(query.list)

# Extract the data and store it in a data-frame
ga.data <- GetReportData(ga.query, token, split_daywise=T)

Note this in the Init()-function:

You can have a maximum of 7 dimensions and 10 metrics
The max.results can (according to my experience) be at the most 99,999 (at 100,000 you get an error).
table.id is called ViewID in your GA’s admin panel under View Settings
If you want to use a GA segment* in your query, add the following attribute: segments = “xxxx”

Note this in the GetReportData-function:

Use split_daywise = TRUE to minimize the sampling of GA.
If your data is sampled the output returns the percentage of sessions that were used for the query. Hence, if you get no message, the data is unsampled.

* Finding the segment id isn’t as easy as finding the table id. It isn’t visible from within Google Analytics (or at least I haven’t found it). The easiest way to do this is to use the query explorer tool provided by Google. This tool is actually meant to aid you in creating api query UPIs but comes in handy for finding the segment id. Just authorise the tool to access your GA account and select the proper view. Go to the segment drop down box and select the segment you want. This will show the segment id which is in format gaid::-2. Use this inside the quotes for the segments attribute.

The basic query returned 770,000 rows of data. The others returned between 250,000 and 490,000 rows. After doing some cleaning and joining these together (using dplyers join functions) I ended up with a dataset of 450,000 rows. Each containing the amount of readers per article per day, information on category, author and publish date as well as amount of sessions and pageviews for the given day. All ready for the actual analysing of the data!

Reading multiple csv files into R while skipping the first lines

Today I needed to read in multiple csv:s into R and merge them all to one table. I’ve done this before succesfully using a nice and short piece of code:

files <- list.files(".", full.names = T, recursive = T)
listcsv <- lapply(files, function(x) read.csv(paste0(x)))
data <- do.call("rbind", listcsv)

This code works nicely when you have csv:s that are of identical structure and don’t include any extra rows at the top of your data.

Today however, I’m stuck with csv:s with an extra seven lines at the top of each file that I’d like to strip. Normally, skipping lines while reading in a csv is easy, just specify the argument skip. Like so:

file <- read.csv("filename", skip=7)

This would give me the file just nicely, as a data frame. However, the above code for reading in the files one by one into a list and the binding them together into one data frame doesn’t work as such as I need to get rid of the seven extra lines at the beginning of each file.

I could, of course, strip the seven lines from each file manually, I currently only have 13 files that I’d like to read in. But I’m sure that there will come a day when I have many more files, so why not do this properly right from the start?

After several trial-and-error approaches I reverted to Google. And found one nice Stackoverflow article on the subject.

I tried a couple of the options with no luck, always failing on some detail, like how to pass the arguments to read.csv. Eventually I tried this one:

data <- do.call(rbind, lapply
        (files, read.csv, as.is=T, skip = 7, header = FALSE))

And it works! In this format passing the extra arguments (as.is, skip and header*) to read.csv works fine. And with my amount of data (only 55000+ rows in the final data frame), it is also fast enough:

   user  system elapsed 
   0.38    0.00    0.37

So now I’m all ready to start joining this data with some other data and get on with my data analysis!

* The as.is argument makes sure that strings are read in as character and not factors. The skip argument allows you to skip the specified amount of rows. The header argument lets you specify whether the first row shouls be used as a header row or not.