AMR outliers or not?

I’m working on a data set with AMR for audio. AMR = Average Minute Rating, in essence how many listeners you content has had on average, each minute. You can think of it as a measure of your audience being spread out evenly over the content, from start to beginning.

To be able to calculate your AMR you need to know the total amount of minutes people have listened to your content and then of course the length of the content. So if your audio content is ten minutes long and your analytics tells you that you have a total of 43 minutes of listening, that would give you an AMR of 4.3 (=on average 4.3 persons listened to the content for the entire duration of it).

My assupmtion is, at least when it comes to well established audio content, like pods running for tens of episodes, that the AMR is more or less the same for each episode. Or at least within the same ball park.

However, at times your data might contain odd numbers. Way to small or way too big numbers. So are these outliers or should you believe that there actually was that few/many listeners at that particular time? Well, there’s no easy answer to that. You need to do some exploratory analysis and have a thorough look at your data.

First, especially if you run into this kind of data often, I would establish some kind of rule-of-thumb as to what is a normal variation in AMR in your case. For some content variation might be small, and thus even smaller deviations from the “normal” should be singled out for further analysis. In other cases the AMR varies a lot, and then you should be more tolerant.

Then, after identifying the potential outliers, you need to start playing detective. Can you find any explanation as to why the AMR is exceptionnally high or low? What date did you publish the content? Was it holidays when your audience had more time than usual to listen to the content or did some special event occur that day, that drew people away from it? Again, there is no one rule to apply, you need to judge for yourself.

Another thing to consider is the content: Was the topic especially interesting/boring? Did you have a celebrity as a guest on your pod/did you not have one (if you usually do)? Was the episode much longer/shorter than normally? Was it published within the same cycle, like day of week/month as you usually do? Did you have technical difficulties recording that affects the quality? And so on, and so son…

It all boils down to knowing your content, examining it from as many different perspectives as possible, and then make a qualified judgement as to whether or not the AMR is to be considered an outlier or not. Only then can you decide which values to filter out and which not.

When you are done with this, you can finally start the analysis of the data. As always, cleaning the data takes 80% of your time and the actual analysis 20% – or was it 90%-10…?

P.S. Sometimes it helps to visualise – but not always:

Failed linegraph of AMRs — Epic fail: Trying to plot a line graph of my AMRs using ggplot2. Well, it didn’t turn out quite as expected 😀

The coolest thing about data

Perhaps the really really coolest thing about data is when it starts talking to you. Well, not literally, but as a figure of speech. When you’ve been working on a set of raw data, spent hours cleaning it, twisting it around and getting to know it. Tried some things, not found anything, tried something else. And then suddenly it’s there. The story the data wants to tell. It’s fascinating and I know that I, at least, can get very excited about unraveling the secrets of the data at hand.

And it really doesn’t need to be that much analysis behind it either, sometimes it’s just plain simple data that you haven’t looked at like that before. Like this past week when we’ve had both the icehockey world championships and the Eurovision Song Contest going on. Both of them events that are covered by our newspaper and both of them with potential to attract lots of readers. Which they have done. But the thing that has surprised me this week is how different the two audiences behave. Where the ESC-fans find our articles on social media and end up on our site mainly via Facebook, the hockey fans come directly to our site. This is very interesting and definitely needs to be looked into more in depth. It raises a million questions, the first and foremost: How have I not seen this before? Is this the normal behaviour of these two groups of readers? Why do they behave like this? And how can we leverage on this information?

Most of the times, however, the exciting feeling of a discovery and of data really talking to you, happens when you have a more complex analysis at hand. When you really start seeing patterns emerge from the data and feel the connection between the data and your daily business activities. I’m currently working on a bigger analysis of our online readers that I’m sure will reveal it’s inner self given some more time. Already I’ve found some interesting things, like a large group of people never visiting the front page. And by never, I really do mean never, not “a few times” or “seldom”, I truly mean never. But more on that later, after I finish with the analysis. (I know, I too hate these teasers – I’m sorry.)

I hope your data is speaking to you too, because that really is the coolest thing! :nerd_face:

The 2018 presidential election in Finland, some observations from a news analytics perspective

The presidential elections 2018 in Finland were quite lame. The incumbent president, Sauli Niinistö, was a very strong candidate from the offset and was predicted to win in the first round, which he did. You can read more about the elections for instance on Wikipedia.

Boring election or not, from an analytics perspective there is always something interesting to learn. So I dug into the data and tried to understand how the elections had played out on our site, hbl.fi (which is the largest swedish language news site in Finland).

We published a total of 275 articles about the presidential election of 2018. 15 of these were published already in 2016, but the vast majority (123) was pubslished in January 2018.

Among the readers the interest for the elections grew over time, which might not be that extraordinery (for Finnish circumstances at least). Here are the pageviews per article over time (as Google Analytics samples the data heavily i used Supermetrics to retrieve the unsampled data – filtering on a custom dimension to get only the articles about the election):

Not much interesting going on there. So, I also took a look at the traffic coming in via social media. Twitter is big in certain circles, but not really that important a driver of traffic to our site. Facebook, on the other hand, is quite interesting.

Using Supermetrics again, and doing some manual(!) work too, I matched the Facebook post reach for a selection of our articles to the unsampled pageviews measured by Google Analytics. From this, it is apparent that approximately one in ten persons reached on Facebook ended up reading our articles on our site. Or more, as we know that some of the social media traffic is dark.

The problem with traffic that originates from Facebook is that people tend to jump in and read one article and then jump out again. Regarding the presidential elections this was painfully clear, the average pageviews was down to 1,2 for sessions originating from Facebook. You can picture this as: Four out of five people read only the one article that was linked to Facebook and then they leave our site. One out of five person reads an additional article and then decides to leave. But nobody reads three or more articles. This is something to think about – we get a good amount of traffic on these articles from Facebook but then we are not that good at keeping the readers on board. There’s certainly room for improvement.

What about the content then? Which articles interested the readers? Well, with good metadata this is not that difficult an analysis. Looking at the articles split by the candidate they covered and the time of day the article was published:

(The legend of the graph is in swedish => “Allmän artikel” means a general article, i.e. either it covered many candidates or it didn’t cover any candidates at all.)

Apart from telling us which candidates attracted the most pageviews, this also clearly shows how many articles were written about which candidate. A quite simple graph in itself, a scatter diagram coloured by the metadata, but revealing a lot of information. From this graph there are several take aways; at what time should we (not) publish, which candidates did our readers find interesting, should we have written more/less about one candidate or the other. When you plot these graphs for all different kinds of meta data, you get a quite interesting story to tell the editors!

So even a boring election can be interesting when you look at the data. In fact, with data, nothing is ever boring 😉

A note about the graphs: The first graph in this post was made with Google Sheets’ chart function. It was an easy to use, and good enough, solution to tell the story of the pageviews. Why use something more fancy? The second graph I drew in Tableau, as the visualisation options are so much better there than in other tools. I like using the optimal tool for the task, not overkilling easy stuff with importing it to Tableau, but also not settling for lesser quality when there is a solution using a more advanced tool. If I had the need to plot the same graphs over and over again, I would go with an R-script to decrease the need of manual clicking and pointing.

Switching Supermetrics reports to a new user – some tips and tricks

Recently I was faced with the need to switch a bunch of Supermetrics reports (in Google Sheets) to another user. How this is done is perhaps not the most obvious thing, but not at all hard after you figure it out.

This is how you do it:

Open the report and navigate to the sheet called SupermetricsQueries. (If you can’t see this sheet you can make it visible either via the All sheets -button at the lower left hand corner of your Google Sheets or via the add-on menu Supermetrics / Manage queries). On this sheet you’ll find a page with some instructions and a table with information about the queries in this report.
Delete the content in the column QueryID :
Replace the content in the column Refresh with user account with the correct credentials. E.g. if we talk about Google Analytics this is the email of the account you want to use, if it is Facebok it is a long numerical id.
Navigate to the Supermetrics add-on menu and choose Refresh all.
Be sure to check the results in the column Last status to ensure that all queries were updates as planned.
Then, check the data in the reports themselves.
When you’re done I suggest you hide the sheet SupermetricsQueries sheet so that you (or someone you shared the report with) doesn’t alter the specs by mistake.
Don’t forget to transfer the ownership of the file itself if needed!

This is pretty straight forward. Updating a bunch of reports I, however, made the following notes-to-self that I’d like to share with you:

Make sure that the account you are using Supermetrics with has credentials to all the data you want to query!
Before you start transferring your reports take some time to get acquainted with the content of the reports. Perhaps even make a safety copy of it so that you can be sure that the new credentials and queries are producing the data you expected.
When updating the report you probably will want to make some changes to some of the queries. I noticed that when updating many queries it might be easier to update them making changes to the specifications in the table on the SupermetricsQueries sheet instead of using the add-on. Just be careful while doing this!
NB! If the original report was scheduled to auto refresh or auto email with certain intervals, you will need to re-do the scheduling. So make sure you know who the recipients of the original report were before you switch the ownership!

Autorefreshing and paginating remote Data Studio dashboards

I’d like to share with you a very nice and handy extension to Chrome, i.e. the Data Studio Auto Refresh extension. With this, you can have your Google Data Studio dashboards autorefreshing and auto paginating on a (remote) screen.

At my workplace we have one screen situated at the news desk. The screen previously showed a Data Studio dashboard with only one page. On this one page we showed the amount of registered users on our site, the NPS for our sites and the most recent comments on a feedback form on the site.

Following Mark Zuckerberg’s announcement last week about prioritising posts from family and friends over posts from pages, we realised we need to have a closer follow up for our Facebook posts as well. So I added a second page to the dashboard that shows Facebook reach and amount of engaged users:

Now the original page alternates with this page every couple of minutes. Thus the news desk can monitor the reach and user engagement of our Facebook posts and hopefully learn what makes Facebook’s algorithms tick. Over time we will of course need to conduct some proper analysis on the performance of the posts but for now this will give us some insight into the performance.

(This screen actually runs on a Raspberry Pi which I manage remotely as I sit on a another floor. Feels like playing with toys but is actually a very good and cheap solution to this simple need.)

Google Analytics and R for a news website

For a news site understanding the analytics is essential. The basic reporting provided by Google Analytics (GA) gives us good tools for monitoring the performance on a daily bases. Even the standard version of GA (which we use) offers a wide variety of reporting options which carries you a long way. However, when you have exhausted all these options and need more, you can either use some kind of tool like Supermetrics or then query the GA api directly. For the latter purpose, I’m using R.

Querying GA with R is a very powerful way to access the analytics data. Where GA only allows you to use two dimensions at the same time, using R you can query several dimensions and easily join different datasets to combine all your data into one large data set that you then can use for further analysis. Provided you know R of course – otherwise I suggest you use a tool like the above mentioned Supermetrics.

For querying GA with R I have used the package RGoogleAnalytics. There are other packages out there, but as for many other packages in R, this is the one I first stumbled upon and then continued using… And so far, I’m quite happy with it, so why change?!

Setting up R to work with GA is quite straight forward, you can find a nice post on it here.

Recently I needed to query GA for our main site’s (hbl.fi, a newssite about Finland in swedish) different measures such as sessions, users, pageviews but also some custom dimensions including author, publish date etc. The aim was to collate this data for last year and then run some analysis on it.

I started out querying the api for the basic information: date (for when the article was read), publish date (a custom dimension), page path, page title and pageviews. After this I queried several different custom dimension one by one and joined them in R with the first dataset. This is necessary as GA only returns rows where there are no NA:s. And as I know that our metadata sometimes is incomplete, this solution allows me to stitch together a dataset that is as complete as possible.

This is my basic query:

# Init() combines all the query parameters into a list that is passed as an argument to QueryBuilder()
query.list <- Init(start.date = "2017-01-01",
                  end.date = "2017-12-31",
                  dimensions = "ga:date,ga:pagePath,ga:pageTitle,ga:dimension13", 
                  metrics = "ga:sessions,ga:users,ga:pageviews,ga:avgTimeOnPage",
                  max.results = 99999,
                  sort = "ga:date",
                  table.id = "ga:110501343")

# Create the Query Builder object so that the query parameters are validated
ga.query <- QueryBuilder(query.list)

# Extract the data and store it in a data-frame
ga.data <- GetReportData(ga.query, token, split_daywise=T)

Note this in the Init()-function:

You can have a maximum of 7 dimensions and 10 metrics
The max.results can (according to my experience) be at the most 99,999 (at 100,000 you get an error).
table.id is called ViewID in your GA’s admin panel under View Settings
If you want to use a GA segment* in your query, add the following attribute: segments = “xxxx”

Note this in the GetReportData-function:

Use split_daywise = TRUE to minimize the sampling of GA.
If your data is sampled the output returns the percentage of sessions that were used for the query. Hence, if you get no message, the data is unsampled.

* Finding the segment id isn’t as easy as finding the table id. It isn’t visible from within Google Analytics (or at least I haven’t found it). The easiest way to do this is to use the query explorer tool provided by Google. This tool is actually meant to aid you in creating api query UPIs but comes in handy for finding the segment id. Just authorise the tool to access your GA account and select the proper view. Go to the segment drop down box and select the segment you want. This will show the segment id which is in format gaid::-2. Use this inside the quotes for the segments attribute.

The basic query returned 770,000 rows of data. The others returned between 250,000 and 490,000 rows. After doing some cleaning and joining these together (using dplyers join functions) I ended up with a dataset of 450,000 rows. Each containing the amount of readers per article per day, information on category, author and publish date as well as amount of sessions and pageviews for the given day. All ready for the actual analysing of the data!

Supermetrics – Easy access to much data!

One nice and very handy tool for extracting data from various sources is an add-on to Google Sheets called Supermetrics. Using it you can access several different data sources, e.g. Google Analytics, Facebook Insights, Google AdWords, Twitter Ads, Instagram and many more. Once installed (and that’s super easy) it opens up as a side bar to your Sheet, like this:

supermetrics_sidebar

Then it’s more or less clicking the right options from the dropdown menus and you have a nice and handy report. Here’s some tips for using Google Analytics with Supermetrics:

1) Make sure that the account you are logged in to Google Sheets (and thus Supermetrics) also has access to the data you want to access.

2) Remember to have you the cell A1 selected before opening Supermetrics or your data will appear in some random corner of your spredsheet.

3) Pay attention when selecting the dates. If you plan to make a report that is auto-refreshing you need to choose the dates using the predefined intervals like today, yesterday, last week, last month, year to date etc. If you chose a custom interval, let’s say January 1st to Janyary 7th, the report will always show the result for these dates even though you ask it to refresh weekly.

4) Split by… rows and/or columns. This is the main benefit compared to querying Google Analytics directly. Here you can specify several dimensions for your data, in GA you only get two.

5) You don’t have to define any segments or filters. If you do, make sure that the account you’re logged in as also has access to these in Google Analytics (and that they are available for the view you are querying).

6) Under options make sure to tick both Add note to query results showing whether Google has used sampling and Try to avoid Google’s data sampling. You’ll see that many times Supermetrics is capable of supplying you with unsampled data where Google itself would give you sampled data.

Here’s a simple example, querying one of our sites for 2017 sessions, splitting the data by operating system and system version:

2017operatingsystems

Nothing spectacular, but very easy to use, easy to share. Absolutely one of my favourite tools!

Reading multiple csv files into R while skipping the first lines

Today I needed to read in multiple csv:s into R and merge them all to one table. I’ve done this before succesfully using a nice and short piece of code:

files <- list.files(".", full.names = T, recursive = T)
listcsv <- lapply(files, function(x) read.csv(paste0(x)))
data <- do.call("rbind", listcsv)

This code works nicely when you have csv:s that are of identical structure and don’t include any extra rows at the top of your data.

Today however, I’m stuck with csv:s with an extra seven lines at the top of each file that I’d like to strip. Normally, skipping lines while reading in a csv is easy, just specify the argument skip. Like so:

file <- read.csv("filename", skip=7)

This would give me the file just nicely, as a data frame. However, the above code for reading in the files one by one into a list and the binding them together into one data frame doesn’t work as such as I need to get rid of the seven extra lines at the beginning of each file.

I could, of course, strip the seven lines from each file manually, I currently only have 13 files that I’d like to read in. But I’m sure that there will come a day when I have many more files, so why not do this properly right from the start?

After several trial-and-error approaches I reverted to Google. And found one nice Stackoverflow article on the subject.

I tried a couple of the options with no luck, always failing on some detail, like how to pass the arguments to read.csv. Eventually I tried this one:

data <- do.call(rbind, lapply
        (files, read.csv, as.is=T, skip = 7, header = FALSE))

And it works! In this format passing the extra arguments (as.is, skip and header*) to read.csv works fine. And with my amount of data (only 55000+ rows in the final data frame), it is also fast enough:

   user  system elapsed 
   0.38    0.00    0.37

So now I’m all ready to start joining this data with some other data and get on with my data analysis!

* The as.is argument makes sure that strings are read in as character and not factors. The skip argument allows you to skip the specified amount of rows. The header argument lets you specify whether the first row shouls be used as a header row or not.

Headache while trying to filter on a map in Tableau :/

This week’s MakeoverMonday delivered a data set on the accessibility of buildings in Singapore. For each building there is an index for the accessibility level and of course information on where this building is situated alongside with some information on that area (“subzone”). So I figured, why not plot each area on a map and then by clicking that area youl’d get a list of all the buildings in that area and their accessibility indeces? Seems straigth forward enough.

So I plotted the map, and let Tableau color the areas according to the average accessibility:

The darker the colour, the better the accessibility. Now I’d like the user to be able to click an area, for instance Alexandra Hill, and get the information about the buildings in this particular area. Like this:

w50_alexandrahill_table

But alas, this table is NOT shown when you click on the map, this action only shows one line per area, for some (for me) still unknown reason:

w50_alexandrahill_table_short

The entire list of buildings is shown only when you chose the area from a list on the side of the dashboard, but not when you click on the map. You can try it out on Tableau Public yourself.

I’ve tried different ways of filtering and different actions on the filters, but nada. I will, however, fix this! I want to understand why Tableau acts this way. I just need to dig into it some more. So instead of serving you a nice #mmonday blog post, I shared some headache, but hey – this is not that uncommon when working with data after all 😉 Hang in there for the sequel!

Makeover Monday – Prices of curries

This week’s Makeover Monday was about visualising a data set gathered by Financial Times. The data covers the pricing of curries at the Wetherspoon pubs in the UK and Ireland. The original story is covering several different aspects of the pricing – my simple makeover is by no means any attempt to do it better. Rather it is an excercise for myself in using Tableau dashboards.

My makeover is posted at Tableu Public. It shows a map of the pubs and when you click on a pub a stacked bar showing the pricing for that bar appears on the right.

A simple viz, but a nice excercise in combining maps and charts into an interactive dashboard.