AMR outliers or not?

I’m working on a data set with AMR for audio. AMR = Average Minute Rating, in essence how many listeners you content has had on average, each minute. You can think of it as a measure of your audience being spread out evenly over the content, from start to beginning.

To be able to calculate your AMR you need to know the total amount of minutes people have listened to your content and then of course the length of the content. So if your audio content is ten minutes long and your analytics tells you that you have a total of 43 minutes of listening, that would give you an AMR of 4.3 (=on average 4.3 persons listened to the content for the entire duration of it).

My assupmtion is, at least when it comes to well established audio content, like pods running for tens of episodes, that the AMR is more or less the same for each episode. Or at least within the same ball park.

However, at times your data might contain odd numbers. Way to small or way too big numbers. So are these outliers or should you believe that there actually was that few/many listeners at that particular time? Well, there’s no easy answer to that. You need to do some exploratory analysis and have a thorough look at your data.

First, especially if you run into this kind of data often, I would establish some kind of rule-of-thumb as to what is a normal variation in AMR in your case. For some content variation might be small, and thus even smaller deviations from the “normal” should be singled out for further analysis. In other cases the AMR varies a lot, and then you should be more tolerant.

Then, after identifying the potential outliers, you need to start playing detective. Can you find any explanation as to why the AMR is exceptionnally high or low? What date did you publish the content? Was it holidays when your audience had more time than usual to listen to the content or did some special event occur that day, that drew people away from it? Again, there is no one rule to apply, you need to judge for yourself.

Another thing to consider is the content: Was the topic especially interesting/boring? Did you have a celebrity as a guest on your pod/did you not have one (if you usually do)? Was the episode much longer/shorter than normally? Was it published within the same cycle, like day of week/month as you usually do? Did you have technical difficulties recording that affects the quality? And so on, and so son…

It all boils down to knowing your content, examining it from as many different perspectives as possible, and then make a qualified judgement as to whether or not the AMR is to be considered an outlier or not. Only then can you decide which values to filter out and which not.

When you are done with this, you can finally start the analysis of the data. As always, cleaning the data takes 80% of your time and the actual analysis 20% – or was it 90%-10…?

P.S. Sometimes it helps to visualise – but not always:

Failed linegraph of AMRs — Epic fail: Trying to plot a line graph of my AMRs using ggplot2. Well, it didn’t turn out quite as expected 😀

Google Analytics and R for a news website

For a news site understanding the analytics is essential. The basic reporting provided by Google Analytics (GA) gives us good tools for monitoring the performance on a daily bases. Even the standard version of GA (which we use) offers a wide variety of reporting options which carries you a long way. However, when you have exhausted all these options and need more, you can either use some kind of tool like Supermetrics or then query the GA api directly. For the latter purpose, I’m using R.

Querying GA with R is a very powerful way to access the analytics data. Where GA only allows you to use two dimensions at the same time, using R you can query several dimensions and easily join different datasets to combine all your data into one large data set that you then can use for further analysis. Provided you know R of course – otherwise I suggest you use a tool like the above mentioned Supermetrics.

For querying GA with R I have used the package RGoogleAnalytics. There are other packages out there, but as for many other packages in R, this is the one I first stumbled upon and then continued using… And so far, I’m quite happy with it, so why change?!

Setting up R to work with GA is quite straight forward, you can find a nice post on it here.

Recently I needed to query GA for our main site’s (hbl.fi, a newssite about Finland in swedish) different measures such as sessions, users, pageviews but also some custom dimensions including author, publish date etc. The aim was to collate this data for last year and then run some analysis on it.

I started out querying the api for the basic information: date (for when the article was read), publish date (a custom dimension), page path, page title and pageviews. After this I queried several different custom dimension one by one and joined them in R with the first dataset. This is necessary as GA only returns rows where there are no NA:s. And as I know that our metadata sometimes is incomplete, this solution allows me to stitch together a dataset that is as complete as possible.

This is my basic query:

# Init() combines all the query parameters into a list that is passed as an argument to QueryBuilder()
query.list <- Init(start.date = "2017-01-01",
                  end.date = "2017-12-31",
                  dimensions = "ga:date,ga:pagePath,ga:pageTitle,ga:dimension13", 
                  metrics = "ga:sessions,ga:users,ga:pageviews,ga:avgTimeOnPage",
                  max.results = 99999,
                  sort = "ga:date",
                  table.id = "ga:110501343")

# Create the Query Builder object so that the query parameters are validated
ga.query <- QueryBuilder(query.list)

# Extract the data and store it in a data-frame
ga.data <- GetReportData(ga.query, token, split_daywise=T)

Note this in the Init()-function:

You can have a maximum of 7 dimensions and 10 metrics
The max.results can (according to my experience) be at the most 99,999 (at 100,000 you get an error).
table.id is called ViewID in your GA’s admin panel under View Settings
If you want to use a GA segment* in your query, add the following attribute: segments = “xxxx”

Note this in the GetReportData-function:

Use split_daywise = TRUE to minimize the sampling of GA.
If your data is sampled the output returns the percentage of sessions that were used for the query. Hence, if you get no message, the data is unsampled.

* Finding the segment id isn’t as easy as finding the table id. It isn’t visible from within Google Analytics (or at least I haven’t found it). The easiest way to do this is to use the query explorer tool provided by Google. This tool is actually meant to aid you in creating api query UPIs but comes in handy for finding the segment id. Just authorise the tool to access your GA account and select the proper view. Go to the segment drop down box and select the segment you want. This will show the segment id which is in format gaid::-2. Use this inside the quotes for the segments attribute.

The basic query returned 770,000 rows of data. The others returned between 250,000 and 490,000 rows. After doing some cleaning and joining these together (using dplyers join functions) I ended up with a dataset of 450,000 rows. Each containing the amount of readers per article per day, information on category, author and publish date as well as amount of sessions and pageviews for the given day. All ready for the actual analysing of the data!

Reading multiple csv files into R while skipping the first lines

Today I needed to read in multiple csv:s into R and merge them all to one table. I’ve done this before succesfully using a nice and short piece of code:

files <- list.files(".", full.names = T, recursive = T)
listcsv <- lapply(files, function(x) read.csv(paste0(x)))
data <- do.call("rbind", listcsv)

This code works nicely when you have csv:s that are of identical structure and don’t include any extra rows at the top of your data.

Today however, I’m stuck with csv:s with an extra seven lines at the top of each file that I’d like to strip. Normally, skipping lines while reading in a csv is easy, just specify the argument skip. Like so:

file <- read.csv("filename", skip=7)

This would give me the file just nicely, as a data frame. However, the above code for reading in the files one by one into a list and the binding them together into one data frame doesn’t work as such as I need to get rid of the seven extra lines at the beginning of each file.

I could, of course, strip the seven lines from each file manually, I currently only have 13 files that I’d like to read in. But I’m sure that there will come a day when I have many more files, so why not do this properly right from the start?

After several trial-and-error approaches I reverted to Google. And found one nice Stackoverflow article on the subject.

I tried a couple of the options with no luck, always failing on some detail, like how to pass the arguments to read.csv. Eventually I tried this one:

data <- do.call(rbind, lapply
        (files, read.csv, as.is=T, skip = 7, header = FALSE))

And it works! In this format passing the extra arguments (as.is, skip and header*) to read.csv works fine. And with my amount of data (only 55000+ rows in the final data frame), it is also fast enough:

   user  system elapsed 
   0.38    0.00    0.37

So now I’m all ready to start joining this data with some other data and get on with my data analysis!

* The as.is argument makes sure that strings are read in as character and not factors. The skip argument allows you to skip the specified amount of rows. The header argument lets you specify whether the first row shouls be used as a header row or not.