Google Analytics and R for a news website

For a news site understanding the analytics is essential. The basic reporting provided by Google Analytics (GA) gives us good tools for monitoring the performance on a daily bases. Even the standard version of GA (which we use) offers a wide variety of reporting options which carries you a long way. However, when you have exhausted all these options and need more, you can either use some kind of tool like Supermetrics or then query the GA api directly. For the latter purpose, I’m using R.

Querying GA with R is a very powerful way to access the analytics data. Where GA only allows you to use two dimensions at the same time, using R you can query several dimensions and easily join different datasets to combine all your data into one large data set that you then can use for further analysis. Provided you know R of course – otherwise I suggest you use a tool like the above mentioned Supermetrics.

For querying GA with R I have used the package RGoogleAnalytics. There are other packages out there, but as for many other packages in R, this is the one I first stumbled upon and then continued using… And so far, I’m quite happy with it, so why change?!

Setting up R to work with GA is quite straight forward, you can find a nice post on it here.

Recently I needed to query GA for our main site’s (hbl.fi, a newssite about Finland in swedish) different measures such as sessions, users, pageviews but also some custom dimensions including author, publish date etc. The aim was to collate this data for last year and then run some analysis on it.

I started out querying the api for the basic information: date (for when the article was read), publish date (a custom dimension), page path, page title and pageviews. After this I queried several different custom dimension one by one and joined them in R with the first dataset. This is necessary as GA only returns rows where there are no NA:s. And as I know that our metadata sometimes is incomplete, this solution allows me to stitch together a dataset that is as complete as possible.

This is my basic query:

# Init() combines all the query parameters into a list that is passed as an argument to QueryBuilder()
query.list <- Init(start.date = "2017-01-01",
                  end.date = "2017-12-31",
                  dimensions = "ga:date,ga:pagePath,ga:pageTitle,ga:dimension13", 
                  metrics = "ga:sessions,ga:users,ga:pageviews,ga:avgTimeOnPage",
                  max.results = 99999,
                  sort = "ga:date",
                  table.id = "ga:110501343")

# Create the Query Builder object so that the query parameters are validated
ga.query <- QueryBuilder(query.list)

# Extract the data and store it in a data-frame
ga.data <- GetReportData(ga.query, token, split_daywise=T)

 

Note this in the Init()-function:

  • You can have a maximum of 7 dimensions and 10 metrics
  • The max.results can (according to my experience) be at the most 99,999 (at 100,000 you get an error).
  • table.id is called ViewID in your GA’s admin panel under View Settings
  • If you want to use a GA segment* in your query, add the following attribute: segments = “xxxx”

 

Note this in the GetReportData-function:

  • Use split_daywise = TRUE to minimize the sampling of GA.
  • If your data is sampled the output returns the percentage of sessions that were used for the query. Hence, if you get no message, the data is unsampled.

 

* Finding the segment id isn’t as easy as finding the table id. It isn’t visible from within Google Analytics (or at least I haven’t found it). The easiest way to do this is to use the query explorer tool provided by Google. This tool is actually meant to aid you in creating api query UPIs but comes in handy for finding the segment id. Just authorise the tool to access your GA account and select the proper view. Go to the segment drop down box and select the segment you want. This will show the segment id which is in format gaid::-2. Use this inside the quotes for the segments attribute.

 

The basic query returned 770,000 rows of data. The others returned between 250,000 and 490,000 rows. After doing some cleaning and joining these together (using dplyers join functions) I ended up with a dataset of 450,000 rows. Each containing the amount of readers per article per day, information on category, author and publish date as well as amount of sessions and pageviews for the given day. All ready for the actual analysing of the data!

 

Supermetrics – Easy access to much data!

One nice and very handy tool for extracting data from various sources is an add-on to Google Sheets called Supermetrics. Using it you can access several different data sources, e.g. Google Analytics, Facebook Insights, Google AdWords, Twitter Ads, Instagram and many more. Once installed (and that’s super easy) it opens up as a side bar to your Sheet, like this:

supermetrics_sidebar

Then it’s more or less clicking the right options from the dropdown menus and you have a nice and handy report. Here’s some tips for using Google Analytics with Supermetrics:

1) Make sure that the account you are logged in to Google Sheets (and thus Supermetrics) also has access to the data you want to access.

2) Remember to have you the cell A1 selected before opening Supermetrics or your data will appear in some random corner of your spredsheet.

3) Pay attention when selecting the dates. If you plan to make a report that is auto-refreshing you need to choose the dates using the predefined intervals like today, yesterday, last week, last month, year to date etc. If you chose a custom interval, let’s say January 1st to Janyary 7th, the report will always show the result for these dates even though you ask it to refresh weekly.

4) Split by… rows and/or columns. This is the main benefit compared to querying Google Analytics directly. Here you can specify several dimensions for your data, in GA you only get two.

5) You don’t have to define any segments or filters. If you do, make sure that the account you’re logged in as also has access to these in Google Analytics (and that they are available for the view you are querying).

6) Under options make sure to tick both Add note to query results showing whether Google has used sampling and Try to avoid Google’s data sampling. You’ll see that many times Supermetrics is capable of supplying you with unsampled data where Google itself would give you sampled data.

Here’s a simple example, querying one of our sites for 2017 sessions, splitting the data by operating system and system version:

2017operatingsystems

Nothing spectacular, but very easy to use, easy to share. Absolutely one of my favourite tools!

 

Reading multiple csv files into R while skipping the first lines

Today I needed to read in multiple csv:s into R and merge them all to one table. I’ve done this before succesfully using a nice and short piece of code:

files <- list.files(".", full.names = T, recursive = T)
listcsv <- lapply(files, function(x) read.csv(paste0(x)))
data <- do.call("rbind", listcsv)

This code works nicely when you have csv:s that are of identical structure and don’t include any extra rows at the top of your data.

Today however, I’m stuck with csv:s with an extra seven lines at the top of each file that I’d like to strip.  Normally, skipping lines while reading in a csv is easy, just specify the argument skip. Like so:

file <- read.csv("filename", skip=7)

This would give me the file just nicely, as a data frame. However, the above code for reading in the files one by one into a list and the binding them together into one data frame doesn’t work as such as I need to get rid of the seven extra lines at the beginning of each file.

I could, of course, strip the seven lines from each file manually, I currently only have 13 files that I’d like to read in. But I’m sure that there will come a day when I have many more files, so why not do this properly right from the start?

After several trial-and-error approaches I reverted to Google. And found one nice Stackoverflow article on the subject.

I tried a couple of the options with no luck, always failing on some detail, like how to pass the arguments to read.csv. Eventually I tried this one:

data <- do.call(rbind, lapply
        (files, read.csv, as.is=T, skip = 7, header = FALSE))

And it works! In this format passing the extra arguments (as.is, skip and header*) to read.csv works fine. And with my amount of data (only 55000+ rows in the final data frame), it is also fast enough:

   user  system elapsed 
   0.38    0.00    0.37 

 

So now I’m all ready to start joining this data with some other data and get on with my data analysis!

 

* The as.is argument makes sure that strings are read in as character and not factors. The skip argument allows you to skip the specified amount of rows. The header  argument lets you specify whether the first row shouls be used as a header row or not.

Headache while trying to filter on a map in Tableau :/

This week’s MakeoverMonday delivered a data set on the accessibility of buildings in Singapore. For each building there is an index for the accessibility level and of course information on where this building is situated alongside with some information on that area (“subzone”). So I figured, why not plot each area on a map and then by clicking that area youl’d get a list of all the buildings in that area and their accessibility indeces? Seems straigth forward enough.

So I plotted the map, and let Tableau color the areas according to the average accessibility:

w50_singapore_averages.PNG

 

The darker the colour, the better the accessibility. Now I’d like the user to be able to click an area, for instance Alexandra Hill, and get the information about the buildings in this particular area. Like this:

w50_alexandrahill_table

But alas, this table is NOT shown when you click on the map, this action only shows one line per area, for some (for me) still unknown reason:

w50_alexandrahill_table_short

The entire list of buildings is shown only when you chose the area from a list on the side of the dashboard, but not when you click on the map. You can try it out on Tableau Public yourself.

I’ve tried different ways of filtering and different actions on the filters, but nada. I will, however, fix this! I want to understand why Tableau acts this way.  I just need to dig into it some more. So instead of serving you a nice #mmonday blog post, I shared some headache, but hey – this is not that uncommon when working with data after all 😉 Hang in there for the sequel!

 

Makeover Monday – Prices of curries

This week’s Makeover Monday was about visualising a data set gathered by Financial Times. The data covers the pricing of curries at the Wetherspoon pubs in the UK and Ireland. The original story is covering several different aspects of the pricing – my simple makeover is by no means any attempt to do it better. Rather it is an excercise for myself in using Tableau dashboards.

My makeover is posted at Tableu Public. It shows a map of the pubs and when you click on a pub a stacked bar showing the pricing for that bar appears on the right.

w49_curries

A simple viz, but a nice excercise in combining maps and charts into an interactive dashboard.

 

A new acquaintance – Google Data Studio

For the past few months we’ve been building dashboards with Google’s Data Studio. A visualiation tools that can easily be connected to a multitude of data sources. We have uploaded most of our data to Big Query to be able to easily (and with much better speed!) query the data into a multitude of dashboards.

BQ in combination with Google’s Data Studio is an easy-to-use combination to implement basic dashboards needed in a media house. Here are some examples of dashboards that we’ve built the past months:

  • A live report on the NPS for our site, including open ended comments, shown on a screen at the news desk
  • A dashboard showing which articles generate the most registrations
  • Amounts of sold subscriptions per type, date and per area
  • A vis on the demographics of the registered users (showing demo data):

Registered_demog

Data Studio is very easy to use and set up to work with different data sources. You don’t even need to do any coding to access the data in Big Query, but then again, the options on how to plot your data are limited. What you gain on the swings you lose on the roundabouts…

The plot types are quite basic, simple time series, bar charts, pie charts, tables etc. One nice feature though is the geo map that allows you to visualise your data on a map:

Subs_geo

But us non-US users still will have to wait for the zoom level to have other options than just the country for areas outside the US :/

Formatting your visualisation can, however, by no means be compared to e.g. Tableau or even Power Point. Limited options for formatting margins etc. mean that effective use of space on your dashboard is difficult. And you can forget about formatting any of the nitty gritty details on your chart.

Nevertheless, Data Studio makes it really easy to visualise your data and is a handy tool with a low learning curve. And it’s free. So why not try it out? And I’d love to hear your comments on it, so please pitch in in the comment section!

 

MakeoverMondays

A week ago on Thursday I attended a meeting for the Tableau User Group in Finland #fintug. There the inspiring Eva Murray (@TriMyData) from Exasol, Tableau evangelist, told us about the concept of MakeoverMonday, and had us do last week’s challenge live, then and there.

I was paired up with Jaakko Wanhalinna (@JWanhalinna) from Solutive in redoing the viz in only 43 minutes. We had a blast, and thanks Jaakko’s good knowledge of Tableau we came up with this nice remake:

 

mmovermonday_w45

You can find the original at my profile on Tableau Public.

Despite some schedule restraints I decided to take on the challenge of this week’s MakeoverMonday viz as well. It’s about the city transport systems in 100 cities globally. The data provided covered only the names of the cities and an index for each city. The higher the index the better. More information about the index can be found at the homepage of Arcadis, a Design and consultancy agency for natural and built assets.

Here’s my viz on the data:

mmovermonday_w46.PNG

And the original is of course on Tableau Public.

The MakeoverMonday is a fun way to experiment with Tableau and simultaneously learn about very diverse topics, I can highly recommend it! So there will be more of these, maybe not every Monday, but as often as I can squeeze them into my schedule!

Co-creating the future media business

Opening up the newsroom is a new trend that spins off the idea that co-creation is the key to loyal readers. I believe in it, or at least that co-creation is a necessity for part of the readers to become or remain loyal when some readers are loyal as is. But as Anette Novak says it so much better and has first hand experience of it, you’d better read about it from her, here.

 However, reading Anette’s blog post I came to think about something similar – the co-creation of the business itself. Or call it by its old-fashioned name; networking. But I’d like to find it a better term than networking. Networking has a sound of “mingling on an event and sharing fragments of ideas” to it. True networking is more than that. As a phenomenon it’s closer to mentoring than just the sharing of bits and pieces that networking often is. But mentoring implies that there is an apprentice and a master, someone old in the game and someone new. We don’t need that now – there is no grand old man for the new media landscape of tomorrow.

For the print media to overcome the challenges we are facing today we need true networking. Not all companies are fully staffed with the most brilliant minds and the most productive people. We are more or less stuck with the people we have. And too many in the print media industry are happy and content with the business as it has been the past thirty years or so. Some big media players might be able to re-staff with brilliance and productiveness but most are not. (And yes, I do know that we need the people who have the old kind of know how as well, I do.)

The simple fact is that we need each other. We are faced with new challenges and it is just plain stupid not to search for answers amongst colleagues in other media companies, academic media researchers and even people in other industries. Together we are smarter and more creative.

We need to get together and start networking for real. Networking in a more mentoring way. Networking and trusting each other. There are many things we can share and co-ponder without sharing business secrets. Instead of mingling on seminars and events and trying to pry the competitor on some secrets or scrabbling quick notes while some guru is talking about how their multinational company launched a new iPad app we should find people we can co-create our future with. Co-create the media business of tomorrow.

 In “reader co-creation” the reader gives input to the journalistic process. In co-creating the entire business I believe in long lasting networking with people you trust. People who are close enough to your business to understand it but far enough either in geographical terms, business wise or in mindset not to misuse the relation. Preferably people that challenge you with their chain of thought and whom you challenge with yours.

The media companies that will succeed will be those that are able to embrace the common know how of the industry, listen to the quiet messages of the readers and implement this knowledge in their own daily life. (Now substitute the word companies in the previous sentence for individuals, it is just as true). Although one additional thing is necessary for the companies to succeed: the retention of the skilled and the brilliant. There are enough challenges for the brilliant out there not to stay put if their brilliance is not appreciated.

The Road to Nowhere

I was just told that students who multitask during lectures perform up to a whole letter grade poorer than their fellow students. Whether this is true or not, I’m pretty sure humans cannot concentrate fully on two things at the same time, our focus is split and our attention jumps back and forth.

In certain situations it certainly is worth while devoting your full attention to whatever you are doing. The students who want to perform well should preferably pay attention to the lecturer rather than their laptops or mobiles. The same is true for our jobs, the result is often better if the person doing the job is paying attention to it. Whether it be writing, cooking or taking care of sick people.

An interesting question is how the multitasking affects our media consumption. There are studies on this as well. The consumption certainly is becoming more and more fragmented which puts pressure on the media companies to produce content that succeeds in keeping the attention of the audience.

I have to admit that almost every time I sit down on the sofa I bring my iPad along. Because most of the TV shows are boring. So why not Facebook or read emails at the same time? At least I fool myself into believing I am more efficient this way. Still I was shocked when a TV strategist told me that the attention span of the TV audience of today is six minutes. Six minutes! Every six minutes something really interesting should happen on the screen or people zap away (or turn to their iPads). It is just crazy. How can we expect to relax or to learn something if our attention span is that short? At least I know I most often feel more stressed than relaxed after an hour of simultaneous usage of TV and FB. It’s a bit like eating a large bag of candy – it feels like a good idea at the beginning, but when it’s done you swear never to do it again. Until next time.

But there is at least one upside to surfing the web while watching TV. When you watch a TV show, you can easily enrich the experience by reading more on the topic at hand online. And this has become so much easier with the iPad. If I watch an old movie on TV I tend to look up the actors and the reception the movie got when it was released, who composed the music, which other films the actors have been involved in etc etc. You learn a lot! Take Vanilla Sky as an example, I had no idea that the name referred to the skies depicted by Monet until I read about it on Wikipedia.

I especially love enriching documentaries. The Finnish broadcasting company YLE just showed the four-part documentary Billy Connolly: Journey to the Edge of the World. A fantastic scenery and interesting people! I watched this together with my iPad, looked up the places Connolly visited on Google Maps, read about the inuits and about Pond Inlet, a place I didn’t know existed.

Pond Inlet by Michael Saunders

Simultaneous usage of media in a way that enriches the experience gives you so much more than only watching the documentary. At the same time you have to be careful not to overdo it. It is quite easy to get carried away and forget all about the documentary or film you thought you were watching. Maybe we do need some twist to the story telling every six minutes to stay focused?

Oh yes, I almost forgot, The Road to Nowhere:
Road to Nowhere by janers.sweeter

Poor research is a real burden for media

With a vast experience of research my heart always cries when I come across poor research. Be it poorly designed or poorly presented – it’s such a waste of money! Sometimes I also get angry. Angry with the research institutes who sell fancy “truths” to gullible companies.  Most of the time, however, there’s not much you can do about that, other than hope the public isn’t stupid enough to believe everything they hear. For instance, when some poll tells you that a certain political party has gained in supporters at another party’s expence when in fact the margins of error make any such conclusions null and void.

But sometimes, when this poor research lands close to my own turf, I feel the need to act.

Last Friday I spent all day tearing a research concept to pieces. Comparing the results to the questionnaire and trying to make sense of it all. It’s a study that’s been done four times already and at the second and third round I was in the audience when the research institute presented the results. Both times I politely asked the researchers how they calculate certain key figures. But the answers never satisfied me. As the study was commissioned by our newspaper association and not our company, I decided to let it be, it was not my fight.

Then came the fourth round, using exactly the same concept, again with exactly the same dubious figures. So I sat down, once again, with the report and the questionnaire and pinpointed the problems with the study in a lengthy email and sent it to the persons responsible for commissioning the study. I just hope it is well received and at least leads to a thorough discussion.

Poor research should be banned. Even though we have the Esomar professional standards we are presented with way too much cr*p even from research institutes complying to the standards. The research institutes  really should go the extra mile on assuring the quality of their concepts and services because it isn’t easy to commission extensive surveys ( Esomar also has a guideline for commissioning research. Read it. And there are independent researchers out there who can help you with the commissioning. Use them.). There are so many factors to weigh in, ranging from the aim and the sample to the analysis and  conclusions. If you aren’t a research professional yourself you should be able to rely on the research institutes.

My personal favorites in the Esomar code are the following basic principle articles:

1a and 1b) “Market research shall be legal, honest, truthful and objective and be carried out in
accordance with appropriate scientific principles“. “Researchers shall not act in any way that could bring discredit on the market research profession or lead to a loss of public confidence in it.” – This is something all researchers should take to their hearts. Sadly enough many don’t. Just think about how often you stumble across crazy research and crazy conclusions. Research that does damage the reputation of market research as people either laugh at it or simply don’t believe in it.

4 c) “Researchers shall on request allow the client to arrange for checks on the quality of
data collection and data preparation.” This article implies that the quality of your work should be impeccable. You should be ready, at any time, to let the customer audit your work. Way too seldom customers ask for it though. Working at a research institute myself some years ago, I offered this option to sceptical customers – nobody has ever offered it to me.

The research on media in Finland is seldom good. Too much is lost in the margins of error, too many conclusions are derived from studying means.The ambition to cover too much has resulted in monstrous surveys that serve nobody well. Thankfully, the print media audience measures has been criticised publicly by more and more people and some improvement is under way.

If we make decisions based on mediocre studies and information that cannot hold for scrutiny we won’t end up with winning products. As long a we measure a total audience and try to describe that mass of heterogeneous people as one entity we fool our selves and we fool the advertisers. We need more detailed information, we need to open our eyes to see the multidimensional audience we have. Gone are the days when one product suited all and the audience could be treated as one. Thus we should also realise that the surveys we use to measure our audiences should be re-designed to fit the needs of today. Although we might lose some trends and many grand old men and ladies will grunt in discontent, we need the change. The poor research of today is only hampering us, so let’s throw it out and bring in research that really benefits us!