How to examine real website visitor data

by | Dec 12, 2016

Using statistical computing language R to exclude referral spam from the Google Analytics data and visualize traffic from human website visitors.

Recently I stumbled upon the GoogleAnalyticsR package written by Mark Edmondson and I was motivated to start discovering its possibilities of analyzing website traffic in an effective manner outside the Google Analytics web interface. I have written this post in order to spread the word and evangalize this powertool. Hopefully, more people will start using this great method of analyzing websites.

In this post we are going to use the Google Analytics API for R to answer two questoins: Which sources are generating fewer or more users and on which sources users interact for a shorter or longer time period on our web site. We compare the second period of this year to the first period of this year. First, we answer the questions to why someone should use this API and how you can start using the API with R. Subsequently, we exclude sources that represent automated traffic. Then, we request the desired dimensions and metrics for the sources that drive the most human traffic and finally we visualize these statistics in R with the package ggplot.

Why use the Google Analytics API?

It allows you to filter in an flexible manner. For example using the data from the unfiltered view of your Google analytics property can be somewhat deceiving due to the fact that it contains automated web visits and referrer spam. This type of traffic does not represent real-world visitors of your website and can, especially for low-traffic sites, skew, Java, Python and R. The function wrappers in these languages make it a lot easier to analyze for the data analyst who has a background in any of these languages to analyze website behaviour. Thus, these scripting languages offer you more flexibility using the Google Analytics API. The latest version (V4) of the API offers a wide range of metrics and dimensions you can query on such as browser, source, medium, users, sessions, average session time and bounce rate.

How should you start using the Google Analytics API for R?

My preferred choice of scripting language for the Google Analytics API is R due to the fact that it is easy to set up and use. Furthermore, it offers a nice integration with other data science libraries you might want to use in subsequent analysis. Against all odds, the GoogleAnalyticsR package is quite unknown and has around 550 monthly downloads and also does not come up on top of the page for a Google query:

google_search_printscreen

In order to start using the API, one should have a Google Developers Cloud account in which you can enable the Google Analytics Reporting API and create a Google OAuth service account JSON file containing the login credentials.

Then you can install the necessary CRAN packages in R as follows.

install.package("googleAuthR")
install.package("googleAnalyticsR")

Then you can enable the packages by calling it with the library command

library("googleAuthR")
library("googleAnalyticsR")

In order to find the id of the view you want to analyze for, you need to:

#open the authorization screen.
ga_auth()
#Puts the GA views you have access to in a table.
account_list <- google_analytics_account_list()
#We choose to use the first view, corresponding to the first view id.
ga_id <- account_list$viewId[1]

And you are all set to start analysing.

How can you detect non-human traffic on your website?

As Optimizesmart remarks, are sources that visit your site several times and spend on average 0 seconds on your site highly suspect. We decide to filter non-human traffic by filtering on Google Analytics source medium combinations having at least 10 sessions and an average session duration lower than 1 seconds within our chosen time-span. We choose to put the threshold for the amount of sessions at 10, but you can vary this number and find out whether there is a large long-tail referral spam traffic or these settings are stringent enough to filter out the bulk of non-human traffic.

#Filters on the metrics sessions and average session duration
mf <- met_filter("sessions", "GREATER_THAN", 10)
mf2 <- met_filter("avgSessionDuration", "LESS_THAN",1)

#Construct the metric filter object with an AND clause
fc <- filter_clause_ga4(list(mf, mf2), 
  operator = "AND")

#Request the sources with the medium for the desired metrics based on this metric #filter for the selected view
referral_spam_duration <- google_analytics_4(
  ga_id, 
  date_range = c("2016-01-01","2016-11-01"),
  dimensions=c('source','medium'), 
  metrics = c("users","sessions","bounceRate",
    "avgSessionDuration"), 
  met_filters = fc)

#To see the obtained dimension and views.
View(referral_spam_duration)

We find 14 referral sources that satisfy these criteria. Except for one source are clearly spam referrals. However, remarkable is that google.com satisfies all the given criteria. Lowering the threshold to at least 5 or 2 sessions does not increase the amount of sessions into such an extent that I also want to exclude for the other possibly non-human traffic sources.

Exclude referral spam and have a look at your website visitor sources.

Thus, we have learned how to detect referral spam, now we are going to filter for this kind of traffic and do a first analysis of our site usage.

#Create a filter that excludes all source medium combinations that are marked as #referral spam based on the number of sessions and average session duration
dim_filter_source_medium_referrals<-
  dim_filter('sourceMedium',
    'IN_LIST',
    referral_spam_duration$sourceMedium,
    not=TRUE)

#We only put this filter in the filter clause.
referral_filter<-filter_clause_ga4(
  list(dim_filter_source_medium_referrals))

#Select for the data range the desired user metrics and order it based on the amount #of user sessions 
human_traffic <- google_analytics_4(
  ga_id, 
  date_range = c("2016-01-01","2016-11-01"),
  dimensions=c('source','medium'), 
  metrics = c("users","sessions","bounceRate",
    "avgSessionDuration"), 
  dim_filter = referral_filter,
  order=order_type("sessions", sort_order="DESCENDING")))

This allows us to see where the bulk of our traffic origins from. We notice that our cleaned statistics have changed. For example the bounce rate is half that what the unfiltered Google Analytics Audience Overview reports.

Calculate the amount of users and average session duration length in the beginning and end of the current year.

Let us focus what have been the trends for the traffic sources that drive the most traffic. We can achieve this by adding an extra filter on the amount of sessions and dividing the year into two approximately time intervals of the same length.

#We impose a restriction on the amount of sessions a certain source medium #combination must generate. We require this to be greater than 100.
sessions_100_filter<-met_filter(
  'sessions','GREATER_THAN', 100)

#We set up the metrics clause.
n_sessions_filter<-filter_clause_ga4(
  list(sessions_100_filter))

We recalculate the metrics only for the source medium combinations that satisfy both the referral exclusion and the minimum amount of traffic condition for the first 5 months and the second 5 months of the current year and sort the
data based on the amount of sessions in descending order.

human_traffic <- google_analytics_4(
  ga_id, 
  date_range = c("2016-01-01","2016-05-15",
    "2016-05-16","2016-10-31"),
  dimensions=c('sourceMedium'), 
  metrics = c("users","sessions","bounceRate",
    "avgSessionDuration"), 
  dim_filter = referral_filter,
  met_filter= n_sessions_filter,
  order=order_type("sessions", sort_order="DESCENDING"))

#We put the obtained data in a dataframe
human_traffic<-as.data.frame(human_traffic)

This gives us a data frame containing the metrics we requested. Only we would like to have the metrics for two different time periods in one column, such that it represents one field, we can plot.

Prepare the data for visualization

We would like to visualize the amount of users and the average session length in both periods, such that we start to have an understanding whether more people stay engaged with the content we offer on our website. Our data frame human traffic is in a wide format, where our metrics for both periods are represented in two different columns, whereas we want these be in one column. Therefore we have to reshape the data frame, such that each of the columns represents one metric for both periods. We make use of the melt function of the reshape2 package to reshape the data frame from wide to long and subsequently filter the data we want within one column.

library(reshape2)
reshaped_human_traffic<-melt(
  human_traffic, id.var=c("sourceMedium"))

Now we have a data frame of three columns, denoting the source medium combination, the metric name for a given period and the metric value.

First we filter reshaped data frame reshaped_human_traffic such that it contains only one metric for both time periods. We put this in two seperate dataframes, one for the amount of users and another the time a user remains engaged in a session:

users_human_traffic<-reshaped_human_traffic[
  grepl('user', reshaped_human_traffic$variable),]
duration_human_traffic<-reshaped_human_traffic [
  grepl('Duration',reshaped_human_traffic$variable),]

Then, we rename the variable fields in both data frames such that we understand what the fields represent such that the data frames can be merged on common field names. We first create a function wrapper to rename the fields, which allows us to make the code as apprehendable as possible. Then, we can call this function two times on both data frames.

rename_variable<-function(df,old_var,new_var){
  #df: dataframe
  #old_var: the variable to be changed
  #new_var: the new name of the variable
  names(df)[names(df) == old_var] <- new_var
  return(df)
}

users_human_traffic<-rename_variable(
  users_human_traffic, "variable", "user")
users_human_traffic<-rename_variable(
  users_human_traffic, "value", "nUsers")

duration_human_traffic<-rename_variable(
  duration_human_traffic, "variable", "Duration")
duration_human_traffic<-rename_variable(
  duration_human_traffic,"value","avgSessionTime")

Subsequently, we create a new field that indicates for which time period the given metric name and metric value is constructed.

replace_content_other_var<-function(
      df, var_cond, cond_value, var_sel, field_new){
  #df: data frame
  #var_cond: variable to condition on
  #cond_value: a substring of the values of the variable to condition on
  #var_sel: selected varaible to change
  #field_new: the new value of the field
  df[grepl(cond_value,df[,c(var_cond)]),c(var_sel)]<-field_new
  return(df)
}

#Create a variable that denotes the period
users_human_traffic$Period<-""
users_human_traffic<-replace_content_other_var(
  users_human_traffic,"user","d1","Period","Period 1")
users_human_traffic<-replace_content_other_var(
  users_human_traffic,"user","d2","Period","Period 2")

#Create a variable that denotes the period
duration_human_traffic$Period<-""
duration_human_traffic<-replace_content_other_var(
  duration_human_traffic,"Duration","1","Period","Period 1")
duration_human_traffic<-replace_content_other_var(
  duration_human_traffic,"Duration","2","Period","Period 2")

We complete the data wrangling process, with a merge of the two data frames on the common names sourceMedium and Period such that the two seperate data frames become one dataframe that can be plotted.

combined_df<-merge(users_human_traffic,duration_human_traffic)

It is also possible to select a subset of variables that are the metrics for the first or the second period, rename the field names such that these are the same for both selections and bind the rows.

Visualize the data

Plotting the amount of users and average session length of these users in both periods using the following commands of the ggplot2 package gives us the chart displayed below.

library(ggplot2)
ggplot(data=combined_df, 
    aes(x=sourceMedium, y=nUsers, fill=Period))+
  geom_bar(stat='identity', position='dodge')+
  scale_fill_manual(values=c("yellow","black"),
    labels=c("Period 1","Period 2"),
    guide_legend(title="Amount of Users",
    keywidth = 1, keyheight = 1))+
  geom_text(data=duration_human_traffic,
    aes(x=sourceMedium,y=avgSessionTime*5,
      label = paste0(as.integer(avgSessionTime), 
      " sec.", sep = ""), colour = Period),
    position = position_dodge(0.9))+
  scale_colour_manual(values=c("black","yellow"), 
    guide_legend(title='Average session time'))+
  labs(x="Source/Medium", y="Amount of users",
       title="Comparison of site usage between period 1 and period 2")+
  theme(plot.title = element_text(size = 16, face = "bold"), 
      legend.title=element_text(size=16), 
      legend.text=element_text(size=14),
      axis.text = element_text(size = 14),
      axis.title = element_text(size=16))

We see that users that arrived on our website directly or from paid search of Google remained on average longer on our website in the second time period than in the first period, whereas the people that found our site based on an organic listing in the search results spent marginally less time on our website. In this blog we will not cover the reasons why that might be. This might be something for a future blog post.

All in all, the key takeaways are that Google offers an API that allows every data analyst, internet marketeer or entrepeneur to determine whether their internet product is heading in the right right direction. Call the API in your preferred scripting, calculate your KPIs and stop being deceived by the skewed metrics of the unfiltered web interface. Hopefully, you found this blog post to be informative and enjoyed it above anything else.

Do you want to know more about Google Analytics and R? Contact Bert!

Share This