Quantcast
Channel: joy of data » Pentaho
Viewing all articles
Browse latest Browse all 5

Mondrian Schema for OLAP Cube Definition ft. Google Analytics and Saiku

$
0
0

data-insightsWhat I am going to showcase in this tutorial is how to load web stats from Google Analytics into a fact table with Penthao Kettle/PDI. And then how to represent that fact table with Mondrian 3.6 schema so we can visualize the data with Saiku on Pentaho BI Server. In the end I’ll give my two cents on Saiku Analytics and possible options involving d3.js and Roland Bouman‘s xmla4js.

In case you are new to this I recommend reading my articles on the following topics involved here:

OLAP Setup

The RDBMS I’ll use is MySQL and the OLAP cube consists of four tables:

  • facts_web_stats
  • dim_time
  • dim_location
  • dim_source

The schema and creation of dim_time is described in “Setting Up a Time Dimension Table in MySQL“. The schemas for the other three tables are as follows:

CREATE TABLE `facts_web_stats` (
  `date` int(11) NOT NULL,
  `city` int(11) NOT NULL,
  `source` int(11) NOT NULL,
  `page_views` int(11) DEFAULT NULL,
  PRIMARY KEY (`date`,`city`,`source`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

CREATE TABLE `dim_location` (
  `id` int(11) NOT NULL,
  `continent_name` varchar(45) DEFAULT NULL,
  `subcontinent_name` varchar(45) DEFAULT NULL,
  `country_name` varchar(45) DEFAULT NULL,
  `city_name` varchar(45) DEFAULT NULL,
  `version` int(11) DEFAULT NULL,
  `date_from` date DEFAULT NULL,
  `date_to` date DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

CREATE TABLE `dim_source` (
  `id` int(11) NOT NULL,
  `source` varchar(45) DEFAULT NULL,
  `category` varchar(45) DEFAULT NULL,
  `version` int(11) DEFAULT NULL,
  `date_from` date DEFAULT NULL,
  `date_to` date DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

 In case the purpose of columns version, date_from and date_to is obscure to you I recommend to check out “Using the Dimension Lookup/Update Step in Pentaho Kettle“. The quick answer is that we are going to represent the dimensions as SCD of Type 2 and hence need those columns for versioning. The brutal truth is that I treat them as SCD Type 2 instead of Type 1 simply so I can apply the “Dimension Lookup/Update” step.

The column “category” in dim_source I added and populated manually. It’s summarizing similar sites into categories. F.x. all Google search sites end up in “google”, the other non-Google search engines end up in “non_google_se” – and so on.

Loading the Data from GA into the Data Warehouse

Thanks to Kettle taking care of all the intricate details regarding querying the GA API and managing the dimensions we end up with a rather simple transformation:

web-stats-trafoYou can download the transformation file here. It’s pretty straightforward – just notice that I use the output type “String” with format “yyyyMMdd” for the date within the Google Analytics step. The field is implicitely cast to integer when it comes to inserting the facts to facts_web_stats – and the ID for date May 3 2014 is 20140503.

Describing the Cube with Mondrian 3.6 Schema

To be able to access those tables as a cube we need to represent the structure with an XML schema that Mondrian 3.6 understands. Because what happens under the hood – while merrily clicking around in Saiku crafting pretty looking data insight to save our imaginative multi-national enterprise billions of dollars every year – is that Saiku translates what you do into an MDX query. Finally the MDX is then translated by Mondrian into plain old SQL to actually query the data from MySQL. Okay there you go:

<Schema name="analytics">
  <Cube name="Analytics">
    <Table name="facts_web_stats"/>

    <!-- the time dimension -->
    <Dimension name="Time" foreignKey="date" type="TimeDimension">

      <!-- year / month / day -->
      <Hierarchy name="Time by Month" hasAll="true" primaryKey="id">
        <Table name="dim_time"/> 
        <Level name="Year" column="y" type="Numeric" 
            levelType="TimeYears"/>
        <Level name="Month" column="m" nameColumn="m_name" type="Numeric" 
            levelType="TimeMonths"/>
        <Level name="Day" column="d" type="Numeric" levelType="TimeDays"/>
      </Hierarchy>

      <!-- year / week / day -->
      <Hierarchy name="Time by Week" hasAll="true" primaryKey="id">
        <Table name="dim_time"/> 
        <Level name="YearWeek" column="yw" type="Numeric" 
            levelType="TimeYears"/>
        <Level name="Week" column="w" type="Numeric" 
            levelType="TimeWeeks"/>
        <Level name="Weekday" column="wd" nameColumn="wd_name" 
            type="Numeric" levelType="TimeDays"/>
      </Hierarchy>

    </Dimension>

    <!-- the location dimension -->
    <Dimension name="Location" foreignKey="city">
      <Hierarchy name="Location of Visitor" hasAll="true" primaryKey="id">
        <Table name="dim_location"/>
        <Level name="Continent" column="continent_name" type="String"/>
        <Level name="Subcontinent" column="subcontinent_name" 
            type="String"/>
        <Level name="Country" column="country_name" type="String"/>
        <Level name="City" column="city_name" type="String"/>
      </Hierarchy>
    </Dimension>

    <!-- the source dimension -->
    <Dimension name="Source" foreignKey="source">
      <Hierarchy name="Source" hasAll="true" primaryKey="id">
        <Table name="dim_source"/>
        <Level name="Category" column="category" type="String"/>
        <Level name="Source" column="source" type="String"/>
      </Hierarchy>
    </Dimension>

    <!-- the only metric -->
    <Measure name="PageViews" column="page_views" aggregator="sum"/>
  </Cube>
</Schema>

I guess the example is simple enough to make quickly sense of the meaning of the various details. The official documentation you may find here. A couple of words are in order though.

“Schema” is the root element which contains one or more cubes. At the center of every “Cube” we have a fact table. A cube is made up of dimensions (“Dimension”) and facts (“Measure”). The dimensions contain one or more hierarchies. That might seem confusing first but makes sense when you think of what f.x. a time dimension might be practically. Time might be week- or month-based. But a week or a month number on itself does not make munch sense as you need the year for the context. And that is where the “Hierarchy” comes into play. The hierarchy is representing an abstract “contains”-relationship. The year contains the month contains the day or for the location dimension – continent contains subcontinent contains country contains city. And given that sometimes a city changes its country – see Crimea – it was maybe not such a bad idea to take that into account by modelling the location dimension as SCD Type 2.

The link between the fact table and a dimension table is defined by “foreignKey” (for the fact table) and “primaryKey” (for the dimension table).

Data Insights with Saiku

Now we load the schema file with Pentaho BI Server and open a new Saiku Analytics page and we are ready to play around with the cube. The following screen shot shows you the setup and the result of a “treemap” plot for subcontinent by source category in May 2014 for some idea on how the world ends up on my site:

data-insightsOr let’s do an area chart of the page views by week for 2014 and the continents:

data-insights-2Well, you get the idea.

Data Exploration like a Pro

Is Saiku Analytics the real deal? Well, I guess that depends on what you need. But judging from what I can tell from using the Community Edition I am missing a lot of features. First and foremost none of the charts seem to be customizable beyond choosing dimensions and metrics. And I am not talking about coloring – which would be nice too – but for example about influencing how a metric is mapped onto the visualization.

As far as I can tell there is no way to configure the scaling for the multiple bar chart – instead automatically for all bar charts the same scaling is applied.

multiple-bar-chartline-chartThen there are some design choices I personally would consider debatable like rounded lines for the line chart – no option for grid lines and missing data point markers. The heat grid is not revealing much due to sub-optimal usage of space and coloring.

But don’t listen to me – form your own opinion. I also would like to add that the Saiku Analytics plugin might not cater to my ideals yet but seems to be a very well developed and visually appealing product meeting all the requirements for making great progress in near future! I also have to emphasize that I used the community edition of Saiku Analytics and can’t say for sure whether some enterprise edition might make all my dreams come true. Though I couldn’t find anything indicating such an extended edition.

xmla4js and d3.js

Few years ago Roland Bouman started xmla4js with the intention of making XMLA server accessible to JavaScript applications. I would love to see a true grassroots developed d3.js driven data explorer connecting to Mondrian. Apparently some russian guy already gave it a try but didn’t took it further than creating a terrible youtube clip showcasing a prototype. I think this would provide a great option to commercial solutions.


(original article published on www.joyofdata.de)


Viewing all articles
Browse latest Browse all 5

Trending Articles