Friday 17 February 2012

Compression in QV

I was talking to someone this morning who mentioned how amazed they were at the compression QlikView managed to achieve. This seems to be a comman misconception about QlikView, so after explaining this to him, I thought I'd repeat my explanation here also.

FACT: QlikView does not use compression for QVDs!

So why then is a QVD so much smaller than a CSV containing the same table of data or why is a QVW so much smaller than several QVD's concatenated together into a single table? The answer lies in understanding how QlikView stores data and the key to this is understanding some basic facts about QlikView's Associative In-Memory Technology. This is what QlikTech have to say about it:
"QlikView uses an associative in-memory technology to allow users to analyze and process data very quickly. Unique entries are only stored once in-memory: everything else are pointers to the parent data. That’s why QlikView is faster and stores more data in memory than traditional cubes."
So from this you can easily deduce the reason QlikView appears to compress data so much is that it stores only unique values thus removing repetition in the data. Think about it for a second, if I have a million rows of data but a field contains only 10 unique values, then QlikView stores those 10 values only and not a million values.This applies equally to the RAM QlikView will need to hold the data as well as the hard disk space needed to store the QVW and QVD files. 

You can even use this fact to your advantage when dealing with large volumes of data in QlikView. Breaking down a field into several fields can cause more repetition and thus reduce the amount of resources QV needs to hold the data both in RAM and on disk. Email addresses are prime candidates for this technique as the domain part will likely not be unique to one entry. Take this example:

EmailAddress
bob.jones@mycompanydomain.com
jane.smith@mycompanydomain.com
paul.hill@mycompanydomain.com

QlikView will consider each of these values as unique and thus will be storing the "@mycompanydomain.com" part repeatedly. If you were to split the email addresses at the "@"  then the domain part becomes the same for all 3 records and thus QV will store it only once. You can split the email address using the following lines within a load statement:

....
left(Email,index(Email,'@')-1) AS EmailName,
right(Email, len(Email)-index(Email, '@')) AS EmailDomain
....


When wanting to display the email address in a chart you only have to concatenate the 2 fields together remembering to replace the '@' in the middle like this:

EmailName & '@' & EmailDomain


The same technique can be used for mail addresses, phone numbers, part codes and any other patterned, repetitive data.

6 comments:

  1. Rationalisation might be a better term than Compression in this case - Sam F

    ReplyDelete
    Replies
    1. I think "Deduplication" is correct term. It just doesn't sound right to me for some reason so I avoid using it.

      Delete
  2. QVW files are stored using compression - based on the document settings. Once it is opened, it loads the rationalized, uncompressed data set. This can be tested by exporting a test data table to qvd from the qvw.

    Also, SubField, in this case, is simpler than the Left/Right functions.
    SubField(Email,'@',1) as EmailName,
    SubField(Email,'@',2) as EmailDomain

    ReplyDelete
  3. I'm sorry to tell you qvd files are using a form of compression.

    A single value for a dataset won't have an index stored since there is obviously only one instance. that's compressing the indexing.

    The index entry size is also compressed to the minimum bits needed. So 4 distinct values will be indexed with a 2-bit size entry.
    Moreover, there is no trash bits left in the qvd file, except maybe for the xml description.

    ReplyDelete
    Replies
    1. Hi

      Thanks for your comment. I would agree that what you describe is both what QlikView does and is highly efficient but it isn't what I would consider any form of compression. This is because compression would indicate the use of an algorithm to remove statistical redundancy in the data which isn't what QlikView does.

      As with everything on the blog, I also like to try and keep things at a level where they are understandable by non-techies and well as those code monkeys amongst us.

      Regards
      Matt

      Delete
  4. Sorry to be techi there, but here is an understandable example:

    given the data (3 columns):

    AAAA;1;L
    BBBB;2;L
    BBBB;3;L
    CCCC;4;L
    CCCC;5;L
    CCCC;6;L

    there are 3 distinct values for col 1, only AAAA,BBBB and CCCC will be store.

    there are 6 disctict values in col 2, statistically, this is the less frequent ones so they will take the greatest part of the binary.

    the last column is always L, so the most frequent one, it will take no place at all, except its own value.

    here is what we get in the qvd data part:
    AAAA BBBB CCCC 1 2 3 4 5 6 L

    followed by the indexing map: 1 1 2 2 2 3 3 4 3 5 3 6
    it gives the position in data lists to recover the table:
    1: the 1st val in (AAAA,BBBB,CCCC) list
    1: the 1st val in (1,2,3,4,5,6) list
    -: the unlisted L value
    2: the 2nd val in (AAAA,...) list
    2: the 2nd val in (1,2,...) list
    -: the unlisted L value
    2the 2nd val in (AAAA,...)
    3:the 3rd val in (1,2,...)
    etc.

    This is a kind of Huffman compression to me.


    the complete qvd file will be:
    ...AAAA BBBB CCCC 1 2 3 4 5 6 L 1 1 2 2 2 3 3 4 3 5 3 6


    oops, i said to much ;-)

    ReplyDelete