1. Motivation

    We would like to be able to ask PostgreSQL to calculate the geometric mean (or harmonic mean) for us, rather than having to remember the formulas and write out the query by hand each time.

    If you’re unsure about what they are, here are pure Python implementations that might be useful for people who don’t enjoy mathematical notation:

    import math
    from typing import List 
    
    def harmonic_mean(x: List[float]) -> float:
        return len(x) / sum(1/val for val in x)
    
    def geometric_mean(x: List[float]) -> float:
        tmp = [math.log(val) for val in x] 
        tmp = sum(tmp) / float(len(x))
        return math.exp(tmp)
    

    To start, assume that we have a SQL table that has three floating point numbers:

    CREATE TEMPORARY TABLE IF NOT EXISTS numbers (
        x float8
    );
    
    INSERT INTO numbers VALUES (1.0), (2.0), (3.0);
    

    An SQL query to calculate the geometric mean for numbers.x looks like this:

    --geometric mean
    SELECT exp(avg(ln(x)))
    FROM numbers;
    

    Sometimes it can be irritating to need to remember the formula though. Wouldn’t it be nice if we could just ask PostgreSQL to do this?

    SELECT geomean(x)
    FROM numbers;
    

    To create a function that accepts a column and returns a scalar, we need a User-defined aggregate.

    How to define a user-defined aggregate

    A user-defined aggregate requires two functions to be defined and some additional information about types and the initial conditions. The first function, sfunc is a reduce/fold. It contains an accumulator value of type stype, that’s initially called with initcond. The second function, finalfunc, is called once every column of has passed through sfunc

    CREATE AGGREGATE geomean(float8) (
        sfunc = geomean_accum,              -- function 1: a "reduce" function, called "s[tate] func"
        stype = float8[],                   -- the "state type" that's passed between calls to function 1 
        finalfunc = geomean_finalize,       -- function 2: called once the whole column has been processed
        initcond = '{0.0, 0.0}'             -- first argument to function 1 in its first invocation
    );
    

    Both of the functions within this snippet use an internal SQL SELECT statement. There are lots of numbers in there, which can be somewhat difficult to parse.

    Geometric mean

    Here is an overview of each function of the geomean aggregate:

    • geomean_accum contains the statement SELECT array[$1[1]+ln($2), $1[2]+1.0];. This is creating an array of length two from the first argument to the function ($1) that is also an array of length two. It’s adding the natural logarithm of the second argument (ln($2)) to the value that’s been accumulated ($1[1]+...). The second value of the array is a count. That’s used later by geomean_finalize.
    • geomean_finalize carries out the exp and avg steps of the query we saw first: exp(avg(ln(x))). It calculates the arithmetic mean via with row count ($2) and the accumulated ln(x) calls ($1): SELECT exp($1[1] / $1[2]);.
    CREATE OR REPLACE FUNCTION geomean_accum(float8[], float8)
    RETURNS float8[]
    LANGUAGE sql
    AS $g$
    SELECT array[$1[1]+ln($2), $1[2]+1.0];
    $g$;
    
    CREATE OR REPLACE FUNCTION geomean_finalize(float8[])
    RETURNS float8
    LANGUAGE sql
    AS $g$
    SELECT exp($1[1] /  $1[2]);
    $g$;
    
    CREATE AGGREGATE geomean(float8) (
        sfunc = geomean_accum,
        stype = float8[],
        finalfunc = geomean_finalize,
        initcond = '{0.0, 0.0}'
    );
    

    Harmonic mean

    The harmonic mean can also be implemented as a user-defined aggregate query. The structure is the same as before, but the operations are different:

    CREATE OR REPLACE FUNCTION harmean_accum(float8[], float8)
    RETURNS float8[]
    LANGUAGE sql
    AS $h$
    SELECT array[$1[1]+1.0/$2, $1[2]+1.0]; 
    $h$;
    
    CREATE OR REPLACE FUNCTION harmean_finalize(float8[])
    RETURNS float8
    LANGUAGE sql
    AS $h$
    SELECT $1[2] /  $1[1];
    $h$;
    
    CREATE AGGREGATE harmean(float8) (
        sfunc = harmean_accum,
        stype = float8[],
        finalfunc = harmean_finalize,
        INITCOND = '{0.0, 0.0}'
    );
    

    Further references

  2. Here is a chance to learn database programming for cheap using quality materials from a quality instructor. The cost is that you need to provide me with full feedback on how I did so that I can improve for future learners.


    About

    Our time will be spent focusing on techniques to pull data out from the database. By the end of the lesson, the whole room will be able to create sophisticated reports from complex datasets.

    Our toolbox will be:

    • SQL syntax
    • filtering
    • ordering
    • aggregation
    • using database functions

    If we have time, we’ll also look at:

    • joining values from multiple tables together
    • working with time series data


    Details

    Price
    koha
    Location
    Central Auckland
    Time/Date
    9:30am-13:30pm, 12th July
    Background
    I am helping out with a Software Carpentry boot camp at the University of Queensland at the end of the month, but would like practice and feedback.

    Software Carpentry is a global initiative run by the Mozilla Science Lab to help scientists build better software. They have create battle-tested learning materials released under an open licence. I will be using that learning material to teach SQL, the programming language for talking to databases.


    RSVP

    Get In touch via email (paperless@timmcnamara.co.nz) or Twitter (@timClicks).

  3. It is customary to hear that nowadays storage is so cheap that one no longer needs to be obsessed with space. True enough, except that such an argument overlooks the fact that there is more and more data to store in today’s world. It also overlooks the fact that data is often mirrored, possibly backed up to other disks on a disaster recovery site where it is mirrored again and that many development databases are mere copies of production databases. As a result, every wasted byte isn’t wasted once, but four or five times in the very best of cases. … Besides the mere cost of storage–sometimes more importantly–there is also the issue of recovery. … All other things being equal, a database that is twice as big as necessary will take twice the time to restore than would otherwise be needed.
    Stéphane Faroult, The Art of SQL, pp8-9

Tim McNamara