I’m a software developer, data scientist, artist and technology writer.
If you have a challenging technical problem or would just like some advice, then send me an email or message me on Twitter! My virtual door is always open.
Oh – and if you want to learn how a computer works or how to use the Rust programming language, then consider buying my book Rust in Action.
We would like to be able to ask PostgreSQL to calculate the geometric mean (or harmonic mean) for us, rather than having to remember the formulas and write out the query by hand each time.
If you’re unsure about what they are, here are pure Python implementations that might be useful for people who don’t enjoy mathematical notation:
import math
from typing import List
def harmonic_mean(x: List[float]) -> float:
return len(x) / sum(1/val for val in x)
def geometric_mean(x: List[float]) -> float:
tmp = [math.log(val) for val in x]
tmp = sum(tmp) / float(len(x))
return math.exp(tmp)
To start, assume that we have a SQL table that has three floating point numbers:
CREATE TEMPORARY TABLE IF NOT EXISTS numbers (
x float8
);
INSERT INTO numbers VALUES (1.0), (2.0), (3.0);
An SQL query to calculate the geometric mean for numbers.x looks like this:
--geometric mean
SELECT exp(avg(ln(x)))
FROM numbers;
Sometimes it can be irritating to need to remember the formula though. Wouldn’t it be nice if we could just ask PostgreSQL to do this?
SELECT geomean(x)
FROM numbers;
To create a function that accepts a column and returns a scalar, we need a User-defined aggregate.
A user-defined aggregate requires two functions to be defined and some additional information about types and the initial conditions. The first function, sfunc is a reduce/fold. It contains an accumulator value of type stype, that’s initially called with initcond. The second function, finalfunc, is called once every column of has passed through sfunc
CREATE AGGREGATE geomean(float8) (
sfunc = geomean_accum, -- function 1: a "reduce" function, called "s[tate] func"
stype = float8[], -- the "state type" that's passed between calls to function 1
finalfunc = geomean_finalize, -- function 2: called once the whole column has been processed
initcond = '{0.0, 0.0}' -- first argument to function 1 in its first invocation
);
Both of the functions within this snippet use an internal SQL SELECT statement. There are lots of numbers in there, which can be somewhat difficult to parse.
Here is an overview of each function of the geomean aggregate:
geomean_accum contains the statement SELECT array[$1[1]+ln($2), $1[2]+1.0];. This is creating an array of length two from the first argument to the function ($1) that is also an array of length two. It’s adding the natural logarithm of the second argument (ln($2)) to the value that’s been accumulated ($1[1]+...). The second value of the array is a count. That’s used later by geomean_finalize.geomean_finalize carries out the exp and avg steps of the query we saw first: exp(avg(ln(x))). It calculates the arithmetic mean via with row count ($2) and the accumulated ln(x) calls ($1): SELECT exp($1[1] / $1[2]);.CREATE OR REPLACE FUNCTION geomean_accum(float8[], float8)
RETURNS float8[]
LANGUAGE sql
AS $g$
SELECT array[$1[1]+ln($2), $1[2]+1.0];
$g$;
CREATE OR REPLACE FUNCTION geomean_finalize(float8[])
RETURNS float8
LANGUAGE sql
AS $g$
SELECT exp($1[1] / $1[2]);
$g$;
CREATE AGGREGATE geomean(float8) (
sfunc = geomean_accum,
stype = float8[],
finalfunc = geomean_finalize,
initcond = '{0.0, 0.0}'
);
The harmonic mean can also be implemented as a user-defined aggregate query. The structure is the same as before, but the operations are different:
CREATE OR REPLACE FUNCTION harmean_accum(float8[], float8)
RETURNS float8[]
LANGUAGE sql
AS $h$
SELECT array[$1[1]+1.0/$2, $1[2]+1.0];
$h$;
CREATE OR REPLACE FUNCTION harmean_finalize(float8[])
RETURNS float8
LANGUAGE sql
AS $h$
SELECT $1[2] / $1[1];
$h$;
CREATE AGGREGATE harmean(float8) (
sfunc = harmean_accum,
stype = float8[],
finalfunc = harmean_finalize,
INITCOND = '{0.0, 0.0}'
);
Here is a chance to learn database programming for cheap using quality materials from a quality instructor. The cost is that you need to provide me with full feedback on how I did so that I can improve for future learners.
Our time will be spent focusing on techniques to pull data out from the database. By the end of the lesson, the whole room will be able to create sophisticated reports from complex datasets.
Our toolbox will be:
If we have time, we’ll also look at:
Get In touch via email (paperless@timmcnamara.co.nz) or Twitter (@timClicks).