function - Quickest way to return snapshots of Exponential Moving Average in postgresql -
i collecting , graphing data , 1 of things need calculate exponential moving average of data. have data stored in postgres.
based on stack page read (how calculate exponential moving average on postgres?) have following function.
create or replace function ema_func( state double precision, inval double precision, alpha double precision) returns double precision $body$ begin return case when state null inval else alpha * inval + (1-alpha) * state end; end $body$ language plpgsql volatile
which using aggregate put together:
create aggregate ema(double precision, double precision) ( sfunc=ema_func, stype=float8 );
i graphing stock information given day have 7000-8000 pieces of data. don't need of information graph data (depending on window settings, 1 pixel might worth somewhere around 60 seconds) want pull snapshots of data every nth second. wrote function me , has saved me time.
create or replace function emasnapshots( ptable varchar, timestart timestamptz, timeend timestamptz, duration double precision, psymbol varchar, alpha double precision) returns setof timevalue $body$ declare localstart timestamptz; localend timestamptz; timevalues timevalue%rowtype; groups int := ceil((select extract(epoch (timeend - timestart))) / duration); begin execute 'create temp table allemas on commit drop select datetime, ema(value, ' || quote_literal(alpha) || ') on (order datetime asc) ' || quote_ident(ptable) || ' symbol = ' || quote_literal(psymbol) || ' , datetime >= ' || quote_literal(timestart) || ' , datetime <= ' || quote_literal(timeend); in 1 .. groups loop localstart := timestart + (duration * (i - 1) * interval '1 second'); localend := timestart + (duration * * interval '1 second'); execute 'select * allemas datetime >= ' || quote_literal(localstart) || ' , datetime <= ' || quote_literal(localend) || ' order datetime desc limit 1' timevalues; return next timevalues; end loop; return; end $body$ language plpgsql volatile
running ema with
select datetime::timestamptz, ema(value, 0.0952380952380952380952380952381 /* alpha */) on (order datetime asc) "4" symbol = 'aapl' , datetime >= '2015-07-01 7:30' , datetime <= '2015-07-01 14:00:00'
takes 1.5 seconds collect of data (7733 rows) , push across internet (my data in state)
running emasnapshot function wrote
select start, average emasnapshots ('4', '2015-07-01 9:30-4', '2015-07-01 16:00-4', 60, 'aapl', 0.0952380952380952380952380952381);
takes 0.5 seconds gather of data , push across internet (390 rows) btw, clarity. pulling table "4" july 1st during stock market hours , want snapshots every 60 seconds. last number alpha , means calculating 20 second emas (alpha = 2/(period + 1))
my question is, doing fastest way possible? there way tell part of function slower part? temp table creation or grabbing snapshot part? should selecting recent date in interval different way? should selecting latest time in interval original table (which indexed on time) , joining newly created table?
i started writing postgres functions week ago. realize newly created table not indexed , might take bit longer date-related stuff asking do. there way around this? dealing lots of days worth of data lots of different symbols not sure creating ema tables possibilities idea. don't want suck of data down , processing locally because if graphing software has multiple days open, encompass 35,000 lines have transferred processed.
btw, don't think indexing speed or because can run:
select * "4" symbol = 'aapl' , datetime >= '2015-07-01 07:30' , datetime <= '2015-07-01 14:00' order datetime asc limit 450
and response in under 150ms on internet. there way less processing in though.
thank time!
edited based on patrick's answer.
i have query below modified patrick said:
select datetime, ema ( select datetime, ema, rank() on (partition bucket order datetime desc) = 1 rank ( select datetime, ema(value, 0.0952380952380952380952380952381) on (order datetime asc) ema, ceil(extract(epoch (datetime - '2015-07-01 7:30')) / 60) bucket "4" symbol = 'aapl' , datetime between '2015-07-01 7:30' , '2015-07-01 14:00' ) x ) y rank = true;
because getting error couldn't put rank statement in clause split different select statement, did right? having 3 select statements feels odd me sql newbie , trying learn go maybe isn't bad.
my explain statement above query looks this.
subquery scan on y (cost=6423.35..6687.34 rows=4062 width=16) filter: y.rank -> windowagg (cost=6423.35..6606.11 rows=8123 width=24) -> sort (cost=6423.35..6443.65 rows=8123 width=24) sort key: x.bucket, x.datetime -> subquery scan on x (cost=5591.23..5895.85 rows=8123 width=24) -> windowagg (cost=5591.23..5814.62 rows=8123 width=16) -> sort (cost=5591.23..5611.54 rows=8123 width=16) sort key: "4".datetime -> bitmap heap scan on "4" (cost=359.99..5063.74 rows=8123 width=16) recheck cond: (((symbol)::text = 'aapl'::text) , (datetime >= '2015-07-01 07:30:00-06'::timestamp time zone) , (datetime <= '2015-07-01 14:00:00-06'::timestamp time zone)) -> bitmap index scan on "4_pkey" (cost=0.00..357.96 rows=8123 width=0) index cond: (((symbol)::text = 'aapl'::text) , (datetime >= '2015-07-01 07:30:00-06'::timestamp time zone) , (datetime <= '2015-07-01 14:00:00-06'::timestamp time zone))
first, few notes minor efficiency issues function:
- you not have
quote_literal()
other strings. impossible bobby tables injected sql statement throughdouble precision
ortimestamp
parameter. - in dynamic sql statement have manually splice in table , column names; parameter values can injected using
using
clause. saves lot of parsing time. - move many calculations possible outside of loop. instance:
declare ... dur_int interval := duration * interval '1 second'; localstart timestamptz := timestart; localend timestamptz := localstart + dur_int; begin ... in 1 .. groups loop ... localstart := localstart + dur_int; localend := localend + dur_int; end loop; ...
but moot...
in code first populate temp table 7733 rows of data, later extract 1 record @ time dynamic query in loop runs 390 times. very, wasteful. can replace entire function body single statement:
return query execute format('select datetime, ema ' 'from (' 'select datetime, ema, ' 'rank() on (partition bucket order datetime desc) rank ' 'from (' 'select datetime, ema(value, $1) on (order datetime asc) ema, ' 'ceil(extract(epoch (datetime - $2)) / $3) bucket ' 'from %i ' 'where symbol = $4 ' 'and datetime between $2 , $5) x ' 'where rank = 1) y ' 'order 1', ptable) using alpha, timestart, duration, psymbol, timeend;
the principle here in innermost query calculate "bucket" each processed row table fall. in next level query calculate rank of rows in each bucket based on datetime
. in main query pick recent row each bucket, i.e. 1 rank = 1
.
on speed: should explain
on queries on server, rather measure on client includes network transfer time.
Comments
Post a Comment