algorithm - How to keep a random subset of a stream of data? -


i have stream of events flowing through servers. not feasible me store of them, periodically able process of them in aggregate. so, want keep subset of stream random sampling of i've seen, capped max size.

so, each new item, need algorithm decide if should add stored set, or if should discard it. if add it, , i'm @ limit, need algorithm evict 1 of old items.

obviously, easy long i'm below limit (just save everything). how can maintain random sampling without being biased towards old items or new items once i'm past limit?

thanks,

for each element ei in input stream, generate random number ri. add pair (ri, ei) set. when set exceeds sample size n, throw out pair smallest r. (a heap convenient data structure doing this.)

at end of procedure, have sample of elements paired largest n random numbers (hence chosen uniformly samples of length).

for example, in python, write this:

import heapq, random  def sample(s, n):     """     generate random sample of n elements sequence s.     """     pairs = ((random.random(), e) e in s)     r, e in heapq.nlargest(n, pairs):         yield e 

or rather, if weren't fact random.sample function in standard library!


Comments

Popular posts from this blog

Add email recipient to all new Trac tickets -

400 Bad Request on Apache/PHP AddHandler wrapper -

php - Change action and image src url's with jQuery -