performance - Comparing Value of one list inside Gigantic Two Dimen list in python, Fastest way? -

September 15, 2014

i want compare if value of 1 list exist in value of other list.they huge (50k + items, database).

edit:

i want mark record duplicated duplicate=true , keep them in table later refrence.

here how lists are:

n_emails=[db_id,checksum id,checksum in search_results] #i want compare checksum if exist inside same list or other list , retrieve id (db_id , if exist) #example : n_emails= [[1,'cafebabe010'],[2,'bfeafe3df1ds],[3,'deadbeef101'],[5,'cafebabe010']]  #in case want retrive id 1 , 5 coz same checksum  m in n_emails:     dups=_getdups(n_emails,m[1],m[0])                n_dups=[casesdb.duplicates.insert( **dup ) dup in dups]     if n_dups:         print "dupe found"         casesdb(casesdb.email_data.id == m[0]).update(duplicated=true)  def _getdups(old_lst,em_md5,em_id):     dups=[]     old in old_lst:         if em_md5==old[0] , old[1]!=em_id:             dups.append(dict(org_id=old[1],md5hash=old[0],dupid=em_id,))     return dups

but seems long , in larger list (50k vs 50k records+) ran on 5000 seconds , never done , seems never ending loop? server running have 4 gb of ram , 4 cores. doing wrong.

please .. lot!

solved:

dict index mapping way lot faster! (when mysql table not indexed, plese note have not test against indexed table).

its 20 secs vs 30 miliseconds = 20*1000 / 30 = 666 times! lol

the fastest way use dict this:

n_emails= [[1,'cafebabe010'],[2,'bfeafe3df1ds'],[3,'deadbeef101'],[5,'cafebabe010']]  d = {} id, hash in n_emails:     if hash not in d:         d[hash] = [id]     else:         d[hash].append(id)  hash, ids in d:     if len(ids) > 1:        print hash, ids

this algorithm hash join

for hash, count in select hash, count(id) num emails group num having num > 1:     first = none     index, id in enumerate(select id emails hash=hash sort desc id):         if index == 0:             first = id             continue         update emails set duplicate=first id=id

would sql/python solution in take duplicate column , use store message 1 thought duplicate of

the emails table @ least:

create table emails (id, hash, duplicate default null)

Search This Blog

shell

performance - Comparing Value of one list inside Gigantic Two Dimen list in python, Fastest way? -

edit:

solved:

Comments

Post a Comment

Popular posts from this blog

Add email recipient to all new Trac tickets -

400 Bad Request on Apache/PHP AddHandler wrapper -

java - Android recognize cell phone with keyboard or not? -