performance - Comparing Value of one list inside Gigantic Two Dimen list in python, Fastest way? -
i want compare if value of 1 list exist in value of other list.they huge (50k + items, database).
edit:
i want mark record duplicated duplicate=true , keep them in table later refrence.
here how lists are:
n_emails=[db_id,checksum id,checksum in search_results] #i want compare checksum if exist inside same list or other list , retrieve id (db_id , if exist) #example : n_emails= [[1,'cafebabe010'],[2,'bfeafe3df1ds],[3,'deadbeef101'],[5,'cafebabe010']] #in case want retrive id 1 , 5 coz same checksum m in n_emails: dups=_getdups(n_emails,m[1],m[0]) n_dups=[casesdb.duplicates.insert( **dup ) dup in dups] if n_dups: print "dupe found" casesdb(casesdb.email_data.id == m[0]).update(duplicated=true) def _getdups(old_lst,em_md5,em_id): dups=[] old in old_lst: if em_md5==old[0] , old[1]!=em_id: dups.append(dict(org_id=old[1],md5hash=old[0],dupid=em_id,)) return dups
but seems long , in larger list (50k vs 50k records+) ran on 5000 seconds , never done , seems never ending loop? server running have 4 gb of ram , 4 cores. doing wrong.
please .. lot!
solved:
dict index mapping way lot faster! (when mysql table not indexed, plese note have not test against indexed table).
its 20 secs vs 30 miliseconds = 20*1000 / 30 = 666 times! lol
the fastest way use dict this:
n_emails= [[1,'cafebabe010'],[2,'bfeafe3df1ds'],[3,'deadbeef101'],[5,'cafebabe010']] d = {} id, hash in n_emails: if hash not in d: d[hash] = [id] else: d[hash].append(id) hash, ids in d: if len(ids) > 1: print hash, ids
this algorithm hash join
for hash, count in select hash, count(id) num emails group num having num > 1: first = none index, id in enumerate(select id emails hash=hash sort desc id): if index == 0: first = id continue update emails set duplicate=first id=id
would sql/python solution in take duplicate column , use store message 1 thought duplicate of
the emails table @ least:
create table emails (id, hash, duplicate default null)
Comments
Post a Comment