Each of the following functions returns a list with no duplicates from the list given as argument. Three approaches: the simplest approach, then faster methods which do not respect the original ordering, and finally fast methods which do respect the ordering.
All in all, this problem is easily solved with Python’s set and dict builtins.
The following function checks that the new values in the input list were not already seen:
def naive(liste):
seen = []
for i in liste:
if i not in seen:
seen.append(i)
return seen
The in operator, operating on a list gets slower when the list grows. Here is a shorter implementation of the same algorithm which presents two subtleties with regard to the Python language:
def buggy_naive(liste, nodups = []):
[nodups.append(i) for i in liste if i not in nodups]
return nodups
The list gets sorted first. Then to check whether a value has been seen: it is only necessary to check against one value instead of a list.
def usort(liste):
l = sorted(liste)
nodups = [l[0]]
for e in l:
if e != nodups[-1]:
nodups.append(e)
return nodups
Another approach is to store the elements behind a hash: the duplicates are quickly found, they are behind the same hash. The Python dictionary can do that easily (it is only a matter of storing an empty value):
def dico(l):
return dict((i,None) for i in l).keys()
Actually a iterable without duplicate is also called a set. Another way to present the problem is this: how to transform a list into a set!
>>> set([1, 2, 3, 4, 3, 4, 3, 4])
set([1, 2, 3, 4])
The Python set is actually implemented with a dictionary whose values are set to None.
Returns a list whose first occurences of duplicates were removed. A dictionary is built from the input list, the keys are the list element which ensures unicity, the value is the position. The keys are returned sorted on the position:
def keep_last(liste):
d = dict((v,i) for (i,v) in enumerate(liste))
return sorted( d.keys(), key = d.get )
Same algorithm but the last duplicates are removed:
def keep_first(liste):
d = dict( (v,i) for (i,v) in enumerate(reversed(liste)))
return sorted( d.keys(), key = d.get, reverse=True)
Recent Python2.7 and Python3.1 gained a new data structure in the collections package: the ordered dict which remembers the order of the input keys:
def odict(l):
return OrderedDict((i,None) for i in l).keys()
Et voila, set and OrderedDict are Python builtins: the language has many facilities directly available to solve this problem.
Let’s build different lists with different characteristics and compare each implementation. randlist returns a list according to input characteristics:
def randlist(size, freq, almost_sorted=False):
"""Returns a list of int according to three parameters:
- size of the list: either *short* or *long*,
- whether there are *few* or *tons* of duplicates in the list,
- either the list is completely shuffled of almost sorted.
"""
d = {'short': 10, 'long': 5000, 'few': 3, 'tons': 0.5}
l = [randint(1, int( d[size]*d[freq])) for _ in range(d[size])]
if almost_sorted:
l.sort()
for _ in range(int(d[size] * 0.01)):
n, m = randint(0, d[size]-1), randint(0, d[size]-1)
l[n], l[m] = l[m], l[n]
return l
make_lists yields the random lists for all the permutations of the possible characteristics:
def make_lists():
for size in 'short', 'long':
for freq in 'few', 'tons':
for almost_sorted in True, False:
yield (size, freq, almost_sorted), randlist(size,
freq,
almost_sorted)
# the main
The main section of the module iterates over the different implementations, and time the implementations on each list yielded by make_list.
if __name__ == '__main__':
from timeit import Timer
for fun in [ naive, usort, keep_last, keep_first, set, dico, odict]:
for (s, f, a), l in make_lists():
print("%s\t%s\t%s\t%s\t%s" % (fun, s, f, a ,
Timer(lambda:fun(l)).timeit(number=100)))
Conclusion form this short benchmark: the set and keep_* functions are efficient. The naive algorithm should really be avoided and the ordered dict, though a builtin, seems four times less efficient than the keep_* function based on the dict and enumerate functions.