Table Of Contents

Previous topic

Class method and static method

Next topic

Setting and reading individual bit of a integer

This Page

Suppressing duplicates in a list

Each of the following functions returns a list with no duplicates from the list given as argument. Three approaches: the simplest approach, then faster methods which do not respect the original ordering, and finally fast methods which do respect the ordering.

All in all, this problem is easily solved with Python’s set and dict builtins.

Simplest naive algorithm

The following function checks that the new values in the input list were not already seen:

def naive(liste):
    seen = []
    for i in liste:
        if i not in seen:
            seen.append(i)
    return seen

The in operator, operating on a list gets slower when the list grows. Here is a shorter implementation of the same algorithm which presents two subtleties with regard to the Python language:

def buggy_naive(liste, nodups = []):
    [nodups.append(i) for i in liste if i not in nodups]
    return nodups
  1. the argument default values are only evaluated once: when the function is defined. In our case, the default value for nodups is a pointer to a list which happens to be empty at the function definition. while the pointer to the list will never change, the list itself might be updated and modified. Do not be surprised if the function’s data is not the same between each call: the default value is not re-initialiased.
  2. nodups.append() does not really return a result as it is a side effect (None is actually returned). The list comprehensions will update the nodups list, but will also create internally a list as long as the input list, filled with None. This unused list is wasted.

Faster methods, which does not respect input list order

The list gets sorted first. Then to check whether a value has been seen: it is only necessary to check against one value instead of a list.

def usort(liste):
  l = sorted(liste)
  nodups = [l[0]]
  for e in l:
       if e != nodups[-1]:
           nodups.append(e)
  return nodups

Another approach is to store the elements behind a hash: the duplicates are quickly found, they are behind the same hash. The Python dictionary can do that easily (it is only a matter of storing an empty value):

def dico(l):
    return dict((i,None) for i in l).keys()

Actually a iterable without duplicate is also called a set. Another way to present the problem is this: how to transform a list into a set!

>>> set([1, 2, 3, 4, 3, 4, 3, 4])
set([1, 2, 3, 4])

The Python set is actually implemented with a dictionary whose values are set to None.

Respect the original order

Returns a list whose first occurences of duplicates were removed. A dictionary is built from the input list, the keys are the list element which ensures unicity, the value is the position. The keys are returned sorted on the position:

def keep_last(liste):
   d = dict((v,i) for (i,v) in enumerate(liste))
   return sorted( d.keys(), key = d.get )

Same algorithm but the last duplicates are removed:

def keep_first(liste):
   d = dict( (v,i) for (i,v) in enumerate(reversed(liste)))
   return sorted( d.keys(), key = d.get, reverse=True)

Recent Python2.7 and Python3.1 gained a new data structure in the collections package: the ordered dict which remembers the order of the input keys:

def odict(l):
    return OrderedDict((i,None) for i in l).keys()

Et voila, set and OrderedDict are Python builtins: the language has many facilities directly available to solve this problem.

What do you mean by fast?

Let’s build different lists with different characteristics and compare each implementation. randlist returns a list according to input characteristics:

def randlist(size, freq, almost_sorted=False):
    """Returns a list of int according to three parameters: 

    - size of the list: either *short* or *long*, 
    - whether there are *few* or *tons* of duplicates in the list,
    - either the list is completely shuffled of almost sorted.
    """

    d = {'short': 10, 'long': 5000, 'few': 3, 'tons': 0.5}
    l = [randint(1, int( d[size]*d[freq])) for _ in range(d[size])]
    if almost_sorted:
        l.sort()
        for _ in range(int(d[size] * 0.01)):
            n, m = randint(0, d[size]-1), randint(0, d[size]-1)
            l[n], l[m] = l[m], l[n]
    return l

make_lists yields the random lists for all the permutations of the possible characteristics:

def make_lists():
    for size in 'short', 'long':
        for freq in 'few', 'tons':
            for almost_sorted in True, False:
                yield (size, freq, almost_sorted), randlist(size, 
                                                            freq, 
                                                            almost_sorted)
# the main

The main section of the module iterates over the different implementations, and time the implementations on each list yielded by make_list.

if __name__ == '__main__':
    from timeit import Timer

    for fun in [ naive, usort, keep_last, keep_first, set, dico, odict]:
        for (s, f, a), l in make_lists():
            print("%s\t%s\t%s\t%s\t%s" % (fun, s, f, a ,
                                   Timer(lambda:fun(l)).timeit(number=100)))

Conclusion form this short benchmark: the set and keep_* functions are efficient. The naive algorithm should really be avoided and the ordered dict, though a builtin, seems four times less efficient than the keep_* function based on the dict and enumerate functions.