================================== Suppressing duplicates in a list ================================== Each of the following functions returns a list with no duplicates from the list given as argument. Three approaches: the simplest approach, then faster methods which do not respect the original ordering, and finally fast methods which do respect the ordering. All in all, this problem is easily solved with Python's *set* and *dict* builtins. Simplest naive algorithm ======================== The following function checks that the new values in the input list were not already *seen*: .. literalinclude:: dup.py :pyobject: naive The *in* operator, operating on a list gets slower when the list grows. Here is a shorter implementation of the same algorithm which presents two subtleties with regard to the Python language: .. literalinclude:: dup.py :pyobject: buggy_naive #. the argument default values are only evaluated once: when the function is defined. In our case, the default value for *nodups* is a pointer to a list which happens to be empty at the function definition. while the pointer to the list will never change, the list itself might be updated and modified. Do not be surprised if the function's data is not the same between each call: the default value is not re-initialiased. #. *nodups.append()* does not really return a result as it is a side effect (*None* is actually returned). The list comprehensions will update the *nodups* list, but will also create internally a list as long as the input list, filled with *None*. This unused list is wasted. Faster methods, which does not respect input list order ======================================================= The list gets sorted first. Then to check whether a value has been seen: it is only necessary to check against one value instead of a list. .. literalinclude:: dup.py :pyobject: usort Another approach is to store the elements behind a hash: the duplicates are quickly found, they are behind the same hash. The Python dictionary can do that easily (it is only a matter of storing an empty value): .. literalinclude:: dup.py :pyobject: dico Actually a iterable without duplicate is also called a *set*. Another way to present the problem is this: how to transform a list into a set! >>> set([1, 2, 3, 4, 3, 4, 3, 4]) set([1, 2, 3, 4]) The Python set is actually implemented with a dictionary whose values are set to *None*. Respect the original order ========================== Returns a list whose *first* occurences of duplicates were removed. A dictionary is built from the input list, the keys are the list element which ensures unicity, the value is the position. The keys are returned sorted on the position: .. literalinclude:: dup.py :pyobject: keep_last Same algorithm but the *last* duplicates are removed: .. literalinclude:: dup.py :pyobject: keep_first Recent Python2.7 and Python3.1 gained a new data structure in the collections package: the ordered dict which remembers the order of the input keys: .. literalinclude:: dup.py :pyobject: odict *Et voila*, set and OrderedDict are Python builtins: the language has many facilities directly available to solve this problem. What do you mean by fast? ========================= Let's build different lists with different characteristics and compare each implementation. *randlist* returns a list according to input characteristics: .. literalinclude:: dup.py :pyobject: randlist *make_lists* yields the random lists for all the permutations of the possible characteristics: .. literalinclude:: dup.py :pyobject: make_lists The *main* section of the module iterates over the different implementations, and time the implementations on each list yielded by *make_list*. .. literalinclude:: dup.py :start-after: the main Conclusion form this short benchmark: the *set* and *keep_\** functions are efficient. The naive algorithm should really be avoided and the ordered dict, though a builtin, seems four times less efficient than the keep_* function based on the dict and enumerate functions.