.NET 3.5+

A LINQ Style Duplicate Item Finder

by Richard Carr, published at http://www.blackwasp.co.uk/Duplicates.aspx

Language-Integrated Query (LINQ) includes a number of set operators, including one that removes duplicates from a sequence to return only distinct values. This article describes a custom set operator that returns only the duplicated values from a list.

Download Source Code

Distinct

LINQ's Distinct operator processes a sequence to remove the duplicates and create a new sequence of unique items. In some situations you may find that you need to perform some further action for any duplicated items in a collection. For example, if the duplicates have been retrieved from a database you may wish to delete all except one item in each group of duplicates.

In this article we'll put together a simple extension method, similar to existing set operators, which will find the duplicated items from any sequence that implements IEnumerable<T>. We'll name the method, "Duplicates".

Duplicates Method

To follow the pattern of the Distinct method, and other LINQ set operators, we'll create several overloaded versions of the Duplicates method. We will allow the comparison that determines whether or not two items are similar to be based upon either the default comparer for the type, or upon a specific comparer provided via an IEqualityComparer<T> parameter. Two of the method signatures will match those of Distinct, as follows.

public static IEnumerable<T> Duplicates<T>(this IEnumerable<T> sequence)
public static IEnumerable<T> Duplicates<T>(
    this IEnumerable<T> sequence, IEqualityComparer<T> comparer)

We also need to consider exactly what we want the Duplicates method to return. We could create a sequence containing all of the duplicated items. We could also return just one of each duplicated element from the sequence. To allow the consumer of the method to decide, we'll add a Boolean parameter. If true, we will return all of the duplicates. If not, we'll return one of each duplicate. This gives two additional overloads.

public static IEnumerable<T> Duplicates<T>(
    this IEnumerable<T> sequence, bool returnAll)
public static IEnumerable<T> Duplicates<T>(
    this IEnumerable<T> sequence, IEqualityComparer<T> comparer, bool returnAll)

Creating the Class

To begin, we need a static class in which to create our extension methods. Create the following declaration.

public static class DuplicatesExtensions
{
}

Creating the Method

Although we have four overloaded versions of the method to create, we'll start with the most complex version, which accepts parameters for the sequence, the comparer and the Boolean value that determines which items are included in the output. The other three overloads will call this method, so will be much simpler to create.

To allow the method to throw exceptions immediately when an invalid value is provided but to defer execution until the sequence is enumerated, we need two methods. The first, Duplicates, performs null checks for the sequence and the comparer, before calling the private DuplicatesImpl method. This second method uses an iterator to defer the results. The public method's code is as follows:

public static IEnumerable<T> Duplicates<T>(
    this IEnumerable<T> sequence, IEqualityComparer<T> comparer, bool returnAll)
{
    if (sequence == null) throw new ArgumentNullException("sequence");
    if (comparer == null) throw new ArgumentNullException("comparer");

    return DuplicatesImpl(sequence, comparer, returnAll);
}

Inside the DuplicatesImpl method we need to loop through the source sequence and take one of three actions according to the values found. If we find a value that has not been seen previously when compared using the provided IEqualityComparer<T> don't add it to the output sequence, as it may not be a duplicate. Instead, we add it to a dictionary of values encountered. When added to the dictionary, we use the source item as the key and set the value to false, indicating that it has not been yielded by the method.

If we obtain a value that has been seen once before, determined by finding the item in the dictionary with a false value, we know it is a duplicate. In this case we return the item from the dictionary, as this was found first. If we are returning all of the duplicates, determined by checking the returnAll flag, we'll also yield the newly found value. To indicate that we have included the value in the output sequence, we set the value of the item in the dictionary to true.

If we find a value that already exists in the dictionary and the value in the dictionary is true, we know that we've seen and returned the value previously. If we are returning all duplicates we yield the new value. If not, no action is required.

All of these rules are included in the following code.

private static IEnumerable<T> DuplicatesImpl<T>(
    IEnumerable<T> sequence, IEqualityComparer<T> comparer, bool returnAll)
{
    Dictionary<T, bool> items = new Dictionary<T, bool>(comparer);
    foreach (T item in sequence)
    {
        if (items.ContainsKey(item))
        {
            if (items[item] == false)
            {
                items[item] = true;
                yield return items.Keys.Single(k => comparer.Equals(k, item));
            }

            if (returnAll)
            {
                yield return item;
            }
        }
        else
        {
            items.Add(item, false);
        }
    }
}

Next: Page 2

18 October 2012