hello and returning unique elements

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

hello and returning unique elements

Hy Carrinski
Dear BayPIGgies,

Thank you for providing exceptional reading material over the past
year for an avid conversation follower and occasional meeting
attendee.

I would like to join the conversation and contribute to the community
beginning with this post. I will attend our meeting this month, and
look forward to meeting at Noisebridge or elsewhere to watch PyCon
2011 videos together.

While thinking about unique lists in the context of Vikram's recent
question, I came across a related blog post at
http://www.peterbe.com/plog/uniqifiers-benchmark

The post emphasizes speed and maintaining original list order when
removing duplicates from a list.

Although the post is several years old, I thought of a solution to the
question it poses that is different from the ones proposed by
BayPIGgies either in the blog's comments or at a previous discussion
(before reversed, sorted, and set were introduced in Python 2.4) at
http://code.activestate.com/recipes/52560/

I will be grateful for any comments on my ask forgiveness rather than
permission implementation. It requires that all elements of the list
be hashable and works for Python 2.x where x >= 4.

def unique(seq):
    """Return unique list of hashable objects while preserving order."""
    seen = {}
    for ix in reversed(xrange(len(seq))):
        seen[seq[ix]] = ix
    return sorted(seen,key=seen.__getitem__)

For the case of hashable objects are there disadvantages to this
implementation versus the ones referenced above? Is there a reason to
prefer operator.itemgetter over __getitem__ in this limited context?
Is there an advantage in clarity or performance to using
collections.OrderedDict in Python 2.7 and above? For lists containing
a great number of distinct objects the sort step will introduce a
lower bound on speed for this implementation.

I look forward to meeting more BayPIGgies in the coming months and years.

Cheers,
Hy
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: hello and returning unique elements

Hy Carrinski
Erratum: the function I suggested is incomplete without this line
following the docstring.

if seq is None: return None


On Thu, Apr 7, 2011 at 6:42 PM, Hy Carrinski <[hidden email]> wrote:

> Dear BayPIGgies,
>
> Thank you for providing exceptional reading material over the past
> year for an avid conversation follower and occasional meeting
> attendee.
>
> I would like to join the conversation and contribute to the community
> beginning with this post. I will attend our meeting this month, and
> look forward to meeting at Noisebridge or elsewhere to watch PyCon
> 2011 videos together.
>
> While thinking about unique lists in the context of Vikram's recent
> question, I came across a related blog post at
> http://www.peterbe.com/plog/uniqifiers-benchmark
>
> The post emphasizes speed and maintaining original list order when
> removing duplicates from a list.
>
> Although the post is several years old, I thought of a solution to the
> question it poses that is different from the ones proposed by
> BayPIGgies either in the blog's comments or at a previous discussion
> (before reversed, sorted, and set were introduced in Python 2.4) at
> http://code.activestate.com/recipes/52560/
>
> I will be grateful for any comments on my ask forgiveness rather than
> permission implementation. It requires that all elements of the list
> be hashable and works for Python 2.x where x >= 4.
>
> def unique(seq):
>    """Return unique list of hashable objects while preserving order."""
>    seen = {}
>    for ix in reversed(xrange(len(seq))):
>        seen[seq[ix]] = ix
>    return sorted(seen,key=seen.__getitem__)
>
> For the case of hashable objects are there disadvantages to this
> implementation versus the ones referenced above? Is there a reason to
> prefer operator.itemgetter over __getitem__ in this limited context?
> Is there an advantage in clarity or performance to using
> collections.OrderedDict in Python 2.7 and above? For lists containing
> a great number of distinct objects the sort step will introduce a
> lower bound on speed for this implementation.
>
> I look forward to meeting more BayPIGgies in the coming months and years.
>
> Cheers,
> Hy
>
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: hello and returning unique elements

Tim Hatch-3
By some quick experiments, it looks like you can double the speed from
your version by not doing all the reversed/sorted stuff and just keeping
track of it in a much less exciting way:

def unique2(seq):
  seen = set()
  out = []
  for i in seq:
    if i not in seen:
      out.append(i)
      seen.add(i)
  return out

# yours
$ python -m timeit -s 'from demo import unique' 'sum(unique([1, 1, 2, 3,
4]))'
100000 loops, best of 3: 5.07 usec per loop
# mine
$ python -m timeit -s 'from demo import unique2' 'sum(unique2([1, 1, 2, 3,
4]))'
100000 loops, best of 3: 2.65 usec per loop

You can actually make it a smidge faster still by making it a generator,
for at least this tiny case (as it becomes bigger, into the thousands of
items, it's much more like unique2's timings):

def unique3(seq):
  seen = set()
  for i in seq:
    if i not in seen:
      yield i
      seen.add(i)

$ python -m timeit -s 'from demo import unique3' 'sum(unique3([1, 1, 2, 3,
4]))'
1000000 loops, best of 3: 1.99 usec per loop

Tim

> Erratum: the function I suggested is incomplete without this line
> following the docstring.
>
> if seq is None: return None
>
>
> On Thu, Apr 7, 2011 at 6:42 PM, Hy Carrinski <[hidden email]> wrote:
>> Dear BayPIGgies,
>>
>> Thank you for providing exceptional reading material over the past
>> year for an avid conversation follower and occasional meeting
>> attendee.
>>
>> I would like to join the conversation and contribute to the community
>> beginning with this post. I will attend our meeting this month, and
>> look forward to meeting at Noisebridge or elsewhere to watch PyCon
>> 2011 videos together.
>>
>> While thinking about unique lists in the context of Vikram's recent
>> question, I came across a related blog post at
>> http://www.peterbe.com/plog/uniqifiers-benchmark
>>
>> The post emphasizes speed and maintaining original list order when
>> removing duplicates from a list.
>>
>> Although the post is several years old, I thought of a solution to the
>> question it poses that is different from the ones proposed by
>> BayPIGgies either in the blog's comments or at a previous discussion
>> (before reversed, sorted, and set were introduced in Python 2.4) at
>> http://code.activestate.com/recipes/52560/
>>
>> I will be grateful for any comments on my ask forgiveness rather than
>> permission implementation. It requires that all elements of the list
>> be hashable and works for Python 2.x where x >= 4.
>>
>> def unique(seq):
>>    """Return unique list of hashable objects while preserving
>> order."""
>>    seen = {}
>>    for ix in reversed(xrange(len(seq))):
>>        seen[seq[ix]] = ix
>>    return sorted(seen,key=seen.__getitem__)
>>
>> For the case of hashable objects are there disadvantages to this
>> implementation versus the ones referenced above? Is there a reason to
>> prefer operator.itemgetter over __getitem__ in this limited context?
>> Is there an advantage in clarity or performance to using
>> collections.OrderedDict in Python 2.7 and above? For lists containing
>> a great number of distinct objects the sort step will introduce a
>> lower bound on speed for this implementation.
>>
>> I look forward to meeting more BayPIGgies in the coming months and
>> years.
>>
>> Cheers,
>> Hy
>>
> _______________________________________________
> Baypiggies mailing list
> [hidden email]
> To change your subscription options or unsubscribe:
> http://mail.python.org/mailman/listinfo/baypiggies


_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: hello and returning unique elements

Tung Wai Yip
I've put my previous suggestion into an example. It is similar to many  
other suggestions except it is 2 lines of code (by using a newer tool in  
the standard library).

>>> countdict = collections.Counter('abcdeabcxyz')
>>> [v for v,c in countdict.items() if c > 1]
['a', 'c', 'b']

This should be O(N).

Wai Yip



On Fri, 08 Apr 2011 10:49:23 -0700, Tim Hatch <[hidden email]> wrote:

> By some quick experiments, it looks like you can double the speed from
> your version by not doing all the reversed/sorted stuff and just keeping
> track of it in a much less exciting way:
>
> def unique2(seq):
>   seen = set()
>   out = []
>   for i in seq:
>     if i not in seen:
>       out.append(i)
>       seen.add(i)
>   return out
>
> # yours
> $ python -m timeit -s 'from demo import unique' 'sum(unique([1, 1, 2, 3,
> 4]))'
> 100000 loops, best of 3: 5.07 usec per loop
> # mine
> $ python -m timeit -s 'from demo import unique2' 'sum(unique2([1, 1, 2,  
> 3,
> 4]))'
> 100000 loops, best of 3: 2.65 usec per loop
>
> You can actually make it a smidge faster still by making it a generator,
> for at least this tiny case (as it becomes bigger, into the thousands of
> items, it's much more like unique2's timings):
>
> def unique3(seq):
>   seen = set()
>   for i in seq:
>     if i not in seen:
>       yield i
>       seen.add(i)
>
> $ python -m timeit -s 'from demo import unique3' 'sum(unique3([1, 1, 2,  
> 3,
> 4]))'
> 1000000 loops, best of 3: 1.99 usec per loop
>
> Tim
>
>> Erratum: the function I suggested is incomplete without this line
>> following the docstring.
>>
>> if seq is None: return None
>>
>>
>> On Thu, Apr 7, 2011 at 6:42 PM, Hy Carrinski <[hidden email]>  
>> wrote:
>>> Dear BayPIGgies,
>>>
>>> Thank you for providing exceptional reading material over the past
>>> year for an avid conversation follower and occasional meeting
>>> attendee.
>>>
>>> I would like to join the conversation and contribute to the community
>>> beginning with this post. I will attend our meeting this month, and
>>> look forward to meeting at Noisebridge or elsewhere to watch PyCon
>>> 2011 videos together.
>>>
>>> While thinking about unique lists in the context of Vikram's recent
>>> question, I came across a related blog post at
>>> http://www.peterbe.com/plog/uniqifiers-benchmark
>>>
>>> The post emphasizes speed and maintaining original list order when
>>> removing duplicates from a list.
>>>
>>> Although the post is several years old, I thought of a solution to the
>>> question it poses that is different from the ones proposed by
>>> BayPIGgies either in the blog's comments or at a previous discussion
>>> (before reversed, sorted, and set were introduced in Python 2.4) at
>>> http://code.activestate.com/recipes/52560/
>>>
>>> I will be grateful for any comments on my ask forgiveness rather than
>>> permission implementation. It requires that all elements of the list
>>> be hashable and works for Python 2.x where x >= 4.
>>>
>>> def unique(seq):
>>> Â  Â """Return unique list of hashable objects while preserving
>>> order."""
>>> Â  Â seen = {}
>>> Â  Â for ix in reversed(xrange(len(seq))):
>>> Â  Â  Â  Â seen[seq[ix]] = ix
>>> Â  Â return sorted(seen,key=seen.__getitem__)
>>>
>>> For the case of hashable objects are there disadvantages to this
>>> implementation versus the ones referenced above? Is there a reason to
>>> prefer operator.itemgetter over __getitem__ in this limited context?
>>> Is there an advantage in clarity or performance to using
>>> collections.OrderedDict in Python 2.7 and above? For lists containing
>>> a great number of distinct objects the sort step will introduce a
>>> lower bound on speed for this implementation.
>>>
>>> I look forward to meeting more BayPIGgies in the coming months and
>>> years.
>>>
>>> Cheers,
>>> Hy
>>>
>> _______________________________________________
>> Baypiggies mailing list
>> [hidden email]
>> To change your subscription options or unsubscribe:
>> http://mail.python.org/mailman/listinfo/baypiggies
>
>
> _______________________________________________
> Baypiggies mailing list
> [hidden email]
> To change your subscription options or unsubscribe:
> http://mail.python.org/mailman/listinfo/baypiggies
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies
Reply | Threaded
Open this post in threaded view
|

Re: hello and returning unique elements

Hy Carrinski
In reply to this post by Tim Hatch-3
Thank you for demonstrating a simpler and faster solution. It is
equivalent to a solution suggested by Andrew Dalke in the initially
referenced blog's comments back in 2006. My performance comparisons
had been only with even older solutions (designed prior to Python 2.4)
based on dictionaries rather than sets. The advantages are now clear
of storing a set of encountered objects, adding to a unique list in a
preferred order, and better yet of making it a generator.

Thanks,
Hy

On Fri, Apr 8, 2011 at 10:49 AM, Tim Hatch <[hidden email]> wrote:

> By some quick experiments, it looks like you can double the speed from
> your version by not doing all the reversed/sorted stuff and just keeping
> track of it in a much less exciting way:
>
> def unique2(seq):
>  seen = set()
>  out = []
>  for i in seq:
>    if i not in seen:
>      out.append(i)
>      seen.add(i)
>  return out
>
> # yours
> $ python -m timeit -s 'from demo import unique' 'sum(unique([1, 1, 2, 3,
> 4]))'
> 100000 loops, best of 3: 5.07 usec per loop
> # mine
> $ python -m timeit -s 'from demo import unique2' 'sum(unique2([1, 1, 2, 3,
> 4]))'
> 100000 loops, best of 3: 2.65 usec per loop
>
> You can actually make it a smidge faster still by making it a generator,
> for at least this tiny case (as it becomes bigger, into the thousands of
> items, it's much more like unique2's timings):
>
> def unique3(seq):
>  seen = set()
>  for i in seq:
>    if i not in seen:
>      yield i
>      seen.add(i)
>
> $ python -m timeit -s 'from demo import unique3' 'sum(unique3([1, 1, 2, 3,
> 4]))'
> 1000000 loops, best of 3: 1.99 usec per loop
>
> Tim
_______________________________________________
Baypiggies mailing list
[hidden email]
To change your subscription options or unsubscribe:
http://mail.python.org/mailman/listinfo/baypiggies