Jython slow parsing with BeautifulSoup

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Jython slow parsing with BeautifulSoup

Indra Talip
G'day,
While looking into using BeautifulSoup with Jython I found that it is exceedingly slow. I came across the following thread from Nov 2009 http://sourceforge.net/p/jython/mailman/message/23893421/ which also documented the issue with speed.

Poking at it with the NetBeans profiler showed that most of the time was in PyUnicode#toCodePoints or more specifically the calls to SubsequenceIteratorImpl#next() due to the HTMLParser/Jython SRE engine passing the entire file as a string multiple times (for different regexes) to the SRE_STATE constructor which in turn calls PyUnicode#toCodePoints. The attached patch takes a brute force approach of memoizing the result of calling PyUnicode#toCodePoints(). This brings the run-time of parsing http://www.fixprotocol.org/specifications/fields/5000-5999 down from 500+ secs to ~6 seconds. I've wrapped the returned codepoints int[] in a SoftReference to allow for it to be garbage collected, although this adds about 1 second of overhead.

The code used is as per the previous thread.

> cat beautifulsoup.py 
import time
from bs4 import BeautifulSoup
data = open("fix-5000-5999.html").read()
start = time.time()
soup = BeautifulSoup(data)
print time.time() - start

> JYTHONPATH=beautifulsoup4-4.3.2/ dist/bin/jython beautifulsoup.py 
6.15199995041

Is this approach worthwhile? Are there any other approaches that I could take to try speeding Jython up while using BeautifulSoup?

Cheers
Indra

--
Indra Talip

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Jython-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/jython-users

codepoints_memoization.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Jython slow parsing with BeautifulSoup

Jim Baker
Indra,

Thanks for your analysis and patch! It's quite interesting, I was discussing today this very issue with a colleague as a known performance issue in Jython. Incidentally Nicholas Riley also mentioned in a comment on that thread (http://sourceforge.net/p/jython/mailman/message/23906076/) that he saw on his work on Joni the same overhead of converting to codepoints (from the underlying UTF-16), so it looks like we just didn't go far enough in looking at this issue nearly 5 years ago.

Memoization is certainly the right approach with how SRE is currently implemented. Instead of doing it in PyUnicode, I would think it makes sense to apply the memoization approach in SRE instead, at the cost of somewhat greater complexity.

One more thing to consider: toCodePoints is really only necessary for unicode objects that have any codepoints not in the basic plane. So we could avoid this construction. Even more radically, we could also probably avoid it in general if we moved away from indexed ops in SRE and just used a next/prev approach.

Thanks again, this is great stuff!

- Jim



On Fri, Mar 7, 2014 at 9:56 PM, Indra Talip <[hidden email]> wrote:
G'day,
While looking into using BeautifulSoup with Jython I found that it is exceedingly slow. I came across the following thread from Nov 2009 http://sourceforge.net/p/jython/mailman/message/23893421/ which also documented the issue with speed.

Poking at it with the NetBeans profiler showed that most of the time was in PyUnicode#toCodePoints or more specifically the calls to SubsequenceIteratorImpl#next() due to the HTMLParser/Jython SRE engine passing the entire file as a string multiple times (for different regexes) to the SRE_STATE constructor which in turn calls PyUnicode#toCodePoints. The attached patch takes a brute force approach of memoizing the result of calling PyUnicode#toCodePoints(). This brings the run-time of parsing http://www.fixprotocol.org/specifications/fields/5000-5999 down from 500+ secs to ~6 seconds. I've wrapped the returned codepoints int[] in a SoftReference to allow for it to be garbage collected, although this adds about 1 second of overhead.

The code used is as per the previous thread.

> cat beautifulsoup.py 
import time
from bs4 import BeautifulSoup
data = open("fix-5000-5999.html").read()
start = time.time()
soup = BeautifulSoup(data)
print time.time() - start

> JYTHONPATH=beautifulsoup4-4.3.2/ dist/bin/jython beautifulsoup.py 
6.<a href="tel:15199995041" value="+15199995041" target="_blank">15199995041

Is this approach worthwhile? Are there any other approaches that I could take to try speeding Jython up while using BeautifulSoup?

Cheers
Indra

--
Indra Talip

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Jython-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/jython-users



------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Jython-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/jython-users
Reply | Threaded
Open this post in threaded view
|

Re: Jython slow parsing with BeautifulSoup

Indra Talip
Jim,
Thanks for the feedback. The attached patch moves the memoization to SRE_STATE using a guava com.google.common.cache.Cache with WeakReference keys.

It appears to have similar performance characteristics for the simple test case .
> JYTHONPATH=beautifulsoup4-4.3.2/ dist/bin/jython beautifulsoup.py 
6.15499997139

If that looks alright I'll push it up to the bug tracker. One question I have is whether to make the concurrencyLevel configurable and if so what's the best way to do so? Is there a standard mechanism for looking up these sorts of configuration items in jython?

Regarding the performance of getting code points from the underlying UTF-16 string it looks like that will be built into Java 8 (http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8012665).

Cheers
Indra


On 8 March 2014 17:26, Jim Baker <[hidden email]> wrote:
Indra,

Thanks for your analysis and patch! It's quite interesting, I was discussing today this very issue with a colleague as a known performance issue in Jython. Incidentally Nicholas Riley also mentioned in a comment on that thread (http://sourceforge.net/p/jython/mailman/message/23906076/) that he saw on his work on Joni the same overhead of converting to codepoints (from the underlying UTF-16), so it looks like we just didn't go far enough in looking at this issue nearly 5 years ago.

Memoization is certainly the right approach with how SRE is currently implemented. Instead of doing it in PyUnicode, I would think it makes sense to apply the memoization approach in SRE instead, at the cost of somewhat greater complexity.

One more thing to consider: toCodePoints is really only necessary for unicode objects that have any codepoints not in the basic plane. So we could avoid this construction. Even more radically, we could also probably avoid it in general if we moved away from indexed ops in SRE and just used a next/prev approach.

Thanks again, this is great stuff!

- Jim



On Fri, Mar 7, 2014 at 9:56 PM, Indra Talip <[hidden email]> wrote:
G'day,
While looking into using BeautifulSoup with Jython I found that it is exceedingly slow. I came across the following thread from Nov 2009 http://sourceforge.net/p/jython/mailman/message/23893421/ which also documented the issue with speed.

Poking at it with the NetBeans profiler showed that most of the time was in PyUnicode#toCodePoints or more specifically the calls to SubsequenceIteratorImpl#next() due to the HTMLParser/Jython SRE engine passing the entire file as a string multiple times (for different regexes) to the SRE_STATE constructor which in turn calls PyUnicode#toCodePoints. The attached patch takes a brute force approach of memoizing the result of calling PyUnicode#toCodePoints(). This brings the run-time of parsing http://www.fixprotocol.org/specifications/fields/5000-5999 down from 500+ secs to ~6 seconds. I've wrapped the returned codepoints int[] in a SoftReference to allow for it to be garbage collected, although this adds about 1 second of overhead.

The code used is as per the previous thread.

> cat beautifulsoup.py 
import time
from bs4 import BeautifulSoup
data = open("fix-5000-5999.html").read()
start = time.time()
soup = BeautifulSoup(data)
print time.time() - start

> JYTHONPATH=beautifulsoup4-4.3.2/ dist/bin/jython beautifulsoup.py 
6.<a href="tel:15199995041" value="+15199995041" target="_blank">15199995041

Is this approach worthwhile? Are there any other approaches that I could take to try speeding Jython up while using BeautifulSoup?

Cheers
Indra

--
Indra Talip

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Jython-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/jython-users





--
Indra Talip

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Jython-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/jython-users

sre_state_codepoints_memoization.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Jython slow parsing with BeautifulSoup

Indra Talip
Jim,
Running the code through the NetBeans profiler again shows that the execution time now appears to be dominated by Jython internals rather than the SRE_STATE/PyUnicode as before.

The top 13 results from the profiling session are shown below with PyTableCode.call, PyType$MethodCache.lookup_where and PyObject.object___findattr__ taking ~16% of the total execution time.

"Hot Spots - Method","Self Time [%]","Self Time","Total Time","Invocations"
"org.python.core.PyTableCode.call(org.python.core.ThreadState, org.python.core.PyFrame, org.python.core.PyObject)","9.314184","4736.226 ms","46921.135 ms","267700"
"org.python.core.PyType$MethodCache.lookup_where(org.python.core.PyType, String, org.python.core.PyObject[])","4.092952","2081.25 ms","3450.315 ms","1702296"
"org.python.core.PyObject.object___findattr__(String)","2.7491088","1397.911 ms","6655.189 ms","976144"
"org.python.core.PyType.lookup(String)","1.7592899","894.592 ms","5180.861 ms","1693183"
"org.python.core.PyType.lookup_where(String, org.python.core.PyObject[])","1.6819717","855.276 ms","4306.611 ms","1702296"
"org.python.core.PyObject.__getattr__(String)","1.5009181","763.211 ms","9284.117 ms","1228640"
"org.python.core.PyStringMap.__finditem__(String)","1.4666681","745.795 ms","745.795 ms","2114545"
"org.python.core.PySequence.<init>(org.python.core.PyType)","1.3575838","690.326 ms","1522.962 ms","840393"
"org.python.core.PyFrame.getlocal(int)","1.3472611","685.077 ms","685.077 ms","3150368"
"org.python.modules.sre.SRE_STATE.SRE_MATCH(int[], int, int)","1.2096357","615.095 ms","1305.284 ms","401248"
"org.python.core.PyObject.<init>(org.python.core.PyType)","1.1715233","595.715 ms","595.715 ms","2388936"
"org.python.core.PyMethodDescr.method_descriptor___get__(org.python.core.PyObject, org.python.core.PyObject)","1.0598332","538.921 ms","2489.703 ms","495366"

It doesn't look at this stage that further optimisations to the SRE_STATE/indexed ops would buy much improvement in the overall execution time.

Cheers
Indra



On 8 March 2014 21:21, Indra Talip <[hidden email]> wrote:
Jim,
Thanks for the feedback. The attached patch moves the memoization to SRE_STATE using a guava com.google.common.cache.Cache with WeakReference keys.

It appears to have similar performance characteristics for the simple test case .
> JYTHONPATH=beautifulsoup4-4.3.2/ dist/bin/jython beautifulsoup.py 
6.15499997139

If that looks alright I'll push it up to the bug tracker. One question I have is whether to make the concurrencyLevel configurable and if so what's the best way to do so? Is there a standard mechanism for looking up these sorts of configuration items in jython?

Regarding the performance of getting code points from the underlying UTF-16 string it looks like that will be built into Java 8 (http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8012665).

Cheers
Indra


On 8 March 2014 17:26, Jim Baker <[hidden email]> wrote:
Indra,

Thanks for your analysis and patch! It's quite interesting, I was discussing today this very issue with a colleague as a known performance issue in Jython. Incidentally Nicholas Riley also mentioned in a comment on that thread (http://sourceforge.net/p/jython/mailman/message/23906076/) that he saw on his work on Joni the same overhead of converting to codepoints (from the underlying UTF-16), so it looks like we just didn't go far enough in looking at this issue nearly 5 years ago.

Memoization is certainly the right approach with how SRE is currently implemented. Instead of doing it in PyUnicode, I would think it makes sense to apply the memoization approach in SRE instead, at the cost of somewhat greater complexity.

One more thing to consider: toCodePoints is really only necessary for unicode objects that have any codepoints not in the basic plane. So we could avoid this construction. Even more radically, we could also probably avoid it in general if we moved away from indexed ops in SRE and just used a next/prev approach.

Thanks again, this is great stuff!

- Jim



On Fri, Mar 7, 2014 at 9:56 PM, Indra Talip <[hidden email]> wrote:
G'day,
While looking into using BeautifulSoup with Jython I found that it is exceedingly slow. I came across the following thread from Nov 2009 http://sourceforge.net/p/jython/mailman/message/23893421/ which also documented the issue with speed.

Poking at it with the NetBeans profiler showed that most of the time was in PyUnicode#toCodePoints or more specifically the calls to SubsequenceIteratorImpl#next() due to the HTMLParser/Jython SRE engine passing the entire file as a string multiple times (for different regexes) to the SRE_STATE constructor which in turn calls PyUnicode#toCodePoints. The attached patch takes a brute force approach of memoizing the result of calling PyUnicode#toCodePoints(). This brings the run-time of parsing http://www.fixprotocol.org/specifications/fields/5000-5999 down from 500+ secs to ~6 seconds. I've wrapped the returned codepoints int[] in a SoftReference to allow for it to be garbage collected, although this adds about 1 second of overhead.

The code used is as per the previous thread.

> cat beautifulsoup.py 
import time
from bs4 import BeautifulSoup
data = open("fix-5000-5999.html").read()
start = time.time()
soup = BeautifulSoup(data)
print time.time() - start

> JYTHONPATH=beautifulsoup4-4.3.2/ dist/bin/jython beautifulsoup.py 
6.<a href="tel:15199995041" value="+15199995041" target="_blank">15199995041

Is this approach worthwhile? Are there any other approaches that I could take to try speeding Jython up while using BeautifulSoup?

Cheers
Indra

--
Indra Talip

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Jython-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/jython-users





--
Indra Talip



--
Indra Talip

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Jython-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/jython-users
Reply | Threaded
Open this post in threaded view
|

Re: Jython slow parsing with BeautifulSoup

Indra Talip
Jim,
I've created a pull request at https://bitbucket.org/jython/jython/pull-request/27/make-sre_state-cache-the-code-points-for-a/diff that implements caching of the code points and where the properties of the cache are configurable via the registry.

Cheers
Indra


On 9 March 2014 08:56, Indra Talip <[hidden email]> wrote:
Jim,
Running the code through the NetBeans profiler again shows that the execution time now appears to be dominated by Jython internals rather than the SRE_STATE/PyUnicode as before.

The top 13 results from the profiling session are shown below with PyTableCode.call, PyType$MethodCache.lookup_where and PyObject.object___findattr__ taking ~16% of the total execution time.

"Hot Spots - Method","Self Time [%]","Self Time","Total Time","Invocations"
"org.python.core.PyTableCode.call(org.python.core.ThreadState, org.python.core.PyFrame, org.python.core.PyObject)","9.314184","4736.226 ms","46921.135 ms","267700"
"org.python.core.PyType$MethodCache.lookup_where(org.python.core.PyType, String, org.python.core.PyObject[])","4.092952","2081.25 ms","3450.315 ms","1702296"
"org.python.core.PyObject.object___findattr__(String)","2.7491088","1397.911 ms","6655.189 ms","976144"
"org.python.core.PyType.lookup(String)","1.7592899","894.592 ms","5180.861 ms","1693183"
"org.python.core.PyType.lookup_where(String, org.python.core.PyObject[])","1.6819717","855.276 ms","4306.611 ms","1702296"
"org.python.core.PyObject.__getattr__(String)","1.5009181","763.211 ms","9284.117 ms","1228640"
"org.python.core.PyStringMap.__finditem__(String)","1.4666681","745.795 ms","745.795 ms","2114545"
"org.python.core.PySequence.<init>(org.python.core.PyType)","1.3575838","690.326 ms","1522.962 ms","840393"
"org.python.core.PyFrame.getlocal(int)","1.3472611","685.077 ms","685.077 ms","3150368"
"org.python.modules.sre.SRE_STATE.SRE_MATCH(int[], int, int)","1.2096357","615.095 ms","1305.284 ms","401248"
"org.python.core.PyObject.<init>(org.python.core.PyType)","1.1715233","595.715 ms","595.715 ms","2388936"
"org.python.core.PyMethodDescr.method_descriptor___get__(org.python.core.PyObject, org.python.core.PyObject)","1.0598332","538.921 ms","2489.703 ms","495366"

It doesn't look at this stage that further optimisations to the SRE_STATE/indexed ops would buy much improvement in the overall execution time.

Cheers
Indra



On 8 March 2014 21:21, Indra Talip <[hidden email]> wrote:
Jim,
Thanks for the feedback. The attached patch moves the memoization to SRE_STATE using a guava com.google.common.cache.Cache with WeakReference keys.

It appears to have similar performance characteristics for the simple test case .
> JYTHONPATH=beautifulsoup4-4.3.2/ dist/bin/jython beautifulsoup.py 
6.15499997139

If that looks alright I'll push it up to the bug tracker. One question I have is whether to make the concurrencyLevel configurable and if so what's the best way to do so? Is there a standard mechanism for looking up these sorts of configuration items in jython?

Regarding the performance of getting code points from the underlying UTF-16 string it looks like that will be built into Java 8 (http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8012665).

Cheers
Indra


On 8 March 2014 17:26, Jim Baker <[hidden email]> wrote:
Indra,

Thanks for your analysis and patch! It's quite interesting, I was discussing today this very issue with a colleague as a known performance issue in Jython. Incidentally Nicholas Riley also mentioned in a comment on that thread (http://sourceforge.net/p/jython/mailman/message/23906076/) that he saw on his work on Joni the same overhead of converting to codepoints (from the underlying UTF-16), so it looks like we just didn't go far enough in looking at this issue nearly 5 years ago.

Memoization is certainly the right approach with how SRE is currently implemented. Instead of doing it in PyUnicode, I would think it makes sense to apply the memoization approach in SRE instead, at the cost of somewhat greater complexity.

One more thing to consider: toCodePoints is really only necessary for unicode objects that have any codepoints not in the basic plane. So we could avoid this construction. Even more radically, we could also probably avoid it in general if we moved away from indexed ops in SRE and just used a next/prev approach.

Thanks again, this is great stuff!

- Jim



On Fri, Mar 7, 2014 at 9:56 PM, Indra Talip <[hidden email]> wrote:
G'day,
While looking into using BeautifulSoup with Jython I found that it is exceedingly slow. I came across the following thread from Nov 2009 http://sourceforge.net/p/jython/mailman/message/23893421/ which also documented the issue with speed.

Poking at it with the NetBeans profiler showed that most of the time was in PyUnicode#toCodePoints or more specifically the calls to SubsequenceIteratorImpl#next() due to the HTMLParser/Jython SRE engine passing the entire file as a string multiple times (for different regexes) to the SRE_STATE constructor which in turn calls PyUnicode#toCodePoints. The attached patch takes a brute force approach of memoizing the result of calling PyUnicode#toCodePoints(). This brings the run-time of parsing http://www.fixprotocol.org/specifications/fields/5000-5999 down from 500+ secs to ~6 seconds. I've wrapped the returned codepoints int[] in a SoftReference to allow for it to be garbage collected, although this adds about 1 second of overhead.

The code used is as per the previous thread.

> cat beautifulsoup.py 
import time
from bs4 import BeautifulSoup
data = open("fix-5000-5999.html").read()
start = time.time()
soup = BeautifulSoup(data)
print time.time() - start

> JYTHONPATH=beautifulsoup4-4.3.2/ dist/bin/jython beautifulsoup.py 
6.<a href="tel:15199995041" value="+15199995041" target="_blank">15199995041

Is this approach worthwhile? Are there any other approaches that I could take to try speeding Jython up while using BeautifulSoup?

Cheers
Indra

--
Indra Talip

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Jython-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/jython-users





--
Indra Talip



--
Indra Talip



--
Indra Talip

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Jython-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/jython-users
Reply | Threaded
Open this post in threaded view
|

Re: Jython slow parsing with BeautifulSoup

Jim Baker
Indra,

Thanks for working on this!

Memoization is the right approach for Jython's regex engine (SRE) as it is currently implemented using array indexed ops. Clearly certain usages are O(n^2) instead of being O(n), which the memoization mitigates. My only concern is that the cache could potentially contain references to large objects. But this can be readily resolved: Guava's Cache mechanism does support an arbitrary weighing mechanism and maximum weights for a cache. So we just need to assign the weight to be the length in codepoints, and we can constrain the cache to be say on the order of 1MB or whatever the user wants. Other issues like concurrency level are best left to the user, as you have done with this property.

Longer term, it would be best if we move to not having to need to cache these codepoints at all. (Jython's runtime does not cache user data, although we certainly do cache metadata, especially around types.) I did take a look at JRuby's Joni project, which is their equivalent to what we do. In particular, let's look specifically its regex bytecode VM (https://github.com/jruby/joni/blob/master/src/org/joni/ByteCodeMachine.java). It's actually quite close to SRE in design but implements the next/prev approach I suggested earlier (called step and stepBack, see https://github.com/jruby/jcodings/blob/master/src/org/jcodings/Encoding.java#L320). The other interesting thing Joni does is it breaks up its methods so they are more inlineable and therefore more likely to run fast on the JVM. This is especially seen when comparing the use of switch statements in Joni and SRE.

What does this mean for the pull request? Assuming weight support, it looks good and certainly fine for 2.7.0. But ideally we can find some time to work on SRE. For the interested developer, this would be straightforward work that would introduce what it means to be write a fast virtual machine on the JVM (and who doesn't like that idea, VMs on VMs?); issues of UTF-16 character encoding for unicode vs just using bytes (Python allows regexes on both); while still being quite self contained.

- Jim


On Tue, Mar 11, 2014 at 9:02 PM, Indra Talip <[hidden email]> wrote:
Jim,
I've created a pull request at https://bitbucket.org/jython/jython/pull-request/27/make-sre_state-cache-the-code-points-for-a/diff that implements caching of the code points and where the properties of the cache are configurable via the registry.

Cheers
Indra


On 9 March 2014 08:56, Indra Talip <[hidden email]> wrote:
Jim,
Running the code through the NetBeans profiler again shows that the execution time now appears to be dominated by Jython internals rather than the SRE_STATE/PyUnicode as before.

The top 13 results from the profiling session are shown below with PyTableCode.call, PyType$MethodCache.lookup_where and PyObject.object___findattr__ taking ~16% of the total execution time.

"Hot Spots - Method","Self Time [%]","Self Time","Total Time","Invocations"
"org.python.core.PyTableCode.call(org.python.core.ThreadState, org.python.core.PyFrame, org.python.core.PyObject)","9.314184","4736.226 ms","46921.135 ms","267700"
"org.python.core.PyType$MethodCache.lookup_where(org.python.core.PyType, String, org.python.core.PyObject[])","4.092952","2081.25 ms","3450.315 ms","1702296"
"org.python.core.PyObject.object___findattr__(String)","2.7491088","1397.911 ms","6655.189 ms","976144"
"org.python.core.PyType.lookup(String)","1.7592899","894.592 ms","5180.861 ms","1693183"
"org.python.core.PyType.lookup_where(String, org.python.core.PyObject[])","1.6819717","855.276 ms","4306.611 ms","1702296"
"org.python.core.PyObject.__getattr__(String)","1.5009181","763.211 ms","9284.117 ms","1228640"
"org.python.core.PyStringMap.__finditem__(String)","1.4666681","745.795 ms","745.795 ms","2114545"
"org.python.core.PySequence.<init>(org.python.core.PyType)","1.3575838","690.326 ms","1522.962 ms","840393"
"org.python.core.PyFrame.getlocal(int)","1.3472611","685.077 ms","685.077 ms","3150368"
"org.python.modules.sre.SRE_STATE.SRE_MATCH(int[], int, int)","1.2096357","615.095 ms","1305.284 ms","401248"
"org.python.core.PyObject.<init>(org.python.core.PyType)","1.1715233","595.715 ms","595.715 ms","2388936"
"org.python.core.PyMethodDescr.method_descriptor___get__(org.python.core.PyObject, org.python.core.PyObject)","1.0598332","538.921 ms","2489.703 ms","495366"

It doesn't look at this stage that further optimisations to the SRE_STATE/indexed ops would buy much improvement in the overall execution time.

Cheers
Indra



On 8 March 2014 21:21, Indra Talip <[hidden email]> wrote:
Jim,
Thanks for the feedback. The attached patch moves the memoization to SRE_STATE using a guava com.google.common.cache.Cache with WeakReference keys.

It appears to have similar performance characteristics for the simple test case .
> JYTHONPATH=beautifulsoup4-4.3.2/ dist/bin/jython beautifulsoup.py 
6.15499997139

If that looks alright I'll push it up to the bug tracker. One question I have is whether to make the concurrencyLevel configurable and if so what's the best way to do so? Is there a standard mechanism for looking up these sorts of configuration items in jython?

Regarding the performance of getting code points from the underlying UTF-16 string it looks like that will be built into Java 8 (http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8012665).

Cheers
Indra


On 8 March 2014 17:26, Jim Baker <[hidden email]> wrote:
Indra,

Thanks for your analysis and patch! It's quite interesting, I was discussing today this very issue with a colleague as a known performance issue in Jython. Incidentally Nicholas Riley also mentioned in a comment on that thread (http://sourceforge.net/p/jython/mailman/message/23906076/) that he saw on his work on Joni the same overhead of converting to codepoints (from the underlying UTF-16), so it looks like we just didn't go far enough in looking at this issue nearly 5 years ago.

Memoization is certainly the right approach with how SRE is currently implemented. Instead of doing it in PyUnicode, I would think it makes sense to apply the memoization approach in SRE instead, at the cost of somewhat greater complexity.

One more thing to consider: toCodePoints is really only necessary for unicode objects that have any codepoints not in the basic plane. So we could avoid this construction. Even more radically, we could also probably avoid it in general if we moved away from indexed ops in SRE and just used a next/prev approach.

Thanks again, this is great stuff!

- Jim



On Fri, Mar 7, 2014 at 9:56 PM, Indra Talip <[hidden email]> wrote:
G'day,
While looking into using BeautifulSoup with Jython I found that it is exceedingly slow. I came across the following thread from Nov 2009 http://sourceforge.net/p/jython/mailman/message/23893421/ which also documented the issue with speed.

Poking at it with the NetBeans profiler showed that most of the time was in PyUnicode#toCodePoints or more specifically the calls to SubsequenceIteratorImpl#next() due to the HTMLParser/Jython SRE engine passing the entire file as a string multiple times (for different regexes) to the SRE_STATE constructor which in turn calls PyUnicode#toCodePoints. The attached patch takes a brute force approach of memoizing the result of calling PyUnicode#toCodePoints(). This brings the run-time of parsing http://www.fixprotocol.org/specifications/fields/5000-5999 down from 500+ secs to ~6 seconds. I've wrapped the returned codepoints int[] in a SoftReference to allow for it to be garbage collected, although this adds about 1 second of overhead.

The code used is as per the previous thread.

> cat beautifulsoup.py 
import time
from bs4 import BeautifulSoup
data = open("fix-5000-5999.html").read()
start = time.time()
soup = BeautifulSoup(data)
print time.time() - start

> JYTHONPATH=beautifulsoup4-4.3.2/ dist/bin/jython beautifulsoup.py 
6.<a href="tel:15199995041" value="+15199995041" target="_blank">15199995041

Is this approach worthwhile? Are there any other approaches that I could take to try speeding Jython up while using BeautifulSoup?

Cheers
Indra

--
Indra Talip

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Jython-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/jython-users





--
Indra Talip



--
Indra Talip



--
Indra Talip


------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Jython-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/jython-users
Reply | Threaded
Open this post in threaded view
|

Re: Jython slow parsing with BeautifulSoup

Indra Talip
Jim,
I poked around with using weights (sre_state_codepoints_weigher.patch) and to be honest i don't really like that approach. The main reason is that it makes tuning a jython app somewhat harder because you will get regressions in behaviour for the SRE_STATE if the cache/weigher isn't sized appropriately and in order to size it you have to know the setting exists and what value is appropriate for your application. For example my first cut at using the weigher used a 1MB limit as you suggested earlier. It turns out that the ~320KB html file from the first email when converted to code points consumed 1.2MB and was always evicted before it could be reused and the performance of the of BeautifulSoup/HTMLParser reverted to taking 500+ seconds :(

So while the sre_state_codepoints_weigher patch attached ups the cache to 10MB which addresses that case I think that it makes tuning a Jython app somewhat harder 



On 13 March 2014 03:04, Jim Baker <[hidden email]> wrote:
Indra,

Thanks for working on this!

Memoization is the right approach for Jython's regex engine (SRE) as it is currently implemented using array indexed ops. Clearly certain usages are O(n^2) instead of being O(n), which the memoization mitigates. My only concern is that the cache could potentially contain references to large objects. But this can be readily resolved: Guava's Cache mechanism does support an arbitrary weighing mechanism and maximum weights for a cache. So we just need to assign the weight to be the length in codepoints, and we can constrain the cache to be say on the order of 1MB or whatever the user wants. Other issues like concurrency level are best left to the user, as you have done with this property.

Longer term, it would be best if we move to not having to need to cache these codepoints at all. (Jython's runtime does not cache user data, although we certainly do cache metadata, especially around types.) I did take a look at JRuby's Joni project, which is their equivalent to what we do. In particular, let's look specifically its regex bytecode VM (https://github.com/jruby/joni/blob/master/src/org/joni/ByteCodeMachine.java). It's actually quite close to SRE in design but implements the next/prev approach I suggested earlier (called step and stepBack, see https://github.com/jruby/jcodings/blob/master/src/org/jcodings/Encoding.java#L320). The other interesting thing Joni does is it breaks up its methods so they are more inlineable and therefore more likely to run fast on the JVM. This is especially seen when comparing the use of switch statements in Joni and SRE.

What does this mean for the pull request? Assuming weight support, it looks good and certainly fine for 2.7.0. But ideally we can find some time to work on SRE. For the interested developer, this would be straightforward work that would introduce what it means to be write a fast virtual machine on the JVM (and who doesn't like that idea, VMs on VMs?); issues of UTF-16 character encoding for unicode vs just using bytes (Python allows regexes on both); while still being quite self contained.

- Jim


On Tue, Mar 11, 2014 at 9:02 PM, Indra Talip <[hidden email]> wrote:
Jim,
I've created a pull request at https://bitbucket.org/jython/jython/pull-request/27/make-sre_state-cache-the-code-points-for-a/diff that implements caching of the code points and where the properties of the cache are configurable via the registry.

Cheers
Indra


On 9 March 2014 08:56, Indra Talip <[hidden email]> wrote:
Jim,
Running the code through the NetBeans profiler again shows that the execution time now appears to be dominated by Jython internals rather than the SRE_STATE/PyUnicode as before.

The top 13 results from the profiling session are shown below with PyTableCode.call, PyType$MethodCache.lookup_where and PyObject.object___findattr__ taking ~16% of the total execution time.

"Hot Spots - Method","Self Time [%]","Self Time","Total Time","Invocations"
"org.python.core.PyTableCode.call(org.python.core.ThreadState, org.python.core.PyFrame, org.python.core.PyObject)","9.314184","4736.226 ms","46921.135 ms","267700"
"org.python.core.PyType$MethodCache.lookup_where(org.python.core.PyType, String, org.python.core.PyObject[])","4.092952","2081.25 ms","3450.315 ms","1702296"
"org.python.core.PyObject.object___findattr__(String)","2.7491088","1397.911 ms","6655.189 ms","976144"
"org.python.core.PyType.lookup(String)","1.7592899","894.592 ms","5180.861 ms","1693183"
"org.python.core.PyType.lookup_where(String, org.python.core.PyObject[])","1.6819717","855.276 ms","4306.611 ms","1702296"
"org.python.core.PyObject.__getattr__(String)","1.5009181","763.211 ms","9284.117 ms","1228640"
"org.python.core.PyStringMap.__finditem__(String)","1.4666681","745.795 ms","745.795 ms","2114545"
"org.python.core.PySequence.<init>(org.python.core.PyType)","1.3575838","690.326 ms","1522.962 ms","840393"
"org.python.core.PyFrame.getlocal(int)","1.3472611","685.077 ms","685.077 ms","3150368"
"org.python.modules.sre.SRE_STATE.SRE_MATCH(int[], int, int)","1.2096357","615.095 ms","1305.284 ms","401248"
"org.python.core.PyObject.<init>(org.python.core.PyType)","1.1715233","595.715 ms","595.715 ms","2388936"
"org.python.core.PyMethodDescr.method_descriptor___get__(org.python.core.PyObject, org.python.core.PyObject)","1.0598332","538.921 ms","2489.703 ms","495366"

It doesn't look at this stage that further optimisations to the SRE_STATE/indexed ops would buy much improvement in the overall execution time.

Cheers
Indra



On 8 March 2014 21:21, Indra Talip <[hidden email]> wrote:
Jim,
Thanks for the feedback. The attached patch moves the memoization to SRE_STATE using a guava com.google.common.cache.Cache with WeakReference keys.

It appears to have similar performance characteristics for the simple test case .
> JYTHONPATH=beautifulsoup4-4.3.2/ dist/bin/jython beautifulsoup.py 
6.15499997139

If that looks alright I'll push it up to the bug tracker. One question I have is whether to make the concurrencyLevel configurable and if so what's the best way to do so? Is there a standard mechanism for looking up these sorts of configuration items in jython?

Regarding the performance of getting code points from the underlying UTF-16 string it looks like that will be built into Java 8 (http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8012665).

Cheers
Indra


On 8 March 2014 17:26, Jim Baker <[hidden email]> wrote:
Indra,

Thanks for your analysis and patch! It's quite interesting, I was discussing today this very issue with a colleague as a known performance issue in Jython. Incidentally Nicholas Riley also mentioned in a comment on that thread (http://sourceforge.net/p/jython/mailman/message/23906076/) that he saw on his work on Joni the same overhead of converting to codepoints (from the underlying UTF-16), so it looks like we just didn't go far enough in looking at this issue nearly 5 years ago.

Memoization is certainly the right approach with how SRE is currently implemented. Instead of doing it in PyUnicode, I would think it makes sense to apply the memoization approach in SRE instead, at the cost of somewhat greater complexity.

One more thing to consider: toCodePoints is really only necessary for unicode objects that have any codepoints not in the basic plane. So we could avoid this construction. Even more radically, we could also probably avoid it in general if we moved away from indexed ops in SRE and just used a next/prev approach.

Thanks again, this is great stuff!

- Jim



On Fri, Mar 7, 2014 at 9:56 PM, Indra Talip <[hidden email]> wrote:
G'day,
While looking into using BeautifulSoup with Jython I found that it is exceedingly slow. I came across the following thread from Nov 2009 http://sourceforge.net/p/jython/mailman/message/23893421/ which also documented the issue with speed.

Poking at it with the NetBeans profiler showed that most of the time was in PyUnicode#toCodePoints or more specifically the calls to SubsequenceIteratorImpl#next() due to the HTMLParser/Jython SRE engine passing the entire file as a string multiple times (for different regexes) to the SRE_STATE constructor which in turn calls PyUnicode#toCodePoints. The attached patch takes a brute force approach of memoizing the result of calling PyUnicode#toCodePoints(). This brings the run-time of parsing http://www.fixprotocol.org/specifications/fields/5000-5999 down from 500+ secs to ~6 seconds. I've wrapped the returned codepoints int[] in a SoftReference to allow for it to be garbage collected, although this adds about 1 second of overhead.

The code used is as per the previous thread.

> cat beautifulsoup.py 
import time
from bs4 import BeautifulSoup
data = open("fix-5000-5999.html").read()
start = time.time()
soup = BeautifulSoup(data)
print time.time() - start

> JYTHONPATH=beautifulsoup4-4.3.2/ dist/bin/jython beautifulsoup.py 
6.<a href="tel:15199995041" value="+15199995041" target="_blank">15199995041

Is this approach worthwhile? Are there any other approaches that I could take to try speeding Jython up while using BeautifulSoup?

Cheers
Indra

--
Indra Talip

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Jython-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/jython-users





--
Indra Talip



--
Indra Talip



--
Indra Talip




--
Indra Talip

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Jython-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/jython-users

sre_state_codepoints_weigher.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Jython slow parsing with BeautifulSoup

Indra Talip
Jim,
Bah. I keep hitting CTRL-Enter accidentally before I finish my emails :(

What I intended to follow-up with was that I think the better approach with regards to the memoization is to turn on softValues for the cache as in the attached patch (soft_values.patch). This in my opinion makes the caching a little more transparent and brings the standard jvm tuning approaches into play. So on large entries in the cache the JVM is responsible for cleaning it up as required. It could still mean regressions in performance if the code points are collected before they can be reused but hopefully it becomes a standard jvm/gc tuning exercise rather than an esoteric setting in jython. Also the time based eviction should also hopefully mean that regardless of the jvm gc cycles jython won't hold onto the code points forever.

Let me know what you think and I'll push another commit up to the pull request. 

Cheers
Indra 


On 15 March 2014 15:47, Indra Talip <[hidden email]> wrote:
Jim,
I poked around with using weights (sre_state_codepoints_weigher.patch) and to be honest i don't really like that approach. The main reason is that it makes tuning a jython app somewhat harder because you will get regressions in behaviour for the SRE_STATE if the cache/weigher isn't sized appropriately and in order to size it you have to know the setting exists and what value is appropriate for your application. For example my first cut at using the weigher used a 1MB limit as you suggested earlier. It turns out that the ~320KB html file from the first email when converted to code points consumed 1.2MB and was always evicted before it could be reused and the performance of the of BeautifulSoup/HTMLParser reverted to taking 500+ seconds :(

So while the sre_state_codepoints_weigher patch attached ups the cache to 10MB which addresses that case I think that it makes tuning a Jython app somewhat harder 



On 13 March 2014 03:04, Jim Baker <[hidden email]> wrote:
Indra,

Thanks for working on this!

Memoization is the right approach for Jython's regex engine (SRE) as it is currently implemented using array indexed ops. Clearly certain usages are O(n^2) instead of being O(n), which the memoization mitigates. My only concern is that the cache could potentially contain references to large objects. But this can be readily resolved: Guava's Cache mechanism does support an arbitrary weighing mechanism and maximum weights for a cache. So we just need to assign the weight to be the length in codepoints, and we can constrain the cache to be say on the order of 1MB or whatever the user wants. Other issues like concurrency level are best left to the user, as you have done with this property.

Longer term, it would be best if we move to not having to need to cache these codepoints at all. (Jython's runtime does not cache user data, although we certainly do cache metadata, especially around types.) I did take a look at JRuby's Joni project, which is their equivalent to what we do. In particular, let's look specifically its regex bytecode VM (https://github.com/jruby/joni/blob/master/src/org/joni/ByteCodeMachine.java). It's actually quite close to SRE in design but implements the next/prev approach I suggested earlier (called step and stepBack, see https://github.com/jruby/jcodings/blob/master/src/org/jcodings/Encoding.java#L320). The other interesting thing Joni does is it breaks up its methods so they are more inlineable and therefore more likely to run fast on the JVM. This is especially seen when comparing the use of switch statements in Joni and SRE.

What does this mean for the pull request? Assuming weight support, it looks good and certainly fine for 2.7.0. But ideally we can find some time to work on SRE. For the interested developer, this would be straightforward work that would introduce what it means to be write a fast virtual machine on the JVM (and who doesn't like that idea, VMs on VMs?); issues of UTF-16 character encoding for unicode vs just using bytes (Python allows regexes on both); while still being quite self contained.

- Jim


On Tue, Mar 11, 2014 at 9:02 PM, Indra Talip <[hidden email]> wrote:
Jim,
I've created a pull request at https://bitbucket.org/jython/jython/pull-request/27/make-sre_state-cache-the-code-points-for-a/diff that implements caching of the code points and where the properties of the cache are configurable via the registry.

Cheers
Indra


On 9 March 2014 08:56, Indra Talip <[hidden email]> wrote:
Jim,
Running the code through the NetBeans profiler again shows that the execution time now appears to be dominated by Jython internals rather than the SRE_STATE/PyUnicode as before.

The top 13 results from the profiling session are shown below with PyTableCode.call, PyType$MethodCache.lookup_where and PyObject.object___findattr__ taking ~16% of the total execution time.

"Hot Spots - Method","Self Time [%]","Self Time","Total Time","Invocations"
"org.python.core.PyTableCode.call(org.python.core.ThreadState, org.python.core.PyFrame, org.python.core.PyObject)","9.314184","4736.226 ms","46921.135 ms","267700"
"org.python.core.PyType$MethodCache.lookup_where(org.python.core.PyType, String, org.python.core.PyObject[])","4.092952","2081.25 ms","3450.315 ms","1702296"
"org.python.core.PyObject.object___findattr__(String)","2.7491088","1397.911 ms","6655.189 ms","976144"
"org.python.core.PyType.lookup(String)","1.7592899","894.592 ms","5180.861 ms","1693183"
"org.python.core.PyType.lookup_where(String, org.python.core.PyObject[])","1.6819717","855.276 ms","4306.611 ms","1702296"
"org.python.core.PyObject.__getattr__(String)","1.5009181","763.211 ms","9284.117 ms","1228640"
"org.python.core.PyStringMap.__finditem__(String)","1.4666681","745.795 ms","745.795 ms","2114545"
"org.python.core.PySequence.<init>(org.python.core.PyType)","1.3575838","690.326 ms","1522.962 ms","840393"
"org.python.core.PyFrame.getlocal(int)","1.3472611","685.077 ms","685.077 ms","3150368"
"org.python.modules.sre.SRE_STATE.SRE_MATCH(int[], int, int)","1.2096357","615.095 ms","1305.284 ms","401248"
"org.python.core.PyObject.<init>(org.python.core.PyType)","1.1715233","595.715 ms","595.715 ms","2388936"
"org.python.core.PyMethodDescr.method_descriptor___get__(org.python.core.PyObject, org.python.core.PyObject)","1.0598332","538.921 ms","2489.703 ms","495366"

It doesn't look at this stage that further optimisations to the SRE_STATE/indexed ops would buy much improvement in the overall execution time.

Cheers
Indra



On 8 March 2014 21:21, Indra Talip <[hidden email]> wrote:
Jim,
Thanks for the feedback. The attached patch moves the memoization to SRE_STATE using a guava com.google.common.cache.Cache with WeakReference keys.

It appears to have similar performance characteristics for the simple test case .
> JYTHONPATH=beautifulsoup4-4.3.2/ dist/bin/jython beautifulsoup.py 
6.15499997139

If that looks alright I'll push it up to the bug tracker. One question I have is whether to make the concurrencyLevel configurable and if so what's the best way to do so? Is there a standard mechanism for looking up these sorts of configuration items in jython?

Regarding the performance of getting code points from the underlying UTF-16 string it looks like that will be built into Java 8 (http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8012665).

Cheers
Indra


On 8 March 2014 17:26, Jim Baker <[hidden email]> wrote:
Indra,

Thanks for your analysis and patch! It's quite interesting, I was discussing today this very issue with a colleague as a known performance issue in Jython. Incidentally Nicholas Riley also mentioned in a comment on that thread (http://sourceforge.net/p/jython/mailman/message/23906076/) that he saw on his work on Joni the same overhead of converting to codepoints (from the underlying UTF-16), so it looks like we just didn't go far enough in looking at this issue nearly 5 years ago.

Memoization is certainly the right approach with how SRE is currently implemented. Instead of doing it in PyUnicode, I would think it makes sense to apply the memoization approach in SRE instead, at the cost of somewhat greater complexity.

One more thing to consider: toCodePoints is really only necessary for unicode objects that have any codepoints not in the basic plane. So we could avoid this construction. Even more radically, we could also probably avoid it in general if we moved away from indexed ops in SRE and just used a next/prev approach.

Thanks again, this is great stuff!

- Jim



On Fri, Mar 7, 2014 at 9:56 PM, Indra Talip <[hidden email]> wrote:
G'day,
While looking into using BeautifulSoup with Jython I found that it is exceedingly slow. I came across the following thread from Nov 2009 http://sourceforge.net/p/jython/mailman/message/23893421/ which also documented the issue with speed.

Poking at it with the NetBeans profiler showed that most of the time was in PyUnicode#toCodePoints or more specifically the calls to SubsequenceIteratorImpl#next() due to the HTMLParser/Jython SRE engine passing the entire file as a string multiple times (for different regexes) to the SRE_STATE constructor which in turn calls PyUnicode#toCodePoints. The attached patch takes a brute force approach of memoizing the result of calling PyUnicode#toCodePoints(). This brings the run-time of parsing http://www.fixprotocol.org/specifications/fields/5000-5999 down from 500+ secs to ~6 seconds. I've wrapped the returned codepoints int[] in a SoftReference to allow for it to be garbage collected, although this adds about 1 second of overhead.

The code used is as per the previous thread.

> cat beautifulsoup.py 
import time
from bs4 import BeautifulSoup
data = open("fix-5000-5999.html").read()
start = time.time()
soup = BeautifulSoup(data)
print time.time() - start

> JYTHONPATH=beautifulsoup4-4.3.2/ dist/bin/jython beautifulsoup.py 
6.<a href="tel:15199995041" value="+15199995041" target="_blank">15199995041

Is this approach worthwhile? Are there any other approaches that I could take to try speeding Jython up while using BeautifulSoup?

Cheers
Indra

--
Indra Talip

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Jython-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/jython-users





--
Indra Talip



--
Indra Talip



--
Indra Talip




--
Indra Talip



--
Indra Talip

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Jython-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/jython-users

soft_values.patch (1K) Download Attachment