Quantcast

differences in IronPython/CPython regular expressions?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

differences in IronPython/CPython regular expressions?

jalopyuser
I have a large RE (223613 chars) that works fine in CPython 2.6, but
seems to produce an endless loop in IronPython (see below).  I'm using
Mono 2.10 (.NET 4.0.x) on Ubuntu, with IronPython 2.7.  Anyone have
pointers to the differences between them?  Is
System::Text::RegularExpressions in .NET configurable in some fashion
that might help?

I'm a .NET newbie.

TIA,

Bill

--------------------------------------------------
import sys, os, re

try:
    # we use the name lists in nltk to create person-name matching patterns
    import nltk.data
except ImportError:
    sys.stderr.write("Can't import nltk; can't do name lists.\nSee http://www.nltk.org/.\n")
    sys.exit(1)
else:
    __MALE_NAME_EXCLUDES = ("Hill",
                          "Ave",
                          )
    __FEMALE_NAME_EXCLUDES = ()
    __FEMALE_NAMES = [x for x in
                      nltk.data.load("corpora/names/female.txt", format="raw").split("\n")
                      if (x and (x not in __FEMALE_NAME_EXCLUDES))]
    __FEMALE_NAMES += [x.upper() for x in __FEMALE_NAMES]
    __MALE_NAMES = [x for x in
                    nltk.data.load("corpora/names/male.txt", format="raw").split("\n")
                    if (x and (x not in __MALE_NAME_EXCLUDES))]
    __MALE_NAMES += [x.upper() for x in __MALE_NAMES]
    __INITS = [chr(x) for x in range(ord('A'), ord('Z'))]

PERSON_PATTERN = re.compile(
    "^((?P<honorific>Mr|Ms|Mrs|Dr|MR|MS|MRS|DR)\.? )?"         # honorific
    "(?P<firstname>" +
    "|".join(__FEMALE_NAMES + __MALE_NAMES + __INITS) + # first name
    ")"
    "( (?P<middlename>([A-Z]\.)|(" +
    "|".join(__FEMALE_NAMES + __MALE_NAMES) +         # middle initial or name
    ")))?"
    " +(?P<lastname>[A-Z][A-Za-z]+)",             # space then last name
    re.MULTILINE)

print PERSON_PATTERN.match("Mr. John Smith")
_______________________________________________
Users mailing list
[hidden email]
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: differences in IronPython/CPython regular expressions?

Jeff Hardy-4
On Wed, Jun 1, 2011 at 4:03 PM, Bill Janssen <[hidden email]> wrote:
> I have a large RE (223613 chars) that works fine in CPython 2.6, but

That's truly horrible, but I assume you have a good reason for it.

> seems to produce an endless loop in IronPython (see below).  I'm using
> Mono 2.10 (.NET 4.0.x) on Ubuntu, with IronPython 2.7.  Anyone have
> pointers to the differences between them?  Is
> System::Text::RegularExpressions in .NET configurable in some fashion
> that might help?

First off, is there a reason you don't use re.IGNORECASE? That would
cut the regex in half, at least.

For the most part, CPython and IronPython regexes should be fairly
compatible - IronPython takes the regex and massages it to work with
System.Text.RE, but the changes are pretty straightforward and small,
and I don't think the re you provided hits any of them. It's quite
possible that the Mono version of System.Text.RE can't handle the
expression; you could test this saving the full regex and building a
small C# program that runs it. The regex template has a lot of
potential backtracking in it; are you sure it's not caught in a
pathological (exponential) case?

Finally, is one ginormous really the best way to do this? Have you
tried other approaches?

- Jeff

>
> I'm a .NET newbie.
>
> TIA,
>
> Bill
>
> --------------------------------------------------
> import sys, os, re
>
> try:
>    # we use the name lists in nltk to create person-name matching patterns
>    import nltk.data
> except ImportError:
>    sys.stderr.write("Can't import nltk; can't do name lists.\nSee http://www.nltk.org/.\n")
>    sys.exit(1)
> else:
>    __MALE_NAME_EXCLUDES = ("Hill",
>                          "Ave",
>                          )
>    __FEMALE_NAME_EXCLUDES = ()
>    __FEMALE_NAMES = [x for x in
>                      nltk.data.load("corpora/names/female.txt", format="raw").split("\n")
>                      if (x and (x not in __FEMALE_NAME_EXCLUDES))]
>    __FEMALE_NAMES += [x.upper() for x in __FEMALE_NAMES]
>    __MALE_NAMES = [x for x in
>                    nltk.data.load("corpora/names/male.txt", format="raw").split("\n")
>                    if (x and (x not in __MALE_NAME_EXCLUDES))]
>    __MALE_NAMES += [x.upper() for x in __MALE_NAMES]
>    __INITS = [chr(x) for x in range(ord('A'), ord('Z'))]
>
> PERSON_PATTERN = re.compile(
>    "^((?P<honorific>Mr|Ms|Mrs|Dr|MR|MS|MRS|DR)\.? )?"         # honorific
>    "(?P<firstname>" +
>    "|".join(__FEMALE_NAMES + __MALE_NAMES + __INITS) + # first name
>    ")"
>    "( (?P<middlename>([A-Z]\.)|(" +
>    "|".join(__FEMALE_NAMES + __MALE_NAMES) +         # middle initial or name
>    ")))?"
>    " +(?P<lastname>[A-Z][A-Za-z]+)",             # space then last name
>    re.MULTILINE)
>
> print PERSON_PATTERN.match("Mr. John Smith")
> _______________________________________________
> Users mailing list
> [hidden email]
> http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
>
_______________________________________________
Users mailing list
[hidden email]
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: differences in IronPython/CPython regular expressions?

jalopyuser
Jeff Hardy <[hidden email]> wrote:

> On Wed, Jun 1, 2011 at 4:03 PM, Bill Janssen <[hidden email]> wrote:
> > I have a large RE (223613 chars) that works fine in CPython 2.6, but
>
> That's truly horrible, but I assume you have a good reason for it.

Hi, Jeff.  Yes, I think so.

> > seems to produce an endless loop in IronPython (see below).  I'm using
> > Mono 2.10 (.NET 4.0.x) on Ubuntu, with IronPython 2.7.  Anyone have
> > pointers to the differences between them?  Is
> > System::Text::RegularExpressions in .NET configurable in some fashion
> > that might help?
>
> First off, is there a reason you don't use re.IGNORECASE? That would
> cut the regex in half, at least.

Sure.  Names sensitive to capitalization; the rule I'm implementing says
names are either capitalized or upper-case.

> For the most part, CPython and IronPython regexes should be fairly
> compatible - IronPython takes the regex and massages it to work with
> System.Text.RE, but the changes are pretty straightforward and small,

Are those changes documented anywhere?

> and I don't think the re you provided hits any of them. It's quite
> possible that the Mono version of System.Text.RE can't handle the
> expression; you could test this saving the full regex and building a
> small C# program that runs it. The regex template has a lot of
> potential backtracking in it; are you sure it's not caught in a
> pathological (exponential) case?

No; all I'm sure of is that this runs in 1.2 seconds in CPython, and
takes up a core for 15 minutes (till I kill it) with IronPython/Mono.
Something is clearly hitting a bug somewhere...  I suppose I should
try it on Windows.

> Finally, is one ginormous really the best way to do this? Have you
> tried other approaches?

No need, until I hit .NET.  I'm used to working with a full-featured
finite-state machine (PARC's xfst; see
http://www.cis.upenn.edu/~cis639/docs/xfst.html), and was wondering if
we could do similar things with Python's RE machinery.  Long lists like
these names are often used for lists of companies or cities or such.
People's names are actually a fairly simple and short example of this :-).

Bill
_______________________________________________
Users mailing list
[hidden email]
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: differences in IronPython/CPython regular expressions?

George Silva
If youre on Windows, you can test the native c# behvaior with a software called Rad Software regular expression designer. Its very helpful.

On Wed, Jun 1, 2011 at 8:44 PM, Bill Janssen <[hidden email]> wrote:
Jeff Hardy <[hidden email]> wrote:

> On Wed, Jun 1, 2011 at 4:03 PM, Bill Janssen <[hidden email]> wrote:
> > I have a large RE (223613 chars) that works fine in CPython 2.6, but
>
> That's truly horrible, but I assume you have a good reason for it.

Hi, Jeff.  Yes, I think so.

> > seems to produce an endless loop in IronPython (see below).  I'm using
> > Mono 2.10 (.NET 4.0.x) on Ubuntu, with IronPython 2.7.  Anyone have
> > pointers to the differences between them?  Is
> > System::Text::RegularExpressions in .NET configurable in some fashion
> > that might help?
>
> First off, is there a reason you don't use re.IGNORECASE? That would
> cut the regex in half, at least.

Sure.  Names sensitive to capitalization; the rule I'm implementing says
names are either capitalized or upper-case.

> For the most part, CPython and IronPython regexes should be fairly
> compatible - IronPython takes the regex and massages it to work with
> System.Text.RE, but the changes are pretty straightforward and small,

Are those changes documented anywhere?

> and I don't think the re you provided hits any of them. It's quite
> possible that the Mono version of System.Text.RE can't handle the
> expression; you could test this saving the full regex and building a
> small C# program that runs it. The regex template has a lot of
> potential backtracking in it; are you sure it's not caught in a
> pathological (exponential) case?

No; all I'm sure of is that this runs in 1.2 seconds in CPython, and
takes up a core for 15 minutes (till I kill it) with IronPython/Mono.
Something is clearly hitting a bug somewhere...  I suppose I should
try it on Windows.

> Finally, is one ginormous really the best way to do this? Have you
> tried other approaches?

No need, until I hit .NET.  I'm used to working with a full-featured
finite-state machine (PARC's xfst; see
http://www.cis.upenn.edu/~cis639/docs/xfst.html), and was wondering if
we could do similar things with Python's RE machinery.  Long lists like
these names are often used for lists of companies or cities or such.
People's names are actually a fairly simple and short example of this :-).

Bill
_______________________________________________
Users mailing list
[hidden email]
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com



--
George R. C. Silva

Desenvolvimento em GIS
http://geoprocessamento.net
http://blog.geoprocessamento.net


_______________________________________________
Users mailing list
[hidden email]
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: differences in IronPython/CPython regular expressions?

Jeff Hardy-4
In reply to this post by jalopyuser
> Sure.  Names sensitive to capitalization; the rule I'm implementing says
> names are either capitalized or upper-case.

Ah, I see that now. I assumed the name lists were in lower case.

>
>> For the most part, CPython and IronPython regexes should be fairly
>> compatible - IronPython takes the regex and massages it to work with
>> System.Text.RE, but the changes are pretty straightforward and small,
>
> Are those changes documented anywhere?

The code is in Languages\IronPython\IronPython.Modules\re.cs in the
PreParseRegex function; it's pretty straightforward, if a little long.
Looking at it again, it's quite possible there's a bug in there, but
we'd need a minimal repro to have any hope of finding it.

> No need, until I hit .NET.  I'm used to working with a full-featured
> finite-state machine (PARC's xfst; see
> http://www.cis.upenn.edu/~cis639/docs/xfst.html), and was wondering if
> we could do similar things with Python's RE machinery.  Long lists like
> these names are often used for lists of companies or cities or such.
> People's names are actually a fairly simple and short example of this :-).

The fact that it works on CPython fairly fast indicates a bug
somewhere, I'm just not sure if it's IronPython or Mono.

- Jeff
_______________________________________________
Users mailing list
[hidden email]
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: differences in IronPython/CPython regular expressions?

jalopyuser
In reply to this post by George Silva
George Silva <[hidden email]> wrote:

> If youre on Windows, you can test the native c# behvaior with a software
> called Rad Software regular expression designer. Its very helpful.

Thanks, George.  That looks like a useful piece of software.

Bill

_______________________________________________
Users mailing list
[hidden email]
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: differences in IronPython/CPython regular expressions?

jalopyuser
In reply to this post by Jeff Hardy-4
Jeff Hardy <[hidden email]> wrote:

> The fact that it works on CPython fairly fast indicates a bug
> somewhere, I'm just not sure if it's IronPython or Mono.

I just tried it with real MS .NET, on a 64-bit Windows 7 machine with a
new download of IronPython 2.7.  On that platform, it core-dumps (well,
ipy exits with a StackOverflowException).

Bill
_______________________________________________
Users mailing list
[hidden email]
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: differences in IronPython/CPython regular expressions?

Jeff Hardy-4
On Thu, Jun 2, 2011 at 9:41 AM, Bill Janssen <[hidden email]> wrote:
> Jeff Hardy <[hidden email]> wrote:
>
>> The fact that it works on CPython fairly fast indicates a bug
>> somewhere, I'm just not sure if it's IronPython or Mono.
>
> I just tried it with real MS .NET, on a 64-bit Windows 7 machine with a
> new download of IronPython 2.7.  On that platform, it core-dumps (well,
> ipy exits with a StackOverflowException).

Any chance you could get a debugger on there and figure out where the
SOE is (IronPython or .NET)? If not, I can try to take a look if you
send the complete regex, but probably not until the weekend.

- Jeff
_______________________________________________
Users mailing list
[hidden email]
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: differences in IronPython/CPython regular expressions?

Tomas Matousek
BTW, depending on how CPython regex syntax and semantics compares to Oniguruma (Ruby's Regex engine) it might be useful to look at the implementation for IronRuby (http://github.com/IronLanguages/main/blob/master/Languages/Ruby/Ruby/Builtins/RegexpTransformer.cs)
I implemented Oniguruma compatible parser that translates 99% of the features to .NET. There are a few missing features that can be implemented relatively easily and then there are some Unicode features that would be quite difficult to do since .NET Regex doesn't support Unicode so much.

In any case, if you found you need to fix the IronPython translator and the fix would require precise understanding of the syntax it might be worth it to reuse the code from IronRuby.

Tomas

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Jeff Hardy
Sent: Thursday, June 02, 2011 9:46 AM
To: Discussion of IronPython
Subject: Re: [IronPython] differences in IronPython/CPython regular expressions?

On Thu, Jun 2, 2011 at 9:41 AM, Bill Janssen <[hidden email]> wrote:
> Jeff Hardy <[hidden email]> wrote:
>
>> The fact that it works on CPython fairly fast indicates a bug
>> somewhere, I'm just not sure if it's IronPython or Mono.
>
> I just tried it with real MS .NET, on a 64-bit Windows 7 machine with
> a new download of IronPython 2.7.  On that platform, it core-dumps
> (well, ipy exits with a StackOverflowException).

Any chance you could get a debugger on there and figure out where the SOE is (IronPython or .NET)? If not, I can try to take a look if you send the complete regex, but probably not until the weekend.

- Jeff
_______________________________________________
Users mailing list
[hidden email]
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com




_______________________________________________
Users mailing list
[hidden email]
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: differences in IronPython/CPython regular expressions?

jalopyuser
In reply to this post by Jeff Hardy-4
Jeff Hardy <[hidden email]> wrote:

> On Thu, Jun 2, 2011 at 9:41 AM, Bill Janssen <[hidden email]> wrote:
> > Jeff Hardy <[hidden email]> wrote:
> >
> >> The fact that it works on CPython fairly fast indicates a bug
> >> somewhere, I'm just not sure if it's IronPython or Mono.
> >
> > I just tried it with real MS .NET, on a 64-bit Windows 7 machine with a
> > new download of IronPython 2.7.  On that platform, it core-dumps (well,
> > ipy exits with a StackOverflowException).
>
> Any chance you could get a debugger on there and figure out where the
> SOE is (IronPython or .NET)? If not, I can try to take a look if you
> send the complete regex, but probably not until the weekend.

Would gdb work?  I'll try.

The fact that it's different between .NET and Mono makes me guess it's
in the System::Text::RegularExpressions package.

Bill
_______________________________________________
Users mailing list
[hidden email]
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: differences in IronPython/CPython regular expressions?

Jeff Hardy-4
On Thu, Jun 2, 2011 at 11:58 AM, Bill Janssen <[hidden email]> wrote:
> Would gdb work?  I'll try.

Mono's debugger might be better, if their regex engine is managed. It
looks like there's some Mono support in gdb but I've never used it.

On Windows, windbg is your friend. It's about as user-friendly as gdb,
though (take that how you want).

- Jeff
_______________________________________________
Users mailing list
[hidden email]
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: differences in IronPython/CPython regular expressions?

Lee-124
In reply to this post by jalopyuser
> The fact that it's different between .NET and Mono makes me guess it's
> in the System::Text::RegularExpressions package.

If that is the case, it should be easy to test by using C#.  Just write a little console app to test you RegEx on both Mono and MS.NET.  If that fails then it is not a problem with IronPython but in the .NET core.

Of course I just may be smoking something.

Lee

_______________________________________________
Users mailing list
[hidden email]
http://lists.ironpython.com/listinfo.cgi/users-ironpython.com
Loading...