Letter class in re

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Letter class in re

Antoon Pardon
I am using PLY for a parsing task which uses re for the lexical
analysis. Does anyone
know what regular expression to use for a sequence of letters? There is
a class for alphanumerics but I can't find one for just letters, which I
find odd.

I am using python 3.4

--
Antoon Pardon


Reply | Threaded
Open this post in threaded view
|

Letter class in re

Wolfgang Maier
On 03/09/2015 11:23 AM, Antoon Pardon wrote:
> I am using PLY for a parsing task which uses re for the lexical
> analysis. Does anyone
> know what regular expression to use for a sequence of letters? There is
> a class for alphanumerics but I can't find one for just letters, which I
> find odd.
>
> I am using python 3.4
>

how about [a-zA-Z] ?



Reply | Threaded
Open this post in threaded view
|

Letter class in re

Antoon Pardon
Op 09-03-15 om 11:37 schreef Wolfgang Maier:

> On 03/09/2015 11:23 AM, Antoon Pardon wrote:
>> I am using PLY for a parsing task which uses re for the lexical
>> analysis. Does anyone
>> know what regular expression to use for a sequence of letters? There is
>> a class for alphanumerics but I can't find one for just letters, which I
>> find odd.
>>
>> I am using python 3.4
>>
>
> how about [a-zA-Z] ?
>
No, that limits the characters to ASCII-letters.

This is what the doc says about the alphanumeric class:

\w

For Unicode(str) patternsL
    Matches Unicode word characters; this includes most characters that
can be part
    of a word in any language, as well as numbers and the underscore. If
the ASCII
    flag is used, only [a-zA-Z0-9_] is matched. ...

So what I want is a class that just includes those characters that can
be part of
a word in any language.



Reply | Threaded
Open this post in threaded view
|

Letter class in re

TIm Chase-3
In reply to this post by Wolfgang Maier
On 2015-03-09 11:37, Wolfgang Maier wrote:
> On 03/09/2015 11:23 AM, Antoon Pardon wrote:
>> Does anyone know what regular expression to use for a sequence of
>> letters? There is a class for alphanumerics but I can't find one
>> for just letters, which I find odd.
>
> how about [a-zA-Z] ?

That breaks if you have Unicode letters.  While ugly, since "\w" is
composed of "letters, numbers, and underscores", you can assert that
the "\w" you find is not a number or underscore by using

  (?:(?!_|\d)\w)

as demonstrated:

Python 3.2.3 (default, Feb 20 2013, 14:44:27)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = "??????"
>>> import re
>>> r = re.compile(r'^[a-zA-Z]*$', re.U)
>>> r.match(s)
>>> r = re.compile(r"^(?:(?!_|\d)\w)*$", re.U)
>>> r.match(s)
<_sre.SRE_Match object at 0x7fb205da9850>

I do miss that Python used "\a" for "start of string" rather than
"alphabetic" like Vim does (and correspondingly "\A" for "not an
alphabetic").

-tkc


Reply | Threaded
Open this post in threaded view
|

Letter class in re

Antoon Pardon
Op 09-03-15 om 12:17 schreef Tim Chase:

> On 2015-03-09 11:37, Wolfgang Maier wrote:
>> On 03/09/2015 11:23 AM, Antoon Pardon wrote:
>>> Does anyone know what regular expression to use for a sequence of
>>> letters? There is a class for alphanumerics but I can't find one
>>> for just letters, which I find odd.
>> how about [a-zA-Z] ?
> That breaks if you have Unicode letters.  While ugly, since "\w" is
> composed of "letters, numbers, and underscores", you can assert that
> the "\w" you find is not a number or underscore by using
>
>   (?:(?!_|\d)\w)

So if I understand correctly the following should be a regular expression for
a python3 identifier.

  (?:(?!_|\d)\w)\w+

It seems odd that one should need such an ugly expression for something that is
used rather frequently for parsing computer languages and the like.

--
Antoon Pardon



Reply | Threaded
Open this post in threaded view
|

Letter class in re

TIm Chase-3
On 2015-03-09 13:26, Antoon Pardon wrote:
> Op 09-03-15 om 12:17 schreef Tim Chase:
>>   (?:(?!_|\d)\w)
>
> So if I understand correctly the following should be a regular
> expression for a python3 identifier.
>
>   (?:(?!_|\d)\w)\w+

If you don't have to treat it as an atom, you can simplify that to
just

  (?!_|\d)\w+

which just means that the first character can't be an underscore or
digit.

Though for a Py3 identifier, the underscore is acceptable as a first
character ("__init__"), so you can simplify it even further to just

  (?!\d)\w+

-tkc





Reply | Threaded
Open this post in threaded view
|

Letter class in re

Wolfgang Maier
In reply to this post by Antoon Pardon
On 03/09/2015 01:26 PM, Antoon Pardon wrote:

> Op 09-03-15 om 12:17 schreef Tim Chase:
>> On 2015-03-09 11:37, Wolfgang Maier wrote:
>>> On 03/09/2015 11:23 AM, Antoon Pardon wrote:
>>>> Does anyone know what regular expression to use for a sequence of
>>>> letters? There is a class for alphanumerics but I can't find one
>>>> for just letters, which I find odd.
>>> how about [a-zA-Z] ?
>> That breaks if you have Unicode letters.  While ugly, since "\w" is
>> composed of "letters, numbers, and underscores", you can assert that
>> the "\w" you find is not a number or underscore by using
>>
>>    (?:(?!_|\d)\w)
>
> So if I understand correctly the following should be a regular expression for
> a python3 identifier.
>
>    (?:(?!_|\d)\w)\w+
>

No, that is not it. For one thing, a leading underscore is fine in
identifier names. That is easy to fix in your expression though.
Another thing are the Other_ID_Start and Other_ID_Continue categories
defined in http://www.unicode.org/Public/6.3.0/ucd/PropList.txt, e.g.,

 >>> '\u212E'
'?'
 >>> ? = 10
 >>> ?
10

though ? is not included in \w.

> It seems odd that one should need such an ugly expression for something that is
> used rather frequently for parsing computer languages and the like.
>

There is str.isidentifier, which returns True if something is a valid
identifier name:

 >>> '?'.isidentifier()
True





Reply | Threaded
Open this post in threaded view
|

Letter class in re

Albert-jan Roskam
In reply to this post by TIm Chase-3
--------------------------------------------
On Mon, 3/9/15, Tim Chase <python.list at tim.thechases.com> wrote:

 Subject: Re: Letter class in re
 To: python-list at python.org
 Date: Monday, March 9, 2015, 12:17 PM
 
 On 2015-03-09 11:37,
 Wolfgang Maier wrote:
 > On 03/09/2015
 11:23 AM, Antoon Pardon wrote:
 >> Does
 anyone know what regular expression to use for a sequence
 of
 >> letters? There is a class for
 alphanumerics but I can't find one
 >> for just letters, which I find odd.
 >
 > how about [a-zA-Z]
 ?
 
 That breaks if you have
 Unicode letters.? While ugly, since "\w" is
 composed of "letters, numbers, and
 underscores", you can assert that
 the
 "\w" you find is not a number or underscore by
 using
 
 ? (?:(?!_|\d)\w)
 

I was going to make the same remark, but with a slightly different solution:
In [1]: repr(re.search("[a-zA-Z]", "?"))
Out[1]: 'None'
 
In [2]: repr(re.search(u"[^\d\W_]+", u"?", re.I | re.U))
Out[2]: '<_sre.SRE_Match object at 0x027CDB10>'

"[^\d\W_]+" means something like "one or more (+) of 'not (a digit, a non-word, an underscore)'.



Reply | Threaded
Open this post in threaded view
|

Letter class in re

Chris Angelico
In reply to this post by Antoon Pardon
On Mon, Mar 9, 2015 at 11:26 PM, Antoon Pardon
<antoon.pardon at rece.vub.ac.be> wrote:
> It seems odd that one should need such an ugly expression for something that is
> used rather frequently for parsing computer languages and the like.

Possibly because computer language parsers don't use regular expressions. :)

ChrisA


Reply | Threaded
Open this post in threaded view
|

Letter class in re

Serhiy Storchaka-2
In reply to this post by Antoon Pardon
On 09.03.15 14:26, Antoon Pardon wrote:
> So if I understand correctly the following should be a regular expression for
> a python3 identifier.
>
>    (?:(?!_|\d)\w)\w+
>
> It seems odd that one should need such an ugly expression for something that is
> used rather frequently for parsing computer languages and the like.

Not all so easy.

 >>> allchars = ''.join(map(chr, range(sys.maxunicode+1)))
>>> ''.join(c for c in allchars if ('a'+c).isidentifier() and not (c+'a').isidentifier() and not c.isdigit())

?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

?????????????????????????????????????????????????????????????????????????????????????????????

?????????????????????
????????????????????????



??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????'




Reply | Threaded
Open this post in threaded view
|

Letter class in re

Wolfgang Maier
In reply to this post by Albert-jan Roskam
On 03/09/2015 02:33 PM, Albert-Jan Roskam wrote:
> --------------------------------------------
> On Mon, 3/9/15, Tim Chase <python.list at tim.thechases.com> wrote:
>
> "[^\d\W_]+" means something like "one or more (+) of 'not (a digit, a non-word, an underscore)'.
>

interesting (using Python3.4 and
U+2188 ROMAN NUMERAL ONE HUNDRED THOUSAND ?):

 >>> re.search('[^\d\W_]+', '\u2188', re.I | re.U)
<_sre.SRE_Match object; span=(0, 1), match='?'>

? and at least some other Nl (letter numbers) category characters seem
to be part of \w (not part of \W).

Would that be considered a bug ?

Wolfgang



Reply | Threaded
Open this post in threaded view
|

Letter class in re

Wolfgang Maier
On 03/09/2015 03:04 PM, Wolfgang Maier wrote:

> On 03/09/2015 02:33 PM, Albert-Jan Roskam wrote:
>> --------------------------------------------
>> On Mon, 3/9/15, Tim Chase <python.list at tim.thechases.com> wrote:
>>
>> "[^\d\W_]+" means something like "one or more (+) of 'not (a digit, a
>> non-word, an underscore)'.
>>
>
> interesting (using Python3.4 and
> U+2188     ROMAN NUMERAL ONE HUNDRED THOUSAND     ?):
>
>  >>> re.search('[^\d\W_]+', '\u2188', re.I | re.U)
> <_sre.SRE_Match object; span=(0, 1), match='?'>
>
> ? and at least some other Nl (letter numbers) category characters seem
> to be part of \w (not part of \W).
>
> Would that be considered a bug ?
>

Sorry for the potential confusion: I meant in the pattern search above
(not in the definition of \w or \W).



Reply | Threaded
Open this post in threaded view
|

Letter class in re

Antoon Pardon
In reply to this post by TIm Chase-3
Op 09-03-15 om 13:50 schreef Tim Chase:

> On 2015-03-09 13:26, Antoon Pardon wrote:
>> Op 09-03-15 om 12:17 schreef Tim Chase:
>>>   (?:(?!_|\d)\w)
>> So if I understand correctly the following should be a regular
>> expression for a python3 identifier.
>>
>>   (?:(?!_|\d)\w)\w+
> If you don't have to treat it as an atom, you can simplify that to
> just
>
>   (?!_|\d)\w+
>
> which just means that the first character can't be an underscore or
> digit.
>
> Though for a Py3 identifier, the underscore is acceptable as a first
> character ("__init__"), so you can simplify it even further to just
>
>   (?!\d)\w+

No that doesn't work. To begin with my attempt above shoud have been:

    (?:(?!_|\d)\w)\w*

because an identifier can just be one letter. So when change the '+'
into a "*' in your suggestion I get this:

>>> r = re.compile(r"(?!\d)\w*")
>>> r.match('?')
<_sre.SRE_Match object; span=(0, 0), match=''>

But the ? is not a letter.

I have done some test with:  (?:(?!\d)\w)\w*, which seems to work.



Reply | Threaded
Open this post in threaded view
|

Letter class in re

Antoon Pardon
In reply to this post by Wolfgang Maier
Op 09-03-15 om 14:32 schreef Wolfgang Maier:

...

>
>> It seems odd that one should need such an ugly expression for
>> something that is
>> used rather frequently for parsing computer languages and the like.
>>
>
> There is str.isidentifier, which returns True if something is a valid
> identifier name:
>
> >>> '?'.isidentifier()
> True

Which is not very usefull in a context of lexical analysis. I don't need to know
if a particular string is useful as an identifier, I want to know which parts of
a text are identifiers.



Reply | Threaded
Open this post in threaded view
|

Letter class in re

Chris Angelico
On Tue, Mar 10, 2015 at 1:34 AM, Antoon Pardon
<antoon.pardon at rece.vub.ac.be> wrote:
>> There is str.isidentifier, which returns True if something is a valid
>> identifier name:
>>
>> >>> '?'.isidentifier()
>> True
>
> Which is not very usefull in a context of lexical analysis. I don't need to know
> if a particular string is useful as an identifier, I want to know which parts of
> a text are identifiers.

If you're doing lexical analysis, you probably want a lexer. For
Python, I would recommend parsing to AST and doing your analysis on
that; I've had pretty good success doing that, and then using the
line/column info to go back to the original text if I need it. A regex
is probably not going to be sufficient for that kind of work.

What exactly are you trying to accomplish here? More info would guide
the recommendations, obviously.

ChrisA


Reply | Threaded
Open this post in threaded view
|

Letter class in re

Antoon Pardon
In reply to this post by Chris Angelico
Op 09-03-15 om 14:35 schreef Chris Angelico:
> On Mon, Mar 9, 2015 at 11:26 PM, Antoon Pardon
> <antoon.pardon at rece.vub.ac.be> wrote:
>> It seems odd that one should need such an ugly expression for something that is
>> used rather frequently for parsing computer languages and the like.
> Possibly because computer language parsers don't use regular expressions. :)
>
Trying to be funny by being pedantic. Maybe your experience is different from mine,
but it rarely seems to work IMO.

--
Antoon Pardon.



Reply | Threaded
Open this post in threaded view
|

Letter class in re

Chris Angelico
On Tue, Mar 10, 2015 at 1:41 AM, Antoon Pardon
<antoon.pardon at rece.vub.ac.be> wrote:
> Op 09-03-15 om 14:35 schreef Chris Angelico:
>> On Mon, Mar 9, 2015 at 11:26 PM, Antoon Pardon
>> <antoon.pardon at rece.vub.ac.be> wrote:
>>> It seems odd that one should need such an ugly expression for something that is
>>> used rather frequently for parsing computer languages and the like.
>> Possibly because computer language parsers don't use regular expressions. :)
>>
> Trying to be funny by being pedantic. Maybe your experience is different from mine,
> but it rarely seems to work IMO.

Not sure what you mean there - what "rarely seems to work"? I've never
written a language parser based on regexps, which is the point of what
I was saying.

ChrisA


Reply | Threaded
Open this post in threaded view
|

Letter class in re

Antoon Pardon
In reply to this post by Chris Angelico
Op 09-03-15 om 15:39 schreef Chris Angelico:

> On Tue, Mar 10, 2015 at 1:34 AM, Antoon Pardon
> <antoon.pardon at rece.vub.ac.be> wrote:
>>> There is str.isidentifier, which returns True if something is a valid
>>> identifier name:
>>>
>>>>>> '?'.isidentifier()
>>> True
>> Which is not very usefull in a context of lexical analysis. I don't need to know
>> if a particular string is useful as an identifier, I want to know which parts of
>> a text are identifiers.
> If you're doing lexical analysis, you probably want a lexer. For
> Python, I would recommend parsing to AST and doing your analysis on
> that; I've had pretty good success doing that, and then using the
> line/column info to go back to the original text if I need it. A regex
> is probably not going to be sufficient for that kind of work.

Maybe I am getting behind, but until now the lexers that I used require a regular
expression per kind of token you want to recognize. At least PLY still seems to
work like that. So if an identifier is one such kind of token, I need a regular
expression that matches what an identifier is.

--
Antoon Pardon



Reply | Threaded
Open this post in threaded view
|

Letter class in re

Antoon Pardon
In reply to this post by Chris Angelico
Op 09-03-15 om 15:44 schreef Chris Angelico:

> On Tue, Mar 10, 2015 at 1:41 AM, Antoon Pardon
> <antoon.pardon at rece.vub.ac.be> wrote:
>> Op 09-03-15 om 14:35 schreef Chris Angelico:
>>> On Mon, Mar 9, 2015 at 11:26 PM, Antoon Pardon
>>> <antoon.pardon at rece.vub.ac.be> wrote:
>>>> It seems odd that one should need such an ugly expression for something that is
>>>> used rather frequently for parsing computer languages and the like.
>>> Possibly because computer language parsers don't use regular expressions. :)
>>>
>> Trying to be funny by being pedantic. Maybe your experience is different from mine,
>> but it rarely seems to work IMO.
> Not sure what you mean there - what "rarely seems to work"? I've never
> written a language parser based on regexps, which is the point of what
> I was saying.
>
May be you should pay better attention then to the history of the thread.
I am not talking about parsing, I am talking about lexical analysis and
that does uses regular expressions. And the result of that is used rather
frequently for parsing computer languages,even if the parsing itself
doesn't use regular expressions.

--
Antoon Pardon



Reply | Threaded
Open this post in threaded view
|

Letter class in re

TIm Chase-3
In reply to this post by Antoon Pardon
On 2015-03-09 15:29, Antoon Pardon wrote:

> Op 09-03-15 om 13:50 schreef Tim Chase:
> >>   (?:(?!_|\d)\w)\w+
> > If you don't have to treat it as an atom, you can simplify that to
> > just
> >
> >   (?!_|\d)\w+
> >
> > which just means that the first character can't be an underscore
> > or digit.
> >
> > Though for a Py3 identifier, the underscore is acceptable as a
> > first character ("__init__"), so you can simplify it even further
> > to just
> >
> >   (?!\d)\w+
>
> No that doesn't work. To begin with my attempt above shoud have
> been:
>
>     (?:(?!_|\d)\w)\w*

Did you actually test my suggestion?  The "(?!\d)\w+" means "one or
more Word characters, but the first one can't be a digit" because
the "(?!...)" is zero-width. This should match single-character
strings including a single underscore.

> because an identifier can just be one letter. So when change the '+'
> into a "*' in your suggestion I get this:
>
> >>> r = re.compile(r"(?!\d)\w*")
> >>> r.match('?')
> <_sre.SRE_Match object; span=(0, 0), match=''>
>
> But the ? is not a letter.

Notice that you match an empty string there because the (?!\d) is
zero width, and thus you match 0-or-more-word-characters by matching
nothing.  Try either anchoring it with a "$" at the end to see that
it doesn't really match.

-tkc







12