Parsing XML file with Minidom has problem with cr/lf

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Parsing XML file with Minidom has problem with cr/lf

Peterson, Wayne

I am parsing an XML file with Python 2.6.5 minidom in Windows and it is mostly working but minidom seems to have problems dealing with Windows cr/lf characters. It creates an extra textnode that needs to be ignored instead of just returning the xml elements. I have tried different methods of opening the file but it doesn’t seem to make a difference. It is happiest when reading a file in Unix format.

 

Wayne Peterson | Consultant
Sierra Systems

(T): 403-264-0955 (C): 403-710-9248 (F): 403-233-2108

7th Floor, Canadian Centre

833 4th Avenue SW
Calgary, Alberta, T2P 3T5

Management Consulting | System Integration | Managed Services
website: www.SierraSystems.com

----Notice Regarding Confidentiality----
This email, including any and all attachments, (this "Email") is intended only for the party to whom it is addressed and may contain information that is confidential or privileged. Sierra Systems Group Inc. and its affiliates accept no responsibility for any loss or damage suffered by any person resulting from any unauthorized use of or reliance upon this Email. If you are not the intended recipient, you are hereby notified that any dissemination, copying or other use of this Email is prohibited. Please notify us of the error in communication by return email and destroy all copies of this Email. Thank you.
_______________________________________________
XML-SIG maillist  -  [hidden email]
http://mail.python.org/mailman/listinfo/xml-sig

Reply | Threaded
Open this post in threaded view
|

Re: Parsing XML file with Minidom has problem with cr/lf

Stefan Behnel-3
Peterson, Wayne, 09.05.2010 08:43:
> I am parsing an XML file with Python 2.6.5 minidom in Windows and it is
> mostly working but minidom seems to have problems dealing with Windows
> cr/lf characters. It creates an extra textnode that needs to be ignored
> instead of just returning the xml elements. I have tried different
> methods of opening the file but it doesn't seem to make a difference. It
> is happiest when reading a file in Unix format.

Whitespace is significant in the W3C DOM, so minidom must provide it in the
DOM tree. It doesn't "have problems" because it creates text nodes for
them, that's just the way things work.

Note that the xml.etree.ElementTree package tends to be a lot more user
friendly for XML handling than the minidom package, simply because if
focuses on the XML Infoset and moves text out of the way when dealing with
elements.

Stefan
_______________________________________________
XML-SIG maillist  -  [hidden email]
http://mail.python.org/mailman/listinfo/xml-sig
Reply | Threaded
Open this post in threaded view
|

Re: Parsing XML file with Minidom has problem with cr/lf

Dieter Maurer
In reply to this post by Peterson, Wayne
Peterson, Wayne wrote at 2010-5-8 23:43 -0700:
>I am parsing an XML file with Python 2.6.5 minidom in Windows and it is
>mostly working but minidom seems to have problems dealing with Windows
>cr/lf characters. It creates an extra textnode that needs to be ignored
>instead of just returning the xml elements. I have tried different
>methods of opening the file but it doesn't seem to make a difference. It
>is happiest when reading a file in Unix format.

The parser should not see these "cr/lf" characters at all.

Python strings itself use only "\n" (aka "lf") to delimite lines.
The "\r" (aka "cr") should only be introduced when those lines
are written to text files. And they should be removed when
those line are read in again.

Are you sure that you access your files as "text" files?



--
Dieter
_______________________________________________
XML-SIG maillist  -  [hidden email]
http://mail.python.org/mailman/listinfo/xml-sig
Reply | Threaded
Open this post in threaded view
|

Re: Parsing XML file with Minidom has problem with cr/lf

Stefan Behnel-3
Dieter Maurer, 10.05.2010 07:50:

> Peterson, Wayne wrote at 2010-5-8 23:43 -0700:
>> I am parsing an XML file with Python 2.6.5 minidom in Windows and it is
>> mostly working but minidom seems to have problems dealing with Windows
>> cr/lf characters. It creates an extra textnode that needs to be ignored
>> instead of just returning the xml elements. I have tried different
>> methods of opening the file but it doesn't seem to make a difference. It
>> is happiest when reading a file in Unix format.
>
> The parser should not see these "cr/lf" characters at all.
>
> Python strings itself use only "\n" (aka "lf") to delimite lines.
> The "\r" (aka "cr") should only be introduced when those lines
> are written to text files. And they should be removed when
> those line are read in again.
>
> Are you sure that you access your files as "text" files?

The correct way to parse XML files is as binary data.

Stefan
_______________________________________________
XML-SIG maillist  -  [hidden email]
http://mail.python.org/mailman/listinfo/xml-sig
Reply | Threaded
Open this post in threaded view
|

Re: Parsing XML file with Minidom has problem with cr/lf

Dieter Maurer
Stefan Behnel wrote at 2010-5-10 08:57 +0200:

>Dieter Maurer, 10.05.2010 07:50:
>> Peterson, Wayne wrote at 2010-5-8 23:43 -0700:
>>> I am parsing an XML file with Python 2.6.5 minidom in Windows and it is
>>> mostly working but minidom seems to have problems dealing with Windows
>>> cr/lf characters. It creates an extra textnode that needs to be ignored
>>> instead of just returning the xml elements. I have tried different
>>> methods of opening the file but it doesn't seem to make a difference. It
>>> is happiest when reading a file in Unix format.
>>
>> The parser should not see these "cr/lf" characters at all.
>>
>> Python strings itself use only "\n" (aka "lf") to delimite lines.
>> The "\r" (aka "cr") should only be introduced when those lines
>> are written to text files. And they should be removed when
>> those line are read in again.
>>
>> Are you sure that you access your files as "text" files?
>
>The correct way to parse XML files is as binary data.

Why do you think so?

The default "minidom" parser seems not to expect "\r\n" line endings....



--
Dieter
_______________________________________________
XML-SIG maillist  -  [hidden email]
http://mail.python.org/mailman/listinfo/xml-sig
Reply | Threaded
Open this post in threaded view
|

Re: Parsing XML file with Minidom has problem with cr/lf

Stefan Behnel-3
Dieter Maurer, 10.05.2010 09:07:

> Stefan Behnel wrote at 2010-5-10 08:57 +0200:
>> Dieter Maurer, 10.05.2010 07:50:
>>> Peterson, Wayne wrote at 2010-5-8 23:43 -0700:
>>>> I am parsing an XML file with Python 2.6.5 minidom in Windows and it is
>>>> mostly working but minidom seems to have problems dealing with Windows
>>>> cr/lf characters. It creates an extra textnode that needs to be ignored
>>>> instead of just returning the xml elements. I have tried different
>>>> methods of opening the file but it doesn't seem to make a difference. It
>>>> is happiest when reading a file in Unix format.
>>>
>>> The parser should not see these "cr/lf" characters at all.
>>>
>>> Python strings itself use only "\n" (aka "lf") to delimite lines.
>>> The "\r" (aka "cr") should only be introduced when those lines
>>> are written to text files. And they should be removed when
>>> those line are read in again.
>>>
>>> Are you sure that you access your files as "text" files?
>>
>> The correct way to parse XML files is as binary data.
>
> Why do you think so?
>
> The default "minidom" parser seems not to expect "\r\n" line endings....

Interesting. Then this might really be a bug. There was a change in Python
2.6.5 that broke universal newline handling for the codecs module, this
might hit here.

However, according to what the OP described, the cr/lf characters turn up
correctly now, so ISTM that it's the plain '\n' line ending that needs fixing.

Stefan
_______________________________________________
XML-SIG maillist  -  [hidden email]
http://mail.python.org/mailman/listinfo/xml-sig
Reply | Threaded
Open this post in threaded view
|

Re: Parsing XML file with Minidom has problem with cr/lf

Peterson, Wayne
In reply to this post by Dieter Maurer
That's what I thought as well. I was expecting the parser to ignore all
forms of linefeed.

I believe I am accessing my files as text files. The documentation for
minidom.parse says you can pass it a file name or a file object and I
have tried it both ways with the same result. Here is the open statement
I am using.

infile = open(in_path_file, 'r')
in_xmldoc = minidom.parse(infile)

The input file contains cr/lf linefeeds x'0a0d'.

When I do something like,

surveys = form.childNodes

the surveys.firstChild node will contain x'0a' which I have to ignore.

Wayne  

-----Original Message-----
From: Dieter Maurer [mailto:[hidden email]]
Sent: Sunday, May 09, 2010 11:50 PM
To: Peterson, Wayne
Cc: [hidden email]
Subject: Re: [XML-SIG] Parsing XML file with Minidom has problem with
cr/lf

Peterson, Wayne wrote at 2010-5-8 23:43 -0700:
>I am parsing an XML file with Python 2.6.5 minidom in Windows and it is
>mostly working but minidom seems to have problems dealing with Windows
>cr/lf characters. It creates an extra textnode that needs to be ignored
>instead of just returning the xml elements. I have tried different
>methods of opening the file but it doesn't seem to make a difference.
It
>is happiest when reading a file in Unix format.

The parser should not see these "cr/lf" characters at all.

Python strings itself use only "\n" (aka "lf") to delimite lines.
The "\r" (aka "cr") should only be introduced when those lines
are written to text files. And they should be removed when
those line are read in again.

Are you sure that you access your files as "text" files?



--
Dieter


----Notice Regarding Confidentiality----
This email, including any and all attachments, (this "Email") is intended only for the party to whom it is addressed and may contain information that is confidential or privileged.  Sierra Systems Group Inc. and its affiliates accept no responsibility for any loss or damage suffered by any person resulting from any unauthorized use of or reliance upon this Email.  If you are not the intended recipient, you are hereby notified that any dissemination, copying or other use of this Email is prohibited.  Please notify us of the error in communication by return email and destroy all copies of this Email.  Thank you.
_______________________________________________
XML-SIG maillist  -  [hidden email]
http://mail.python.org/mailman/listinfo/xml-sig
Reply | Threaded
Open this post in threaded view
|

Re: Parsing XML file with Minidom has problem with cr/lf

Bill Kinnersley
In reply to this post by Peterson, Wayne
> I am parsing an XML file with Python 2.6.5 minidom in Windows and it is
> mostly working but minidom seems to have problems dealing with Windows
> cr/lf characters. It creates an extra textnode that needs to be ignored
> instead of just returning the xml elements. I have tried different
> methods of opening the file but it doesn’t seem to make a difference. It
> is happiest when reading a file in Unix format.
>
> *Wayne Peterson **|** Consultant
> Sierra Systems

Wayne,

It sounds to me like you're doing everything correctly.

- XML files are text files, and should be read as text.

- In the absence of a DTD, all whitespace is regarded as significant.
Typically this means yes, there will be a text node between consecutive
element nodes.

- The XML processor is required to return end-of-line as a single '\n',
regardless of which OS or programming language.

If you are traversing every node, you'll need to explicitly ignore the
text nodes. More usually you don't have to deal with them, because you
know what nodes you're looking for and pick them out with
GetElementsByTagName.


_______________________________________________
XML-SIG maillist  -  [hidden email]
http://mail.python.org/mailman/listinfo/xml-sig
Reply | Threaded
Open this post in threaded view
|

Re: Parsing XML file with Minidom has problem with cr/lf

Fred Drake-3
On Mon, May 10, 2010 at 1:59 PM, Bill Kinnersley <[hidden email]> wrote:
> - XML files are text files, and should be read as text.

XML files contain encoded text, and must be handled as binary files.


  -Fred

--
Fred L. Drake, Jr.    <fdrake at gmail.com>
"Chaos is the score upon which reality is written." --Henry Miller
_______________________________________________
XML-SIG maillist  -  [hidden email]
http://mail.python.org/mailman/listinfo/xml-sig
Reply | Threaded
Open this post in threaded view
|

Re: Parsing XML file with Minidom has problem with cr/lf

Stefan Behnel-3
In reply to this post by Bill Kinnersley
Bill Kinnersley, 10.05.2010 19:59:
> - XML files are text files, and should be read as text.

Sorry, but the only sane way to read them is as binary data. Passing
unicode text to the parser will interfere with the encoding declaration at
the beginning.


> - The XML processor is required to return end-of-line as a single '\n',
> regardless of which OS or programming language.

Interesting. I wasn't aware of that, but it's true.

http://www.w3.org/TR/REC-xml/#sec-line-ends

Stefan
_______________________________________________
XML-SIG maillist  -  [hidden email]
http://mail.python.org/mailman/listinfo/xml-sig
Reply | Threaded
Open this post in threaded view
|

Re: Parsing XML file with Minidom has problem with cr/lf

"Martin v. Löwis"
In reply to this post by Dieter Maurer
>> The correct way to parse XML files is as binary data.
>
> Why do you think so?
>
> The default "minidom" parser seems not to expect "\r\n" line endings....

Why do you say that? It expects them just fine, replacing them with \n
line endings, then inserting those into the DOM tree. Just as it should.
I believe the OP was complaining that it creates those text nodes in
the first place, not that it does or does not specifically do that for
\r\n line endings.

Regards,
Martin
_______________________________________________
XML-SIG maillist  -  [hidden email]
http://mail.python.org/mailman/listinfo/xml-sig
Reply | Threaded
Open this post in threaded view
|

Re: Parsing XML file with Minidom has problem with cr/lf

Dieter Maurer
"Martin v. Löwis" wrote at 2010-5-11 09:14 +0200:

>>> The correct way to parse XML files is as binary data.
>>
>> Why do you think so?
>>
>> The default "minidom" parser seems not to expect "\r\n" line endings....
>
>Why do you say that? It expects them just fine, replacing them with \n
>line endings, then inserting those into the DOM tree. Just as it should.
>I believe the OP was complaining that it creates those text nodes in
>the first place, not that it does or does not specifically do that for
>\r\n line endings.

I may have misunderstood the original problem report.
I have read it as: I see "\r\n" text nodes.



--
Dieter
_______________________________________________
XML-SIG maillist  -  [hidden email]
http://mail.python.org/mailman/listinfo/xml-sig
Reply | Threaded
Open this post in threaded view
|

Re: Parsing XML file with Minidom has problem with cr/lf

Peterson, Wayne
In reply to this post by Stefan Behnel-3
Thank you everyone for the excellent replies.

As someone noticed, my original complaint was that the parser was
returning linefeeds at all in the DOM tree. I thought that the Windows
cr/lf format was causing this but  now understand that this is what it
is supposed to do.

I received conflicting advice on whether to process the XML files as
binary or text but that is a topic for a different thread.

Wayne



----Notice Regarding Confidentiality----
This email, including any and all attachments, (this "Email") is intended only for the party to whom it is addressed and may contain information that is confidential or privileged.  Sierra Systems Group Inc. and its affiliates accept no responsibility for any loss or damage suffered by any person resulting from any unauthorized use of or reliance upon this Email.  If you are not the intended recipient, you are hereby notified that any dissemination, copying or other use of this Email is prohibited.  Please notify us of the error in communication by return email and destroy all copies of this Email.  Thank you.
_______________________________________________
XML-SIG maillist  -  [hidden email]
http://mail.python.org/mailman/listinfo/xml-sig