Converting text file to different encoding.

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Converting text file to different encoding.

subhabrata.banerji@gmail.com
I am having few files in default encoding. I wanted to change their encodings,
preferably in "UTF-8", or may be from one encoding to any other encoding.

I was trying it as follows,

   >>> import codecs
   >>> sourceEncoding = "iso-8859-1"
   >>> targetEncoding = "utf-8"
   >>> source = open("source1","w")
   >>> target = open("target", "w")
   >>> target.write(unicode(source, sourceEncoding).encode(targetEncoding))

but it was giving me error as follows,
Traceback (most recent call last):
  File "<pyshell#6>", line 1, in <module>
    target.write(unicode(source, sourceEncoding).encode(targetEncoding))
TypeError: coercing to Unicode: need string or buffer, file found

If anybody may kindly suggest how may I solve it.

Regards,
Subhabrata Banerjee.
 


Reply | Threaded
Open this post in threaded view
|

Converting text file to different encoding.

Rustom Mody
On Friday, April 17, 2015 at 6:50:08 PM UTC+5:30, subhabrat... at gmail.com wrote:
> I am having few files in default encoding. I wanted to change their encodings,
> preferably in "UTF-8", or may be from one encoding to any other encoding.
>
> I was trying it as follows,
>
>    >>> import codecs
>    >>> sourceEncoding = "iso-8859-1"
>    >>> targetEncoding = "utf-8"
>    >>> source = open("source1","w")

Do you want "w" or "r" ?

>    >>> target = open("target", "w")
>    >>> target.write(unicode(source, sourceEncoding).encode(targetEncoding))
>
> but it was giving me error as follows,
> Traceback (most recent call last):
>   File "<pyshell#6>", line 1, in <module>
>     target.write(unicode(source, sourceEncoding).encode(targetEncoding))
> TypeError: coercing to Unicode: need string or buffer, file found
>
> If anybody may kindly suggest how may I solve it.
>
> Regards,
> Subhabrata Banerjee.



Reply | Threaded
Open this post in threaded view
|

Converting text file to different encoding.

subhabrata.banerji@gmail.com
In reply to this post by subhabrata.banerji@gmail.com
On Friday, April 17, 2015 at 6:50:08 PM UTC+5:30, subhabrat... at gmail.com wrote:

> I am having few files in default encoding. I wanted to change their encodings,
> preferably in "UTF-8", or may be from one encoding to any other encoding.
>
> I was trying it as follows,
>
>    >>> import codecs
>    >>> sourceEncoding = "iso-8859-1"
>    >>> targetEncoding = "utf-8"
>    >>> source = open("source1","w")
>    >>> target = open("target", "w")
>    >>> target.write(unicode(source, sourceEncoding).encode(targetEncoding))
>
> but it was giving me error as follows,
> Traceback (most recent call last):
>   File "<pyshell#6>", line 1, in <module>
>     target.write(unicode(source, sourceEncoding).encode(targetEncoding))
> TypeError: coercing to Unicode: need string or buffer, file found
>
> If anybody may kindly suggest how may I solve it.
>
> Regards,
> Subhabrata Banerjee.

As an ace coder you may know better than me what I would need, but if you have any roundabout or hint you may give I will practice to see if I may port it.


Reply | Threaded
Open this post in threaded view
|

Converting text file to different encoding.

Oscar Benjamin-2
On 17 April 2015 at 14:51,  <subhabrata.banerji at gmail.com> wrote:

> On Friday, April 17, 2015 at 6:50:08 PM UTC+5:30, subhabrat... at gmail.com wrote:
>> I am having few files in default encoding. I wanted to change their encodings,
>> preferably in "UTF-8", or may be from one encoding to any other encoding.
>>
>> I was trying it as follows,
>>
>>    >>> import codecs
>>    >>> sourceEncoding = "iso-8859-1"
>>    >>> targetEncoding = "utf-8"
>>    >>> source = open("source1","w")
>>    >>> target = open("target", "w")
>>    >>> target.write(unicode(source, sourceEncoding).encode(targetEncoding))
>>
>> but it was giving me error as follows,
>> Traceback (most recent call last):
>>   File "<pyshell#6>", line 1, in <module>
>>     target.write(unicode(source, sourceEncoding).encode(targetEncoding))
>> TypeError: coercing to Unicode: need string or buffer, file found

The error comes from `unicode(source, sourceEncoding)` and results
from the fact that source is a file object when it should be a string.
To read the contents of the file as a string just change `source` to
`source.read()`.


Oscar


Reply | Threaded
Open this post in threaded view
|

Converting text file to different encoding.

subhabrata.banerji@gmail.com
In reply to this post by subhabrata.banerji@gmail.com
On Friday, April 17, 2015 at 7:36:46 PM UTC+5:30, Oscar Benjamin wrote:

>  wrote:
> > On Friday, April 17, 2015 at 6:50:08 PM UTC+5:30,  wrote:
> >> I am having few files in default encoding. I wanted to change their encodings,
> >> preferably in "UTF-8", or may be from one encoding to any other encoding.
> >>
> >> I was trying it as follows,
> >>
> >>    >>> import codecs
> >>    >>> sourceEncoding = "iso-8859-1"
> >>    >>> targetEncoding = "utf-8"
> >>    >>> source = open("source1","w")
> >>    >>> target = open("target", "w")
> >>    >>> target.write(unicode(source, sourceEncoding).encode(targetEncoding))
> >>
> >> but it was giving me error as follows,
> >> Traceback (most recent call last):
> >>   File "<pyshell#6>", line 1, in <module>
> >>     target.write(unicode(source, sourceEncoding).encode(targetEncoding))
> >> TypeError: coercing to Unicode: need string or buffer, file found
>
> The error comes from `unicode(source, sourceEncoding)` and results
> from the fact that source is a file object when it should be a string.
> To read the contents of the file as a string just change `source` to
> `source.read()`.
>
>
> Oscar

I tried to do as follows,

>>> import codecs
>>> sourceEncoding = "iso-8859-1"
>>> targetEncoding = "utf-8"
>>> source = open("source1","w")
>>> string1="String type"
>>> str1=str(string1)
>>> source.write(str1)
>>> source.close()
>>> target = open("target", "w")
>>> source=open("source1","r")
>>> target.write(unicode(source.read(), sourceEncoding).encode(targetEncoding))
>>>

am I going ok?



Reply | Threaded
Open this post in threaded view
|

Converting text file to different encoding.

Chris Angelico
On Sat, Apr 18, 2015 at 12:26 AM,  <subhabrata.banerji at gmail.com> wrote:

> I tried to do as follows,
>>>> import codecs
>>>> sourceEncoding = "iso-8859-1"
>>>> targetEncoding = "utf-8"
>>>> source = open("source1","w")
>>>> string1="String type"
>>>> str1=str(string1)
>>>> source.write(str1)
>>>> source.close()
>>>> target = open("target", "w")
>>>> source=open("source1","r")
>>>> target.write(unicode(source.read(), sourceEncoding).encode(targetEncoding))
>>>>
>
> am I going ok?

Here's how I'd do it.

$ python3
>>> with open("source1", encoding="iso-8859-1") as source, open("target", "w", encoding="utf-8") as target:
...     target.write(source.read())

Or maybe this:

$ pike
> Stdio.write_file("target", string_to_utf8(Stdio.read_file("source1")));

So much easier than fiddling around with all those steps you're doing.
I'm not sure what they're all for, anyway; calling str() on a
double-quoted literal isn't usually going to do anything, and I don't
see "from __future__ import unicode_literals" anywhere.

ChrisA


Reply | Threaded
Open this post in threaded view
|

Converting text file to different encoding.

Dave Angel-4
In reply to this post by subhabrata.banerji@gmail.com
On 04/17/2015 09:19 AM, subhabrata.banerji at gmail.com wrote:
> I am having few files in default encoding. I wanted to change their encodings,
> preferably in "UTF-8", or may be from one encoding to any other encoding.
>

You neglected to specify what Python version this is for.  Other
information that'd be useful is whether the file size is small enough
that two copies of it will all fit reasonably into memory.

I'll assume it's version 2.7, because of various clues in your sample
code.  But if it's version 3.x, it could be substantially easier.

> I was trying it as follows,
>
>     >>> import codecs
>     >>> sourceEncoding = "iso-8859-1"
>     >>> targetEncoding = "utf-8"
>     >>> source = open("source1","w")

mode "w" will truncate the source1 file, leaving you nothing to process.
  i'd suggest "r"

>     >>> target = open("target", "w")

It's not usually a good idea to use the same variable for both the file
name and the opened file object.  What if you need later to print the
name, as in an error message?

>     >>> target.write(unicode(source, sourceEncoding).encode(targetEncoding))

I'd not recommend trying to do so much in one line, at least until you
understand all the pieces.  Programming is not (usually) a contest to
write the most obscure code, but rather to make a program you can still
read and understand six months from now.  And, oh yeah, something that
will run and accomplish something.

 >
 > but it was giving me error as follows,
 > Traceback (most recent call last):
 >    File "<pyshell#6>", line 1, in <module>
 >      target.write(unicode(source, sourceEncoding).encode(targetEncoding))
 > TypeError: coercing to Unicode: need string or buffer, file found


if you factor this you will discover your error.  Nowhere do you read
the source file into a byte string.  And that's what is needed for the
unicode constructor.  Factored, you might have something like:

      encodedtext = source.read()
      text = unicode(source, sourceEncoding)
      reencodedtext = text.encode(targetEncoding)
      target.write(encodedText)

Next, you need to close the files.

     source.close()
     target.close()

There are a number of ways to improve that code, but this is a start.

Improvements:

      Use codecs.open() to open the files, so encoding is handled
implicitly in the file objects.

      Use with... syntax so that the file closes are implicit

      read and write the files in a loop, a line at a time, so that you
needn't have all the data in memory (at least twice) at one time.  This
will also help enormously if you encounter any errors, and want to
report the location and problem to the user.  It might even turn out to
be faster.

      You should write non-trivial code in a text file, and run it from
there.

--
DaveA


Reply | Threaded
Open this post in threaded view
|

Converting text file to different encoding.

Marko Rauhamaa
In reply to this post by subhabrata.banerji@gmail.com
Chris Angelico <rosuav at gmail.com>:

> Here's how I'd do it.
>
> $ python3
>>>> with open("source1", encoding="iso-8859-1") as source,
>> open("target", "w", encoding="utf-8") as target:
> ...     target.write(source.read())

You might run out of memory. How about:

========================================================================
#!/usr/bin/env python3
import shutil
shutil.copyfileobj(
    open("source1", encoding="iso-8859-1"),
    open("target", "w", encoding="utf-8"))
========================================================================


Marko


Reply | Threaded
Open this post in threaded view
|

Converting text file to different encoding.

Dave Angel-3
In reply to this post by Dave Angel-4
On 04/17/2015 10:48 AM, Dave Angel wrote:
> On 04/17/2015 09:19 AM, subhabrata.banerji at gmail.com wrote:

>>     >>> target = open("target", "w")
>
> It's not usually a good idea to use the same variable for both the file
> name and the opened file object.  What if you need later to print the
> name, as in an error message?

Oops, my error.  Somehow my brain didn't notice the quote marks, until I
reread my own message online.



--
DaveA


Reply | Threaded
Open this post in threaded view
|

Converting text file to different encoding.

Peter Otten
In reply to this post by Chris Angelico
Chris Angelico wrote:

> On Sat, Apr 18, 2015 at 12:26 AM,  <subhabrata.banerji at gmail.com> wrote:
>> I tried to do as follows,
>>>>> import codecs
>>>>> sourceEncoding = "iso-8859-1"
>>>>> targetEncoding = "utf-8"
>>>>> source = open("source1","w")
>>>>> string1="String type"
>>>>> str1=str(string1)
>>>>> source.write(str1)
>>>>> source.close()
>>>>> target = open("target", "w")
>>>>> source=open("source1","r")
>>>>> target.write(unicode(source.read(),
>>>>> sourceEncoding).encode(targetEncoding))
>>>>>
>>
>> am I going ok?
>
> Here's how I'd do it.
>
> $ python3
>>>> with open("source1", encoding="iso-8859-1") as source, open("target",
>>>> "w", encoding="utf-8") as target:
> ...     target.write(source.read())

This approach is also viable in Python 2.6 and 2.7 if you use io.open()
instead of the builtin.

To limit memory consumption for big files you can replace

target.write(source.read())

with

shutil.copyfileobj(source, target)

If you want to be sure that line endings are preserved open both files with

io.open(..., newline="") # disable newline translation