Quantcast

CSV writer question

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

CSV writer question

Jason Swails
Hello,

I have a question about a csv.writer instance.  I have a utility that I want to write a full CSV file from lots of data, but due to performance (and memory) considerations, there's no way I can write the data sequentially.  Therefore, I write the data in chunks to temporary files, then combine them all at the end.  For convenience, I declare each writer instance via a statement like

my_csv = csv.writer(open('temp.1.csv', 'wb'))

so the open file object isn't bound to any explicit reference, and I don't know how to reference it inside the writer class (the documentation doesn't say, unless I've missed the obvious).  Thus, the only way I can think of to make sure that all of the data is written before I start copying these files sequentially into the final file is to unbuffer them so the above command is changed to

my_csv = csv.writer(open('temp.1.csv', 'wb', 0))

unless, of course, I add an explicit reference to track the open file object and manually close or flush it (but I'd like to avoid it if possible).  My question is 2-fold.  Is there a way to do that directly via the CSV API, or is the approach I'm taking the only way without binding the open file object to another reference?  Secondly, if these files are potentially very large (anywhere from ~1KB to 20 GB depending on the amount of data present), what kind of performance hit will I be looking at by disabling buffering on these types of files?

Tips, answers, comments, and/or suggestions are all welcome.

Thanks a lot!
Jason


As an afterthought, I suppose I could always subclass the csv.writer class and add the reference I want to that, which I may do if there's no other convenient solution.

--
http://mail.python.org/mailman/listinfo/python-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: CSV writer question

Chris Rebert-6
On Sun, Oct 23, 2011 at 10:18 PM, Jason Swails <[hidden email]> wrote:

> Hello,
>
> I have a question about a csv.writer instance.  I have a utility that I want
> to write a full CSV file from lots of data, but due to performance (and
> memory) considerations, there's no way I can write the data sequentially.
> Therefore, I write the data in chunks to temporary files, then combine them
> all at the end.  For convenience, I declare each writer instance via a
> statement like
>
> my_csv = csv.writer(open('temp.1.csv', 'wb'))
>
> so the open file object isn't bound to any explicit reference, and I don't
> know how to reference it inside the writer class (the documentation doesn't
> say, unless I've missed the obvious).  Thus, the only way I can think of to
> make sure that all of the data is written before I start copying these files
> sequentially into the final file is to unbuffer them so the above command is
> changed to
>
> my_csv = csv.writer(open('temp.1.csv', 'wb', 0))
>
> unless, of course, I add an explicit reference to track the open file object
> and manually close or flush it
> (but I'd like to avoid it if possible).

Why? Especially when the performance cost is likely to be nontrivial...

> Is there a way to do that directly via the CSV API,

Very doubtful; csv.writer (and reader for that matter) is implemented
in C, doesn't expose a ._file or similar attribute, and has no
.close() or .flush() methods.

Cheers,
Chris
--
http://rebertia.com
--
http://mail.python.org/mailman/listinfo/python-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: CSV writer question

Chris Angelico
In reply to this post by Jason Swails
On Mon, Oct 24, 2011 at 4:18 PM, Jason Swails <[hidden email]> wrote:
> my_csv = csv.writer(open('temp.1.csv', 'wb'))
>

Have you confirmed, or can you confirm, whether or not the file gets
closed automatically when the writer gets destructed? If so, all you
need to do is:

my_csv = something_else
# or:
del my_csv

to unbind what I assume is the only reference to the csv.writer, upon
which it should promptly clean itself up.

ChrisA
--
http://mail.python.org/mailman/listinfo/python-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: CSV writer question

Jason Swails
In reply to this post by Chris Rebert-6


On Mon, Oct 24, 2011 at 2:08 AM, Chris Rebert <[hidden email]> wrote:
On Sun, Oct 23, 2011 at 10:18 PM, Jason Swails <[hidden email]> wrote:

> unless, of course, I add an explicit reference to track the open file object
> and manually close or flush it
> (but I'd like to avoid it if possible).

Why? Especially when the performance cost is likely to be nontrivial...

Because if the CSV API exposed the file object, I wouldn't have to create that extra reference, and the class that is handling this stuff already has enough attributes without having to add a separate one for each CSV writer it instantiates.  It's not a serious objection to creating it, just one I'd rather not do (it's less elegant IMO, whatever that's worth).


> Is there a way to do that directly via the CSV API,

Very doubtful; csv.writer (and reader for that matter) is implemented
in C, doesn't expose a ._file or similar attribute, and has no
.close() or .flush() methods.

The machinery implemented in C, but they are wrapped in Python, and it's a Python file object being passed to the writer, so I thought maybe there was a 'standard' method of exposing file objects in these types of cases that I just wasn't aware of, but it appears not.

Thanks for the info!
Jason

--
http://mail.python.org/mailman/listinfo/python-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: CSV writer question

Jason Swails
In reply to this post by Chris Angelico


On Mon, Oct 24, 2011 at 3:03 AM, Chris Angelico <[hidden email]> wrote:
On Mon, Oct 24, 2011 at 4:18 PM, Jason Swails <[hidden email]> wrote:
> my_csv = csv.writer(open('temp.1.csv', 'wb'))
>

Have you confirmed, or can you confirm, whether or not the file gets
closed automatically when the writer gets destructed? If so, all you
need to do is:

my_csv = something_else
# or:
del my_csv

I'm not sure why I decided against this approach in the first place.  This does work (at least with my test), so it's what I'll do.  I probably wasn't confident that it would clean itself up properly, but that's probably rather un-pythonic and would not have made it into the stdlib if that was the case.

Thanks!
Jason


--
http://mail.python.org/mailman/listinfo/python-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: CSV writer question

Andrew McLean-7
In reply to this post by Chris Angelico
On 24/10/2011 08:03, Chris Angelico wrote:

> On Mon, Oct 24, 2011 at 4:18 PM, Jason Swails<[hidden email]>  wrote:
>> my_csv = csv.writer(open('temp.1.csv', 'wb'))
>>
> Have you confirmed, or can you confirm, whether or not the file gets
> closed automatically when the writer gets destructed? If so, all you
> need to do is:
>
> my_csv = something_else
> # or:
> del my_csv
>
> to unbind what I assume is the only reference to the csv.writer, upon
> which it should promptly clean itself up.
My understanding is that in cpython the file does get closed when the
writer is deleted, however, it's not guaranteed to happen in other
Python implementations (e.g. IronPython, PyPy and Jython).

Andrew

--
http://mail.python.org/mailman/listinfo/python-list
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: CSV writer question

Peter Otten
In reply to this post by Jason Swails
Jason Swails wrote:

> Hello,
>
> I have a question about a csv.writer instance.  I have a utility that I
> want to write a full CSV file from lots of data, but due to performance
> (and memory) considerations, there's no way I can write the data
> sequentially. Therefore, I write the data in chunks to temporary files,
> then combine them
> all at the end.  For convenience, I declare each writer instance via a
> statement like
>
> my_csv = csv.writer(open('temp.1.csv', 'wb'))
>
> so the open file object isn't bound to any explicit reference, and I don't
> know how to reference it inside the writer class (the documentation
> doesn't
> say, unless I've missed the obvious).  Thus, the only way I can think of
> to make sure that all of the data is written before I start copying these
> files sequentially into the final file is to unbuffer them so the above
> command is changed to
>
> my_csv = csv.writer(open('temp.1.csv', 'wb', 0))
>
> unless, of course, I add an explicit reference to track the open file
> object
> and manually close or flush it (but I'd like to avoid it if possible).  My
> question is 2-fold.  Is there a way to do that directly via the CSV API,
> or is the approach I'm taking the only way without binding the open file
> object
> to another reference?  Secondly, if these files are potentially very large
> (anywhere from ~1KB to 20 GB depending on the amount of data present),
> what kind of performance hit will I be looking at by disabling buffering
> on these types of files?
>
> Tips, answers, comments, and/or suggestions are all welcome.
>
> Thanks a lot!
> Jason
>
> As an afterthought, I suppose I could always subclass the csv.writer class
> and add the reference I want to that, which I may do if there's no other
> convenient solution.

A contextmanager might help:

import csv
from contextlib import contextmanager

@contextmanager
def filewriter(filename):
    with open(filename, "wb") as outstream:
        yield csv.writer(outstream)


if __name__ == "__main__":
    with filewriter("tmp.csv") as writer:
        writer.writerows([
                ["alpha", "beta"],
                ["gamma", "delta"]])


--
http://mail.python.org/mailman/listinfo/python-list
Loading...