[Tutor] Finding a specific line in a body of text

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[Tutor] Finding a specific line in a body of text

Robert Sjoblom
I'm sorry if the subject is vague, but I can't really explain it very
well. I've been away from programming for a while now (I got a
daughter and a year after that a son, so I've been busy with family
matters). As such, my skills are definitely rusty.

In the file I'm parsing, I'm looking for specific lines. I don't know
the content of these lines but I do know the content that appears two
lines before. As such I thought that maybe I'd flag for a found line
and then flag the next two lines as well, like so:

if keyword in line:
  flag = 1
  continue
if flag == 1 or flag == 2:
  if flag == 1:
    flag = 2
    continue
  if flag == 2:
    list.append(line)

This, however, turned out to be unacceptably slow; this file is 1.1M
lines, and it takes roughly a minute to go through. I have 450 of
these files; I don't have the luxury to let it run for 8 hours.

So I thought that maybe I could use enumerate() somehow, get the index
when I hit keyword and just append the line at index+2; but I realize
I don't know how to do that. File objects doesn't have an index
function.

For those curious, the data I'm looking for looks like this:
5 72 88 77 90 92
18 80 75 98 84 90
81
12 58 76 77 94 96

There are other parts of the file that contains similar strings of
digits, so I can't just grab any digits I come across either; the only
thing I have to go on is the keyword. It's obvious that my initial
idea was horribly bad (and I knew that as well, but I wanted to first
make sure that I could find what I was after properly). The structure
looks like this (I opted to use \t instead of relying on the tabs to
getting formatted properly in the email):

\t\tkeyword=
\t\t{
5 72 88 77 90 92 \t\t}

--
best regards,
Robert S.
_______________________________________________
Tutor maillist  -  [hidden email]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor
Reply | Threaded
Open this post in threaded view
|

Re: [Tutor] Finding a specific line in a body of text

Steven D'Aprano-8
On Mon, Mar 12, 2012 at 02:56:36AM +0100, Robert Sjoblom wrote:

> In the file I'm parsing, I'm looking for specific lines. I don't know
> the content of these lines but I do know the content that appears two
> lines before. As such I thought that maybe I'd flag for a found line
> and then flag the next two lines as well, like so:
>
> if keyword in line:
>   flag = 1
>   continue
> if flag == 1 or flag == 2:
>   if flag == 1:
>     flag = 2
>     continue
>   if flag == 2:
>     list.append(line)


You haven't shown us the critical part: how are you getting the lines in
the first place?

(Also, you shouldn't shadow built-ins like list as you do above, unless
you know what you are doing. If you have to ask "what's shadowing?", you
don't :)


> This, however, turned out to be unacceptably slow; this file is 1.1M
> lines, and it takes roughly a minute to go through. I have 450 of
> these files; I don't have the luxury to let it run for 8 hours.

Really? And how many hours have you spent trying to speed this up? Two?
Three? Seven? And if it takes people two or three hours to answer your
question, and you another two or three hours to read it, it would have
been faster to just run the code as given :)

I'm just saying.

Since you don't show the actual critical part of the code, I'm going to
make some simple suggestions that you may or may not have already tried.

- don't read files off USB or CD or over the network, because it will
likely be slow; if you can copy the files onto the local hard drive,
performance may be better;

- but if you include the copying time, it might not make that much
difference;

- can you use a dedicated tool for this, like Unix grep or even perl,
which is optimised for high-speed file manipulations?

- if you need to stick with Python, try this:

# untested
results = []
fp = open('filename')
for line in fp:
    if key in line:  
        # Found key, skip the next line and save the following.
        _ = next(fp, '')
        results.append(next(fp, ''))

By the way, the above assumes you are running Python 2.6 or better. In
Python 2.5, you can define this function:

def next(iterator, default):
    try:
        return iterator.next()
    except StopIteration:
        return default

but it will likely be a little slower.


Another approach may be to read the whole file into memory in one big
chunk. 1.1 million lines, by (say) 50 characters per line comes to about
53 MB per file, which should be small enough to read into memory and
process it in one chunk. Something like this:

# again untested
text = open('filename').read()
results = []
i = 0
while i < len(text):
    offset = text.find(key, i)
    if i == -1: break
    i += len(key)  # skip the rest of the key
    # read ahead to the next newline, twice
    i = text.find('\n', i)
    i = text.find('\n', i)
    # now find the following newline, and save everything up to that
    p = text.find('\n', i)
    if p == -1:  p = len(text)
    results.append(text[i:p])
    i = p  # skip ahead


This will likely break if the key is found without two more lines
following it.



--
Steven
_______________________________________________
Tutor maillist  -  [hidden email]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor
Reply | Threaded
Open this post in threaded view
|

Re: [Tutor] Finding a specific line in a body of text

Robert Sjoblom
> You haven't shown us the critical part: how are you getting the lines in
> the first place?

Ah, yes --
with open(address, "r", encoding="cp1252") as instream:
    for line in instream:

> (Also, you shouldn't shadow built-ins like list as you do above, unless
> you know what you are doing. If you have to ask "what's shadowing?", you
> don't :)
Maybe I should have said list_name.append() instead; sorry for that.

>> This, however, turned out to be unacceptably slow; this file is 1.1M
>> lines, and it takes roughly a minute to go through. I have 450 of
>> these files; I don't have the luxury to let it run for 8 hours.
>
> Really? And how many hours have you spent trying to speed this up? Two?
> Three? Seven? And if it takes people two or three hours to answer your
> question, and you another two or three hours to read it, it would have
> been faster to just run the code as given :)
Yes, for one set of files. Since I don't know how many sets of ~450
files I'll have to run this over, I think that asking for help was a
rather acceptable loss of time. I work on other parts while waiting
anyway, or try and find out on my own as well.

> - if you need to stick with Python, try this:
>
> # untested
> results = []
> fp = open('filename')
> for line in fp:
>    if key in line:
>        # Found key, skip the next line and save the following.
>        _ = next(fp, '')
>        results.append(next(fp, ''))

Well that's certainly faster, but not fast enough.
Oh well, I'll continue looking for a solution -- because even with the
speedup it's unacceptable. I'm hoping against hope that I only have to
run it against the last file of each batch of files, but if it turns
out that I don't, I'm in for some exciting days of finding stuff out.
Thanks for all the help though, it's much appreciated!

How do you approach something like this, when someone tells you "we
need you to parse these files. We can't tell you how they're
structured so you'll have to figure that out yourself."? It's just so
much text that's it's hard to get a grasp on the structure, and
there's so much information contained in there as well; this is just
the first part of what I'm afraid will be many. I'll try not to bother
this list too much though.
--
best regards,
Robert S.
_______________________________________________
Tutor maillist  -  [hidden email]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor
Reply | Threaded
Open this post in threaded view
|

Re: [Tutor] Finding a specific line in a body of text

ian douglas-3

Erik Rise gave a good talk today at PyCon about a parsing library he's working on called Parsimonious. You could maybe look into what he's doing there, and see if that helps you any... Follow him on Twitter at @erikrose to see when his session's video is up. His session was named "Parsing Horrible Things in Python"

On Mar 11, 2012 9:48 PM, "Robert Sjoblom" <[hidden email]> wrote:
> You haven't shown us the critical part: how are you getting the lines in
> the first place?

Ah, yes --
with open(address, "r", encoding="cp1252") as instream:
   for line in instream:

> (Also, you shouldn't shadow built-ins like list as you do above, unless
> you know what you are doing. If you have to ask "what's shadowing?", you
> don't :)
Maybe I should have said list_name.append() instead; sorry for that.

>> This, however, turned out to be unacceptably slow; this file is 1.1M
>> lines, and it takes roughly a minute to go through. I have 450 of
>> these files; I don't have the luxury to let it run for 8 hours.
>
> Really? And how many hours have you spent trying to speed this up? Two?
> Three? Seven? And if it takes people two or three hours to answer your
> question, and you another two or three hours to read it, it would have
> been faster to just run the code as given :)
Yes, for one set of files. Since I don't know how many sets of ~450
files I'll have to run this over, I think that asking for help was a
rather acceptable loss of time. I work on other parts while waiting
anyway, or try and find out on my own as well.

> - if you need to stick with Python, try this:
>
> # untested
> results = []
> fp = open('filename')
> for line in fp:
>    if key in line:
>        # Found key, skip the next line and save the following.
>        _ = next(fp, '')
>        results.append(next(fp, ''))

Well that's certainly faster, but not fast enough.
Oh well, I'll continue looking for a solution -- because even with the
speedup it's unacceptable. I'm hoping against hope that I only have to
run it against the last file of each batch of files, but if it turns
out that I don't, I'm in for some exciting days of finding stuff out.
Thanks for all the help though, it's much appreciated!

How do you approach something like this, when someone tells you "we
need you to parse these files. We can't tell you how they're
structured so you'll have to figure that out yourself."? It's just so
much text that's it's hard to get a grasp on the structure, and
there's so much information contained in there as well; this is just
the first part of what I'm afraid will be many. I'll try not to bother
this list too much though.
--
best regards,
Robert S.
_______________________________________________
Tutor maillist  -  [hidden email]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

_______________________________________________
Tutor maillist  -  [hidden email]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor
Reply | Threaded
Open this post in threaded view
|

Re: [Tutor] Finding a specific line in a body of text

Steven D'Aprano-8
In reply to this post by Robert Sjoblom
On Mon, Mar 12, 2012 at 05:46:39AM +0100, Robert Sjoblom wrote:
> > You haven't shown us the critical part: how are you getting the lines in
> > the first place?
>
> Ah, yes --
> with open(address, "r", encoding="cp1252") as instream:
>     for line in instream:

Seems reasonable.


> > (Also, you shouldn't shadow built-ins like list as you do above, unless
> > you know what you are doing. If you have to ask "what's shadowing?", you
> > don't :)
> Maybe I should have said list_name.append() instead; sorry for that.

No problems :) Shadowing builtins is fine if you know what you're doing,
but it's the people who do it without realising that end up causing
themselves trouble.


> >> This, however, turned out to be unacceptably slow; this file is 1.1M
> >> lines, and it takes roughly a minute to go through. I have 450 of
> >> these files; I don't have the luxury to let it run for 8 hours.
> >
> > Really? And how many hours have you spent trying to speed this up? Two?
> > Three? Seven? And if it takes people two or three hours to answer your
> > question, and you another two or three hours to read it, it would have
> > been faster to just run the code as given :)
> Yes, for one set of files. Since I don't know how many sets of ~450
> files I'll have to run this over, I think that asking for help was a
> rather acceptable loss of time. I work on other parts while waiting
> anyway, or try and find out on my own as well.

All very reasonable. So long as you have considered the alternatives.


> > - if you need to stick with Python, try this:
> >
> > # untested
> > results = []
> > fp = open('filename')
> > for line in fp:
> >    if key in line:
> >        # Found key, skip the next line and save the following.
> >        _ = next(fp, '')
> >        results.append(next(fp, ''))
>
> Well that's certainly faster, but not fast enough.

You may have to consider that your bottleneck is not the speed of your
Python code, but the speed of getting data off the disk into memory. In
which case, you may be stuck.

I suggest you time how long it takes to process a file using the above,
then compare it to how long just reading the file takes:

from time import clock
t = clock()
for line in open('filename', encoding='cp1252'):
    pass
print(clock() - t)

Run both timings a couple of times and pick the smallest number, to
minimise caching effects and other extraneous influences.

Then do the same using a system tool. You're using Windows, right? I
can't tell you how to do it in Windows, but on Linux I'd say:

time cat 'filename' > /dev/null

which should give me a rough-and-ready estimate of the raw speed of
reading data off the disk. If this speed is not *significantly* better
than you are getting in Python, then there simply isn't any feasible way
to speed the code up appreciably. (Except maybe get faster hard drives
or smaller files .)

[...]
> How do you approach something like this, when someone tells you "we
> need you to parse these files. We can't tell you how they're
> structured so you'll have to figure that out yourself."?

Bitch and moan quietly to myself, and then smile when I realise I'm
being paid by the hour.

Reverse-engineering a file structure without any documentation is rarely
simple or fast.



--
Steven
_______________________________________________
Tutor maillist  -  [hidden email]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor
Reply | Threaded
Open this post in threaded view
|

Re: [Tutor] Finding a specific line in a body of text

Alan Gauld
In reply to this post by Steven D'Aprano-8
On 12/03/12 03:28, Steven D'Aprano wrote:

> Another approach may be to read the whole file into memory in one big
> chunk. 1.1 million lines, by (say) 50 characters per line comes to about
> 53 MB per file, which should be small enough to read into memory and
> process it in one chunk. Something like this:
>
> # again untested
> text = open('filename').read()
> results = []
> i = 0
> while i<  len(text):
>      offset = text.find(key, i)
>      if i == -1: break
>      i += len(key)  # skip the rest of the key
>      # read ahead to the next newline, twice
>      i = text.find('\n', i)
>      i = text.find('\n', i)
>      # now find the following newline, and save everything up to that
>      p = text.find('\n', i)
>      if p == -1:  p = len(text)
>      results.append(text[i:p])
>      i = p  # skip ahead

Or using readlines:

index = 0
text = open('filename').readlines()
while True:
   try:
     index = text.index(key,index) + 2
     results.append(text[index])
   except ValueError: break

readlines will take slightly more memory.

But I suspect a tool like grep will be faster. grep
can be downloaded for windows.

To use grep explore the -A option.

Even using grep as a pre-filter to pipe into your
program might work.

But you may also have to accept that processing 450
large files will take some time! You can help by
parallel processing up to the number of cores (less 1)
in your PC, But other than that you may just need a
faster computer! Either more RAM or a SSD drive will
help greatly.

--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/

_______________________________________________
Tutor maillist  -  [hidden email]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor
Reply | Threaded
Open this post in threaded view
|

Re: [Tutor] Finding a specific line in a body of text

Emile van Sebille
In reply to this post by Robert Sjoblom
On 3/11/2012 6:56 PM Robert Sjoblom said...

> I'm sorry if the subject is vague, but I can't really explain it very
> well. I've been away from programming for a while now (I got a
> daughter and a year after that a son, so I've been busy with family
> matters). As such, my skills are definitely rusty.
>
> In the file I'm parsing, I'm looking for specific lines. I don't know
> the content of these lines but I do know the content that appears two
> lines before. As such I thought that maybe I'd flag for a found line
> and then flag the next two lines as well, like so:
>


If, as others suggest, the files do fit in memory, you might try:

content = open(thisfile).read()
for fragment in content.split(keyword)[1:]:
     myresults.append(fragment.split('\n')[1]


Emile


_______________________________________________
Tutor maillist  -  [hidden email]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor