How to populate DB from PDF extracted data

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

How to populate DB from PDF extracted data

Shazia Nusrat
Hi,

I am trying to work around with PDF's where user uploads PDF in image or filefield and then way to extract it for Django and finally update DB table based on it. Following are the models:

class StudentFee(models.Model):
       class_name = models.CharField(choices=CLASSES, max_lenght=200)
       fee_deposit_slip = models.ImageField(upload_to="students/")

       def __unicode__(self):
             return unicode(self.class_name)

All I need is to design a view where I can extract data from the PDF uploaded in the model below:

class StudentInfo(models.Model):
        first_name = models.CharField(max_length=200)
        last_name = models.CharField(max_length=200)
        email=models.EmailField()
        phone = PhoneField() #using phoneField custom field
        def __unicode__(self):
               return unicode(self.first_name)

All the fields in second model does exist in the PDF. In my Views.py

class StudentPDFReader(FormView):
      template_name = 'pdfdata.html'
      form_class = PDFForm
      success_url = '/success/'
     
      def form_valid(self, form):
           # here I need to extract and add entries to modelform
          return super(StudentPDFReader, self).form_valid(form)
            
Looking for kind help.

Regards,
Shazia
 

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/CAD83tOyHABrpfn48EwMgjbvCB5y1U4AwwL_%2BS1EnCb6WebyWKw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to populate DB from PDF extracted data

m1chael-2
Good luck.

Best case scenario in my opinion is using the utility pdf2text and regex, and this will be painful. 



On Fri, Mar 9, 2018, 3:01 AM Shazia Nusrat <[hidden email]> wrote:
Hi,

I am trying to work around with PDF's where user uploads PDF in image or filefield and then way to extract it for Django and finally update DB table based on it. Following are the models:

class StudentFee(models.Model):
       class_name = models.CharField(choices=CLASSES, max_lenght=200)
       fee_deposit_slip = models.ImageField(upload_to="students/")

       def __unicode__(self):
             return unicode(self.class_name)

All I need is to design a view where I can extract data from the PDF uploaded in the model below:

class StudentInfo(models.Model):
        first_name = models.CharField(max_length=200)
        last_name = models.CharField(max_length=200)
        email=models.EmailField()
        phone = PhoneField() #using phoneField custom field
        def __unicode__(self):
               return unicode(self.first_name)

All the fields in second model does exist in the PDF. In my Views.py

class StudentPDFReader(FormView):
      template_name = 'pdfdata.html'
      form_class = PDFForm
      success_url = '/success/'
     
      def form_valid(self, form):
           # here I need to extract and add entries to modelform
          return super(StudentPDFReader, self).form_valid(form)
            
Looking for kind help.

Regards,
Shazia
 

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/CAD83tOyHABrpfn48EwMgjbvCB5y1U4AwwL_%2BS1EnCb6WebyWKw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/CAAuoY6Mdbe-CvCtqaP5CfRNsuooi1%3DDD1TrpHDSoTioYFie0dQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to populate DB from PDF extracted data

Jason Johns
PDF processing is very difficult, because the entire standard is a dumpster fire.  For example, it has no concept of structure like headings, paragraphs or sentences because each and every character is just a character, location coordinate, font size and font type.

In order to process the document and try to extract some of the structure out, its required to use heuristics.  Check out https://github.com/pdfminer/pdfminer.six and as Mike said above, good luck.  It is not a simple task.

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/01e467e3-c34a-4054-a88d-d3fa22e20881%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to populate DB from PDF extracted data

Jaap van Wingerde-4
In reply to this post by Shazia Nusrat
Use 'pdftohtml - xml' to convert the pdf in an xml-file and use per
line in de xml-file regulair expressions to extract the data.

[pdftohtml]
https://www.sourceforge.net/projects/pdftohtml/

Op Fri, 9 Mar 2018 00:00:39 -0800
Shazia Nusrat <[hidden email]> schreef:

> Hi,
>
> I am trying to work around with PDF's where user uploads PDF in image
> or filefield and then way to extract it for Django and finally update
> DB table based on it. Following are the models:
>
> class StudentFee(models.Model):
>        class_name = models.CharField(choices=CLASSES, max_lenght=200)
>        fee_deposit_slip = models.ImageField(upload_to="students/")
>
>        def __unicode__(self):
>              return unicode(self.class_name)
>
> All I need is to design a view where I can extract data from the PDF
> uploaded in the model below:
>
> class StudentInfo(models.Model):
>         first_name = models.CharField(max_length=200)
>         last_name = models.CharField(max_length=200)
>         email=models.EmailField()
>         phone = PhoneField() #using phoneField custom field
>         def __unicode__(self):
>                return unicode(self.first_name)
>
> All the fields in second model does exist in the PDF. In my Views.py
>
> class StudentPDFReader(FormView):
>       template_name = 'pdfdata.html'
>       form_class = PDFForm
>       success_url = '/success/'
>
>       def form_valid(self, form):
>            # here I need to extract and add entries to modelform
>           return super(StudentPDFReader, self).form_valid(form)
>
> Looking for kind help.
>
> Regards,
> Shazia
>


--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/20180309151249.0ebddf92%40jaap.custard.shrl.nl.
For more options, visit https://groups.google.com/d/optout.