Turn your EBook to Text with Python In seconds

Python

EBook

“I wasted a whole week trying to convert an Epub file onto text, but you should not”

First, let’s make it readable

I will be working with my EBook, you can get yours and get its path if you’re running locally If your working on Collab it will be easier just upload it and get its path.

I will be working with the following packages, you better pip install them before starting

    !pip install ebooklib
!pip install BeautifulSoup4

Getting the HTML out

I will be getting a path of the epub file, then I will turn it to a list of HTML, each HTML is a chapter (I believe).

I believe it’s XHTML, but I’ve put its content in a file.txt, and I was able to open it in the browser and view the content, therefore I will keep calling it HTML during this project to simplify, “but it is NOT”.

I used the ebooklib library to extract the HTML out of the epub file using the “item.get_content()”

    
      import ebooklib
      from ebooklib import epub

      def epub2thtml(epub_path):
          book = epub.read_epub(epub_path)
          chapters = []
          for item in book.get_items():
              if item.get_type() == ebooklib.ITEM_DOCUMENT:
                  chapters.append(item.get_content())
          return chapters

We got the HTML, now where is my text?

Ok if you visualize the output, it is full of HTML brackets. My text is to be fed to a machine to learn from it, I don’t want it to get confused by these brackets,

“I, myself was confused with these HTML brackets”

So I will be using a very friendly framework, BeautifulSoup that is basically used to get content from the web and scrap it. I don’t have content on the web, but I do have its format! therefore, It should be just good :)

To understand furthermore the use of beautiful soup check this article

Now first I should get the noise bracket out and choose the type of content I want, then I will scrape it all, put it in a text variable, apply it on every HTML of a chapter, an I will have a list of texts of each chapter.

    
      from bs4 import BeautifulSoup

      blacklist = ['[document]', 'noscript', 'header', 'html', 'meta', 'head','input', 'script']
      # there may be more elements you don't want, such as "style", etc.

      def chap2text(chap):
          output = ''
          soup = BeautifulSoup(chap, 'html.parser')
          text = soup.find_all(text=True)
          for t in text:
              if t.parent.name not in blacklist:
                  output += '{} '.format(t)
          return output

      def thtml2ttext(thtml):
          Output = []
          for html in thtml:
              text =  chap2text(html)
              Output.append(text)
          return Output

Finally, the function that takes the path to our Epub file, and gives as an output Its text :

    
      def epub2text(epub_path):
          chapters = epub2thtml(epub_path)
          ttext = thtml2ttext(chapters)
          return ttext

Done!

now the process finished let us see the result

    
      out=epub2text('/content/[Franz_Kafka,_John_Updike]_The_Complete_Stories(z-lib.org).epub')

Ok I have some “\n” noise but I need it to separate content, the first 2 lines mean nothing basically, but the rest is rich of content. so I might call it a Day.

Feel free to ask me if you didn't understand a line, and feel free to contribute to my Github repository, that contains all the base code.

Thank you for your time, leave a couple of claps if you liked it, comment if you think it can be improved.

This was my first article on medium so I hope you liked it.