Buy @ Amazon

Using PySpark 2 to read CSV having HTML source code

When you have a CSV file that has one of its fields as HTML Web-page source code, it becomes a real pain to read it, and much more so with PySpark when used in Jupyter Notebook.


Before jumping gun, let us understand what are the challenges in parsing such files:
  • That the HTML source code can have line-ending, EOF, new-line, etc
  • That the HTML source code can have a lot of noise empty spaces before, after and in the middle of the content
  • That the HTML source code can have separators that could be construed as CSV-separator
  • That the HTML source code can have quotes that might cause trouble
  • The the HTML source code can have non-ASCII / English characters that can cause trouble if encoding isn't set right
  • That when an error is thrown during parsing, it is kind of a pain to get to the error cause because of using PySpark (Understand that PySpark is a wrapper on top of Scala API upon which Spark is built. And the distributed architecture of Spark leads to multiple layers of error wrappings. Ouch, it hurts!).
  • In practice, use-cases like this tend to have large sized CSV file (in GBs) that wouldn't load in your normal file editor. In case of attempting to manually debug, you end up banging your head against the wall.

Now check out the magical potion - the source code in PySpark as solution to overcome the said challenges:
Image generated from https://carbon.now.sh/

Happy coding!