Skip to main content

Deals You Can't Miss

1 Year Subscription

Using PySpark 2 to read CSV having HTML source code

When you have a CSV file that has one of its fields as HTML Web-page source code, it becomes a real pain to read it, and much more so with PySpark when used in Jupyter Notebook.

Before jumping gun, let us understand what are the challenges in parsing such files:
  • That the HTML source code can have line-ending, EOF, new-line, etc
  • That the HTML source code can have a lot of noise empty spaces before, after and in the middle of the content
  • That the HTML source code can have separators that could be construed as CSV-separator
  • That the HTML source code can have quotes that might cause trouble
  • The the HTML source code can have non-ASCII / English characters that can cause trouble if encoding isn't set right
  • That when an error is thrown during parsing, it is kind of a pain to get to the error cause because of using PySpark (Understand that PySpark is a wrapper on top of Scala API upon which Spark is built. And the distributed architecture of Spark leads to multiple layers of error wrappings. Ouch, it hurts!).
  • In practice, use-cases like this tend to have large sized CSV file (in GBs) that wouldn't load in your normal file editor. In case of attempting to manually debug, you end up banging your head against the wall.

Now check out the magical potion - the source code in PySpark as solution to overcome the said challenges:
Image generated from

Happy coding!

My Popular Posts

Ten Commandments of Egoless Programming

We are nothing but the values we carry. All through my life thus far, I tried to influence people around me with the virtues I value. Thanks to some good reading habits I had inculcated, and the fortune of being in good community of peers and mentors alike, I managed to have read some real good books. This post is about the 10 commands of egoless programming in Weinberg's book. I shall explain the commandments based on my experience here. So very many decades ago, Gerald M. Weinberg authored  The Psychology of Computer Programming . In it, he listed The Ten Commandments of  Egoless Programming , which remains relevant even today for us as not just programmers but as team-members. Weinberg is regarded as a pioneer in taking a people-centric approach to computing, and his work endures as a good guide to intelligence, skill, teamwork, and problem-solving power of a developer. When they appear to inspire and instruct, we find that they can apply to just about every business area, and e

Should I buy refurbished laptop from Amazon?

This post is based on my experience with and guess it to be true on all other platforms as well. At least you can check out and verify for these pointers before you make that decision to buy renewed/refurbished laptop on Amazon with your hard earned money. I see this question propping up in several forums and on many different occasions. In the recent past, I had my 5 year old dell laptop that gave up because its motherboard failed. One of the options that I had in my mind was to re-use the HDD and the 16GB DDR4 RAM of that old laptop in the one that I purchase next as secondary.  I had come to a conclusion that it is not worth buying a refurbished/renewed laptop at all. Why? For the following reasons, most of which I see as BIG #RedFlags: You got to remember that Amazon provides a platform for 3rd party sellers to sell their products as well. So in your search for refurbished laptops you wouldn’t want to choose some random 3rd party seller who Amazon doesn’t endorse. You cou

Multi-tenant Architectures

  Multi-tenancy Application Deployment Architecture could be modeled in 4 broad ways: Separate Apps & Separate Databases Shared Apps & Shared Databases Separate Apps & Shared Databases Shared Apps & Separate Databases There is no right or wrong here. It's about choice and consequence that you should consider taking into your business context and constraints. In this post I intend to jot down a some key points to keep in mind for each of these multi-tenant architecture. These are more of quick notes for my quick reference, a cheat-sheet of sorts when I have to make choices. And I guess this can come handy to you too in your wise decision making. Separate Apps & Separate Databases Easiest to implement from development and deployment stand-point. Just automate the deployment infrastructure for every tenant for quick set-up. Most expensive of all the models from infrastructure cost stand-point. Relatively longer deployment t