How To

PySpark Word2Vec – Lessons Learned – Part 1

Training a language model is an art by itself. There are several factors to consider. I am sharing my experiences as lessons learned here.

Model
I used PySpark Word2Vec.

Data
The dataset that I used for training my language model was the CFPB Complaints data. In compressed parquet format this came to about 100 MB.

Spark Driver & Executor memory
The first error I got was Out of Memory error. When I examined the Spark Driver and Executor memory, it was set to 1 GB. So I increased both the memory to 2 GB. This resolved the Out of Memory error.

To be continued in next part…

Leave a Reply

Your email address will not be published. Required fields are marked *