Training a language model is an art by itself. There are several factors to consider. I am sharing my experiences as lessons learned here.
I used PySpark Word2Vec.
The dataset that I used for training my language model was the CFPB Complaints data. In compressed parquet format this came to about 100 MB.
Spark Driver & Executor memory
The first error I got was Out of Memory error. When I examined the Spark Driver and Executor memory, it was set to 1 GB. So I increased both the memory to 2 GB. This resolved the Out of Memory error.
To be continued in next part…