A Case in Gensim Word2Vec Model Memory Usage: Can Not Release the Model through “del”

Xing Zeng
4 min readApr 8, 2022

I run into a weird case regarding Gensim Word2Vec that forces my program out of memory.

We have a simple Flask application. For each API call to Flask, it spins up a subprocess through a customized subclass of Python’s own multiprocessing.Process. This subprocess is a long-running process which, as part of its running, it trains a number of different Gensim-based Word2Vec models with different attributes along its way.

More specifically, each time it finishes training one Word2Vec model, it collects some aggregated statistics related to it, and then we have everything needed for the current Word2Vec model, thus this model can be thrown away. And then it can proceed on training the next Word2Vec model.

As you can see from the above process, Word2Vec model is never really needed, so every time I finished training, I will have to remove the model from the memory to reduce the memory taken by the Subprocess. More specifically, the code looks like this:

def __release_model(self):
# self.wv_model is the name of the word2vec model persisted
# it's a class attribute because it makes inferencing easier
del self.wv_model
self.wv_model = None

However, that doesn’t seem to work. When my subprocess finished training one Word2Vec model, the memory goes up to 9G. When it finished training the second one, the memory goes up to 18G. Then after the third, it goes up to 27G. I need to train 4 models in one pass of the subprocess, and my computer only has 32G of memory. Of course, Ubuntu won’t be happy, thus a -9 is send to my subprocess.

One way to fix this is to make the 4 model training into 4 different subprocesses. This is actually the best way to fix it and how I envisioned my program can look like. If I starts making all the model training distributed using Ray, and if I track chained multiple processes coming from one API call using Prefect. However, I am not having enough time to do that migration — it sure will be extremely exciting but I have other matters to attend to also. Also — I feel like it’s extremely weird why Gensim Word2Vec can’t just release memory through del , so I spend some time looking into if I can fix it by updating my __release_model function.

The result is very much NOT exciting:

def __release_model(self):
# self.wv_model is the name of the word2vec model persisted
# it's a class attribute because it makes inferencing easier
del self.wv_model
self.wv_model = None
gc.collect()

Basically, just calling Garbage Collector Explicitly, will fix the increasing memory issue I was having.

While this explicit calling Garbage Collector does not look like the standard recommended way of doing Python programming, after a bit of deep look into the issue this may still be a good way to fix this issue.

Firstly, we need to understand how Python Garbage Collection work. To do that, I highly suggest reading:

https://rushter.com/blog/python-garbage-collector/

Basically, Python has two primary modules regarding Garbage Collection. The primary module, reference-counting based, is what del is in ,and this operator can remove the last reference to the model for you. Thus in most cases, this will trigger the data mentioned after del to be removed.

However, reference-counting based method does not deal with circular dependency. And unfortunately, Gensim’s implementation somehow introduce a circular dependency somewhere, causing del to not work.

I was able to prove the existence of the circular dependency through:

def __release_model(self):
# self.wv_model is the name of the word2vec model persisted
# it's a class attribute because it makes inferencing easier
wv_model_id = id(self.wv_model) del self.wv_model
self.wv_model = None
# Code on PyObject can be found in aboved link
# This print 2
print(PyObject.from_address(wv_model_id).refcnt)

Couldn’t tell which part of the Gensim actually introduce that dependency. The source code around __init__ looked fine and I wouldn’t have so many time to look into other parts of the source code. So the best I can do is nuke a gc.collect() here.

That being said, normally, gc.collect() are still be automatically called by the Python when time is appropriate. I do not know why it wasn’t called in time in my case. Each of my training of Word2Vec does take long enough (5 minutes) for gc to be invoked at least once. My guess is that this has to do with the fact that my training is running in subprocess triggers by the main process, and Python may automatically do lower gc in subprocess for some reason. This won’t be surprising to me since if your subprocess terminates, OS is automatically going to recover all the resources for you and you don’t really need to. Anyone who knows more about Python’s gc and subprocess is welcome to shed more light on this matter.

--

--