My Understanding of fastText

fastText is a famous library for text classification and text representation learning. It was released by Facebook in 2016. It’s really a long time ago when fastText was released. I’ve read the papers and the source code of fastText recently. There’s no doubt that fastText is a great work, and it’s really a good tool for text classification and text representation learning.

Key Points of fastText

Hierarchical Softmax

Hierarchical softmax is a great idea to organize the output layer of the neural network. It’s a binary tree structure. Each leaf node of the tree represents one class of the classification task. The path from the root node to the leaf node represents the class label. The hierarchical softmax can reduce the complexity of the output layer from $O(V)$ to $O(log(V))$ , where $V$ is the number of classes.

The probability of the class label is the product of the probabilities of the nodes on the path from the root node to the leaf node.

$\mathbf{P(n_{l+1})} = \prod_{i=1}^{l} P(n_{i})$

Huffman Tree

We can construct the hierarchical softmax based on the Huffman tree. The Huffman tree is a binary tree structure. The Huffman tree is constructed based on the frequency of the classes. The classes with higher frequency are closer to the root node of the tree. The classes with lower frequency are closer to the leaf nodes of the tree.

Char-level n-gram

fastText uses the char-level n-gram feature to represent the text. The motivation is mainly in Morphologically Rich Languages (MRL) where the vocabulary has multiple morphological changes, such as there are 40 different verb forms in French, and there are 15 different noun forms in Finnish. The word level is hard to capture such kind of information very well. If we use the stemming, the information on the morphological changes will be discarded. The char-level n-gram is the best way to capture the information of the morphological changes.

The char-level n-gram also has the advantage of reducing the size of the vocabulary. The char-level n-gram can reduce the size of the vocabulary from $O(V)$ to $O(V^{\prime})$ , where $V^{\prime}$ is the number of char-level n-grams.

Published Jun 19, 2023

Flying code monkeyAlex Jiang on Twitter