3.3 Q4 Term Frequency-Inverse Document Frequency (TF-IDF)  

To run a TF-IDF, we used Spark-NLP Processing Job because of the huge dataset size with the HashingTF feature transformer. When using Spark’s HashingTF feature transformer, the challenge is that it hashes words into a fixed-size feature vector. This hashing process makes it efficient but also means you lose the direct mapping between words and their indices in the feature vector, which can make it difficult to retrieve the original words from the indices.

Table 2: Top 10 Words by TF-IDF Scoring: Highlighting Unique Vocabulary

Term Frequency
0 blockchain 2159.020063
1 burning 1217.472828
2 adventures 988.120861
3 above 969.387716
4 buffet 968.732854
5 are 927.111189
6 ceoofdogecoin 870.622154
7 240k 827.099049
8 announces 817.186108
9 career 780.541266

(Term frequency from sample dataset)