4.5 Q10: Exploring user content, patterns, networks
We seek to elucidate the network structure within the realms of the r/dogecoin
and r/cryptocurrency
subreddits to discern community clustering patterns and the dissemination dynamics of information therein. Our methodology entails an initial identification of unique posts within these Reddit communities, followed by the construction of edges between posters and commenters. These edges are endowed with weights; specifically, a weight of 1 is assigned when a commenter engages with a post authored by a unique poster, and a weight of ‘n’ is assigned if multiple comments emanate from a single commenter directed towards the same author. It is pertinent to note within this network configuration that a comment can engender multiple posts, while conversely, a single post may attract contributions from multiple commenters. Following the reformatting of the dataset in accordance with the aforementioned relationships, we export the resultant network to a GML file format, wherein each node represents a distinct user participating in the network either as a poster or a commenter.
Subsequently, the network undergoes visualization using Gephi, yielding insights into its structural characteristics. The visualization reveals a decentralized topology, wherein numerous peripheral communities are observed to coalesce, yet no singular ‘central user’ or pivotal opinion leader emerges prominently. From a statistical standpoint, the network comprises 73,395 nodes and 209,864 edges, suggesting a propensity for sparse connectivity. The average degree, indicative of the average number of connections per node, stands at approximately 2.859, further underscoring the network’s distributed nature. Modularity analysis yields a value of 0.383, denoting the extent of decentralization into distinct communities within the network. Specifically, partitioning reveals the existence of 14 discernible communities or groups, delineated by inter-nodal connections. The average clustering coefficient, a measure of local interconnectedness, is computed at 0.022, suggesting modest clustering tendencies within the network. Finally, the average path length, reflective of the average number of steps required to traverse between any two nodes, is determined to be 4.225, providing insights into the network’s navigational efficiency.
In summation, our analysis furnishes a comprehensive overview of the network’s structural attributes, elucidating its decentralized architecture and community clustering dynamics. Through meticulous statistical scrutiny, we delineate key metrics encapsulating connectivity, modularity, clustering, and navigational efficiency, thereby enriching our understanding of information propagation within these Reddit communities.
3.5.2. Do highly active users in both subreddits post distinct content?
In the realm of natural language processing, embeddings are high-dimensional vectors used to capture the semantic properties of text. These vectors transform textual data into a format that machines can ‘understand’. From the results of the Spark Processing Job earlier, we generate embeddings of the post titles using BERT Sentence Embeddings trained on Wikipedia and BooksCorpus and fine-tuned on SST-2.
To aid in visual interpretation of these embeddings, the dimensionality reduction technique known as t-SNE (t-distributed Stochastic Neighbor Embedding) is employed. This method reduces the complex, high-dimensional data into a 2-dimensional space, enabling easier visualization and analysis of the relationships and clusters within the data, such as differentiating textual content across various subreddits.
This generates a column containing 768-dimensional vector which represents the text. To easily visualize it, we use t-SNE to reduce the embeddings to 2 dimensions and plot them. The color represents membership of different subreddits. We have three groups of users: those who are only part of r/CryptoCurrency
(red), those only part of r/dogecoin
(blue) and the highly active users who are part of both subreddits (green). However, as the plot shows, the posts of highly active users are not distinctly different in content/meaning compared to users who are only active in one subreddit.
Figure 5. Scatter plot of t-SNE embeddings of post titles
Elon Musk has been a vocal proponent of Dogecoin, frequently discussing and promoting the cryptocurrency on social media, which often influences its market value. His tweets and comments can cause significant fluctuations in the price of Dogecoin, demonstrating his substantial impact on the crypto market. Musk’s endorsement has helped to elevate Dogecoin from a lesser-known digital currency to a prominent player in the cryptocurrency space. In the early hours of November 1, 2022 (just after 12 AM), Musk tweeted a picture of Shiba wearing a Twitter T-shirt which likely led to an uptick in dogecoin’s price. We look at the nature of posts before and after this event. Using sentiment analysis, we calculate a compound average score - where higher value indicates more positive sentiment.