1.2 Introduction

Welcome to our capstone project for PPOL 5206 Big Data and Cloud Computing, titled Forecasting Cryptocurrency Prices Using Machine Learning: An Analysis of Reddit Discussions. This collaborative effort by Cecil Philip John, Himangshu Kumar, Ocean Chen, and Olivia Zhou leverages the vast repository of big data to dissect the conversations surrounding Dogecoin within Reddit’s vibrant communities. Through meticulous collection and examination of data spanning an entire year from prominent subreddits such as r/cryptocurrency and r/dogecoin, our project seeks to unearth the nuanced relationships between the sentiment in online discussions and the market behavior of meme-based cryptocurrencies like Dogecoin.

Our exploration delves into distinguishing the degree to which price fluctuations are influenced by the enthusiastic participation of online communities versus external market shocks. By analyzing this comprehensive dataset, we aim to understand the dynamics that drive the valuation of meme coins and assess the predictive power of social media sentiment on their market performance. Ultimately, our study strives to establish a framework that could predict future investment opportunities by identifying patterns and implications drawn from the aggregated online discourse surrounding cryptocurrencies.

The data presented indicates that, between January 2022 and February 2023, the total number of comments and submissions on the selected subreddits amounted to 587,972. This figure encompasses the activity within the designated timeframe for the two subreddits under investigation. It is essential to emphasize that within the subreddit r/cryptocurrency, a significant segment of the discourse is specifically focused on Dogecoin. For the purpose of accurately isolating discussions pertinent to our study, a thorough review of the complete body of content is necessary.

For efficient processing of this large dataset, we employed PySpark, an open-source framework that facilitates using Apache Spark with Python. PySpark’s robust capabilities enabled rapid and efficient execution of complex data processing tasks, allowing for a scaled analysis and deeper exploration of the dataset.

Recent research has intensively explored the impact of online communities on cryptocurrency markets, leveraging a variety of data sources and machine learning techniques to uncover the relationship between social media sentiment and market fluctuations. Tandon et al. (2021) and Agarwal et al. (2021) both utilized sentiment analysis on Twitter and Reddit discussions, demonstrating that positive social media sentiment can precede market upswings. Similarly, Seroyizhko (2022) focused on Reddit’s cryptocurrency subreddits, employing LSTM networks to predict price trends, highlighting the predictive value of specific online forums. Sridhar and Sanagavarapu (2021) advanced this discussion by applying a mix of NLP, Random Forest, and Gradient Boosting models, emphasizing the complex interplay between online sentiment and other market drivers.

In our analysis, we explore ten different questions that delve into various aspects of dogecoin. discussion on Reddit. From investigating the buy/sell signals in the discussion, most active users, the count of the post over time, to investigating the sentiment associated with different time periods, our research provides a big data perspective of how the online platform might influence the price of the dogecoin. By leveraging techniques like Exploratory Data Analysis (EDA), Natural Language Processing (NLP), and Machine Learning (ML) on a big data scale, we aim to uncover hidden patterns and trends that can shed light on the realities of investing in the crypto markets.