Data Sources
1. Reddit Data
Our project leverages a comprehensive dataset curated by Professor Arora. It leverages Reddit’s API to encapsulate the discussions from January 2021 to March 2023. This dataset is instrumental in examining the digital dialogue within key technology-focused communities on Reddit.
Submissions: The dataset includes an extensive collection of 109 million submissions stored across 412 GB of plain-text JSON files. These submissions, spanning various subreddits, provide a wide lens on the topics that engage the Reddit community.
Comments: To complement the submissions, the dataset also contains 701 million comments, recorded in 918 GB of plain-text JSON files. These comments provide a depth to the discussions and revealing the community’s sentiment on a granular level.
We refined our dataset to focus on the following subreddits, each offering distinctive insights into the technology discourse:
r/Technology
: A primary forum for public dialogue on general technology advancements.r/Futurology
: A platform for discussions on future technology trends and innovations.r/News
: While broader in scope, this subreddit offers a lens into how technology topics are portrayed and perceived in media-driven discussions.
To provide a clearer picture of the scale of our analysis and the breadth of discussions captured from Reddit, here is a detailed breakdown of the number of submissions and comments we’ve collected from each selected subreddit between January 2021 and February 2023:
Subreddit | Submissions | Comments |
---|---|---|
r/Technology | 181,596 | 7,320,261 |
r/News | 868,430 | 21,525,282 |
r/Futurology | 44,744 | 3,034,605 |
Total | 1,963,200 | 53,405,430 |
2. Yahoo Finance Data
To analyze the relationship between online discussions and market performance, we utilized the yfinance
library to obtain historical stock data for major technology companies, enhancing our understanding of market dynamics in relation to Reddit discourse.
Stock Price Retrieval: Using
yfinance
, we fetched the adjusted close prices for a selected group of tech companies: Microsoft (MSFT), Nvidia (NVDA), Adobe (ADBE), Alphabet (GOOG), and Amazon (AMZN). Our focus was on the daily stock metrics from January 1, 2021, to February 28, 2023, aligning with the timeline of our Reddit dataset.Data Organization: The retrieved stock data was structured into a pandas DataFrame, where each column represents a company, and each row corresponds to the adjusted closing price for a given day. This structured format enables us to efficiently analyze the stock price trends alongside the Reddit sentiment data.
Through this process, we aim to decode the potential correlations between the stock market trends of these tech giants and the sentiment trends within the technology-related discussions on Reddit, providing a holistic view of how online narratives may align with or diverge from financial market movements.