python, Scraping

How to Use the Reddit API with Pushshift to Bypass API Limits

I often see people ask how to get more history from Reddit. Say you want every post from 2020 on /r/wallstreetbets. The official Reddit API doesn’t let you do that.

Thankfully there is another project out there called Pushshift that stores an archive of Reddit you can query. And query much faster than using Reddit.

Here’s how to use Pushshift combined with the official Reddit API to query more data!

Python Example

While you can query Pushshift with any language we will use Python because of how easy and versatile it is. If you’re new to Python I recommend Corey Schafer’s Youtube tutorials.

Install Steps

(skip straight to the Git Repo)

If this is your first tutorial you’ve used please start with installing miniconda and cloning the repo.
Install miniconda https://docs.conda.io/en/latest/miniconda.html (choose latest Python version under your OS. Likely 64 bit)

git clone https://github.com/rogerfitz/tutorials
cd tutorials

The exact python version doesn’t matter because with each project I’ll have you create a different environment with the proper version of Python.

From the tutorials directory

git pull origin master
cd subreddit_analyzer
conda create -n subreddit_analysis python=3.9 pandas=1.3.2 jupyter=1.0.0 matplotlib=3.4.2 -y
conda activate subreddit_analysis
pip install -r requirements.txt

Run the Code

conda activate subreddit_analysis
jupyter notebook

And now you’re ready for generating your Reddit API credentials. Please follow along in the video below

Once that’s installed feel free to watch my other tutorial videos or jump straight to the Pushshift one. Here is the Jupyter Notebook and the accompanying video

How to Use Pushshift with the Official Reddit API

Use PSAW (installed earlier) to query Pushshift and get back reddit API PRAW objects. Pushshift will serve as the index of posts and PRAW will be used to get scrape the rest since Pushshift may be out of date if commenters update their post or comment on an old post.

Below I query all posts in the environment subreddit that have “comapnies” mentioned and store them in a Pandas dataframe called “df”.

from secret_services import reddit,psaw_pushshift
import utils
import tqdm
import pandas as pd


subreddit_name="environment"
word_to_check="companies"

#psaw example - returns praw objects
subreddit_name="environment"
word_to_check="companies"
comments=psaw_pushshift.search_comments(q=word_to_check, subreddit=subreddit_name, limit=200, before=1629990795)
import pandas as pd

post_with_comments=[]
for comment in comments:
    if word_to_check in comment.body.lower():#case insensitive check
        post_with_comments.append(
            {"comment_id": comment.id, "comment_text": comment.body,"score": comment.score,"post": comment.submission.id
            }
        )
    else:
        #edited or removed comments don't work
        print(comment.body)
df=pd.DataFrame(post_with_comments)
df

And now take that DataFrame and extract all the comments from those posts

full_discussion_rows=[]
#just doing first 5, traverse posts is super slow with 6th post ID and takes a while to do it https://www.reddit.com/r/environment/comments/pam1cx/the_colorado_river_that_supplies_water_to_40/
for post_id in tqdm.tqdm(df['post'].iloc[:5]):
    comments=utils.traverse_post(reddit.submission(post_id))
    for comment,level in comments:
        full_discussion_rows.append({"comment_id": comment.id, "comment_text": comment.body,"score": comment.score,"post": post_id, 
     "level": level
            })
full_discussion_df=pd.DataFrame(full_discussion_rows)
full_discussion_df.to_csv("all_comments_from_found_posts_with_pushshift.csv",index=False)

There you go! If you want to see where the “traverse_post” function came from watch my full tutorial series.

Leave a Reply

Your email address will not be published. Required fields are marked *