Craigslist TCG Web Scraper

A good friend of mine has been playing Magic the Gathering (MTG) since childhood. He still plays and ranks in tournaments, he actively follows MTG news, subscribes to message boards, and every time I go over to his apartment, he has envelopes stacked on the kitchen table containing highly valuable cards that he’s buying and selling. One day, not too long ago, I had inquired about why he buys and sells and what his strategy is. Long story short is that speculation and hype over certain cards influence the market. Certain MTG cards have different synergies with other cards depending on the game format and the deck archetype. Another important factor is that even cards released over a decade ago are still highly relevant and dominant in today’s game. Because my friend has been playing since he was a teenager, he’s collected many of those rare and valuable cards and knows how to create decks around them, even using cards from more recent releases. When MTG news leaks or word gets around about a re-print of an old card or which new cards are being released, my friend is able to discern which cards will lose or gain value. Having this foresight is key to making a profit when buying or selling Magic the Gathering cards or packs. While he does a majority of his selling on TCGPLayer.com, he’ll often browse eBay and Craigslist in the hopes that he’ll find seller who has incorrectly priced his cards because doesn’t have as much MTG knowledge as he. He spends HOURS searching to find these deals that’ll make him either a better player or some profit. With having no knowledge about MTG, but knowing how to code, I told my friend that I could help save him some time looking on Craigslist by creating a tool that’ll do the searching for him.

The Solution

This script will search Craigslist for Pokemon, Magic the Gathering, and Yu-Gi-Oh! card listings and send recipients an email containing a table of new listings in their area. The image below is of the email the user receives from the script. It contains each listing’s posting date, the card game the post pertains to, the title of the listing, the listing’s location, the listing’s price, and the link to the listing on Craigslist. The script can be scheduled to run automatically so that recipients receive email notifications hourly, daily, etc.

How I Did It

  1. Web Scaper - Scrape Craigslist for relevant TCG listings data

  2. MySQL Database - Write data to a local MySQL database

  3. Email - Send price, location, and url to user about new TCG listings

  4. Schedule Job - Send email containing the relevant information at a given time interval

Web Scraper

Typically when I go on Craigslist, often looking for furniture and decor, I will make use of the following features in order to narrow my search to the newest listings near me. After entering in the keywords of the item I am searching for, I want to know what the description of the post is, what city the item is in, and what the selling price is. Because TCG is in such hot demand, making this search manually, multiple times a day (to not miss a deal on cards) becomes time intensive. By gathering this data programmatically, I can easily reduce the amount of effort this takes.

Screen Shot 2021-01-10 at 1.15.29 PM.png
  1. Search for Sale

  2. Miles From Zip

  3. Sort By Newest

As this was my first ever Python project, I looked to the internet to see what resources where available to help guide me through this project. What I found was this article from Riley Predum called “Web Scraping Craigslist: A Complete Tutorial”. In this article, he explains how he would obtain data about apartments in the East Bay Area of California for the purpose of data exploration. I reused a majority of his script but made the necessary modifications to get the relevant data that I needed. My full script can be found here in my GitHub repository for this project, but I will walk you through each section below.

After inspecting Craigslist, I was able to determine that the data for each individual listing was stored in <li class=“result-row”>. From this class, I could grab the post’s date, title, price, and location. In order to get data about each listing, I would need to write a script that loops through each <li class=“result-row”> for each page of search results.

Screen Shot 2021-01-11 at 9.09.40 PM.png

To do this, I am using BeautifulSoup from the bs4 package to parse the HTML of the webpage that’s being returned from the request to the URL. To do that, I am making a call to the website using get from the requests package and then assigning it to a variable, which I then pass to BeautifulSoup so that I can find all of the <li class=“result-row”>. Here, I also check the number of elements in the result to make sure it matches the number of posts on the page (in this case there are 22). This will be important later on when try to obtain data about all of the posts on each page.

Next I went ahead and grabbed all of the information I would need from each listing. Class bs4.element.ResultSet is indexed, so I looked at the first post by indexing posts[0].

Screen+Shot+2021-01-11+at+9.51.41+PM.jpg

The code block below illustrates how I parse out all of the attributes I needed for a single post. You’ll see, in the full script, that as I loop through each listing, some of the attributes that I need do not exist because the seller did not input that data in their listing. I’ve handled these cases using if statements, which you’ll see a little bit later on. For example, sometimes the seller doesn’t list a location. When this occurs, I substitute ‘N/A’ for all instances in which location is missing. You can reference the code below and screenshot above to get a better understanding of how to scrape for the data in the html output.

Now that I know how to grab all of the data for a single post, it’s time to write out the for loop that will allow me to obtain each listings’ data across all pages returned from my search criteria. The code below will not only show you how I loop through each page, but I also how I handle missing data, clean raw data, and how I store each posts’ data to a list sot that it can later be stored in a data frame. The script below is annotated to help explain what I am trying to accomplish in each line or section. However, in summary, I taking the following steps in the code.

  • For each page, out of all pages returned from search (page in pages), and

    • For each post on the page, out of all posts (post in posts)

      • If post is missing information about seller’s neighborhood

        • Add post date to list of post dates

        • Add post title to list of post titles

        • Add post url to list of post urls

        • Add post price to list of post urls

          • If the post has a listing location (alternative location field if neighborhood is missing)

            • Add listing location to list of listing locations, otherwise replace with ‘N/A’ and add that to the list of listing locations

      • If post is NOT missing information about neighborhood

        • Add post date to list of post dates

        • Add post title to list of post titles

        • Add post url to list of post urls

        • Add post price to list of post urls

        • Add post location to list of listing locations

If you’ve successfully run the script above, you should now have all of the information about each post in their respective lists. We’ll need to combine the lists, of equal lengths, into a data frame so that each row will represent each listing in the search. You can do this by using the following code.

For each card game you are interested in searching for, you’ll need to take the same steps as explained above but you’ll replace the search parameters in the response url to search for your TCG of interest. In my full script, I search for Pokemon, Magic the Gathering, and Yu-Gi-Oh!. If you want to try it yourself, go for it, otherwise you can view how I did it here.

In this next step below, assuming you’ve created multiple data frames (one for each TCG), I’m combining all of the search results from each card game into one single data frame.

Write to and Read From Database

Before sending the data from our newly created data frame to the end user, we need to take a few more steps to ensure that the listings we are sending to the recipient are only the new listings. While there are numerous ways to do achieve this, I chose to do the following. I define a function called write_to_DB(), that when called, connects to my local MySQL database instance. The script uses my database username and password and writes new listings from our latest scrape to the MySQL database. Note that the function reads in configs from a file on my desktop called "/Users/brandonlevan/Desktop/Craigslist/cl_config.txt". When you set up your MySQL instance, remember to note what your username and password is so that you can add to the text file, which I named “cl_config”. The screenshot below shows how you should format the text inside this file so that the script can read from it. I have my config file in the same folder as my Python script, but make sure when you are defining the file location that you add the correct location of your config file to the script since ours will likely be different.

Add your username and password to cl_config.txt

Add your username and password to cl_config.txt

At the end of write_to_DB() function, I write new_listings_updated to our TCG table. What is new_listings_updated? To ensure that we are only sending the end user new listings, we need to add some logic to our script to do so. In the next step, we will read all data in from our MySQL table and compare that data to TCG_Table (combined data frame from Pokemon, Magic the Gathering, and Yu-Gi-Oh! scrape). To compare and determine only new listings, we will drop the date field and concatenate the old listings (from the db) and the latest scrape results from TCGTable. After concatenating the two tables together, we drop duplicates. In this case, dropping duplicates will remove duplicate rows that contain the same ['TCG','POST TITLE','POST LOCATION','PRICE','URL']. Note that on Craigslist a seller can renew a listing, so we may see the same listing multiple times, yet with different dates. In step one we remove the date and drop duplicated posts, so that in the next step we can join back to TCG_Table to get the latest date for all of the listings. After getting the date, we use a lambda function to get the price column into a currency format so that it’s more easily readable for the end user.

In the final steps (code block above), we check to see whether we have new listings or not. We write an if statement to check if, when we drop duplicates, there are any new rows of data. If all rows are dropped when checking for duplicates, the if statement will print “No New Listings”; otherwise, the script will run the write_to_DB() and send_email() functions. The send_email() function looks like this.

Email

In this email function, you can set the subject line and recipients. If you have more than one recipient you want to receive the notification, put a comma in between the recipients’ email addresses. You will also have to set msg[‘From’] to your email address (the one you want to send from). This script works for me using a Gmail address (which is what I am using). If this script returns an error, it may be that you need to adjust your Gmail account security settings to allow other applications to send emails from your address.

Scheduling Your Script to Run Automatically

At this point, your script will only run when you choose to run it via your IDE. I did the development of this script using a Jupyter notebook, so whenever I manually chose to run all, I get an email about new listings. But what if you don’t want to manually run your script to get information about new listings? I solved this using a cron job. There are many articles on the internet about how to get started with cron job, but I did the following. First you’ll need to save a copy of your working script to your desktop (ideally in a folder with your config file). Next go to your terminal, and do the following.

And with that, you can now send yourself (and whoever-else you want) and email containing Craigslist’s Pokemon listings whenever you’d like.

If you’d like to view the whole script or download it, you can do so here.

I really hope you enjoyed this post and ultimately learned something new. If you have any questions about what I wrote here or just want to leave some feedback about this post, feel free to do so in the comment section below. Lastly, I just want to state that this was my first Python project so there is definitely room for improvement. Doing these projects and exercises through blogging will help me advance my skills. If you’d like to work on a project together or want to recommend ways to improve this script, please don’t hesitate to reach out. Thanks for reading.

Previous
Previous

Whole Lotta WHAT? Playboi Carti Lyrical Analysis