Finally, I use feedparser to go through the list of possible RSS feeds and validate them to ensure that the links point to valid feeds.įeel free to fork this gist on GitHub or download the raw file. I start by looking for tags pointing to RSS feeds, then parse the page looking for any a hrefs pointing to links with “xml”, “rss”, or “feed” in the URL. I’ve copied my solution below, which you should be able to interpret fairly easily. This script does have some non-standard dependencies, both of which you are probably already using if you’re doing anything related to web scraping or feed reading: feedparser and beautifulsoup4. I wouldn’t include any links that were not valid RSS feeds.I wouldn’t miss any legitimate feeds that were on a website and.I wanted my function to be accruate and thorough, which (for me) means: My Solution: Python 3 function for extracting RSS feeds from URLs After fighting a losing battle trying to deal with Python’s 2to3 conversion tool, I realized I’d already wasted more time trying to port this old script than it would take me to write a new one. However, a major shortcoming of this script is that it’s fairly dated and written for Python 2. Aaron Swartz (RIP) wrote his own script called feedfinder.py which does this exact same thing. What is an RSS feed RSS feeds are an additional piece of programming tucked away in the code of a website. Essentially, I want to pass a URL to my API and have it return the RSS feed associated with that domain.Īs with most things, I wasn’t the first person to come across this problem. I have been working on a project where I need to extract RSS feeds from various blogs and news websites. In this video I will show you 3 methods to find the RSS feed URL of most of the websites out there.Feed structure for sites.
0 Comments
Leave a Reply. |