tags from the XML that we scraped.Įach of the articles will be separated by using the loop: for a in articles:, this will allow us to parse the information into separate variables and append them to an empty dictionary we’ve created.īS4 has parsed our XML into a string, allowing us to call the. Unpacking the above, we’ll begin by checking out the articles = soup.findAll('item'). # scraping.py def hackernews_rss(): article_list = try: r = requests.get(' ') soup = BeautifulSoup(r.content, features='xml') articles = soup.findAll('item') for a in articles: title = a.find('title').text link = a.find('link').text published = a.find('pubDate').text article = article_list.append(article) return print(article_list). We’ll be taking advantage of the consistent item tags to parse our information. Įach of the articles available on the RSS feed follows the above structure, containing all information within item tags. Let’s begin by looking at the structure of the feed. The RSS feed was chosen because it’s much easier than parsing website information, as we don’t have to worry about nested HTML elements and pinpointing our exact information. Next, we’ll begin parsing the information. We’ve successfully illustrated that we can extract the XML from our HackerNews RSS feed. $ python scraping.py Starting scraping The scraping job succeeded: 200 Finsihed scraping This states that we’re able to ping the site and “get” information.
Once we run the program, we’ll see a successful status code of 200. I’m printing the status code to the terminal using r.status_code to check that the website has been successfully called.Īdditionally, I’ve wrapped this into a try: except: to catch any errors we may have later on down the road. In the above, we’re going to call the Requests library and fetch our website using requests.get(.). See exception: ') print(e) print('Starting scraping') hackernews_rss() print('Finished scraping') # scraping function def hackernews_rss(' '): try: r = requests.get() return print('The scraping job succeeded: ', r.status_code) except Exception as e: print('The scraping job failed.
This will be what we execute to # scraping.py # library imports omitted.
Let’s begin by creating our base scraping function. To ensure that we’re capable of scraping at all, we’ll need to test that we can connect. When we’re web scraping, we begin by sending a request to a website.