06 Scraping Images

January 29, 2022

Scraping Images

Signs of a Scraping

What I Did to Reach This Point:

  1. Started with a DataCamp course on Web Scraping in Python: https://www.datacamp.com/courses/web-scraping-with-python
  2. Tried Scrapy + SplashRequest
  3. Tried Selenium

DataCamp

This got me familiar with css locators and xpath notation. It also got me started working in PyCharm and re-picking up Python. Lastly, it introduced me to scrapy and the concepts of building a spider. Armed with some new knowledge, I set about writing my first scraper. Following along with a few YouTube videos specifically geared towards scraping Amazon with scrapy, I was able to get my first spider running. This was a tool which looked at a specific brand on Amazon and scrapped information about the search results. I have run it multiple times as I was interested in seeing how this data changed over time. While it didn't work perfectly (it kept getting kicked off of Amazon after about 100 pages), this was enough of an encouraging first step for me to keep on keeping on.

JSON feed of Amazon product titles from the brand CENTUKE

I also realized that sometimes when you are scraping, because things are so much about structure, your code is often very beautiful.

Scrapy + SplashRequest

Looking more specifically at my thesis area of interest, I realized that there was a challenge between me and scraping information about this image. At this point, I can pretty reliably spot which flag products on Amazon are going to have an image of the woman with the flag (first image is at a slight angle, waves in the flag itself, and a black flag pole). But that wasn't exactly something I could easily communicate to a computer. After sleeping on the problem for a few nights, I had a solution. What if instead of searching and scraping from Amazon, I search, scrape, and access Amazon from a Google reverse image search.

With that, I started to sketch out the parameters and information that I would get from my ideal spider.

My Dream Spider

I realized that I wanted to be able to do a few things that I really did not know how to do while scraping. Take screenshots of pages and save images. I also wanted to be able to search google images and TinEye at the same time. Given that my scraping experience was grounded in JSONs as the output, I started with building some of the structure for a scrapy spider. I had a few things in place and felt confident that I could suss out the informational data. So armed with a not-terribly-useful json, I went to tackle a stickier problem.

The World's Least Helpful JSON Feed

Next, I started tackling the screenshot problem. This led me to SplashRequest. After some noodling around, I was able to get this working but I had a problem. Things were working ok from TinEye, but google images was a bust. Nothing was rendering in the screenshot and I was really struggling to reliably collect data. I assumed that my css locators were wrong. I also assumed that if I was pinging the locators correctly, then that would force enough of the page to load and allow me to get a nice little screenshot. After working under these inaccuracies for longer than I'd care to admit, I came upon a neat little chrome extension. https://github.com/hermit-crab/ScrapeMate#readme This helped me realize that the problem wasn't in my locators, but rather existed somewhere else. I started to investigate infinite scrolls, finding the network tool to locate the data source, and a tool named zenserp.

Selenium

Eventually, I found my way on to this fantastically helpful article: https://medium.com/@wwwanandsuresh/web-scraping-images-from-google-9084545808a2. I set about integrating Selenium on top of scrapy and splash. While I was getting some success using both tools, I decided that this was a silly approach and started fresh using Selenium and focusing on just scraping the images from the Google Image Search page.

This was successful and extremely satisfying to get up and running!

End result! I now have a tool that I can use to scrape oh so many versions of this woman with a flag. I ran it twice on two different start images and just now collected 2,633 images in a half hour. Including this gem, which might be my new fav flag.

Process below!

No items found.