- Web scraping is the act of pulling data directly from a website by parsing the HTML from the web page itself. It refers to retrieving or “scraping” data from a website.
- The plackup command starts a standalone Plack web server that hosts the Scraping program. The Scraping code handles request routing, extracts data from the data.html document, produces some basic statistical measures, and then uses the Template::Recall package to generate an HTML report for the user.
- Since JavaScript is excellent at manipulating the DOM (Document Object Model) inside a web browser, creating data extraction scripts in Node.js can be extremely versatile. Hence, this tutorial focuses on javascript web scraping. In this article, we’re going to illustrate how to perform web scraping with JavaScript and Node.js.
- Victor Rak, Middle Ruby on Rails developer Table of Contents Show Web scraping is a popular method New content will be added above the current area of focus upon selectionVictor Rak, Middle Ruby on Rails developer Table of Contents Show Web scraping is a popular method of automatically collecting the information from different websites.
Web Scraping & Ruby on Rails Projects for $15 - $25. This ROR backend and iOS front-end application is already live on iTunes. Some of the gems we use are: 1.ApiPie (documentation of API services) 2.
23 May 2015For this tutorial, we are going to build a site with Ruby on Rails that can scrape a particular webpage for some specific information. More specifically, this tutorial will walk you through how to come up with a scraper for the titles and links on Reddit’s front page. You can go to the website and take a look to see what it is we are trying to obtain.
Learning Objectives
By doing this tutorial you should have an understanding of the following:
- How to approach scraping a website
- How to use the Interactive Ruby Shell (IRB)
- How to start a Ruby on Rails project
- What a controller is
- What a view is
Step 1: Try it in the IRB
The IRB is kind of like a sandbox, it allows you to play with (and execute) Ruby code without having to start a huge project or do anything complex at all. That’s why it is often a good idea to try things out in the IRB before we start building a whole site and getting into a lot of complexities, it’s always nice to play with ideas and make sure that they are feasible and that you have some idea of how you’re going to approach the problem you want to solve.
Start the IRB
This tutorial is going to assume that you have both Ruby and Rails already installed. The IRB comes with Ruby, so all you have to do is go to the terminal and type irb
. You should get a prompt like this:
You can try to play around with it, but we won’t spend too much time on explaining the IRB itself. You can do simple math like this:
From this point on, I will omit writing the prompt irb(main):001:0>
, it’s assumed that you will be typing code at the prompt.
Getting Ready to Scrape
We will need a tool to help us open webpages in our code, and also the ability to go through the HTML and find what we want. The former is accomplished by a wrapper called open-uri
and the latter can be done with a parser called nokogiri
.
So for our first step, we will require
those two tools and then use them to load an entire page into our variable, doc
:
Notice that the open
function opens the webpage, and then Nokogiri
parses it into the doc
variable as HTML. You will also see a lot of gibberish being displayed, but that’s actually the information on the page as parsed by Nokogiri
.
Identifying What We Want
Now that we’ve grabbed the entire page in the doc
variable, we can now mess with it however we want. So let’s try to see how we can get the titles and links of the entries. To do that, we have to understand the structure of how the page is laid out. Let’s go to the browser and find out.
So go to your favourite browser, navigate to reddit.com, right-click one of the entry titles and select inspect element
. From inspecting the element of an entry title, you should find something that looks like this:
What this tells us, is that every entry is wrapped inside a <div>
with the class entry
, and the title and link that we want is represented by an anchor tag <a>
inside of a paragraph tag <p>
with the class of title
.
That’s kind of confusing, so let’s start simple.
Getting at the Information
Let’s start small and simply try to see if we can get a variable that represents all the entries on the page.
Note: Many scrapers use something called X-Path, but for simplicity’s sake, we will use CSS selectors as Nokogiri provides that option for us.
Let’s put all the div
tags on the page with a class of entry
inside a variable we will call entries
:
Let’s check how many we’ve got:
Great! If you go to the front page of Reddit.com in your browser, you can count exactly 35 links on the front page. We’re getting somewhere!
Now let’s try to get the specific title of just one post. We’ll try with just the first post for now, so we’ll use entries[0]
as our starting point and try to get more specific from there.
Try the following:
Note that the selector string p.title>a
simply means the p
tag with a class of title
and then get the a
tag immediately under that.
If you take a careful look, you should see a representation of exactly the anchor tag we are looking for. In order to get the title and the link, we will use the following code:
The code for the title is quite self-explanatory, it is simply the text of the tag itself. For the link, we have to find the href
attribute, and attributes are stored as the first child of the object, so that is why we type [0]['href']
to get the link.
If you type those two lines above, you should be able to see the title and link as displayed by IRB.
Now let’s see if we can list the titles for all 35 entries. Let’s use the .each
construct:
Notice how we use puts
so that Ruby knows to display it onto our screen.
You should see the titles and their links being displayed in the terminal.
Using a Class
Okay great, we know how to get the information, but this is kind of unwieldy, so let’s attempt to create a class of Entry
objects. Each Entry
object will house the title and link for easy access. We will then try to create a whole array of these objects so it will be easy for us to manage and move around.
Note that every time we create an Entry
object, we will need to initialize it with a title and a link. The initialized values will be assigned to the instance variables @title
and @link
respectively. The title and link of each Entry object is also made readable by the attr_reader
lines.
Now let’s try to make an array of Entry objects:
You can try runing something like entriesArray[0].title
or entriesArray[0].link
to ensure that this works.
Refactoring
Refactoring means to change the code to make it better (in any number of ways) while keeping its functionality the same.
Let’s refactor the code so that it’s more readable. We can immediately place the newly created Entry
object into the entriesArray
instead of using the temporary variable newEntry
:
Should we refactor again so that we remove the need for the temporary variables of title
and link
?
This is not so easy to read, so that’s not good. We should refactor to be more concise when possible, but we should not reduce our code so much that it becomes hard to read.
Now we’re ready to make this a Rails app.
Step 2: Make this a Rails App
Let’s open a brand new terminal and type in the following to start our brand new Rails app:
After some setup stuff is run, you can cd
into the directory and then type in rails server
to run the app. Navigate to http://localhost:3000
in your browser to see the default app page.
Routing, Actions, and Controllers
The very basic explanation of what controllers do is this: When you try to access a website run by rails, a bunch of code in the rails app (called a router) will direct it to process a block of code (called an action) inside a particular file (called a controller).
Essentially, a controller is what processes a request sent to the server by a client. And each controller can house different actions.
For our example, let’s add the scrape_reddit
action inside our main controller in app/controllers/application_controller.rb
:
We just made an action inside the main application controller. Right now, it does nothing other than try to render the text “scrape reddit data here”, we will add more functionality to this later. We just want to see that it works first.
Now let’s point to it with our router so that it loads right away when we load our webpage.
Add the following to app/config/routes.rb
What we’re doing here is to say that the root
route should be directed to the application
controller and more specifically, the scrape_reddit
action within that controller.
Now let’s test it. Go back to your browser and hit refresh. You should simply see the text “scrape reddit data here”.
Great, it works as expected! But we don’t just want to render the text, we want to scrape Reddit’s front page and then render the titles and links.
Scraping a Page
Let’s start small and grab the front page like we did before in IRB. Just for fun we’ll also try to see what happens when we render the retrieved document.
Go back to the browser, refresh and see. No you’re not dreaming, it is actually pulling and displaying the reddit front page itself!
Of course this is overkill, so let’s get more specific and paste our previous code into the scrape_reddit
action so that we can grab just the titles and links.
If you tried to refresh and run the changes above, you’ll encounter an error. That’s because we forgot to define our Entry
class! So let’s do that right now, outside of the action itself:
Try that.
No error anymore! But all we see is an array of ‘#’ symbols. What’s happening?
Well that’s actually the entriesArray! It’s filled with 35 Entry
objects as expected. But since there is no string representation of our Entry
object, it’s just ‘#’.
Rendering Template/Views
Okay, we need a smart way to be able to render the Entry
objects as a list that the user can see. For this, we need to use a view
. This is basically the skeleton with which we will send our data.
Let’s make a new file here: app/views/scrape_reddit.html.erb
This will be our template/view that we render after grabbing the information. So, instead of rendering text at the end of our scrape_reddit
action, we are going to make it render the view instead, with the entriesArray
. We’ll need to make a couple of changes to our scrape_reddit
action in order to do this:
We’ve made two changes:
We have pre-pended the
entriesArray
variable with an@
symbol. This is to make that variable an instance variable so that we can use it inside our view.At the end, we have now changed
render text: entriesArray
torender template: 'scrape_reddit'
so that Rails will know to send the data to thescrape_reddit
view with the data context.
Finally, let’s go back to our view at app/views/scrape_reddit.html.erb
and add the following:
So now that we have access to our @entriesArray
variable, we can try to display each entry’s title and link property. To do that, we have:
@entriesArray.each do |entry|
to start things off so that Rails will know to duplicate the following for each entry in@entriesArray
entry.title
andentry.link
will allow Rails to render that<p>
element with the corresponding data
That’s it! It’s actually qutie straight-forward. You can now go into your browser, refresh the page, and see Reddit’s front page entry titles and links!
Related Posts
Please enable JavaScript to view the comments powered by Disqus.comments powered by DisqusSearch engine scraping is the process of harvesting URLs, descriptions, or other information from search engines such as Google, Bing or Yahoo. This is a specific form of screen scraping or web scraping dedicated to search engines only.
Most commonly larger search engine optimization (SEO) providers depend on regularly scraping keywords from search engines, especially Google, to monitor the competitive position of their customers' websites for relevant keywords or their indexing status.
Search engines like Google do not allow any sort of automated access to their service,[1] but from a legal point of view, there is no known case or broken law.
The process of entering a website and extracting data in an automated fashion is also often called 'crawling'. Search engines like Google, Bing or Yahoo get almost all their data from automated crawling bots.
Difficulties[edit]
Google is the by far largest search engine with most users in numbers as well as most revenue in creative advertisements, which makes Google the most important search engine to scrape for SEO related companies.[2]
Google does not take legal action against scraping, likely for self-protective reasons. However, Google uses a range of defensive methods that makes scraping their results a challenging task.
- Google is testing the User-Agent (Browser type) of HTTP requests and serves a different page depending on the User-Agent. Google is automatically rejecting User-Agents that seem to originate from a possible automated bot. [Part of the Google error page: Please see Google's Terms of Service posted at http://www.google.com/terms_of_service.html ] A typical example would be using the command line browser cURL, Google will directly reject to serve any pages to it while Bing is a bit more forgiving, Bing does not seem to care about User-Agents.[3]
- Google is using a complex system of request rate limitation which is different for each Language, Country, User-Agent as well as depending on the keyword and keyword search parameters. The rate limitation can make it unpredictable when accessing a search engine automated as the behaviour patterns are not known to the outside developer or user.
- Network and IP limitations are as well part of the scraping defense systems. Search engines can not easily be tricked by changing to another IP, while using proxies is a very important part in successful scraping. The diversity and abusive history of an IP is important as well.
- Offending IPs and offending IP networks can easily be stored in a blacklist database to detect offenders much faster. The fact that most ISPs give dynamic IP addresses to customers requires that such automated bans be only temporary, to not block innocent users.
- Behaviour based detection is the most difficult defense system. Search engines serve their pages to millions of users every day, this provides a large amount of behaviour information. A scraping script or bot is not behaving like a real user, aside from having non-typical access times, delays and session times the keywords being harvested might be related to each other or include unusual parameters. Google for example has a very sophisticated behaviour analyzation system, possibly using deep learning software to detect unusual patterns of access. It can detect unusual activity much faster than other search engines.[4]
- HTML markup changes, depending on the methods used to harvest the content of a website even a small change in HTML data can render a scraping tool broken until it was updated.
- General changes in detection systems. In the past years search engines have tightened their detection systems nearly month by month making it more and more difficult to reliable scrape as the developers need to experiment and adapt their code regularly.[5]
Detection[edit]
When search engine defense thinks an access might be automated the search engine can react differently.
The first layer of defense is a captcha page[6] where the user is prompted to verify he is a real person and not a bot or tool. Solving the captcha will create a cookie that permits access to the search engine again for a while. After about one day the captcha page is removed again.
The second layer of defense is a similar error page but without captcha, in such a case the user is completely blocked from using the search engine until the temporary block is lifted or the user changes his IP.
The third layer of defense is a longterm block of the entire network segment. Google has blocked large network blocks for months. This sort of block is likely triggered by an administrator and only happens if a scraping tool is sending a very high number of requests.
All these forms of detection may also happen to a normal user, especially users sharing the same IP address or network class (IPV4 ranges as well as IPv6 ranges).
Methods of scraping Google, Bing or Yahoo[edit]
To scrape a search engine successfully the two major factors are time and amount.
The more keywords a user needs to scrape and the smaller the time for the job the more difficult scraping will be and the more developed a scraping script or tool needs to be.
Scraping scripts need to overcome a few technical challenges:[7]
- IP rotation using Proxies (proxies should be unshared and not listed in blacklists)
- Proper time management, time between keyword changes, pagination as well as correctly placed delays Effective longterm scraping rates can vary from only 3–5 requests (keywords or pages) per hour up to 100 and more per hour for each IP address / Proxy in use. The quality of IPs, methods of scraping, keywords requested and language/country requested can greatly affect the possible maximum rate.
- Correct handling of URL parameters, cookies as well as HTTP headers to emulate a user with a typical browser[8]
- HTML DOM parsing (extracting URLs, descriptions, ranking position, sitelinks and other relevant data from the HTML code)
- Error handling, automated reaction on captcha or block pages and other unusual responses[9]
- Captcha definition explained as mentioned above by[10]
An example of an open source scraping software which makes use of the above mentioned techniques is GoogleScraper.[8] This framework controls browsers over the DevTools Protocol and makes it hard for Google to detect that the browser is automated.
Programming languages[edit]
When developing a scraper for a search engine almost any programming language can be used. Although, depending on performance requirements, some languages will be favorable.
PHP is a commonly used language to write scraping scripts for websites or backend services, since it has powerful capabilities built-in (DOM parsers, libcURL); however, its memory usage is typically 10 times the factor of a similar C/C++ code. Ruby on Rails as well as Python are also frequently used to automated scraping jobs. For highest performance, C++ DOM parsers should be considered.
Web Scraping R Studio
Additionally, bash scripting can be used together with cURL as a command line tool to scrape a search engine.
Tools and scripts[edit]
When developing a search engine scraper there are several existing tools and libraries available that can either be used, extended or just analyzed to learn from.
- iMacros - A free browser automation toolkit that can be used for very small volume scraping from within a users browser [11]
- cURL – a commandline browser for automation and testing as well as a powerful open source HTTP interaction library available for a large range of programming languages.[12]
- google-search - A Go package to scrape Google. [13]
- GoogleScraper – A Python module to scrape different search engines (like Google, Yandex, Bing, Duckduckgo, Baidu and others) by using proxies (socks4/5, http proxy). The tool includes asynchronous networking support and is able to control real browsers to mitigate detection.[14]
- se-scraper - Successor of GoogleScraper. Scrape search engines concurrently with different proxies. [15]
Legal[edit]
When scraping websites and services the legal part is often a big concern for companies, for web scraping it greatly depends on the country a scraping user/company is from as well as which data or website is being scraped. With many different court rulings all over the world.[16][17][18]However, when it comes to scraping search engines the situation is different, search engines usually do not list intellectual property as they just repeat or summarize information they scraped from other websites.
The largest public known incident of a search engine being scraped happened in 2011 when Microsoft was caught scraping unknown keywords from Google for their own, rather new Bing service. ([19]) But even this incident did not result in a court case.
One possible reason might be that search engines like Google are getting almost all their data by scraping millions of public reachable websites, also without reading and accepting those terms. A legal case won by Google against Microsoft would possibly put their whole business as risk.
See also[edit]
Web Scraping Risks
References[edit]
- ^'Automated queries – Search Console Help'. support.google.com. Retrieved 2017-04-02.CS1 maint: discouraged parameter (link)
- ^'Google Still World's Most Popular Search Engine By Far, But Share Of Unique Searchers Dips Slightly'. searchengineland.com. 11 February 2013.
- ^'why would curl and wget result in a 403 forbidden?'. unix.stackexchange.com.
- ^'Does Google know that I am using Tor Browser?'. tor.stackexchange.com.
- ^'Google Groups'. google.com.
- ^'My computer is sending automated queries – reCAPTCHA Help'. support.google.com. Retrieved 2017-04-02.CS1 maint: discouraged parameter (link)
- ^'Scraping Google Ranks for Fun and Profit'. google-rank-checker.squabbel.com.
- ^ ab'Python3 framework GoogleScraper'. scrapeulous.
- ^Deniel Iblika (3 January 2018). 'De Online Marketing Diensten van DoubleSmart'. DoubleSmart (in Dutch). Diensten. Retrieved 16 January 2019.CS1 maint: discouraged parameter (link)
- ^Jan Janssen (26 September 2019). 'Online Marketing Services van SEO SNEL'. SEO SNEL (in Dutch). Services. Retrieved 26 September 2019.CS1 maint: discouraged parameter (link)
- ^'iMacros to extract google results'. stackoverflow.com. Retrieved 2017-04-04.
- ^'libcurl - the multiprotocol file transfer library'. curl.haxx.se.
- ^'A Go package to scrape Google' – via GitHub.
- ^'A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.: NikolaiT/GoogleScraper'. 15 January 2019 – via GitHub.
- ^Tschacher, Nikolai (2020-11-17), NikolaiT/se-scraper, retrieved 2020-11-19
- ^'Is Web Scraping Legal?'. Icreon (blog).
- ^'Appeals court reverses hacker/troll 'weev' conviction and sentence [Updated]'. arstechnica.com.
- ^'Can Scraping Non-Infringing Content Become Copyright Infringement... Because Of How Scrapers Work?'. www.techdirt.com.
- ^Singel, Ryan. 'Google Catches Bing Copying; Microsoft Says 'So What?''. Wired.
External links[edit]
- Scrapy Open source python framework, not dedicated to search engine scraping but regularly used as base and with a large number of users.
- Compunect scraping sourcecode - A range of well known open source PHP scraping scripts including a regularly maintained Google Search scraper for scraping advertisements and organic resultpages.
- Justone free scraping scripts - Information about Google scraping as well as open source PHP scripts (last updated mid 2016)
- Scraping.Services source code - Python and PHP open source classes for a 3rd party scraping API. (updated January 2017, free for private use)
- PHP Simpledom A widespread open source PHP DOM parser to interpret HTML code into variables.
- SerpApi Third party service based in the United States allowing you to scrape search engines legally.