Rust Parallel HTTP Requests in Ruby

Written by Mike Piccolo

Rusty Chains

This is part five of a series where I try to stumble my way through creating a Rust web-scraping library that will be embeddable in a Ruby module. If you are interested in starting from the beginning you can check out all my posts here: https:[email protected]

Follow along with this blog post with the part-5 branch of the scrape repo

Last time in ARRIV, we learned how to pass structs and arrays back and fourth between Rust and Ruby. The ability to work with these more advanced structures will definitely come in handy for what we are about to do. Today we are going to learn how to rust in parallel. One of the most painfully slow aspects of programming is working with multiple HTTP requests so this is a great problem to tackle.

The Slow Way

To start off we are going to implement a function that will make multiple HTTP requests synchronously. Each request will first be made, then wait for the response, do something with the response and then move on to the next iteration. As you can imagine this will be quite slow.

    
      #![feature(test)]
      extern crate hyper;
      extern crate test;
      use hyper::Client;
      use std::io::Read;
      extern crate time;

      #[no_mangle]
      pub extern fn run_threads() {
        let start_time = time::now();
        for i in 0..5 {
          let client = Client::new();
          println!("Requesting {}", i.to_string());
          let mut response = client.get("http://wikipedia.com/").send().unwrap();
          let mut body = String::new();
          response.read_to_string(&mut body).unwrap();
          println!("BodyLength: {}", body.len().to_string());
        }
        let end_time = time::now();
        println!("{:?}", (end_time - start_time));
      }
    
  
    
      [package]
      name = "scrape"
      version = "0.1.0"
      authors = ["Mike Piccolo "]

      [lib]
      name = "scrape"
      crate-type = ["dylib"]

      [dependencies.hyper]
      git = "https://github.com/hyperium/hyper.git"

      [dependencies]
      time = "0.1"
    
  

Lets break down this function that we have created. First off, we are using an external crate called hyper to handle the HTTP client. To do so we add it as a dependency to our cargo.toml file. Now we are building a function that will iterate over the range 0..5. Each iteration will set up a client, make a get request to google.com, convert the request to a string and print out the length of that string to the console. We are also including a simple benchmark to make sure we know how long this takes

Now lets set up the Ruby interface for this function.

    
      require 'ffi'

      module Scrape
        extend FFI::Library
        ffi_lib './target/debug/libscrape.dylib'

        attach_function :run_threads, [], :void
      end

      Scrape.run_threads()
    
  

Simple FFI module will allow us to call this from ruby. Lets give that a try.

    
      $ cargo build
      $ ruby scrape.rb
      Requesting 0
      BodyLength: 42367
      Requesting 1
      BodyLength: 42367
      Requesting 2
      BodyLength: 42367
      Requesting 3
      BodyLength: 42367
      Requesting 4
      BodyLength: 42367
      Duration { secs: 9, nanos: 235893000 }
    
  

Cool. It worked and it took around 9.2 seconds to complete. I think we can make that significantly faster if we use threads. Lets give that a try.

The Fast Way

Now it is time to do this in parallel. To do this we will need to use Rust threads and Arc. Lets go ahead an set up the function.

    
      extern crate hyper;
      use std::sync::Arc;
      use std::thread;
      use hyper::Client;
      use std::io::Read;
      extern crate time;

      #[no_mangle]
      pub extern fn run_threads() {
        let start_time = time::now();
        let client = Arc::new(Client::new());
        let threads: Vec<_> = (0..5).map(|i| {
          let client = client.clone();
          thread::spawn(move || {
            println!("Requesting {}", i.to_string());
            let mut response = client.get("http://wikipedia.com").send().unwrap();
            let mut body = String::new();
            response.read_to_string(&mut body).unwrap();
            body.len().to_string()
          })
        }).collect();

        let responses: Vec<_> = threads
          .into_iter()
          .map(|thread| thread.join())
          .collect();
        println!("All threads joined. Full responses are:");
        for response in responses.into_iter() {
          println!("The response have the following lengths: {:?}", response.ok());
        }
        let end_time = time::now();
        println!("{:?}", (end_time - start_time));
      }
    
  

Lets break down what we did here. On line 11 we are using the Arc::new wrapper around client. This will allow us to use client.clone() on line 13 inside the iteration and Rust will keep an atomic reference count (Arc) of the number of times that it is cloned in the threads. Rust needs this to know the number of clones that it will need to interact with and clean up at compile time.

On line 12 we setting the local variable threads to the a Vec which will be filled with threads. These threads will each make a HTTP request and return the length of the body of the response.

Now we can set up a vec to hold the responses by mapping over the threads and joining on line 23 through 26.

Line 27 through 32 are printing out the response length and the and the time it took.

We don’t need to change anything about our FFI module so we can go ahead with the compile and run.

    
      $ cargo build
      $ ruby scrape.rb
      Requesting 1
      Requesting 0
      Requesting 2
      Requesting 3
      Requesting 4
      All threads joined. Full responses are:
      The response have the following lengths: Some("42367")
      The response have the following lengths: Some("42367")
      The response have the following lengths: Some("42367")
      The response have the following lengths: Some("42367")
      The response have the following lengths: Some("42367")
      Duration { secs: 3, nanos: 933754000 }
    
  

Woohoo! That is around 2.3 times faster.

Next Time

To sum up where we are at so far in this series, we can receive, manipulate and return strings, numbers, structs and arrays from Ruby to Rust. We can now make HTTP requests in parallel and act on the responses. This is getting pretty close to a working HTTP scraping library. Next time we will be looking into what mozilla is up to with Servo and use some external libraries for HTML parsing similar to Nokgiri.

Special Thanks

The rust community, for the most part, is pretty nice to newbs so don’t be afraid to ask a Stack Overflow question or get on the rust IRC channel. Special thanks to Stack Overflow users Adrian, shepmaster, Chris Morgan, Vladimir Matveev and DK. Also Steve Klabnik for doing a great job on the docs.

And of course don’t hesitate to hit me up on twitter @mfpiccolo.

Written by Mike Piccolo

Mike leads our project management efforts. As an experienced software consultant, he brings years of consulting experience to FullStack, and can tackle nearly any software development project.