Using cURL with a proxy

14 July 2021 | 8 min read

In this article, you will learn how to use the command line tool cURL to transfer data using a proxy server. A proxy server acts as a middleman between a client and a destination server. The client can forward each command they want to execute to the proxy and then the proxy executes it and returns the result to the client.

We might want to do this when say data on a target service uses geo localization to restrict the data displayed, or completely blocks access to clients in certain countries. On a variety of global shopping sites this approach is used to display prices in a local currency - e.g. Euros rather than dollars. If we were to visit the site directly, we would end up with data in the wrong currency. By using a proxy, we can fetch the data we require based on the locale of the proxy.

You will learn the different ways to use a proxy with cURL.

cover image

How set up a proxy with cURL?

What is cURL?

cURL is a command line tool used for transferring data which has roots dating all the way back to 1996. It allows you to retrieve and send data in a multitude of ways and outputs to standard output to allow it to be used with standard unix pipes. It forms the basis of many developer scripts due to its near universal availability on most unix like operating systems and support of a wide variety of protocols including HTTP/HTTPS, FTP and SCP.

A simple example to get the content of Google would be to simply call cURL against a URL, which will output the content of the webpage to standard output:

curl google.com

Which returns:

<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>

We can also get further details by appending -I which reveals header content.

HTTP/1.1 301 Moved Permanently
Location: http://www.google.com/
Content-Type: text/html; charset=UTF-8
Date: Wed, 14 Jul 2021 10:37:31 GMT
Expires: Fri, 13 Aug 2021 10:37:31 GMT
Cache-Control: public, max-age=2592000
Server: gws
Content-Length: 219
X-XSS-Protection: 0
X-Frame-Options: SAMEORIGIN

This might confuse you - where's google's search box? We can see instead cURL has returned a 301 redirect, which is what our browser also gets on visiting the page. Our browser would follow the redirect and then return that sites content. As cURL defaults to using HTTP and by default won't follow the redirect unless we specify it, we need to explicitly tell cURL to to visit the www subdomain we'd typically see in our browser along with the protocol we'd prefer to use.

curl https://www.google.com

Which now returns the full content of google.com as we'd usually see.

This gives you a good idea of the level we're working at when interacting with cURL. One of it's driving philosophies is garbage in, garbage out or if it is provided with garbage, then that is what it will return. cURL won't try and guide you if you try and supply incorrect URLs or use the wrong protocol - it will simply try and run your command. Therefore being explicit with your arguments is very important.

Our Proxy Setup

The proxy we use for our examples in the rest of this article will be set up on our local machine, using port 5000. By default, the only required part of a proxy statement is the host and it defaults to using HTTP as the scheme with the default port 80. We'll be transferring data to it via HTTP which we'll explicitly specify for completeness. For most of the examples here, we're going to be using a GET request made to the httpbin.org developer service. This service is especially useful as it allows us to see the source of a request when we access the path /ip.

1: Using command line arguments

The first and simplest option for using a proxy is to use a command line argument. cURL has extensive help documentation within it which you can filter for all the options for proxy configuration on the command line. In order to look at the documentation for proxy settings use the following command:

curl --help proxy
-x, --proxy [protocol://]host[:port]

We can see the exact configuration under the --proxy statement line from help output, which states we're able to use either the -x flag shorthand or --proxy argument to specify our proxy.

So to send data via proxy we'd run either of the following in our terminal:

curl --proxy http://127.0.0.1:5000 https://httpbin.org/ip
curl -x http://127.0.0.1:5000 https://httpbin.org/ip

We could drop the protocol specification for our examples case, given cURL defaults to HTTP allowing us to instead only specify the hostname:

curl -x 127.0.0.1:5000 https://httpbin.org/ip

Proxy Authentication

Additionally, if our HTTP proxy server also requires authentication, we can use the -U flag to specify it.

curl -U user:password --proxy http://127.0.0.1:5000 https://httpbin.org/ip

This uses the Basic authentication scheme by default, but some proxy servers may require a different authentication scheme. The server will respond with headers detailing which scheme should be used, but you can specify for cURL to determine the authentication scheme and use it using --proxy-anyauth.

curl -U user:password --proxy http://127.0.0.1:5000 https://httpbin.org/ip --proxy-anyauth

If your password includes special characters, be sure to double quote ("") your authentication string.

curl -U "user:p@assword" --proxy http://127.0.0.1:5000 https://httpbin.org/ip

Ideally, you should look to never have to expose your password in the terminal, instead you can supply the username only - whereupon you will then be prompted by cURL for your password.

curl -U user --proxy http://127.0.0.1:5000 https://httpbin.org/ip

> Enter proxy password for user 'user':

2: Using Environment Variables

It is possible to configure cURL to use our proxy using environment variables. cURL allows use of an environment variable for each protocol it supports through setting a variable [scheme]_proxy. If these are set, then cURL will by default use them when the appropriate protocol is used. In our examples where we are using http or https, we'd set http_proxy or https_proxy like so:

export http_proxy="http://127.0.0.1:5000"
export https_proxy="http://127.0.0.1:5000"

We could also include authentication when necessary in these statements:

export http_proxy="http://username:password@127.0.0.1:5000"
export https_proxy="http://username:password@127.0.0.1:5000"

Our call using cURL would then only require the following statement:

curl https://httpbin.org/ip

Meaning we no longer need to remember to supply arguments for the duration of our session. One limitation of this approach is that it is used for all applications that support its use (such as wget), so depending on your use case could lead to problems.

As with all environment variables set in this way, these are only temporary variables which will be removed at the end of our shell session or if we restart the system. In order to make it available in all sessions, this could however be appended to a shell startup file. Depending on the shell you use and your system, the file you need to place this alias will differ. In some cases our command would be appended to the bottom of a .bashrc or .zshrc file (for bash and zshell respectively), or could be put in a .profile file. These types of files are first executed when you login and startup an instance of your shell, meaning the environment variable will be assigned and used when we call cURL.

3: Using an alias

Another more permanent approach to configuring a proxy is via an alias which is useful if you always need to regularly connect in this way. An alias can be used to substitute any system command we execute with another.

Using this approach, we can substitute the call to curl with our proxy command. As with environment variables, the shell file which you use to do this will differ.

alias curl="curl -x http://127.0.0.1:5000"

Now, when calling curl https://httpbin.org/ip, we actually are calling curl -x http://127.0.0.1:5000 https://httpbin.org/ip via the alias meaning we don't have to set it each time.

4. Using a .curlrc file

We've seen the vast amount of command line options that are available to use with cURL already. Fortunately you can configure the ones you want curl to use within a configuration file, to save repetition it your command line usage. The file can be specified on the command line via the -K argument, but curl will always search by default for one in your home directory ~/.curlrc (_curlrc on Windows). If it doesn't exist already, you can create it and it will be used if found each time it is invoked.

In order to configure cURL to use proxy this way, we can specify it as follows in our ~/.curlrc along with any other cURL configuration variables.

proxy = "http://127.0.0.1:5000"

Using cURL to Extract the Title of a Web Page

We've explained how useful cURL can be on its own so lets demonstrate a more complex one where we combine it with other tooling using unix pipes. In this example we're going to parse the title html tag from https://snapshooter.com/.

curl --silent https://snapshooter.com/ | grep -Eo '<title>(.*)</title>'

Here is another example to parse the meta description for https://www.boxmode.com/

curl --silent https://www.boxmode.com/ | grep -Eo '\"description\":(.*)\"'

It is necessary to supply the --silent argument in order to suppress progress information from being output by cURL and we use grep to capture the title within a regular expression. You can see how being such a simple tool makes it versatile as we're able to combine it with other tools like grep, sed and awk.

Conclusion

In this article we've shown just how simple it is to configure a proxy for use with cURL, using a variety of both temporary and more permanent solutions. If you're going to be making a lot of requests via the same proxy our recommended approach would be to use the curl configuration file in order to keep cURLs config separate from other tools. Any approach here however is equally acceptable to enable you to reach the data you require.

Some languages offer bindings for cURL, for example you can use Python with cURL with the PycURL package.

image description
Kevin Sahin

Kevin worked in the web scraping industry for 10 years before co-founding ScrapingBee. He is also the author of the Java Web Scraping Handbook.