cloudera
Wrapping multiple backend Hadoop web applications with HAProxy
Authorizing access to multiple Hadoop applications on different nodes of the cluster can be complex and troublesome for some organizations.
In order to assure a consistent access access path, ideally we want to expose all web applications via a single entry point. In this example, we will use HAProxy to aggregate a bunch of Hadoop backend web applications and expose them from a single host and port.
HAProxy
HAProxy is a free, open source software load balancer with some nice features, including some features specifically for HTTP traffic.
http://www.haproxy.org/
Most Linux distributions include a version of HAProxy, and while it might not be the latest and greatest, the default version that comes with your Linux distro is probably going to be sufficient for what we want to do in this example.
Setting Up
We will simulate the Hue, Oozie and Cloudera Manager backend web apps using the python SimpleHTTPserver module. The python SimpleHTTPserver module enables us to serve up a directory listing of the present working directory of the job.
We will create some placeholder content and serve it up on the ports used by the real applications.
cd ~
mkdir -p hue
echo “hello hue” > hue/hellohue.txt
mkdir -p oozie
echo “hello oozie” > oozie/hellooozie.txt
mkdir -p cm
echo “hello Cloudera Manager” > cm/hellocm.txt
cd hue
python -m SimpleHTTPServer 8888 &
cd ..
cd oozie
python -m SimpleHTTPServer 11000 &
cd ..
cd cm
python -m SimpleHTTPServer 7180 &
cd ..
In order to route to the right backend, we need to have a way to tell HAProxy which backend to route to. One way is to use alternative host names and have HAProxy inspect the hostname, however this may be unacceptably complex for some organizations, so instead we will rely on the root path.
So if, for example, the user enters the path http://loadbalancer.fqdn.org:8080/cm then we want HAProxy to route this request and all subsequent requests to Cloudera Manager.
If, on the other hand, the user enters the path http://loadbalancer.fqdn.org:8080/hue then we want HAProxy to route this request and all subsequent requests to Hue.
Finally if the user enters the path http://loadbalancer.fqdn.org:8080/oozie then we want HAProxy to route this request and all subsequent requests to Oozie.
Seems simple right? Well, no, because the backend applications are not listening at /oozie and /cm and /hue. They are all listening a the root of the given backend webserver, /.
Furthermore, subrequests, for example for javascript, css, and images, might be on other paths beneath /. How will the loadbalancer know to send them to the right backend?
Lastly, when the user follows a link in the application, how will the load balancer know which backend to send the request to?
The answer to these questions is to set a cookie. When the first request comes in, we strip the application identifier from the path and then send the request to the appropriate backend, setting a cookie that identifies the current application at the same time.
When the next request comes in, the application identifier won’t be on the path, but we know which backend to send the request to - based on the cookie.
When the user wants to access another application, he just has to enter the application path for the other application, and HAProxy will know to first strip the application identifier from the path, then set a cookie, and forward this and subsequent requests to the other application backend.
Here’s a simple example of how the HAProxy configuration file would look:
defaults
log global
mode http
timeout connect 5000
timeout client 50000
timeout server 50000
frontend webfe
bind *:8080
mode http
acl is_hue_path path_beg -i /hue
acl is_cm_path path_beg -i /cm
acl is_oozie_path path_beg -i /oozie
acl is_hue_cookie hdr_sub(cookie) BACKEND=hue
acl is_cm_cookie hdr_sub(cookie) BACKEND=cm
acl is_oozie_cookie hdr_sub(cookie) BACKEND=oozie
use_backend hue if is_hue_path
use_backend cm if is_cm_path
use_backend oozie if is_oozie_path
use_backend hue if is_hue_cookie
use_backend cm if is_cm_cookie
use_backend oozie if is_oozie_cookie
backend hue
mode http
balance roundrobin
option forwardfor
http-request set-header X-Forwarded-Port %[dst_port]
http-request add-header X-Forwarded-Proto https if { ssl_fc }
cookie BACKEND insert indirect nocache
reqirep ^([^\ :]*)\ /hue([^\ ]*)\ (.*)$ \1\ /\2\ \3
rspirep ^(Location:)\ http://([^/]*)/(.*)$ \1\ http://\2/hue/\3
rspirep ^(Set-Cookie:.*\ path=)([^\ ]+)(.*)$ \1/hue\2\3
server hue01 localhost:8888 cookie hue
backend cm
mode http
balance roundrobin
option forwardfor
http-request set-header X-Forwarded-Port %[dst_port]
http-request add-header X-Forwarded-Proto https if { ssl_fc }
cookie BACKEND insert indirect nocache
reqirep ^([^\ :]*)\ /cm([^\ ]*)\ (.*)$ \1\ /\2\ \3
rspirep ^(Location:)\ http://([^/]*)/(.*)$ \1\ http://\2/cm/\3
rspirep ^(Set-Cookie:.*\ path=)([^\ ]+)(.*)$ \1/cm\2\3
server cm01 localhost:7180 cookie cm
backend oozie
mode http
balance roundrobin
option forwardfor
http-request set-header X-Forwarded-Port %[dst_port]
http-request add-header X-Forwarded-Proto https if { ssl_fc }
cookie BACKEND insert indirect nocache
reqirep ^([^\ :]*)\ /oozie([^\ ]*)\ (.*)$ \1\ /\2\ \3
rspirep ^(Location:)\ http://([^/]*)/(.*)$ \1\ http://\2/oozie/\3
rspirep ^(Set-Cookie:.*\ path=)([^\ ]+)(.*)$ \1/oozie\2\3
server oozie01 localhost:11000 cookie oozie
To test, fire up haproxy in foreground with the config file:
haproxy -f our_test_haproxy.cfg
And try to browse to the HAProxy paths:
Cool, looks like that works. Will browsing to the file work?
What about now browsing to our Cloudera Manager url?
looks good
Now onto Oozie
Yes!!, seems to work
Robert Gibbon
Rob is a Hadoop and large-scale distributed computing evangelist. Solution Architect by trade, Rob is a managing partner at Big Industries - the premiere Hadoop & Big Data systems integrator for Belgium and Luxembourg.