Discussion:
tcp segments with HTTP traffic
(too old to reply)
Mark
2013-04-04 12:55:56 UTC
Permalink
Hello,

I'm implementing a http parser; As indicator of http data I'm searching for
'HTTP/1.? CRLF' in the stream.
TCP layer may cut the application-provided buffer into chunks suitable for
transferring over network. Is it possible to have http data (e.g. GET
http://www.google.com/index.html HTTP/1.1 CRLF) NOT following immediately
after TCP header? Also, is it possible to have, for example 'GET ..' query
split across TCP segments?

Thanks.

Mark
Jorgen Grahn
2013-04-04 14:54:24 UTC
Permalink
Post by Mark
Hello,
I'm implementing a http parser; As indicator of http data I'm searching for
'HTTP/1.? CRLF' in the stream.
TCP layer may cut the application-provided buffer into chunks suitable for
transferring over network. Is it possible to have http data (e.g. GET
http://www.google.com/index.html HTTP/1.1 CRLF) NOT following immediately
after TCP header?
I don't understand. If you're running HTTP, that means all TCP
payload is HTTP data, and the TCP payload by definition follows
immediately after the TCP header (including TCP options, if any).
Post by Mark
Also, is it possible to have, for example 'GET ..' query
split across TCP segments?
Yes. Why do you ask? You wrote that you're looking at the TCP stream;
there are no segments on that level. (And no TCP header, for that
matter,)

/Jorgen
--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .
Barry Margolin
2013-04-04 15:13:42 UTC
Permalink
Post by Jorgen Grahn
Post by Mark
Hello,
I'm implementing a http parser; As indicator of http data I'm searching for
'HTTP/1.? CRLF' in the stream.
TCP layer may cut the application-provided buffer into chunks suitable for
transferring over network. Is it possible to have http data (e.g. GET
http://www.google.com/index.html HTTP/1.1 CRLF) NOT following immediately
after TCP header?
I don't understand. If you're running HTTP, that means all TCP
payload is HTTP data, and the TCP payload by definition follows
immediately after the TCP header (including TCP options, if any).
Post by Mark
Also, is it possible to have, for example 'GET ..' query
split across TCP segments?
Yes. Why do you ask? You wrote that you're looking at the TCP stream;
there are no segments on that level. (And no TCP header, for that
matter,)
I suspect he's implementing something like a firewall or IDS, not a
server using the normal TCP stack API.
--
Barry Margolin
Arlington, MA
Jorgen Grahn
2013-04-04 18:22:10 UTC
Permalink
Post by Barry Margolin
Post by Jorgen Grahn
Post by Mark
Hello,
I'm implementing a http parser; As indicator of http data I'm searching for
'HTTP/1.? CRLF' in the stream.
TCP layer may cut the application-provided buffer into chunks suitable for
transferring over network. Is it possible to have http data (e.g. GET
http://www.google.com/index.html HTTP/1.1 CRLF) NOT following immediately
after TCP header?
I don't understand. If you're running HTTP, that means all TCP
payload is HTTP data, and the TCP payload by definition follows
immediately after the TCP header (including TCP options, if any).
Post by Mark
Also, is it possible to have, for example 'GET ..' query
split across TCP segments?
Yes. Why do you ask? You wrote that you're looking at the TCP stream;
there are no segments on that level. (And no TCP header, for that
matter,)
I suspect he's implementing something like a firewall or IDS, not a
server using the normal TCP stack API.
To be honest I suspect that too -- but I want to see him state his
question clearly before I spend time on a reply.

/Jorgen
--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .
Barry Margolin
2013-04-04 15:12:06 UTC
Permalink
Post by Mark
Hello,
I'm implementing a http parser; As indicator of http data I'm searching for
'HTTP/1.? CRLF' in the stream.
TCP layer may cut the application-provided buffer into chunks suitable for
transferring over network. Is it possible to have http data (e.g. GET
http://www.google.com/index.html HTTP/1.1 CRLF) NOT following immediately
after TCP header? Also, is it possible to have, for example 'GET ..' query
split across TCP segments?
I think the first HTTP request in a connection will always be
immediately after the TCP header, because the client can't send anything
before the request. However, if multiple requests are sent on the same
connection, the client could start the second request in the same TCP
segment as the end of the first request.

The HTTP request could be split across TCP segments if it's very long.
Or, as above, if requests are pipelined, the second (or Nth) request
could start near the end of one segment and be continued in the next
segment.

TCP doesn't provide any correspondence between segments and logical
units of application layer data. Your parser should just treat the TCP
data as a byte stream; processing it at the packet layer will obviously
simplify it, but it's likely to get you into trouble.
--
Barry Margolin
Arlington, MA
glen herrmannsfeldt
2013-04-04 15:30:05 UTC
Permalink
Post by Barry Margolin
Post by Mark
I'm implementing a http parser; As indicator of http data I'm searching for
'HTTP/1.? CRLF' in the stream.
TCP layer may cut the application-provided buffer into chunks suitable for
transferring over network. Is it possible to have http data (e.g. GET
http://www.google.com/index.html HTTP/1.1 CRLF) NOT following immediately
after TCP header? Also, is it possible to have, for example 'GET ..' query
split across TCP segments?
(snip)
Post by Barry Margolin
The HTTP request could be split across TCP segments if it's very long.
Or, as above, if requests are pipelined, the second (or Nth) request
could start near the end of one segment and be continued in the next
segment.
In theory, it could be split anywhere. It does seem unlikely that
the buffer would be so small as to split 10 characters, but as
I understand it there is no guarantee. Is there a minumum MTU?
Post by Barry Margolin
TCP doesn't provide any correspondence between segments and logical
units of application layer data. Your parser should just treat the TCP
data as a byte stream; processing it at the packet layer will obviously
simplify it, but it's likely to get you into trouble.
If you use getc() then you don't have to worry about it at all.
If you use read(), or something similar, be sure to check the length
and count as appropriate.

-- glen
Barry Margolin
2013-04-04 17:10:03 UTC
Permalink
Post by glen herrmannsfeldt
Post by Barry Margolin
Post by Mark
I'm implementing a http parser; As indicator of http data I'm searching for
'HTTP/1.? CRLF' in the stream.
TCP layer may cut the application-provided buffer into chunks suitable for
transferring over network. Is it possible to have http data (e.g. GET
http://www.google.com/index.html HTTP/1.1 CRLF) NOT following immediately
after TCP header? Also, is it possible to have, for example 'GET ..' query
split across TCP segments?
(snip)
Post by Barry Margolin
The HTTP request could be split across TCP segments if it's very long.
Or, as above, if requests are pipelined, the second (or Nth) request
could start near the end of one segment and be continued in the next
segment.
In theory, it could be split anywhere. It does seem unlikely that
the buffer would be so small as to split 10 characters, but as
How are you counting 10 characters? The example GET line he gave above
is around 40 characters. And I assumed he was also interested in the
request-headers (the server hostname isn't normally in the GET line as
above, it's in the Host: header), which will add several hundred bytes
to the request.
Post by glen herrmannsfeldt
I understand it there is no guarantee. Is there a minumum MTU?
I think the minimum is something like 563. But almost all systems these
days negotiate at least 1400.
Post by glen herrmannsfeldt
Post by Barry Margolin
TCP doesn't provide any correspondence between segments and logical
units of application layer data. Your parser should just treat the TCP
data as a byte stream; processing it at the packet layer will obviously
simplify it, but it's likely to get you into trouble.
If you use getc() then you don't have to worry about it at all.
If you use read(), or something similar, be sure to check the length
and count as appropriate.
As I mentioned in another reply, I suspect he's not using the normal
socket API. If he's implementing a firewall, he's probably using some
kind of raw packet API.
--
Barry Margolin
Arlington, MA
Rick Jones
2013-04-04 17:19:49 UTC
Permalink
Post by Barry Margolin
Post by glen herrmannsfeldt
I understand it there is no guarantee. Is there a minumum MTU?
I think the minimum is something like 563. But almost all systems these
days negotiate at least 1400.
I believe the "de jure" minimum IPv4 MTU is 68 bytes, allowing for 20
bytes of IPv4 header, 40 bytes of IPv4 options and 8 bytes of
"payload."

The "minimum, maximum reassemblable" IP datagram size is 576 bytes.

rick jones
--
"You can't do a damn thing in this house without having to do three
other things first!" - my father (It seems universally applicable :)
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
glen herrmannsfeldt
2013-04-04 20:31:07 UTC
Permalink
Barry Margolin <***@alum.mit.edu> wrote:

(snip, someone wrote)
Post by Barry Margolin
Post by glen herrmannsfeldt
Post by Barry Margolin
The HTTP request could be split across TCP segments if it's very long.
Or, as above, if requests are pipelined, the second (or Nth) request
could start near the end of one segment and be continued in the next
segment.
(then I wrote)
Post by Barry Margolin
Post by glen herrmannsfeldt
In theory, it could be split anywhere. It does seem unlikely that
the buffer would be so small as to split 10 characters, but as
How are you counting 10 characters? The example GET line he gave above
is around 40 characters. And I assumed he was also interested in the
request-headers (the server hostname isn't normally in the GET line as
above, it's in the Host: header), which will add several hundred bytes
to the request.
I was counting the "HTTP/1.x\n\r".
Post by Barry Margolin
Post by glen herrmannsfeldt
I understand it there is no guarantee. Is there a minumum MTU?
I think the minimum is something like 563. But almost all systems
these days negotiate at least 1400.
As I posted later, if you:

telnet www.google.com 80

and then slowly type in the request, I believe it still
has to work. (Well, fast enough not to time out the connection.)

-- glen
glen herrmannsfeldt
2013-04-04 15:34:04 UTC
Permalink
(snip)
Post by Barry Margolin
The HTTP request could be split across TCP segments if it's very long.
Or, as above, if requests are pipelined, the second (or Nth) request
could start near the end of one segment and be continued in the next
segment.
TCP doesn't provide any correspondence between segments and logical
units of application layer data. Your parser should just treat the TCP
data as a byte stream; processing it at the packet layer will obviously
simplify it, but it's likely to get you into trouble.
Thinking about this more, I have at least a few times:

telnet www.google.com 80

then typed the HTTP/1.1(enter)(enter)
myself. (I forget if/when those translate to (crlf).

In the case of a person typing into telnet, it seems likely
that they characters are sent one at a time, and a proper
tcp server should process it if it comes in that way.

-- glen
Jorgen Grahn
2013-04-05 06:19:24 UTC
Permalink
Post by glen herrmannsfeldt
(snip)
Post by Barry Margolin
The HTTP request could be split across TCP segments if it's very long.
Or, as above, if requests are pipelined, the second (or Nth) request
could start near the end of one segment and be continued in the next
segment.
TCP doesn't provide any correspondence between segments and logical
units of application layer data. Your parser should just treat the TCP
data as a byte stream; processing it at the packet layer will obviously
simplify it, but it's likely to get you into trouble.
telnet www.google.com 80
then typed the HTTP/1.1(enter)(enter)
myself. (I forget if/when those translate to (crlf).
In the case of a person typing into telnet, it seems likely
that they characters are sent one at a time, and a proper
tcp server should process it if it comes in that way.
I assumed telnet worked in line buffered mode when used that way ...
but I can't be bothered to test.

If the OPs software has to work in less than perfect conditions, it
doesn't matter. It's like this: a TCP client may write 1 or more
octets to the socket at the time. The TCP layer will probably try to
collect these into larger segments, but not if it affects
interactivity too much.

/If/ the OP is sniffing on the Ethernet or IP layer, this is only the
start of his troubles, of course. He'd also has to handle (off the
top of my head):

HTTP: whatever the RFCs say and maybe common extensions/bugs
TCP: retransmissions, out-of-order segments, orderly shutdown,
dead/hung connections, ...
IP: reassembly

plus malicious attacks of all possible kinds on all of these.
Basically he has to implement half of a a TCP/IP stack and half of a
web server.

/Jorgen
--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .
Robert Wessel
2013-04-05 16:26:22 UTC
Permalink
On Thu, 4 Apr 2013 15:34:04 +0000 (UTC), glen herrmannsfeldt
Post by glen herrmannsfeldt
(snip)
Post by Barry Margolin
The HTTP request could be split across TCP segments if it's very long.
Or, as above, if requests are pipelined, the second (or Nth) request
could start near the end of one segment and be continued in the next
segment.
TCP doesn't provide any correspondence between segments and logical
units of application layer data. Your parser should just treat the TCP
data as a byte stream; processing it at the packet layer will obviously
simplify it, but it's likely to get you into trouble.
telnet www.google.com 80
then typed the HTTP/1.1(enter)(enter)
myself. (I forget if/when those translate to (crlf).
In the case of a person typing into telnet, it seems likely
that they characters are sent one at a time, and a proper
tcp server should process it if it comes in that way.
While I've not seen it on an HTTP server, I've seen several SMTP
servers that apply timing checks to detect telnet-style connections,
and then drop those (IOW, getting the SMTP command split across too
many recv()'s, etc., would be an indication of someone poking at your
mail server).
glen herrmannsfeldt
2013-04-05 17:44:04 UTC
Permalink
Robert Wessel <***@yahoo.com> wrote:

(snip, I wrote)
Post by Robert Wessel
Post by glen herrmannsfeldt
telnet www.google.com 80
then typed the HTTP/1.1(enter)(enter)
myself. (I forget if/when those translate to (crlf).
In the case of a person typing into telnet, it seems likely
that they characters are sent one at a time, and a proper
tcp server should process it if it comes in that way.
While I've not seen it on an HTTP server, I've seen several SMTP
servers that apply timing checks to detect telnet-style connections,
and then drop those (IOW, getting the SMTP command split across too
many recv()'s, etc., would be an indication of someone poking at your
mail server).
I have probably done that more than port 80, but not so much lately.

It used to be that mail servers had the VRFY command that would verify
that a recipient existed. Now most just tell you (if you send VRFY) to
mail and see what happens.

Many will bounce mail if the recipient doesn't exist, but some
silently discard it.

As far as I know, SMTP, probably more than HTTP, was designed such
that human sent mail should work.

-- glen
Jorgen Grahn
2013-04-05 18:03:45 UTC
Permalink
On Fri, 2013-04-05, glen herrmannsfeldt wrote:
...
Post by glen herrmannsfeldt
It used to be that mail servers had the VRFY command that would verify
that a recipient existed. Now most just tell you (if you send VRFY) to
mail and see what happens.
Many will bounce mail if the recipient doesn't exist, but some
silently discard it.
And those who do discard are a pox on the Internet. "Return to sender"
was designed to be something you could rely on.

I don't know how common this is, but there's lot of collateral damage
from the war of spam and this may be one thing.

/Jorgen
--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .
Barry Margolin
2013-04-05 18:50:07 UTC
Permalink
Post by Jorgen Grahn
...
Post by glen herrmannsfeldt
It used to be that mail servers had the VRFY command that would verify
that a recipient existed. Now most just tell you (if you send VRFY) to
mail and see what happens.
Many will bounce mail if the recipient doesn't exist, but some
silently discard it.
And those who do discard are a pox on the Internet. "Return to sender"
was designed to be something you could rely on.
Unfortunately, spammers have forced many decisions that are violations
of otherwise good practice. Mail bounces result in innocent bystanders
being inundated with blowback. People can't run personal mail servers on
their home computers because most sites block mail from residential
addresses.
--
Barry Margolin
Arlington, MA
Jorgen Grahn
2013-04-06 04:58:08 UTC
Permalink
Post by Barry Margolin
Post by Jorgen Grahn
...
Post by glen herrmannsfeldt
It used to be that mail servers had the VRFY command that would verify
that a recipient existed. Now most just tell you (if you send VRFY) to
mail and see what happens.
Many will bounce mail if the recipient doesn't exist, but some
silently discard it.
And those who do discard are a pox on the Internet. "Return to sender"
was designed to be something you could rely on.
Unfortunately, spammers have forced many decisions that are violations
of otherwise good practice. Mail bounces result in innocent bystanders
being inundated with blowback.
Post by Jorgen Grahn
I don't know how common this is, but there's lot of collateral
damage from the war of spam and this may be one thing.
But you make it sound as if disabling all bounces is necessary on the
internet today. I suspect a minority of servers do this. Maybe a tiny
minority? I tested one server (which worked) but surely someone has
collected statistics somewhere ...

IMO, part of the problem with spam is overzealous mail admins use it
as an excuse to tinker with their servers, until the countermeasures
are more damaging than the spam itself.

/Jorgen
--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .
Martijn Lievaart
2013-04-06 18:22:23 UTC
Permalink
Post by Jorgen Grahn
But you make it sound as if disabling all bounces is necessary on the
internet today. I suspect a minority of servers do this. Maybe a tiny
minority? I tested one server (which worked) but surely someone has
collected statistics somewhere ...
Disabling bounces, or at least severely limit them, is sound practice. It
is better to reject then to bounce.

M4
Jorgen Grahn
2013-04-06 21:13:01 UTC
Permalink
Post by Martijn Lievaart
Post by Jorgen Grahn
But you make it sound as if disabling all bounces is necessary on the
internet today. I suspect a minority of servers do this. Maybe a tiny
minority? I tested one server (which worked) but surely someone has
collected statistics somewhere ...
Disabling bounces, or at least severely limit them, is sound practice. It
is better to reject then to bounce.
Now that I think of it, it was rejecting I thinking of. In my test,
three MTAs were involved: my local one (A), my ISP's relay (B), and the
destination (C). C rejected the unknown recipient, and B generated
the bounce mail. As far as I can tell, this is safe.

I'd have to refresh my SMTP knowledge to understand this other form of
bounces you're referring to. I always saw the MTA for example.org
accepting a mail to ***@example.org as a /promise/ that the foo mailbox
exists and the mail has reached it ... any other design seems
extremely dangerous. But of course there's backup MXes ...

/Jorgen
--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .
Barry Margolin
2013-04-07 17:13:12 UTC
Permalink
Post by Martijn Lievaart
Post by Jorgen Grahn
But you make it sound as if disabling all bounces is necessary on the
internet today. I suspect a minority of servers do this. Maybe a tiny
minority? I tested one server (which worked) but surely someone has
collected statistics somewhere ...
Disabling bounces, or at least severely limit them, is sound practice. It
is better to reject then to bounce.
Same difference. If a message is rejected, the sending server would
traditionally bounce it.

Unless you're talking about the special case where the client is sending
a message to their own domain. Then the submission server may be able
to reject the message immediately, and many do (e.g. if you're a Comcast
customer and send to ***@comcast.net you'll get a rejection
that says "Not our customer").
--
Barry Margolin
Arlington, MA
Moe Trin
2013-04-07 22:21:34 UTC
Permalink
On Sun, 07 Apr 2013, in the Usenet newsgroup comp.protocols.tcp-ip, in article
Post by Barry Margolin
Post by Martijn Lievaart
Post by Jorgen Grahn
But you make it sound as if disabling all bounces is necessary on
the internet today. I suspect a minority of servers do this.
Maybe a tiny minority? I tested one server (which worked) but
surely someone has collected statistics somewhere ...
Disabling bounces, or at least severely limit them, is sound
practice. It is better to reject then to bounce.
Same difference. If a message is rejected, the sending server would
traditionally bounce it.
I interpret the discussion as relating to a very different part of the
mail transaction. The "news.admin.net-abuse.blocklisting" Usenet
newsgroup was removed in (perhaps) late 2009, but one thing that got
many people posting (whining) to that newsgroup was their being put on
various blocklists for BELATEDLY sending a bounce. Over and over
again, the regulars in that group stressed that if you're not going to
accept mail (non-existent user, full mail-box, WHAT-EVER) you must do
that during the initial mail transfer stage. Returning a "55x" result
code then ends that mail transfer right there. No later discovery that
the mail is actually spam or the user doesn't exist, or anything else.
No bounce. No landing on the blocklists. Yes, RFC821, 2821 and 5321
require undeliverable messages if you accepted mail for delivery, and
if someone sends them they must expect that there "could" (or maybe
"will") be consequences. Not accepting undeliverable/unwanted mail
is a much better solution.
Post by Barry Margolin
Unless you're talking about the special case where the client is
sending a message to their own domain. Then the submission server
may be able to reject the message immediately, and many do
That's the Horse of a Different Color. There, it's all internal to a
domain, and as many are requiring some form of authentication to
submit mail, "this" domain should fix "their" problems. The "bounce"
issue normally refers to the case of a mis-configured RECEIVING mail
server trying to "Return to Sender" garbage that has a forged "envelope
sender", and the worst abusers would not only bounce, but include the
entire (usually spam) content. That was well and truly deserving to be
placed on all blocklists.

Old guy
Barry Margolin
2013-04-08 06:53:43 UTC
Permalink
Post by Moe Trin
On Sun, 07 Apr 2013, in the Usenet newsgroup comp.protocols.tcp-ip, in article
Post by Barry Margolin
Post by Martijn Lievaart
Post by Jorgen Grahn
But you make it sound as if disabling all bounces is necessary on
the internet today. I suspect a minority of servers do this.
Maybe a tiny minority? I tested one server (which worked) but
surely someone has collected statistics somewhere ...
Disabling bounces, or at least severely limit them, is sound
practice. It is better to reject then to bounce.
Same difference. If a message is rejected, the sending server would
traditionally bounce it.
I interpret the discussion as relating to a very different part of the
mail transaction. The "news.admin.net-abuse.blocklisting" Usenet
newsgroup was removed in (perhaps) late 2009, but one thing that got
many people posting (whining) to that newsgroup was their being put on
various blocklists for BELATEDLY sending a bounce. Over and over
again, the regulars in that group stressed that if you're not going to
accept mail (non-existent user, full mail-box, WHAT-EVER) you must do
that during the initial mail transfer stage. Returning a "55x" result
code then ends that mail transfer right there. No later discovery that
the mail is actually spam or the user doesn't exist, or anything else.
No bounce. No landing on the blocklists. Yes, RFC821, 2821 and 5321
require undeliverable messages if you accepted mail for delivery, and
if someone sends them they must expect that there "could" (or maybe
"will") be consequences. Not accepting undeliverable/unwanted mail
is a much better solution.
In most cases these should be equivalent, except for the From address in
the bounce message -- if they reject it the bounce comes from
Mailer-***@senderdomain, if they accept and then bounce it comes from
Mailer-***@receiverdomain.

But in the case of spammers sending directly to the destination site, I
can see their point -- it creates additional blowback. Spamming
software doesn't send bounces, so the only bounces that would occur in
this mode are when the receiving server accepts the mail and then
returns it to the (forged) sender.
--
Barry Margolin
Arlington, MA
Robert Wessel
2013-04-05 16:23:14 UTC
Permalink
On Thu, 4 Apr 2013 08:55:56 -0400, "Mark"
Post by Mark
Hello,
I'm implementing a http parser; As indicator of http data I'm searching for
'HTTP/1.? CRLF' in the stream.
TCP layer may cut the application-provided buffer into chunks suitable for
transferring over network. Is it possible to have http data (e.g. GET
http://www.google.com/index.html HTTP/1.1 CRLF) NOT following immediately
after TCP header? Also, is it possible to have, for example 'GET ..' query
split across TCP segments?
While I'm not fully understanding your question, the persistent
connections there can be more that one GET in a stream, nor is there
any guarantee that the first GET isn't split across segments (for
example, you could telnet to an HTTP server and type in the GET, and
the server would usually see one character at a time, one per TCP
segment), and something that looks very much like a GET could be
included in the text of the page.
Continue reading on narkive:
Loading...