Using .Net Core to Tee Streams and Buffered AWS S3 Uploads
My team was reviewing some .NET Framework (C#) code from a company that we recently acquired. There was a service that took a posted S3 file, converted an image to a thumbnail, saved both files to AWS S3. The SVG conversion used a utility called Inkscape by launching a Process. For upload to S3, they were using a class called TransferUtility in the AWS SDK which automatically facilitates a multi-part upload.
In the process, the SVG and the converted PDF were saved as temp files. I asked if, instead of temp files, they thought about piping streams. There turned out to be two reasons why the didn’t:
- TransferUtility only supports “seekable” streams where the content length is known. The STDOUT pipe from a Process call is not seekable.
- .NET doesn’t have built-in support for “Tee”-ing or multiplexing a stream.
I should have left it at that and moved on, but my curiosity got the best of me and I didn’t. Writing temp files is not a problem if you make sure to clean up after them. The solution was “good enough”. Nonetheless, I started messing around with this topic. I had been porting stuff to AWS Lambda (serverless) and temp files are not the best thing to be doing in that universe. I also was interested having more control over the memory footprint during transfer (I did not want to buffer entire image files in memory). I ended up writing an extension to TransferUtility that supports non-seekable streams, as well as class that facilitates redirecting a stream to multiple streams (i.e. a “tee”).
Basically, what I wanted to do was to get from this:
In doing so, I wanted something that would work with any “recent” .NET Core code base with minimal external dependencies and libraries. Below is a description of the two pieces of the solution that were created:
- Extension of AWS Transfer Utility to support non-seekable streams and the creation of a .NET Stream to write S3 objects
- A .NET Stream class to redirect data written to it to multiple outputs (TeeStream)
Adding Support for Non-Seekable Streams to AWS TransferUtility
The NuGet package S3BufferedUpload (source and documentation here) adds UploadBuffered and UploadBufferedAsync methods to the TransferUtility class. They facilitate uploading of non-seekable streams (like the redirected STDOUT stream from a Process). Credit to Norm Johanson for posting his re:Invent 2019 demo that provided the roadmap for this effort.
This package includes a class called S3BufferedUploadStream which is a stream that, as its name suggests, writes to S3. I use this class from the new UploadBuffered and UploadBufferedAsync TransferUtility methods.
To help performance, the stream will buffer content until a configurable threshold is reached, and then write the buffered content to S3. This lowers the call count to S3.
Like TransferUpload, this class does not support concurrent uploads of file parts. For the most part, this is because we have to be able to identify the final part if using S3-side encryption, and we don’t know the end until we reach it. Additionally, since I plan on using this in Lambda, I am avoiding scaling out multiple threads each with their own buffer; otherwise, I would need to size my Lambda function more aggressively. The trade-off is between speed and scalability/cost. For “reasonably sized” image files, this trade-off works for me.
With the S3BufferedUploadStream class, I can now start to model my conversion and upload as a stream pipeline. But, I still want to be able to split/multiplex streams.
.NET TeeStream Class
The class TeeStream, at its most basic, simply forwards the bytes read from an Input stream to one or more output streams. It does not buffer, have any backflow handling, etc. If an Tee Stream can’t write to an Output Stream, it will wait until it can.
What is a little more interesting is that a TeeStream object itself can be used as an input. Which means that we can send the data from the Input stream to anything that accepts a readable stream as an input, while still writing to one or more Output streams.
When constructing a TeeStream, you have to define which Output streams you want to write to. In the list, you can include the value TeeStream.Self which will set up the TeeStream to be an Output stream itself.
When data is written to a TeeStream, it is forwarded on to the Output streams. In the event that Output streams cannot be written to, the TeeStream will block until they can. You can set timeouts on the Output streams if you do not want to wait “indefinitely”.
When using the TeeStream as an Output source, a fixed-length buffer is created upon construction. This buffer should be large enough to cache data between reads from whatever is consuming the TeeStream itself. If the buffer fills up, the TeeStream will block until data is consumed.
Dealing With the End of a Stream
The stream method CopyTo is a common way to copy one stream to another. The default implementation of Stream.CopyTo will perform reads on the stream, and write that data to the destination stream, until the stream’s Read/ReadAsync returns zero.
Using CopyTo to send data to a TeeStream presents a problem: TeeStream never “knows” when the Input stream is exhausted. This is important because TeeStream needs to generate its own end of stream indicator (i.e. return zero from Read/ReadAsync).
Setting up a timeout isn’t a good option, because there may be something happening “upstream” that takes time. There are two ways of dealing with this:
- Call TeeStream’s SetAtEnd method after sending data to it using CopyTo/CopyToAsync.
- Use TeeStream’s CopyFrom/CopyFromAsync methods to “pull” data from an Input stream instead of “pushing” it using CopyTo.
Putting It Together
Using the S3BufferedUploadStream and TeeStream, we can now do this:
Obviously, embedding the conversion code inline is sloppy, but it’s easier to show here as a single code block. I am demonstrating using ImageMagick convert instead of Inkscape, since it is a more common use case.
What’s going on in this code?
- Instantiate an AmazonS3Client object. If necessary, you could set up credentials, region, etc. here.
- Create S3BufferedUpload streams to create publicly readable S3 objects for the full-sized and thumbnail image files.
- Create a TeeStream to output to the S3 thumbnail object and the Response.Body
- Create another TeeStream to output to the S3 full-sized object and allow the TeeStream itself to be read from
- Create a Task that calls Convert, using the TeeStream in step #4 as input and the TeeStream in step #3 as output (if something goes wrong, an exception is thrown with the contents of STDERR used to populate the Exception)
- Copy the posted Request.Body to the TeeStream created in step #3 (this feeds the Request.Body image to the Convert task while saving it to S3)
- Wait for the task to finish, which will include the writing of thumbnail image data to S3 as well as the Response.Body
Admittedly this demands refactoring and dependency injection, but for demonstration purposes, it meets the need.
Where to Get It
If any of this is interesting to you, here are the locations where you can get the libraries and a demo.
Using dotnet to add these packages to a project…
dotnet add package TeeStreaming
dotnet add package S3BufferedUpload
Source for these packages and a demonstration project are on GitHub