Posted
about 8 years
ago
Ever since the original iPhone came out, I’ve had several ideas about how they managed to achieve such fluidity with relatively mediocre hardware. I mean, it was good at the time, but Android still struggles on hardware that makes that look like a
... [More]
486… It’s absolutely my fault that none of these have been implemented in any open-source framework I’m aware of, so instead of sitting on these ideas and trotting them out at the pub every few months as we reminisce over what could have been, I’m writing about them here. I’m hoping that either someone takes them and runs with them, or that they get thoroughly debunked and I’m made to look like an idiot. The third option is of course that they’re ignored, which I think would be a shame, but given I’ve not managed to get the opportunity to implement them over the last decade, that would hardly be surprising. I feel I should clarify that these aren’t all my ideas, but include a mix of observation of and conjecture about contemporary software. This somewhat follows on from the post I made 6 years ago(!) So let’s begin.
1. No main-thread UI
The UI should always be able to start drawing when necessary. As careful as you may be, it’s practically impossible to write software that will remain perfectly fluid when the UI can be blocked by arbitrary processing. This seems like an obvious one to me, but I suppose the problem is that legacy makes it very difficult to adopt this at a later date. That said, difficult but not impossible. All the major web browsers have adopted this policy, with caveats here and there. The trick is to switch from the idea of ‘painting’ to the idea of ‘assembling’ and then using a compositor to do the painting. Easier said than done of course, most frameworks include the ability to extend painting in a way that would make it impossible to switch to a different thread without breaking things. But as long as it’s possible to block UI, it will inevitably happen.
2. Contextually-aware compositor
This follows on from the first point; what’s the use of having non-blocking UI if it can’t respond? Input needs to be handled away from the main thread also, and the compositor (or whatever you want to call the thread that is handling painting) needs to have enough context available that the first response to user input doesn’t need to travel to the main thread. Things like hover states, active states, animations, pinch-to-zoom and scrolling all need to be initiated without interaction on the main thread. Of course, main thread interaction will likely eventually be required to update the view, but that initial response needs to be able to happen without it. This is another seemingly obvious one – how can you guarantee a response rate unless you have a thread dedicated to responding within that time? Most browsers are doing this, but not going far enough in my opinion. Scrolling and zooming are often catered for, but not hover/active states, or initialising animations (note; initialising animations. Once they’ve been initialised, they are indeed run on the compositor, usually).
3. Memory bandwidth budget
This is one of the less obvious ideas and something I’ve really wanted to have a go at implementing, but never had the opportunity. A problem I saw a lot while working on the platform for both Firefox for Android and FirefoxOS is that given the work-load of a web browser (which is not entirely dissimilar to the work-load of any information-heavy UI), it was very easy to saturate memory bandwidth. And once you saturate memory bandwidth, you end up having to block somewhere, and painting gets delayed. We’re assuming UI updates are asynchronous (because of course – otherwise we’re blocking on the main thread). I suggest that it’s worth tracking frame time, and only allowing large asynchronous transfers (e.g. texture upload, scaling, format transforms) to take a certain amount of time. After that time has expired, it should wait on the next frame to be composited before resuming (assuming there is a composite scheduled). If the composited frame was delayed to the point that it skipped a frame compared to the last unladen composite, the amount of time dedicated to transfers should be reduced, or the transfer should be delayed until some arbitrary time (i.e. it should only be considered ok to skip a frame every X ms).
It’s interesting that you can see something very similar to this happening in early versions of iOS (I don’t know if it still happens or not) – when scrolling long lists with images that load in dynamically, none of the images will load while the list is animating. The user response was paramount, to the point that it was considered more important to present consistent response than it was to present complete UI. This priority, I think, is a lot of the reason the iPhone feels ‘magic’ and Android phones felt like junk up until around 4.0 (where it’s better, but still not as good as iOS).
4. Level-of-detail
This is something that I did get to partially implement while working on Firefox for Android, though I didn’t do such a great job of it so its current implementation is heavily compromised from how I wanted it to work. This is another idea stolen from game development. There will be times, during certain interactions, where processing time will be necessarily limited. Quite often though, during these times, a user’s view of the UI will be compromised in some fashion. It’s important to understand that you don’t always need to present the full-detail view of a UI. In Firefox for Android, this took the form that when scrolling fast enough that rendering couldn’t keep up, we would render at half the resolution. This let us render more, and faster, giving the impression of a consistent UI even when the hardware wasn’t quite capable of it. I notice Microsoft doing similar things since Windows 8; notice how the quality of image scaling reduces markedly while scrolling or animations are in progress. This idea is very implementation-specific. What can be dropped and what you want to drop will differ between platforms, form-factors, hardware, etc. Generally though, some things you can consider dropping: Sub-pixel anti-aliasing, high-quality image scaling, render resolution, colour-depth, animations. You may also want to consider showing partial UI if you know that it will very quickly be updated. The Android web-browser during the Honeycomb years did this, and I attempted (with limited success, because it’s hard…) to do this with Firefox for Android many years ago.
Pitfalls
I think it’s easy to read ideas like this and think it boils down to “do everything asynchronously”. Unfortunately, if you take a naïve approach to that, you just end up with something that can be inexplicably slow sometimes and the only way to fix it is via profiling and micro-optimisations. It’s very hard to guarantee a consistent experience if you don’t manage when things happen. Yes, do everything asynchronously, but make sure you do your book-keeping and you manage when it’s done. It’s not only about splitting work up, it’s about making sure it’s done when it’s smart to do so.
You also need to be careful about how you measure these improvements, and to be aware that sometimes results in synthetic tests will even correlate to the opposite of the experience you want. A great example of this, in my opinion, is page-load speed on desktop browsers. All the major desktop browsers concentrate on prioritising the I/O and computation required to get the page to 100%. For heavy desktop sites, however, this means the browser is often very clunky to use while pages are loading (yes, even with out-of-process tabs – see the point about bandwidth above). I highlight this specifically on desktop, because you’re quite likely to not only be browsing much heavier sites that trigger this behaviour, but also to have multiple tabs open. So as soon as you load a couple of heavy sites, your entire browsing experience is compromised. I wouldn’t mind the site taking a little longer to load if it didn’t make the whole browser chug while doing so.
Don’t lose site of your goals. Don’t compromise. Things might take longer to complete, deadlines might be missed… But polish can’t be overrated. Polish is what people feel and what they remember, and the lack of it can have a devastating effect on someone’s perception. It’s not always conscious or obvious either, even when you’re the developer. Ask yourself “Am I fully satisfied with this” before marking something as complete. You might still be able to ship if the answer is “No”, but make sure you don’t lose sight of that and make sure it gets the priority it deserves.
One last point I’ll make; I think to really execute on all of this, it requires buy-in from everyone. Not just engineers, not just engineers and managers, but visual designers, user experience, leadership… Everyone. It’s too easy to do a job that’s good enough and it’s too much responsibility to put it all on one person’s shoulders. You really need to be on the ball to produce the kind of software that Apple does almost routinely, but as much as they’d say otherwise, it isn’t magic. [Less]
|
Posted
about 8 years
ago
Every so often I happen to be involved in designing electronics
equipment that's supposed to run reliably remotely in inaccessible
locations,without any ability for "remote hands" to perform things like
power-cycling or the like. I'm talking about
... [More]
really remote locations,
possible with no but limited back-haul, and a very high cost of ever
sending somebody there for remote maintenance.
Given that a lot of computer peripherals (chips, modules, ...) use USB
these days, this is often some kind of an embedded ARM (rarely x86) SoM
or SBC, which is hooked up to a custom board that contains a USB hub
chip as well as a line of peripherals.
One of the most important lectures I've learned from experience is:
Never trust reset signals / lines, always include power-switching
capability. There are many chips and electronics modules available on
the market that have either no RESET, or even might claim to have a
hardware RESET line which you later (painfully) discover just to be a
GPIO polled by software which can get stuck, and hence no way to really
hard-reset the given component.
In the case of a USB-attached device (even though the USB
might only exist on a circuit board between two ICs), this is typically
rather easy: The USB hub is generally capable of switching the power of
its downstream ports. Many cheap USB hubs don't implement this at all,
or implement only ganged switching, but if you carefully select your USB
hub (or in the case of a custom PCB), you can make sure that the given
USB hub supports individual port power switching.
Now the next step is how to actually use this from your (embedded) Linux
system. It turns out to be harder than expected. After all, we're
talking about a standard feature that's present in the USB
specifications since USB 1.x in the late 1990ies. So the expectation is
that it should be straight-forward to do with any decent operating
system.
I don't know how it's on other operating systems, but on Linux I
couldn't really find a proper way how to do this in a clean way. For
more details, please read my post to the linux-usb mailing list.
Why am I running into this now? Is it such a strange idea? I mean,
power-cycling a device should be the most simple and straight-forward
thing to do in order to recover from any kind of "stuck state" or other
related issue. Logical enabling/disabling of the port, resetting the
USB device via USB protocol, etc. are all just "soft" forms of a reset
which at best help with USB related issues, but not with any other part
of a USB device.
And in the case of e.g. an USB-attached cellular modem, we're actually
talking about a multi-processor system with multiple built-in
micro-controllers, at least one DSP, an ARM core that might run another
Linux itself (to implement the USB gadget), ... - certainly enough
complex software that you would want to be able to power-cycle it...
I'm curious what the response of the Linux USB gurus is.
[Less]
|
Posted
about 8 years
ago
Today I want to share the pain of running a production 3GPP TCAP/MAP/CAP system and network protocol design in general. The excellent Free Software ASN1/TCAP/MAP/CAP stack (which is made possible by the Pharo live programming environment) I helped
... [More]
creating is in heavy production usage (powering standard off-the-shelf components like a SGSN, an AuC or non-standard components to enable new business cases) and sees roaming traffic from a lot of networks. From time to time something odd comes up.
In TCAP/MAP/CAP messages but also Request/Response and the possible Errors are defined using ASN1. Over the last decades ETSI and 3GPP have made various major versions and minor releases (e.g. adding new optional attributes to requests/responses/errors). The biggest new standard is CAMEL and it is so big and complicated that it was specified in four phases (each phase with their own versions of the ApplicationContext, think of it as an versioned and entry into the definition for all messages and RPC calls).
One issue in supporting a specific module version (application-context-name) is to find the right minor release of 3GPP (either the newest or oldest for that ACN). Then it is a matter to copy and paste the ASN1 definition from either a PDF or a WordDocument into individual files.. and after that is done one can fix the broken imports (or modify the ASN1 parser to make a global look-up) and typos for elements.
This artificial barrier creates two issue for people implementing MAP/CAP using components. Some use inferior ASN1 tools or can’t be bothered to create the input files and decide to hardcode the message content (after all BER/DER is more or less just nested TLV entries). The second issue is related to time/effort as well. When creating the CAMEL ASN1 files I didn’t want to do the work four times (once for each phase) and searched for shortcuts too.
The first issue materialized itself by equipment sending completely broken messages or not sending mandatory(!) elements. So what happens if a big telco sends you a message the stack can’t decode, you look up the oldest and youngest release defining this ACN and see the element that is attempted to be parsed was always mandatory? Right, one adds an OPTIONAL modifier to be able to move forward…
The second issue is on me though. I started with a set of CAMEL phase3 files and assumed that only the operations (and their arguments/response) would be different across different CAMEL phases but the support structs they use would stay the same. My assumption (and this brings us to protocol design) was that besides the versioning of the module they would be conservative and extend supporting types in a forward compatible way and integrated phase2 and phase1 into the same set of files.
And then reality sets in and the logs of the system showed a message that caused an exception during parsing (normally only happens for the first kind of issue). An extension to the Request structure was changed in a not forward compatible way. Let’s have a look:
InitialDPArgExtension ::= SEQUENCE {
-naCarrierInformation [0] NACarrierInformation OPTIONAL,
-gmscAddress [1] ISDN-AddressString OPTIONAL,
-…
+ gmscAddress [0] ISDN-AddressString OPTIONAL,
*more new optional elements*
+ …,
+ enhancedDialledServicesAllowed [11] NULL OPTIONAL,
*more elements after the extension marker*
}
So one element (naCarrierInformation) got removed and then every following element was renumbered and the extension marker was moved further down. In theory the InitialDPArgExtension name binding exists once in the phase2 to definition and once in phase3 and 3GPP had all rights to define a new binding with different. An engineering question is if this was a good decision?
A change in application-context allows to remove some old cruft and make room for new. The tag space might be considered a scarce resource and making room is saving a resource. On the other hand in the history of GSM no other struct had ran out of tags and there are various other approaches to the problem. The above is already an extension to an extension and the step to an extension of an extension of an extension doesn’t seem so absurd anymore.
So please think of forward compatibility when designing protocols, think of the implementor and make the definition machine readable and please get the imports right so one doesn’t need to resort to a global symbol search. If you are having interesting core network issues related to TCAP, MAP and CAP consider contacting me. [Less]
|
Posted
about 8 years
ago
My Galera set-up on Kubernetes and the Azure LoadBalancer in front of it seem to work nicely but one big TODO is to implement proper health checks. If a node is down, in maintenance or split from the network it should not be part of the LoadBalancer.
... [More]
The Azure LoadBalancer has support for custom HTTP probes and I wanted to write something very simple that handles the HTTP GET, opens a MySQL connection to the destination, check if it is connected to a primary. As this is about health checks the code should be small and reliable.
To improve my Go(-lang) skills I decided to write my healthcheck in Go. And it seemed like a good idea, Go has a powerful HTTP package, a SQL API package and two MySQL implementations. So the entire prototype is just about 72 lines (with comments and empty lines) and I think that qualifies as small. Prototyping the MySQL code took some iterations but in general it went quite quickly. But how reliable is it? Go introduced the nice concept of a context.Context. So any operation should be associated with a context and it should be passed as argument from one method to another. One can create a child context and associate it with a deadline (absolute time) or timeout (relative) and has a way to cancel it.
I grabbed the Context from the HTTP Request, added a timeout and called a function to do the MySQL check. Wow that was easy. Some polish to parse the parameters from the CLI and I am ready to deploy it! But let’s see how reliable it is?
I imagined the following error conditions:
The destination IP is reachable but no one listening on the port. The TCP connection will fail quickly (SYN -> RST,ACK)
The destination IP ends in a blackhole (no RST, ACK) received. One would have a large connect timeout
The Galera node (or machine hosting it) is overloaded. While the connect succeeds the authentication or a query might stall
The Galera node is split and not a master
The first and fourth error conditions are easy to test/simulate and trivial to implement properly. I then moved to the third one. My first choice was to implement an infinitely slow Galera node and did that by using nc -l 3006 to accept a TCP connection and then send nothing. I made a healthprobe and waited… and waited.. no timeout. Not after 2s as programmed in the context, not after 2min and not after.. (okay I gave up after 30 min). Pretty discouraging!
After some reading and browsing I saw an open PR to add context.Context support to the MySQL backend. I modified my import, ran go get to fetch it, go build and retested. Okay that didn’t work either. So let’s try the other MySQL implementation, again change the package imports, go get and go build and retest. I picked the wrong package name but even after picking the right package this driver failed to parse the Database URL. At that point I decided to go back to the first implementation and have a deeper look.
So while many of the SQL API methods take a Context as argument, the Open one does not. Open says it might or might not connect to the database and in case of MySQL it does connect to it. Let’s see if there is a workaround? I could spawn a Go routine and have a selective receive on the result or a timeout. While this would make it possible to respond to the HTTP request it does create two issues. First one can’t cancel Go routines and I would leak memory, but worse I might run into a connection limit of the Galera node. What about other workarounds? It seems I can play with a custom parameter for readTimeout and writeTimeout and at least limit the timeout per I/O operation. I guess it takes a bit of tuning to find good values for a busy system and let’s hope that context.Context will be used more in more places in the future. [Less]
|
Posted
about 8 years
ago
Overhyped Docker missing the most basic features
I've always been extremely skeptical of suddenly emerging over-hyped
technologies, particularly if they advertise to solve problems by adding
yet another layer to systems that are already
... [More]
sufficiently complex
themselves.
There are of course many issues with containers, ranging from replicated
system libraries and the basic underlying statement that you're giving
up on the system packet manager to properly deal with dependencies.
I'm also highly skeptical of FOSS projects that are primarily driven by
one (VC funded?) company. Especially if their offering includes a
so-called cloud service which they can stop to operate at any given
point in time, or (more realistically) first get everybody to use and
then start charging for.
But well, despite all the bad things I read about it over the years, on
one day in May 2017 I finally thought let's give it a try. My problem
to solve as a test balloon is fairly simple.
My basic use case
The plan is to start OsmoSTP, the m3ua-testtool and the sua-testtool,
which both connect to OsmoSTP. By running this setup inside containers
and inside an internal network, we could then execute the entire
testsuite e.g. during jenkins test without having IP address or port
number conflicts. It could even run multiple times in parallel on one
buildhost, verifying different patches as part of the continuous
integration setup.
This application is not so complex. All it needs is three containers,
an internal network and some connections in between. Should be a piece
of cake, right?
But enter the world of buzzword-fueled web-4000.0 software-defined
virtualised and orchestrated container NFW + SDN vodoo: It turns out to
be impossible, at least not with the preferred tools they advertise.
Dockerfiles
The part that worked relatively easily was writing a few Dockerfiles to
build the actual containers. All based on debian:jessie from the
library.
As m3ua-testsuite is written in guile, and needs to build some guile
plugin/extension, I had to actually include guile-2.0-dev and other
packages in the container, making it a bit bloated.
I couldn't immediately find a nice example Dockerfile recipe that would
allow me to build stuff from source outside of the container, and then
install the resulting binaries into the container. This seems to be a
somewhat weak spot, where more support/infrastructure would be helpful.
I guess the idea is that you simply install applications via package
feeds and apt-get. But I digress.
So after some tinkering, I ended up with three docker containers:
one running OsmoSTP
one running m3ua-testtool
one running sua-testtool
I also managed to create an internal bridged network between the
containers, so the containers could talk to one another.
However, I have to manually start each of the containers with ugly long
command line arguments, such as docker run --network sigtran --ip
172.18.0.200 -it osmo-stp-master. This is of course sub-optimal, and
what Docker Services + Stacks should resolve.
Services + Stacks
The idea seems good: A service defines how a given container is run,
and a stack defines multiple containers and their relation to each
other. So it should be simple to define a stack with three
services, right?
Well, it turns out that it is not. Docker documents that you can
configure a static ipv4_address [1] for each service/container, but it
seems related configuration statements are simply silently
ignored/discarded [2], [3], [4].
This seems to be related that for some strange reason stacks can (at
least in later versions of docker) only use overlay type networks,
rather than the much simpler bridge networks. And while bridge
networks appear to support static IP address allocations, overlay
apparently doesn't.
I still have a hard time grasping that something that considers itself a
serious product for production use (by a company with estimated value
over a billion USD, not by a few hobbyists) that has no support for
running containers on static IP addresses. that. How many applications
out there have I seen that require static IP address configuration? How
much simpler do setups get, if you don't have to rely on things like
dynamic DNS updates (or DNS availability at all)?
So I'm stuck with having to manually configure the network between my
containers, and manually starting them by clumsy shell scripts, rather
than having a proper abstraction for all of that. Well done :/
Exposing Ports
Unrelated to all of the above: If you run some software inside
containers, you will pretty soon want to expose some network services
from containers. This should also be the most basic task on the planet.
However, it seems that the creators of docker live in the early 1980ies,
where only TCP and UDP transport protocols existed. They seem to have
missed that by the late 1990ies to early 2000s, protocols like SCTP or DCCP were invented.
But yet, in 2017, Docker chooses to
blindly assume TCP in
https://docs.docker.com/engine/reference/builder/#expose without even
mentioning it (or designing the syntax to accept any specification of
the protocol)
design a syntax (/tcp) in the command-line parsing (see
https://docs.docker.com/engine/reference/run/#expose-incoming-ports), but then
only parse tcp and udp, despite people requesting support for other
protocols like SCTP as early as three years ago
Now some of the readers may think 'who uses SCTP anyway'. I will give
you a straight answer: Everyone who has a mobile phone uses SCTP. This
is due to the fact that pretty much all the connections inside cellular
networks (at least for 3G/4G networks, and in reality also for many 2G
networks) are using SCTP as underlying transport protocol, from the
radio access network into the core network. So every time you switch
your phone on, or do anything with it, you are using SCTP. Not on your
phone itself, but by all the systems that form the network that you're
using. And with the drive to C-RAN, NFV, SDN and all the other
buzzwords also appearing in the Cellular Telecom field, people should
actually worry about it, if they want to be a part of the software stack
that is used in future cellular telecom systems.
Summary
After spending the better part of a day to do something that seemed like
the most basic use case for running three networked containers using
Docker, I'm back to step one: Most likely inventing some custom
scripts based on unshare to run my three
test programs in a separate network namespace for isolated execution of
test suite execution as part of a Jenkins CI setup :/
It's also clear that Docker apparently don't care much about playing a
role in the Cellular Telecom world, which is increasingly moving away
from proprietary and hardware-based systems (like STPs) to virtualised,
software-based systems.
[1]
https://docs.docker.com/compose/compose-file/#ipv4address-ipv6address
[2]
https://forums.docker.com/t/docker-swarm-1-13-static-ips-for-containers/28060
[3]
https://github.com/moby/moby/issues/31860
[4]
https://github.com/moby/moby/issues/24170
[Less]
|
Posted
about 8 years
ago
After the public user-oriented OsmoCon 2017, we also recently had the
6th incarnation of our annual contributors-only Osmocom Developer Conference: The OsmoDevCon 2017.
This is a much smaller group, typically about 20 people, and is limited
to
... [More]
actual developers who have a past record of contributing to any of
the many Osmocom projects.
We had a large number of presentation and discussions. In fact, so
large that the schedule of talks extended from 10am to midnight on some
days. While this is great, it also means that there was definitely too
little time for more informal conversations, chatting or even actual
work on code.
We also have such a wide range of topics and scope inside Osmocom, that
the traditional ad-hoch scheduling approach no longer seems to be
working as it used to. Not everyone is interested in (or has time for)
all the topics, so we should group them according to their topic/subject
on a given day or half-day. This will enable people to attend only
those days that are relevant to them, and spend the remaining day in an
adjacent room hacking away on code.
It's sad that we only have OsmoDevCon once per year. Maybe that's
actually also something to think about. Rather than having 4 days once
per year, maybe have two weekends per year.
Always in motion the future is.
[Less]
|
Posted
about 8 years
ago
My former gpl-violations.org colleague Armijn Hemel and Shane Coughlan
(former coordinator of the FSFE Legal Network) have written a book on
practical GPL compliance issues.
I've read through it (in the bath tub of course, what better place to
read
... [More]
technical literature), and I can agree wholeheartedly with its
contents. For those who have been involved in GPL compliance
engineering there shouldn't be much new - but for the vast majority of
developers out there who have had little exposure to the
bread-and-butter work of providing complete an corresponding source
code, it makes an excellent introductory text.
The book focuses on compliance with GPLv2, which is probably not too
surprising given that it's published by the Linux foundation, and Linux
being GPLv2.
You can download an electronic copy of the book from
https://www.linuxfoundation.org/news-media/research/practical-gpl-compliance
Given the subject matter is Free Software, and the book is written by
long-time community members, I cannot help to notice a bit of a surprise
about the fact that the book is released in classic copyright under All
rights reserved with no freedom to the user.
Considering the sensitive legal topics touched, I can understand the
possible motivation by the authors to not permit derivative works. But
then, there still are licenses such as CC-BY-ND which prevent derivative
works but still permit users to make and distribute copies of the work
itself. I've made that recommendation / request to Shane, let's see
if they can arrange for some more freedom for their readers.
[Less]
|
Posted
about 8 years
ago
It's already one week past the event, so I really have to sit down and
write some rewview on the first public Osmocom Conference ever:
OsmoCon 2017.
The event was a huge success, by all accounts.
We've not only been sold out, but we also had to
... [More]
turn down some last
minute registrations due to the venue being beyond capacity (60
seats). People traveled from Japan, India, the US, Mexico and many
other places to attend.
We've had an amazing audience ranging from commercial operators to
community cellular operators to professional developers doing work
relate to osmocom, academia, IT security crowds and last but not least
enthusiasts/hobbyists, with whom the project[s] started.
I've received exclusively positive feedback from many attendees
We've had a great programme. Some part of it was of introductory
nature and probably not too interesting if you've been in Osmocom for
a few years. However, the work on 3G as well as the current roadmap
was probably not as widely known yet. Also, I really loved to see
Roch's talk about Running a commercial cellular network with Osmocom
software
as well as the talk on Facebook's OpenCellular BTS hardware
and the Community Cellular Manager.
We have very professional live streaming + video recordings courtesy
of the C3VOC team. Thanks a lot for your
support and for having the video recordings of all talks online already at
the next day after the event.
We also received some requests for improvements, many of which we will
hopefully consider before the next Osmocom Conference:
have a multiple day event. Particularly if you're traveling
long-distance, it is a lot of overhead for a single-day event. We of
course fully understand that. On the other hand, it was the first
Osmocom Conference, and hence it was a test balloon where it was
initially unclear if we'll be able to get a reasonable number of
attendees interested at all, or not. And organizing an event with
venue and talks for multiple days if in the end only 10 people attend
would have been a lot of effort and financial risk. But now that we
know there are interested folks, we can definitely think of a multiple
day event next time
Signs indicating venue details on the last meters. I agree, this cold
have been better. The address of the venue was published, but we
could have had some signs/posters at the door pointing you to the
right meeting room inside the venue. Sorry for that.
Better internet connectivity. This is a double-edged sword. Of
course we want our audience to be primarily focused on the talks and
not distracted :P I would hope that most people are able to survive
a one day event without good connectivity, but for sure we will have
to improve in case of a multiple-day event in the future
In terms of my requests to the attendees, I only have one
Participate in the discussions on the schedule/programme while it is
still possible to influence it. When we started to put together the
programme, I posted about it on the openbsc mailing list and invited
feedback. Still, most people seem to have missed the time window
during which talks could have been submitted and the schedule still
influenced before finalizing it
Register in time. We have had almost no registrations until about two
weeks ahead of the event (and I was considering to cancel it), and
then suddenly were sold out in the week ahead of the event. We've had
people who first booked their tickets, only to learn that the tickets
were sold out. I guess we will introduce early bird pricing and add
a very expensive last minute ticket option next year in order to
increase motivation to register early and thus give us flexibility
regarding venue planning.
Thanks again to everyone involved in OsmoCon 2017!
Ok, now, all of you who missed the event: Go to
https://media.ccc.de/c/osmocon17 and check out the recordings. Have
fun!
[Less]
|
Posted
about 8 years
ago
In my previous posts I wrote about my set-up of MariaDB Galera on Kubernetes. Now I have some first experience with this set-up and can provide some guidance. I used an ill-fated TCP health-check that lead to MariaDB Galera blocking the
... [More]
originating IPv4 address from accessing the cluster due to never completing a MySQL handshake and it seems (logs are gone) that this lead to the sync between different systems breaking too.
When I woke up my entire cluster was down and didn’t recover. Some pods restarted and I run into a Azure Kubernetes bug where a Persistent Storage would be umounted but not detached. This means the storage can not be re-attached to the new pod. The Microsoft upstream project is a bit hostile but the issue is known. If you are seeing an error about the storage still being detached/attached. You can go to the portal, find the agent that has it attached and detach it by hand.
To bring the cluster back online there is a chicken/egg problem. The entrypoint.sh discovers the members of the cluster by using environment variables. If the cluster is entirely down and the first pod is starting, it will just exit as it can’t connect to the others. My first approach was to keep the other nodes down and use kubectl edit rc/galera-node-X and set replicas to 0. But then the service is still exporting the information. In the end I deleted the srv/galera-node-X and waited for the first pod to start. Once it was up I could re-create the services again.
My next steps are to add proper health checks, some monitoring and see if there is a more long term archive for the log data of a (deleted) pod.
[Less]
|
Posted
about 8 years
ago
Observations on SCTP and Linux
When I was still doing Linux kernel work with netfilter/iptables in the
early 2000's, I was somebody who actually regularly had a look at the
new RFCs that came out. So I saw the SCTP RFCs, SIGTRAN RFCs, SIP and
RTP
... [More]
, etc. all released during those years. I was quite happy to see
that for new protocols like SCTP and later DCCP, Linux quickly received
a mainline implementation.
Now most people won't have used SCTP so far, but it is a protocol used
as transport layer in a lot of telecom protocols for more than a decade
now. Virtually all protocols that have traditionally been spoken over
time-division multiplex E1/T1 links have been migrated over to SCTP
based protocol stackings.
Working on various Open Source telecom related projects, i of course
come into contact with SCTP every so often. Particularly some years
back when implementing the Erlang SIGTAN code in erlang/osmo_ss7 and most recently
now with the introduction of libosmo-sigtran with its OsmoSTP, both part
of the libosmo-sccp repository.
I've also hard to work with various proprietary telecom equipment over
the years. Whether that's some eNodeB hardware from a large brand
telecom supplier, or whether it's a MSC of some other vendor. And they
all had one thing in common: Nobody seemed to use the Linux kernel SCTP
code. They all used proprietary implementations in userspace, using RAW
sockets on the kernel interface.
I always found this quite odd, knowing that this is the route that you
have to take on proprietary OSs without native SCTP support, such as
Windows. But on Linux? Why? Based on rumors, people find the Linux
SCTP implementation not mature enough, but hard evidence is hard to come
by.
As much as it pains me to say this, the kind of Linux SCTP bugs I have
seen within the scope of our work on Osmocom seem to hint that there is
at least some truth to this (see e.g.
https://bugzilla.redhat.com/show_bug.cgi?id=1308360 or
https://bugzilla.redhat.com/show_bug.cgi?id=1308362).
Sure, software always has bugs and will have bugs. But we at Osmocom
are 10-15 years "late" with our implementations of higher-layer
protocols compared to what the mainstream telecom industry does. So if
we find something, and we find it even already during R&D of some
userspace code, not even under load or in production, then that seems a
bit unsettling.
One would have expected, with all their market power and plenty of
Linux-based devices in the telecom sphere, why did none of those large
telecom suppliers invest in improving the mainline Linux SCTP code? I
mean, they all use UDP and TCP of the kernel, so it works for most of
the other network protocols in the kernel, but why not for SCTP? I
guess it comes back to the fundamental lack of understanding how open
source development works. That it is something that the given
industry/user base must invest in jointly.
The leatest discovered bug
During the last months, I have been implementing SCCP, SUA, M3UA and
OsmoSTP (A Signal Transfer Point). They were required for an effort to
add 3GPP compliant A-over-IP to OsmoBSC and OsmoMSC.
For quite some time I was seeing some erratic behavior when at some
point the STP would not receive/process a given message sent by one of
the clients (ASPs) connected. I tried to ignore the problem initially
until the code matured more and more, but the problems remained.
It became even more obvious when using Michael Tuexen's m3ua-testtool,
where sometimes even the most basic test cases consisting of sending +
receiving a single pair of messages like ASPUP -> ASPUP_ACK was failing.
And when the test case was re-tried, the problem often disappeared.
Also, whenever I tried to observe what was happening by meas of strace,
the problem would disappear completely and never re-appear until strace
was detached.
Of course, given that I've written several thousands of lines of new
code, it was clear to me that the bug must be in my code. Yesterday I
was finally prepare to accept that it might actually be a Linux SCTP
bug. Not being able to reproduce that problem on a FreeBSD VM also
pointed clearly into this direction.
Now I could simply have collected some information and filed a bug
report (which some kernel hackers at RedHat have thankfully invited me
to do!), but I thought my use case was too complex. You would have to
compile a dozen of different Osmocom libraries, configure the STP, run
the scheme-language m3ua-testtool in guile, etc. - I guess nobody
would have bothered to go that far.
So today I tried to implement a test case that reproduced the problem in
plain C, without any external dependencies. And for many hours, I
couldn't make the bug to show up. I tried to be as close as possible to
what was happening in OsmoSTP: I used non-blocking mode on client and
server, used the SCTP_NODELAY socket option, used the sctp_rcvmsg()
library wrapper to receive events, but the bug was not reproducible.
Some hours later, it became clear that there was one setsockopt() in
OsmoSTP (actually, libosmo-netif) which enabled all existing SCTP
events. I did this at the time to make sure OsmoSTP has the maximum
insight possible into what's happening on the SCTP transport layer, such
as address fail-overs and the like.
As it turned out, adding that setsockopt for SCTP_FLAGS to my test code
made the problem reproducible. After playing around which of the flags,
it seems that enabling the SENDER_DRY_EVENT flag makes the bug appear.
You can find my detailed report about this issue in
https://bugzilla.redhat.com/show_bug.cgi?id=1442784 and a program to
reproduce the issue at
http://people.osmocom.org/laforge/sctp-nonblock/sctp-dry-event.c
Inside the Osmocom world, luckily we can live without the
SENDER_DRY_EVENT and a corresponding work-around has been submitted and
merged as https://gerrit.osmocom.org/#/c/2386/
With that work-around in place, suddenly all the m3ua-testtool and sua-testtool test cases are reliably green
(PASSED) and OsmoSTP works more smoothly, too.
What do we learn from this?
Free Software in the Telecom sphere is getting too little attention.
This is true even those small portions of telecom relevant protocols
that ended up in the kernel like SCTP or more recently the GTP module I
co-authored. They are getting too little attention in development, even
more lack of attention in maintenance, and people seem to focus more on
not using it, rather than fixing and maintaining what is there.
It makes me really sad to see this. Telecoms is such a massive
industry, with billions upon billions of revenue for the classic telecom
equipment vendors. Surely, they would be able to co-invest in some
basic infrastructure like proper and reliable testing / continuous
integration for SCTP. More recently, we see millions and more millions
of VC cash burned by buzzword-flinging companies doing "NFV" and
"SDN". But then rather reimplement network stacks in userspace than to
fix, complete and test those little telecom infrastructure components
which we have so far, like the SCTP protocol :(
Where are the contributions to open source telecom parts from Ericsson,
Nokia (former NSN), Huawei and the like? I'm not even dreaming about
the actual applications / network elements, but merely the maintenance
of something as basic as SCTP. To be fair, Motorola was involved early
on in the Linux SCTP code, and Huawei contributed a long series of fixes
in 2013/2014. But that's not the kind of long-term maintenance
contribution that one would normally expect from the primary interest
group in SCTP.
Finally, let me thank to the Linux SCTP maintainers. I'm not
complaining about them! They're doing a great job, given the arcane code
base and the fact that they are not working for a company that has
SCTP based products as their core business. I'm sure the would love
more support and contributions from the Telecom world, too.
[Less]
|