My (Painful) Experience With Ubiquiti As A DevOps Engineer

ubiquiti

Background

I recently decided to try out Ubiquiti after hearing many of my friends and people on social media (Discord especially) speak highly of their experience. On a previous project for replacing my NAS, I did all my own research, bought an Asustor NAS, ended up making a lot of mistakes and ultimately also discovered that that device was not a stable enough or easy enough to use. After returning a lot of the hardware (I found out pretty fast how bad it sucked), I just took my friend's advice and bought Synology, even getting Synology branded network cards, RAM and SSDs for the best experience and it turned out to be a great experience. Their OS is robust and easy to use and their hardware is performant and reliable. So instead of doing my own research, I just went with what the smart people around me said and jumped in with Ubiquiti. My plan was to add one piece at a time and eventually, hopefully, replace my router, existing switches and wireless access point and get deep into the ecosystem to hopefully reap the benefits of hardware designed to work with each other.

I mention the DevOps Engineer part mostly to give context: I'm a pretty technical person, and I've been a DevOps Engineer for 10 years, but if you know about DevOps, you probably know it doesn't intersect with technical networking concepts that often. Almost all DevOps Engineers I know need subnet calculators for anything to do with them, and that's to be expected. There's no reason to spend effort to learn things you might use once every 3 years. I am a little more experienced than average, but not by much. I can calculate the number of hosts in a subnet without a tool, I know the networking layers, I know the private A,B and C CIDR blocks, I set up a small subnet of my network to get automatically routed over a public VPN in OPNSense. A little 👌. So the idea of an easy networking experience that didn't require skills I didn't have yet sounded awesome.

The first piece of hardware - A WiFi Access Point

My first piece of hardware was a UAP-AC-Lite. I didn't have any controller or anything, but setting it up as a standalone device using the mobile app was easy and it worked quite well. I was a little off-put that so many features were gated behind having a controller, like being unable to pick just one frequency band, have more than one AP, VLANs and much more. I was mostly testing it out and it was free to me, so it wasn't a big deal but it definitely couldn't replace my old router without having a controller.

The next piece of hardware - A Switch

My next piece of hardware was an XG24, a 24-port 10Gbe switch. It arrived and worked beautifully, but again I was put off by the limitations which were much more material: it's essentially an un-managed switch unless you have a controller. Crucially, at this point I moved everything over to the switch - I planned on working on one port at a time while draining nodes and the like so services would remain virtually entirely up.

The final piece of hardware - A Router

Since it was always the plan, I bought a Dream Machine Special Edition because for one, it seemed to be the best choice in the category of what I was doing, "prosumer". Two, I had a few friends who already had some of their own, so I could ask them questions if a quick Google search yielded confusing results.

The problems begin

I looked forward to the router the most, as my current OPNSense router only has (2) 1GbE gigabit ports and is pretty timid as far as specs go -- everything I heard about the Dream Machine made it out to be everything anyone could need. So when it came, I racked it, plugged it in and plugged in the SFP+ port from the XG24 to the UDM.

No lights

Alright, no big deal; you probably have to configure the interfaces. I plugged in the 2.5GbE WAN to a port on the XG24 (temporarily, so I could stage the device and make a cutover with the least disruption possible). I logged in and started poking around. It definitely looked slick. First, I wanted to adopt devices and configure them, but they didn't show up which made a lot of sense, I figured they had to be connected to the switch fabric rather than the WAN, so I connected one of the ports of the UDM to a port on the XG24

The chaos monkey is let loose

The UDM set upon my network at once. It began handing out it's own DHCP reservations, fighting with my existing router. The problem was, it wasn't immediately apparent so I left it connected for maybe 30 minutes while I tried to adopt the other pieces of gear. They would show up and then disappear.

I got an alert about a service being down, so I left things alone to check it out. My test was okay, but my Kubernetes cluster was very sluggish (the Kubernetes masters had IPs in different subnets, so HAProxy was hanging on connections depending on whether or not the IP worked). I tested more services and realized what was happening. I disconnected the UDM and waited for things to recover but after getting new IPs, the devices had no reason to request a new DHCP lease and the UDM must have default lease times of at least a couple hours.

Fix, or fail forward?

I fixed things one by one but after a while it became clear that it was going to take a long time to fix everything and if I wanted to use the UDM, it would need to be on the switch, the one where everything else already was. I had removed my previous switches already as I was about 80% done with the cable management and final setup of the switch.

Fail forward, I guess

Let me take a moment to mention something. My current network was unplanned, so it was just a 192.168.1.0/24 for all devices and MetalLB was configured to use a small chunk (20 IPs) between all of my devices (and often their multiple interfaces) and the DHCP range for LoadBalancers, which make port forwarding much easier and is necessary for some of my apps. So I had taken the time before I got the hardware to formulate some Plans.

Ubiquiti%20VLAN-IP%20Range%20Plan

I had planned out IP ranges and VLANs for my new network, so even if the UDM didn't work out I knew OPNSense could do what I needed, so I was committed. Now that I already had downtime, I decided to just push forward. I figured it might take a day to configure this, so I might be offline for maybe 12 hours max as I set things up... since I'm writing a blog post about this, I imagine you know that that was not what happened.

Pain. Just pain.

I set up my planned VLANs (sidenote: I did end up making too many, but even so I only used half the IP space of the private B class CIDR: 172.16.0.0/12), then I set up wireless and I really got moving because when my apps are down, I usually hear about it from multiple people that use them. Then, the devices kept going offline. I realized that my current router was interfering with the UDM as well, so that was a constant struggle to configure devices before they get messed up and I have to readopt them. Services went up and down as I essentially did this:

plugging-leaks

Then, the one feature it didn't have

Ubiquiti Dream Machines do not have the ability to use BGP natively. It's possible to SSH into the device and run a secondary Docker container that provides BGP, but this was not an acceptable solution for me. I did not want to void the warranty (that states in quite a few ways that if the UDM is modified or altered, it's void) in order to use a routing protocol (on a router) that the RFC for it came out in 1989. I bought the best "Gateway Console" they had; obviously if this didn't do BGP, none of them did. What I learned much later is there is a whole different brand from Ubiquiti (UISP) that sells devices as inexpensive as $99 that could do BGP. Okay, but let's take stock: I bought a 24-port 10GbE switch. My 9 Kubernetes nodes are 2.5GbE already, and I plan on slowly upgrading in a year or more to 10GbE slowly. So what device would I need so none of that is hobbled, staying within the ecosystem? The Edge Router Infinity, currently $1,599 (and sold out to boot). So, $300 more than my switch to do one thing the UDM couldn't do. Laughable.

So if I don't have a UDM, that means...

I don't have a managed switch. At this point, I was venting my frustrations to a friend who already used Ubiquiti and he casually mentioned you can run a Docker container with the controller software. I looked into it and yep, another Ubiquiti hack for functionality they gated behind dollar signs: run the container and you can mange the switch, set up APs, essentially everything the $499 UDM does except the physical switching and SFP ports (which I must make clear: I never got it to work. Even tried two cables). Cool beans. I guess I have to run it, but I hate it because there's a dependency loop now: I run the container on Kubernetes, which is running on nodes connected to the switch, which is managed by software running on the nodes. If I need to re-bootstrap the network, I'd need to run the controller somewhere else, get everything configured and then start it back up on Kubernetes and adopt the devices. cat_okay

Save me, OPNSense, you're my only hope

I didn't even research whether or not OPNSense could do BGP past having the vague feeling that I've seen forum posts that talk about it being able to do it because I naively assumed the UDM would work for me. So I set off to see how to make that work, while I also looked into how to set up VLANs. I wish I could say it was easy, but I was completely redoing three networks (Kubernetes, my local network and my WiFi) and setting up BGP between a Kubernetes cluster and OPNSense and I couldn't find much information at all about that part, so I struggled for some time before. I eventually found it was actually really easy. Hindsight, 20/20, etc.

What's the damage?

I got the UDM May 11th. My first downtime was 2023-05-12 01:41:31, and my downtime finally ended 2023-05-18 00:32:28 (UTC). That was 5 days, 22 hours, 50 minutes and 57 seconds. Of that time, I was down 26 hours and 32 minutes, so my uptime was actually 80.5% during this time (for this one service). I don't know about the others because I set a maintenance window for all alerts from May 11 at 9:46 PM to May 17 at 6:00 PM. As it turned out, that was enough time for me to get everything almost completely working. Overall though, most of what I cared about didn't work for those almost 6 days.

My mistakes

My first mistake was not doing enough research. I remembered my NAS experience so I wanted to just dive in and hope for the best, but I should have done much, much more research even if I was already decided on Ubiqiti. If I had, I would have found out about the Docker controller image before buying the UDM (although I would have decided not to use it because it is an unofficial hack). I also would have had a better idea of how serious the vendor lock-in is (I originally only knew about the AP limitations). On the other hand, if I had looked into BGP specifically on the Dream Machine, I would have learned it couldn't do it natively and fallen back to my current configuration, the Docker controller and OPNSense, with so much less downtime.

Final thoughts

Vendor lock-in should be illegal everywhere. It's one thing to make a new standard where one never existed before but there was no technical requirement for Ubiquiti to make such a high walled garden. I'm both angered and thankful that I can run the controller software and be able to use all of the hardware I paid for, but that's unconscionable of Ubiqiti to create a situation like this. Overall, I'm happy I was able to accomplish what I wanted, which turned out to be a huge feat (bigger than I planned for, of course lol) but I still have a very bad taste in my mouth. Ubiquiti's hardware is the real deal, there's no question about that. It's solid, thoughtfully designed, and priced attractively compared to enterprise equipment. But...

On the other hand, Ubiquiti is no different than Tesla, Apple, BMW and many other disgustingly greedy companies that design and control their products so that the user has less power and so that they make as much money as possible. This is not okay. This must be stopped.

🔥 Down with the system 🔥

We're waking up in beds bought on layaway, leaving houses we pay for but do not own in cars we've leased for a short time (and must give up or pay more to be allowed to keep), drive for companies, not as employees but gig workers (with no health insurance, terrible pay and predatory arrangements) that don't make a profit with it's main revenue source to monopolize markets and force any company not operating at a loss to close, then buy back stocks to inflate stock prices and lower the number of jobs when it opens a new store, then closes some time later because they're "underperforming".

Ubiquiti is just a snowflake on the slope; no single snowflake blames itself for the avalanche and yet, it happens anyway. The people at Ubiqiti making these terrible, anti-consumer decisions are bastards. I hope they all win the lottery and quit their jobs so they fuck off from running Ubiquiti 🖕