The Batshit Crazy Story Of The Day Elon Musk Decided To Personally Rip Servers Out Of A Sacramento Data Center

stopthatgirl7 · 10 months ago

The Batshit Crazy Story Of The Day Elon Musk Decided To Personally Rip Servers Out Of A Sacramento Data Center

@redcalcium · 10 months ago

Remember he just gutted Twitter before pulling this stunt, so the estimate of a few months to move the servers might be true if the entire department that handle infra has been gutted with only skeleton crew left.

@flathead@lemm.ee · 10 months ago

Well, yes. If nobody left has a clue then it’s going to take a little longer but you could physically move just about anything in a few weeks with the right crew, even if you had to bring them in cold. An open checkbook solves a lot of logistical problems.

The proof is more or less self evident. If this idiot and his cousin were able to pull it off without breaking anything critical, then it stands to reason that a properly managed team would have been able to do it in a more orderly way in a few days.

I get that everyone wants to paint this as completely irresponsible, but apart from the fact that it was done so haphazardly in the dead of night and gate crashing the data center security (nobody is going to refuse access to the CEO), there’s really nothing here that’s completely out of order. Locking the gear in the trucks is pretty standard for intact secure data transport. The real mistake is the infra manager sandbagging the move estimate - or not understanding how to plan and execute it.

@redcalcium · edit-2 10 months ago

Physically move them is one thing. Reassigning each server into the new data center network is a whole other thing. It won’t be as simple as connecting the power and network cables. From the post, the rack density is different so you’ll probably have to change the each server name to match the new rack position. Then the hostname and subnet probably changes in the new data center, so now you’ll have to map everything again (the hard coded references to Sacramento mentioned by Musk). The 100MM contract means they have a lot of servers to account for. This is the real headache of the migration and probably the reason Twitter keep having random outages for months after this stunt. They probably took shortcut and can’t bring all those servers online in time to handle traffic bursts which leads to another Musk’s shenanigans (e.g. forbidding visitors from viewing tweets unless they’re logged in to limit servers load, etc).

Edit: the more I think about this, the more my head hurt. If any infra people reading this, what are you going to do if you suddenly received truckloads of servers yanked from another data center location and told to bring them online again ASAP, while more than half of your team has been laid off? Seriously, what’s the step you’re gonna do to bring all these servers online again? Oh, and those servers probably not gracefully shut down and just have their power cable yoinked off.

@flathead@lemm.ee · 10 months ago

Oh by no means am I suggesting it was reasonable to do this. Musk would be a fucking nightmare as an employer. As a customer probably not much better but you know what they say about a fool and his money. This fool would be a great customer as long as you had a good lawyer to write the contracts.

I do suspect that some of the details of this story are somewhat embellished though, if only for the sheer joy of it, which I’m all for. It’s a great story. I don’t believe, for instance, that they could possibly have moved 5000 racks - or even 5000 servers - as I think the story was intimating. It sounds like they filled a few semis, which would be a small fraction of the systems. Maybe this was just the last of it that was too hard to move earlier. As for the rack configs at the other end, they would need power and services and an empty space if they are just rolling the stuff in. That’s only a few weeks of lead time in a properly run facility.

If they had their reservations set up correctly they wouldn’t need to change hostnames or even addresses, just wheel in the racks, brace and connect them. Ideally stuff would be shut down gracefully, but it shouldn’t really matter if they just pulled the plug. The software should be resilient enough to restart ok. Again, no idea if they had anything thought out, probably not, given the way it was done. But I have seen a big tech co move several rows this way when they basically couldn’t be bothered figuring out how to logically migrate them. Of course they weren’t doing it with a coked up CEO at 2am on Christmas Eve, but it wasn’t as difficult as you might imagine. But yeah not 5000 racks at once. Not even close to possible.