Below are excerpts from the workshop talk Dr. Welling gave on Ingredients for Bayesian, Privacy Preserving, Distributed Learning, where the professor shares his views on FL, the importance of distributed learning, and the Bayesian aspects of the domain.
Why do we need distributed learning in the first place?
“The question can be separated in two parts. Why do we need distributed or federated inferencing? Maybe that is easier to answer. We need it because of reliability. If you in a self-driving car, you clearly don’t want to rely on a bad connection to the cloud in order to figure out whether you should brake. Latency. If you have your virtual reality glasses on and you have just a little bit of latency you’re not going to have a very good user experience. And then there’s, of course, privacy, you don’t want your data to get off your device. Also compute maybe because it’s close to where you are, and personalization — you want models to be suited for you.
Why distributed learning is so important?
It took a little bit more thinking why distributed learning is so important, especially within a company — how are you going to sell something like that? Privacy is the biggest factor here, there are many companies and factories that simply don’t want their data to go off site, they don’t want to have it go to the cloud. And so you want to do your training in-house. But there’s also bandwidth. You know, moving around data is actually very expensive and there’s a lot of it. So it’s much better to keep the data where it is and move the computation to the data. And also,personalizationplays a role.
There are many challenges when you want to do this.The data could be extremely heterogeneous, so you could have a completely different distribution on one device than you have on another device. Also, the data sizes could be very different. One device could contain 10 times more data than another device. And the compute could be heterogeneous, you could have small devices with a little bit of compute that now and then or you can’t use because the battery’s down. There are other bigger servers that you also want to have in your in your distribution of compute devices.
The bandwidth is limited, so you don’t want to send huge amounts of even parameters. Let’s say we don’t move data, but we move parameters. Even then you don’t want to move loads and loads of parameters over the channel. So you want to maybe quantize it to see this. I believe Bayesian thinking is going to be very helpful. And again, the data needs to be private so you wouldn’t want to send parameters that contain a lot of information about the data.
What is the solution?
So first of all, of course, we’re going to move model parameters, we’re not going to move data. We have data stored at places and we’re going to move the algorithm to that data.So basically you get your learning update, maybe privatized, and then you move it back to your central place where you’re going to update it.And of course, bandwidth is another challenge that you have to solve.
We have these heterogeneous data sources and we have very variability in the speed in which we can sync these updates. Here I think the Bayesian paradigm is going to come in handybecause, for instance, if you have been running an update on a very large dataset, you can shrink your posterior parameters to a very small posterior. Where on another device, you might have much less data, and you might have a very wide posterior distribution for those parameters. Now, how to combine that? You shouldn’t average them, it’s silly.You should do a proper posterior update where the one that has a small peaked posterior has a lot more weight than the one with a very wide posterior. Also uncertainty estimates are important in that aspect.
The other thing is that withBayesian update, if you have a very wide posterior distribution, then you know that parameter is not going be very important for making predictions. And so if you’re going to send that parameter over a channel, you will have to quantize it, especially to save bandwidth. The ones that are very uncertain anyway you can quantize at a very coarse level, and the ones which have a very peak posterior need to be encoded very precisely, and so you need much higher resolution for that. So also there, the Bayesian paradigm is going to be helpful.
In terms of privacy, there is this interesting result that if you have an uncertain parameter and you draw a sample from that posterior parameter, then that single sample is more private than providing the whole distribution.There’s results that show that you can get a certain level of differential privacy by just drawing a single sample from that posterior distribution. So effectively you’re adding noise to your parameter, making it more private. Again, Bayesian thinking is synergistic with this sort of Bayesian federated learning scenario.
What are the key takeaways?We can do MCMC (Markov chain Monte Carlo) and variational based distributed learning. And as there’s advantages to do that because it makes the updates more principled and you can combine things which, one of them might be based on a lot more data than another one.
Then we haveprivate and Bayesianto privatize the updates of a variational Bayesian model. Many people have worked on many other of these intersections, so we have deep learning models which have been privatized, we have quantization, which is important if you want to send your parameters over a noisy channel. And it’s nice because the more you quantize, the more private things become. You can compute the level of quantization from your Bayesian posterior, so all these things are very nicely tied together.
People have looked at the relation between quantized models and Bayesian models — how can you use Bayesian estimates to quantized better? People have looked at quantized versus deep to make your deep neural network run faster on a mobile phone you want to quantize it. People have looked at distributed versus deep, distributed deep learning. So many of these intersections have actually been researched, but it hasn’t been put together. This is what I want to call for. We can try to put these things together and at the core of all of this is Bayesian thinking, we can use it to execute better on this program.