An approach called federated learning trains machine learning models on devices such as smartphones and laptops, rather than transferring private data to central servers.
The largest benchmarking data set to date for a machine learning technique designed with data privacy in mind is now available open source.
“Training in situ on data where it is generated allows us to train on larger real-world data,” explains Fan Lai, a doctoral student in computer science and engineering at the University of Michigan who presents the FedScale training environment at the University of Michigan. International conference on machine learning this week. A paper on the work is available on ArXiv.
“This also allows us to mitigate privacy risks and high communication and storage costs associated with collecting the raw data from end-user devices to the cloud,” said Lai.
Still a new technology, federated learning relies on an algorithm that serves as a centralized coordinator. It delivers the model to the devices, trains it locally on the relevant user data, and then brings back each partially trained model and uses them to generate a final global model.
For a number of applications, this workflow provides additional protection of data privacy and security. Messaging apps, health data, personal documents, and other sensitive but useful training materials can improve models without fear of data center vulnerabilities.
In addition to protecting privacy, federated learning can make model training more resource efficient by reducing and sometimes eliminating the transfer of big data, but it faces several challenges before it can be used on a large scale. Multi-device training means there are no guarantees about available computer resources, and uncertainties such as user connection speeds and device specifications lead to a pool of data options of varying quality.
“Federal learning is growing rapidly as a field of research,” said Mosharaf Chowdhury, an associate professor of computer science and engineering. “But most of the work uses a handful of data sets, which are very small and don’t represent many aspects of federated learning.”
And this is where FedScale comes in. The platform can simulate the behavior of millions of user devices on a few GPUs and CPUs, allowing machine learning model developers to explore how their federated learning program will perform without the need for large-scale implementation. It serves a variety of popular learning tasks, including image classification, object detection, language modeling, speech recognition, and machine translation.
“Anything that uses machine learning on end-user data can be federated,” Chowdhury says. “Applications need to be able to learn and improve how they deliver their services without really capturing everything their users do.”
The authors specify several conditions to consider to realistically mimic the federated learning experience: data heterogeneity, device heterogeneity, connectivity heterogeneity, and availability conditions, all with the ability to work at multiple scales across a wide variety of machine learning capabilities. tasks. FedScale’s datasets are the largest released to date that specifically address these challenges in federated learning, Chowdhury said.
“In recent years we have collected dozens of data sets. The raw data is usually publicly available, but difficult to use because it comes in different sources and formats,” said Lai. “We’re also continuously working to support large-scale deployment on the device.”
The FedScale team has also launched a leaderboard to promote the most successful federated learning solutions trained on the university’s system.
The National Science Foundation and Cisco supported the work.
Source: Zachary Chamption for University of Michigan