Few Labels Learning

Ada:

Hi, it’s me again, Ada Lovelace. As namesake of the ADA Lovelace Center for Analytics, Data and Applications, I’m keen to know more about the work being done at this nexus of research and industry. I’ve already discovered that artificial intelligence is the overarching theme, with researchers developing, refining and applying AI methods and techniques. I’m also aware that there quite a few of these out there. So to get a better overview, today I’m talking to Jann. He’s one of the experts working at the ADA Lovelace Center in the Few Labels Learning competence group. Jann, could you please explain your method to us?

 

Jann:

Hello Ada! I’m delighted that you’re interested in our competence group. I should first point out that our competence group doesn’t work on just one method. Instead, we work on a whole collection of methods. It’s just that these are all used to solve the same problem. And the problem that we want to solve is this: machine learning, more specifically deep learning, can be applied in all kinds of domains, and successfully training machine learning models across all domains hinges on having a wealth of annotated training data. What exactly does “annotated data” mean? An annotated training data point is essentially made up of two parts: the data input and the corresponding data output, also known as the label or annotation. I’ll give you a brief example, Ada. Imagine we want to train a computer vision system so that it can look at pictures of us and determine which are pictures of me and which are pictures of you. In this scenario, the input for a data point would be the image – say, a photo of me – and the output, the label, would be the information of whether it’s a photo of Jann or Ada. But now let’s imagine that we’re working in the medical field or in industrial production. Data annotation then becomes much more difficult because it takes highly qualified people, experts in their field, to complete this annotation as an essential stage in preparation for the actual model training.

And to come full circle, our Few Labels Learning competence group examines methods for training machine learning models, even in situations that provide very little annotated training data and a lot of unannotated training data. In our example, we would have the images but not the information telling us whether they were of Jann or Ada. To put it into technical terms, we explore semi-supervised learning, transfer learning and we’ve got projects that are about few-shot learning. Then there’s active learning, which basically involves integrating human data into the model training process. And when it comes to meta learning, our work also occasionally overlaps with that done by other competence groups.

 

Ada:

That all sounds really interesting. So it’s less about developing methods that simplify the annotation process, and more about creating training models that can work with very little annotated data – in other words, “few labels data.” How can you use them in practice?

 

Jann:

Essentially, you can apply methods from few labels learning and also limited labels learning in all kinds of domains, including text, images, video data and X-rays, but also when working with time series analysis or sensor data. As is often the case with deep learning or machine learning, our research focuses on image data, but time series analysis is now also generating more and more research output. And our competence group specializes in analyzing image data, time series data and X-ray technology data.

 

Ada:

That gives me a general idea of what these methods can do. But at Fraunhofer in general, and at the ADA Lovelace Center in particular, it’s about transferring these methods into application. So can you give me some concrete examples of projects or applications?

 

Jann:

We’re working on an exciting project in the area of medical imaging. The problem we’re trying to solve there is how to tell different cell types apart when looking at a sample of human tissue under a microscope. This is very useful in histopathology, and we’re looking into how we can help medical experts working in this field. However, this particular problem is rather tricky. I should point out that we already have lots of annotated training data collected using a special scanner. This scanner is highly suitable for training powerful machine learning models because of the wealth of training data available. But now we also want to use this model on new scanners, for instance in other hospitals. The problem there is that reconstructing such a large dataset is very expensive, so the aim is to generalize the model for new scanners using as little annotated data as possible. The technical term for this is “domain adaptation.” We do it by using meta-learning methods and, more specifically, methods that are part of the few-shot learning research project. Another example and another of our focus areas is the use of sensor data, by which I mean time series. This project is about positioning – we want to see if sensor data can still determine the position of objects in a space despite potential interference.

 

Ada:

That all sounds really exciting. I also gather that you’re collaborating with the Development Center X-ray Technology EZRT at Fraunhofer IIS. Is that right? What projects are you working on there?

 

Jann:

Yes, that’s right, Ada, and I’m glad you brought that up. At the moment we’re collaborating with the EZRT on two projects. The first is very exciting and also maybe a little odd – it’s about the segmentation of car parts. We put an entire car through an X-ray scanner and see if we can segment the various parts off from each other on the 3D images we get. What makes the project so exciting is that a car is quite big, which means there are lots of parts. Since it’s basically impossible for a human to annotate all these parts, we use active-learning methods to perform what we call model-based annotation for these individual data points. The advantage here is simply that, once trained, the model can ultimately choose what data it would like a human to annotate. The other project we’re collaborating on with EZRT is about using dual energy X-ray technology in recycling. We’re working with a recycling company, trying to help them to automate trash sorting using learning models. Here again, we run into the problem that it would be too expensive to have a human annotate all the tiny scraps of trash. And you’d need an expert to do it. What this really means for us is that we’ve got a small number of annotated pieces of trash, but a large number of non-annotated ones. So far, we’ve had quite a lot of success applying semi-supervised learning methods. We’re also working on a project about reinforcement learning. Overall, we’re always exploring other collaborations both within and outside Fraunhofer IIS.

 

Ada:

Thanks for that. So we’ve just learned that your methods make it easier to work with a small amount of annotated data. But is there is a minimum amount that you need? Or to put it another way, can you tell us what you have to take into account in terms of data volume and quality?

 

Jann:

That’s a great question, one that we get asked a lot by our research and industry partners. Generally speaking, it’s extremely difficult to estimate how much annotated data we’ll require for successful model training before we’ve had a closer look at the problem and the data basis. It simply depends on a great many factors, starting with how hard the problem is, but also extending to how balanced the annotated data is and the number of classes. Then of course there’s the type of problem itself – is it about classification, segmentation, regression, etc.? This is why I’m having a hard time coming up with a number on the spot. As a general rule, it’s better to have a large number of annotated samples, and we’re seeing that it’s also important to have a large amount of non-annotated data – especially for semi-supervised learning.

 

Ada:

I see. As a mathematician, my work with numbers and data makes it easy for me to imagine the benefits of these methods. Perhaps you can give us a brief overview of what the benefits are. What results can these methods deliver?

 

Jann:

Using few labels learning methods, we’re able to train powerful machine learning algorithms in situations where only very little annotated training data is available – for instance by using large amounts of non-annotated data, like we do in semi-supervised learning. Another benefit or feature of few labels learning methods is that we can integrate human feedback into the training algorithm itself by adopting what we refer to as a human-in-the-loop approach. This is where we’re exploring active learning: we use model uncertainty as a way of essentially asking the model to tell us which data points would still require human annotation in order to continuously improve the model. And lastly, thanks to few-shot learning methods and meta-learning methods, we’re able to train models that can be transferred from one application area to another.

 

Ada:

You’ve mentioned a number of methods that you’re working with in the Few Labels Learning competence group. It sounds like it’s often about using not just one method, but a combination of methods. This raises the question of where your methods can be combined with other AI methods, or those used by other competence groups. Is this already happening?

 

Jann:

That’s another great question, Ada. We do have some thematic overlap and certain points in common with other competence groups at the ADA Lovelace Center. The Few Data Learning group is a good example. That group’s work is basically about how to train models when data is thin on the ground – not just not much annotated data, like in our competence group, but not much data of any kind. Then the work is often about data augmentation or about simulations. This is where there are overlaps with the Few Labels Learning group. Another example is the Sequence-Based Learning group. Our colleagues there work a lot with sensor data and time series data. This too has some points in common with the work we do.

 

Ada:

Thanks a lot, Jann, for this fascinating discussion. I can already see that the age when something like machine learning was possible only with a lot of annotated training data is over. This opens up a vast array of new possibilities for AI, and I for one am looking forward to discovering what the other competence groups at the ADA Lovelace Center are all about. Watch this space. Until next time!

 

 

This may also interest you

 

The story of Ada Lovelace as a podcast

 

Reaping the benefits of quantum computing for mathematical optimization

 

Fraunhofer IIS Magazine

Smart Sensing and Electronics

 

Fraunhofer IIS Magazine

Cognitive Sensor Technologies

Contact

Your questions and suggestions are always welcome!

Please do not hesitate to email us.

Stay connected

The newsletter

Sign up for the the Fraunhofer IIS Magazine newsletter and get the hotlist topics in one email.

Homepage

Back to homepage of Fraunhofer IIS Magazine