View the original post
Machine learning (ML) technology has the possibility of fundamentally changing how we approach innovation.
Instead of the traditional moonshot approach (iteratively building/improving features toward a lofty goal), our team has had success using a data-first approach* to meet monumental benchmarks. We honed this direction while helping create Einstein Designer, an AI-augmented experience for a variety of design tasks.
There are a few reasons to take a data-first approach: First, it avoids the inefficient rebuilds that the moonshot approach tends to result in. It also acknowledges that ML projects are inherently for experiences with underlying statistical patterns and that those patterns are uncovered through data.
What does a data-first approach to machine learning look like?
It starts with collecting and labeling data so you have a minimum viable dataset (MVD) that informs a model. Read on for specifics on how to do both. Then you can build a minimum viable product (MVP) from that.
Using this method will help your team learn about the problem space, identify canonical edge cases, and even codify some basic rules about the problem. But, most importantly, you’ll exit the process with a valuable dataset that you can use to bootstrap your ML project.
First, ask: “What data do we already have?”
Perhaps you have raw data that just isn’t labeled. In that case, you’re in good shape. But what happens when you don’t already have data? How can you bootstrap your project to get it started and build toward a functioning machine learning system? That’s when it’s time to get creative.
Here, I’ll share how we tackled the two steps in our data-first approach..
Step 1: Collect Data
We’ve found three broad approaches for collecting data: Public (from available datasets), Harvested (from the environment), and Generated (from users).
Public datasets form the basis for a ton of the ML projects today. For instance, Kaggle competitions are famous for providing datasets. In another famous example, many projects have used the Enron emails, released as part of the court battle in 2002. The sources are endless. A good place to start is by reading research papers related to your subject area to learn what data sources they are using.
In our project creating Einstein Designer, we found that papers often provide links to the datasets they used. This can be a great starting point as those datasets are typically high quality. For example we found the Rico dataset which has extensive data about mobile UIs.
In other cases, the data you need might exist in the environment. That’s when you need to harvest it. Google Maps Streetview cars are a good example of this. All the location data exists in the real world. They just need to get it into a machine readable format.
In our case, we recognized that there is a ton of design data on the web, we just needed to build a crawler to go and collect it. This approach can generate a lot of really excellent data but you also have to contend with the grit and grime of the real world. Even when crawling the web we had to overcome many technical hurdles to find consistent ways of ingesting and labeling this data.
Lastly, there are times when the data needs to be created in the first place. In these cases, you’ll need to employ humans to get to work and help you out. For example, most websites and apps today collect a lot of user data. Every image you upload to social media adds to a giant body of images that those companies can use to train models. To make this approach work, you typically need a lot of active users.
Once you have some raw data, the next move is labeling it. This is both a challenge and an incredible opportunity to set a project up for success.
Step 2: Labeling Data
Machine learning isn’t magic. It requires labeling the data so that the models can encode the patterns within it. It’s important to have collected your data ahead of time before starting this process to ensure you’re labeling a consistent data set. There are many different ways of labeling data that can help lead to a successful ML model.
These are perhaps the three most common: Manual (with a mechanical turk), Heuristic (for raw datasets) and Synthetic (for created data).
The most obvious way to label data is to have humans do it by hand. This has the advantage of letting you leverage human intelligence. You can ask some pretty nuanced questions and still get back labels. While any one human might get things wrong, if you use enough people to label the data then you can either calculate the right answer. At the minimum, you can measure the subjectivity within your data. The team behind Bricolage used this technique to have human labelers specify what comparable components are across two different pages. Using this, they were able to discover that the labelers agreed ~78% of the time. The advantage of this is that it also established a benchmark to compare the model performance against. This is great because it can help measure and limit the inherent bias in the dataset. The downsides of this approach are the expense and varying data quality. Using platforms like Amazon Mechanical Turk, you can pay people to do your task or find colleagues to help out.
We did both — partly relying on in-house expert designers to label data for us. You’ll also need a tool that these labelers use to label the data. While there are off-the-shelf tools that can be very powerful like VGG, we also found value in building our own labeling tools optimized for specific tasks.
In some cases, you may be able to automatically label your data instead of hiring humans. This has the obvious advantage of being cheaper. The challenge here is to create a set of rules that can accurately label the data. If it was easy, then you probably wouldn’t need an ML model in the first place. So through this process, it’s recommended to focus on identifying true positives and true negatives. Since we don’t know what false positives and false negatives are, it’s better to filter ambiguous data out of the training set. Or at least label it as ambiguous to prevent introducing noise into your data.
For our project, we wanted to sort through thousands of web pages and label the parts that were product tiles. To do this, we created a set of rules that correctly identified only product tiles. Though it did miss a few along the way, this wasn’t important because we still collected a lot of good data.
Be careful though that your rules don’t accidentally introduce additional bias into your training data. One way to help prevent bias is to manually label a small but representative subset of examples. Using this manual data, you can write tests to verify that your rules work as expected. This is just test-driven development but used for the purposes of data gathering and labeling. When labeling product tiles, we took this approach and identified approximately 30 sites. For example: We found a site with a product image that was only 30px square. By writing a test for this case, we could see that our rules were too restrictive and then fixed this issue in our labeling code. In our case, we only labeled true positives but the tests enabled us to have greater confidence. Always follow up this process by inspecting the data to find new false positives and write more tests.
In the last approach, you both create the data and label it in one step. This has the advantage of being relatively easy to accomplish since you don’t have to deal with the messiness of the real world. The clear disadvantage is that synthetic data runs the risk of having the most bias since it wasn’t derived from real data. You’ll need to pay particular attention to crafting this data set to avoid bias or at least be aware of it.
To see how we created ours, check out my other article on creating synthetic data to learn more.
When your data is collected and labeled, it’s time to train your first iteration of the model and make your MVP.
Machine learning technology has the possibility of fundamentally changing how we do innovation. Maybe instead of the moon-shot metaphor, we could think of innovation using machine learning as being more like training a puppy. Unlike building a spaceship in which you might start by building half the systems, with a puppy you don’t start with half a puppy. Instead, you start with a small and slightly unruly version of a complete dog (model). Using this metaphor, your job is not to tightly engineer all of the puppies’ behaviors but to create experiences in which the puppy (model) can collect experiences (data) to learn to become a well-behaved dog. You bring the dog on walks to teach them about cars. You go to the dog park to learn about dogs. You visit friends to help them learn about people. And through it all you reinforce positive behaviors (labeling) so that the puppy will grow into a sociable, friendly, and obedient adult.
With machine learning we are training our innovations to perform tasks we desire of them. So we need to collect a wide array of unbiased data that has been labeled to reinforce the behaviors we want. And we need to continue to guide our models even after they’re built by maintaining and continuing to train them. Models, unlike traditional algorithms, are never done. There is always room to add more data, more labels, to increase the models performance and reliability.
In this version of innovation, it all starts with the data.
**Note: Prior to executing this approach, it’s a good idea to explore the product idea using cheap prototyping methods. Prototyping predictive systems can be a real challenge but using the traditional toolkit of rapid prototyping designers can test out various aspects of the experience and collect data about what products are likely to succeed in the marketplace. Then you’re ready to start.
Learn more at www.salesforce.com/design
Follow us at @SalesforceUX.
Check out the Salesforce Lightning Design System
Machine Learning: Redrawing the Innovation Roadmap was originally published in Salesforce Design on Medium, where people are continuing the conversation by highlighting and responding to this story.