Jiazhen Zhu leads the end-to-end data team at Walmart Global Governance DSI, a diverse group of data engineers and data scientists united in building a better platform through data-driven decisions and data-driven products. Zhu first joined Walmart Global Tech in 2019 to oversee both data engineering and machine learning, giving him a unique vantage point into the interconnected worlds of DataOps, MLOps, and data science. Before Walmart Global Tech, Zhu was a data scientist and software engineer at NTT Data.
Can you briefly introduce yourself and outline your role at Walmart Global Tech?
I am currently the lead data engineer and machine learning engineer at Walmart Global Tech. I work with data end-to-end, starting with where we get the data from to how we clean the data, transfer the data, feed the data into model training and finally move models to the production layer. I enjoy overseeing this process and bring a decade of experience in both data and machine learning, building platforms for both.
What has been your career path so far?
After completing my bachelor’s degree in computer science, I worked as a software engineer at Citi, focusing on the data warehouse used to build models and support scalable data modeling. After that I completed a master’s degree in data science and worked as a software engineer as well as a data scientist. All of this is interrelated, as data engineering and machine learning engineering are really just a part of software engineering – typically the software engineer will focus on the application or UI or full stack tasks, while the machine engineer and data engineer are more focused on the data and the model, respectively.
How does Walmart Global Tech fit into Walmart in general?
Walmart Global Tech is working on advanced technologies that create unique and innovative experiences for our employees, customers and members at Walmart, Sam’s Club and Walmart International. We solve the myriad of challenges that every retailer faces, be it suppliers, distribution, ordering, innovation, shopping experience or after-sales service. The only similarity between all of these is that they all benefit from technology.
You oversee both data engineering and machine learning. Are there lessons for others about the benefits of structuring the organization in this way? This should give you a unique vantage point on data-centric AI, per your recent blog†
At other companies, these functions are often separated in different organizations. My own experience is that if we can combine the different roles – especially the data scientists, the research scientists, data engineers, machine learning engineers and software engineers – in one team, it can accelerate product development. Since most domains require specialized knowledge, combining many different roles in one team can also help bring new innovative ideas to the product.
How do you feel about the build versus buy calculus when it comes to ML platforms?
For MLOps platforms, which is clearly a new area, it varies – it’s not as simple as saying we have one tech stack that we follow every time. What we do is approach these decisions based on requirements – then we make sure that each part can be easily replaced or rebuilt so that later we don’t have to rebuild the whole thing just because one part no longer meets our needs.
What types of models does Walmart Global Tech put into production and why?
It depends on the area, the requirements and the end customers. In the beginning, I always start with the question: do we need machine learning to solve this problem, or is there an easier way to solve it that we should implement instead? When machine learning is required, it is often much easier and better to choose a simple model, such as regression or linear regression, to ensure good performance. We use those kinds of models for base scenarios. If there is a good existing model to use, we will often adapt or use it – such as BERT for natural language processing.
I want to emphasize that trust is crucial for the model itself. Not everyone will trust the model. That’s why I said at the beginning that the simplest is often the best. Not using machine learning – or if you must use machine learning anyway, it is preferable to use a model that offers simpler explanations, such as linear regression models. The black box nature of BERT or deep learning makes it more difficult to help people or customers understand the model.
Ultimately, if customers or people don’t trust the model, it’s useless. So it is critical to build a process to explain the model. It is also important to troubleshoot the model itself.
Sound like a model’s explainability and being able to rely on a model’s decisions are really important to your team?
Yes, it is not only important for the model, but also for the product and its customers. If you can explain a model to a customer or a user, you can also explain it to yourself – so win-win in that way too. Nobody likes a black box.
What is your strategy for? model monitoring and model performance management†
Since changes are always happening, monitoring is really the key to successful MLOps. Whether from a data engineering or machine learning engineering perspective, we are always tasked with monitoring all processes in the pipeline or infrastructure. For example, the data engineer will look for data quality issues, data mismatches, missing data, and more.
For machine learning, monitoring includes both the data and the model itself. We look at data drift, concept drift and performance across key metrics (ie AUC) to get to the bottom of issues and inform the retraining process. There’s a lot you can track, so having access to key stats for root cause analysis and getting notifications and alerts really helps.
This must be a really interesting time at Walmart given record demand, supply chain challenges, inflation and much more. Have you encountered any interesting issues with production models responding to a new environment?
Definitely yes. The only constant is that the data is always changing. For example, a model trained on social network data can have a major impact on model performance when the social network data changes drastically or disappears overnight. These kinds of problems are common.
Half of the data scientists we recently interrogated (50.3%) say their business counterparts don’t understand machine learning. How did you successfully overcome this hurdle to scale your ML practice?
This kind of situation is common in the industry. As discussed, some models are black boxes – and low confidence black boxes that are unopened, which is why explainability is so important. If your customers look at it and understand why a model made a certain decision, confidence will grow over time.
For models that directly affect customers, how do you incorporate customer feedback into your models?
Customer feedback is so important. The data may change or the concept may change, but if customer feedback is part of the ML process, we can use that customer data to retrain the model in near real time and have better model performance and a better ability to perceive reality. predict as a result. Having that human-in-the-loop process of checking things on can ensure that models are relevant and perform well.
What is your most favorite and least favorite part of your current role?
I like data and I like to play with data, so that’s really one of my favorite aspects of my current role. Incidentally, it is also one of the more difficult parts of the profession, because you have to know the data well before you fit it into a model. When it comes to machine learning, one of the hardest things is knowing how to take the right approach – not just for the model, not just for the data, but for the tech stacks, scalability, and all about the ML pipeline .