Use Pydantic to Ensure Data Consistency

Pydantic is a powerful library that can be extremely useful for data parsing and validation. With data validation and coercion capabilities, Pydantic allows you to create custom classes to help keep track of your data. Using Pydantic along with Prefect is an excellent way to ensure your flows run smoothly and are simple to debug.

The below code snippets show how to set up a simple class using Pydantic and Prefect. In this example we will use Pydantic’s Basemodel to create classes to assist in validating and coercing data in a fake API call. The first step is to create a series of models in a separate python script. The syntax of the Basemodels can be further examined here.

from pydantic import BaseModel, validator
from typing import Any, Optional

class Address(BaseModel):
    street: str
    city:str
    state:str
    zip: str
    country:str='USA'

class User(BaseModel):
    id: int
    first_name: str
    last_name:Optional[str]
    address: Address
    email:str
    phone:str
    @validator('phone')
    def check__phone(ph): 
        if len(ph) < 10: 
            raise ValueError("Please put a full phone number")
        return ph

The validator decorator allows a custom-built function to validate specific attributes of inputs. In this case, a simple function was introduced to ensure a full phone number over 10 digits was the input.

Pydantic’s benefits work exceptionally well with parsing API data. For this example, let’s assume we will receive the following from an API call.

from prefect import flow, task
from pydantic import BaseModel, validator
from typing import Any, Optional
from models import Address, User

@task
def my_fake_api_call():
    res =[{'id':42,
        'first_name' : 'Ford',
        'last_name' : 'Prefect',
        'address': {
            'street': '42 Prefect Way',
            'city': 'Chicago',
            'state': 'Illinois',
            'zip':60614},
        'email' :'ford.prefect@gmail.com',
        'phone': 3125555555},
        {'id':"43",
        'first_name' : 'Marvin',
        'last_name':'McQuack',
        'address': {
            'street': '42 Rubber Duck Lane',
            'city': 'Chicago',
            'state': 'Illinois',
            'zip':'60614'},
        'email' :'marvin.mcquack@gmail.com',
        'phone': "3127777777"}]
    return res

Looking at the results from the “API call” above, we can see some inconsistencies with our inputs. For example, the first user’s zip is an integer value, while the second user’s id is currently a string. This may be easy to change for a couple of users, but with large amounts of data, this is a huge pain point in the flow. However, if we change this data into the Pydantic classes we created, the inputs will automatically change into our desired types.

With the simple flow below, we will change the data into a list of the Pydantic User classes and extract the Address class from each user.

@task
#Converts the class for each user to our custom Pydantic User class
#As we defined in our models, the Address class will be embedded in the user class 
def change_to_user_class(user_list): 
    user_data = [User(**u) for u in user_list]
    return user_data

@task
#Given a specific user, the user's address will be extracted as an Address class
def return_address(user_data):
    returned_address = user_data.address
    return returned_address

@flow
#Putting it all together
def get_user_address_flow(): 
		#Get our data from the "API call" 
    user_list = my_fake_api_call()
		#Change the data for each user into our custom User class
    user_data = change_to_user_class(user_list)
		#Extract the Address class for each User class instance
    returned_address =[return_address(u) for u in user_data]
    return print(returned_address)

#Run our flow when the python script is run
if __name__=="__main__":
    get_user_address_flow()

Finally, with the completion of the flow, we will return a list containing an Address class instance for each user. Thanks to our Base model, the zip code is now in the correct string format.

[Address(street='42 Prefect Way', city='Chicago', state='Illinois', zip='60614', country='USA'), 
Address(street='42 Rubber Duck Lane', city='Chicago', state='Illinois', zip='60614', country='USA')]

This example flow is a simple way to ensure your flow’s data is validated and consistent.

To learn more about how Pydantic’s SecretStr can help to obscure secrets in the Prefect UI, check out the Discourse topic here.

3 Likes