Testing Masterclass

Aur Saraf, PyCon IL 2025

A hands-on workshop covering the agreed best practices and also opinions

Intro Who, Why, What

I xUnit tests

II Evaluating prompts

III Mastery

Intro Who, Why, What

Who am I?

Why test?

What is testing?

I xUnit tests

II Evaluating prompts

III Mastery

Intro

Who am I?

Aur, startup CTO / consultant
Wrote my first line of BASIC 27 years ago and fell in love
Run a programming Dojo every Wednesday 20:00 on Zoom

Why test?

What is testing?

Intro

Who am I?

Why test?

To build products faster
Code never works the first time
Working code won't stay working if we leave it alone

What is testing?

Intro

Who am I?

Why test?

What is testing?

Any activity that catches mistakes before production
e.g. manual exploration, manual scripts, automated unit tests, automated e2e tests, fuzzing, running past production workloads, statistical evals, ...
Testing is risk management

TESTING IS RISK MANAGEMENT

TESTING IS RISK MANAGEMENT

TESTING IS RISK MANAGEMENT

Intro Who, Why, What

I xUnit tests

Lets write a test

Choosing good test cases

Testing toolbox

Design exercises

Things that are hard to test

II Evaluating prompts

III Mastery

Lets write a test!

# shekels.py

def amount_to_shekels_in_hebrew(amount):
    ...

print(amount_to_shekels_in_hebrew(1.01))
שקל אחד ואגורה אחת

Usually a test looks like this:

def test_happy_path():
    value = 1.01
    result = amount_to_shekels_in_hebrew(value)
    assert result == "שקל אחד ואגורה אחת"

But lets start simple:

def amount_to_shekels_in_hebrew(amount):
    ...
assert amount_to_shekels_in_hebrew(1.01) == "שקל אחד ואגורה אחת"

Write and run 3-5 assert tests to convince yourself that הכל בסדר 👌

git clone https://github.com/SonOfLilit/testing.git
cd testing
uv run python shekels.py

https://docs.astral.sh/uv/getting-started/installation/

Choosing good test cases

Start simple

Cover every subdivision of the problem space

There are surprising subdivisions

Hug borders tightly

Look for multipoints

Again!

list(range(start=1, stop=10, step=3))  # => [1, 4, 7, 9]

Start simple

assert (
    list(range(3)) == list(range(0, 3)) ==
    list(range(0, 3, 1)) == [0, 1, 2]
)
assert list(range(1, 3)) == [1, 2]
assert list(range(1, 6, 2)) == [1, 3, 5]
assert list(range(5, 0, -1)) == [5, 4, 3, 2, 1]

Cover every subdivision of the problem space

assert (
    list(range(3)) == list(range(0, 3)) ==
    list(range(0, 3, 1)) == [0, 1, 2]
)
assert list(range(1, 3)) == [1, 2]
assert list(range(1, 6, 2)) == [1, 3, 5]
assert list(range(5, 0, -1)) == [5, 4, 3, 2, 1]
assert list(range(3, 1)) == []
assert list(range(5, 5, -1)) == []
assert list(range(5, 0, -2)) == [5, 3, 1]

There are surprising subdivisions

assert (
    list(range(3)) == list(range(0, 3)) ==
    list(range(0, 3, 1)) == [0, 1, 2])
assert list(range(1, 3)) == [1, 2]
assert list(range(1, 6, 2)) == [1, 3, 5]
assert list(range(5, 0, -1)) == [5, 4, 3, 2, 1]
assert list(range(3, 1)) == []
assert list(range(5, 5, -1)) == []
assert list(range(5, 0, -2)) == [5, 3, 1]
assert list(range(3, 4, 2)) == [3]
assert list(range(-1)) == []
try:
    list(range(2.5))
    assert False
except TypeError: pass
try:
    list(range(0, 3, 0))
    assert False
except TypeError: pass

Hug borders tightly

assert (list(range(3)) == list(range(0, 3)) ==
        list(range(0, 3, 1)) == [0, 1, 2])
assert list(range(2, 3)) == [2]
assert list(range(3, 3)) == []
assert list(range(3, 2)) == []
assert list(range(-1)) == []
assert list(range(1, 6, 2)) == [1, 3, 5]
assert list(range(1, 5, 2)) == [1, 3]
assert list(range(1, 4, 2)) == [1, 3]
assert list(range(1, 3, 2)) == [1]
assert list(range(5, 0, -1)) == [5, 4, 3, 2, 1]
assert list(range(5, 4, -1)) == [5]
assert list(range(5, 5, -1)) == []
assert list(range(5, 0, -2)) == [5, 3, 1]
assert list(range(5, 0, -4)) == [5, 1]
assert list(range(5, 0, -5)) == [5]
with pytest.raises(ValueError):
    list(range(5, 0, 0))
with pytest.raises(TypeError):
    list(range(2.5))

Look for multipoints

range(1)
range(0, 1)
range(0, 1, 1)

range(0, 1, 0)
range(0, -1, -1)

range(0, 0, 1)
range(0, 0, 0)
range(0, 0, -1)

range(0, 2, 1)
range(0, 2, 0)
range(0, -2, -1)

# what else?

Now lets write some great tests!

# shekels.py

def amount_to_shekels_in_hebrew(amount):
    ...
assert amount_to_shekels_in_hebrew(1.01) == "שקל אחד ואגורה אחת"
# what else?

# bonus:
def normalize_shekel_string(string):
    ...
assert normalize_shekel_string("שנים עשר שמונים") == "שתים עשרה שקל ושמונים אגורות"

Who can find a bug? (I couldn't, but there's enough logic that there must be)

Testing stateful systems

The state is another dimension in the problem space

It gets ugly fast.

Testing stateful systems

As much as you can, avoid writing stateful systems.

height: 500px

Testing stateful systems

user = create_account(name="Adam")
pizza = create_item(name="Pizza", price="$9.99")
add_to_shopping_cart(user, pizza, quantity=1)
shopping_cart = get_shopping_cart(user)
assert len(shopping_cart) == 1
(item,) = shopping_cart
assert item.id == pizza.id and item.quantity == 1
finalize_order(user, credit_card="XXXX-XXXX-XXXX-XXXX")
assert not get_shopping_cart(user)

How do we add more tests?

Testing stateful systems

user = create_account(name="Adam")
pizza = create_item(name="Pizza", price="$9.99")
add_to_shopping_cart(user, pizza, quantity=2)
(item,) = get_shopping_cart(user)
assert item.id == pizza.id and item.quantity == 2
remove_from_shopping_cart(user, pizza, quantity=1)
(item,) = get_shopping_cart(user)
assert item.id == pizza.id and item.quantity == 1
finalize_order(user, credit_card="XXXX-XXXX-XXXX-XXXX")
assert not get_shopping_cart(user)

We had stateful code. We tested it with a stateful test.
Now we have two problems.
It's very hard to add nother case without breaking anything
or to figure out why it failed.

Lets try again.

user = create_account(name="Adam")
pizza = create_item(name="Pizza", price="$9.99")
add_to_shopping_cart(user, pizza, quantity=2)
(item,) = get_shopping_cart(user)
assert item.id == pizza.id and item.quantity == 2

user = create_account(name="Adam")
pizza = create_item(name="Pizza", price="$9.99")
add_to_shopping_cart(user, pizza, quantity=2)
remove_from_shopping_cart(user, pizza, quantity=1)
(item,) = get_shopping_cart(user)
assert item.id == pizza.id and item.quantity == 1
                                                                                                            
user = create_account(name="Adam")
pizza = create_item(name="Pizza", price="$9.99")
add_to_shopping_cart(user, pizza, quantity=2)
remove_from_shopping_cart(user, pizza, quantity=1)
finalize_order(user, credit_card="XXXX-XXXX-XXXX-XXXX")
assert not get_shopping_cart(user)

We managed to hide the devil, but he still lurks.
This is slow, and test speed is essential to happy, effective devs.
It's also repetitive, so mistakes won't be noticed (I made one).

Again.

def setup_pizza_test(quantity=1):
    user = create_account(name="Adam")
    pizza = create_item(name="Pizza", price="$9.99")
    add_to_shopping_cart(user, pizza, quantity=quantity)
    (item,) = get_shopping_cart(user)    
    return user, pizza, item

user, pizza, item = setup_pizza_test()
assert item.id == pizza.id and item.quantity == 2

user, pizza, item = setup_pizza_test(quantity=2)
remove_from_shopping_cart(user, pizza, quantity=1)
(item,) = get_shopping_cart(user)
assert item.id == pizza.id and item.quantity == 1
                                                                                                            
user, pizza, item = setup_pizza_test(quantity=1)
finalize_order(user, credit_card="XXXX-XXXX-XXXX-XXXX")
assert not get_shopping_cart(user)

Who put spaghetti in my pizza test? setup() tends to grow wild.

Still, this is usually the winning tradeoff. Unless we can make the
business logic stateless...

A message from the Devil:

Even if you split your test cases between several test_*() functions, feel free to make assumptions about the order in which they run. Hehe. Hehehehe.






Testing toolbox

Given...When...Then

Setup/Teardown/Fixtures

Parametrize

Mocks

It's standard to write tests with this structure:

Given...When...Then

def test_finalize_order_empties_shopping_cart():
    # Given: a user's shopping cart with one pizza
    user = create_account(name="Adam")
    pizza = create_item(name="Pizza", price="$9.99")
    add_to_shopping_cart(user, pizza, quantity=quantity)

    # When: I finalize the order
    finalize_order(user, credit_card="XXXX-XXXX-XXXX-XXXX")

    # Then: The shopping cart is now empty
    shopping_cart = get_shopping_cart(user)
    assert not shopping_cart

Google's testing book calls this DAMP tests, meaning each test is a "Descriptive And Meaningful Phrase" and not DRY (tests repeat themselves!).

DRY or DAMP?

Should you optimize for maintainability and ease of writing tests or for ease of reading a single test in isolation?

Accepted wisdom is the latter, I advise to try to strike a good balance.

If you have 100 test like "add some standard items to the shop, create a user, do a series of adds/removes on their shopping cart, check what's in the shopping cart", you can afford to write a test harness. If you only have 5 and they're not super identical, just copy paste.

Setup/Teardown/Fixtures

Traditional xUnit test frameworks have a hierarchy of setup()/teardown() hooks (for tests and test suites).

pytest improves on it with fixtures, a dependency injection mechanism for resources used by tests.

Fixtures

import pytest

@pytest.fixture
def shop(scope="session"):  # runs once before first test
    pizza = create_item(name="Pizza", price="$9.99")
@pytest.fixture
def user():
    user = create_account(name="Adam")
    yield user
    close_account(user)  # you can teardown resources
@pytest.fixture
def cart_item(shop, user):  # fixtures can depend on other fixtures
    add_to_shopping_cart(user, pizza, quantity=1)
    (item,) = get_shopping_cart(user)
    yield item

def test_finalize_order_empties_shopping_cart(user, cart_item):                         
    finalize_order(user, credit_card="XXXX-XXXX-XXXX-XXXX")
    shopping_cart = get_shopping_cart(user)
    assert not shopping_cart

Parametrization

The poor man's test harness, good enough in 80% of cases.

@pytest.mark.parametrize(
    "amount,hebrew",
    [
        (1, "שקל אחד"),
        (2, "שני שקלים"),
        (10, "עשרה שקלים"),
        # ...
        # Decimal tests
        (1.23, "שקל אחד ועשרים ושלוש אגורות"),
        # ...
        # Error cases
        (None, ""),
        ("א", "לא הוקש סכום תקין"),
        (1222333444555, "לא הוקש סכום תקין"),                                         
    ],
)
def test_conversion(amount, hebrew):
    assert amount_to_shekels_in_hebrew(amount) == hebrew

Mocks

Should we actually talk to the credit card processor when testing "finalizing an order"? Probably not. And to the DB?

Python has absolutely magical libaries for saying "this test believes as dogma that this other code behaves as follows", from "make the system clock always say midnight" to "when sending a POST request to stripe.com/api/charge, it will return HTTP 405 the first time and this JSON document the second time, also assert it was called with the right params".

MagicMock and patch

from unittest.mock import MagicMock, patch

def test_finalize_order_empties_shopping_cart(user, cart_item):
    _pay(user)
    shopping_cart = get_shopping_cart(user)
    assert not shopping_cart

@patch("time.time", return_value=12345)
@patch("requests.post")
def _pay(mock_time, mock_post, user):
    mock_post.return_value.json.return_value = {"id": "123"}

    order = get_order_summary(user)
    assert mock_post.assert_called_once_with("https://.../api/initiate", json={...})

    mock_post.reset_mock()
    payment = initiate_payment(order)
    receive_payment_success(order, {"payment_id": "123", "token": "123"})
    assert mock_post.assert_called_once_with("https://.../api/validate", json={...})

VCR

import vcr

def test_finalize_order_empties_shopping_cart(user, cart_item):
    _pay(user)
    shopping_cart = get_shopping_cart(user)
    assert not shopping_cart

with vcr.use_cassette('fixtures/vcr_cassettes/payment.yaml'):
def _pay(user):
    order = get_order_summary(user)
    payment = initiate_payment(order)
    receive_payment_success(order, {"payment_id": "123", "token": "123"})

pip install vcrpy works for HTTP, but you can easily implement your own VCR for anything

What should we mock?

What should we mock?

Still, in a web app we should probably not mock the DB.

But our ORM can, with SQLite!

Let's look at some tests!

The warehouse (PyPI) tests. Look at their session tests. What's their stance on "test one thing per test"?

The httpx HTTP client tests. Fixtures are in conftest.py. Find the GET / test. Find test for JSON encode/decode. What did they choose to mock?

The SQLAlchemy (database ORM) tests. Find the tests that together say "SELECT field FROM table does what we want". Where are they on DRY/DAMP question?

Which test suite did you like the most? Why?

Lets design tests!

How would you test glob.glob?

https://docs.python.org/3/library/glob.html

Write a full suite of test cases in pseudo-code

How would you test the ChatGPT backend?

(the consumer app, not the GPT API, which is one of your dependencies)

Write some example test cases in pseudo-code

Intro Who, Why, What

I xUnit tests

II Evaluating prompts

Let's use an LLM to process a request

How to think about "correctness"

Train/test split

Quality measures

Production monitoring

III Mastery

Let's use an LLM to process a request

from pydantic import BaseModel, Field
from pydantic_ai import Agent

class PalindromeGeneration(BaseModel):
    palindrome: str = Field(description="The generated palindrome")

generator_agent = Agent(
    model="google-gla:gemini-2.0-flash-lite",
    output_type=PalindromeGeneration,
    system_prompt="""You are a creative palindrome generator. Create palindromes that:
1. Are related to the given topic
2. Make grammatical and semantic sense
""",
)
def generate_palindrome(topic):
    prompt = f"Create a palindrome related to the topic: {topic}"
    return generator_agent.run_sync(prompt).output

How do we test non-deterministic outputs?

Traditional test won't work

def test_palindrome():
    topic = "quest for the holy grail"
    result = generate_palindrome(topic)

    # This might fail randomly!
    assert result.palindrome == "Sir, a Grail, a liar, Garis!"

What is a "correct" answer?

Signal + Noise -> [Our Test] -> Signal + Noise

Quality measures

Deterministic measurements

Accuracy curves

Human as judge

LLM as judge

Ensemble

Noisy Metrics

import numpy as np
agent = Agent(OpenAIModel("gemini-embedding-001"))

def embedding_correlation(a, b):
    e_a, e_b = agent.run_sync(a), agent.run_sync(b)
    correlation = np.dot(e_a, e_b) / (
        np.linalg.norm(e_a) * np.linalg.norm(e_b))
    return correlation

correlation = embedding_correlation(
    "quest for the holy grail",
    "Sir, a Grail, a liar, Garis!"
)

Deterministic measurements

def is_palindrome(text: str) -> bool:
    cleaned = re.sub(r"[^a-zA-Z0-9]", "", text.lower())
    return cleaned == cleaned[::-1]

Accuracy curves

Human as judge 😩

LLM as judge

class PalindromeEvaluation(BaseModel):
    sense_score: int = Field(description="How much sense it makes (1-10)", ge=1, le=10)
    topic_score: int = Field(description="How on topic it is (1-10)", ge=1, le=10)

evaluator_agent = Agent(
    model="google-gla:gemini-2.0-flash-lite",
    output_type=PalindromeEvaluation,
    system_prompt="""You are a creative writing teacher. Evaluate this short text on:

1. Sense (1-10): How grammatically correct and semantically meaningful is it?
2. Topic relevance (1-10): How well does it relate to the assigned topic?
""",
)

Ensemble

generation = generate_palindrome(topic)
evaluation = evaluate_palindrome(topic, generation.palindrome)
is_valid_palindrome = is_palindrome(generation.palindrome)
palindrome_score = 10 * is_valid_palindrome
average_score = (
    palindrome_score + evaluation.sense_score + evaluation.topic_score
) / 3

Train/test split

all_examples = load_examples()
random.shuffle(all_examples)

# 80/20 split
train_set = all_examples[:800]  # For prompt engineering
test_set = all_examples[800:]   # For evaluation

# Use train_set to iterate on your prompt
# Use test_set ONCE to measure final quality

Never look at test set results until you're done!

Remember: if you already know the answer, it's fake science

Production monitoring

Log everything

Track quality metrics over time

A/B test prompt changes

Monitor for drift

It's science time!

cp .env.example .env
# make  free logfire and google ai studio accounts
# put API keys in .env
uv run logfire auth
uv run python palindrome.py train

Exercise: design a prompt to maximize average_score

II Evaluating prompts

III Mastery

TDD

Coverage

Things that are hard to test

BDD & FIT

Golden/snapshot testing

Design exercise

Blesstests

TDD

Test Driven Development

Red (failing test)
Green (minimal change that fixes it)
Refactor (if needed)
Repeat.

TDD is actually not about tests, it's about design.

But it's a good way to practice testing. Lets build FizzBuzz with TDD.

def fizzbuzz(n):
    pass
assert fizzbuzz(1) == "1"

# what's the minimal fix to make it green?

Try it for a week. I myself do pseudo-TDD. I write pseudocode for tests before the first line of code, so my design will be testable.

Coverage metrics

$ pytest --cov=myproject tests/
======================== test session starts ========================
tests/test_shekels.py::test_simple_amounts PASSED
tests/test_shekels.py::test_edge_cases PASSED
tests/test_shekels.py::test_large_numbers PASSED

---------- coverage: platform linux, python 3.11.0 ----------
Name                  Stmts   Miss  Cover
-----------------------------------------
myproject/shekels.py     45      3    93%
myproject/utils.py       20      0   100%
-----------------------------------------
TOTAL                    65      3    95%

100% coverage ≠ bug-free
Coverage shows what you didn't test

Things that are hard to test

UI - Automated visuals tests probably not worth the effort

Nondeterministic code - Duh. Wait, why are you writing it?

Distributed code - Oof. Good luck. Read some jepsen posts.

BDD (Behavior-Driven Development)

Tests are executable specifications

Scenario: Counting people in departments
  Given a set of specific users
     | name      | department  |
     | Barry     | Arguments   |
     | Pudey     | Silly Walks |
     | Two-Lumps | Silly Walks |

 When we count the number of people in each department
 Then we will find two people in "Silly Walks"
  But we will find one person in "Arguments"

BDD implementation

from behave import given, when, then

@given("a set of specific users")
def step_impl(context):
    for row in context.table:
        model.add_user(name=row["name"], department=row["department"])

@when("we count the number of people in each department")
def step_impl(context):
    context.department_counts = model.count_by_department()

@then('we will find {count:d} (?:person|people) in "{department}"')
def step_impl(context, count, department):
    actual_count = context.department_counts.get(department, 0)
    assert actual_count == count, f"Expected {count} people in {department}, but found {actual_count}"

Product managers can write tests!
I think it's enough that they can write test cases and read test reports

FIT (Framework for Integrated Test)

A dialect of BDD that uses Excel (FIT) or a wiki (FitNesse) to write tables of test case parameters.

Golden/Snapshot testing

def test_render_invoice():
    assert range(1) == snapshot()
    assert range(0, 1) == snapshot()
    assert range(0, 1, 1) == snapshot()

    assert range(0, 1, 0) == snapshot()
    assert range(0, -1, -1) == snapshot()

    assert range(0, 0, 1) == snapshot()
    assert range(0, 0, 0) == snapshot()
    assert range(0, 0, -1) == snapshot()

    assert range(0, 2, 1) == snapshot()
    assert range(0, 2, 0) == snapshot()
    assert range(0, -2, -1) == snapshot()

Run it and it becomes...

Golden/Snapshot testing

def test_render_invoice():
from inline_snapshot import snapshot
from inline_snapshot.extra import raises


def test_range():
    assert range(1) == snapshot(range(0, 1))
    assert range(0, 1) == snapshot(range(0, 1))
    assert range(0, 1, 1) == snapshot(range(0, 1))

    with raises(snapshot("ValueError: range() arg 3 must not be zero")):
        range(0, 1, 0)
    assert range(0, -1, -1) == snapshot(range(0, -1, -1))
    # ...

Golden tests meet world

Benefits

  • Catches unexpected changes
  • Very low friction to add tests
  • Great for complex outputs (HTML, reports, tables, generated code)

Pitfalls

  • Tests become change detectors (also yay!)
  • Need to think while running tests (also yay!)
  • Non-deterministic data (timestamps, IDs) needs to be normalized
  • Higher friction to touch mature project (reflects reality... but must-fix)

Design exercise: Testing a chatbot

You're building a 0% AI, 100% scripted sales chatbot

How would you test it?

- import test_default_init_album
- לבחירת אלבום בינוני
- B: אנא כתבו את הכתובת אליה נשלח אליכם את האלבום
- B: מה שם העיר?
- גבעתיים
- B: מה שם הרחוב? (בלי מספר)
- שינקין
- B: מה מספר הבניין/בית?
- "1"
- B: מה מספר המיקוד?
- "1231234"
- B: |-
    *סיכום פרטי האלבום עד כה:*
    *שם + שם משפחה:* טסט בוט
    *כתובת מייל:* test@albooms.co.il
    *שם האירוע שיופיע על כריכת האלבום:* בדיקה אוטומטית
    *גודל האלבום הנבחר:* בינוני - 20X54
    *כתובת למשלוח:* גבעתיים, שינקין 1, דירה: 2, מיקוד: 1231234
  buttons:
    - אישור
    - עריכה
- אישור

Blesstests

https://github.com/SonOfLilit/blesstest/tree/main
https://github.com/SonOfLilit/todont/blob/main/test_todont.py
https://github.com/SonOfLilit/rps/blob/main/tests.py

Snapshots + Git + DSLs + Variation trees = MAGIC

Final Exam

Design a test suite for your largest work project that would minimize the friction of adding test cases.

Key takeaways

Testing is risk management

Choose test cases systematically

Different problems need different approaches

Minimize friction to add test cases

aur@sarafconsulting.com