# Lecture 8 – Beyond the Code

### Overview

Writing code that works is necessary but not sufficient. The other half of software engineering — the half that determines whether you're effective on a team, trusted in open source, and able to build things that outlive your involvement — is everything that surrounds the code.

This lecture covers three pillars: **one-way communication** (writing for people who lack your context — comments, READMEs, commit messages), **collaboration** (contributing to and reviewing others' code), and **education** (asking and answering questions effectively). It also addresses the emerging question of AI etiquette in professional and academic settings.

#### Key Takeaways

* The **why** is more valuable than the what in almost every form of technical writing. Code shows *what* it does; only humans (and comments) can capture *why* it was done that way.
* **Good comments** explain things the code cannot: rationale, hard-learned lessons, load-bearing choices, and "why not"s. Bad comments just restate the code.
* **READMEs** should answer four questions immediately: What? Why? How to use? How to install? In that order — show before tell.
* **Commit messages** are the historical record of *why* the codebase evolved. A good body follows Problem → Solution → Implications.
* **Maintainer time is finite and oversubscribed.** High signal-to-noise contributions — clear bug reports, minimal reproducible examples, focused PRs — are far more likely to go somewhere.
* **Code review is not bureaucratic overhead.** It's one of the fastest ways to learn, and reviews catch bugs, spread knowledge, and improve quality. Your perspective as a junior reviewer has genuine value.
* **Asking good questions is a skill.** State your understanding first, ask specific yes/no questions, and don't accept incomplete answers. This applies to humans and LLMs alike.
* **Disclose AI use** when AI meaningfully contributed to your work. It sets appropriate expectations, ensures proper review, and is simply honest.

***

### Core Concepts

#### One-Way Communication: Writing for People Without Your Context

Much of what engineers produce is read without the author present to explain it. A teammate joins six months later. You return to code you wrote a year ago. A maintainer reads your PR at 11pm. In all these cases, the writing has to stand alone.

The central principle: **your job is to capture and convey the&#x20;*****why*****, not the&#x20;*****what*****.** The what is usually self-evident from the code. The why — the reasoning, the trade-offs, the hard-earned lessons — is easily lost to time and almost never recoverable without explicit documentation.

***

#### Comments: The Good, the Useless, and the Actively Harmful

Most code comments are useless. They either restate what the code plainly shows ("increment i by 1") or are so vague they provide no actionable information. But comments don't have to be that way.

**Comment types that are almost always worth writing:**

**TODOs with context:**

```python
# Bad — provides no actionable information
# TODO: optimize

# Good — explains what, why, and the threshold for action
# TODO: this O(n²) comparison is fine for n<100 (typical batch size),
# but will need a hash-based approach if we process larger datasets.
# See benchmark results in docs/perf/batch_comparison.md
```

**References to external sources:**

```c
// Implements the Knuth-Morris-Pratt string search algorithm.
// Reference: https://doi.org/10.1145/360825.360855 (permalink)
// Divergence: we use 0-indexed positions rather than 1-indexed as in the paper.
```

**Correctness arguments:**

```python
# The parent lock must be held before acquiring the child lock,
# never the reverse. Acquiring in any other order risks deadlock
# because thread B (in gc_sweep) always acquires parent → child.
# This ordering is enforced by the lock hierarchy in LOCKING.md §3.
```

**Hard-learned lessons:**

```python
# Do NOT use os.rename() across filesystem boundaries on Linux —
# it raises EXDEV (Invalid cross-device link). Use shutil.move()
# instead, which falls back to copy+delete across devices.
# Spent 3 hours debugging this on the CI server (issue #847).
```

**Rationale for constants:**

```python
# 16 — not arbitrary. This is the AES block size in bytes (128 bits).
# Changing this will silently corrupt encrypted data on read.
BLOCK_SIZE = 16
```

**Load-bearing choices:**

```python
# Must be a SortedDict (not a plain dict) — the event dispatch loop
# below depends on iteration in key order. Switching to dict{}
# would cause out-of-order delivery under Python 3.6 on some platforms.
event_queue = SortedDict()
```

**"Why not"s — explaining deliberately avoided approaches:**

```python
# We do not use the requests library here, even though it's simpler.
# This module is imported in the embedded firmware build where we
# cannot install third-party packages. urllib.request is stdlib only.
```

**Comments that add noise (avoid these):**

```python
# Bad: restates the code
i += 1  # increment i by 1

# Bad: vague and unactionable
# This is a bit tricky
result = compute(x, y)

# Bad: outdated — code was changed but comment wasn't updated
# Returns a list of strings
def get_items() -> dict:  # actually returns a dict now
    ...
```

> **Respect the reader.** Resist over-explaining once you start writing comments. Explain the why; trust the reader to understand the how for their own situation. Walls of explanation get skipped entirely.

***

#### READMEs: The First-Contact Document

A README is often the first and only thing someone reads before deciding whether your project is relevant to them. Structure it as a funnel:

```
┌─────────────────────────────────────────────────────┐
│ One-liner: what does this do?                        │  ← 5 seconds
│ (optional: visual demo / screenshot / GIF)           │
├─────────────────────────────────────────────────────┤
│ Why should I care? What problem does it solve?       │  ← 30 seconds
├─────────────────────────────────────────────────────┤
│ Usage examples — show what it looks like to use it   │  ← 2 minutes
├─────────────────────────────────────────────────────┤
│ Installation instructions                            │  ← when committed
├─────────────────────────────────────────────────────┤
│ Contributing, license, full API reference, ...       │  ← deep dive
└─────────────────────────────────────────────────────┘
```

**Show usage before installation.** People want to see what they're getting before they commit to a setup process. A code example showing the API is more persuasive than three paragraphs explaining the architecture.

**Answer these four questions, in this order:**

1. What does this do? (one sentence)
2. Why should I care? (what problem it solves)
3. How do I use it? (usage examples)
4. How do I install it? (setup steps)

***

#### Commit Messages: The Historical Record

Commit messages form the historical record of *why* the codebase evolved the way it did. `git blame` is only as useful as the messages it points to.

**The anatomy of a good commit message:**

```
Short subject line (≤72 chars) — imperative mood, present tense
                                  "Fix race condition" not "Fixed race condition"

[blank line]

Body — answers:
  - What problem forced this change?
  - What alternatives did you consider?
  - What are the trade-offs or implications?
  - What might be surprising about this approach?
```

**The Problem → Solution → Implications structure** for complex changes:

```
Avoid thundering herd on cache expiry by adding jitter

Problem: All cache entries for a given key class expire at the same
second (TTL is set at write time to a fixed value). Under high load
this causes simultaneous cache misses → all requests hit the DB
simultaneously → DB overload. Observed in production at 14:32 on 2026-03-01,
caused a 45-second outage (#1241).

Solution: Add ±10% random jitter to all TTL values at write time.
Entries that previously all expired at T now expire uniformly in
[0.9T, 1.1T]. Tested under synthetic load: peak DB QPS reduced from
8,400 to ~520 during expiry windows.

Implications: Cache hit rates are marginally lower (~0.3%) due to some
entries expiring earlier than their nominal TTL. This is acceptable.
Note: if you change CACHE_TTL, the jitter range (cache.py:47) scales
automatically — no secondary change needed.
```

**Scale detail to complexity:**

* One-line typo fix → subject only is fine
* Subtle race condition or tricky optimization → paragraphs, with the problem clearly stated before the solution

**Using LLMs for commit messages — the right way:**

If you ask an LLM to write a commit message from a diff alone, it can only see the *what* and will produce a descriptive summary — the opposite of what you want. Better approaches:

```
# Option 1: Ask in the same session where you made the change
# (the LLM already has the context from your conversation)
"Write a commit message for this change. Focus on the why, not the what."

# Option 2: Tell the LLM to ask you for the missing context
"Write a commit message for this diff. You need the 'why' and the
trade-offs — ask me any questions you need before writing it."

# Option 3: Include the why in your prompt explicitly
"Write a commit message. The change fixes X because Y. We chose
this approach over Z because of the performance implications."
```

**Logical commit decomposition:** Each commit should represent one coherent change reviewable independently. Don't mix refactoring with feature work. Don't bundle unrelated bug fixes. `git add -p` is your tool for staging individual hunks rather than entire files.

```bash
# Stage specific hunks interactively
git add -p

# y — stage this hunk
# n — skip this hunk
# s — split into smaller hunks
# e — manually edit the hunk
```

***

#### Contributing to Open Source

The fundamental constraint of open source: there are orders-of-magnitude more users than contributors, and more contributors than maintainers. **Maintainer time is severely oversubscribed.** Every interaction you have with a project should be designed to maximize signal and minimize noise.

**Writing a good bug report:**

```markdown
## Environment
- OS: Ubuntu 22.04 / macOS 14.3 / Windows 11
- Version: v2.3.1 (git: f552b55)
- Python: 3.12.1
- Relevant config: [paste relevant config, redact secrets]

## Expected behavior
Clicking "Export to CSV" downloads a file named `report.csv`
containing all rows currently visible in the table.

## Actual behavior
A 500 error is returned. The server log shows:
  UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013'

## Steps to reproduce
1. Log in as an admin user
2. Navigate to /reports/monthly
3. Apply filter: "date range = last 30 days"
4. Click "Export to CSV"

## What I've tried
- Verified the bug on a fresh install (rules out local config)
- Confirmed it only occurs with rows containing em-dashes in the "Notes" field
- Does NOT occur when exporting without the date filter applied

## Minimal reproducible example
[link to a repo / gist / test case that isolates the bug]
```

**Before filing:**

* Search existing issues. Your bug may already be reported.
* "Me too" comments and bare terminal output pastes are net-negative noise.
* Polite follow-up after 2 weeks is fine; daily pings are not.

**Security vulnerabilities:** Never disclose publicly. Contact maintainers privately first (look for `SECURITY.md`). Give them reasonable time to fix before any disclosure.

**Making a code contribution:**

1. Read `CONTRIBUTING.md` and follow it exactly
2. Start small — a typo fix or documentation improvement is a valid first PR
3. Check the license before contributing (copyleft licenses like GPL have implications for your employer)
4. Isolate your change — one PR, one purpose
5. Explain the *why* in the PR description, not just the *what*
6. Call out parts warranting special review attention
7. Document trade-offs you made — maintainers are accepting long-term maintenance responsibility

```markdown
## Why this change is needed
[Explain the problem, not just the solution]

## What changed
[Brief description of the implementation]

## Trade-offs and alternatives considered
[What else did you consider? Why did you pick this approach?]

## How to test
[Specific steps to verify the change works]

## Special attention
[Any areas you're less confident about, or that require careful review]
```

> **Forking:** Prefer contributing upstream over forking. Fork only when your changes are genuinely out of scope for the original project. If you fork, acknowledge the original.

> **AI-generated contributions:** Using AI to help identify issues and produce fixes is fine — but you must do the due diligence to understand and polish the result. Submitting AI-generated code you can't explain burdens maintainers with reviewing and maintaining code its own author doesn't understand. The maintenance burden is transferred, not eliminated.

***

#### Code Review

You will be asked to review code earlier in your career than you expect. Your perspective as someone less familiar with the code is genuinely valuable — fresh eyes catch assumptions that experts have stopped noticing, and questions that expose hidden complexity often trigger simplifications that improve the code for everyone.

**Code review serves multiple functions:**

* Catches bugs and edge cases before production
* Spreads knowledge across the team (reviewer learns; author gets a second opinion)
* Maintains consistent code quality and architecture
* Creates a record of design decisions for future maintainers

**Principles of effective review:**

| Principle                          | Weak                        | Strong                                                                                                            |
| ---------------------------------- | --------------------------- | ----------------------------------------------------------------------------------------------------------------- |
| Review code, not person            | "You wrote confusing code." | "This function is confusing to follow — could we add a comment explaining the branching logic?"                   |
| Prefer actionable comments         | "Don't use globals here."   | "Can you replace these globals with a config dataclass? It would make the initialization order explicit."         |
| Ask, don't demand                  | "Handle the null case."     | "What happens if `user` is null here — should we return early or raise?"                                          |
| Explain the why                    | "Use a constant here."      | "Consider a named constant here so the timeout value can be adjusted per environment without touching this file." |
| Distinguish blocking from optional | (mixed)                     | "Blocking: this can cause data loss under concurrent writes. Nit: the variable name could be more descriptive."   |
| Acknowledge good work              | (silence)                   | "Nice approach here — using the context manager means the lock is always released even on exception."             |
| Know when to stop                  | Every single nit            | Fix minor style nits yourself after merge; spend review energy on substance                                       |

**AI in code review:** AI tools can catch certain classes of issues (typos, obvious bugs, style violations) and are worth using as a first pass. But they miss product context, don't understand business requirements, and can confidently suggest wrong things. They augment human review; they don't replace it.

> When receiving feedback: your code is not you. Reviewers are trying to make the code better. Disagreements are an opportunity — ask clarifying questions; you might learn something, or they might.

***

#### Asking Good Questions

Asking well-formed questions is a skill that makes you faster at learning from anyone, not just expert explainers. It applies to colleagues, to Stack Overflow, and equally to LLMs.

**The principles:**

**State your understanding first:**

```
# Weak: "How do SQL joins work?"
# Strong: "My understanding is that a LEFT JOIN returns all rows from the
#          left table and matching rows from the right, with NULLs where
#          there's no match. Is that right?"
```

This helps the answerer identify your actual knowledge gap, rather than starting from scratch.

**Ask specific, answerable questions:**

```
# Weak: "Why is my code slow?"
# Strong: "Does calling list() on a generator consume it? I'm seeing the
#          second iteration return empty and I think that might be why."
```

**Ask yes/no questions:** "Is X true?" tends to either confirm or trigger useful elaboration — both productive outcomes. "Explain X" often produces a lecture that may not address your specific confusion.

**Admit when you don't understand:** Interrupt to ask about unfamiliar terms. This signals intellectual engagement, not weakness. "I don't know, but I think..." or "I don't know, but I can find out" are both honest and useful answers when the question is turned back on you.

**Do some investigation first:** Basic research helps you ask more targeted questions and shows respect for the other person's time. But casual questions among colleagues don't require full investigation — calibrate to context.

**Don't accept incomplete answers:** "Does that make sense?" is not the same as actually understanding. If you're still confused, keep asking follow-ups. Well-formed follow-ups often benefit not just you but everyone in the conversation.

***

#### AI Etiquette

The professional and social norms around AI use in software engineering are still forming. Some principles worth internalizing now:

**Disclose when AI meaningfully contributed:** This isn't about shame. It's about honesty and ensuring work gets the level of review it actually needs. Also disclose *which parts* — there's a meaningful difference between "this was entirely vibe-coded" and "I wrote the core logic and used an LLM to draft the frontend styling."

```
# In a PR description
Note: the migration SQL in db/migrations/ was generated with Claude.
I reviewed it against the schema and tested it locally, but I'd appreciate
extra scrutiny on the rollback logic (lines 45–62) where I'm less confident.
```

**Follow your team's policies:** Some organizations have strict policies around AI tools — for compliance, data residency, or IP reasons. When in doubt, ask. Being transparent about your tooling prevents accidentally violating policies with serious consequences.

**Learning goals vs. output goals:** If you're trying to learn, having an AI do the work for you mostly teaches you how to prompt, not the subject matter. When learning is the goal, the journey matters — use AI to unblock specific confusion, not to skip the thinking.

**Interviews and assessments:** Assessment situations are intended to evaluate *your* capabilities. If AI use is unclear, ask. If it's explicitly prohibited, don't use it — "almost got away with it" has a bad expected value.

***

### Mental Models

#### The Why/What Asymmetry

Code is very good at recording *what* it does. It is completely silent about *why*. This asymmetry shows up everywhere:

```
What the code shows          What only a human can record
─────────────────────────    ─────────────────────────────────
The algorithm used           Why this algorithm, not the obvious one
The constant value           Why this specific value
The data structure chosen    Why this structure and not the simpler one
The error-handling path      What failure mode this was defending against
The ordering of operations   Why this order is required for correctness
The library used             Why not the more popular alternative
```

Every time you write a comment or commit message, ask: "Am I recording the *what* (already visible) or the *why* (only I know)?" Default to the latter.

***

#### The Maintainer's Perspective

When you interact with open source, mentally swap roles: you are the maintainer. A hundred people filed issues today. Fifty of them are missing reproduction steps. Thirty are duplicates. Ten are feature requests for things explicitly out of scope. You have three hours.

Which contributions do you triage first? The ones where someone clearly investigated the problem, provided a minimal reproduction, and explained what they've already tried. The ones that save you work rather than creating it.

```
Maintainer time budget (notional):
  Reading and triaging a vague bug report:     20 min
  Reading and triaging a complete bug report:   5 min
  Extracting a repro from a user:              30 min
  Using a provided minimal repro:               0 min extra
  Reviewing a 2000-line PR with mixed concerns: 3 hours
  Reviewing a focused 200-line PR:             45 min
```

Making your contributions low-friction is not just courtesy — it's the practical difference between your issue getting fixed and it getting closed.

***

#### Context Gradient in Code Review

A codebase has a context gradient: some people have deep context (wrote the original code, know why it was designed that way), and others have little (just joined, reading it fresh). Both perspectives are valuable in review for different reasons:

```
High context reviewer:
  ✅ Spots violations of subtle invariants
  ✅ Knows which parts are fragile
  ✅ Understands the original design intent
  ❌ May miss things they've "stopped seeing"
  ❌ May assume knowledge the PR author doesn't have

Low context reviewer:
  ✅ Fresh eyes on readability and clarity
  ✅ Will ask questions that reveal hidden assumptions
  ✅ Spots over-complexity that experts have normalized
  ❌ May not catch domain-specific subtle bugs
```

As a junior reviewer, you are a low-context reviewer. This is a feature, not a bug. Your confusion is signal. "I don't understand what this function does" is a valid and valuable review comment.

***

### Diagrams

#### The Comment Decision Tree

```mermaid
graph TD
    Q["You're about to write a comment.\nWhat does it say?"]

    WHAT["It describes *what*\nthe code does"]
    WHY["It explains *why*\nthe code is this way"]
    TODO["It marks incomplete\nor deferred work"]

    SKIP["❌ Skip it.\nThe code already shows this.\nA comment would just add noise."]
    WRITE["✅ Write it.\nThis is knowledge only you have."]
    TODOWRITE["✅ Write it — but include:\n• What's outstanding\n• Why it was deferred\n• Threshold for action"]

    Q --> WHAT --> SKIP
    Q --> WHY --> WRITE
    Q --> TODO --> TODOWRITE

    style SKIP fill:#4a0000,color:#fff
    style WRITE fill:#2d4a22,color:#fff
    style TODOWRITE fill:#2d4a22,color:#fff
```

***

#### Commit Message Quality Spectrum

```mermaid
graph LR
    BAD["❌ Bad\n\n'fixed bug'\n'wip'\n'changes'\n'update stuff'"]
    OK["⚠️ Okay\n\n'Fix null pointer in login'\n(describes what, not why)"]
    GOOD["✅ Good\n\n'Fix null pointer in login\n\nUser object can be null when\nsession expires mid-request.\nAdded null check + redirect\nto login page. Fixes #392.'"]
    GREAT["🏆 Great\n\nProblem → Solution\n→ Implications\n\n(for complex changes)"]

    BAD -->|"add subject"| OK -->|"add body"| GOOD -->|"add trade-offs\nand implications"| GREAT

    style BAD fill:#4a0000,color:#fff
    style OK fill:#4a2d00,color:#fff
    style GOOD fill:#2d4a22,color:#fff
    style GREAT fill:#1a3a5c,color:#fff
```

***

#### Open Source Contribution Signal-to-Noise

```mermaid
graph TD
    subgraph "❌ Low Signal (likely to be ignored or closed)"
        N1["Vague bug report\n'It doesn't work'"]
        N2["Missing repro steps\nor environment info"]
        N3["Duplicate issue\n(didn't search first)"]
        N4["PR mixing unrelated\nchanges or refactors"]
        N5["'Me too' comment\nwith no new info"]
    end

    subgraph "✅ High Signal (likely to get traction)"
        S1["Complete bug report:\nenv + expected + actual\n+ repro + what I tried"]
        S2["Minimal reproducible\nexample isolating the bug"]
        S3["Focused PR:\none change, one purpose\nexplains the why"]
        S4["Follow-up after 2 weeks\nif no response (once)"]
        S5["Adds info to\nexisting issue"]
    end

    style N1 fill:#4a0000,color:#fff
    style N2 fill:#4a0000,color:#fff
    style N3 fill:#4a0000,color:#fff
    style N4 fill:#4a0000,color:#fff
    style N5 fill:#4a0000,color:#fff

    style S1 fill:#2d4a22,color:#fff
    style S2 fill:#2d4a22,color:#fff
    style S3 fill:#2d4a22,color:#fff
    style S4 fill:#2d4a22,color:#fff
    style S5 fill:#2d4a22,color:#fff
```

***

#### The README Funnel

```mermaid
graph TB
    R1["⚡ One-liner + demo / screenshot\n(5 seconds — does this solve my problem?)"]
    R2["Why should I care?\nWhat problem does it solve?\n(30 seconds — is this worth my time?)"]
    R3["Usage examples\nShow what it looks like to use it\n(2 minutes — can I see myself using this?)"]
    R4["Installation\n(committed — I've decided I want this)"]
    R5["Contributing / License / Full API reference\n(deep dive — I'm using this seriously)"]

    R1 --> R2 --> R3 --> R4 --> R5

    BAIL1["Most readers stop here\nif the one-liner doesn't click"]
    BAIL2["Many readers stop here\nif no usage example exists"]

    R1 -.-> BAIL1
    R3 -.-> BAIL2

    style R1 fill:#1a3a5c,color:#fff
    style R3 fill:#2d4a22,color:#fff
    style BAIL1 fill:#4a0000,color:#fff
    style BAIL2 fill:#4a0000,color:#fff
```

***

#### Question Quality Framework

```mermaid
graph TD
    Q["You have a question."]

    VAGUE["❌ Vague\n'How does X work?'\n'Why is my code slow?'"]
    SPECIFIC["✅ Specific\n'Does a LEFT JOIN include rows\nwhere the right table has no match?'\n'Does calling list() consume a generator?'"]

    NOCONTEXT["❌ No context\n'It's not working'"]
    CONTEXT["✅ State your understanding\n'My understanding is Y — is that right?'\nHelps answerer find your actual gap"]

    ACCEPT["❌ Accept vague answer\n'Oh okay, thanks'"]
    FOLLOWUP["✅ Follow up until understood\n'I'm still not sure about X specifically —\ncan you say more about that part?'"]

    Q --> VAGUE
    Q --> SPECIFIC
    Q --> NOCONTEXT
    Q --> CONTEXT
    Q --> ACCEPT
    Q --> FOLLOWUP

    style VAGUE fill:#4a0000,color:#fff
    style NOCONTEXT fill:#4a0000,color:#fff
    style ACCEPT fill:#4a0000,color:#fff
    style SPECIFIC fill:#2d4a22,color:#fff
    style CONTEXT fill:#2d4a22,color:#fff
    style FOLLOWUP fill:#2d4a22,color:#fff
```

***

### Practical Workflows

#### Writing a Good Commit Message with LLM Assistance

```bash
# ❌ Weak: LLM sees only the diff — produces a descriptive summary (the what)
git diff HEAD | llm "Write a commit message for this."
# → "Adds null check to login handler" (useless)

# ✅ Better: Ask for the why and tell the LLM to query you for context
git diff HEAD | llm "Write a commit message focused on the *why* (not what
the code does, but why this change was necessary, what was wrong before,
what alternatives were considered, and what the trade-offs are).
Ask me questions for any missing context before writing."

# ✅ Best: Ask in the same session where you made the change
# (the LLM already has the reasoning context from your conversation)
# At the end of the coding session:
"Write a commit message for everything we've done in this session.
Focus on the why, the problem we were solving, and the approach we chose
over alternatives."
```

***

#### Using `git add -p` to Compose Logical Commits

```bash
# You've made changes to three files but they represent two separate concerns:
# 1. A bug fix in auth.py
# 2. A refactor of the query builder in db.py

# Stage only the bug fix hunks interactively
git add -p

# For each hunk, you'll be prompted:
# Stage this hunk [y,n,q,a,d,s,e,?]?
# y — include this hunk in the commit
# n — skip this hunk (stage later)
# s — split into smaller hunks if possible
# q — quit; remaining hunks go unstaged

git commit -m "Fix session expiry null pointer in login handler

User object can be null when the session expires between the auth
check and the profile load (race condition under high load, ~1 in
10,000 requests in prod). Added null check and redirect to login.

Previously this caused a 500 that appeared as a generic error page.
Now returns a clean redirect to /login?reason=session_expired.
Fixes #392."

# Now stage and commit the refactor separately
git add -p  # stage db.py changes
git commit -m "Refactor query builder to use method chaining

Reduces the boilerplate in call sites from 5-7 lines to 2-3.
No behavior change — all existing tests pass unchanged.
Motivated by the repeated pattern in the new reporting module."
```

***

#### Creating a Minimal Reproducible Example

```python
# Original bug context: 500 error when exporting CSV with certain characters
# Step 1: Identify the smallest failing case

# Start with the real code and strip away everything irrelevant:

# Iteration 1: does it fail with just the encoding?
>>> "report\u2013summary".encode("latin-1")
# → UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013'
# ✅ Reproduced — this is the core issue

# Step 2: Confirm the fix is isolated
>>> "report\u2013summary".encode("utf-8")
# → b'report\xe2\x80\x93summary'  ✅

# Step 3: Write the minimal repro as a standalone script
"""
Minimal reproduction of CSV export encoding bug (#847)

Expected: export succeeds for rows containing em-dash (U+2013)
Actual: UnicodeEncodeError raised by latin-1 encoder

Python 3.12.1, pandas 2.2.0, Ubuntu 22.04
"""
import io, csv

rows = [["Notes"], ["project\u2013update"]]  # em-dash in data
buf = io.StringIO()
writer = csv.writer(buf)
writer.writerows(rows)
# This succeeds — the bug is in the downstream encoding, not csv.writer

# The bug:
buf.getvalue().encode("latin-1")  # ← raises UnicodeEncodeError
# The fix:
buf.getvalue().encode("utf-8")    # ← works
```

***

#### Structuring a Pull Request Description

```markdown
## Problem
When a user's session expires between the authentication check and the
profile data load, `current_user` can be `None`. This causes an
unhandled `AttributeError` that manifests as a 500 error with a
generic error page — no information about the cause is shown to the user
or logged in a way that's easy to search.

Observed in production logs ~15 times in the last 30 days, always
during peak load (>500 concurrent users).

## Solution
Added a null check immediately after `current_user` is retrieved.
If null, the request is redirected to `/login?reason=session_expired`
with a 302, rather than proceeding to load profile data.

The redirect URL is constructed in `auth/utils.py` (existing helper) so
the pattern is consistent with other session-expiry redirections in the codebase.

## Alternatives considered
- **Extend session TTL**: would reduce frequency but not fix the root cause;
  also has privacy implications (sessions staying alive longer).
- **Re-authenticate silently**: too complex, out of scope for this fix.

## Trade-offs
The redirect adds one round-trip for the affected ~0.01% of requests.
Acceptable given the alternative is a 500 error.

## Testing
1. Log in, then manually expire the session cookie
2. Navigate to any authenticated page
3. Expected: redirect to `/login?reason=session_expired`
4. Existing auth tests still pass (no regression)

## Special attention
Line 47 in `auth/middleware.py` — I'm not 100% sure whether
`request.session.flush()` should also be called here or whether the
redirect is sufficient. Would appreciate a second opinion.
```

***

### Common Mistakes

#### ❌ Comments That Restate the Code

**Wrong:**

```python
i += 1       # increment i by 1
x = x * 2   # multiply x by 2
return user  # return the user
```

**Correct:**

```python
i += 1       # advance past the header byte (format spec §3.2)
x = x * 2   # double the retry interval (exponential backoff)
return user  # returning None here breaks the login redirect chain (see #441)
```

**Why:** Parroting the code adds characters but no information. Comments should tell readers something the code cannot tell them itself.

***

#### ❌ Vague TODO Comments

**Wrong:**

```python
# TODO: optimize this
# TODO: fix edge case
# TODO: clean up
```

**Correct:**

```python
# TODO: this O(n²) scan is fine for n<50 (current max users per org),
# but will need a hash-based lookup if we add multi-org support.
# Blocked on schema decision in RFC-12.

# TODO: handle the case where uploaded_at is null (legacy data from
# before the 2024 migration). Low priority — affects ~200 rows.
# Query to find affected records: SELECT id FROM files WHERE uploaded_at IS NULL
```

**Why:** Vague TODOs are noise. A TODO without enough context for someone else to act on it is worse than no TODO — it creates the impression of documentation while providing none.

***

#### ❌ Describing *What* in Commit Messages

**Wrong:**

```
Add null check to current_user in middleware

Added a check for None on current_user and redirected to login page.
```

**Correct:**

```
Fix session expiry crash in authenticated middleware

Without this, a race between session expiry and the profile load causes
current_user to be None, triggering an AttributeError → 500 response.
This manifested as a generic error page ~15 times/month in production.

Added null guard + redirect to /login?reason=session_expired.
Chosen over extending session TTL (doesn't fix the root cause) or
silent re-auth (too complex for this fix). Resolves #392.
```

**Why:** "Added a null check" is already visible from the diff. The reader needed to know *why* the null check is necessary and why this fix was chosen over alternatives.

***

#### ❌ Submitting a Bug Report Without Reproduction Steps

**Wrong:**

```
Subject: Export is broken

When I try to export, I get an error. This is a critical bug.
Please fix immediately.
```

**Correct:** Complete environment info, expected vs. actual behavior, exact reproduction steps, what you've already tried, and ideally a minimal reproducible example.

**Why:** A maintainer receiving the weak version must either spend 20+ minutes asking follow-up questions or close the issue entirely. The complete version might be actionable in 5 minutes. Which one gets fixed first?

***

#### ❌ Code Review Comments That Attack Instead of Help

**Wrong:**

```
This is wrong. Why would you do it this way?
This function is a mess. Rewrite it.
You forgot to handle the error case.
```

**Correct:**

```
I'm confused by the ordering here — is there a reason `flush()` is called
before `close()`? In the stdlib docs it says close() already calls flush(),
so this might be a no-op (or could cause a double-flush in edge cases).

This function is doing three distinct things — could we extract the
validation logic into a helper? It would make each part easier to test.

What happens if `response` is None here? I think we'd get an
AttributeError on line 47 — should we add a guard or raise early?
```

**Why:** Aggressive review comments create a hostile environment that discourages contributions and defensive reactions that derail productive discussion. Specific, curious questions lead to better outcomes.

***

#### ❌ Using AI to Skip the Learning Process

**Wrong:** "I asked Claude to explain it and now I understand it" (without actually checking your understanding), or submitting AI-generated code in a context intended to evaluate your skills.

**Correct:** Use AI to unblock specific confusion, then verify you genuinely understand the result. "Claude explained X — let me test my understanding by explaining it back, implementing it myself, and asking follow-up questions about the parts that are still unclear."

**Why:** If you can't explain code you're submitting, you can't review it for correctness, maintain it when it breaks, or extend it. The output may look right; the understanding gap will surface at the worst possible time.

***

### Exercises

#### Beginner Exercises

1. **Hunt for good comments:** Browse the source of a well-known project ([Redis](https://github.com/redis/redis) or [curl](https://github.com/curl/curl) are good choices). Find at least one example each of: a useful TODO with context, a reference to external documentation, a "why not" comment explaining a deliberately avoided approach, and a hard-learned lesson. For each: what would a reader lose if that comment were removed? Could the information be recovered from the code alone?
2. **Commit message archaeology:** Pick an open-source project and explore its recent `git log`. Find one commit with a strong message (explains *why* the change was made) and one with a weak message (describes only *what* changed). For the weak one, run `git show <hash>` to read the diff. Try to write a better message following the Problem → Solution → Implications structure. Notice how much context you have to reconstruct — and how much is irrecoverably lost.
3. **README audit:** Compare the READMEs of three GitHub projects with 1000+ stars in a domain you're familiar with. Score each against the four questions: What / Why / How to use / How to install. Does the order match the funnel structure? What noise (things that don't help you decide whether to use the project) appears near the top? What would you remove or reorder?
4. **Issue quality evaluation:** Find an open issue on a project you use (filter by "good first issue" or "help wanted"). Evaluate it against the bug report criteria from the lecture. Does it include: environment, expected vs. actual, reproduction steps, what was already tried? If it's missing elements, how many follow-up questions would the maintainer likely need to ask before they could act on it?
5. **Minimal reproducible example practice:** Think of a bug you've encountered (or find one in a project's issue tracker). Practice creating a minimal reproducible example: strip away everything unrelated until you have the smallest case that still demonstrates the problem. Write down what you removed and why. Did the process of stripping things away help you understand the bug better?

***

#### Intermediate Exercises

6. **PR review analysis:** Find a merged pull request with substantive review comments (not just "LGTM") on a project you know. Read through the full review thread. Categorize each comment using the principles from the lecture: is it actionable? Does it explain the why? Does it ask a question or make a demand? Does it distinguish blocking issues from suggestions? If you were the PR author, which comments would you find most and least useful?
7. **Question quality comparison:** Go to Stack Overflow (or a community Discord/forum you're in) and find two questions on a topic you know: one with a highly-voted accepted answer, and one that was closed or heavily downvoted. Compare them against the question quality criteria from the lecture. Was it predictable which would get better answers? Apply this lens to the last question you asked someone — how could it have been better?
8. **Write a `git add -p` commit sequence:** Take your most recent changes in a project (or make some deliberate mixed changes in a test repo). Use `git add -p` to split them into at least two semantically distinct commits. Write a proper commit message for each following the Problem → Solution → Implications structure. The constraint: no commit message body should describe only *what* the diff shows.
9. **Improve a real README:** Find a GitHub project you use that has a weak README (missing usage examples, installation-first structure, no one-liner). Write an improved version following the funnel structure. (Bonus: open a PR with your improvement.)
10. **AI disclosure practice:** Review your last three significant code contributions (to a personal project, coursework, or work project). For each, write a one-sentence disclosure of what AI contributed, what you contributed, and what level of review the AI-generated parts received. Would you be comfortable sharing these disclosures with a teammate or course instructor?

***

#### Advanced Challenge

11. **Make a real open-source contribution:** Find a "good first issue" in a project you use. Work through the full contribution lifecycle:

    * Search for duplicates and add to an existing issue if relevant
    * If filing new: write a complete bug report using the template from the lecture
    * Read `CONTRIBUTING.md` and follow it exactly
    * Open a focused PR with a proper description (Problem / Solution / Trade-offs / How to test)
    * Respond to review comments using the principles from the lecture

    Document what surprised you about the process.
12. **Design a code review rubric:** For a codebase you work on (or a domain you know well), write a one-page code review checklist. It should cover: correctness, edge cases, security considerations, performance, readability, test coverage, and documentation. Distinguish must-fix from suggestion items. Use it on a real PR (your own or a colleague's) and report on which checklist items caught issues that wouldn't otherwise have been caught.
13. **Commit history reconstruction:** Find a significant feature or bug fix in a well-known open-source project (one with a detailed git history). Starting from the commit that introduced the fix, reconstruct: what problem was being solved, what alternatives were considered (look at adjacent PRs, issues, mailing list archives), and what trade-offs were made. How much of this context survived in the commit messages vs. in issue/PR discussions vs. is simply lost? Write a retrospective commit message for the change that captures what you found.

***

### Summary

| Topic                          | Core Principle                                                                                                                        |
| ------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------- |
| **Comments**                   | Capture the *why*, not the *what*. The code already shows what it does.                                                               |
| **Useful comment types**       | TODOs with context, references, correctness arguments, hard-learned lessons, magic number rationale, load-bearing choices, "why not"s |
| **READMEs**                    | Answer What / Why / Usage / Install in that order. Show before tell. Funnel structure.                                                |
| **Commit messages**            | Historical record of *why* the codebase evolved. Problem → Solution → Implications for complex changes.                               |
| **LLMs for commits**           | LLM sees only the *what* unless you provide the *why* explicitly or ask in the same session as the change.                            |
| **`git add -p`**               | Stage individual hunks for semantically coherent commits. Don't mix refactors with features.                                          |
| **Bug reports**                | Include environment, expected vs. actual, exact repro steps, what you tried. Minimal repro is gold.                                   |
| **Open source contributions**  | Maintainer time is precious. High signal-to-noise. One PR, one purpose. Explain the why.                                              |
| **AI-generated contributions** | Fine to use AI to help — not fine to submit what you don't understand. The maintenance burden transfers, not eliminates.              |
| **Code review**                | Not bureaucratic overhead. Catches bugs, spreads knowledge, develops intuition. Fresh eyes have genuine value.                        |
| **Review principles**          | Code not person. Actionable and specific. Ask, don't demand. Explain the why. Distinguish blocking from nits.                         |
| **Asking questions**           | State your understanding first. Ask specific, answerable questions. Follow up until genuinely understood.                             |
| **AI etiquette**               | Disclose AI contributions. Follow team policies. If learning is the goal, don't skip the thinking.                                    |

#### What's Next

In **Lecture 9 – Code Quality**, you'll learn how to enforce correctness and consistency automatically: static analysis, type systems, linters, formatters, and the CI pipelines that run them on every commit — turning "code that works" into "code that provably meets its quality bar."

***

*Source:* [*MIT Missing Semester – Beyond the Code*](https://missing.csail.mit.edu/2026/beyond-code/) *Licensed under* [*CC BY-NC-SA 4.0*](https://creativecommons.org/licenses/by-nc-sa/4.0)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://shankar-lab.gitbook.io/mylearning/the-missing-semester-of-your-cs-education/lecture-8-beyond-the-code.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
