Neural Nets - Fork My Brain

# [[Neural Nets]] %% Notes from the morning tutorial Noel gave me on **how neural networks actually work**, on [[2026-06-21]] in Setúbal. He walked it from a single neuron up through deep learning, LLMs, and the human brain, sketching it on the reMarkable as we went and sending two cleaner diagrams over WhatsApp afterward (embedded below). This note ties all of that together. %% A neural network is a system that learns to turn inputs into outputs by example, rather than by being explicitly programmed with rules. It's loosely inspired by the brain: lots of very simple units (neurons), each doing nothing more than a weighted sum followed by a nonlinear squish, wired together in **layers**. On its own a single neuron is almost trivial, but stacked by the thousands or millions they can approximate astonishingly complex functions — recognizing a handwritten digit, classifying an image, or generating text. The crucial part is that you never hand-write the logic. You show the network many examples of inputs paired with the outputs you want, and a **training** process gradually tunes the internal numbers (the weights) until the network produces the right answers on its own — and, ideally, generalizes to inputs it has never seen. Everything it works on, whether an image or a sentence, first gets **mapped into numbers**, so under the hood it really is all just arithmetic. Here's how I understand it, built up from the smallest piece to the largest. ## A single neuron The basic unit is a neuron. It takes several **inputs**, multiplies each by a **weight**, sums them, adds a **bias**, and passes the result through a **nonlinear activation function** to produce one **output**. - Inputs: $x_1, x_2, \dots, x_n$ (each scaled to roughly $[0, 1]$) - Weights: $w_1, w_2, \dots, w_n$ — one per input - Bias: $b$ — an offset - Output: $y$ $y = f\left(\sum_{i=1}^{n} w_i x_i + b\right)$ where $f(\cdot)$ is an activation function such as **ReLU, sigmoid, or tanh**. The thing that finally clicked for me: **the weights are not inputs.** They start as random values and get *learned* through training. The inputs and the expected outputs are what you supply; the weights are what the network figures out for itself. ## The perceptron A perceptron is the simplest kind of neuron — same weighted-sum structure, but with a **step activation function** instead of a smooth one. It fires (outputs 1) when the weighted sum clears a threshold, and stays off (outputs 0) otherwise. $z = \sum_{i=1}^{n} w_i x_i + b \qquad y = \begin{cases} 1 & z \ge 0 \\ 0 & z < 0 \end{cases}$ That makes it a **linear classifier**: it draws a single straight decision boundary through the input space. That limitation matters later (see XOR, below). ## How a network learns (training) Training is just **repeatedly nudging the weights to reduce error**: 1. Start with the weights in some random state. 2. Push a batch of inputs through the network and read off the actual outputs. 3. Compare each actual output to the expected output and measure the **error**: $E = \sum_i (y_i - \text{expected}_i)$ (This is the napkin version. In practice the error is *squared* — mean squared error — or measured with cross-entropy, so that positive and negative misses don't cancel each other out. But the idea is the same: collapse "how wrong was the network" into a single number.) 4. Take the **gradient** of that error, $\Delta E$, which tells you how much to adjust each weight — $\Delta w_i$ — to move the output closer to what you wanted. 5. Repeat. The *art* of it comes down to two tuning questions: **how long you accumulate errors before applying an update**, and **how big a step you take each time** (in modern terms, batch size and learning rate). The clever historical breakthrough was realizing you *can* compute the right $\Delta w_i$ for **every** weight inside the network, not just the last layer — that's backpropagation. ## Why it's all "just numbers" — the digit-recognition example It used to confuse me that people call neural nets "just numbers" even for LLMs that clearly output text. The resolution is that there's always a **mapping function** that translates a real thing — an image, a piece of text — into numbers (for text, those are **tokens**), runs the net, and maps the numeric output back. The example that makes it concrete is classic MNIST-style digit recognition: - Take an image of a handwritten **"2"**. It's a bitmap — a grid of pixels, each pixel a number (real MNIST pixels are grayscale values, usually normalized to $[0,1]$; I'm showing them as `0`s and `1`s just to keep the picture simple). - **Flatten** that grid into a long input vector of numbers. Each pixel becomes one input neuron. - The **output** is a vector of ten values, one per digit class: `0 1 2 3 4 5 6 7 8 9`. - For a "2", the expected output vector is a **1 in the "2" position and 0 everywhere else** (one-hot encoding). Training teaches the net to produce that. There's also a reason **everything is squashed into a bounded range**: if one neuron emitted a huge value (say a million), it would dominate every downstream multiplication and **drown out the rest of the signal** — like a laser shone into your eye, where you only see the bright point and nothing else. ## The XOR problem and the birth of hidden layers This is the historical turning point. **Marvin Minsky** pointed out in the 1960s that a single perceptron **cannot compute XOR** (output is 1 when the two inputs differ, i.e. $x \ne y$). There's no single set of weights that draws one straight line separating the XOR cases — it isn't *linearly separable*. That objection nearly killed the field. The fix people eventually found was to insert a **hidden layer** of neurons between input and output. With an intermediate layer, the network can represent the nonlinear function. This is the whole reason we talk about **layers** at all. How layers connect: - Input neurons → one or more **hidden layers** → output neuron(s). - The layers are **fully connected**: every neuron in a layer connects to every neuron in the *next* layer. - There's **no cross-layer or within-layer skipping** — signal flows forward, layer to layer. For the digit example there are ten output neurons (one per class). For a 2-megapixel input image you'd have on the order of **millions of input neurons**, with the hidden layers doing the heavy lifting in between. ## Deep learning, GPUs, and modern architectures - **1980s–90s:** shallow networks with a hidden layer or two were the norm. Convolutional nets (e.g. LeCun's LeNet for digit recognition) already existed in this era, but they were small by today's standards. - **The deep-learning explosion came around 2012**, not in the 1990s — that's when AlexNet, a large convolutional net trained on **GPUs**, blew away the ImageNet competition. The architecture ideas (convolution, pooling — shifting data around or taking the maximum over a region, simple operations applied to very large blocks of data) were older; what changed was the scale, the data, and the GPU horsepower to train millions of neurons in reasonable time. GPUs are what unlocked the modern era. - **Recurrent vs. attention (these are *not* the same thing):** - A **recurrent** network (RNN) feeds its own outputs back in as it processes a sequence, giving it a form of memory across steps. - **Attention** is a different mechanism: it lets the network look at all parts of the input at once and weigh *which parts matter most* for each output. The **transformer** architecture (2017) is built on attention and actually *removed* recurrence rather than adding it. Modern LLMs are transformers — so it's **attention**, not recurrence, that's the real foundation of today's LLMs. Two things worth holding onto for scale intuition: an **ECG-style medical image classifier** sits at **ResNet scale**, not LLM scale; and a neat pattern in practice is the **orchestration LLM** — a cheap, fast "router" model triages a request and sends it to the right specialist model. (That's the same idea behind running local open-weight models in a stack: a coding/"thinky" model, a general model, and a fast router, ideally with more specialists than just a couple.) --- ## Diagram 1 — Neural Nets 101 (the neuron & the perceptron) ![[2026-06-21-noel neural net_1.png]] The reference card for sections 1–2 above: the top half shows a generic **neuron** (inputs × weights → weighted sum $\Sigma$ → activation $f(\cdot)$ → output, plus bias) with $y = f(\sum w_i x_i + b)$; the bottom half shows the **perceptron** with its step function and $y \in \{0,1\}$, described as a linear classifier with a decision boundary. ## Diagram 2 — The Scale of Neural Networks ![[2026-06-21-noel neural net_2.png]] This one maps out the **scale** (in neurons/parameters) across very different tasks. From simple logic to human intelligence, each step up is bigger, more capable, and more power-hungry: | Scale | Example | Rough parameters | Training data | Power | |---|---|---|---|---| | **XOR function** (toy problem) | tiny 2-3-1 net | ~a handful of weights | 4 examples | microwatts — "fits in your head" | | **Image classifier** | ResNet-50 on ImageNet | ~25 million weights | ~1.2M labeled images | ~hundreds of watts — "runs on your machine" | | **Frontier LLM** | GPT-scale, transformer/attention | ~1–2 trillion parameters | ~trillions of tokens | ~megawatts — "runs in a data center" | | **Human brain** | biological neural net | ~86 billion neurons, ~100 trillion synapses | continuous lifetime learning | ~20 watts — "shaped by evolution" | | **Beyond human?** | hypothetical future AI | >100 trillion? | — | — — "larger scale, broader capability" | The takeaway from the picture: **the brain's synapses (~$10^{14}$–$10^{15}$) are the right thing to compare against an LLM's parameter count**, and the whole spectrum is one of *increasing scale, complexity, capability, and impact* — while, strikingly, the brain does it on about 20 watts. (Numbers are approximate and vary by model/implementation.) %% ## Sources - [[plugins/claude/claude-output/2026-06-21 Noel]] — reMarkable handwritten notes from the tutorial (perceptron sketch, error/weight-update formulas, diagram descriptions) - [[2026-06-21T095401Z]] — Plaud transcript of the morning session (the full spoken walkthrough) - [[2026-06-21-noel neural net_1.png]] — "Neural Nets 101" diagram (neuron & perceptron) - [[2026-06-21-noel neural net_2.png]] — "The Scale of Neural Networks" diagram - [[Daily Summary - 2026-06-21]] — daily summary that ties the session to the rest of the day %%