Close Menu

    Subscribe to Updates

    Get the latest news from tastytech.

    What's Hot

    Highguard Dev Who Just Lost His Job Calls Out Toxic Online Reaction

    February 13, 2026

    Battle For Boston Expansion Officially Announced

    February 13, 2026

    Alpine Quits WEC Hypercar After 2026—What It Means for BMW and Endurance Racing

    February 13, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    tastytech.intastytech.in
    Subscribe
    • AI News & Trends
    • Tech News
    • AI Tools
    • Business & Startups
    • Guides & Tutorials
    • Tech Reviews
    • Automobiles
    • Gaming
    • movies
    tastytech.intastytech.in
    Home»Business & Startups»How Andrej Karpathy Built a Transformer in 243 Lines of Code?
    How Andrej Karpathy Built a Transformer in 243 Lines of Code?
    Business & Startups

    How Andrej Karpathy Built a Transformer in 243 Lines of Code?

    gvfx00@gmail.comBy gvfx00@gmail.comFebruary 13, 2026No Comments12 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email


    The AI researcher Andrej Karpathy has developed an educational tool microGPT which provides the easiest access to GPT technology according to his research findings. The project uses 243 lines of Python code which does not need any external dependency to show users the fundamental mathematical principles that govern Large Language Model operations because it removes all complicated features of modern deep learning systems.  

    Let’s dive into the code and figure out how he was able to achieve such a marvellous feat in such a economical manner.

    New art project.
    Train and inference GPT in 243 lines of pure, dependency-free Python. This is the *full* algorithmic content of what is needed. Everything else is just for efficiency. I cannot simplify this any further.https://t.co/HmiRrQugnP

    — Andrej Karpathy (@karpathy) February 11, 2026

    Table of Contents

    Toggle
    • What Makes MicroGPT Revolutionary?
    • Understanding the Core Components
      • The Autograd Engine: Building Backpropagation
      • The GPT Architecture: Transformers Demystified  
      • The Training Loop: Learning in Action  
    • What Makes This a Gamechanger for Learning?
      • Complete Transparency
      • No Dependencies, No Installation Hassles
      • Debuggable and Modifiable
      • Real Results
    • Getting Started: Running Your Own GPT
    • Performance and Practical Limitations
    • Going deeper with Experimental Ideas 
    • Conclusion 
    • Frequently Asked Questions
        • Login to continue reading and enjoy expert-curated content.
      • Related posts:
    • A Guide to Agentic Coding
    • 50+ Machine Learning Resources for Self Study in 2026
    • 15 AI Agents Trends to Watch in 2026

    What Makes MicroGPT Revolutionary?

    Most GPT tutorials today rely on PyTorch, TensorFlow, or JAX which serve as powerful frameworks, but they conceal mathematical foundations through their user-friendly interface. Karpathy’s microGPT takes the opposite approach because it builds all of its functions through Python’s built-in modules that include basic programming tools.  

    The code doesn’t contain the following:  

    • The code includes neither PyTorch nor TensorFlow.  
    • The code contains no NumPy or any other numerical libraries.  
    • The system does not use GPU acceleration or any optimization techniques.  
    • The code contains no concealed frameworks or unexposed systems.  

    The code contains the following:  

    • The system uses pure Python to create autograd which performs automatic differentiation.  
    • The system includes the complete GPT-2 architecture which features multi-head attention.  
    • The Adam optimizer has been developed through first principles.  
    • The system provides a complete training and inference system.  
    • The system produces operational text generation that generates actual text output.  

    This system operates as a complete language model which uses actual training data to create logical written content. The system has been designed to prioritize comprehension instead of fast processing speed. 

    microGPT

    Understanding the Core Components

    The Autograd Engine: Building Backpropagation

    Automatic differentiation functions as the core component for all neural network frameworks because it enables computers to automatically calculate gradients. Karpathy developed a basic version of PyTorch autograd which he named micrograd that includes only one Value class. The computation uses Value objects to track every number which consists of two components. 

    • The actual numeric value (data)  
    • The gradient with respect to the loss (grad)  
    • Which operation created it (addition, multiplication, etc.)  
    • How to backpropagate through that operation  

    Value objects create a computation graph when you use a + b operation or an a * b operation. The system calculates all gradients through chain rule application when you execute loss.backward() command. This PyTorch implementation performs its core functions without any of its optimization and GPU capabilities. 

    # Let there be an Autograd to apply the chain rule recursively across a computation graph
    class Value:
        """Stores a single scalar value and its gradient, as a node in a computation graph."""
    
        def __init__(self, data, children=(), local_grads=()):
            self.data = data        # scalar value of this node calculated during forward pass
            self.grad = 0           # derivative of the loss w.r.t. this node, calculated in backward pass
            self._children = children   # children of this node in the computation graph
            self._local_grads = local_grads  # local derivative of this node w.r.t. its children
    
        def __add__(self, other):
            other = other if isinstance(other, Value) else Value(other)
            return Value(self.data + other.data, (self, other), (1, 1))
    
        def __mul__(self, other):
            other = other if isinstance(other, Value) else Value(other)
            return Value(self.data * other.data, (self, other), (other.data, self.data))
    
        def __pow__(self, other):
            return Value(self.data**other, (self,), (other * self.data**(other - 1),))
    
        def log(self):
            return Value(math.log(self.data), (self,), (1 / self.data,))
    
        def exp(self):
            return Value(math.exp(self.data), (self,), (math.exp(self.data),))
    
        def relu(self):
            return Value(max(0, self.data), (self,), (float(self.data > 0),))
    
        def __neg__(self):
            return self * -1
    
        def __radd__(self, other):
            return self + other
    
        def __sub__(self, other):
            return self + (-other)
    
        def __rsub__(self, other):
            return other + (-self)
    
        def __rmul__(self, other):
            return self * other
    
        def __truediv__(self, other):
            return self * other**-1
    
        def __rtruediv__(self, other):
            return other * self**-1

    The GPT Architecture: Transformers Demystified  

    The model implements a simplified GPT-2 architecture with all the essential transformer components: 

    • The system uses token embeddings to create vector representations which map each character to its corresponding learned vector.  
    • The system uses positional embeddings to show the model the exact position of each token in a sequence.  
    • Multi-head self-attention enables all positions to view prior positions while merging different streams of data.  
    • The attended information gets processed through feed-forward networks which use learned transformations to analyze the data. 

    The implementation uses ReLU² (squared ReLU) activation instead of GeLU, and it eliminates bias terms from the entire system, which makes the code easier to understand while maintaining its fundamental elements. 

    def gpt(token_id, pos_id, keys, values):
        tok_emb = state_dict['wte'][token_id]  # token embedding
        pos_emb = state_dict['wpe'][pos_id]    # position embedding
        x = [t + p for t, p in zip(tok_emb, pos_emb)]  # joint token and position embedding
        x = rmsnorm(x)
    
        for li in range(n_layer):
            # 1) Multi-head attention block
            x_residual = x
            x = rmsnorm(x)
            q = linear(x, state_dict[f'layer{li}.attn_wq'])
            k = linear(x, state_dict[f'layer{li}.attn_wk'])
            v = linear(x, state_dict[f'layer{li}.attn_wv'])
            keys[li].append(k)
            values[li].append(v)
    
            x_attn = []
            for h in range(n_head):
                hs = h * head_dim
                q_h = q[hs:hs + head_dim]
                k_h = [ki[hs:hs + head_dim] for ki in keys[li]]
                v_h = [vi[hs:hs + head_dim] for vi in values[li]]
    
                attn_logits = [
                    sum(q_h[j] * k_h[t][j] for j in range(head_dim)) / head_dim**0.5
                    for t in range(len(k_h))
                ]
                attn_weights = softmax(attn_logits)
                head_out = [
                    sum(attn_weights[t] * v_h[t][j] for t in range(len(v_h)))
                    for j in range(head_dim)
                ]
                x_attn.extend(head_out)
    
            x = linear(x_attn, state_dict[f'layer{li}.attn_wo'])
            x = [a + b for a, b in zip(x, x_residual)]
    
            # 2) MLP block
            x_residual = x
            x = rmsnorm(x)
            x = linear(x, state_dict[f'layer{li}.mlp_fc1'])
            x = [xi.relu() ** 2 for xi in x]
            x = linear(x, state_dict[f'layer{li}.mlp_fc2'])
            x = [a + b for a, b in zip(x, x_residual)]
    
        logits = linear(x, state_dict['lm_head'])
        return logits

    The Training Loop: Learning in Action  

    The training process is refreshingly straightforward:  

    • The code processes each dataset document by first converting its text into character ID tokens.  
    • The code processes each document by first converting its text into character ID tokens and then sending those tokens through the model for processing.  
    • The system calculates loss through its ability to predict the upcoming character. The system performs backpropagation to obtain gradient values.  
    • The system uses the Adam optimizer to execute parameter updates. 

    The Adam optimizer itself is implemented from scratch with proper bias correction and momentum tracking. The optimization algorithm provides complete transparency because all its steps are visible without any hidden processes. 

    # Repeat in sequence
    num_steps = 500  # number of training steps
    for step in range(num_steps):
    
        # Take single document, tokenize it, surround it with BOS special token on both sides
        doc = docs[step % len(docs)]
        tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]
        n = min(block_size, len(tokens) - 1)
    
        # Forward the token sequence through the model, building up the computation graph all the way to the loss.
        keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
        losses = []
    
        for pos_id in range(n):
            token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
            logits = gpt(token_id, pos_id, keys, values)
            probs = softmax(logits)
            loss_t = -probs[target_id].log()
            losses.append(loss_t)
    
        loss = (1 / n) * sum(losses)  # final average loss over the document sequence. May yours be low.
    
        # Backward the loss, calculating the gradients with respect to all model parameters.
        loss.backward()

    What Makes This a Gamechanger for Learning?

    This implementation is pedagogical gold for several reasons: 

    Complete Transparency

    The execution of thousands of lines optimized C++ and CUDA code happens when you use the model.forward() function in PyTorch. The complete set of mathematical calculations appears in Python code which you can read. Want to know exactly how softmax works? How are attention scores calculated? How gradients flow through matrix multiplications? Every detail appears in a transparent manner in the code. 

    No Dependencies, No Installation Hassles

    The system operates without requiring any dependencies while delivering straightforward installation procedures.  

    No conda environments, no CUDA toolkit, no version conflicts. Just Python 3 and curiosity. You can create a functional GPT training system by copying the code into a file and running it. This removes every barrier between you and understanding. 

    Debuggable and Modifiable

    You can test the system by adjusting its attention head count. You need to replace ReLU² with another activation function. The system lets you add additional layers. The system enables you to change the learning rate schedule through its settings. The entire text can be read and understood by all readers. The system enables you to place print statements at any location throughout the program to observe the exact values which flow through the computation graph. 

    Real Results

    The educational content of this material makes it suitable for learning purposes despite its control deficiencies. The model demonstrates its ability to train successfully while producing understandable text output. The system learns to create realistic names after it has been trained on a name database. The inference section uses temperature-based sampling to show how the model generates creative content. 

    Why microGPT is a gamechanger for Learning?

    Getting Started: Running Your Own GPT

    The beauty of this project is its simplicity. The project requires you to download the code from the GitHub repository or Karpathy’s website and save it as microgpt.py.  You can use the following command for execution. 

    python microgpt.py 

    It shows the process of downloading training data while training the model and creating text output. The system needs no virtual environments and no pip installations and no configuration files. The system operates through untainted Python which performs machine learning tasks. 

    If you’re interested in the full code, refer to Andrej Kaparthy’s Github repository.

    Performance and Practical Limitations

    The current implementation shows slow execution because the implemented system takes excessive time to complete its tasks. The training process on a CPU with pure Python requires: 

    • The system executes operations at a single point in time 
    • The system performs computations without GPU support 
    • The system uses pure Python math instead of advanced numerical libraries 

    The model which trains in seconds with PyTorch requires three hours to complete the training process. The entire system functions as a test environment which lacks actual production code. 

    The educational code prioritizes clarity for learning purposes instead of achieving fast execution. The process of learning to drive a manual transmission system becomes like learning to drive an automatic transmission system. The process enables you to feel every gear shift while gaining complete knowledge about the transmission system. 

    Going deeper with Experimental Ideas 

    The code implementation process requires you to first learn the programming language. The experiments start after you achieve code comprehension: 

    • Modify the architecture: Add more layers, change embedding dimensions, experiment with different attention heads 
    • Try different datasets: Train on code snippets, song lyrics, or any text that interests you 
    • Implement new features: Add dropout, learning rate schedules, or different normalization schemes 
    • Debug gradient flow: Add logging to see how gradients change during training 
    • Optimize performance: Test the system with NumPy implementation and basic vectorization implementation while maintaining system performance 

    The code structure enables users to conduct experiments without difficulty. Your experiments will reveal the fundamental principles that govern transformer functionality. 

    Conclusion 

    The 243-line GPT implementation in the name of microGPT developed by Andrej Karpathy functions as more than an algorithm because it demonstrates Ultimate coding excellence. The system shows that understanding transformers requires only a few lines of essential framework code. The system demonstrates that users can understand how attention functions without needing high-cost GPU hardware. Users need only basically building components to develop their language modeling system. 

    The combination of your curiosity with Python programming and your ability to understand 243 lines of educational code creates your requirements for successful learning. This implementation serves as an actual present for three different groups of people: students, engineers, and researchers who want to learn about neural networks. The system provides you with the most precise view of transformer architecture available across all locations. 

    Download the file, read the code, make changes or even dismantle the entire process and start your own version. The system will teach you about GPT operation through its pure Python code base, which you will learn one line at a time. 

    Frequently Asked Questions

    Q1. What is microGPT and why is it useful for learning transformers?

    A. It is a pure Python GPT implementation that exposes core math and architecture, helping learners understand transformers without relying on heavy frameworks like PyTorch.

    Q2. What components does Andrej Karpathy build from scratch in microGPT?

    A. He implements autograd, the GPT architecture, and the Adam optimizer entirely in Python, showing how training and backpropagation work step by step.

    Q3. Why is microGPT slower than standard deep learning frameworks?

    A. It runs on pure Python without GPUs or optimized libraries, prioritizing clarity and education over speed and production performance.


    Riya Bansal

    Data Science Trainee at Analytics Vidhya
    I am currently working as a Data Science Trainee at Analytics Vidhya, where I focus on building data-driven solutions and applying AI/ML techniques to solve real-world business problems. My work allows me to explore advanced analytics, machine learning, and AI applications that empower organizations to make smarter, evidence-based decisions.
    With a strong foundation in computer science, software development, and data analytics, I am passionate about leveraging AI to create impactful, scalable solutions that bridge the gap between technology and business.
    📩 You can also reach out to me at [email protected]

    Login to continue reading and enjoy expert-curated content.



    Related posts:

    5 Free Tools to Experiment with LLMs in Your Browser

    Top 20 LLM Interview Questions (With Clear, Practical Answers)

    How to Structure Your Data Science Project in 2026?

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleOpenAI sidesteps Nvidia with unusually fast coding model on plate-sized chips
    Next Article Sudan protecting Africa from foreign interference in war with RSF, says FM | Sudan war News
    gvfx00@gmail.com
    • Website

    Related Posts

    Business & Startups

    My Honest And Candid Review of Abacus AI Deep Agent

    February 13, 2026
    Business & Startups

    Build Data Analyst & Visualization Agent using Swarm Architecture

    February 12, 2026
    Business & Startups

    Building Practical MLOps for a Personal ML Project

    February 12, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram

    Subscribe to Updates

    Get the latest tech news from tastytech.

    About Us
    About Us

    TastyTech.in brings you the latest AI, tech news, cybersecurity tips, and gadget insights all in one place. Stay informed, stay secure, and stay ahead with us!

    Most Popular

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views

    Subscribe to Updates

    Get the latest news from tastytech.

    Facebook X (Twitter) Instagram Pinterest
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 TastyTech. Designed by TastyTech.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.