A Brief History of NLP and LLM
The past few years have witnessed a rapid development of LLM. To understand the context of how LLM has achieved its status today, it is worth taking a brief look at the history of Natural Language Processing predating today’s LLM.
The development of Machine Learning can be roughly divided into five stages.
Today, researchers across the world are continuing to improve LLM to better hone their AI capabilities. In a nutshell, such capabilities can be categorized into two areas:
The Training Paradigm in LLM
The model training framework of LLMs is different from traditional machine learning. Traditional machine learning developed tailored models for different tasks. For example, there were different models for machine translation, sentiment analysis and question-answers. Each of these models has their specific architecture and is trained independently, using a large amount of training data.
In LLM, model training is broken down into two stages: a pre-train stage and a fine-tune stage.
How Data Affect LLM Training
Due to the different scale and purpose of pre-train and fine-tune, the data requirement is very different from the two stages.
During pre-train, data volume, data quality and diversity are the most important requirements. Pre-training LLM today involves training on almost all available human generated text data. Careful data cleaning is needed to filter out noisy, uninformative texts. The data also need to balance across different domains to ensure LLM can effectively transfer the knowledge, and acquire the potential to be fine tuned for all downstream tasks.
During fine-tune, quality is the single most important requirement. Pre-trained LLM already has acquired general knowledge about the world, therefore, common data we encounter in our daily life, such as web forum discussions, can no longer benefit LLM in specific areas. Rather, we need expert-level data in a specific domain. For example, using PhD level text in STEM domains to enhance LLM to solve advanced problems in these areas.
The specific quality requirement during the fine-tuning stage asks for strong demand for expert-level data annotations. This is because there’s a scarcity of expert-level data on the web. To bring expert-level intelligence to LLM in specific domains, there’s no alternative rather than leveraging human experts to iteratively interact with LLMs, feed complex data and provide feedback to LLM responses, to continue enhancing their domain specific intelligence.
Fine Tuning Algorithms
Typically, LLM fine tuning involves two steps.
A Fine Tuning Example
Let’s take a look at a real world fine tuning example. Suppose we want to develop a LLM that can generate coding solutions given a coding problem. How to do so?
“Question”: “Write a function to generate the nth Fibonacci number. The function should have a time complexity of O(log n) and use dynamic programming. Additionally, the function should only use a constant amount of space, without using any arrays or additional data structures.”,
“Answer”: “def multiply_matrices(A, B): a, b, c = A[0][0], A[0][1], A[1][0] d, e, f = B[0][0], B[0][1], B[1][0] return [ [a*d + b*e, a*e + b*f], [c*d + e*d, c*e + e*f] ] def power_matrix(A, n): if n == 0: return [[1, 0], [0, 1]] # Identity matrix result = power_matrix(A, n // 2) result = multiply_matrices(result, result) if n % 2 == 1: result = multiply_matrices(result, A) return result def fibonacci(n): if n == 0: return 0 matrix = [[1, 1], [1, 0]] result = power_matrix(matrix, n - 1) return result[0][0]”
}
For RLHF, training data involves multiple answers with labeled preference, i.e.
{
“Question”: “Write a function to generate the nth Fibonacci number. The function should have a time complexity of O(log n) and use dynamic programming. Additionally, the function should only use a constant amount of space, without using any arrays or additional data structures.”,
“Chosen Answer”: “def multiply_matrices(A, B): a, b, c = A[0][0], A[0][1], A[1][0] d, e, f = B[0][0], B[0][1], B[1][0] return [ [a*d + b*e, a*e + b*f], [c*d + e*d, c*e + e*f] ] def power_matrix(A, n): if n == 0: return [[1, 0], [0, 1]] # Identity matrix result = power_matrix(A, n // 2) result = multiply_matrices(result, result) if n % 2 == 1: result = multiply_matrices(result, A) return result def fibonacci(n): if n == 0: return 0 matrix = [[1, 1], [1, 0]] result = power_matrix(matrix, n - 1) return result[0][0]”,
“Unchosen Answer”: “def matrix_mult(A, B): return [[A[0][0] * B[0][0] + A[0][1] * B[1][0], A[0][0] * B[0][1] + A[0][1] * B[1][1]], [A[1][0] * B[0][0] + A[1][1] * B[1][0], A[1][0] * B[0][1] + A[1][1] * B[1][1]]] def matrix_pow(mat, exp): if exp == 1: return mat if exp % 2 == 0: half_pow = matrix_pow(mat, exp // 2) return matrix_mult(half_pow, half_pow) else: return matrix_mult(mat, matrix_pow(mat, exp - 1)) def fibonacci(n): if n == 0: return 0 if n == 1: return 1 base_matrix = [[1, 1], [1, 0]] result_matrix = matrix_pow(base_matrix, n - 1) return result_matrix[0][0]”
}
Summary
Recent development in LLM shows that models intended to solve natural language problems exhibit emerging AI behaviors such as solving common sense reasoning, writing codes, and solving STEM questions. This motivates further research to enhance LLM’s capability in these domains.
Modern training paradigm of LLM involves a pre-train stage and a fine-tune stage. Pre-train is both data and infra hungry, and is typically done by a few companies in the world. Fine-tune, on the other hand, focuses on customizing pre-trained LLM for specific tasks. To improve LLM’s capability in specific domains or for customized applications, fine-tune is the most important stage.
The fine-tune stage is heavily driven by high quality data. To guarantee such data can further enhance pre-trained LLM’s capability in specific tasks, we need domain expert-level data. For example, generate compute-optimal solutions for coding questions.
(c)2016-2025 CHANDLER ZUO ALL RIGHTS PRESERVED