Welcome to the Corpus Linguistics page — a digital space dedicated to understanding how real-life language use can be explored, analyzed, and interpreted through data. This page is curated by Dr. Ali Raza Siddique for students, researchers, and language enthusiasts interested in discovering the science behind words, patterns, and meaning.
Here, you’ll find lecture summaries, practical tutorials, reading materials, and homework assignments related to corpus linguistics. The goal is to create a learning environment that combines theory, technology, and critical thinking.
Lecture 1
University of South Asia, Lahore
Department of English
Course: Corpus in Applied Linguistics
Instructor: Dr. Ali Raza Siddique
Topics: Claims and Criticism on
the Use of Corpora and Other Data Sources in Linguistic Research
Level: PhD English (Linguistics)
1.
Introduction
In linguistic research, there has
long been a debate between corpus-based (empirical) linguists and theoretical
/ intuition-based linguists.
This debate centers on what
constitutes valid linguistic evidence:
- Should linguistics rely on naturally occurring
language data (corpora),
- or on constructed sentences, elicited judgments, and
theoretical modeling?
2.
Major Claims in Favour of Corpus Linguistics
(Represented by Sinclair, Leech,
McEnery, Biber, Hunston, Stubbs, Baker, and others)
|
Claim |
Linguistic
Rationale |
Supporting
Scholars / Quotes |
|
1. Corpus provides empirical
evidence |
Language analysis should be based
on what people actually say or write, not on intuition. |
Sinclair (1991): “The language looks rather different when you look at a
lot of it.” |
|
2. Corpus data reveal patterns
hidden from intuition |
Lexical and grammatical patterns
(e.g., collocations, lexical bundles) emerge only from large-scale data. |
Hunston & Francis (2000): Corpus shows “the systematicity of phraseology.” |
|
3. Corpus ensures objectivity and
replicability |
Research can be verified and
reproduced by others using the same corpus dataset. |
McEnery & Hardie (2012): Corpus linguistics “enhances transparency and reliability.” |
|
4. Corpus supports quantitative
and qualitative integration |
Enables mixed-methods research —
frequency counts + contextual interpretation. |
Biber (1993): “Corpora make linguistic variation empirically
observable.” |
|
5. Corpus contributes to applied domains |
Pedagogy, lexicography, discourse
analysis, ESP, and language policy all benefit from corpus findings. |
Sinclair (2004) and Baker (2006) emphasize the “applied utility”
of corpus studies. |
|
6. Corpus challenges prescriptive
norms |
Descriptive analysis based on
authentic usage replaces idealized models of grammar. |
Leech (1992): “Corpus linguistics democratizes language description.” |
|
7. Corpus aids technological and
computational advances |
Supports NLP, AI, translation, and
language modeling — providing linguistic realism to machines. |
McEnery et al. (2006); Stubbs (2001). |
3.
Major Criticisms / Limitations of Corpus Linguistics
(Raised by Chomsky, Widdowson,
Sampson, and some cognitive linguists)
|
Criticism
/ Limitation |
Explanation |
Key
Critics / Quotes |
|
1. Corpus shows performance, not
competence |
Corpora record actual
performance, which may contain errors, slips, and distractions; it
doesn’t reveal the mental grammar or underlying competence. |
Noam Chomsky (1962, 1965): “A corpus is no substitute for the intuition of the
native speaker.” |
|
2. Corpora are finite and
selective |
No corpus can fully represent an
infinite language system; sampling is always partial and subjective. |
Widdowson (2000): “Corpora provide evidence of use, not of meaning.” |
|
3. Corpus ignores creativity and
intuition |
Language users can generate novel
utterances not found in corpora; linguistic theory must account for
generativity. |
Chomsky (1965): Linguistics must explain “the ability to produce and
understand new sentences.” |
|
4. Corpus data lack contextual and
social depth |
Corpora may show frequency but not
the pragmatic force, intention, or social meaning behind utterances. |
Widdowson (2000): “Corpus evidence without discourse interpretation is
incomplete.” |
|
5. Annotation and tagging
introduce bias |
Decisions about POS tagging,
lemmatization, and sampling can distort data and lead to interpretive bias. |
Sampson (2001): “Corpus annotation is theory-laden.” |
|
6. Over-reliance on quantitative
results |
Counting without context can lead
to superficial conclusions about meaning and use. |
Stubbs (2001): Advocates combining corpus with discourse and
ethnographic analysis. |
|
7. Corpora often lack spoken,
multimodal, or minority language data |
Most large corpora (e.g., BNC,
COCA) privilege standard written English. |
Leech (2000): Calls for “balanced representation across modes and
varieties.” |
4.
Claims in Favour of Non-Corpus or Alternative Data Sources
(Used in generative grammar,
psycholinguistics, and experimental linguistics)
|
Data
Source |
Claimed
Strengths |
Examples
/ Scholars |
|
Intuition-based judgments |
Provide direct insight into
linguistic competence and grammaticality. |
Chomsky’s generative grammar
(1957, 1965) uses constructed examples
for theory testing. |
|
Elicited / experimental data |
Allow control over variables;
useful in SLA and cognitive studies. |
Ellis (2005); Long (2015) use experimental elicitation. |
|
Fieldwork and ethnography |
Capture social meaning, variation,
and real communicative intent beyond textual data. |
Labov (1972); Hymes (1974); Cameron (2001). |
|
Psycholinguistic and
neurolinguistic datasets |
Reveal mental processing
mechanisms unavailable in corpora. |
Pinker (1994); Tomasello (2003). |
|
Constructed / simulated dialogues |
Used in pragmatics and speech act
research for consistent stimuli. |
Blum-Kulka (1989) – CCSARP project. |
5.
Attempts to Bridge the Divide
Modern linguistics increasingly
adopts a pluralistic approach combining corpus and non-corpus methods:
|
Integrative
Approach |
Explanation |
Examples |
|
Corpus + Introspection |
Corpora provide evidence, while
intuition helps interpret rare or ambiguous patterns. |
Leech (1991): “Corpus linguistics and introspection are
complementary.” |
|
Corpus + Experimentation |
Use corpus data to design
experimental tasks and verify patterns. |
Ellis (2008) – usage-based SLA with corpus priming. |
|
Corpus + Discourse Analysis (CDA) |
Quantitative frequency supports
qualitative discourse interpretation. |
Baker, Gabrielatos, KhosraviNik
(2008) – Corpus-Assisted Discourse
Studies (CADS). |
|
Corpus + Ethnography |
Combines authentic data with
social context and participant insight. |
Tusting & Maybin (2007) – “Corpus-ethnographic” approaches. |
6.
Key Scholarly Positions at a Glance
|
Linguist
/ School |
Position
Summary |
|
John Sinclair (1991) |
Advocated for “trusting the text”
and data-driven language description. |
|
Noam Chomsky (1965) |
Rejected corpus as evidence of
competence; favored idealized native speaker intuition. |
|
Geoffrey Leech (1992) |
Balanced view — corpus and
intuition are complementary. |
|
Tony McEnery & Andrew Hardie
(2012) |
Defined corpus linguistics as an empirical
methodology, not a theory. |
|
Henry Widdowson (2000) |
Warned against treating corpora as
context-free truth; advocated interpretive balance. |
|
Douglas Biber (1993) |
Pioneered large-scale empirical
register analysis; corpus reveals functional variation. |
|
Susan Hunston (2002) |
Highlighted corpus value in
applied linguistics and pedagogy. |
|
Paul Baker (2006) |
Advanced critical and
discourse-based corpus studies. |
7.
Discussion Questions
- Is corpus data sufficient for linguistic
explanation, or only necessary as empirical grounding?
- How can corpus linguistics address issues of meaning
and interpretation raised by Widdowson?
- Should theoretical linguistics (e.g., Chomskyan
grammar) integrate corpus findings?
- In what ways can Pakistani linguistics benefit from
local corpus construction rather than relying on Western corpora?
8.
Suggested Readings
- Sinclair, J. (1991). Corpus, Concordance,
Collocation. Oxford University Press.
- Chomsky, N. (1965). Aspects of the Theory of Syntax.
MIT Press.
- Widdowson, H. G. (2000). On the Limitations of
Linguistics Applied. Applied Linguistics, 21(1).
- McEnery, T., & Hardie, A. (2012). Corpus
Linguistics: Method, Theory and Practice. Cambridge University Press.
- Leech, G. (1992). Corpora and Theories of Linguistic
Performance. In Directions in Corpus Linguistics.
- Hunston, S. (2002). Corpora in Applied Linguistics.
CUP.
- Baker, P. (2006). Using Corpora in Discourse
Analysis. Continuum.
University of South Asia, Lahore
Department of English
Course: Corpus in Applied Linguistics
Instructor: Dr. Ali Raza Siddique
Topic: Corpus Building, Annotation, and Analysis
of Newspaper Editorials
Overview
Today’s class focused on the practical application of
corpus-building and analytical techniques. Students learned how to
construct a small-scale corpus using English-language Pakistani newspaper
editorials and how to process, analyze, and interpret linguistic data using
corpus tools like AntConc and TagAnt. The session aimed to equip
students with both technical skills and analytical insight for
real-world corpus-based research.
1. Data Collection
The lecture began with a demonstration of how to collect
authentic linguistic data from reliable sources.
Students were instructed to select two Pakistani
English-language newspapers such as Dawn, The Nation, The Express
Tribune, or The News International. From each newspaper, they were
to choose two editorials published between 2023 and 2025.
Each editorial was saved in .txt format with a
systematic filename for easy identification.
Example:
0001_Dawn_Politics_2024-04-10.txt
0002_Nation_Economy_2023-11-25.txt
This task helped students understand that consistent file
organization is essential for efficient data management and retrieval
during corpus analysis.
2. Preparation of the Metafile
Next, students learned to prepare a metafile
using Excel, which serves as a metadata sheet for the corpus. It
included important information such as:
- File ID
- Editorial Title
- Publication Date
- Author Name
- Newspaper Name
- Genre
- Word Count (Tokens and Types)
- URL
Students practiced using ChatGPT
to verify missing data or classify editorial genres by giving structured
prompts such as:
“Generate a metafile entry for an
editorial titled [Title] published in [Newspaper] on [Date].”
This exercise highlighted how AI
tools can assist in automating corpus documentation and ensure data
accuracy.
3. File Coding Procedure
The class then discussed the
importance of file-naming conventions for maintaining uniformity.
The adopted pattern was:
0001_Genre_Sub-Genre_Date.txt
Example:
0001_Dawn_Editorial_2024-06-05.txt
Students wrote short reflective
notes explaining how this system supports data management by making it
easy to identify each file’s source, type, and date without manual searching.
4. Preparing Word Lists
Students were guided through generating word lists
using AntConc.
They learned how to separate words into two major
categories:
- Functional
Words: words that provide grammatical
structure (e.g., the, of, is, and, was)
- Content
Words: words that carry meaning
(e.g., economy, policy, reform, government)
Each student created tables displaying the top 10 most
frequent words in both categories for each editorial. This exercise
emphasized the distinction between lexical content and grammatical
function in language.
5. Integration and Tagging Preparation
Students merged all individual text files into a single
combined file titled
Combined_Editorials.txt.
This file was then processed through TagAnt, a
part-of-speech (POS) tagging tool, producing a tagged version of the
corpus — Tagged_Editorials.txt.
This tagging step was crucial for identifying grammatical
categories such as nouns, verbs, adjectives, and adverbs, which would later
inform keyword and collocation analysis.
6. Keyword and Collocation Analysis
Using AntConc, students
conducted several layers of analysis:
- Keyword Analysis:
comparing the editorial corpus with a general English reference corpus to
identify distinctive or overused words.
Example: Frequent use of words like policy, reform, governance, and democracy indicates political focus. - Collocation Analysis:
identifying the top 10 collocates (frequently co-occurring words)
for key items such as Pakistan, government, or policy.
Example: The word government frequently co-occurred with federal, corruption, accountability, and reform, suggesting common discourse themes. - Lexical Pattern Analysis: exploring repeated word sequences or stance
markers (e.g., it is clear that, one might argue that).
Each analysis was followed by a
short written interpretation to connect linguistic patterns with editorial
tone and ideology.
7. Linking Keyword Analysis to POS Tagging
In this part, students were asked to relate their lexical
findings to parts of speech identified through tagging.
They explored how certain grammatical categories contribute
to stance or persuasion in editorials.
Example:
“The frequent use of modal verbs such as must and should
reflects a persuasive and advisory tone typical of editorials, whereas
adjectives like urgent and critical express evaluative stance.”
This exercise connected linguistic patterns to discourse
functions, showing how corpus methods reveal subtle rhetorical features in
texts.
8. Reflection and Interpretation
To consolidate learning, students were assigned a 250–300
word reflection on their corpus-building experience. They were encouraged
to discuss:
- Challenges faced during data collection and cleaning (e.g.,
removing advertisements or formatting errors).
- Role
of AI tools (ChatGPT) in
helping with metadata preparation or pattern interpretation.
- Insights gained about linguistic and metadiscursive patterns in
Pakistani editorials.
This reflection aimed to promote metacognitive awareness,
encouraging students to think critically about both the process and results of
their corpus analysis.
Conclusion
By the end of the lecture, students
had gained practical experience in:
- Collecting and organizing real-world textual data,
- Creating and managing a metadata file,
- Using AntConc and TagAnt for linguistic
analysis,
- Identifying collocational and keyword patterns, and
- Interpreting language use in relation to discourse and
ideology.
The class demonstrated how corpus linguistics serves as a powerful tool for uncovering patterns of meaning, ideology, and persuasion in authentic texts such as newspaper editorials.
Lecture 3
University of South Asia, Lahore
Department of English
Course: Corpus in Applied Linguistics
Instructor: Dr. Ali Raza Siddique
Topic: From Theory to Practice in Corpus Linguistics: Exploring Corpus-Based and Corpus-Driven Approaches, Data Preparation, Analysis, Tagging, and Research Annotation
1. Corpus-Based
vs. Corpus-Driven Studies
The class began with an in-depth
discussion on the difference between corpus-based and corpus-driven
approaches in linguistic research. Understanding this distinction is
essential because it shapes how researchers formulate questions, collect data,
and interpret linguistic patterns.
Corpus-Based Studies
Corpus-based research starts from existing
linguistic theories or hypotheses and uses corpus data to test,
support, or illustrate those ideas. The corpus serves as empirical
evidence to confirm or refine what is already known.
In this approach, the researcher
already has a framework — such as Halliday’s Systemic Functional Grammar
or Biber’s Multidimensional Analysis — and uses corpus data to measure
or exemplify certain linguistic behaviors.
Example:
A researcher interested in comparing hedging
devices (e.g., might, perhaps, possibly) across male and
female academic writers might first assume that female writers use more hedging
to appear polite or cautious. Using AntConc or Sketch Engine, they extract all
hedging instances from research articles and calculate frequencies to see whether
the data support or contradict this theoretical assumption.
Here, the corpus is used to test
a pre-defined theory, making the study corpus-based.
Corpus-Driven
Studies
Corpus-driven studies work in
the opposite direction. Rather than starting with theory, researchers
let the data speak for itself. Patterns, constructions, and
linguistic categories emerge inductively from the corpus
without being constrained by prior assumptions. In this case, theory is built
from the data, not imposed on it.
Example:
Suppose a linguist analyzes a large
collection of online forums or learner essays without any hypothesis in mind.
After running frequency and collocation analyses, they notice that learners
frequently use the phrase “make a photo” instead of the native-like “take
a photo.” This recurring pattern might lead the researcher to develop
a new hypothesis about how learners’ L1 (first language) influences
verb-noun collocations in English.
Here, the researcher derives
the pattern directly from the data, making it a corpus-driven
study.
Comparative Perspective
|
Aspect |
Corpus-Based |
Corpus-Driven |
|
Starting Point |
Pre-existing theory or hypothesis |
No predefined theory — the data guide
discovery |
|
Purpose |
To test, exemplify, or refine theories |
To discover new patterns and generate
theories |
|
Analytical Process |
Deductive (top-down) |
Inductive (bottom-up) |
|
Example Focus |
Checking frequency of modal verbs in
academic writing to confirm politeness theory |
Discovering unexpected lexical bundles in
learner essays |
|
Outcome |
Validation or refinement of known
linguistic patterns |
Creation of new theoretical insights or
linguistic categories |
Concluding
Insight
This distinction laid the foundation
for understanding how corpus methodology supports both descriptive
and exploratory linguistic investigations. While corpus-based
studies contribute to verifying and strengthening linguistic theories,
corpus-driven studies enable innovation by uncovering patterns
and phenomena that might otherwise go unnoticed.
2. Merging Text Files Using ILoveMerge
After discussing theoretical
foundations, the class moved on to the practical process of corpus data
preparation.
Students were introduced to ILoveMerge,
a user-friendly software designed to combine multiple Notepad (.txt)
files into a single, continuous text file.
This process is an essential step in
corpus building because it reduces manual effort and ensures
that all individual text files—such as student essays, interview
transcripts, or literary excerpts—are merged into one unified
dataset. Once combined, this dataset can be easily imported into
corpus analysis tools for tagging, cleaning, and subsequent linguistic
analysis.
For example, if a researcher has 100
student essays saved as separate text files, manually copying and pasting them
into one document would be time-consuming and error-prone. Using ILoveMerge,
all files can be merged automatically within seconds,
producing a single corpus file ready for tagging in CLAWS or
analysis in AntConc.
This step streamlined the workflow and
prepared students for the next phase of corpus-based investigation.
3. Performing Corpus Analyses (Hands-on
Session)
After completing the file-merging
stage, the class engaged in a hands-on session to explore the
practical application of corpus analysis tools, particularly AntConc.
This session allowed students to experience how raw textual data can be
transformed into meaningful linguistic insights through systematic analysis.
The major analytical
procedures introduced were as follows:
·
Finding Concordances:
Students learned how to extract textual
examples of a specific word or phrase to examine its immediate
linguistic context. This helped them observe how the same lexical item
behaves differently across various sentences.
Example: Searching for the word “development”
revealed patterns such as “language development”, “economic
development”, and “personal development”, each carrying distinct
contextual meanings.
·
Observing Concordance Plot:
The concordance plot visually displays
the distribution of a word or phrase throughout the corpus,
helping identify whether it appears frequently in certain sections or is evenly
spread across the dataset.
·
Studying File View:
This function allows users to read
texts directly within the corpus tool, offering a close examination of
sentence structure, paragraph organization, and text type. It provides context
beyond isolated examples.
·
Creating and Customizing Clusters/N-grams:
Students generated multi-word
sequences (such as 2-grams, 3-grams, or longer units) to explore recurrent
lexical bundles and phraseological patterns.
Example: Frequent 3-grams like “on the
other” or “as a result” indicated typical academic discourse
markers.
·
Studying Collocates:
This step involved identifying words
that frequently co-occur with a target word, thereby uncovering
semantic and grammatical associations.
Example: For the target word “research”,
frequent collocates such as “study”, “findings”, and “methodology”
reflected its disciplinary context.
·
Analyzing Statistical Significance:
To evaluate the strength of
association between co-occurring words, students applied the Mutual
Information (MI) score, where values above 3
typically indicate statistically significant collocations.
·
Generating Wordlists:
Wordlists were created to display all
words in the corpus ranked by frequency. This helped identify lexical
density, repetition patterns, and vocabulary
richness within the dataset.
·
Developing Keyword Lists:
By comparing the corpus against a reference
corpus, students identified keywords—words that occur
significantly more often than expected. These highlight distinctive
lexical features of a particular dataset, genre, or author.
Through this interactive session,
students became familiar with the core functionalities of corpus analysis
software and gained insight into the quantitative logic
underlying linguistic research. They observed how data-driven tools
transform large collections of text into verifiable patterns that inform
linguistic interpretation.
4. Developing Manual Tagging Procedures
The next part of the class focused on
the design and implementation of manual tagging as an
essential step in corpus-based linguistic analysis. Tagging allows researchers
to systematically identify, classify, and analyze linguistic features
or errors across large datasets. The class discussed both conceptual
and practical aspects of this process.
Steps Discussed
1. Listing Language Errors:
Students began by compiling a comprehensive
inventory of common linguistic errors found in learner writing. These
included grammatical (e.g., He go to school every day),
lexical (e.g., discuss about the issue), and syntactic
(e.g., Because he was sick, so he stayed home) errors. This list
formed the foundation for developing consistent tagging criteria.
2. Developing Tags:
A tagging scheme was
then designed to categorize each type of error using short, descriptive
codes that are easy to identify and interpret.
For example:
o
SVA → Subject-Verb Agreement Error (He go instead of He goes)
o
ART → Article Error (He bought car instead of He bought a car)
o
PREP → Preposition Error (discuss about instead of discuss)
o
TENSE → Verb Tense Error (He is go yesterday)
This step ensures uniformity
and allows the data to be efficiently searched, filtered, and analyzed.
3. Tagging Through AI Tools:
To minimize manual labor,
students explored the use of AI-assisted tagging tools. These
tools automatically detect and highlight potential errors, which researchers
can then manually verify and refine. This hybrid approach balances accuracy
and efficiency in large-scale tagging projects.
4. Processing Tagged Data via AntConc:
Once the data were tagged, the corpus
was uploaded into AntConc for further analysis. The software
allowed students to search for specific tags, measure
error frequencies, and compare patterns across
learners, genres, or proficiency levels.
5. Finding and Exploring Results:
Finally, students learned how to run
queries to calculate error frequency, observe recurring
patterns, and extract tagged instances for deeper
interpretation and discussion.
Example: Searching for the tag SVA revealed that subject-verb agreement
errors were most frequent in beginner-level essays, particularly with
third-person singular verbs.
This practical activity demonstrated
how manual tagging bridges qualitative and
quantitative approaches in corpus linguistics. Students discovered
that systematic tagging not only reveals surface-level grammatical
trends but also provides insights into underlying linguistic
competence, enabling richer interpretations and more robust
corpus-based comparisons.
5.
Research Article Annotation Activity
In the final part of the session,
students annotated a published research article titled
“Collocation Use in EFL Learners’ Writing Across Multiple Language
Proficiencies: A Corpus-Driven Study” (Du, Afzaal & Al Fadda, 2022).
Each section of the article
was analyzed systematically using the following academic parameters:
·
Title
of Research
·
Name
of Authors
·
Publication
Date
·
Linguistic
Feature to be Studied
·
Purpose
of Study
·
Research
Objectives
·
Research
Questions
·
Problem
Statement
·
Research
Gaps
·
Research
Model Employed for the Analysis
·
Nature
of Data (e.g., newspapers, books, essays)
·
Size
of Data
·
Distribution
of Data
·
Source
of Data
·
Tools
for Data Collection
·
Tools
for Data Analysis
·
Results
of the Study
·
Conclusion
of the Study
·
Limitations
of the Study
·
Delimitations
of the Study
·
Relevance
of the Study
·
Critical
Evaluation of the Study
This annotation exercise aimed to
build the ability to critically evaluate academic research,
extract methodological insights, and structure one’s own study systematically.
Summary
By the end of the class, students had
both conceptual clarity and practical experience
in corpus linguistics. They learned how to:
·
Distinguish
between corpus-based and corpus-driven methodologies,
·
Prepare
and merge textual data for analysis,
·
Apply
corpus tools for linguistic investigation,
·
Design
and tag error-based corpora, and
·
Analyze,
annotate, and critically assess a published corpus study.

No comments:
Post a Comment