Academician Leads Comprehensive Review of Multimodal LLM Alignment Algorithms

Author Info

Marcus Reeves

Senior AI Industry Correspondent

M.S. Computer Science (Georgia Tech); former semiconductor equity research associate

Marcus covers frontier model releases, chip supply chains, and capital markets around AI infrastructure. Before joining our desk he spent six years translating earnings calls and product roadmaps into decision-ready briefs for engineering leaders. He stress-tests vendor claims against filings, benchmarks, and on-the-record statements.

#Frontier Models #Semiconductor Supply Chain #Capital Markets #Product Roadmaps

Full author profile →

A 10,000-word comprehensive review of alignment algorithms in multimodal LLMs!

This article systematically covers the application scenarios encompassed by existing alignment algorithms, the core factors involved in constructing alignment datasets, the benchmarks used to evaluate these algorithms, and their potential future development directions.

Academician-led 10,000-word article comprehensively reviews multimodal LLM alignment algorithms

Large Language Models (LLMs) can perform various tasks through simple prompts without task-specific training. However, these models primarily process text data and have limitations in handling multimodal data.

Since the world is inherently multimodal—encompassing visual, auditory, and textual data—researchers have begun developing Multimodal Large Language Models (MLLMs) based on LLMs to handle more complex forms of data.

However, existing MLLMs still face a series of challenges, particularly regarding truthfulness, safety, reasoning capabilities, and alignment with human preferences, which remain insufficiently addressed.

Therefore, alignment algorithms designed to address these issues have emerged as an effective approach to overcoming these challenges.

Academician-led 10,000-word article comprehensively reviews multimodal LLM alignment algorithms

The primary contribution of this study is a comprehensive and systematic review of alignment algorithms in Multimodal Large Language Models (MLLMs).

Specifically, it explores four key questions:

  • Application Scenarios of Existing Alignment Algorithms: By categorizing current alignment algorithms, the article clearly demonstrates their applicability across different domains and provides researchers with a unified symbolic system to help understand the distinctions and connections between various algorithms.
  • Construction of Alignment Datasets: The construction of alignment datasets involves three core factors: data sources, model responses, and preference annotations. The article systematically analyzes and categorizes these factors, summarizing the strengths and weaknesses of public datasets to provide references for future improvements.
  • Evaluation Methods for Alignment Algorithms: Given that most alignment algorithms target specific tasks—such as reducing hallucinations, ensuring safety, and improving reasoning capabilities—the article compiles commonly used evaluation benchmarks and proposes a clear evaluation framework.
  • Future Development Directions: The article outlines potential future directions for the development of alignment algorithms, particularly focusing on the integration of visual information, empirical insights from LLM alignment methods, and the challenges and opportunities faced by MLLMs as agents.

This research was conducted jointly by researchers from the Institute of Automation, Chinese Academy of Sciences; Nanjing University; University of Science and Technology of China; Nanyang Technological University; Tsinghua Shenzhen International Graduate School; Tencent Youtu Laboratory; National University of Singapore; Lehigh University; The Hong Kong University of Science and Technology; and Squirrel AI Learning.

The study is led by Tan Tie-Niu, Academician of the Chinese Academy of Sciences, and Wang Liang, Fellow of the China Computer Federation (CCF).

Academician-led 10,000-word article comprehensively reviews multimodal LLM alignment algorithms

Here are more details.

Application Scenarios and Representative Methods

Application Scenarios

The article introduces the application scenarios of MLLM alignment algorithms, categorized into three main levels:

  • General Image Understanding: Focuses primarily on reducing hallucinations (where models generate inaccurate or irrelevant outputs) and enhancing other capabilities such as dialogue and reasoning.
  • Multi-Image, Video, and Audio: Addresses complex multimodal data, such as multiple images and videos, by proposing different architectures and training methods to handle these tasks, particularly in reducing hallucinations and improving model performance.
  • Extended Applications: Explores the application of MLLMs in domain-specific tasks, such as medicine, mathematical reasoning, and security systems, detailing how models can be optimized according to specific domain requirements.

General Image Understanding and Multimodal o1

General Image Understanding

The original intent of MLLM alignment algorithms was to address hallucination issues in multimodal systems. Recent research indicates that these algorithms not only improve the handling of hallucinations but also enhance multiple functional attributes, including safety, dialogue capabilities, and reasoning skills.

This section systematically introduces several innovative methods, classified by their primary application scenarios: reducing hallucinations and enhancing other capabilities.

Reducing Hallucinations

The initial design purpose of MLLM alignment algorithms was to mitigate hallucination phenomena.

For example, Fact-RLHF is the first multimodal RLHF algorithm, which used 10K manually annotated samples to train a reward model and introduced mechanisms such as per-token KL penalties, factual information calibration, and penalties for correctness and length.

DDPO further optimizes standard DPO by increasing the weight of corrected data.

HA-DPO utilizes MLLMs to generate image descriptions, verifies hallucinations using GPT-4, and rewrites positive and negative samples, incorporating an auxiliary causal language modeling loss to reduce hallucinations.

mDPO addresses the issue of ignoring visual information by introducing a visual loss function and adds an anchoring mechanism to prevent the probability of selected responses from decreasing.

Enhancing Comprehensive Capabilities

Beyond reducing hallucinations, some algorithms focus on enhancing various model capabilities.

For instance, Silkie collects diverse instruction datasets and uses GPT-4V to evaluate generated responses, thereby providing preference data for applying DPO. CLIP-DPO labels data using CLIP scores and applies DPO loss, simultaneously improving performance in both hallucination mitigation and zero-shot classification tasks.

SIMA enhances performance on multi-image tasks by having the model self-evaluate its generated responses to construct preference pairs.

Recently, methods such as MM-RLHF have further improved alignment effects through more diverse data and algorithms.

Development of Multimodal o1

The popularity of DeepSeek-R1 has brought new insights to the MLLM community.

LMM-R1 uses pure text mathematical datasets, trains via RLOO, and achieves improvements on multimodal math benchmarks.

Open-R1-Video leverages the GRPO method to enhance model performance in the video domain.

VLM-R1 applies R1 methods to handle referring expression comprehension tasks, further expanding multimodal reasoning capabilities.

Multi-Image, Video, and Audio

In this section, the article discusses challenges and solutions in multi-image, video, and audio tasks.

  • Multi-Image Tasks: Existing MLLMs often struggle with multi-image understanding. MIA-DPO addresses this by constructing multi-image preference data, achieving good results.
  • Video Tasks: Video understanding is more complex than single-image tasks. Combining DPO with interleaved visual instructions can effectively enhance video task processing capabilities, as seen in methods like LLaVA-NeXT-Interleave.
  • Audio Tasks: Audio-visual understanding suffers from “audio blindness.” Video-SALMONN 2 successfully resolves this by introducing an audio-visual alignment mechanism.

Extended Multimodal Applications

The article also introduces extended applications in specific domains, proposing more targeted alignment methods.

  • Medical Applications: 3D-CT-GPT++ optimizes medical image analysis, successfully reducing diagnostic errors and achieving clinical-level accuracy.
  • Mathematical Applications: The MAVIS method improves MLLM performance in mathematical reasoning by enhancing the visual math problem-solving framework.
  • Security: To address adversarial attacks on multimodal large language models, the article introduces methods such as AdPO and VLGuard, which improve model robustness by optimizing training data and model structures.
  • Agents and Intelligent Systems: Methods like INTERACTIVECOT and EMMOE enhance the performance of MLLMs in embedded intelligence, particularly during complex decision-making processes, by dynamically optimizing reasoning flows and decomposing tasks.

The authors analyze different application scenarios for multimodal large language models, detailing various algorithms and methods covering everything from general image understanding to specific domain applications.

The main contribution lies in demonstrating how optimizing alignment algorithms can reduce hallucinations and enhance comprehensive model capabilities across different tasks, especially in complex fields such as video, audio, medicine, and mathematics.

As these methods continue to be optimized, MLLMs will demonstrate their powerful processing capabilities in more domains.

The following table summarizes common loss function forms for current alignment strategies:

Academician-led 10,000-word article comprehensively reviews multimodal LLM alignment algorithms

MLLM Alignment Data Construction and Summary of Existing Data

Main Content Summary

In the research of Multimodal Large Language Models (MLLMs), alignment datasets are a critical component. Since constructing multimodal datasets involves numerous data sources, generation methods, and annotation techniques, researchers have categorized different construction approaches.

Academician-led 10,000-word article comprehensively reviews multimodal LLM alignment algorithms

These datasets can generally be divided into two categories: datasets that introduce external knowledge and those that rely on self-annotation.

Through these classifications, researchers can gain a clearer understanding of the characteristics of different datasets, thereby supporting the optimization of multimodal systems.

The authors conduct a comprehensive classification and analysis of existing MLLM alignment datasets, detailing the pros and cons of different construction methods and their application scenarios. The research focuses primarily on the following aspects:

  • Datasets Introducing External Knowledge: Discusses datasets constructed through human annotation and closed-source models (such as the GPT-4 series). While these methods improve data quality, they also face challenges such as high costs and subjectivity.
  • Self-Annotated Datasets: Explores methods that utilize the model itself to generate preference pairs for dataset construction, including three types: single-text modality, single-image modality, and image-text hybrid modality.
  • Balancing Data Quality and Scale: The article also discusses how to balance data quality, scale, and cost, and looks forward to the potential of future automated data augmentation technologies, particularly leveraging self-annotation methods to enhance data quality.

Through this work, researchers can gain a clearer understanding of multimodal dataset construction strategies, providing strong support for future research.

Datasets Introducing External Knowledge

  • Human Annotation: High-quality data from various domains is collected through manual labeling.

For example, LLaVA-RLHF collected 10k samples by manually selecting positive and negative responses, while RLHF-V collected 1.4k samples by manually correcting hallucinated responses.

  • Closed-Source LLMs/MLLMs: Preference data generated using GPT-4 series models allows for large-scale dataset construction while reducing costs.

For instance, LRV-Instruction generated 400k visual instructions via GPT-4, covering 16 vision-language tasks.

  • Open-Source LLMs/MLLMs: Using open-source models (such as CLIP-DPO) to construct preference data reduces costs but may sacrifice data quality.

For example, INTERACTIVECOT constructed an embodied intelligence preference dataset using predefined scores.

Self-Annotated Datasets

  • Single Text Modality:

SQuBa uses a fine-tuned model to generate negative samples and compares them with positive samples via DPO. SymDPO enhances visual learning by converting VQA/classification data into ICL format.

  • Single Image Modality:

Image DPO constructs DPO preference pairs by perturbing images (e.g., Gaussian blur or pixelation) while keeping the text unchanged.

  • Image-Text Hybrid Modality:

AdPO constructs preference pairs of original/adversarial images and their model responses. During optimization, the image and text content differ between positive and negative samples.

Experimental Findings

In the experimental section, the study found:

Balancing Dataset Scale and Quality: Introducing external knowledge can improve data quality but increases construction costs. While self-annotation methods can generate large-scale data, current self-annotated datasets suffer from lower quality due to MLLM performance limitations and exhibit certain distribution shift issues.

Potential of Automated Augmentation: With the development of automated data augmentation technologies, future…

Self-labeling methods may address current issues with low data quality while enhancing diversity and credibility.

Overall, dataset construction methodologies and quality control are critical factors influencing the alignment performance of Multimodal Large Language Models (MLLMs). Future research should focus on reducing costs and increasing dataset scale without compromising data quality.

Model Evaluation

Existing MLLM alignment evaluation benchmarks are categorized into six key dimensions:

General Knowledge (assessing foundational capabilities), Hallucination (measuring consistency between generated content and facts), Safety (evaluating the ability to mitigate risks in responses), Dialogue (testing whether models can output user-requested content), Reward Models (evaluating reward model performance), and Alignment with Human Preferences.

General Knowledge

Most benchmarks prioritize high-quality, human-annotated datasets tailored specifically for real-world application scenarios.

For example, MME-RealWorld contains 29K question-answer pairs from 13K images, while MMMU includes 11.5K questions sourced from academic materials. MMStar enhances reliability by minimizing data leakage and emphasizing visual dependency.

Many benchmarks introduce innovative methodologies, such as bilingual evaluation in MMBench alongside CircularEval, task graphs in MMT-Bench for intra- and out-of-domain analysis, and BLINK’s focus on visual perception tasks. These frameworks improve assessment precision and reveal model limitations.

Tasks often require advanced multimodal reasoning capabilities, such as mathematical visual integration in MathVista, 3D contextual question answering in SQA3D, and coverage of charts and maps in MMMU.

These benchmarks drive models to address interdisciplinary challenges by curating difficult, fine-grained tasks—such as temporal understanding in MVBench and multi-image processing in Mantis-Instruct—aiming to enhance problem-solving abilities in real-world scenarios, particularly regarding nuanced perception and reasoning.

Hallucination

These benchmarks systematically identify and classify hallucination issues in multimodal models, including object hallucinations (Object HalBench), intrinsic and extrinsic hallucinations (VideoHallucer), and association biases (VALOR-Eval). They emphasize fine-grained evaluation across visual, textual, and sequential contexts.

Many benchmarks propose innovative frameworks, such as vote-based querying (POPE), LLM-driven scoring (HaELM, RefoMB), open-vocabulary detection (OpenCHAIR), annotation-free evaluation (GAVIE), LLM-free pipelines (AMBER), and GPT-4-assisted reasoning analysis (Mementos).

These methods emphasize automated, scalable assessment while addressing issues like data leakage and language priors.

Datasets prioritize fine-grained human annotations (M-HalDetect, HallusionBench) and synthetic data generation (VHTest, MHaluBench), balancing real-world complexity (e.g., counter-intuitive images in PhD, 58K Q&A pairs in ActivityNet-QA) with controlled challenges (e.g., robustness analysis in R-Bench).

Some benchmarks focus on specific tasks, such as multilingual support (MHumanEval), while others address broader issues like bias and interference (Bingo). All aim to improve model robustness in practical scenarios.

By proposing alignment strategies (such as open-source feedback in RLAIF-V) and unified frameworks (HQH), these benchmarks provide guidance for developing more reliable multimodal systems.

Safety

Some studies introduce novel techniques, such as diffusion-based adversarial attacks (AdvDiffVLM), red-teaming frameworks (RTVLM), and post-training fine-tuning strategies (VLGuard).

These methods enhance the rigor of evaluation by simulating real-world threats or improving model resilience against interference.

Benchmarks like MultiTrust and RTVLM unify trustworthiness assessments across multiple dimensions (e.g., authenticity, fairness), while others focus on specific challenges such as out-of-distribution (OOD) generalization (VLLM-safety-bench) or over-sensitivity (MOSSBench). These benchmarks offer holistic insights into model limitations.

MM-RLHF-SafetyBench samples from existing datasets to further cover areas such as adversarial attacks, privacy, red-teaming, and harmful content detection.

Dialogue

These benchmarks prioritize the evaluation of foundational visual skills, such as low-level perception capabilities (Q-Bench, LLVisionQA), descriptive abilities for low-level information (LLDescribe), and quality assessment.

They emphasize the model’s ability to interpret and express fine-grained visual information.

Several benchmarks test generalization in challenging scenarios, including unconventional images (LLaVA Bench-Wilder), cross-domain tasks (integration of math/news in LiveBench), and adversarial prompts (high-difficulty questions in Vibe-Eval). These benchmarks reveal adaptability beyond standard datasets.

Reward Models

Each benchmark targets specific evaluation dimensions, such as multilingual support (23 languages in M-RewardBench), alignment/safety/bias (MJ-Bench), enhanced interpretability and final model scoring via human annotation (MM-RLHF-RewardBench), and the capability of MLLMs as auxiliary judges across multiple modalities (scoring and pairwise comparisons in MLLM-as-a-Judge).

These frameworks reveal strengths and weaknesses in both structured and OOD scenarios.

High-quality datasets are curated through human-AI collaboration (e.g., annotation pipelines in VL-RewardBench) or structured triplet designs (RewardBench). Tasks range from simple preference ranking to complex reasoning, pushing models to handle nuanced challenges such as hallucination and ethical alignment.

Alignment

Some benchmarks investigate the model’s ability to align with human preferences.

Arena-Hard is a comprehensive multi-dimensional benchmark designed to evaluate the alignment capabilities of Chinese LLMs. AlpacaEval-V2 proposes a simple regression analysis method to control for length bias in self-evaluation. Arena-Hard achieved 98.6% correlation with human preference rankings by tripling the separation of model performance. MM-AlignBench is a specialized, manually annotated benchmark designed to evaluate alignment with human values.

Overall, many current MLLM alignment algorithms focus on preventing hallucinations while exploring how to leverage alignment algorithms to enhance general knowledge and dialogue capabilities in MLLMs—a significant direction for future research.

Some researchers view unsafe responses as misaligned with human preferences; therefore, they apply MLLM alignment algorithms to address safety issues. The effectiveness of reward models within these frameworks, particularly their performance in guiding alignment, warrants further study.

Furthermore, regarding alignment with human preferences, benchmarks have evolved from the LLM domain into the MLLM domain.

Future Work and Challenges

With the rapid development of Multimodal Large Language Models (MLLMs), aligning them with human preferences has become a research focus. However, several challenges remain.

First, the scarcity of high-quality and diverse datasets remains unresolved. Second, many methods fail to effectively utilize visual information, often relying primarily on text to construct positive and negative samples, thereby ignoring the full potential of multimodal data. Additionally, there is a lack of comprehensive evaluation standards; current methods are typically validated only on specific types of benchmarks such as hallucination or dialogue tasks, making it difficult to assess their generalizability.

By drawing on advancements in LLM post-training strategies and agent research, limitations in existing MLLM alignment methods can be revealed. Overcoming these challenges is crucial for developing more robust and comprehensive alignment approaches.

Data Challenges

MLLM alignment faces two key data-related challenges: data quality and coverage.

First, the availability of high-quality MLLM alignment data is limited. Acquiring and annotating multimodal data is significantly more complex than for LLMs due to the involvement of processing multiple modalities.

Second, existing datasets lack sufficient coverage of diverse multimodal tasks, such as optical character recognition (OCR), mathematical problems, and chart understanding. Constructing a comprehensive dataset covering a wide range of tasks is highly challenging.

To the authors’ knowledge, no publicly available, fully human-annotated multimodal dataset currently exceeds 200,000 samples.

These limitations in data quality and coverage constitute major obstacles to effective MLLM alignment.

Leveraging Visual Information for Alignment

Current alignment data can be represented as: preference data $D=(x, I, y_w, y_l)$, where $x$ is the question, $I$ is the image, and $y_w$, $y_l$ represent correct and incorrect responses, respectively.

In current research, there are three main approaches to leveraging visual information to enhance alignment performance, each with its own limitations:

  • Using corrupted or irrelevant images as negative samples during the alignment phase. Researchers create new images $I_{neg}$ and use $(y_w | X, I_{neg})$ as negative samples. This method improves alignment by reducing hallucinations and enhancing MLLM robustness to different images. However, visual negative samples often rely on diffusion algorithms or image modification, which lack strong quality metrics and incur high computational costs.

  • Generating new questions and answers based on corrupted images. In this approach, researchers create a new image $I_{neg}$, generate additional responses $y_{neg}$ using that image, and treat $(y_{neg} | X, I)$ as negative samples. This method increases diversity in text comparison, but the process of generating extra negative samples adds computational overhead.

  • Using cosine similarity metrics like CLIP to evaluate text-image matching. This method filters data by calculating similarity scores between text and images or incorporates them into reinforcement learning reward functions. While this helps reduce data noise, the quality of scoring depends on the evaluation model’s quality and may be subject to model bias.

Each method plays a role in leveraging visual data to enhance MLLM alignment but involves trade-offs regarding efficiency, cost, and potential bias.

Comprehensive Evaluation

Most MLLM alignment research primarily evaluates algorithm performance in several key areas such as hallucination, dialogue capabilities, or safety.

However, future research should adopt more comprehensive evaluation methods, assessing alignment approaches across a broader range of tasks to better demonstrate their generalizability and effectiveness.

Full-Modal Alignment

Align-anything pioneered full-modal alignment through the multimodal dataset “align-anything-200k,” covering text, images, audio, and video. This research demonstrated complementary effects between different modalities.

However, their work is still in its early stages; datasets for each modality are relatively small, limiting task coverage.

Furthermore, the proposed algorithms are merely preliminary improvements to the DPO method, failing to fully utilize the unique structural information inherent in each modality.

In the future, designing alignment algorithms that extend beyond image/text domains—particularly those targeting other modalities—will be a key trend.

MLLM Reasoning

Recently, reasoning LLMs represented by OpenAI (o1) and DeepSeek-R1 have demonstrated that reinforcement learning algorithms and preference data are crucial for improving LLM performance in complex problem-solving, long-context understanding, and generation tasks.

This section explores insights gained from LLM reasoning enhancement research and their impact on aligning MLLMs, analyzed primarily through the dimensions of data and optimization frameworks.

(1) Data.

  • Scale and Quality. Corresponding methods have evolved from resampling small models (e.g., OpenMathInstruct) to high-quality synthetic data (e.g., AceMath), gradually adopting cutting-edge models (e.g., OpenAI o1) and achieving scalable knowledge transfer via domain-specific model synthesis (e.g., DeepSeek-V3). Currently, datasets used for reasoning enhancement generally reach the scale of millions of samples (e.g., Qwen-2.5-MATH).

  • Efficiency. The “less is more” alignment approach (e.g., LIMA’s 1k samples for a 65B Llama) proves that minimal high-quality data can optimally activate pre-trained capabilities while reducing dependence on dataset scale.

(2) Optimization Frameworks.

  • Sampling Strategies. Recent advances indicate that online reinforcement learning (RL) is becoming the mainstream method. Online sampling methods in DeepSeek-V3 and Qwen-2.5-MATH effectively mitigate distribution shift. Additionally, Mini-Max adopts an offline + online sampling strategy, further enhancing model performance.

  • Training Paradigms. Multi-stage, collaborative optimization has become the mainstream approach. For example, Llama 3 includes six rounds of DPO iterations, while DeepSeek utilizes temperature-varied sampling and reflection/verification

hints to optimize reasoning depth (long chain-of-thought) and conciseness.

  • Algorithms. Reinforcement learning algorithms have evolved from early policy gradient methods to more complex Proximal Policy Optimization (PPO). Recent improvements based on PPO mainly follow two directions:

    One direction involves removing the evaluation model and training the policy with sparse rewards, thereby reducing the parameter count by half (e.g., DPO and GRPO); the other focuses on refining the design of the evaluation model, such as introducing ratios into the advantage function in PRIME or reshaping positive and negative sample rewards in OREAL.

    By prioritizing high-quality data and innovative optimization frameworks, the Multimodal Large Language Model (MLLM) field is moving toward more effective and scalable models that can better unlock the reasoning potential of MLLMs.

Insights from LLM Alignment

Alignment for Large Language Models (LLMs) has become a key focus of recent research, offering many valuable insights that can guide the development of MLLMs. By examining lessons learned from existing LLM alignment strategies, we can reveal key principles that help advance MLLM research:

(1) Improving Training Efficiency.

Current MLLM alignment methods rely on the Direct Preference Optimization (DPO) loss function. However, because DPO requires loading both the policy model and the reference model simultaneously, training speed decreases significantly. Can reference-free methods like SimPO be utilized to further improve training efficiency?

This approach could accelerate the training process while reducing dependence on a reference model. Further research into the specific role and impact of reference models in MLLM alignment is crucial for improving efficiency and optimizing model design.

(2) Mitigating Over-Optimization/Reward Hacking.

When using DPO or Reinforcement Learning from Human Feedback (RLHF) for LLM alignment, over-optimization remains a key challenge: performance improves by exploiting the learned proxy reward model, but true quality may stagnate or degrade.

To address this challenge, mitigation strategies include:

  • Using balanced training datasets to ensure diversity and representativeness, preventing overly narrow optimization;
  • Implementing early stopping when validation performance plateaus;
  • Introducing regularization techniques to reduce over-reliance on training data and improve model generalization.

MLLMs as Agents

MLLMs combine the powerful reasoning capabilities of LLMs with the ability to process data from multiple modalities (such as images, text, and audio). This enables them to extract knowledge from various information sources and perform comprehensive analysis, offering significant advantages in handling complex real-world tasks.

However, transforming MLLMs into efficient agents still requires addressing several pending issues:

  • Multi-Agent Collaboration. Currently, multi-agent collaboration frameworks for text-based agents have made significant progress, but mature solutions for multi-agent systems based on MLLMs are still lacking.
  • Robustness. The robustness of MLLM agents in open environments has not been systematically verified; adversarial robustness testing and assurance techniques need to be introduced.
  • Security. Introducing more complex components into MLLM agents increases security risks. Future research should explore various security protection mechanisms to mitigate these risks.

Paper Link: https://arxiv.org/pdf/2503.14504
GitHub Link: https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Alignment