ML Publication Trends

Analysis of Publication Trends in Top AI Conferences (2006-2024)
ml publications
data analysis
data science
Author

Devansh Lodha

Published

April 13, 2025

CS 328 Writing Assignment - 2025

Group Members: Devansh Lodha (23110091), Tejas Lohia (23110335), Mohit (23110207), Tanishq Chaudhari (23110329), Shardul Junagade (23110297)

1. Introduction

The field of Artificial Intelligence (AI) has experienced exponential growth over the past two decades. Tracking publications at top-tier Machine Learning (ML) conferences like NeurIPS, ICML, and ICLR provides a valuable lens through which to observe this growth, identify leading institutions and countries, and understand shifting trends, such as the increasing role of industry research.

This report analyzes a dataset compiling paper titles, authors, and affiliations from NeurIPS (2006-2024), ICML (2017-2024), and ICLR (2018-2024, excluding 2020). The primary challenge lies in the unstructured nature of the original affiliation data. To address this, a Large Language Model (LLM) approach was employed to extract and standardize institution names, countries, and affiliation types (Academia/Industry).

The analysis follows the structure outlined in the assignment: 1. Summarize the prepared data and provide an overview using visualizations. 2. Posit hypotheses regarding growth, geographic shifts, industry involvement, and research concentration. 3. Quantify and analyze the data to settle these hypotheses. 4. Discuss the findings, limitations, and conclusions.

All analyses and visualizations were performed using Python with pandas for data manipulation and Plotly for interactive plotting.

Code
import pandas as pd
import plotly.io as pio
import plotly.express as px
from sklearn.cluster import KMeans 
from sklearn.preprocessing import StandardScaler 
from sklearn.decomposition import PCA

# Plotly setup for notebook display
pio.renderers.default = "plotly_mimetype+notebook_connected"
pio.templates.default = "plotly_white"

2. Data Loading and Affiliation Processing

The analysis begins with the raw dataset containing publication metadata. A critical challenge is the Affiliation column, which contains unstructured, highly variable text strings. Examples include:

  • "Dept. of Computer Science, Stanford University, CA, USA and Google Research"
  • "MIT"
  • "Institute for Neuroinformatics, University of Zurich and ETH Zurich"
  • "Tsinghua University"

Direct analysis based on this raw text is infeasible due to inconsistency, ambiguity (e.g., “MIT”), and the presence of multiple institutions within single strings. To enable meaningful analysis of institutional and geographic trends, we need to extract structured information:

  1. Standardized Institution Name: Map variations to a common name (e.g., “MIT” -> “Massachusetts Institute of Technology”, “Google Brain” -> “Google”).
  2. Country: Identify or infer the country for each institution.
  3. Type: Classify the institution as “Academia” or “Industry”.

Given the scale (12,887 unique affiliations) and complexity, a Large Language Model (LLM) approach was chosen for this extraction task.

Code
# Load the RAW dataset
df_raw = pd.read_csv('papers.csv')

print("Raw DataFrame shape:", df_raw.shape)
print("Number of unique affiliations:", df_raw['Affiliation'].nunique())
df_raw.head()

2.1 LLM-Based Information Extraction Methodology

We utilize the mistralai/Mistral-Small-24B-Instruct-2501 model via the vLLM inference engine for efficient batch processing. The core idea is to provide each unique affiliation string to the LLM within a carefully crafted prompt, instructing it to return a structured JSON list containing the desired information for all institutions mentioned.

Steps We Followed:

  1. Identify Unique Affiliations: Extract the unique non-null affiliation strings from the raw data.
  2. Prompt Engineering: Develop a detailed prompt template that:
    • Clearly defines the task (extract institution, country, type).
    • Instructs the model to handle multiple institutions per string.
    • Provides examples of desired normalization (e.g., MIT, Google variations).
    • Specifies the output format as a JSON list of objects ([{ "institution": ..., "country": ..., "type": ... }, ...]).
    • Requests country inference where possible, using null if uncertain.
    • Uses TEMPERATURE=0.0 for deterministic output suitable for extraction.
    • (The final prompt template is shown within the code cell below).
  3. Batch Inference (vLLM): Generate responses for all unique affiliations in batches using vLLM for performance. (Note: Running the next code cell required significant GPU resources (like an A100 40GB used via Ola Krutim credits offerred to IIT Gandhinagar) and the vLLM library).
  4. Parsing and Validation: Parse the LLM’s text output as JSON. Validate that the output is a list of dictionaries, with each dictionary containing the required keys. Store the parsed list, the raw model output, and a parse status for each affiliation.
  5. Mapping and Exploding: Map the structured results back to the original DataFrame and then use the pandas explode function to create one row per author-institution pair.
Code
# Import necessary libraries for this block assuming seperate execution
import pandas as pd
import json
from vllm import LLM, SamplingParams

# Configuration
MODEL_NAME = "mistralai/Mistral-Small-24B-Instruct-2501" # Your specified model
TENSOR_PARALLEL_SIZE = 1
MAX_NEW_TOKENS = 800 # Increased for potentially long JSON lists
TEMPERATURE = 0.0
GPU_MEMORY_UTILIZATION = 0.9 # Adjust if needed

# Get Unique Affiliations
unique_affiliations = df_raw['Affiliation'].dropna().unique().tolist()

# Define Prompt Template
# Using f-strings requires escaping the literal braces in the JSON examples with double braces {{ }}
PROMPT_TEMPLATE = """Your task is to analyze the affiliation string provided and extract detailed information about all mentioned institutions.

Follow these instructions carefully:
1.  Identify all distinct institutions mentioned in the affiliation string.
2.  For each institution, determine its full standardized name, its country, and its type ("Academia" or "Industry").
3.  **Normalization:** Standardize common institution names. For example:
    *   "MIT", "M.I.T." should become "Massachusetts Institute of Technology".
    *   "Google Research", "Google Brain", "DeepMind" should become "Google".
    *   "UC Berkeley" should become "University of California, Berkeley".
    *   Use the full official name where possible.
4.  **Country Inference:** If the country is not explicitly mentioned, try to infer it from the institution name, city, or state. If the country cannot be determined reliably, use `null`.
5.  **Output Format:** Return the information as a **JSON list**. Each element in the list should be a JSON object containing the keys "institution", "country", and "type" for one identified institution.
6.  **Output ONLY the JSON list**, nothing else before or after.

Example 1:
Affiliation: "Dept. of Computer Science, Stanford University, CA, USA and Google Research"
JSON Output:
[
  {{"institution": "Stanford University", "country": "USA", "type": "Academia"}},
  {{"institution": "Google", "country": "USA", "type": "Industry"}}
]

Example 2:
Affiliation: "Université de Montréal, Mila - Quebec AI Institute"
JSON Output:
[
  {{"institution": "Université de Montréal", "country": "Canada", "type": "Academia"}},
  {{"institution": "Mila - Quebec AI Institute", "country": "Canada", "type": "Academia"}}
]

Example 3:
Affiliation: "Vrije Universiteit Brussel"
JSON Output:
[
  {{"institution": "Vrije Universiteit Brussel", "country": "Belgium", "type": "Academia"}}
]

Now, analyze the following affiliation:

Affiliation: "{affiliation}"

JSON Output:
"""

# Prepare Prompts
prompts = [PROMPT_TEMPLATE.format(affiliation=aff) for aff in unique_affiliations]

# Initialize vLLM

llm = LLM(
    model=MODEL_NAME,
    tokenizer_mode="mistral",
    tensor_parallel_size=TENSOR_PARALLEL_SIZE,
    gpu_memory_utilization=GPU_MEMORY_UTILIZATION,
    trust_remote_code=True
)

results_map = {} # Initialize results map


# Define Sampling Parameters
sampling_params = SamplingParams(
    max_tokens=MAX_NEW_TOKENS,
    temperature=TEMPERATURE,
)

# Run Batch Inference
outputs = llm.generate(prompts, sampling_params)

# Process Outputs
parse_errors = 0
for i, output in enumerate(outputs):
    original_affiliation = unique_affiliations[i]
    generated_text = output.outputs[0].text.strip()

    result_entry = {
        'StructuredOutput': None,
        'ModelOutput': generated_text,
        'ParseStatus': 'Error: Unknown'
    }

    cleaned_text = generated_text
    if cleaned_text.startswith("```json"):
        cleaned_text = cleaned_text[len("```json"):].strip()
    if cleaned_text.endswith("```"):
        cleaned_text = cleaned_text[:-len("```")].strip()
    if not cleaned_text:
            raise ValueError("Cleaned text is empty")

    extracted_data = json.loads(cleaned_text)

    if isinstance(extracted_data, list) and all(isinstance(item, dict) for item in extracted_data):
        all_items_valid = True
        for item in extracted_data:
            if not isinstance(item.get("institution"), str) or \
                not isinstance(item.get("type"), str) or \
                not (isinstance(item.get("country"), str) or item.get("country") is None):
                    all_items_valid = False
                    print(f"Warning: Item in list lacks expected keys/types for affiliation '{original_affiliation}'. Item: {item}")
                    break
        if all_items_valid:
            result_entry['StructuredOutput'] = extracted_data
            result_entry['ParseStatus'] = 'Success'
        else:
            result_entry['ParseStatus'] = 'Error: Invalid Item Structure'
            parse_errors += 1

    results_map[original_affiliation] = result_entry

# Clean up GPU memory
del llm
if 'torch' in locals() or 'torch' in globals():
    import torch
    torch.cuda.empty_cache()
print("GPU cache cleared.")

2.2 Processing LLM Results & Data Structuring

The raw text outputs from the LLM containing JSON lists are parsed and validated. Key challenges during this phase include handling malformed JSON, ensuring the output is correctly structured as a list of dictionaries, and managing cases where the model failed to extract information or adhere to the format.

The results_map dictionary now holds the processing status and structured output (or errors) for each unique affiliation. We map this back to the original DataFrame and then “explode” the rows associated with multiple institutions to create the final analysis-ready DataFrame (df), where each row corresponds to a single author-institution link. * Author (5 NaNs): A very small number of entries had missing author names. Given the large dataset size and the focus on aggregate trends, these rows were removed. * Affiliation (3999 NaNs): This is the original raw string and not used directly in most analyses. Missing values here are expected. * Institution (4028 NaNs): These represent instances where the LLM could not determine a standardized institution (including original affiliation NaNs and parsing issues). For institution-specific analyses, these will be treated as ‘Unknown’. * Country (8264 NaNs): Includes original NaNs and cases where country couldn’t be reliably inferred. This is expected due to variability in affiliation strings. These will be treated as ‘Unknown’ for geographic analysis. * Type (68 NaNs): A small number. We ensure consistency: if Institution is ‘Unknown’, Type should also be ‘Unknown’. Remaining NaNs are marked as ‘Unknown’. Missing or invalid extracted data is represented by NaN or specific ‘Unknown’ placeholders.

Code
# Map results back to the raw DataFrame
def get_result_data(affiliation, key, default=None):
    # Handle potential NaN affiliations during mapping
    if pd.isna(affiliation):
        return default
    result = results_map.get(affiliation)
    if isinstance(result, dict):
        return result.get(key, default)
    return default

df_raw['StructuredOutput'] = df_raw['Affiliation'].apply(lambda aff: get_result_data(aff, 'StructuredOutput'))
df_raw['ModelOutput_LLM'] = df_raw['Affiliation'].apply(lambda aff: get_result_data(aff, 'ModelOutput')) # Use different name
df_raw['ParseStatus'] = df_raw['Affiliation'].apply(lambda aff: get_result_data(aff, 'ParseStatus', 'Error: Mapping Failed'))

# Clean up and Prepare for Exploding
def clean_structured_output(row):
    # Handle NaN in Affiliation column itself
    if pd.isna(row['Affiliation']):
        return [] # Treat NaN affiliation as having no structured output

    if row['ParseStatus'] != 'Success' and row['StructuredOutput'] is None:
        return []
    elif row['StructuredOutput'] is None:
            if row['ParseStatus'] == 'Error: Mapping Failed':
                # Affiliation was NaN or somehow missed in unique list -> no output expected
                return []
            else: # ParseStatus indicates error, but output is None -> Treat as empty
                return []
    elif isinstance(row['StructuredOutput'], list):
        # Filter out any potential non-dict items within the list, although the parser should prevent this
        return [item for item in row['StructuredOutput'] if isinstance(item, dict)]
    else: # Should not happen if parser worked, but fallback
        return []

df_raw['StructuredOutput'] = df_raw.apply(clean_structured_output, axis=1)

# Add placeholder for empty lists to keep rows during explode
df_raw['StructuredOutput_Explode'] = df_raw['StructuredOutput'].apply(lambda x: [None] if not x else x)

# Explode
df_exploded = df_raw.explode('StructuredOutput_Explode', ignore_index=True)

# Extract Final Columns
def safe_get(data, key):
    if isinstance(data, dict):
        return data.get(key)
    return None # Handles the None placeholder

df_exploded['Institution'] = df_exploded['StructuredOutput_Explode'].apply(lambda x: safe_get(x, 'institution'))
df_exploded['Country'] = df_exploded['StructuredOutput_Explode'].apply(lambda x: safe_get(x, 'country'))
df_exploded['Type'] = df_exploded['StructuredOutput_Explode'].apply(lambda x: safe_get(x, 'type'))

# Final DataFrame Assembly
# Select and rename columns
df_final = df_exploded[['Conference', 'Year', 'Title', 'Author', 'Affiliation', # Original cols
                        'Institution', 'Country', 'Type', # New extracted cols
                        'ModelOutput_LLM', 'ParseStatus']].copy() # Metadata cols

# Handle NaNs created during processing
df_final.dropna(subset=['Author'], inplace=True) # Should already be done if df_raw was used
df_final['Institution'] = df_final['Institution'].fillna("Unknown Institution")
df_final['Country'] = df_final['Country'].fillna("Unknown Country")
# Ensure Type consistency (important AFTER extraction)
df_final.loc[df_final['Institution'] == "Unknown Institution", 'Type'] = "Unknown Type"
df_final['Type'] = df_final['Type'].fillna("Unknown Type") # Catch any remaining NaNs

# Ensure Year is numeric
df_final['Year'] = pd.to_numeric(df_final['Year'])

# Rename df_final to df to match subsequent code cells
df = df_final.copy()

You can download the processed dataframe used further in this analysis here: Download Dataset CSV

3. Data Summary and Overview

We begin by summarizing the key characteristics of the prepared dataset and visualizing the overall trends.

Code
# Calculations for Summary
unique_papers_df = df[['Conference', 'Year', 'Title']].drop_duplicates()
total_unique_papers = len(unique_papers_df)
total_authorship_instances = len(df)
time_frames = df.groupby('Conference')['Year'].agg(['min', 'max'])
overall_min_year = df['Year'].min()
overall_max_year = df['Year'].max()
unknown_institutions_count = (df['Institution'] == 'Unknown Institution').sum()
unknown_countries_count = (df['Country'] == 'Unknown Country').sum()
unknown_types_count = (df['Type'] == 'Unknown Type').sum()
perc_unknown_inst = (unknown_institutions_count / total_authorship_instances) * 100
perc_unknown_country = (unknown_countries_count / total_authorship_instances) * 100
perc_unknown_type = (unknown_types_count / total_authorship_instances) * 100

# Store stats for Markdown
summary_stats = {
    "Total Unique Papers": total_unique_papers,
    "Total Authorship Instances": total_authorship_instances,
    "Overall Years": f"{overall_min_year}-{overall_max_year}",
    "% Unknown Institutions": f"{perc_unknown_inst:.2f}%",
    "% Unknown Countries": f"{perc_unknown_country:.2f}%",
    "% Unknown Types": f"{perc_unknown_type:.2f}%"
}

# Plot 1: Unique Papers per Year
print("\nPlot 1: Unique Papers per Year")
papers_per_year = unique_papers_df.groupby('Year').size().reset_index(name='UniquePaperCount')
fig1 = px.line(papers_per_year,
              x='Year',
              y='UniquePaperCount',
              markers=True,
              title='Total Unique Papers per Year (Across All Conferences)',
              labels={'UniquePaperCount': 'Number of Unique Papers', 'Year': 'Year'})
fig1.update_layout(hovermode="x unified")
fig1.show()

# Plot 2: Authorship Instances per Year
print("Plot 2: Authorship Instances per Year")
instances_per_year = df.groupby('Year').size().reset_index(name='AuthorshipInstanceCount')
fig2 = px.line(instances_per_year,
              x='Year',
              y='AuthorshipInstanceCount',
              markers=True,
              title='Total Authorship Instances per Year (Across All Conferences)',
              labels={'AuthorshipInstanceCount': 'Number of Authorship Instances', 'Year': 'Year'})
fig2.update_layout(hovermode="x unified")
fig2.show()

# Plot 3: Total Authorship Instances per Conference
print("Plot 3: Total Authorship Instances per Conference")
conf_counts = df['Conference'].value_counts().reset_index()
conf_counts.columns = ['Conference', 'Count'] # Rename columns for clarity
fig3 = px.bar(conf_counts,
             x='Conference',
             y='Count',
             color='Conference',
             title='Total Authorship Instances per Conference (Overall)',
             labels={'Count': 'Number of Authorship Instances', 'Conference': 'Conference'},
             text_auto=True)
fig3.update_layout(showlegend=False)
fig3.show()

# Plot 4: Top 15 Countries by Total Authorship Instances
print("Plot 4: Top 15 Countries")
country_df_filtered = df[df['Country'] != 'Unknown Country']
country_counts = country_df_filtered['Country'].value_counts().head(15).reset_index()
country_counts.columns = ['Country', 'Count']
country_counts = country_counts.sort_values(by='Count', ascending=True) # Sort for horizontal bar

fig4 = px.bar(country_counts,
             x='Count',
             y='Country',
             orientation='h',
             title='Top 15 Countries by Total Authorship Instances (Overall)',
             labels={'Count': 'Number of Authorship Instances', 'Country': 'Country'},
             text='Count')
fig4.update_layout(yaxis={'categoryorder':'total ascending'})
fig4.update_traces(textposition='outside')
fig4.show()

# Plot 5: Top 15 Institutions by Total Authorship Instances
print("Plot 5: Top 15 Institutions")
inst_df_filtered = df[df['Institution'] != 'Unknown Institution']
inst_counts = inst_df_filtered['Institution'].value_counts().head(15).reset_index()
inst_counts.columns = ['Institution', 'Count']
inst_counts = inst_counts.sort_values(by='Count', ascending=True) # Sort for horizontal bar

fig5 = px.bar(inst_counts,
             x='Count',
             y='Institution',
             orientation='h',
             title='Top 15 Institutions by Total Authorship Instances (Overall)',
             labels={'Count': 'Number of Authorship Instances', 'Institution': 'Institution'},
             text='Count')
fig5.update_layout(yaxis={'categoryorder':'total ascending'})
fig5.update_traces(textposition='outside')
fig5.show()

# Plot 6: Overall Distribution by Type (Academia/Industry)
print("Plot 6: Distribution by Type")
type_df_filtered = df[df['Type'].isin(['Academia', 'Industry'])] # Filter only known types
type_counts = type_df_filtered['Type'].value_counts().reset_index()
type_counts.columns = ['Type', 'Count']

if not type_counts.empty:
    fig6 = px.pie(type_counts,
                 names='Type',
                 values='Count',
                 title='Overall Distribution of Authorship Instances by Type',
                 hole=0.3)
    fig6.update_traces(textinfo='percent+label', pull=[0.05, 0.05])
    fig6.show()
else:
    print("No known types ('Academia'/'Industry') found to generate distribution plot.")

Plot 1: Unique Papers per Year
Plot 2: Authorship Instances per Year
Plot 3: Total Authorship Instances per Conference
Plot 4: Top 15 Countries
Plot 5: Top 15 Institutions
Plot 6: Distribution by Type

3.1 Summary of Dataset Characteristics

The prepared dataset covers the period 2006-2024 across NeurIPS, ICML, and ICLR (excluding 2020). It contains 38510 unique papers and 18345 authorship instances (representing individual author-institution links).

  • Data Completeness: After processing, the remaining percentages of unmapped entries are:
    • Institutions: 2.19%
    • Countries: 4.50%
    • Types: 2.21%
  • Overview Trends (from plots above):
    • Both unique paper counts and total authorship instances show a strong upward trend over the years, confirming significant growth in research output at these venues.
    • NeurIPS has the highest overall volume of authorship instances, followed by ICML and ICLR.
    • Geographically, the USA and China are the dominant contributors, with the UK, Canada, and Germany also having significant presence.
    • Institutionally, Google, Stanford, MIT, Tsinghua, and CMU lead in overall contributions.
    • Affiliation types are predominantly Academia (around 70-80%), but Industry forms a substantial minority.

4. Hypotheses, Analysis and Findings

Based on the dataset and initial overview, we formulate hypotheses to investigate specific trends.

H1: Overall Growth

Hypothesis: The volume of AI research published at these top conferences (measured by unique papers and authorship instances) has increased significantly over the period covered by the dataset.

Quantification: Plotting unique papers per year and total authorship instances per year (Plots 1 & 2 generated in Section 3).

Findings: Plots 1 and 2 clearly show a strong, near-exponential increase in both the number of unique papers and the total volume of authorship instances over the years, particularly accelerating after ~2015.

Conclusion: H1 is strongly supported. The data confirms significant growth in publication volume at these top AI conferences.

H2: Geographic Shifts

Hypothesis: While North America (USA/Canada) has historically dominated, contributions from East Asia (particularly China) have shown the most rapid growth in recent years, significantly increasing their share of publications.

Quantification: Calculate the percentage share of total yearly authorship instances for key countries over time. Filter out “Unknown Country” instances before calculating percentages.

Code
# Filter out unknown countries
geo_df = df[df['Country'] != 'Unknown Country'].copy()

# Calculate counts per year and country
country_yearly_counts = geo_df.groupby(['Year', 'Country']).size().reset_index(name='Count')

# Calculate total counts per year (only considering known countries)
total_yearly_counts = country_yearly_counts.groupby('Year')['Count'].sum().reset_index(name='TotalYearlyCount')

# Merge totals back to calculate percentages
country_yearly_perc = pd.merge(country_yearly_counts, total_yearly_counts, on='Year')
country_yearly_perc['Percentage'] = (country_yearly_perc['Count'] / country_yearly_perc['TotalYearlyCount']) * 100

# Select key countries/regions to plot
key_countries = ['USA', 'China', 'Canada', 'UK', 'Germany', 'France', 'India', 'Japan', 'South Korea', 'Switzerland']

# Filter data for plotting
plot_data_h2 = country_yearly_perc[country_yearly_perc['Country'].isin(key_countries)]

# Create the interactive line plot
fig_h2 = px.line(plot_data_h2,
                 x='Year',
                 y='Percentage',
                 color='Country',
                 markers=True,
                 title='Percentage Share of Authorship Instances by Key Countries Over Time',
                 labels={'Percentage': '% of Total Yearly Instances', 'Year': 'Year'},
                 hover_data={'Count': True, 'Percentage': ':.2f%'})

fig_h2.update_layout(hovermode="x unified")
fig_h2.show()

Findings: The plot reveals dynamic shifts in geographic contributions: * The USA’s share, while consistently high, has declined from over 60% in the early years to around 35-40% more recently. * China’s share shows dramatic growth, rising from less than 5% before 2014 to become the second-largest contributor, exceeding 25% in recent years, challenging the USA’s dominance. * Canada and the UK maintain significant shares, often fluctuating but generally ranking 3rd/4th or 4th/5th. * Germany, France, and Switzerland show relatively stable contributions within the top 10. * South Korea and India demonstrate noticeable growth in their percentage share, particularly in the later years.

Conclusion: H2 is strongly supported. While North America (primarily USA, with Canada stable) remains a major force, East Asia (driven overwhelmingly by China’s rapid ascent) has significantly increased its share, confirming the geographic shift.

H3: Rise of Industry

Hypothesis: The proportion of research contributions from industry-affiliated authors has steadily increased over time compared to academic institutions.

Quantification: Calculate the percentage share of total yearly authorship instances attributed to ‘Industry’ over time. Filter out “Unknown Type” instances.

Code
# Filter out unknown types
type_df = df[df['Type'].isin(['Academia', 'Industry'])].copy() # Use isin for clarity

# Calculate counts per year and type
type_yearly_counts = type_df.groupby(['Year', 'Type']).size().reset_index(name='Count')

# Calculate total counts per year (known types only)
total_yearly_type_counts = type_yearly_counts.groupby('Year')['Count'].sum().reset_index(name='TotalYearlyCount')

# Merge totals back to calculate percentages
type_yearly_perc = pd.merge(type_yearly_counts, total_yearly_type_counts, on='Year')
type_yearly_perc['Percentage'] = (type_yearly_perc['Count'] / type_yearly_perc['TotalYearlyCount']) * 100

# Filter for Industry type
plot_data_h3 = type_yearly_perc[type_yearly_perc['Type'] == 'Industry']

# Create the interactive line plot
fig_h3 = px.line(plot_data_h3,
                 x='Year',
                 y='Percentage',
                 markers=True,
                 title='Percentage Share of Authorship Instances from Industry Over Time',
                 labels={'Percentage': '% of Total Yearly Instances from Industry', 'Year': 'Year'})

fig_h3.update_layout(hovermode="x unified")
fig_h3.show()

Findings: The plot demonstrates a clear and substantial upward trend in the percentage of authorship instances affiliated with Industry. Accelerating after 2015 and reaching approximately 30-35% in the most recent years.

Conclusion: H3 is strongly supported. Industry participation in research publications at these top AI conferences has significantly increased over time.

H4: Institutional Concentration

Hypothesis: A relatively small number of ‘elite’ institutions (both academic and industrial) consistently account for a large percentage of the total publications, indicating a concentration of top AI research.

Quantification: Identify the Top \(N=20\) institutions based on overall authorship instances. Calculate the combined percentage share of total yearly authorship instances held by these Top \(N\) institutions over time. Filter out “Unknown Institution”.

Code
# Filter out unknown institutions
inst_df = df[df['Institution'] != 'Unknown Institution'].copy()

# Part 1: Identify Top N Institutions Overall
N = 20
top_n_institutions = inst_df['Institution'].value_counts().head(N).index.tolist()
print(f"Top {N} Institutions (Overall):")
print(top_n_institutions)

# Part 2: Calculate Yearly Concentration Percentage
inst_yearly_counts = inst_df.groupby(['Year', 'Institution']).size().reset_index(name='Count')
total_yearly_inst_counts = inst_yearly_counts.groupby('Year')['Count'].sum().reset_index(name='TotalYearlyCount')
top_n_yearly_counts = inst_yearly_counts[inst_yearly_counts['Institution'].isin(top_n_institutions)]
top_n_yearly_sum = top_n_yearly_counts.groupby('Year')['Count'].sum().reset_index(name=f'Top{N}_Count')
concentration_df = pd.merge(total_yearly_inst_counts, top_n_yearly_sum, on='Year', how='left').fillna(0) # Use how='left' and fillna(0)
concentration_df[f'Top{N}_Percentage'] = (concentration_df[f'Top{N}_Count'] / concentration_df['TotalYearlyCount']) * 100

# Visualization: Concentration Trend
fig_h4_conc = px.line(concentration_df,
                      x='Year',
                      y=f'Top{N}_Percentage',
                      markers=True,
                      title=f'Percentage Share of Authorship Instances by Top {N} Institutions Over Time',
                      labels={f'Top{N}_Percentage': f'% Share Held by Top {N}', 'Year': 'Year'})

fig_h4_conc.update_layout(hovermode="x unified")
fig_h4_conc.show()
Top 20 Institutions (Overall):
['Google', 'Stanford University', 'Massachusetts Institute of Technology', 'Tsinghua University', 'Carnegie Mellon University', 'Microsoft', 'University of California, Berkeley', 'Peking University', 'University of Oxford', 'University of Texas at Austin', 'Shanghai Jiao Tong University', 'Princeton University', 'Facebook', 'University of California, Los Angeles', 'Zhejiang University', 'University of Washington', 'Georgia Institute of Technology', 'ETH Zurich', 'National University of Singapore', 'University of Cambridge']

Findings: The list reveals the dominant institutions overall (Google, Stanford, MIT, Tsinghua, CMU, Microsoft, Berkeley, etc.). The concentration plot shows that the Top 20 institutions consistently account for a significant portion (roughly 35% to over 50%) of the total authorship instances. While there are fluctuations, the share held by the top institutions appears to have generally increased, especially peaking around 2016-2020, before maybe slightly decreasing in the last couple of years shown.

Conclusion: H4 is supported. A relatively small number of elite institutions contribute a disproportionately large share of publications, indicating significant concentration. The level of concentration seems to have increased over time, although it might have plateaued or slightly decreased very recently.

H5: Conference Profile Differences

Hypothesis: The dominant contributing countries and the balance between Academia and Industry differ significantly across the three conferences (NeurIPS, ICML, ICLR).

Quantification: Calculate percentage share by country and type per conference. Visualize using faceted and grouped bar charts.

Code
N_TOP_COUNTRIES = 10

# Part 1: Geographic Distribution per Conference
print(f"\nGeographic Distribution per Conference (Top {N_TOP_COUNTRIES})")
geo_conf_df = df[df['Country'] != 'Unknown Country'].copy()
conf_country_counts = geo_conf_df.groupby(['Conference', 'Country']).size().reset_index(name='Count')
total_conf_counts = conf_country_counts.groupby('Conference')['Count'].sum().reset_index(name='TotalConferenceCount')
conf_country_perc = pd.merge(conf_country_counts, total_conf_counts, on='Conference')
conf_country_perc['Percentage'] = (conf_country_perc['Count'] / conf_country_perc['TotalConferenceCount']) * 100
# Get Top N countries FOR EACH conference using indices from nlargest
top_indices = conf_country_perc.groupby('Conference')['Count'].nlargest(N_TOP_COUNTRIES).reset_index()['level_1']
top_countries_per_conf = conf_country_perc.loc[top_indices]

fig_h5_geo = px.bar(top_countries_per_conf,
                    x='Country', y='Percentage', color='Conference', facet_col='Conference',
                    title=f'Top {N_TOP_COUNTRIES} Contributing Countries by Conference (% of Total Instances per Conference)',
                    labels={'Percentage': '% of Conference Instances', 'Country': 'Country'},
                    category_orders={"Conference": ["NeurIPS", "ICML", "ICLR"]}, text='Percentage')
fig_h5_geo.update_traces(texttemplate='%{text:.1f}%', textposition='outside')
fig_h5_geo.update_xaxes(matches=None, showticklabels=True, title_text='') # Allow independent x-axes and remove redundant title
fig_h5_geo.update_layout(showlegend=True)
fig_h5_geo.add_annotation(x=0.5, y=-0.30, xref='paper', yref='paper', text='Country', showarrow=False, font=dict(size=12))
fig_h5_geo.show()

# Part 2: Academia vs. Industry Distribution per Conference
print("\nAcademia vs. Industry Distribution per Conference")
type_conf_df = df[df['Type'].isin(['Academia', 'Industry'])].copy()
conf_type_counts = type_conf_df.groupby(['Conference', 'Type']).size().reset_index(name='Count')
total_conf_type_counts = conf_type_counts.groupby('Conference')['Count'].sum().reset_index(name='TotalConferenceCount')
conf_type_perc = pd.merge(conf_type_counts, total_conf_type_counts, on='Conference')
conf_type_perc['Percentage'] = (conf_type_perc['Count'] / conf_type_perc['TotalConferenceCount']) * 100

fig_h5_type = px.bar(conf_type_perc,
                     x='Conference', y='Percentage', color='Type', barmode='group',
                     title='Academia vs. Industry Distribution by Conference',
                     labels={'Percentage': '% of Conference Instances', 'Conference': 'Conference', 'Type': 'Institution Type'},
                     category_orders={"Conference": ["NeurIPS", "ICML", "ICLR"], "Type": ["Academia", "Industry"]},
                     text='Percentage')
fig_h5_type.update_traces(texttemplate='%{text:.1f}%', textposition='outside')
fig_h5_type.show()

Geographic Distribution per Conference (Top 10)

Academia vs. Industry Distribution per Conference

Findings: * Geographic Profiles: While USA and China dominate across all three, the relative ranking and presence of other countries vary. * Academia vs. Industry: ICLR shows the highest percentage of industry contributions (25.3%), followed by ICML (23.5%) and then NeurIPS (21.3%). Conversely, NeurIPS has the highest academic share.

Conclusion: H5 is supported. The conferences exhibit distinct profiles regarding both the detailed geographic distribution beyond the top 2 leaders and the balance between academic and industrial participation, with NeurIPS being the most academic-centric and ICML the most industry-heavy overall.

H8: Evolution within China (Academia vs. Industry)

Hypothesis: Within China, the balance between academic and industrial contributions has shifted significantly over time.

Quantification: Filter for China, then plot the percentage share of ‘Academia’ vs. ‘Industry’ instances over time within the country.

Code
TARGET_COUNTRY = 'China'

print(f"\nAcademia vs. Industry Trend within {TARGET_COUNTRY}")

# Filter for the target country AND known types
country_type_df = df[(df['Country'] == TARGET_COUNTRY) &
                     (df['Type'].isin(['Academia', 'Industry']))].copy()

if country_type_df.empty:
    print(f"No data found for known types in {TARGET_COUNTRY}.")
else:
    country_type_yearly_counts = country_type_df.groupby(['Year', 'Type']).size().reset_index(name='Count')
    total_yearly_country_counts = country_type_yearly_counts.groupby('Year')['Count'].sum().reset_index(name='TotalYearlyCount')
    country_type_yearly_perc = pd.merge(country_type_yearly_counts, total_yearly_country_counts, on='Year')
    country_type_yearly_perc['Percentage'] = (country_type_yearly_perc['Count'] / country_type_yearly_perc['TotalYearlyCount']) * 100

    fig_h8 = px.line(country_type_yearly_perc,
                     x='Year', y='Percentage', color='Type', markers=True,
                     title=f'Academia vs. Industry Share of Authorship Instances from {TARGET_COUNTRY}',
                     labels={'Percentage': f'% of {TARGET_COUNTRY} Yearly Instances', 'Year': 'Year', 'Type': 'Institution Type'})
    fig_h8.update_layout(hovermode="x unified")
    fig_h8.show()

Academia vs. Industry Trend within China

Findings: Within China, industry contributions initially dominated almost entirely. Academia’s share shows noticeable growth starting around 2005-2010 and increasing significantly thereafter, reaching around 85% of Chinese contributions in recent years.

Conclusion: H8 is supported. The internal balance in China has shifted, with a marked increase in academia participation in recent years, although academia still holds the majority share.

H9: Persistence of Top Institutions

Hypothesis: The institutions ranked in the Top \(N=20\) during an earlier period largely remain dominant in a later period.

Quantification: Define early (2010-2015) and late (2019-2023) periods. Identify Top 20 institutions in each. Calculate the percentage of the early Top 20 also present in the late Top 20.

Code
N_RANK = 20
EARLY_START, EARLY_END = 2010, 2015
LATE_START, LATE_END = 2019, 2023

print(f"\nPersistence of Top {N_RANK} Institutions")
print(f"Early Period: {EARLY_START}-{EARLY_END}, Late Period: {LATE_START}-{LATE_END}")

inst_persist_df = df[(df['Institution'] != 'Unknown Institution')].copy()
inst_persist_df['Year'] = pd.to_numeric(inst_persist_df['Year'])

early_df = inst_persist_df[(inst_persist_df['Year'] >= EARLY_START) & (inst_persist_df['Year'] <= EARLY_END)]
late_df = inst_persist_df[(inst_persist_df['Year'] >= LATE_START) & (inst_persist_df['Year'] <= LATE_END)]

if early_df.empty or late_df.empty:
    print("Insufficient data for one or both periods.")
else:
    top_early = early_df['Institution'].value_counts().head(N_RANK).index.tolist()
    print(f"\nTop {N_RANK} Institutions ({EARLY_START}-{EARLY_END}):")
    print(top_early)

    top_late = late_df['Institution'].value_counts().head(N_RANK).index.tolist()
    print(f"\nTop {N_RANK} Institutions ({LATE_START}-{LATE_END}):")
    print(top_late)

    persistent_institutions = [inst for inst in top_early if inst in top_late]
    num_persistent = len(persistent_institutions)
    persistence_rate = (num_persistent / N_RANK) * 100

    print(f"\nNumber of institutions in Top {N_RANK} in both periods: {num_persistent}")
    print(f"Persistence Rate: {persistence_rate:.1f}%")

Persistence of Top 20 Institutions
Early Period: 2010-2015, Late Period: 2019-2023

Top 20 Institutions (2010-2015):
['Google', 'Carnegie Mellon University', 'Stanford University', 'Massachusetts Institute of Technology', 'University of California, Berkeley', 'Microsoft', 'University of Texas at Austin', 'Duke University', 'Princeton University', 'Columbia University', 'University of Washington', 'University of Cambridge', 'University of California, San Diego', 'Facebook', 'University College London', 'University of Toronto', 'Technion - Israel Institute of Technology', 'University of Michigan', 'Georgia Institute of Technology', 'Inria']

Top 20 Institutions (2019-2023):
['Google', 'Stanford University', 'Massachusetts Institute of Technology', 'Carnegie Mellon University', 'Tsinghua University', 'Microsoft', 'University of California, Berkeley', 'Peking University', 'University of Oxford', 'Facebook', 'University of Texas at Austin', 'University of California, Los Angeles', 'Princeton University', 'Georgia Institute of Technology', 'Mila - Quebec AI Institute', 'ETH Zurich', 'University of Washington', 'IBM', 'Amazon', 'Shanghai Jiao Tong University']

Number of institutions in Top 20 in both periods: 11
Persistence Rate: 55.0%

Findings: Comparing the Top 20 institutions from 2010-2015 with those from 2019-2023, we found that 11 institutions were present in both lists. This yields a persistence rate of 55.0%. Key persistent players include Google, Stanford, MIT, CMU, Microsoft, Berkeley, Princeton, UTexas Austin, UWashington, Georgia Tech. Notable entries into the later top 20 include Tsinghua, Peking, Oxford, UCLA, Mila, ETH Zurich, IBM, Amazon, SJTU, while institutions like Duke, Columbia, Cambridge, UCSD, UCL, Toronto, Technion, Michigan, Inria dropped out of the Top 20 between these periods (though many remain highly ranked).

Conclusion: H9 is partially supported. While there is significant persistence (over half the institutions remain in the top 20), there is also considerable dynamism, with nearly half the top tier changing between the two periods, reflecting the rise of certain institutions (especially from China and industry) and the relative decline of others within this elite group.

H10: National Institutional Dominance

Hypothesis: Within the top contributing countries, the national research output is highly concentrated, with the single top institution often accounting for a substantial percentage of that country’s total publications.

Quantification: Select the Top \(K=5\) countries. For each, identify their single top contributing institution overall and calculate the percentage of the country’s total instances originating from that institution. Visualize with a bar chart, adding the institution name to the bar label.

Code
K_TOP_COUNTRIES = 5 # Number of top countries to analyze

print(f"\nInstitutional Dominance within Top {K_TOP_COUNTRIES} Countries")

# Filter unknowns
dom_df = df[(df['Country'] != 'Unknown Country') &
            (df['Institution'] != 'Unknown Institution')].copy()

# Find Top K countries overall
top_k_countries = dom_df['Country'].value_counts().head(K_TOP_COUNTRIES).index.tolist()
print(f"Analyzing Top {K_TOP_COUNTRIES} Countries: {top_k_countries}")

# Filter data for only these top countries
dom_df_filtered = dom_df[dom_df['Country'].isin(top_k_countries)]

# Calculate total instances per country
total_country_counts = dom_df_filtered.groupby('Country').size().reset_index(name='TotalCountryInstances')

# Find the single top institution WITHIN each of these countries
top_inst_per_country = dom_df_filtered.groupby(['Country', 'Institution']).size().reset_index(name='InstCount')
top_inst_per_country = top_inst_per_country.loc[top_inst_per_country.groupby('Country')['InstCount'].idxmax()]
top_inst_per_country = top_inst_per_country.rename(columns={'InstCount': 'TopInstCount'})

# Merge total country counts and top institution counts
dominance_summary = pd.merge(total_country_counts, top_inst_per_country[['Country', 'Institution', 'TopInstCount']], on='Country')

# Calculate dominance percentage
dominance_summary['DominancePercentage'] = (dominance_summary['TopInstCount'] / dominance_summary['TotalCountryInstances']) * 100

# Create the combined text label for the bars
dominance_summary['BarLabel'] = dominance_summary.apply(
    lambda row: f"{row['DominancePercentage']:.1f}%<br><i>({row['Institution']})</i>",
    axis=1
)

# Sort by dominance percentage for plotting
dominance_summary = dominance_summary.sort_values('DominancePercentage', ascending=False)

# Visualization: Bar chart of dominance percentage
fig_h11 = px.bar(dominance_summary,
                 x='Country', y='DominancePercentage', color='Country',
                 title=f'Dominance of Top Institution within Top {K_TOP_COUNTRIES} Countries',
                 labels={'DominancePercentage': '% National Instances (Top Inst.)', 'Country': 'Country'},
                 text='BarLabel',
                 hover_data={'Institution': True, 'TopInstCount': True, 'TotalCountryInstances': True, 'DominancePercentage':':.1f%'})

fig_h11.update_traces(textposition='outside')
fig_h11.update_layout(showlegend=False, yaxis_title="% National Instances (Top Inst.)")
fig_h11.show()

Institutional Dominance within Top 5 Countries
Analyzing Top 5 Countries: ['USA', 'China', 'UK', 'Canada', 'Germany']

Findings: The plot shows the percentage of each top country’s total output contributed by its single largest institution. * The UK shows the highest concentration among the top 5, with the University of Oxford contributing over 27% of its national total. * Canada follows, with the University of Toronto contributing around 18.6%. * China’s output is less concentrated in its top institution (Tsinghua University, ~14.2%) compared to the UK or Canada. * The USA, despite having the highest overall output, shows relatively lower concentration in its top institution (Google, ~13.1%), indicating a broader base of major contributors. * Germany has the lowest concentration among these five, with the Max Planck Institute for Intelligent Systems contributing ~10.0%.

Conclusion: H10 is supported, but with significant variation. While top institutions are clearly important, the degree of national concentration varies. Some countries (UK, Canada) rely more heavily on their single top institution than others (USA, China, Germany) which appear to have a more distributed set of leading contributors within the country.

H11: Rise of New Player Countries

Hypothesis: Beyond the established leaders, other countries have emerged as significant contributors in recent years.

Quantification: Identify countries ranked in the Top 20 post-2018 but not in the Top 20 pre-2017. Plot their absolute contribution trends.

Code
N_RANK_CHECK = 20
EARLY_END_YEAR = 2016
LATE_START_YEAR = 2019

print(f"\nRise of New Player Countries (Comparing Pre-{EARLY_END_YEAR+1} and Post-{LATE_START_YEAR-1})")

rise_df = df[(df['Country'] != 'Unknown Country')].copy()
rise_df['Year'] = pd.to_numeric(rise_df['Year'])

early_period_df = rise_df[rise_df['Year'] <= EARLY_END_YEAR]
late_period_df = rise_df[rise_df['Year'] >= LATE_START_YEAR]

emerging_countries = []
if not early_period_df.empty and not late_period_df.empty:
    top_early_countries = early_period_df['Country'].value_counts().head(N_RANK_CHECK).index.tolist()
    top_late_countries = late_period_df['Country'].value_counts().head(N_RANK_CHECK).index.tolist()
    emerging_countries = [c for c in top_late_countries if c not in top_early_countries]
    print(f"\nPotential Emerging Countries (in Top {N_RANK_CHECK} post-{LATE_START_YEAR-1}, not pre-{EARLY_END_YEAR+1}):")
    print(emerging_countries)
else:
    print("Insufficient data for one or both periods to identify emerging countries.")

if emerging_countries:
    country_yearly_counts_all = rise_df.groupby(['Year', 'Country']).size().reset_index(name='Count')
    plot_data_h12 = country_yearly_counts_all[country_yearly_counts_all['Country'].isin(emerging_countries)]

    fig_h12 = px.line(plot_data_h12,
                     x='Year', y='Count', color='Country', markers=True,
                     title=f'Authorship Instance Trends for Emerging Countries',
                     labels={'Count': 'Number of Authorship Instances', 'Year': 'Year', 'Country': 'Country'})
    fig_h12.update_layout(hovermode="x unified")
    fig_h12.show()
else:
    print("No emerging countries identified based on the criteria.")

Rise of New Player Countries (Comparing Pre-2017 and Post-2018)

Potential Emerging Countries (in Top 20 post-2018, not pre-2017):
['Russia', 'Sweden', 'Denmark']

Findings: Based on the criteria (in Top 20 post-2018, not Top 20 pre-2017), the analysis identifies Russia, Sweden, and Denmark as emerging contributors. The plot shows their absolute authorship instance counts over time. While starting from a low base, all three show an upward trajectory, particularly in the later years, indicating their growing presence in these conferences.

Conclusion: H11 is supported. Countries like Russia, Sweden, and Denmark, while not among the absolute leaders, have demonstrably increased their contributions and presence within the Top 20 landscape in recent years.

H12: Conference Concentration Comparison

Hypothesis: The degree to which publications are concentrated within the top institutions differs across the conferences.

Quantification: Calculate the percentage share held by the Top \(N=5\) institutions separately for each conference per year. Plot trends and include Top 5 institutions in hover data.

Code
N_CONC = 5 # Focusing on Top 5 for clarity

print(f"\nTop {N_CONC} Institutional Concentration Across Conferences")

conc_df = df[df['Institution'] != 'Unknown Institution'].copy()
conf_inst_yearly_counts = conc_df.groupby(['Year', 'Conference', 'Institution']).size().reset_index(name='Count')
total_conf_yearly_counts = conf_inst_yearly_counts.groupby(['Year', 'Conference'])['Count'].sum().reset_index(name='TotalConfYearlyCount')

concentration_list = []
for year in sorted(conf_inst_yearly_counts['Year'].unique()):
    for conf in ["NeurIPS", "ICML", "ICLR"]:
        if conf not in conf_inst_yearly_counts[conf_inst_yearly_counts['Year'] == year]['Conference'].unique():
            continue

        subset = conf_inst_yearly_counts[(conf_inst_yearly_counts['Year'] == year) & (conf_inst_yearly_counts['Conference'] == conf)]

        if not subset.empty:
            top_n_inst_subset_df = subset.nlargest(N_CONC, 'Count')
            top_n_details = ", ".join([f"{row['Institution']} ({row['Count']})" for index, row in top_n_inst_subset_df.iterrows()])
            top_n_count_subset = top_n_inst_subset_df['Count'].sum()

            total_count_subset_row = total_conf_yearly_counts[(total_conf_yearly_counts['Year'] == year) & (total_conf_yearly_counts['Conference'] == conf)]
            if not total_count_subset_row.empty:
                 total_count_subset = total_count_subset_row['TotalConfYearlyCount'].iloc[0]
            else:
                 total_count_subset = 0

            if total_count_subset > 0:
                percentage = (top_n_count_subset / total_count_subset) * 100
                concentration_list.append({
                    'Year': year, 'Conference': conf, f'Top{N_CONC}_Percentage': percentage,
                    f'Top{N_CONC}_Institutions': top_n_details })
            # else: # No need to print warnings during normal run
            #      print(f"Warning: Total count is zero for {conf} in {year}")

conf_concentration_df = pd.DataFrame(concentration_list)

if not conf_concentration_df.empty:
    fig_h13 = px.line(conf_concentration_df,
                      x='Year', y=f'Top{N_CONC}_Percentage', color='Conference', markers=True,
                      title=f'Top {N_CONC} Institution Concentration by Conference Over Time',
                      labels={f'Top{N_CONC}_Percentage': f'% Share Held by Top {N_CONC}', 'Year': 'Year', 'Conference': 'Conference'},
                      category_orders={"Conference": ["NeurIPS", "ICML", "ICLR"]},
                      hover_data={f'Top{N_CONC}_Institutions': True, f'Top{N_CONC}_Percentage': ':.1f%'})
    fig_h13.update_layout(hovermode="x unified")
    fig_h13.show()
else:
    print("Could not calculate conference concentration trends.")

Top 5 Institutional Concentration Across Conferences

Findings: The plot compares the percentage share held by the Top 5 institutions across the three conferences over time.

Conclusion: H12 is supported. Concentration levels differ noticeably between conferences. ICLR appears significantly more ‘elitist’ (higher Top 5 concentration) than ICML and NeurIPS.

H13: Institutional Publication Profiles (Clustering)

Hypothesis: Institutions exhibit distinct publication profiles based on conference focus and academia/industry nature, allowing for clustering into meaningful groups.

Quantification: Calculate features (% NeurIPS, % ICML, % ICLR, % Industry) for institutions with >20 publications. Standardize features and apply K-Means (K=5). Analyze cluster profiles and visualize using PCA.

Code
MIN_PUBS_CLUSTER = 20 # Minimum total publications for an institution to be included

print(f"\nClustering Institutions by Publication Profile (Min {MIN_PUBS_CLUSTER} Pubs)")

# Filter unknowns
cluster_df = df[(df['Institution'] != 'Unknown Institution') &
                (df['Type'].isin(['Academia', 'Industry']))].copy()

# Calculate total publications per institution
inst_total_counts = cluster_df['Institution'].value_counts()

# Filter institutions below the threshold
institutions_to_keep = inst_total_counts[inst_total_counts >= MIN_PUBS_CLUSTER].index.tolist()
cluster_df = cluster_df[cluster_df['Institution'].isin(institutions_to_keep)]

if cluster_df.empty:
    print(f"No institutions found with at least {MIN_PUBS_CLUSTER} publications.")
else:
    # --- Feature Calculation ---
    inst_conf_counts = cluster_df.groupby(['Institution', 'Conference']).size().unstack(fill_value=0)
    inst_conf_perc = inst_conf_counts.apply(lambda x: x / x.sum() * 100, axis=1)
    inst_conf_perc.columns = [f'pct_{col.lower()}' for col in inst_conf_perc.columns]

    inst_type_counts = cluster_df.groupby(['Institution', 'Type']).size().unstack(fill_value=0)
    inst_type_perc = inst_type_counts.apply(lambda x: x / x.sum() * 100, axis=1)
    if 'Industry' not in inst_type_perc.columns: inst_type_perc['Industry'] = 0
    if 'Academia' not in inst_type_perc.columns: inst_type_perc['Academia'] = 0
    inst_type_perc = inst_type_perc[['Industry']].rename(columns={'Industry': 'pct_industry'})

    feature_df = pd.merge(inst_conf_perc, inst_type_perc, left_index=True, right_index=True)
    expected_feature_cols = ['pct_neurips', 'pct_icml', 'pct_iclr', 'pct_industry']
    for col in expected_feature_cols:
         if col not in feature_df.columns: feature_df[col] = 0
    feature_cols_to_use = [col for col in expected_feature_cols if col in feature_df.columns]

    # --- Clustering ---
    scaler = StandardScaler()
    features_scaled = scaler.fit_transform(feature_df[feature_cols_to_use])
    K_CLUSTERS = 5
    print(f"K-Means with K={K_CLUSTERS}")
    kmeans = KMeans(n_clusters=K_CLUSTERS, random_state=42, n_init=10)
    feature_df['Cluster'] = kmeans.fit_predict(features_scaled)

    # --- Analysis & Visualization ---
    cluster_centers_scaled = kmeans.cluster_centers_
    cluster_centers = scaler.inverse_transform(cluster_centers_scaled)
    profile_cols_exist = [col for col in feature_cols_to_use] # Get features used
    cluster_profile = pd.DataFrame(cluster_centers, columns=profile_cols_exist)
    print("\nCluster Profiles (Average %):")
    print(cluster_profile.round(1))

    # --- PCA Visualization ---
    print("PCA for visualization")
    pca = PCA(n_components=2)
    pca_result = pca.fit_transform(features_scaled)
    feature_df['PCA1'] = pca_result[:, 0]
    feature_df['PCA2'] = pca_result[:, 1]
    feature_df_reset = feature_df.reset_index()

    fig_h18_pca = px.scatter(feature_df_reset,
                         x='PCA1', y='PCA2', color='Cluster', # Use integer Cluster for color mapping
                         color_continuous_scale=px.colors.qualitative.Vivid, # Specify qualitative scale
                         hover_data={'Institution': True, 'pct_neurips': ':.1f', 'pct_icml': ':.1f',
                                     'pct_iclr': ':.1f', 'pct_industry': ':.1f', 'Cluster': True},
                         title=f'Institution Clusters (K={K_CLUSTERS}) based on Publication Profile (PCA Visualization)',
                         labels={'Cluster': 'Cluster ID'})
    fig_h18_pca.update_layout(coloraxis_showscale=False) # Hide continuous bar for qualitative scale
    fig_h18_pca.show()

    # --- Parallel Coordinates Plot ---
    # Prepare data: reset index of the average profiles, rename 'index' to 'Cluster'
    plot_data_profiles = cluster_profile.reset_index()
    plot_data_profiles = plot_data_profiles.rename(columns={'index': 'Cluster'})

    # *** KEEP 'Cluster' AS NUMERICAL (0-4) FOR COLOR MAPPING ***
    # plot_data_profiles['Cluster'] = plot_data_profiles['Cluster'].astype(str) # REMOVED THIS CONVERSION

    print("\nCluster Profile Comparison Plot")
    fig_profiles = px.parallel_coordinates(
        plot_data_profiles,
        dimensions=feature_cols_to_use,        # Use only the feature columns for axes
        color='Cluster',                       # Use the NUMERICAL Cluster ID for color mapping
        color_continuous_scale=px.colors.qualitative.Vivid, # *** APPLY QUALITATIVE SCALE ***
        title='Average Publication Profiles by Cluster',
        labels={col: col.replace('pct_', '% ').replace('_', ' ').title() for col in feature_cols_to_use}
    )
    # Optionally hide the color scale bar as it represents discrete categories
    fig_profiles.update_layout(coloraxis_showscale=False)
    fig_profiles.show()

Clustering Institutions by Publication Profile (Min 20 Pubs)
K-Means with K=5

Cluster Profiles (Average %):
   pct_neurips  pct_icml  pct_iclr  pct_industry
0         83.2      10.5       6.3          16.4
1         60.2      23.1      16.6           0.1
2         59.7      24.8      15.4          98.6
3         42.2      16.9      40.9          43.3
4         45.2      41.7      13.2           6.8
PCA for visualization

Cluster Profile Comparison Plot

Findings: The K-Means clustering (with \(K=5\)) identifies distinct groups visualized via PCA and characterized by their average profiles:

  • Cluster 2 (Dark Green on PCA): Industry Giants. This cluster is characterized by an extremely high average pct_industry (98.6%) and a relatively balanced distribution across the three conferences (~60% NeurIPS, ~25% ICML, ~15% ICLR). Sample Institutions likely include Google, Microsoft, Meta, etc.
  • Cluster 1 (Light Green on PCA): Balanced Academics. This large cluster has virtually 0% industry affiliation and a fairly balanced conference profile, slightly favoring NeurIPS (~60%) over ICML (~23%) and ICLR (~17%). This likely represents many traditional university CS/AI departments.
  • Cluster 4 (Grey on PCA): ICML-Leaning Academics. Very similar to Cluster 1 (low industry %), but with a significantly higher average focus on ICML (~42%) and lower NeurIPS focus (~45%) compared to Cluster 1.
  • Cluster 0 (Orange on PCA): NeurIPS-Dominant Academics. This cluster also has low average industry presence (~16%) but shows a strong preference for NeurIPS (~83%) compared to ICML and ICLR (both ~10% or less).
  • Cluster 3 (Blue on PCA): ICLR-Heavy/Mixed. This is an interesting cluster with a moderate industry presence (~43%) but a very strong focus on ICLR (~41%), much higher than any other cluster, while having lower NeurIPS presence (~42%).

Conclusion: H13 is supported. Institutions exhibit distinct publication profiles, allowing K-Means to identify meaningful clusters such as industry powerhouses, balanced academic centers, and institutions with specific conference preferences (NeurIPS-dominant, ICML-leaning, ICLR-heavy).

6. Discussion

This analysis of publication trends at NeurIPS, ICML, and ICLR from 2006-2024 reveals several key insights into the evolution of the AI research landscape:

  • Explosive Growth: The sheer increase in paper volume underscores the field’s rapid expansion and the growing importance of these conferences.
  • Geopolitical Shifts: The dramatic rise of China as a leading contributor, challenging the long-standing dominance of the USA, is arguably the most significant geographic trend. Other regions like Europe maintain a strong presence, while countries like South Korea and India are growing their share.
  • Academia-Industry Dynamics: Industry’s role has transitioned from marginal to substantial, accounting for roughly a third of contributions recently. Collaboration between the two sectors has also increased significantly, reflecting the tightening link between fundamental research and industrial application.
  • Concentration and Elitism: Research remains heavily concentrated in a relatively small number of globally recognized institutions (both academic and industrial). While there is persistence among the top tier, significant dynamism exists, with new players (e.g., Tsinghua, Peking, Amazon, Mila) rising into the elite ranks in recent years.
  • Globalization: The increasing trend of international co-authorship suggests a more interconnected global research community.

Limitations:

  • Data Scope: The analysis is limited to three major conferences; trends might differ in other venues or subfields. The absence of ICLR 2020 data creates a small gap.
  • Affiliation Parsing: While the LLM approach was effective, potential errors in institution standardization, country inference, or type classification remain (~2-5% unmapped/unknown). A dedicated post-processing step with fuzzy matching could improve accuracy further.
  • Authorship vs. Impact: This analysis measures publication volume (authorship instances), not necessarily research impact (e.g., citations, influence). Dominance in publications doesn’t directly equate to dominance in groundbreaking research.

Discussion on the ~2015 Dip/Slowdown

Several plots, including collaboration trends (H6, H7) show a noticeable dip, slowdown, or inflection point around 2015-2016. Potential reasons for this observation include:

  1. NeurIPS Format Change/Lag: NeurIPS (a major contributor to the volume) underwent significant changes around this time. It moved its submission deadline much earlier in 2015 (to May from ~October) for the December conference. This might have disrupted submission patterns for that cycle, potentially causing a temporary dip or displacing some submissions to other venues or the following year. The dataset’s start year for ICML (2017) and ICLR (2018, excl. 2020) means the earlier years are dominated by NeurIPS, making NeurIPS-specific fluctuations more apparent in the aggregate plots before 2017.
  2. Rise of Other Venues: While NeurIPS, ICML, and ICLR are top-tier, the rise of other specialized conferences or workshops around this period might have drawn some papers away, causing a temporary slowdown in the growth rate at these specific venues.
  3. “Deep Learning Winter” Echoes / Transition Period?: While the major AI winter was much earlier, 2014-2016 was a period of intense consolidation around deep learning methods after initial breakthroughs (e.g., AlexNet 2012). It’s conceivable there was a brief pause or shift in focus before the next wave of major advancements (like attention/transformers gaining traction post-2017) fully accelerated publication rates again.
  4. Data Artifacts: Although less likely with publication counts, subtle data collection inconsistencies or changes in how conferences reported papers online during that specific year could theoretically contribute.
  5. Saturation/Reviewer Load: Rapid growth before 2015 might have started straining the peer review system, potentially leading to slightly lower acceptance rates or slower review cycles temporarily affecting the number of published papers for a cycle.

Without more specific context on conference policies or submission numbers for those exact years, it’s hard to pinpoint a single cause. However, the NeurIPS deadline shift and the general dynamics of a rapidly evolving field transitioning between major paradigms seem like plausible contributing factors.

7. Conclusion

The AI research landscape, as viewed through the lens of NeurIPS, ICML, and ICLR publications, is characterized by rapid growth, significant geographic power shifts (especially the rise of China), increasing industry involvement and collaboration, and a persistent but evolving concentration of research within elite global institutions. The use of Large Language Models proved crucial for extracting the necessary structured data from complex affiliation strings, enabling these quantitative insights. Future work could involve expanding the conference scope, refining affiliation normalization, and incorporating citation data for impact analysis.