Lecture 02.2 – Privacy Considerations

Introduction

The integration of large language models (LLMs) into psychological research introduces significant privacy and security considerations that researchers must carefully navigate. When we use AI systems to design surveys, generate synthetic data, or interact with participants, we create new data flows that may expose sensitive personal information to novel risks. This lecture examines the privacy implications of using generative AI in psychological research, with special attention to the trade-offs between closed API models and open-source alternatives.

Our focus will be on helping researchers make informed decisions about which AI approaches best protect participant confidentiality while still enabling innovative research. We will also discuss compliance with relevant regulations, institutional review board (IRB) considerations, and emerging best practices in this rapidly evolving landscape.

Understanding the Privacy Landscape

Before diving into specific AI models, it’s important to understand the broader privacy context in which psychological research operates:

1. Regulatory Requirements

Psychological research is subject to various regulations that protect participant privacy:

  • General Data Protection Regulation (GDPR) in the EU imposes strict requirements on the processing of personal data, with special protections for sensitive categories including health and psychological information (European Parliament, 2016).

  • Health Insurance Portability and Accountability Act (HIPAA) in the US regulates protected health information, which may include psychological data in clinical contexts.

  • Institutional Review Board (IRB) Requirements at most research institutions mandate specific privacy protections for human subjects data.

2. Psychological Data Sensitivity

Data collected in psychological research is often inherently sensitive:

  • Mental health assessments may reveal psychiatric conditions

  • Personal narratives often contain identifying details

  • Behavioral measures can reveal patterns unique to individuals

  • Longitudinal studies track changes in individuals over time

3. Participant Expectations

Research participants typically expect that their data will be:

  • Used only for the purposes they consented to

  • Kept confidential and secure

  • Not shared with commercial entities without explicit permission

  • Protected from re-identification

These expectations create ethical obligations beyond regulatory compliance, particularly when introducing new technologies into the research process.

Closed API LLMs vs. Open-Source Models: A Privacy Comparison

When integrating LLMs into research, one of the most consequential decisions is whether to use closed API models (like OpenAI’s GPT-4 or Anthropic’s Claude) or open-source models that can be run locally (such as models from the LLaMA, Mistral, or BLOOM families).

Closed API LLMs: Privacy Considerations

When using closed API models, researchers send data to external servers owned by the model provider:

  • Data Transmission: Participant data (or data about participants) leaves your secure research environment and travels over the internet to the provider’s servers.

  • Data Retention: Many providers retain query data for some period. For example, OpenAI previously retained API queries for 30 days by default for service improvement, though policies can change and opt-out options may be available (OpenAI, 2023).

  • Terms of Service: API providers typically have terms that specify how they can use data sent through their services. These terms may change over time and should be carefully reviewed.

  • Third-Party Access: Commercial API providers may be subject to legal demands for data from governments or other entities, potentially exposing research data to access beyond what researchers intend.

  • Model Training: Some providers may use queries to improve their models unless researchers explicitly opt out. This could theoretically incorporate participant data into future model versions.

For example, if a researcher uses ChatGPT to analyze open-ended survey responses about mental health experiences, those sensitive testimonials are transmitted to OpenAI’s servers, where they may be stored according to the company’s data retention policies. Researchers must consider whether this aligns with the confidentiality promised to participants.

Open-Source Models: Privacy Advantages

Open-source models that can be run locally or on secured cloud infrastructure offer several privacy advantages:

  • Data Containment: Participant data never leaves the researcher’s controlled environment, significantly reducing transmission risks.

  • No Third-Party Exposure: The data is not exposed to commercial entities or their changing terms of service.

  • Complete Control: Researchers determine their own data retention and security policies without depending on external providers.

  • Audit Transparency: The model’s code is available for inspection, allowing for verification of how data is processed.

  • Compliance Simplification: It’s often easier to satisfy IRB requirements and regulatory compliance when data stays within established secure research environments.

These advantages make open-source models particularly attractive for highly sensitive research contexts. For instance, when analyzing therapy session transcripts or clinical interview data, keeping all processing in-house provides stronger privacy protections (Abdurahman et al., 2024).

Performance vs. Privacy Trade-offs

While open-source models offer privacy advantages, there are trade-offs to consider:

  • Computational Requirements: Running state-of-the-art open models locally requires significant computational resources (GPUs, memory, etc.).

  • Technical Expertise: Local deployment requires more technical knowledge than using APIs.

  • Performance Gap: Though rapidly closing, some open-source models may not yet match the capabilities of leading closed API models.

  • Maintenance Burden: Locally deployed models require maintenance and updates, creating additional work for research teams.

Recent research has shown that for specific research tasks, smaller open models (e.g., 7B parameter models) can perform competitively with much larger closed models (175B+ parameters) when properly fine-tuned (Abdurahman et al., 2024). This suggests that the privacy-performance trade-off is becoming less stark as open models improve.

Practical Privacy Preservation Strategies

Beyond the fundamental choice of model type, researchers can implement various strategies to enhance privacy protection:

1. Data Minimization

  • Only send the specific data needed for the task when using APIs

  • Remove personally identifiable information (PII) before processing with any AI system

  • Use pseudonyms or ID codes instead of real names

  • Consider aggregating data when individual-level processing isn’t necessary

2. Local Preprocessing and Postprocessing

  • Perform anonymization locally before sending data to API models

  • Use local tools to filter out potentially identifying details

  • Process raw data locally and only send derived features to external APIs when necessary

3. Secure Model Deployment

  • When using open-source models, deploy them on secure infrastructure

  • Implement proper authentication and access controls

  • Ensure data-at-rest encryption for stored queries and responses

  • Consider air-gapped systems for particularly sensitive data

4. Consent and Transparency

  • Explicitly inform participants if their data will be processed by AI systems

  • Explain in clear language how the data will flow and what privacy measures are in place

  • Consider offering opt-out options for AI processing

  • Document all AI processing in your research protocol for IRB review

5. Synthetic Data Approaches

  • Use generative AI to create synthetic datasets that preserve statistical properties without exposing real participant data

  • Train models on synthetic data for preliminary analyses

  • Share synthetic rather than real data with collaborators when possible

Kang et al. (2024) demonstrated an effective approach using synthetic data generation for depression prediction. By generating artificial clinical interview snippets that maintained the statistical properties of real interviews without exposing actual patient data, they were able to train improved prediction models while preserving privacy.

Hybrid Approaches: Balancing Privacy and Capability

Rather than viewing the choice as binary, many researchers are adopting hybrid approaches that leverage different models for different tasks based on sensitivity:

1. Sensitivity-Based Model Selection

Match the model deployment type to the sensitivity of the data being processed:

  • Low Sensitivity: General survey design, literature summarization, or code generation → API models may be appropriate

  • Moderate Sensitivity: De-identified response analysis or aggregate data processing → Fine-tuned open models or API models with strong controls

  • High Sensitivity: Individual-level clinical data, therapy transcripts, or identifiable information → Locally deployed open-source models only

2. Task Segregation

Divide research workflows to keep sensitive data local:

  • Use API models for generating survey questions, but open models for analyzing responses

  • Generate synthetic respondents via API, but process real participant data locally

  • Develop analysis code using APIs, but run the analysis on sensitive data using local deployment

3. Emerging Middle-Ground Solutions

The landscape is evolving with new options that aim to provide both performance and privacy:

  • On-premises API deployments: Some providers now offer options to deploy their models within a customer’s infrastructure

  • API providers with specialized privacy tiers: Enterprise offerings with enhanced privacy guarantees

  • Fine-tuned smaller models: Specialized open models tuned for specific psychological applications

Responsible Documentation and Transparency

Regardless of the approach chosen, thorough documentation is essential for responsible research:

1. Method Transparency in Publications

When reporting research using LLMs, always document:

  • Which specific models were used (including versions)

  • How data was processed and what privacy measures were implemented

  • Any limitations or potential risks of the approach

  • Validation procedures used to ensure quality

2. IRB Protocol Specificity

When submitting research protocols to IRBs, provide detailed information about:

  • Data flows between systems

  • Privacy safeguards

  • Model provider terms of service (if using APIs)

  • Participant consent language regarding AI processing

3. Reproducibility Considerations

The model choice also affects research reproducibility:

  • Closed API models may change over time, potentially affecting results

  • Open-source models can be archived with specific weights to ensure computational reproducibility

  • Document random seeds and any sampling parameters used to enhance reproducibility

Abdurahman et al. (2024) emphasize that using open models can enhance scientific reproducibility by allowing researchers to specify and share the exact model version used, avoiding the “moving target” problem that can occur with regularly updated API models.

Case Study: Privacy-Preserving Clinical Assessment

To illustrate these principles in practice, consider a hypothetical research project developing an AI-assisted clinical assessment tool for anxiety disorders:

Initial Approach (Privacy Concerns): * Researchers planned to use a commercial API to analyze transcript data from clinical interviews * Raw interview transcripts would be sent to the API for sentiment analysis and symptom detection * Privacy review identified serious concerns with sending identifiable clinical data to external servers

Revised Approach (Privacy-Preserving): * Team deployed an open-source 7B parameter model on a secure university server * Model was fine-tuned on synthetic clinical data to improve performance on anxiety assessment * All processing occurred within the university’s HIPAA-compliant infrastructure * Only aggregated, de-identified results were reported externally

Result: The research proceeded with stronger privacy protections, and while the analysis required more technical setup, the team avoided exposing sensitive clinical information to third parties. Performance was comparable to the API-based approach for this specific clinical assessment task.

Future Directions in Privacy-Preserving AI

The landscape of privacy-preserving AI is rapidly evolving. Several promising approaches may further improve the balance between capability and privacy:

1. Federated Learning

Federated learning allows models to be trained across multiple institutions without sharing raw data. Only model updates are shared, not the underlying data itself. This approach could enable collaborative model improvement while keeping sensitive psychological data within each institution’s secure environment.

2. Differential Privacy

Differential privacy techniques add noise to data or model outputs to mathematically guarantee privacy while preserving overall statistical properties. As these techniques mature, they may offer new ways to balance privacy and utility in psychological AI applications.

3. Homomorphic Encryption

This advanced encryption approach allows computations to be performed on encrypted data without decrypting it first. While still computationally intensive, future developments could enable secure processing of encrypted psychological data even in cloud environments.

4. Dedicated Research Models

There is growing momentum toward developing LLMs specifically designed for scientific research, with appropriate privacy guarantees and governance models aligned with research ethics rather than commercial priorities.

Conclusion

Privacy considerations are fundamental, not peripheral, to the responsible use of generative AI in psychological research. By thoughtfully evaluating the privacy implications of different approaches, researchers can harness the power of these new tools while upholding their ethical obligations to participants.

The choice between closed API models and open-source alternatives represents one of the most consequential decisions affecting data privacy. While closed models may offer convenience and cutting-edge performance, open-source models deployed locally provide stronger privacy guarantees that align well with the sensitive nature of much psychological data (Abdurahman et al., 2024).

As this field continues to evolve, staying informed about emerging best practices, regularly reassessing privacy approaches, and maintaining transparency with participants and IRBs will be essential to conducting both innovative and ethical AI-augmented psychological research.

In the next lecture, we will turn our attention to the practical analysis of Likert-scale data, examining how traditional statistical approaches can be complemented by new AI-enabled techniques.