CI/CD Troubleshooting Guide
Resolving common issues, optimizing performance, and following best practices
This guide helps you diagnose and resolve common CI/CD pipeline issues for both NX-based and standalone architectures.
Common Issues & Solutions
Feature Branch CI Issues
Problem: Feature branch builds failing
# Check workflow status
gh workflow list
gh run list --branch feature/my-feature
# View detailed logs
gh run view --log
Solution: Ensure feature branch is up-to-date with main:
git checkout feature/my-feature
git rebase main
git push --force-with-lease
Tag Cutting Issues
Problem: Manual tag action not appearing
- Check Permissions: Ensure workflow_dispatch permissions
- Branch Protection: Verify main branch allows manual workflows
- Action Visibility: Confirm workflow file is in main branch
Problem: Image not found during tag cutting
# Verify image exists in registry
docker pull registry.company.com/service:commit-sha
# Check registry permissions
docker login registry.company.com
Pipeline Performance Issues
NX Build Optimization
# Clear NX cache
npx nx reset
# Analyze build performance
npx nx dep-graph
npx nx affected:dep-graph
Standalone Build Optimization
# Optimize Docker build cache
docker system prune
docker builder prune
# Use BuildKit for faster builds
export DOCKER_BUILDKIT=1
Performance Optimization
Build Speed Improvements
For NX-Based Systems
- Cache Optimization: Configure proper NX caching
- Dependency Graph: Optimize build order
- Parallel Execution: Enable parallel builds where possible
- Selective Testing: Run only affected tests
For Standalone Systems
- Docker Layer Caching: Optimize Dockerfile layer order
- Dependency Caching: Cache node_modules, pip packages, etc.
- Build Context: Minimize Docker build context size
- Multi-stage Builds: Use multi-stage for smaller final images
GitHub Actions Optimization
# Example optimization techniques
jobs:
build:
runs-on: ubuntu-latest
steps:
# Use action caching
- uses: actions/cache@v3
with:
path: ~/.npm
key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
# Parallel matrix builds
strategy:
matrix:
node-version: [16, 18, 20]
# Conditional steps
- name: Run tests
if: contains(github.event.head_commit.message, '[test]')
Debugging Workflows
GitHub Actions Debugging
Enable Debug Logging
# Set repository secrets
ACTIONS_RUNNER_DEBUG: true
ACTIONS_STEP_DEBUG: true
Common Debug Commands
# Check runner environment
echo "Runner OS: ${{ runner.os }}"
echo "GitHub workspace: ${{ github.workspace }}"
echo "GitHub event: ${{ github.event_name }}"
# Debug file permissions
ls -la
pwd
whoami
Container Debugging
Local Docker Testing
# Build and run locally
docker build -t test-image .
docker run -it test-image /bin/bash
# Check container layers
docker history test-image
# Inspect image
docker inspect test-image
Registry Issues
# Test registry connectivity
docker login registry.company.com
docker pull hello-world
docker tag hello-world registry.company.com/test:latest
docker push registry.company.com/test:latest
Performance Monitoring
Key Metrics to Track
Metric | Target | Action if Exceeded |
---|---|---|
Build Duration | < 10 minutes | Optimize dependencies, caching |
Test Execution | < 5 minutes | Parallelize, selective testing |
Image Size | < 500MB | Multi-stage builds, base image optimization |
Success Rate | > 95% | Investigate frequent failures |
Monitoring Tools
GitHub Actions Insights
- Workflow run history: Identify patterns in failures
- Job duration trends: Track performance over time
- Resource usage: Monitor runner utilization
Custom Monitoring
# Add timing to workflows
- name: Build with timing
run: |
start_time=$(date +%s)
npm run build
end_time=$(date +%s)
echo "Build took $((end_time - start_time)) seconds"
Security Troubleshooting
Secret Management Issues
Problem: Secrets not available in workflow
Solutions:
- Check secret scope (repository vs. organization)
- Verify workflow permissions
- Ensure secret names match exactly (case-sensitive)
Problem: Token permissions insufficient
# Add proper permissions to workflow
permissions:
contents: read
packages: write
id-token: write
Container Security
Vulnerability Scanning
# Local security scanning
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
aquasec/trivy image your-image:tag
# Check for outdated dependencies
npm audit
pip check
Best Practices
Feature Branch Strategy
- Keep Branches Small: Easier to validate and merge
- Regular Rebasing: Stay current with main branch
- Descriptive Names: Use prefixes like
feature/
,bugfix/
,hotfix/
- Clean History: Squash commits before merge
Tag Cutting Guidelines
- Semantic Versioning: Follow semver (major.minor.patch)
- Release Notes: Document changes in each release
- Environment Testing: Validate in staging before production tag
- Rollback Plan: Ensure previous versions remain available
Pipeline Maintenance
- Regular Updates: Keep actions and dependencies current
- Monitoring: Set up alerts for pipeline failures
- Documentation: Keep runbooks updated
- Testing: Validate pipeline changes in development environments
🆘 Emergency Procedures
Pipeline Outage Response
Immediate Actions
- Assess Impact: Determine affected services and environments
- Communicate: Notify stakeholders via Slack/incident channels
- Investigate: Check GitHub Actions status, runner availability
- Workaround: Consider manual deployment if critical
Escalation Path
- Platform Team: First line of support for CI/CD issues
- DevOps Lead: For architectural decisions
- Engineering Manager: For business impact decisions
Rollback Procedures
Failed Deployment Rollback
# For NX-based systems
git revert <commit-hash>
git push origin main
# For standalone systems
# Use previous tag
kubectl set image deployment/app app=registry.com/app:v1.2.2
Database Migration Rollback
# Always test rollback scripts
npm run migrate:down
# or
python manage.py migrate app_name 0001 --fake
This troubleshooting guide is maintained by the DevOps and Platform Engineering teams.