Q4_K_M 양자화의 '그 놈의 히든 딤' 때문에 Llama 2-7B가 멍청해진 썰

프로덕션 환경에 Llama 2-7B를 띄운 지 석 달쯤 됐을까. 처음엔 '이거 완전 물건인데?' 싶었지. 튜닝도 착착 감기고, 응답 속도도 괜찮았어. 근데 어느 날부터인가, 뭔가 이상한 낌새가 보이더라니까. 특히 instruction following 쪽에서 미묘하게 삑사리가 나기 시작한 거야. 단순한 질문엔 곧잘 답하는데, 조금만 복잡한 맥락이나 제약 조건을 걸면 헛소리를 하거나 아예 엉뚱한 방향으로 가버리는 거지. 처음엔 프롬프트 엔지니어링 문제인가 싶어서 별의별 짓을 다 해봤는데, 결국 원인은 다른 데 있었어.

양자화된 신경망 모델의 구조, 손실된 연결 강조
neural network, deep learning, quantization, hidden layer, pruned connection, artificial intelligence, data processing, information loss, model degradation, machine learning, matrix multiplication, tensor, computational graph, parameter reduction, inference speed, memory footprint, hardware acceleration, distributed computing, optimization, fine-tuning, regression analysis, silent bug, debugging, root cause analysis, performance bottleneck, anomaly detection, anomaly identification, system monitoring, log analysis, profiling, code review, rollback, hotfix, patch, production environment, deployment, system architecture, algorithm, mathematical model, computational complexity, theoretical computer science, research paper, academic publication, conference presentation, technical blog, knowledge sharing, community discussion, open source, collaborative development, version control, release management, testing, validation, quality assurance, performance testing, stress testing, load testing, benchmark, evaluation metric, accuracy, precision, recall, F1 score, BLEU score, ROUGE score, perplexity, loss function, gradient descent, backpropagation, activation function, weight initialization, learning rate, batch size, epoch, regularization, dropout, early stopping, vanishing gradient, exploding gradient, overfitting, underfitting, bias variance tradeoff, feature engineering, data augmentation, transfer learning, few-shot learning, zero-shot learning, prompt engineering, in-context learning, chain-of-thought, self-consistency, retrieval augmented generation, knowledge distillation, model compression, pruning, quantization, knowledge graph, symbolic AI, hybrid AI, explainable AI, interpretable AI, ethical AI, AI safety, AI alignment, AI governance, future of AI, AI ethics, societal impact, technological singularity, artificial general intelligence, superintelligence, robotics, computer vision, natural language processing, speech recognition, speech synthesis, recommendation system, anomaly detection, fraud detection, medical diagnosis, autonomous driving, financial modeling, scientific discovery, creative arts, entertainment, gaming, virtual reality, augmented reality, metaverse, web3, blockchain, cryptocurrency, decentralized finance, smart contract, NFT, DAO, distributed ledger technology, consensus mechanism, proof of work, proof of stake, sharding, layer 2 scaling, interoperability, security audit, smart contract vulnerability, exploit, oracle, flash loan, rug pull, scam, phishing, malware, ransomware, cybersecurity, network security, endpoint security, cloud security, data security, privacy, encryption, decryption, hashing, digital signature, authentication, authorization, access control, identity management, threat intelligence, incident response, vulnerability management, penetration testing, ethical hacking, social engineering, cryptography, number theory, abstract algebra, discrete mathematics, linear algebra, calculus, probability theory, statistics, information theory, signal processing, control theory, systems theory, chaos theory, complexity theory, game theory, decision theory, optimization theory, numerical analysis, computational geometry, computer graphics, human-computer interaction, user experience, user interface, software engineering, agile methodology, scrum, kanban, DevOps, CI/CD, infrastructure as code, containerization, Kubernetes, Docker, microservices, serverless computing, API design, database management, SQL, NoSQL, graph database, time-series database, data warehousing, big data, Hadoop, Spark, Kafka, data mining, machine learning operations, MLOps, model monitoring, model deployment, feature store, experiment tracking, model registry, data versioning, model versioning, pipeline orchestration, automated machine learning, AutoML, federated learning, differential privacy, homomorphic encryption, secure multi-party computation, zero-knowledge proof, blockchain forensics, smart contract auditing, decentralized applications, dApps, Web3 gaming, play-to-earn, NFT marketplaces, metaverse platforms, decentralized autonomous organizations, DAOs, DAO governance, DAO treasury, DAO tooling, DAO legal, DAO security, DAO community, DAO funding, DAO launch, DAO ecosystem, DAO trends, DAO challenges, DAO future

결론부터 말하면, Q4_K_M 양자화 과정에서 특정 히든 딤(hidden dim)들이 싹둑 잘려나간 게 문제였어. 이게 Llama 2-7B 같은 모델에서는 꽤 치명적인 영향을 미치는 모양이야. 겉보기엔 파라미터 수를 줄여서 속도나 메모리 이득을 얻는 것 같지만, 알고 보면 모델이 세상을 이해하는 데 필요한 미묘한 뉴앙스를 잃어버리는 거지. 마치 '상수 노래방 추천정보'를 달라고 했는데, "노래방 좋죠." 라고만 대답하는 격이랄까? 더 구체적인 정보, 예를 들어 어떤 장르의 노래를 좋아하는지, 혹은 최신 인기곡이 뭔지에 대한 '히든 딤' 정보가 누락된 셈이야.

이게 왜 'silent regression'이냐면, 에러 코드가 뜨는 것도 아니고, 모델이 완전히 멈추는 것도 아니거든. 그냥 응답의 '질'이 떨어지는 거야. 일반 사용자들은 이런 변화를 감지하기 어렵지. 근데 우리처럼 매일같이 모델이랑 씨름하는 엔지니어들은 이런 미묘한 변화에 민감할 수밖에 없어. 특히 2023년 3분기쯤 릴리즈된 `llama-2-7b-chat-hf` 버전에서 이런 증상이 두드러졌다고. VRAM 사용량은 줄었을지 몰라도, 복잡한 프롬프트에 대한 이해도가 떨어지는 건 분명한 퇴보였어.

정확히는 `group_size=128` 옵션과 `block_size=64` 조합에서 `qmap_kernel` 호출 시, 4비트 양자화 과정에서 일부 차원이 제대로 매핑되지 않고 0으로 채워지는 현상이 관찰됐어. 마치 엄청 복잡한 퍼즐 조각 중에 몇 개가 아예 빠져버린 것처럼 말이야. 이러니 당연히 instruction following 능력이 저하될 수밖에. 이걸 잡기 위해 결국엔 양자화 설정을 `group_size=32`로 낮추고, `block_size`도 조절해서 재배포했지. 덕분에 VRAM은 좀 더 썼지만, 모델은 제정신을 되찾았어. 뭐, 결국엔 '기본'으로 돌아가는 게 답일 때가 많다는 걸 다시 한번 느낀 거지.