A new study by the University of Bristol has revealed that the new Chinese AI app ‘Deepseek’ which was recently launched in the market poses significant safety risks.
Analysis by the Bristol Cyber Security Group reveals that while CoT refuses harmful requests at a higher rate, their transparent reasoning process can unintentionally expose harmful information that traditional LLMs might not explicitly reveal.
(Source: Pixabay)
Bristol/UK – Deepseek is a variation of Large Language Models (LLMs) that uses Chain of Thought (CoT) reasoning, which enhances problem-solving through a step-by-step reasoning process rather than providing direct answers. Analysis by the Bristol Cyber Security Group reveals that while CoT refuses harmful requests at a higher rate, their transparent reasoning process can unintentionally expose harmful information that traditional LLMs might not explicitly reveal.
This study, lead by Zhiyuan Xu, provides critical insights into the safety challenges of CoT reasoning models and emphasizes the urgent need for enhanced safeguards. As AI continues to evolve, ensuring responsible deployment and continuous refinement of security measures will be paramount.
Co-author Dr Sana Belguith from Bristol’s School of Computer Science explained: “The transparency of CoT models such as Deepseek’s reasoning process that imitates human thinking makes them very suitable for wide public use.
“But when the model’s safety measures are bypassed, it can generate extremely harmful content, which combined with wide public use, can lead to severe safety risks.”
Large Language Models (LLMs) are trained on vast datasets that undergo filtering to remove harmful content. However, due to technological and resource limitations, harmful content can persist in these datasets. Additionally, LLMs can reconstruct harmful information even from incomplete or fragmented data.
Reinforcement learning from human feedback (RLHF) and supervised fine-tuning (SFT) are commonly employed as safety training mechanisms during pre-training to prevent the model from generating harmful content. But fine-tuning attacks have been proven to bypass or even override these safety measures in traditional LLMs.
In this research, the team discovered that CoT-enabled models not only generated harmful content at a higher rate than traditional LLMs, they also provided more complete, accurate, and potentially dangerous responses due to their structured reasoning process, when exposed to same attacks. In one example, Deepseek provided detailed advice on how to carry out a crime and get away with it.
Fine-tuned CoT reasoning models often assign themselves roles, such as a highly skilled cybersecurity professional, when processing harmful requests. By immersing themselves in these identities, they can generate highly sophisticated but dangerous responses.
Co-author Dr Joe Gardiner added: “The danger of fine-tuning attacks on large language models is that they can be performed on relatively cheap hardware that is well within the means of an individual user for a small cost, and using small publicly available datasets in order to fine tune the model within a few hours.
“This has the potential to allow users to take advantage of the huge training datasets used in such models to extract this harmful information which can instruct an individual to perform real-world harms, whilst operating in a completely offline setting with little chance for detection.
“Further investigation is needed into potential mitigation strategies for fine-tune attacks. This includes examining the impact of model alignment techniques, model size, architecture, and output entropy on the success rate of such attacks.”
While CoT-enabled reasoning models inherently possess strong safety awareness, generating responses that closely align with user queries while maintaining transparency in their thought process, it can be a dangerous tool in the wrong hands. This study highlights, that with minimal data, CoT reasoning models can be fine-tuned to exhibit highly dangerous behaviors across various harmful domains, posing safety risks.
Dr Belguith concluded: “The reasoning process of these models is not entirely immune to human intervention, raising the question of whether future research could explore attacks targeting the model's thought process itself.
“LLMs in general are useful, however, the public need to be aware of such safety risks.
“The scientific community and the tech companies offering these models are both responsible for spreading awareness and designing solutions to mitigate these hazards.”
Paper: ‘The dark deep side of Deepseek: Fine-tuning attacks against the safety alignment of CoT-enabled models’ by Zhiyuan Xu, Dr Sana Belguith and Dr Joe Gardiner in arXiv.
Date: 08.12.2025
Naturally, we always handle your personal data responsibly. Any personal data we receive from you is processed in accordance with applicable data protection legislation. For detailed information please see our privacy policy.
Consent to the use of data for promotional purposes
I hereby consent to Vogel Communications Group GmbH & Co. KG, Max-Planck-Str. 7-9, 97082 Würzburg including any affiliated companies according to §§ 15 et seq. AktG (hereafter: Vogel Communications Group) using my e-mail address to send editorial newsletters. A list of all affiliated companies can be found here
Newsletter content may include all products and services of any companies mentioned above, including for example specialist journals and books, events and fairs as well as event-related products and services, print and digital media offers and services such as additional (editorial) newsletters, raffles, lead campaigns, market research both online and offline, specialist webportals and e-learning offers. In case my personal telephone number has also been collected, it may be used for offers of aforementioned products, for services of the companies mentioned above, and market research purposes.
Additionally, my consent also includes the processing of my email address and telephone number for data matching for marketing purposes with select advertising partners such as LinkedIn, Google, and Meta. For this, Vogel Communications Group may transmit said data in hashed form to the advertising partners who then use said data to determine whether I am also a member of the mentioned advertising partner portals. Vogel Communications Group uses this feature for the purposes of re-targeting (up-selling, cross-selling, and customer loyalty), generating so-called look-alike audiences for acquisition of new customers, and as basis for exclusion for on-going advertising campaigns. Further information can be found in section “data matching for marketing purposes”.
In case I access protected data on Internet portals of Vogel Communications Group including any affiliated companies according to §§ 15 et seq. AktG, I need to provide further data in order to register for the access to such content. In return for this free access to editorial content, my data may be used in accordance with this consent for the purposes stated here. This does not apply to data matching for marketing purposes.
Right of revocation
I understand that I can revoke my consent at will. My revocation does not change the lawfulness of data processing that was conducted based on my consent leading up to my revocation. One option to declare my revocation is to use the contact form found at https://contact.vogel.de. In case I no longer wish to receive certain newsletters, I have subscribed to, I can also click on the unsubscribe link included at the end of a newsletter. Further information regarding my right of revocation and the implementation of it as well as the consequences of my revocation can be found in the data protection declaration, section editorial newsletter.