
About Data Scientist
Advanced AI text detection platform designed to help maintain academic integrity by identifying AI-generated content in papers and publications.
Kungratov Ilmurod
Master of Data Science Student
This platform was developed as part of a Master's thesis project in 2026.
Version 1.0What is Data Scientist?
Data Scientist is an advanced AI detection platform that uses ensemble machine learning methods to analyze text and determine the probability of AI authorship. Our system combines multiple analysis techniques to provide accurate and reliable results.
Whether you're an educator checking student submissions, a publisher verifying content authenticity, or a researcher ensuring academic integrity, Data Scientist provides the tools you need to make informed decisions.
How It Works
Perplexity Analysis
Measures how predictable the text is. AI-generated text typically has lower perplexity because it follows more predictable patterns.
Burstiness Detection
Analyzes sentence length variation. Human writing tends to have more varied sentence structures, while AI text is often more uniform.
Transformer Classification
Uses RoBERTa-based neural network specifically trained to detect AI-generated content from models like GPT.
Stylometric Analysis
Examines vocabulary richness, word patterns, and writing style markers that differ between human and AI authors.
Important Notice
Data Scientist provides probabilistic estimates and should be used as one of several factors in determining content authenticity. No AI detection tool can guarantee 100% accuracy. Results should not be used as sole evidence of AI authorship.
AI Models Used
Our platform uses the following state-of-the-art artificial intelligence models to analyze text:
RoBERTa OpenAI Detector
roberta-base-openai-detectorTransformer model trained by OpenAI to detect GPT-2 generated text. Participates in the ensemble with 40% weight.
DistilGPT-2
distilgpt2 (HuggingFace)Model used for perplexity and entropy analysis. Detects the low perplexity characteristic of AI-generated text.
Burstiness Analysis
Statistical AnalysisAnalyzes the variability of sentence length and complexity. Human-written text tends to be more 'bursty'.
Stylometric + NLTK
NLP AnalysisUses NLTK library for POS tagging, lexical diversity, and AI phrase pattern analysis.
Ensemble Method: Results from all models are combined with weight coefficients to calculate the final AI probability.