DataScientist Logo

About Data Scientist

Advanced AI text detection platform designed to help maintain academic integrity by identifying AI-generated content in papers and publications.

Kungratov Ilmurod

Master of Data Science Student

Nordik International University

This platform was developed as part of a Master's thesis project in 2026.

Version 1.0

What is Data Scientist?

Data Scientist is an advanced AI detection platform that uses ensemble machine learning methods to analyze text and determine the probability of AI authorship. Our system combines multiple analysis techniques to provide accurate and reliable results.

Whether you're an educator checking student submissions, a publisher verifying content authenticity, or a researcher ensuring academic integrity, Data Scientist provides the tools you need to make informed decisions.

How It Works

Perplexity Analysis

Measures how predictable the text is. AI-generated text typically has lower perplexity because it follows more predictable patterns.

Burstiness Detection

Analyzes sentence length variation. Human writing tends to have more varied sentence structures, while AI text is often more uniform.

Transformer Classification

Uses RoBERTa-based neural network specifically trained to detect AI-generated content from models like GPT.

Stylometric Analysis

Examines vocabulary richness, word patterns, and writing style markers that differ between human and AI authors.

Important Notice

Data Scientist provides probabilistic estimates and should be used as one of several factors in determining content authenticity. No AI detection tool can guarantee 100% accuracy. Results should not be used as sole evidence of AI authorship.

AI Models Used

Our platform uses the following state-of-the-art artificial intelligence models to analyze text:

RoBERTa OpenAI Detector

roberta-base-openai-detector

Transformer model trained by OpenAI to detect GPT-2 generated text. Participates in the ensemble with 40% weight.

Transformer40% weight

DistilGPT-2

distilgpt2 (HuggingFace)

Model used for perplexity and entropy analysis. Detects the low perplexity characteristic of AI-generated text.

Perplexity25% weight

Burstiness Analysis

Statistical Analysis

Analyzes the variability of sentence length and complexity. Human-written text tends to be more 'bursty'.

Statistical20% weight

Stylometric + NLTK

NLP Analysis

Uses NLTK library for POS tagging, lexical diversity, and AI phrase pattern analysis.

NLP15% weight

Ensemble Method: Results from all models are combined with weight coefficients to calculate the final AI probability.

Technology Stack

Python
PyTorch
FastAPI
Next.js
Transformers
NLTK
Tailwind CSS
TypeScript