AI/ML Project
Hate Speech Detection using Machine Learning
PythonScikit-learnNLTKPandasMatplotlib

Overview
Built with Scikit-learn and NLTK. Features include advanced text preprocessing, data balancing with augmentation, TF-IDF feature extraction, and Random Forest classification.
Key Features
- Tweet Preprocessing & Cleaning
- Class Balancing with Data Augmentation
- TF-IDF Feature Extraction
- Random Forest Classification
- Evaluation with Precision, Recall & F1-score
Architecture & Decisions
- TF-IDF with BigramsUsed TF-IDF vectorization with unigrams and bigrams to better capture contextual clues in short text.
- Class Balancing via AugmentationUsed NLPAug to synthetically augment underrepresented classes instead of only undersampling dominant ones.
- Random Forest as Final ClassifierChose Random Forest for its superior performance and robustness on this dataset, achieving 89% accuracy.
Challenges & Solutions
Challenge / Problem
Model struggled with subtle contextual differences (e.g., sarcasm, negations like 'don't kill').
Solution / Implementation
Explored deep learning and transformer-based approaches like BERT for future enhancement; ultimately stuck with ML for simplicity in this project scope.
Project Timeline
Day 1
Dataset Cleaning
Day 2
Exploratory Data Analysis
Day 3
Feature Engineering
Day 4
Model Training & Evaluation
Day 5
Packaging & Testing
Reference Resources
documentation
NLTK Documentation
Used for text preprocessing tasks like tokenization, stopword removal, and lemmatization.
documentation
Scikit-learn Classification Guide
Guide for using various ML classifiers, evaluation metrics, and TF-IDF vectorization.
article
NLPAug for Text Augmentation
Helpful for augmenting underrepresented classes using synonym replacement.