Back to Projects
AI/ML Project

Hate Speech Detection using Machine Learning

PythonScikit-learnNLTKPandasMatplotlib
Hate Speech Detection using Machine Learning

Overview

Built with Scikit-learn and NLTK. Features include advanced text preprocessing, data balancing with augmentation, TF-IDF feature extraction, and Random Forest classification.

Key Features

  • Tweet Preprocessing & Cleaning
  • Class Balancing with Data Augmentation
  • TF-IDF Feature Extraction
  • Random Forest Classification
  • Evaluation with Precision, Recall & F1-score

Architecture & Decisions

  • TF-IDF with BigramsUsed TF-IDF vectorization with unigrams and bigrams to better capture contextual clues in short text.
  • Class Balancing via AugmentationUsed NLPAug to synthetically augment underrepresented classes instead of only undersampling dominant ones.
  • Random Forest as Final ClassifierChose Random Forest for its superior performance and robustness on this dataset, achieving 89% accuracy.

Challenges & Solutions

Challenge / Problem

Model struggled with subtle contextual differences (e.g., sarcasm, negations like 'don't kill').

Solution / Implementation

Explored deep learning and transformer-based approaches like BERT for future enhancement; ultimately stuck with ML for simplicity in this project scope.

Project Timeline

Day 1

Dataset Cleaning

Day 2

Exploratory Data Analysis

Day 3

Feature Engineering

Day 4

Model Training & Evaluation

Day 5

Packaging & Testing

Reference Resources

documentation

NLTK Documentation

Used for text preprocessing tasks like tokenization, stopword removal, and lemmatization.

documentation

Scikit-learn Classification Guide

Guide for using various ML classifiers, evaluation metrics, and TF-IDF vectorization.

article

NLPAug for Text Augmentation

Helpful for augmenting underrepresented classes using synonym replacement.