AI/ML Project

Hate Speech Detection using Machine Learning

PythonScikit-learnNLTKPandasMatplotlib

Overview

Built with Scikit-learn and NLTK. Features include advanced text preprocessing, data balancing with augmentation, TF-IDF feature extraction, and Random Forest classification.

Key Features

Tweet Preprocessing & Cleaning
Class Balancing with Data Augmentation
TF-IDF Feature Extraction
Random Forest Classification
Evaluation with Precision, Recall & F1-score

Architecture & Decisions

TF-IDF with BigramsUsed TF-IDF vectorization with unigrams and bigrams to better capture contextual clues in short text.
Class Balancing via AugmentationUsed NLPAug to synthetically augment underrepresented classes instead of only undersampling dominant ones.
Random Forest as Final ClassifierChose Random Forest for its superior performance and robustness on this dataset, achieving 89% accuracy.

Challenges & Solutions

Challenge / Problem

Model struggled with subtle contextual differences (e.g., sarcasm, negations like 'don't kill').

Solution / Implementation

Explored deep learning and transformer-based approaches like BERT for future enhancement; ultimately stuck with ML for simplicity in this project scope.

Project Timeline

Day 1

Dataset Cleaning

Day 2

Exploratory Data Analysis

Day 3

Feature Engineering

Day 4

Model Training & Evaluation

Day 5

Packaging & Testing

Reference Resources

documentation

NLTK Documentation

Used for text preprocessing tasks like tokenization, stopword removal, and lemmatization.

View Resource

documentation

Scikit-learn Classification Guide

Guide for using various ML classifiers, evaluation metrics, and TF-IDF vectorization.

View Resource

article

NLPAug for Text Augmentation

Helpful for augmenting underrepresented classes using synonym replacement.

View Resource