Data Collection and Preprocessing

Data Types and Sources

1. Data Types:

Structured Data: Organized in a clear, easily searchable format, typically in tables with rows and columns (e.g., databases, spreadsheets).
Unstructured Data: Lacks a predefined structure, often text-heavy, such as emails, social media posts, images, or videos.
Semi-Structured Data: Contains elements of both structured and unstructured data, like JSON, XML, or log files.
Time-Series Data: Data points collected or recorded at specific time intervals, used in financial markets, sensor readings, etc.
Geospatial Data: Information about physical objects on Earth, often used in maps and GPS systems.

2. Data Sources:

Databases: Relational (e.g., MySQL, PostgreSQL) and non-relational (e.g., MongoDB) databases.
APIs: Interfaces provided by services to access their data programmatically (e.g., Twitter API, Google Maps API).
Web Scraping: Extracting data from websites using tools like BeautifulSoup or Scrapy.
Sensors: IoT devices, wearables, and other hardware that collect real-time data.
Public Datasets: Open data repositories like Kaggle, UCI Machine Learning Repository, or government databases.

Tensors:

Definition: A tensor is a generalization of vectors and matrices to higher dimensions. Tensors are used in deep learning, physics, and more complex data representations.
Notation: Tensors are often denoted by uppercase letters (e.g., T) with indices representing different dimensions, such as TijkT_{ijk}Tijk.
Operations: Tensor operations generalize matrix operations to higher dimensions, including addition, multiplication, and contraction.

Data Cleaning: Handling Missing Values, Outliers

1. Handling Missing Values:

Removal:
- Delete Rows: Remove rows with missing values if they constitute a small portion of the data.
- Delete Columns: Remove columns with a significant proportion of missing values.
Imputation:
- Mean/Median/Mode Imputation: Replace missing values with the mean, median, or mode of the column.
- Forward/Backward Fill: Fill missing values with the previous/next observation in time-series data.
- Interpolation: Estimate missing values based on surrounding data points, particularly in time-series data.
Advanced Techniques:
- K-Nearest Neighbors (KNN) Imputation: Estimate missing values based on similar rows.
- Multiple Imputation: Generate multiple imputations and average them to handle uncertainty.

2. Handling Outliers:

Identification:
- Z-Score: Outliers are data points with Z-scores greater than a certain threshold (e.g., |Z| > 3).
- IQR Method: Points lying below Q1 – 1.5IQR or above Q3 + 1.5IQR (where IQR is the interquartile range) are considered outliers.

Feature Engineering: Scaling, Encoding, Selection

Scaling:

Standardization: Rescale data to have a mean of 0 and a standard deviation of 1.
Min-Max Scaling: Scale data to a fixed range, typically [0, 1].
Robust Scaling: Use median and IQR for scaling, which is robust to outliers.

Encoding:

One-Hot Encoding: Convert categorical variables into a series of binary columns.
Label Encoding: Assign a unique integer to each category.
Ordinal Encoding: Encode categorical variables where order matters (e.g., “low”, “medium”, “high”).

Feature Selection:

Filter Methods: Select features based on statistical tests like Chi-square or correlation.
Wrapper Methods: Use algorithms like Recursive Feature Elimination (RFE) to select features.
Embedded Methods: Feature selection occurs during the training of the model, e.g., Lasso regression.

Data Splitting: Training, Validation, and Test Sets

Training Set:

Purpose: The portion of data used to train the model. The model learns patterns and relationships from this dataset.
Typical Split: 60-80% of the entire dataset.

Validation Set:

Purpose: Used to tune model parameters and prevent overfitting by evaluating the model’s performance on unseen data during the training process.
Typical Split: 10-20% of the entire dataset.

Test Set:

Purpose: Used to evaluate the final model’s performance and generalization ability on completely unseen data.
Typical Split: 10-20% of the entire dataset.

Example: Data Cleaning and Splitting in Python

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Example data
data = {
    'Age': [25, 30, 45, None, 35, 50, 28, None],
    'Salary': [50000, 54000, 61000, 58000, None, 69000, 72000, 65000],
    'Purchased': ['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes']
}

# Create DataFrame
df = pd.DataFrame(data)

# 1. Handle missing values (Imputation)
imputer = SimpleImputer(strategy='mean')
df['Age'] = imputer.fit_transform(df[['Age']])
df['Salary'] = imputer.fit_transform(df[['Salary']])

# 2. Encode categorical variables
df['Purchased'] = df['Purchased'].map({'No': 0, 'Yes': 1})

# 3. Scale features
scaler = StandardScaler()
df[['Age', 'Salary']] = scaler.fit_transform(df[['Age', 'Salary']])

# 4. Split data into training, validation, and test sets
X = df[['Age', 'Salary']]
y = df['Purchased']

# Split into train+val and test first
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Then split train+val into train and validation
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=42)

print("Training set size:", len(X_train))
print("Validation set size:", len(X_val))
print("Test set size:", len(X_test))

Validation Set:

Purpose: Used to tune model parameters and prevent overfitting by evaluating the model’s performance on unseen data during the training process.
Typical Split: 10-20% of the entire dataset.

import numpy as np

# 1. Create a vector
vector = np.array([1, 2, 3])
print("Vector:", vector)

# 2. Create a matrix
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("Matrix:\n", matrix)

# 3. Perform vector addition
vector2 = np.array([4, 5, 6])
vector_sum = vector + vector2
print("Vector Addition:", vector_sum)

# 4. Perform scalar multiplication
scalar = 3
scalar_mult = scalar * vector
print("Scalar Multiplication:", scalar_mult)

# 5. Perform matrix multiplication
matrix2 = np.array([[1, 2, 1], [2, 1, 2], [1, 2, 1]])
matrix_mult = np.dot(matrix, matrix2)
print("Matrix Multiplication:\n", matrix_mult)

# 6. Compute dot product of two vectors
dot_product = np.dot(vector, vector2)
print("Dot Product of vectors:", dot_product)

# 7. Find the transpose of a matrix
transpose = np.transpose(matrix)
print("Transpose of Matrix:\n", transpose)

Explainable AI and interpretability
Federated learning and privacy-preserving ML
AI-driven automation and the future of work
Ongoing research and emerging trends in AI

Data Collection and Preprocessing

Data Types and Sources

Data Cleaning: Handling Missing Values, Outliers

Feature Engineering: Scaling, Encoding, Selection

Data Splitting: Training, Validation, and Test Sets

Comments

Leave a Reply Cancel reply

More posts

Can AI Replace Finance? The Future of Finance in the Age of Artificial Intelligence

Does JP Morgan Hire CFA Level 1? A Complete Guide for Finance Aspirants

Can FRM Earn 1 Crore? Complete Career Guide for FRM Professionals in India

Which Institute Is Best for FRM in India?