Data Archive

      • pages
        • Data Archive
        • DE_Tools
        • ML_Tools
        • Quotes
        • Research Questions
        • Reviews
      • standardised
        • 1-on-1 Template
        • 1-to-1's with a Line Manager
        • AB testing
        • Accessing Gen AI generated content
        • Accuracy
        • ACID Transaction
        • Activation atlases
        • Activation Function
        • Active Learning
        • Ada boosting
        • Adam Optimizer
        • Adaptive Learning Rates
        • Adding a database to PostgreSQL
        • Addressing Multicollinearity
        • Addressing_Multicollinearity.py
        • Adjusted R squared
        • Agent Exploration
        • Agent-based modelling
        • Agentic Solutions
        • Aggregation
        • AI Agents Memory
        • AI Engineer
        • AI governance
        • Algorithms
        • Altair
        • altair versus seaborn
        • Alternatives to Batch Processing
        • Amazon S3
        • Anomaly Detection
        • Anomaly Detection in Time Series
        • Anomaly Detection with Clustering
        • Anomaly Detection with Statistical Methods
        • ANOVA
        • Apache Airflow
        • Apache Iceberg
        • Apache Kafka
        • Apache Spark
        • API
        • API Driven Microservices
        • ARIMA
        • Asking questions
        • Assumption of Normality
        • Attack mitigation
        • Attack types
        • Attention Is All You Need
        • Attention mechanism
        • AUC
        • Automated Feature Creation
        • AWS Lambda
        • Azure
        • Backpropagation
        • Bag of words
        • Bag_of_Words.py
        • Bagging
        • Bandit example output
        • Bandit_Example_Fixed.py
        • Bash
        • bat
        • Batch Normalisation
        • Batch Processing
        • Bellman Equations
        • Benefits of Data Transformation
        • Bernoulli
        • BERT
        • BERT Pretraining of Deep Bidirectional Transformers for Language Understanding
        • BERTScore
        • Bias and variance
        • Big Data
        • big o notation
        • BigQuery
        • binary classification
        • Binder
        • Boosting
        • Bootstrap Sampling
        • Boxplot
        • business intelligence
        • Business observability
        • Business value of anomaly detection
        • Casual Inference
        • CatBoost
        • Central Limit Theorem
        • Central Limit Theorem & Small Sample Sizes
        • Chain of thought
        • Change Management
        • ChatGPT
        • Checksum
        • Chi-Squared Test
        • Choosing a Threshold
        • Choosing the Number of Clusters
        • CI-CD
        • Class Separability
        • Classification
        • Classification Report
        • Claude
        • Click_Implementation.py
        • Cloud Providers
        • Cluster Density
        • Cluster Seperation
        • Clustering
        • Clustering_Dashboard.py
        • Clustermap
        • Code Diagrams
        • Columnar Storage
        • Command line
        • Command Prompt
        • Common Table Expression
        • Communication principles
        • Communication Techniques
        • Comparing LLMs
        • Comparing_Ensembles.py
        • Components of the database
        • Computer Science
        • conceptual data model
        • Conceptual Model
        • Concurrency
        • Confidence Interval
        • Confusion Matrix
        • Continuous Delivery - Deployment
        • Continuous Integration
        • Convolutional Neural Networks
        • Correlation
        • Correlation vs Causation
        • Cosine Similarity
        • Cost Function
        • Cost-Sensitive Analysis
        • Covariance
        • Covariance Structures
        • Covariance vs Correlation
        • Covering Index
        • Cron jobs
        • Cross Entropy
        • Cross validation
        • Cross_Entropy_Single.py
        • Cross_Entropy.py
        • Crosstab
        • CRUD
        • Cryptography
        • csv module
        • CUDA
        • Curse of dimensionality
        • Cypher
        • dagster
        • Dash
        • Dashboarding
        • Data AI Education at Work
        • Data Analysis
        • Data Analysis Portal
        • Data Analyst
        • Data Architect
        • Data Assessment
        • Data Cleansing
        • Data Collection
        • Data Contract
        • Data Distribution
        • Data Drift
        • Data Engineer
        • Data Engineering
        • Data Engineering Portal
        • Data Engineering Tools
        • data governance
        • data hierarchy of needs
        • Data Ingestion
        • data integration
        • Data Integrity
        • Data Lake
        • Data Lakehouse
        • Data Leakage
        • Data Lifecycle Management
        • data lineage
        • data literacy
        • Data Management
        • Data Mining - CRISP
        • Data Modelling
        • Data Observability
        • Data Orchestration
        • Data Pipeline
        • Data Pipeline to Data Products
        • Data Principles
        • data product
        • data quality
        • Data Reduction
        • Data Roles
        • Data Science
        • Data Scientist
        • Data Security
        • Data Selection
        • Data Selection in ML
        • Data Steward
        • Data storage
        • Data Streaming
        • Data Transformation
        • Data transformation in Data Engineering
        • Data transformation in Machine Learning
        • Data Transformation with Pandas
        • Data Validation
        • data virtualization
        • Data Visualisation
        • Data Warehouse
        • Database
        • Database Index
        • Database Management System (DBMS)
        • Database schema
        • Database Storage
        • Database Techniques
        • Databricks
        • Databricks vs Snowflake
        • Datasets
        • DBScan
        • dbt
        • Debugging
        • Debugging ipynb
        • Debugging.py
        • Decision Tree
        • Declarative Data Pipeline
        • Deep Learning
        • Deep Learning Frameworks
        • Deep Q-Learning
        • Demand forecasting
        • Dendrograms
        • dependency manager
        • Design Thinking Questions
        • Determining Threshold Values
        • DevOps
        • Differentation
        • Digital Transformation
        • Digital twin
        • Dimension Table
        • Dimensional Modelling
        • Dimensionality Reduction
        • dimensions
        • Directed Acyclic Graph (DAG)
        • Distillation
        • Distributed Computing
        • Distribution_Analysis.py
        • Distributions
        • Docker
        • Docker Image
        • documentation
        • Documentation & Meetings
        • Dropout
        • DS & ML Portal
        • duckdb
        • DuckDB in python
        • DuckDB vs SQLite
        • Dummy variable trap
        • EDA
        • Edge ML
        • Education and Training
        • Elastic Net
        • ElasticSearch
        • ELT
        • Embedded Methods
        • embeddings for OOV words
        • emergent behavior
        • Encoding Categorical Variables
        • Energy
        • Energy ABM
        • Energy Storage
        • Environment Variables
        • Epoch
        • Epub
        • ER Diagrams
        • Estimator
        • ETL
        • ETL Pipeline example
        • etl vs elt
        • etlt
        • Evaluate Embedding Methods
        • Evaluating Language Models
        • Evaluating the effectiveness of prompts
        • Evaluation Metrics
        • Event Driven
        • Event Driven Events
        • Event Driven Microservices
        • Event-Driven Architecture
        • Everything
        • Excel
        • Excel pivot table
        • Excel vs Google Sheets
        • Experiment Plan Template
        • Exploration vs Exploitation
        • f-regression
        • F-statistic
        • F1 Score
        • Fabric
        • fact table
        • Factor Analysis
        • Factor_Analysis.py
        • facts
        • FAISS
        • FastAPI
        • FastAPI_Example.py
        • Feature Engineering
        • Feature Evaluation
        • Feature Extraction
        • Feature Importance
        • Feature Scaling
        • Feature Selection
        • Feature Selection vs Feature Importance
        • Feature_Distribution.py
        • Feed Forward Neural Network
        • Feedback Template
        • File Management
        • Filter method
        • filter methods
        • Firebase
        • Fishbone diagram
        • Fitting weights and biases of a neural network
        • Flask
        • Folder Tree Diagram
        • Forecasting_AutoArima.py
        • Forecasting_Baseline.py
        • Forecasting_Exponential_Smoothing.py
        • Foreign Key
        • Forward Propagation
        • frontend
        • functional programming
        • Fuzzywuzzy
        • garbage collector
        • Gartner Hype Cycle
        • Gaussian Distribution
        • Gaussian Mixture Models
        • Gaussian Model
        • gaussian_mixture_model_implementation.py
        • General Linear Regression
        • Generative Adversarial Networks
        • Generative AI
        • Generative AI From Theory to Practice
        • Generators in Python
        • Gini Impurity
        • Gini Impurity vs Cross Entropy
        • GIS
        • Git
        • Gitlab
        • gitlab-ci.yml
        • Global Interpreter Lock
        • Google Cloud Platform
        • Google Colab
        • Google My Maps Data Extraction
        • Google OR Tools
        • Google Sheet Pivots Table
        • Google Sheets
        • GPT
        • Gradient Boosting
        • Gradient Boosting Regressor
        • Gradient Descent
        • Gradient descent in linear regression
        • Gradio
        • Grain
        • Grammar method
        • granularity
        • Graph Neural Network
        • Graph Query Language
        • Graph Theory
        • Graph Theory Community
        • GraphRAG
        • Grep
        • GridSeachCv
        • Groupby
        • Groupby vs Crosstab
        • Grouped plots
        • GRU
        • Guardrails
        • Hadoop
        • Handling Different Distributions
        • Handling Missing Data
        • Handling_Missing_Data_Basic.ipynb
        • Handling_Missing_Data.ipynb
        • Hash
        • Heap Data Structure
        • Heap Memory
        • Heatmap
        • Heatmaps_Dendrograms.py
        • heterogeneous features
        • Hierarchical Clustering
        • High cross validation accuracy is not directly proportional to performance on unseen test data
        • Honkit
        • Hosting
        • How businesses use Gen AI
        • How do we evaluate of LLM Outputs
        • how do you do the data selection
        • How is reinforcement learning being combined with deep learning
        • How is schema evolution done in practice with SQL
        • How LLMs store facts
        • How to do git commit messages properly
        • How to normalise a merged table
        • How to reduce the need for Gen AI responses
        • How to search within a graph
        • How to use Sklearn Pipeline
        • How would you decide between using TF-IDF and Word2Vec for text vectorization
        • html
        • Hugging Face
        • Hyperparameter
        • Hyperparameter Tuning
        • Hypothesis testing
        • Imbalanced Datasets
        • Imbalanced_Datasets_SMOTE.py
        • Immutable vs mutable
        • Impact of multicollinearity on model parameters
        • imperative
        • Implementing Database Schema
        • Imputation Techniques
        • In NER how would you handle ambiguous entities
        • in-memory format
        • incremental synchronization
        • Indexing in cypher
        • Industries of interest
        • Inertia K Means Cost Function
        • inference
        • inference versus prediction
        • information theory
        • initialization methods
        • Input is Not Properly Sanitized
        • Interoperability
        • interoperable
        • interpretability
        • Interpreting logistic regression model parameters
        • Interquartile Range (IQR) Detection
        • ipynb
        • Isolated Forest
        • Java
        • Java vs JavaScript
        • JavaScript
        • jinja template
        • Jobs to be done
        • Johnson–Lindenstrauss lemma
        • Joining Datasets
        • Json
        • Json to SQLite
        • Junction Tables
        • Jupyter Book
        • jupytext
        • Justfile
        • K_Means.py
        • K-means
        • K-nearest neighbours
        • Keras
        • Kernel Density Estimation
        • Kernelling
        • Key Components of Attention and Formula
        • Kmeans vs GMM
        • KNIME
        • Knowledge Graph
        • Knowledge graph vs RAG setup
        • Knowledge Work
        • kubernetes
        • Label encoding
        • Label encoding vs One-hot encoding
        • Labelling data
        • lambda architecture
        • Langchain
        • Language Model Output Optimisation
        • Language Models
        • Language Models Large (LLMs) vs Small (SLMs)
        • Lasso
        • Latency
        • Latent Dirichlet Allocation
        • LBFGS
        • Learning Curve
        • learning rate
        • Learning Styles
        • lemmatization
        • LightGBM
        • LightGBM vs XGBoost vs CatBoost
        • Linear Discriminant Analysis
        • Linear Regression
        • Linked List
        • LLM
        • LLM Evaluation Metrics
        • LLM Memory
        • Load Balancing
        • Local Interpretable Model-agnostic Explainations
        • Local Outlier Factor (LOF)
        • Log transformation
        • Logical Model
        • Logistic Regression
        • Logistic Regression does not predict probabilities
        • Logistic regression in sklearn & Gradient Descent
        • Logistic Regression Statsmodel Summary table
        • Looker Studio
        • loss function
        • Loss versus Cost function
        • LSTM
        • Machine Learning
        • Machine Learning Algorithms
        • Machine Learning Operations
        • maintainability
        • Maintainable Code
        • Makefile
        • Manifold learning
        • Many-to-Many Relationships
        • map reduce
        • Markov chain
        • Markov Decision Processes
        • master data management
        • Master Observability Datadog
        • Mathematical Reasoning in Transformers
        • Mathematics
        • Maximum Likelihood Estimation
        • mean absolute error
        • Mean Squared Error
        • mean vs median
        • melt
        • Memory
        • Memory Caching
        • Merge
        • Mermaid
        • Metadata Handling
        • Methods for Handling Outliers
        • metric
        • Microsoft
        • Microsoft Access
        • Mini-batch gradient descent
        • Mixture of Experts
        • ML Engineer
        • MNIST
        • Model Building
        • Model Cascading
        • Model Deployment
        • Model Ensemble
        • Model Evaluation
        • Model Evaluation vs Model Optimisation
        • Model Interpretability
        • Model Observability
        • Model Optimisation
        • Model Parameters
        • Model Parameters Tuning
        • Model parameters vs hyperparameters
        • Model Selection
        • Model Validation
        • model-agnostic feature importance
        • Momentum
        • Momentum.py
        • MongoDB
        • Monolith Architecture
        • Monte Carlo Simulation
        • Multi-Agent Reinforcement Learning
        • Multi-head attention
        • Multi-level index
        • Multicollinearity
        • Multinomial Naive bayes
        • Multiprocessing
        • Multiprocessing vs Multithreading
        • Multithreading
        • MySql
        • Naive Bayes
        • Named Entity Recognition
        • nbconvert
        • nbconvert slideshows
        • neo4j
        • neomodel
        • NET
        • Network Design
        • Neural network
        • Neural Network Classification
        • Neural network in Practice
        • Neural Scaling Laws
        • Ngrams
        • NLP
        • nltk
        • Node.JS
        • non-parametric
        • Non-parametric tests
        • Normalisation
        • Normalisation of data
        • Normalisation of Text
        • Normalisation vs Standardisation
        • Normalised Schema
        • NoSQL
        • NotebookLM
        • npy Files A NumPy Array storage
        • Numpy
        • Object Relational Mapper
        • Odds
        • Odds vs Probability
        • OLAP (online analytical processing)
        • OLTP
        • One Pager Template
        • One_hot_encoding.py
        • One-hot encoding
        • OOV words
        • Operational Resilience for Growth and Adaptability
        • Optimisation function
        • Optimisation techniques
        • Optimising a Logistic Regression Model
        • Optimising Neural Networks
        • Optuna
        • Ordinary Least Squares
        • Orthogonalization
        • Outliers
        • Over parameterised models
        • Overfitting
        • p values
        • Page Rank
        • Pandas
        • Pandas Dataframe Agent
        • Pandas join vs merge
        • Pandas Pivot Table
        • Pandas Stack
        • Pandas_Common.py
        • Pandas_Stack.py
        • Pandoc
        • Parametric tests
        • parametric vs non-parametric models
        • parametric vs non-parametric tests
        • Parquet
        • parsimonious
        • Part of speech tagging
        • PCA Explained Variance Ratio
        • PCA Principal Components
        • PCA_Analysis.ipynb
        • PCA_Based_Anomaly_Detection.py
        • PCA-Based Anomaly Detection
        • pd.Grouper
        • pdoc
        • PDP and ICE
        • Percentile Detection
        • Performance Dimensions
        • Performance Drift
        • Physical Model
        • Pickle
        • Plotly
        • pmdarima
        • Poetry
        • Policy
        • Polynomial Regression
        • Positional Encoding
        • PostgreSQL
        • Postman
        • PowerBI
        • Powerquery
        • PowerShell
        • Powershell scripts
        • Powershell versus Command Prompt
        • Powershell vs Bash
        • Precision
        • Precision or Recall
        • Precision-Recall Curve
        • Prediction Intervals
        • Preprocessing
        • Prevention Is Better Than the Cure
        • Primary Key
        • Principal Component Analysis
        • Probability
        • Problem Definition
        • Process Based Parallelism
        • Processes vs Threads
        • programming languages
        • Project Management Portal
        • Prompt engineering
        • prompt retrievers
        • Prompts
        • Proportion Test
        • Publish and Subscribe
        • Pull Request Template
        • push-down
        • PyCaret
        • Pycaret_Anomaly.ipynb
        • Pycaret_Example.py
        • Pydantic
        • Pydantic_More.py
        • Pydantic.py
        • PyGraphviz
        • PyOD
        • Pyright
        • Pyright vs Pydantic
        • PySpark
        • Pytest
        • Python
        • Python Click
        • PyTorch
        • Pytorch vs Tensorflow
        • Q-Learning
        • Q-Q Plot
        • Quartz
        • Query Optimisation
        • Querying
        • QuickSort
        • R
        • R squared
        • R-squared metric not always a good indicator of model performance in regression
        • Race Conditions
        • RAG
        • Random Access Memory
        • Random Forest Regression
        • Random Forests
        • React
        • Reasoning tokens
        • Recall
        • Recommender systems
        • Recurrent Neural Networks
        • Recursive Algorithm
        • Registering a Scheduled Task
        • Regression
        • Regression metrics
        • Regression_Logistic_Metrics.ipynb
        • Regularisation
        • Regularisation of Tree based models
        • Regularisation.py
        • Reinforcement learning
        • Relating Tables Together
        • Relational Database
        • Relationships in memory
        • Relu
        • REST API
        • Reveal.js
        • reverse etl
        • Reward Function
        • Ridge
        • ROC (Receiver Operating Characteristic)
        • ROC_Curve.py
        • rollup
        • Root Mean Squared Error
        • Row-based Storage
        • Sampling
        • Sarsa
        • Scala
        • Scalability
        • Scaling Agentic Systems
        • Scaling Data Science Capability
        • Scaling Server
        • Scatter Plots
        • schema evolution
        • Scientific Method
        • Scikit-Learn
        • Scipy
        • Seaborn
        • search
        • Security mitigation
        • Security Researcher
        • Security Vulnerabilities
        • Self Attention
        • Self attention vs multi-head attention
        • Self-Attention
        • semantic layer
        • Semantic Relationships
        • Semantic search
        • semi-structured data
        • Sentence Similarity
        • Sentence Transformer Workflow
        • Sentence Transformers
        • shapefile
        • SHapley Additive exPlanations
        • Sharepoint
        • Silhouette Analysis
        • Similarity Search
        • Single source of truth
        • sklearn datasets
        • Sklearn Pipiline
        • Slowly Changing Dimension
        • Small Language Models
        • Smart Grids
        • SMOTE (Synthetic Minority Over-sampling Technique)
        • SMSS
        • Snowflake
        • Snowflake Schema
        • Snowflake vs Hadoop
        • Soft Deletion
        • Software Design Patterns
        • Software Development Life Cycle
        • Software Development Portal
        • spaCy
        • SparseCategorialCrossentropy or CategoricalCrossEntropy
        • Spearman vs Pearson Correlation
        • Specificity
        • Spreadsheets vs Databases
        • SQL
        • SQL Groupby
        • SQL Injection
        • SQL Joins
        • SQL vs NoSQL
        • SQL Window functions
        • SQLAlchemy
        • SQLAlchemy vs. sqlite3
        • SQLite
        • SQLite Studio
        • stack memory
        • Stacking
        • Standard deviation
        • Standardisation
        • Star Schema
        • Statistical Assumptions
        • Statistical Tests
        • Statistical theorems
        • Statistics
        • Stemming
        • Stochastic Gradient Descent
        • storage layer object store
        • Stored Procedures
        • Streamlit
        • Strongly vs Weakly typed language
        • structured data
        • Structuring and organizing data
        • Summarisation
        • Supervised Learning
        • Support Vector Classifier
        • Support Vector Machines
        • Support Vector Regression
        • SVM_Example.py
        • Symbolic computation
        • Sympy
        • syntactic relationships
        • t-SNE
        • T-test
        • Tableau
        • Technical Debt
        • Technical Design Doc Template
        • Telecommunications
        • Tensorflow
        • Terminal commands
        • Test Loss When Evaluating Models
        • Testing
        • Testing_Pytest.py
        • Testing_unittest.py
        • Text2Cypher
        • TF-IDF
        • TF-IDF Implementation
        • Thinking Systems
        • Time Series
        • Time Series Forecasting
        • Time Series Identify Trends and Patterns
        • Tokenisation
        • TOML
        • tool.bandit
        • tool.ruff
        • tool.uv
        • topic modeling
        • Train-Dev-Test Sets
        • Transaction
        • Transfer Learning
        • transfer_learning.py
        • Transformed Target Regressor
        • Transformer
        • Transformers vs RNNs
        • TS_Anomaly_Detection.py
        • Turning a flat file into a database
        • Type I Error (False Positive)
        • Type II Error (False Negative)
        • Types of Computational Bugs
        • Types of Database Schema
        • Types of Neural Networks
        • TypeScript
        • Typical Output Formats in Neural Networks
        • Ubuntu
        • UMAP
        • UML
        • unittest
        • univariate vs multivariate
        • Unix
        • unstructured data
        • Unsupervised learning
        • Use Cases for a Simple Neural Network Like
        • Use of RNNs in energy sector
        • Vacuum
        • vanishing and exploding gradients problem
        • Variability in linear models
        • variance
        • Vector Database
        • Vector Embedding
        • Vector_Embedding.py
        • Vectorisation
        • Vectorized Engine
        • Vercel
        • View Use Case
        • Views
        • Violin plot
        • Virtual environments
        • WCSS and elbow method
        • Weak Learners
        • Web Feature Server (WFS)
        • Web Map Tile Service (WMTS)
        • When and why not to us regularisation
        • Why does increasing the number of models in a ensemble not necessarily improve the accuracy
        • Why does the Adam Optimizer converge
        • Why is named entity recognition (NER) a challenging task
        • Why JSON is Better than Pickle for Untrusted Data
        • Why Removing Outliers May Improve Regression but Harm Classification
        • Why standardise features
        • Why Type 1 and Type 2 matter
        • Why use ER diagrams
        • Wikipedia_API.py
        • Windows Scheduled Tasks
        • Windows Subsystem for Linux
        • Word2vec
        • Word2Vec.py
        • WordNet
        • Wrapper Methods
        • Xaiver
        • XGBoost
        • yaml
        • Z-Normalisation
        • Z-Score
        • Z-Scores vs Prediction Intervals
        • Z-Test

    neo4j

    https://www.youtube.com/watch?v=IShRYPsmiR8

    Related terms:

    • neomodel
    • GraphRAG
    • Cypher
    • Graph Query Language
    • graph database

    Neo4j is a graph database. Instead of storing data in tables (like SQL), it stores data as nodes (entities) and relationships (connections between entities). Instead of JOINs, Neo4j directly stores and indexes connections.

    When to use:

    • Complex relationships (social networks, fraud detection, recommendations).
    • You need to traverse lots of relationships quickly.

    Backlinks

    • Cypher
    • database
        • pages
          • Data Archive
          • DE_Tools
          • ML_Tools
          • Quotes
          • Research Questions
          • Reviews
        • standardised
          • 1-on-1 Template
          • 1-to-1's with a Line Manager
          • AB testing
          • Accessing Gen AI generated content
          • Accuracy
          • ACID Transaction
          • Activation atlases
          • Activation Function
          • Active Learning
          • Ada boosting
          • Adam Optimizer
          • Adaptive Learning Rates
          • Adding a database to PostgreSQL
          • Addressing Multicollinearity
          • Addressing_Multicollinearity.py
          • Adjusted R squared
          • Agent Exploration
          • Agent-based modelling
          • Agentic Solutions
          • Aggregation
          • AI Agents Memory
          • AI Engineer
          • AI governance
          • Algorithms
          • Altair
          • altair versus seaborn
          • Alternatives to Batch Processing
          • Amazon S3
          • Anomaly Detection
          • Anomaly Detection in Time Series
          • Anomaly Detection with Clustering
          • Anomaly Detection with Statistical Methods
          • ANOVA
          • Apache Airflow
          • Apache Iceberg
          • Apache Kafka
          • Apache Spark
          • API
          • API Driven Microservices
          • ARIMA
          • Asking questions
          • Assumption of Normality
          • Attack mitigation
          • Attack types
          • Attention Is All You Need
          • Attention mechanism
          • AUC
          • Automated Feature Creation
          • AWS Lambda
          • Azure
          • Backpropagation
          • Bag of words
          • Bag_of_Words.py
          • Bagging
          • Bandit example output
          • Bandit_Example_Fixed.py
          • Bash
          • bat
          • Batch Normalisation
          • Batch Processing
          • Bellman Equations
          • Benefits of Data Transformation
          • Bernoulli
          • BERT
          • BERT Pretraining of Deep Bidirectional Transformers for Language Understanding
          • BERTScore
          • Bias and variance
          • Big Data
          • big o notation
          • BigQuery
          • binary classification
          • Binder
          • Boosting
          • Bootstrap Sampling
          • Boxplot
          • business intelligence
          • Business observability
          • Business value of anomaly detection
          • Casual Inference
          • CatBoost
          • Central Limit Theorem
          • Central Limit Theorem & Small Sample Sizes
          • Chain of thought
          • Change Management
          • ChatGPT
          • Checksum
          • Chi-Squared Test
          • Choosing a Threshold
          • Choosing the Number of Clusters
          • CI-CD
          • Class Separability
          • Classification
          • Classification Report
          • Claude
          • Click_Implementation.py
          • Cloud Providers
          • Cluster Density
          • Cluster Seperation
          • Clustering
          • Clustering_Dashboard.py
          • Clustermap
          • Code Diagrams
          • Columnar Storage
          • Command line
          • Command Prompt
          • Common Table Expression
          • Communication principles
          • Communication Techniques
          • Comparing LLMs
          • Comparing_Ensembles.py
          • Components of the database
          • Computer Science
          • conceptual data model
          • Conceptual Model
          • Concurrency
          • Confidence Interval
          • Confusion Matrix
          • Continuous Delivery - Deployment
          • Continuous Integration
          • Convolutional Neural Networks
          • Correlation
          • Correlation vs Causation
          • Cosine Similarity
          • Cost Function
          • Cost-Sensitive Analysis
          • Covariance
          • Covariance Structures
          • Covariance vs Correlation
          • Covering Index
          • Cron jobs
          • Cross Entropy
          • Cross validation
          • Cross_Entropy_Single.py
          • Cross_Entropy.py
          • Crosstab
          • CRUD
          • Cryptography
          • csv module
          • CUDA
          • Curse of dimensionality
          • Cypher
          • dagster
          • Dash
          • Dashboarding
          • Data AI Education at Work
          • Data Analysis
          • Data Analysis Portal
          • Data Analyst
          • Data Architect
          • Data Assessment
          • Data Cleansing
          • Data Collection
          • Data Contract
          • Data Distribution
          • Data Drift
          • Data Engineer
          • Data Engineering
          • Data Engineering Portal
          • Data Engineering Tools
          • data governance
          • data hierarchy of needs
          • Data Ingestion
          • data integration
          • Data Integrity
          • Data Lake
          • Data Lakehouse
          • Data Leakage
          • Data Lifecycle Management
          • data lineage
          • data literacy
          • Data Management
          • Data Mining - CRISP
          • Data Modelling
          • Data Observability
          • Data Orchestration
          • Data Pipeline
          • Data Pipeline to Data Products
          • Data Principles
          • data product
          • data quality
          • Data Reduction
          • Data Roles
          • Data Science
          • Data Scientist
          • Data Security
          • Data Selection
          • Data Selection in ML
          • Data Steward
          • Data storage
          • Data Streaming
          • Data Transformation
          • Data transformation in Data Engineering
          • Data transformation in Machine Learning
          • Data Transformation with Pandas
          • Data Validation
          • data virtualization
          • Data Visualisation
          • Data Warehouse
          • Database
          • Database Index
          • Database Management System (DBMS)
          • Database schema
          • Database Storage
          • Database Techniques
          • Databricks
          • Databricks vs Snowflake
          • Datasets
          • DBScan
          • dbt
          • Debugging
          • Debugging ipynb
          • Debugging.py
          • Decision Tree
          • Declarative Data Pipeline
          • Deep Learning
          • Deep Learning Frameworks
          • Deep Q-Learning
          • Demand forecasting
          • Dendrograms
          • dependency manager
          • Design Thinking Questions
          • Determining Threshold Values
          • DevOps
          • Differentation
          • Digital Transformation
          • Digital twin
          • Dimension Table
          • Dimensional Modelling
          • Dimensionality Reduction
          • dimensions
          • Directed Acyclic Graph (DAG)
          • Distillation
          • Distributed Computing
          • Distribution_Analysis.py
          • Distributions
          • Docker
          • Docker Image
          • documentation
          • Documentation & Meetings
          • Dropout
          • DS & ML Portal
          • duckdb
          • DuckDB in python
          • DuckDB vs SQLite
          • Dummy variable trap
          • EDA
          • Edge ML
          • Education and Training
          • Elastic Net
          • ElasticSearch
          • ELT
          • Embedded Methods
          • embeddings for OOV words
          • emergent behavior
          • Encoding Categorical Variables
          • Energy
          • Energy ABM
          • Energy Storage
          • Environment Variables
          • Epoch
          • Epub
          • ER Diagrams
          • Estimator
          • ETL
          • ETL Pipeline example
          • etl vs elt
          • etlt
          • Evaluate Embedding Methods
          • Evaluating Language Models
          • Evaluating the effectiveness of prompts
          • Evaluation Metrics
          • Event Driven
          • Event Driven Events
          • Event Driven Microservices
          • Event-Driven Architecture
          • Everything
          • Excel
          • Excel pivot table
          • Excel vs Google Sheets
          • Experiment Plan Template
          • Exploration vs Exploitation
          • f-regression
          • F-statistic
          • F1 Score
          • Fabric
          • fact table
          • Factor Analysis
          • Factor_Analysis.py
          • facts
          • FAISS
          • FastAPI
          • FastAPI_Example.py
          • Feature Engineering
          • Feature Evaluation
          • Feature Extraction
          • Feature Importance
          • Feature Scaling
          • Feature Selection
          • Feature Selection vs Feature Importance
          • Feature_Distribution.py
          • Feed Forward Neural Network
          • Feedback Template
          • File Management
          • Filter method
          • filter methods
          • Firebase
          • Fishbone diagram
          • Fitting weights and biases of a neural network
          • Flask
          • Folder Tree Diagram
          • Forecasting_AutoArima.py
          • Forecasting_Baseline.py
          • Forecasting_Exponential_Smoothing.py
          • Foreign Key
          • Forward Propagation
          • frontend
          • functional programming
          • Fuzzywuzzy
          • garbage collector
          • Gartner Hype Cycle
          • Gaussian Distribution
          • Gaussian Mixture Models
          • Gaussian Model
          • gaussian_mixture_model_implementation.py
          • General Linear Regression
          • Generative Adversarial Networks
          • Generative AI
          • Generative AI From Theory to Practice
          • Generators in Python
          • Gini Impurity
          • Gini Impurity vs Cross Entropy
          • GIS
          • Git
          • Gitlab
          • gitlab-ci.yml
          • Global Interpreter Lock
          • Google Cloud Platform
          • Google Colab
          • Google My Maps Data Extraction
          • Google OR Tools
          • Google Sheet Pivots Table
          • Google Sheets
          • GPT
          • Gradient Boosting
          • Gradient Boosting Regressor
          • Gradient Descent
          • Gradient descent in linear regression
          • Gradio
          • Grain
          • Grammar method
          • granularity
          • Graph Neural Network
          • Graph Query Language
          • Graph Theory
          • Graph Theory Community
          • GraphRAG
          • Grep
          • GridSeachCv
          • Groupby
          • Groupby vs Crosstab
          • Grouped plots
          • GRU
          • Guardrails
          • Hadoop
          • Handling Different Distributions
          • Handling Missing Data
          • Handling_Missing_Data_Basic.ipynb
          • Handling_Missing_Data.ipynb
          • Hash
          • Heap Data Structure
          • Heap Memory
          • Heatmap
          • Heatmaps_Dendrograms.py
          • heterogeneous features
          • Hierarchical Clustering
          • High cross validation accuracy is not directly proportional to performance on unseen test data
          • Honkit
          • Hosting
          • How businesses use Gen AI
          • How do we evaluate of LLM Outputs
          • how do you do the data selection
          • How is reinforcement learning being combined with deep learning
          • How is schema evolution done in practice with SQL
          • How LLMs store facts
          • How to do git commit messages properly
          • How to normalise a merged table
          • How to reduce the need for Gen AI responses
          • How to search within a graph
          • How to use Sklearn Pipeline
          • How would you decide between using TF-IDF and Word2Vec for text vectorization
          • html
          • Hugging Face
          • Hyperparameter
          • Hyperparameter Tuning
          • Hypothesis testing
          • Imbalanced Datasets
          • Imbalanced_Datasets_SMOTE.py
          • Immutable vs mutable
          • Impact of multicollinearity on model parameters
          • imperative
          • Implementing Database Schema
          • Imputation Techniques
          • In NER how would you handle ambiguous entities
          • in-memory format
          • incremental synchronization
          • Indexing in cypher
          • Industries of interest
          • Inertia K Means Cost Function
          • inference
          • inference versus prediction
          • information theory
          • initialization methods
          • Input is Not Properly Sanitized
          • Interoperability
          • interoperable
          • interpretability
          • Interpreting logistic regression model parameters
          • Interquartile Range (IQR) Detection
          • ipynb
          • Isolated Forest
          • Java
          • Java vs JavaScript
          • JavaScript
          • jinja template
          • Jobs to be done
          • Johnson–Lindenstrauss lemma
          • Joining Datasets
          • Json
          • Json to SQLite
          • Junction Tables
          • Jupyter Book
          • jupytext
          • Justfile
          • K_Means.py
          • K-means
          • K-nearest neighbours
          • Keras
          • Kernel Density Estimation
          • Kernelling
          • Key Components of Attention and Formula
          • Kmeans vs GMM
          • KNIME
          • Knowledge Graph
          • Knowledge graph vs RAG setup
          • Knowledge Work
          • kubernetes
          • Label encoding
          • Label encoding vs One-hot encoding
          • Labelling data
          • lambda architecture
          • Langchain
          • Language Model Output Optimisation
          • Language Models
          • Language Models Large (LLMs) vs Small (SLMs)
          • Lasso
          • Latency
          • Latent Dirichlet Allocation
          • LBFGS
          • Learning Curve
          • learning rate
          • Learning Styles
          • lemmatization
          • LightGBM
          • LightGBM vs XGBoost vs CatBoost
          • Linear Discriminant Analysis
          • Linear Regression
          • Linked List
          • LLM
          • LLM Evaluation Metrics
          • LLM Memory
          • Load Balancing
          • Local Interpretable Model-agnostic Explainations
          • Local Outlier Factor (LOF)
          • Log transformation
          • Logical Model
          • Logistic Regression
          • Logistic Regression does not predict probabilities
          • Logistic regression in sklearn & Gradient Descent
          • Logistic Regression Statsmodel Summary table
          • Looker Studio
          • loss function
          • Loss versus Cost function
          • LSTM
          • Machine Learning
          • Machine Learning Algorithms
          • Machine Learning Operations
          • maintainability
          • Maintainable Code
          • Makefile
          • Manifold learning
          • Many-to-Many Relationships
          • map reduce
          • Markov chain
          • Markov Decision Processes
          • master data management
          • Master Observability Datadog
          • Mathematical Reasoning in Transformers
          • Mathematics
          • Maximum Likelihood Estimation
          • mean absolute error
          • Mean Squared Error
          • mean vs median
          • melt
          • Memory
          • Memory Caching
          • Merge
          • Mermaid
          • Metadata Handling
          • Methods for Handling Outliers
          • metric
          • Microsoft
          • Microsoft Access
          • Mini-batch gradient descent
          • Mixture of Experts
          • ML Engineer
          • MNIST
          • Model Building
          • Model Cascading
          • Model Deployment
          • Model Ensemble
          • Model Evaluation
          • Model Evaluation vs Model Optimisation
          • Model Interpretability
          • Model Observability
          • Model Optimisation
          • Model Parameters
          • Model Parameters Tuning
          • Model parameters vs hyperparameters
          • Model Selection
          • Model Validation
          • model-agnostic feature importance
          • Momentum
          • Momentum.py
          • MongoDB
          • Monolith Architecture
          • Monte Carlo Simulation
          • Multi-Agent Reinforcement Learning
          • Multi-head attention
          • Multi-level index
          • Multicollinearity
          • Multinomial Naive bayes
          • Multiprocessing
          • Multiprocessing vs Multithreading
          • Multithreading
          • MySql
          • Naive Bayes
          • Named Entity Recognition
          • nbconvert
          • nbconvert slideshows
          • neo4j
          • neomodel
          • NET
          • Network Design
          • Neural network
          • Neural Network Classification
          • Neural network in Practice
          • Neural Scaling Laws
          • Ngrams
          • NLP
          • nltk
          • Node.JS
          • non-parametric
          • Non-parametric tests
          • Normalisation
          • Normalisation of data
          • Normalisation of Text
          • Normalisation vs Standardisation
          • Normalised Schema
          • NoSQL
          • NotebookLM
          • npy Files A NumPy Array storage
          • Numpy
          • Object Relational Mapper
          • Odds
          • Odds vs Probability
          • OLAP (online analytical processing)
          • OLTP
          • One Pager Template
          • One_hot_encoding.py
          • One-hot encoding
          • OOV words
          • Operational Resilience for Growth and Adaptability
          • Optimisation function
          • Optimisation techniques
          • Optimising a Logistic Regression Model
          • Optimising Neural Networks
          • Optuna
          • Ordinary Least Squares
          • Orthogonalization
          • Outliers
          • Over parameterised models
          • Overfitting
          • p values
          • Page Rank
          • Pandas
          • Pandas Dataframe Agent
          • Pandas join vs merge
          • Pandas Pivot Table
          • Pandas Stack
          • Pandas_Common.py
          • Pandas_Stack.py
          • Pandoc
          • Parametric tests
          • parametric vs non-parametric models
          • parametric vs non-parametric tests
          • Parquet
          • parsimonious
          • Part of speech tagging
          • PCA Explained Variance Ratio
          • PCA Principal Components
          • PCA_Analysis.ipynb
          • PCA_Based_Anomaly_Detection.py
          • PCA-Based Anomaly Detection
          • pd.Grouper
          • pdoc
          • PDP and ICE
          • Percentile Detection
          • Performance Dimensions
          • Performance Drift
          • Physical Model
          • Pickle
          • Plotly
          • pmdarima
          • Poetry
          • Policy
          • Polynomial Regression
          • Positional Encoding
          • PostgreSQL
          • Postman
          • PowerBI
          • Powerquery
          • PowerShell
          • Powershell scripts
          • Powershell versus Command Prompt
          • Powershell vs Bash
          • Precision
          • Precision or Recall
          • Precision-Recall Curve
          • Prediction Intervals
          • Preprocessing
          • Prevention Is Better Than the Cure
          • Primary Key
          • Principal Component Analysis
          • Probability
          • Problem Definition
          • Process Based Parallelism
          • Processes vs Threads
          • programming languages
          • Project Management Portal
          • Prompt engineering
          • prompt retrievers
          • Prompts
          • Proportion Test
          • Publish and Subscribe
          • Pull Request Template
          • push-down
          • PyCaret
          • Pycaret_Anomaly.ipynb
          • Pycaret_Example.py
          • Pydantic
          • Pydantic_More.py
          • Pydantic.py
          • PyGraphviz
          • PyOD
          • Pyright
          • Pyright vs Pydantic
          • PySpark
          • Pytest
          • Python
          • Python Click
          • PyTorch
          • Pytorch vs Tensorflow
          • Q-Learning
          • Q-Q Plot
          • Quartz
          • Query Optimisation
          • Querying
          • QuickSort
          • R
          • R squared
          • R-squared metric not always a good indicator of model performance in regression
          • Race Conditions
          • RAG
          • Random Access Memory
          • Random Forest Regression
          • Random Forests
          • React
          • Reasoning tokens
          • Recall
          • Recommender systems
          • Recurrent Neural Networks
          • Recursive Algorithm
          • Registering a Scheduled Task
          • Regression
          • Regression metrics
          • Regression_Logistic_Metrics.ipynb
          • Regularisation
          • Regularisation of Tree based models
          • Regularisation.py
          • Reinforcement learning
          • Relating Tables Together
          • Relational Database
          • Relationships in memory
          • Relu
          • REST API
          • Reveal.js
          • reverse etl
          • Reward Function
          • Ridge
          • ROC (Receiver Operating Characteristic)
          • ROC_Curve.py
          • rollup
          • Root Mean Squared Error
          • Row-based Storage
          • Sampling
          • Sarsa
          • Scala
          • Scalability
          • Scaling Agentic Systems
          • Scaling Data Science Capability
          • Scaling Server
          • Scatter Plots
          • schema evolution
          • Scientific Method
          • Scikit-Learn
          • Scipy
          • Seaborn
          • search
          • Security mitigation
          • Security Researcher
          • Security Vulnerabilities
          • Self Attention
          • Self attention vs multi-head attention
          • Self-Attention
          • semantic layer
          • Semantic Relationships
          • Semantic search
          • semi-structured data
          • Sentence Similarity
          • Sentence Transformer Workflow
          • Sentence Transformers
          • shapefile
          • SHapley Additive exPlanations
          • Sharepoint
          • Silhouette Analysis
          • Similarity Search
          • Single source of truth
          • sklearn datasets
          • Sklearn Pipiline
          • Slowly Changing Dimension
          • Small Language Models
          • Smart Grids
          • SMOTE (Synthetic Minority Over-sampling Technique)
          • SMSS
          • Snowflake
          • Snowflake Schema
          • Snowflake vs Hadoop
          • Soft Deletion
          • Software Design Patterns
          • Software Development Life Cycle
          • Software Development Portal
          • spaCy
          • SparseCategorialCrossentropy or CategoricalCrossEntropy
          • Spearman vs Pearson Correlation
          • Specificity
          • Spreadsheets vs Databases
          • SQL
          • SQL Groupby
          • SQL Injection
          • SQL Joins
          • SQL vs NoSQL
          • SQL Window functions
          • SQLAlchemy
          • SQLAlchemy vs. sqlite3
          • SQLite
          • SQLite Studio
          • stack memory
          • Stacking
          • Standard deviation
          • Standardisation
          • Star Schema
          • Statistical Assumptions
          • Statistical Tests
          • Statistical theorems
          • Statistics
          • Stemming
          • Stochastic Gradient Descent
          • storage layer object store
          • Stored Procedures
          • Streamlit
          • Strongly vs Weakly typed language
          • structured data
          • Structuring and organizing data
          • Summarisation
          • Supervised Learning
          • Support Vector Classifier
          • Support Vector Machines
          • Support Vector Regression
          • SVM_Example.py
          • Symbolic computation
          • Sympy
          • syntactic relationships
          • t-SNE
          • T-test
          • Tableau
          • Technical Debt
          • Technical Design Doc Template
          • Telecommunications
          • Tensorflow
          • Terminal commands
          • Test Loss When Evaluating Models
          • Testing
          • Testing_Pytest.py
          • Testing_unittest.py
          • Text2Cypher
          • TF-IDF
          • TF-IDF Implementation
          • Thinking Systems
          • Time Series
          • Time Series Forecasting
          • Time Series Identify Trends and Patterns
          • Tokenisation
          • TOML
          • tool.bandit
          • tool.ruff
          • tool.uv
          • topic modeling
          • Train-Dev-Test Sets
          • Transaction
          • Transfer Learning
          • transfer_learning.py
          • Transformed Target Regressor
          • Transformer
          • Transformers vs RNNs
          • TS_Anomaly_Detection.py
          • Turning a flat file into a database
          • Type I Error (False Positive)
          • Type II Error (False Negative)
          • Types of Computational Bugs
          • Types of Database Schema
          • Types of Neural Networks
          • TypeScript
          • Typical Output Formats in Neural Networks
          • Ubuntu
          • UMAP
          • UML
          • unittest
          • univariate vs multivariate
          • Unix
          • unstructured data
          • Unsupervised learning
          • Use Cases for a Simple Neural Network Like
          • Use of RNNs in energy sector
          • Vacuum
          • vanishing and exploding gradients problem
          • Variability in linear models
          • variance
          • Vector Database
          • Vector Embedding
          • Vector_Embedding.py
          • Vectorisation
          • Vectorized Engine
          • Vercel
          • View Use Case
          • Views
          • Violin plot
          • Virtual environments
          • WCSS and elbow method
          • Weak Learners
          • Web Feature Server (WFS)
          • Web Map Tile Service (WMTS)
          • When and why not to us regularisation
          • Why does increasing the number of models in a ensemble not necessarily improve the accuracy
          • Why does the Adam Optimizer converge
          • Why is named entity recognition (NER) a challenging task
          • Why JSON is Better than Pickle for Untrusted Data
          • Why Removing Outliers May Improve Regression but Harm Classification
          • Why standardise features
          • Why Type 1 and Type 2 matter
          • Why use ER diagrams
          • Wikipedia_API.py
          • Windows Scheduled Tasks
          • Windows Subsystem for Linux
          • Word2vec
          • Word2Vec.py
          • WordNet
          • Wrapper Methods
          • Xaiver
          • XGBoost
          • yaml
          • Z-Normalisation
          • Z-Score
          • Z-Scores vs Prediction Intervals
          • Z-Test

      Created with Quartz v4.3.1 © 2025

      • GitHub
      • Linkedin