Advanced SQL: Optimization, Transactions, Big Data, and Security

Working with Queries

In this section, we’ll explore some of SQL’s more advanced features for query building. We’ll focus on enhanced SELECT and JOIN constructs, different types of subqueries, and how to use window functions. We’ll also discuss how to structure queries so they’re both easy to read and straightforward to optimize.

Complex SELECT Statements

1.1. Extended Capabilities of SELECT

In its basic form, the SELECT statement pulls columns from one or more tables. However, it can also perform aggregations, filtering, and grouping.

SELECT department_id,
       COUNT(*) AS employee_count,
       AVG(salary) AS avg_salary
FROM employees
GROUP BY department_id
HAVING COUNT(*) > 5;

GROUP BY department_id: Groups the rows in the table by department_id.
COUNT(*) AS employee_count: Counts how many rows appear in each group.
AVG(salary) AS avg_salary: Calculates the average salary for each group.
HAVING COUNT(*) > 5: Returns only those groups with more than five employees.

1.2. Using WITH (CTE)

Common Table Expressions (CTEs), created with WITH, help break down complex queries into logical parts. This makes code easier to read and maintain.

WITH high_salary AS (
    SELECT employee_id, salary
    FROM employees
    WHERE salary > 5000
)
SELECT h.employee_id, h.salary
FROM high_salary h
JOIN employees e ON e.employee_id = h.employee_id
WHERE e.department_id = 10;

WITH high_salary AS (...): Creates a temporary (CTE) named high_salary containing employees with salaries above 5000.
Main query: Uses high_salary to join with the employees table and filters results by department_id = 10.

Advanced JOIN Operations

2.1. Types of JOIN

INNER JOIN: Returns only those rows with matching keys in both tables.
LEFT JOIN: Returns all rows from the left table, plus any matching rows from the right table. Rows with no match in the right table will contain NULL values.
RIGHT JOIN: Similar to LEFT JOIN but applies to the right table.
FULL OUTER JOIN: Returns all rows from both tables, substituting NULL if there’s no match.

SELECT e.employee_id, e.name, d.department_name
FROM employees e
LEFT JOIN departments d 
       ON e.department_id = d.id;

A LEFT JOIN returns all rows from the employees table (the “left” table).
If a row in employees has no matching row in departments, the columns from departments will be NULL.
This is helpful for finding employees not assigned to any department.

2.2. FULL OUTER JOIN

A FULL OUTER JOIN combines the results of both LEFT JOIN and RIGHT JOIN.

For example, if the projects table stores projects and the employees table stores employees:

SELECT e.employee_id, e.name, p.project_name
FROM employees e
FULL OUTER JOIN projects p 
       ON e.project_id = p.id;

Returns all employees, even if they’re not associated with a project.
Returns all projects, even if they don’t have any employees.

2.3. SELF JOIN

A self-join is used when you need to match rows in a table with other rows in the same table—often useful in hierarchical structures, like employees and their managers.

SELECT e.name AS employee_name,
       m.name AS manager_name
FROM employees e
JOIN employees m ON e.manager_id = m.employee_id;

e and m are aliases for the same table, employees.
Each row links to itself based on the “employee ↔ manager” relationship through e.manager_id = m.employee_id.
The result shows each employee’s name (e.name) along with their manager’s name (m.name).

Subqueries

3.1. Basic Subqueries

Subqueries allow you to nest SELECT statements for filtering, computing values, or comparing results.

SELECT employee_id, name, salary
FROM employees
WHERE salary > (
    SELECT AVG(salary) 
    FROM employees
);

The subquery returns the average salary of all employees.
The outer query returns those whose salary is higher than that average.

3.2. Correlated Subqueries

In a correlated subquery, the nested SELECT depends on the data from the outer query.

SELECT e.employee_id, e.name, e.salary
FROM employees e
WHERE e.salary > (
    SELECT AVG(e2.salary)
    FROM employees e2
    WHERE e2.department_id = e.department_id
);

For each row in employees e, the nested SELECT filters by that specific employee’s department_id.

Window Functions

Window (analytical) functions let you calculate aggregates over partitions (groups of rows) without having to group the entire dataset. They also allow custom partitioning (PARTITION BY) and ordering (ORDER BY).

4.1. Example with RANK

The RANK() function determines the position of each row within its group (or “window”), according to a specified order.

SELECT employee_id,
       department_id,
       salary,
       RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS salary_rank
FROM employees;

RANK(): Computes the rank of each row in the window defined by PARTITION BY department_id.
ORDER BY salary DESC: Sorts each group in descending order by salary.
Ties in salary produce the same rank, and the next rank is skipped accordingly (e.g., 1, 2, 2, 4).

4.2. LAG/LEAD

LAG and LEAD let you “peek” at preceding or following rows without an additional JOIN.

SELECT employee_id,
       salary,
       LAG(salary, 1, 0) OVER (ORDER BY hire_date) AS prev_salary
FROM employees;

LAG(salary, 1, 0): Fetches the salary from one row earlier.
0 is a default value if there is no previous row.
OVER (ORDER BY hire_date): Defines the row sequence by hire date.

Structuring Large Queries

5.1. Formatting Guidelines

Indentation: Separate SELECT, FROM, JOIN, and WHERE clauses with clear indentation.
Readable Aliases: Use descriptive aliases (e.g., e for employees, d for departments).
Comments: Document the logic in key parts to make the query understandable to others.

5.2. Splitting Queries into Multiple Parts

Use WITH (CTE) when your query becomes too large.
Follow the “one calculation step—one CTE” principle for clarity.

WITH dept_salary AS (
    SELECT department_id, AVG(salary) AS avg_salary
    FROM employees
    GROUP BY department_id
),
high_paid AS (
    SELECT e.employee_id, e.name, e.department_id
    FROM employees e
    JOIN dept_salary d ON e.department_id = d.department_id
    WHERE e.salary > d.avg_salary
)
SELECT * 
FROM high_paid
ORDER BY department_id;

First, you calculate each department’s average salary (dept_salary).
Then, you use that data to get all employees whose salary exceeds that average (high_paid).

Optimization and Analysis

In this section, we’ll explore various tools and techniques you can use to speed up query execution and gain a deeper understanding of how your database processes operations. We’ll learn how to use the EXPLAIN command, examine the main types of indexes, and discover popular query optimization patterns.

Examining Execution Plans with EXPLAIN

1.1. Purpose of EXPLAIN

The EXPLAIN statement (or its extensions, such as EXPLAIN ANALYZE in some DBMSs) shows how the database plans to execute a query. This helps you:

Identify which indexes are being used.
Determine the intended order in which tables are joined.
Understand the estimated cost of each step.

1.2. Example Usage (PostgreSQL)

EXPLAIN
SELECT e.name, d.department_name
FROM employees e
JOIN departments d ON e.department_id = d.id
WHERE e.salary > 6000;

You might see an output like this:

Nested Loop  (cost=0.43..8.72 rows=1 width=64)
  ->  Index Scan using employees_salary_idx on employees e  (cost=0.43..4.54 rows=1 width=32)
        Filter: (salary > 6000)
  ->  Index Scan using departments_pkey on departments d  (cost=0.29..4.17 rows=1 width=32)
        Index Cond: (id = e.department_id)

Index Scan indicates that the optimizer used an index to filter by salary (employees_salary_idx) and to join on department_id.
Nested Loop shows that the query plan is using a nested-loop join for iterating over matching keys.

If you use EXPLAIN ANALYZE, the database actually runs the query and returns real execution details (e.g., timing, number of rows), allowing you to compare the plan with actual performance metrics.

Indexes: Types, Creation, and Optimization

2.1. Primary Index Types

B-Tree Index: The most common type; ideal for exact lookups, range queries, and sorting.
Hash Index: Specializes in exact comparisons (=) and quickly finds exact matches but is less efficient for other operations.
GIN/GiST (PostgreSQL): Used for full-text search, geospatial data, and complex data types.

2.2. Creating Indexes

In SQL, you typically create an index like this:

CREATE INDEX idx_salary ON employees(salary);

Indexing the salary column speeds up queries such as WHERE salary > ... or WHERE salary = .... However, be mindful that having too many indexes can slow down insert and update operations because each index must be kept in sync.

2.3. Composite Indexes

You can create multi-column (composite) indexes if you frequently filter by more than one column at a time:

CREATE INDEX idx_dept_salary
    ON employees (department_id, salary);

In most cases, the order of the columns in a composite index matters. For example, this index may be very effective for:

WHERE department_id = 10 AND salary > 6000

But it may not be as effective if you only filter by salary.

Query Optimization Patterns

3.1. Reducing Data Volume

Select only the columns you need. Instead of SELECT *, specify the exact fields.
Use aggregation. If you only need summary data, use GROUP BY to reduce the number of rows returned.

3.2. Proper Table Partitioning

If your table contains very large amounts of data, it’s worth storing it in partitions based on criteria like date or region. This speeds up queries because the database can focus on just the relevant partitions.

3.3. Query Rewrites

Sometimes restructuring a query can yield a bigger performance boost than simply adding indexes. For instance, replacing a complex correlated subquery with a JOIN plus grouping can lower the load on the server.

Instead of:

SELECT e.employee_id, e.name, e.salary
FROM employees e
WHERE e.salary > (
    SELECT AVG(e2.salary)
    FROM employees e2
    WHERE e2.department_id = e.department_id
);

It may be more efficient to write:

WITH avg_dept_salary AS (
    SELECT department_id, AVG(salary) AS avg_salary
    FROM employees
    GROUP BY department_id
)
SELECT e.employee_id, e.name, e.salary
FROM employees e
JOIN avg_dept_salary a ON e.department_id = a.department_id
WHERE e.salary > a.avg_salary;

This approach can tap into indexes more effectively and simplify the execution plan.

3.4. Using the Right Data Types

Make sure the column data types match the values you pass (e.g., don’t store dates in string fields, and avoid comparing numeric data to text). Mismatched data types can cause hidden conversions that make index usage less efficient.

Additional Tools and Recommendations

Statistics: Regularly update statistics (e.g., ANALYZE in PostgreSQL) to help the optimizer choose the best plan.
Profiling: Many DBMSs collect detailed query execution info. In PostgreSQL, for instance, you can view BUFFERS and TIMING to diagnose bottlenecks.
Monitoring: System-level metrics (CPU, memory, disk) matter, too. Even a highly optimized query can run slowly if the server is short on RAM or constantly swapping.

By combining careful indexing, thoughtful partitioning, and targeted rewrites, you can significantly improve database performance and ensure your queries run as smoothly as possible.

Transactions and Concurrent Access

In this chapter, we’ll look at the mechanisms that allow multiple users to work with a database at the same time without compromising data integrity. We’ll explore ACID principles, transaction isolation levels, and how the DBMS manages locks. We’ll also learn strategies for avoiding deadlocks.

ACID Principles

A transaction is a logical set of operations in a database that must either be fully completed (committed) or fully rolled back. ACID is a set of properties that ensures transactions are both correct and reliable.

Atomicity
All operations within a transaction are treated as a single unit—either they all succeed (COMMIT) or they are all undone (ROLLBACK), reverting the system to its original state.
Consistency
Every committed transaction maintains the database in a consistent state, satisfying all business rules, integrity constraints, and triggers.
Isolation
One transaction’s changes should not be visible to or affect another transaction until the first transaction is committed. Several isolation levels define precisely how this invisibility is enforced.
Durability
After a transaction is committed, its results must be preserved and not lost—even in the event of a system failure. This is achieved by writing the transaction log to disk.

Transaction Isolation Levels

Different database systems offer several isolation levels, each designed to address specific needs and performance considerations.

2.1. READ UNCOMMITTED

Overview: Transactions can see “dirty” data—changes that another transaction has made but not yet committed.
Potential Issue: Dirty reads.

2.2. READ COMMITTED

Overview: A transaction only sees data that has been committed. If another transaction updates data and commits, you may see the new value on a subsequent read.
Potential Issue: Non-repeatable reads, meaning data might change between reads within the same transaction.

2.3. REPEATABLE READ

Overview: All reads within a single transaction are “frozen” at the moment it starts—subsequent reads of the same data return the same result.
Potential Issue: Some DBMSs (e.g., older versions of MySQL InnoDB) may still allow phantom reads, where a new row satisfying the query appears that wasn’t there before.

2.4. SERIALIZABLE

Overview: Maximum isolation—transactions behave as if they run one at a time in a strict sequence.
Potential Issue: High locking overhead and frequent conflicts in parallel operations, potentially reducing overall performance.

Lock Management

Locks are one of the core mechanisms that provide transaction isolation.

3.1. Types of Locks

Row-Level Lock: Locks a specific row. This is the most common approach in OLTP systems.
Page-Level Lock: Locks a page of data (often 8 KB or more, depending on the DBMS).
Table-Level Lock: May be used for large-scale operations (e.g., ALTER TABLE).

3.2. Lock Modes

Shared Lock (S): Multiple transactions can read the same row simultaneously, but not modify it.
Exclusive Lock (X): A transaction fully locks the row for both read and write access, preventing other transactions from using it.

Some systems also provide additional lock modes (e.g., Intent locks in Microsoft SQL Server or Row Share/Row Exclusive in Oracle) for more granular concurrency control.

Practical Transaction Management

4.1. Essential Statements

BEGIN (or START TRANSACTION): Starts a transaction.
COMMIT: Commits the changes.
ROLLBACK: Rolls back all changes to the state prior to the transaction’s start.

Example (PostgreSQL):

BEGIN;

-- Decrease the balance in account 1
UPDATE accounts
SET balance = balance - 100
WHERE account_id = 1;

-- Check if the balance went negative
IF EXISTS (SELECT 1 FROM accounts WHERE account_id = 1 AND balance < 0) THEN
    ROLLBACK;
    RAISE EXCEPTION 'Error: insufficient funds in account 1';
END IF;

-- Increase the balance in account 2
UPDATE accounts
SET balance = balance + 100
WHERE account_id = 2;

-- Check if account 2 exists
IF NOT EXISTS (SELECT 1 FROM accounts WHERE account_id = 2) THEN
    ROLLBACK;
    RAISE EXCEPTION 'Error: account 2 not found';
END IF;

-- If everything is successful, commit the transaction
COMMIT;

Here, we decrement the balance in account 1 and increment the balance in account 2. If any of the updates fail (for instance, a constraint violation), we can perform a ROLLBACK to revert the system to its state before the transaction began.

Avoiding Deadlocks

A deadlock occurs when two transactions are waiting on each other, with each transaction holding resources the other one needs.

5.1. Example of a Deadlock

Transaction A locks row X; Transaction B locks row Y.
Transaction A tries to lock Y, but must wait for B to release it.
Transaction B tries to lock X, but must wait for A to release it.
Both transactions wait indefinitely.

5.2. Prevention Strategies

Fixed Locking Order: Always update tables/rows in a predictable sequence (e.g., ascending ID) to avoid cycles.
Explicit Locks (SELECT ... FOR UPDATE): Helps you control the order in which locks are acquired.
Keep Transactions Short: Less logic within each transaction means lower risk of deadlocks.
DBMS Settings: Many systems automatically detect and resolve deadlocks by rolling back one of the transactions (the “deadlock victim”).

By understanding ACID properties, isolation levels, and lock management, you can ensure consistent and safe concurrent access to your database.

Storing and Processing Big Data

In this chapter, we’ll explore how SQL is used in the world of big data, where the volume of information exceeds the limits of traditional relational databases. We’ll discuss the key principles behind Apache Hive and Spark SQL, as well as the unique aspects of distributed queries and why these solutions are growing in popularity.

SQL in the Big Data Ecosystem

1.1. Brief Overview

When data volumes or workloads go beyond what traditional relational databases can handle, Big Data platforms come into play. These include:

Apache Hadoop: A framework for distributed storage and processing of large datasets.
Apache Hive: Runs on top of Hadoop, providing an SQL-like language (HiveQL) for querying and analyzing data.
Apache Spark: A cluster computing platform with support for Spark SQL.
Apache Drill, Presto, Apache Impala: Other systems that let you execute SQL queries against distributed data sources.

1.2. Advantages of SQL in Big Data

Familiar Syntax: Millions of developers and analysts already know SQL, so less training and onboarding is needed.
Flexibility: SQL works for a wide range of tasks, from simple queries to complex analytics.
High Level of Abstraction: Users don’t need to understand the inner workings of MapReduce or Spark—SQL engines handle execution plans under the hood.

Apache Hive: SQL on Hadoop

2.1. Hive Architecture

Hive stores its data in the Hadoop Distributed File System (HDFS). Queries written in HiveQL (an extension of SQL) are converted into MapReduce or Spark tasks. This lets you analyze huge datasets without manually writing MapReduce code.

2.2. Key Concepts

Tables and Databases: Logical organization of data into tables; physically, the data might be in files of different formats (Text, ORC, Parquet).
Partitions & Buckets: Large tables are divided into partitions (for example, by date) to optimize reads. Buckets offer additional segmentation within each partition.
Metastore: A service that stores metadata, such as table structures, file paths, and column types.

2.3. Sample HiveQL Query

SELECT region, COUNT(*) AS cnt
FROM sales
WHERE year = 2024
GROUP BY region;

Hive translates this query into a set of tasks that run on the Hadoop cluster, returning aggregated results by region.

Spark SQL: Fast Data Processing

3.1. Features of Spark SQL

Spark SQL lets you query data using SQL syntax, while “under the hood” Spark’s high-performance cluster computing in memory does the heavy lifting.

Benefits:

High Speed: Spark’s in-memory computation is faster than classic MapReduce.
DataFrames and Datasets: In addition to SQL, Spark offers APIs for working with structured data in languages like Python, Scala, and Java.

3.2. Spark SQL Code Example (Scala)

val spark = SparkSession.builder
  .appName("Spark SQL Example")
  .getOrCreate()

// Load data from a CSV file
val df = spark.read
  .option("header", "true")
  .csv("/path/to/sales.csv")

// Register the DataFrame as a temporary table
df.createOrReplaceTempView("sales")

val result = spark.sql("""
  SELECT region, COUNT(*) AS cnt
  FROM sales
  WHERE year = 2024
  GROUP BY region
""")

result.show()

Behind the scenes, Spark distributes the calculations across cluster nodes.

Distributed SQL Queries and Their Characteristics

4.1. Distribution Principles

In traditional databases (PostgreSQL, MySQL, Oracle), the query execution plan typically runs on a single server. In Big Data scenarios, the system breaks a query into sub-tasks that run in parallel across many nodes, then combines the final results.

4.2. Query Planner

Distributed systems use a query planner to decide how best to split the workload among nodes, taking into account:

Partition Size
Node Resources
Network Overhead

4.3. Challenges in Distributed Systems

Network: Often the bottleneck for transferring large amounts of data between nodes.
Fault Tolerance: The system must recover if individual nodes fail.
Load Balancing: Ensuring work is evenly distributed among nodes.

Choosing Between Traditional SQL and Big Data Solutions

Data Volume: If your data comfortably fits on a single server, there’s no need to introduce a complex architecture.
Workload Type: For fast transactions (OLTP), traditional databases may be more suitable. For analyzing massive datasets (OLAP), Hadoop and Spark are strong choices.
Infrastructure Complexity: Big Data systems require clusters, distributed file systems, and DevOps expertise.

Stored Procedures and Functions

In this chapter, we’ll explore how to create and use stored procedures and user-defined functions in relational databases. We’ll see how these features can help encapsulate business logic, reduce network traffic, and boost performance.

Key Differences: Procedures vs. Functions

Stored Procedure

Often doesn’t return a value directly (though it can pass values through output parameters) but can execute any SQL operations (SELECT, INSERT, UPDATE, DELETE, DDL).

User-Defined Function (UDF)

Returns a scalar value (number, string, date, etc.) or a table. Functions are typically restricted in how they modify data (in most DBMSs, they can’t perform UPDATE or DELETE operations).

Creating Stored Procedures

2.1. Example in PostgreSQL (PL/pgSQL)

CREATE OR REPLACE PROCEDURE update_salary(
    p_employee_id INT,
    p_new_salary NUMERIC
)
LANGUAGE plpgsql
AS $$
BEGIN
    UPDATE employees
    SET salary = p_new_salary
    WHERE employee_id = p_employee_id;
END;
$$;

Parameters: p_employee_id and p_new_salary.
Body: Updates the corresponding row in the employees table.

In some DBMSs, the keyword might be CREATE PROCEDURE ... instead of CREATE OR REPLACE PROCEDURE, and the procedure body’s syntax can vary.

2.2. Calling a Procedure

CALL update_salary(101, 7000);

This updates the salary to 7000 for the employee whose employee_id is 101.

User-Defined Functions

3.1. Scalar Function (Returns a Single Value)

Example (T-SQL, Microsoft SQL Server):

CREATE FUNCTION fn_total_price(
    @orderId INT
)
RETURNS DECIMAL(10, 2)
AS
BEGIN
    DECLARE @total DECIMAL(10, 2);

    SELECT @total = SUM(price * quantity)
    FROM order_items
    WHERE order_id = @orderId;

    RETURN @total;
END;

The function fn_total_price takes @orderId and returns the total cost of that specific order.
In PostgreSQL, a similar function might have slightly different syntax, but the idea is the same.

Using the Function:

SELECT dbo.fn_total_price(1234) AS total_price;

This retrieves the total cost for order ID 1234.

3.2. Table-Valued Function (Returns a Set of Rows)

Example (PostgreSQL):

CREATE OR REPLACE FUNCTION get_high_salary_employees(
    min_salary NUMERIC
)
RETURNS TABLE (
    employee_id INT,
    name TEXT,
    salary NUMERIC
)
LANGUAGE plpgsql
AS $$
BEGIN
    RETURN QUERY
    SELECT employee_id, name, salary
    FROM employees
    WHERE salary > min_salary;
END;
$$;

This function returns a result set with three columns (employee_id, name, salary) filtered by the condition salary > min_salary.

In T-SQL (Microsoft SQL Server), table-valued functions might use RETURNS TABLE and RETURN SELECT ....

Using the Function:

SELECT *
FROM get_high_salary_employees(5000);

This returns all employees with a salary above 5000.

Working with Parameters and Return Values

4.1. Input and Output Parameters

Input parameter: Receives a value when the procedure or function is called and can be used inside for filtering, calculations, etc.
Output parameter (OUT) in some DBMSs: Can return one or more values from a procedure (often found in Oracle and MS SQL Server).

Example (PL/SQL, Oracle):

CREATE OR REPLACE PROCEDURE get_employee_name(
    p_employee_id IN NUMBER,
    p_employee_name OUT VARCHAR2
)
AS
BEGIN
    SELECT name
    INTO p_employee_name
    FROM employees
    WHERE employee_id = p_employee_id;
END;
/

p_employee_id is an input (IN) parameter.
p_employee_name is an output (OUT) parameter that stores the result.

Practical Use Cases

5.1. Encapsulating Business Logic

Rather than scattering logic across client code or the application, it can be simpler to keep it in procedural code within the DBMS. For instance, in an online store, you might have a procedure that:

Saves an order.
Decrements the inventory stock.
Creates a log entry for the transaction.

As a result, the entire purchasing process is carried out as a single database-level transaction.

5.2. Data Validation

In some projects, it makes sense to place validation logic in a procedure to enhance reliability before inserting or updating data. For example, a procedure add_employee might:

Check if an employee with the same email already exists.
Verify that the department’s employee limit hasn’t been exceeded.
Only then insert a new row into the employees table.

5.3. Speed and Lower Network Traffic

When you need to perform large-scale changes all at once, it’s often more efficient to call a single stored procedure that runs all UPDATE and INSERT operations on the server side, rather than sending multiple separate queries over the network.

Best Practices for Writing and Maintaining Procedures/Functions

Follow standards: Use consistent naming (for example, sp_ or fn_ prefixes) and comment the logic.
Beware of extensive business logic at the DB level if you plan a microservices architecture—otherwise, you might need to replicate that logic across multiple databases as you scale.
Pay attention to DBMS specifics: Different engines handle transactions inside procedures differently (especially Oracle vs. PostgreSQL vs. MS SQL).
Test performance: Overly complex logic in a single procedure can become a bottleneck.

Security and Access Control

In this chapter, we’ll explore the key aspects of database security in SQL systems—from managing privileges with GRANT and REVOKE to data encryption and common strategies for protecting information. The main goal is to ensure confidentiality, integrity, and availability of data without making life overly complicated for administrators and users.

Managing Access Rights

1.1 Roles and Users

In most DBMSs, security is managed using a combination of “users – roles – privileges”:

Users: Specific accounts, often tied to individual employees or applications.
Roles: Sets of access rights that can be assigned to users for easier management.

1.2 GRANT and REVOKE Commands

GRANT: Grants certain privileges to a user or role.
REVOKE: Revokes previously granted privileges.

Example (PostgreSQL):

-- Create a 'readonly' role that can only read data
CREATE ROLE readonly;

-- Grant SELECT permissions on the 'employees' table
GRANT SELECT ON employees TO readonly;

-- Assign the 'readonly' role to the user 'user_test'
GRANT readonly TO user_test;

-- To revoke privileges:
REVOKE SELECT ON employees FROM readonly;

This approach allows you to assign privileges via roles instead of granting them directly to every user.

1.3 Levels of Privileges

Database Level: Can grant privileges for creating/dropping tables, altering structures, etc.
Table/Schema Level: Controls read (SELECT), write (INSERT, UPDATE, DELETE) access.
Column Level (in some DBMSs): More granular control if you need to hide certain fields.

Data Encryption

2.1 Encryption at the Connection Level

SSL/TLS: Ensures secure data transfer between client and server (for example, postgresql://...?sslmode=require in PostgreSQL). This prevents data from being intercepted in plain text.

2.2 Encryption at the Storage Level

TDE (Transparent Data Encryption): Many DBMSs (Oracle, Microsoft SQL Server, MySQL) support transparent encryption for all or part of the database file.
Application-Level Encryption: Data is encrypted before it’s written to the database, adding another layer of protection from unauthorized server-side access.

2.3 Key Management

If data is encrypted, it’s crucial to store and manage keys securely—often via a dedicated module or service (KMS). Losing the key can be as catastrophic as losing the data itself.

General Approaches to Data Protection

3.1 Principle of Least Privilege

Grant users only the privileges they actually need for their tasks.
Regularly review issued privileges to remove outdated or excessive access.

3.2 Segregation of Duties

Database administrators don’t always require access to business data; sometimes they only need technical privileges for maintenance.
Analysts, developers, operators—each role should have its own access level.

3.3 Credential Security

Store passwords in encrypted or hashed form.
Use strong passwords and, where possible, one-time tokens or Kerberos/LDAP-based authentication.
Restrict access to configuration files that may contain credentials.

3.4 Monitoring and Auditing

Audit logs (covering DDL and DML commands) help detect suspicious activity.
Set up alerts: if someone attempts to access a table they’re not authorized for, or performs unusual operations, the system should notify administrators.

Protecting Against SQL Injection

Although SQL injection is primarily an application-level concern, proper DBMS configuration also plays a part:

Use prepared statements rather than dynamically concatenating strings.
Limit privileges: even if a malicious actor gains access through a vulnerability, they shouldn’t have the authority to view or alter the entire database.
Keep your DBMS version and security patches up to date.

Example of a Comprehensive Security Strategy

Create roles: readonly, readwrite, admin.
Assign each user an appropriate role based on job responsibilities.
Enable SSL/TLS for all database connections and use TDE for on-disk data encryption.
Conduct regular audits: periodically review logs and user activity.
Rotate encryption keys according to your organization’s security policy.

Recommendations for Enhancing Security

Use complex passwords and rotate them on a set schedule.
Restrict access by IP or VPN: avoid allowing external connections unless absolutely necessary.
Implement multi-factor authentication (MFA) if your DBMS integrates with a corporate directory (e.g., Active Directory).
Minimize open ports: keep only essential services accessible from outside.
Perform regular backups, store them in an encrypted form, and test the recovery process periodically.

Read also:

WAL (Write-Ahead Logging) in PostgreSQL: How to perform backups and restore data without loss. Detailed examples and Docker Compose instructions included.

SQL: From basic operations and JOINs to indexes and query optimization for high performance.

Learn about the Pydantic library, data model creation, validation, and integration with FastAPI, SQLAlchemy, and Django.

Advanced SQL: Optimization, Transactions, Big Data, and Security

Working with Queries

Complex SELECT Statements

Advanced JOIN Operations

Subqueries

Window Functions

Structuring Large Queries

Optimization and Analysis

Examining Execution Plans with EXPLAIN

Indexes: Types, Creation, and Optimization

Query Optimization Patterns

Additional Tools and Recommendations

Transactions and Concurrent Access

ACID Principles

Transaction Isolation Levels

Lock Management

Practical Transaction Management

Avoiding Deadlocks

Storing and Processing Big Data

SQL in the Big Data Ecosystem

Apache Hive: SQL on Hadoop

Spark SQL: Fast Data Processing

Distributed SQL Queries and Their Characteristics

Choosing Between Traditional SQL and Big Data Solutions

Stored Procedures and Functions

Key Differences: Procedures vs. Functions

Creating Stored Procedures

User-Defined Functions

Working with Parameters and Return Values

Practical Use Cases

Best Practices for Writing and Maintaining Procedures/Functions

Security and Access Control

Managing Access Rights

Data Encryption

General Approaches to Data Protection

Protecting Against SQL Injection

Example of a Comprehensive Security Strategy

Recommendations for Enhancing Security

Read also:

Confirmation

Feedback with the author

Choose a login method