## Library Carpentry: SQL
Full lesson: https://librarycarpentry.org/lc-sql/
Slides: https://www.angelazoss.com/2021-08-24-suffolk-online/sql
## Schedule
10:00
1.
Introduction to SQL
10:15
2.
Selecting and sorting data
10:35
3.
Filtering
10:55
Short Break
11:00
3.
Filtering, cont'd.
11:10
5.
Aggregating & calculating values
11:30
6.
Joins and aliases
12:00
Long Break
12:30
6.
Joins and aliases, cont'd.
12:45
8.
Database design
13:15
Short Break
13:20
8.
Database design, cont'd.
13:30
9.
Creating tables and modifying data
13:55
Finish
## [Introduction to SQL](https://librarycarpentry.org/lc-sql/01-introduction/index.html)
## What is SQL? **S**tructured **Q**uery **L**anguage, or SQL\ (sometimes pronounced “sequel”)
## Relational Databases
multiple tables
fields (columns)
records
(rows)
Also: every value in the same field has the same data *type*
## Links between tables
articles
ID
Title
Year
Journal ID
1
2
3
4
5
6
7
Aflatoxin...
Metagen...
Synthesis...
Perform...
Dihydro...
Ionic Liq...
Character...
2015
2015
2015
2015
2015
2015
2015
1
22
20
19
35
10
22
journals
ID
Title
ISSN
Agriculture
Int’l Jnl of Mol Sci...
Inorganics
Hydrology
Molecules
Catalysts
2077-0472
1422-0067
2304-6740
2306-5338
1420-3049
2073-4344
1
22
20
19
35
10
## Query: command that lets you look up data in a database or make calculations based on columns
## Why use SQL?
well established, around since the 1970s
keeps data separate from analysis
optimized for large amounts of data
improves data quality
used in other programming languages (e.g., R, Python) and web applications (e.g., WordPress)
SQL in Libraries
make bulk changes to metadata
communicate effectively with database administrators
use for evaluation projects
create custom, direct queries
build databases to keep a local inventory or track licenses
improve data migration projects
connect and share data from different spreadsheets
prepare for other querying systems, like application programming interfaces (APIs)
## Additional resources * [An Introduction to SQL for Librarians](http://ruthtillman.com/an-introduction-to-sql-for-librarians/) * [Data Science is Different Now: Learn SQL](https://veekaybee.github.io/2019/02/13/data-science-is-different/)
## Database Management Systems * SQLite (using today) * MySQL * PostgreSQL * MS Access * Filemaker Pro
## Introduction to DB Browser for SQLite * View Tables * Browse Data * Write a query (Execute SQL) * Add data to a table without SQL * Enter data into a CSV file and append (“File” > “Import” > “Table” from CSV file...) * Click the “Browse Data” tab, then click the “New Record” button.
## Workshop Dataset Description 51 different journals published during 2015
* **articles** (1001 records) * Field names: id, Title, Authors, DOI, URL, Subjects, ISSNs, Citation, LanguageID, LicenseID, Author_Count, First_Author, Citation_Count, Day, Month, Year * **journals** (51 records) * Field names: id, ISSN-L, ISSNs, PublisherID, Journal_Title * **languages** (4 records) * Field names: id, Language * **licenses** (4 records) * Field names: id, License * **publishers** (6 records) * Field names: id, Publisher
## Data Types
|Data type|Details|Name in SQLite| |---|---|---| |**text or string**|a combination of numbers, letters, symbols. Platforms may have different data types: one for variables with a set number of characters - e.g., a zip code or postal code, and one for variables with an open number of characters, e.g., an address or description variable.|TEXT| |**integer**|sometimes called whole numbers or counting numbers. Can be 1,2,3, etc., as well as 0 and negative whole numbers: -1,-2,-3, etc.|INTEGER| |float, real, or double|a decimal number or a floating point value. The largest possible size of the number may be specified.|REAL| |date or datetime|depending on the platform, may represent the date and time or the number of days since a specified date. This field often has a specified format, e.g., YYYY-MM-DD|doesn’t exist - need to use built-in date and time functions and store dates in real, integer, or text formats. See Section 2.2 of SQLite documentation for more details.| |boolean or binary|this variable type is often used to represent variables that can only have two values: yes or no, true or false.|doesn’t exist - need to use integer data type and values of 0 or 1.| |blob|a Binary Large OBject can store a large amount of data, documents, audio or video files.|BLOB|
## [Selecting and sorting data](https://librarycarpentry.org/lc-sql/02-selecting-sorting-data/index.html)
## First query SELECT *column_name_1* FROM *table name*;
## Add more columns SELECT *column_name_1*, *column_name_2* FROM *table name*;
## Select all columns SELECT * FROM *table name*;
## Unique values SELECT DISTINCT *column_name_1* FROM *table name*;
SELECT DISTINCT *column_name_1*, *column_name_2* FROM *table name*;
## Sorting SELECT * FROM *table name* ORDER BY *column_name_1*;
## Sorting, Ascending vs. Descending SELECT * FROM *table name* ORDER BY *column_name_1* ASC; SELECT * FROM *table name* ORDER BY *column_name_1* DESC;
## Sorting, Multiple Columns SELECT * FROM *table name* ORDER BY *column_name_1* DESC, *column_name_2* ASC;
## Challenge \#1 Write a query that returns: `Title`, `First_Author`, `ISSNs` and `Citation_Count` from the **articles** table, ordered by the top cited article and alphabetically by title.
## [Filtering](https://librarycarpentry.org/lc-sql/03-filtering/index.html)
## Filtering with WHERE SELECT * FROM *table name* WHERE *column_name_1* = '*some value*';
## Filtering, Multiple Conditions SELECT * FROM *table name* WHERE (*column_name_1* = '*some value*') AND (*column_name_2* = '*some other value*');
## Filtering, Alternatives SELECT * FROM *table name* WHERE (*column_name_1* = '*some value*') OR (*column_name_2* = '*some other value*');
Can also use `NOT` to reverse either condition.
## Filtering, Other Comparisons SELECT * FROM *table name* WHERE *column_name_1* LIKE '*%part of value%*';
Additional comparisons: `IN`, `BETWEEN...AND`, `IS NULL`
## Challenge \#1 Write a query that returns: `Title`, `First_Author`, `Subjects`, `ISSNs`, `Month` and `Year` for all papers where `Subjects` contains “computer” and that have more than 8 citations.
## [Aggregating & calculating values](https://librarycarpentry.org/lc-sql/05-aggregating-calculating/index.html)
## Aggregation Functions * `MAX` (find the maximum value in a field) * `MIN` (find the minimum value in a field) * `AVG` (find the average value of a field) * `COUNT` (count the number of values in a field and present the total) * `SUM` (add up the values in a field and present the sum)
## Basic Aggregation SELECT *column_name_1*, AVG(*column_name_2*) FROM *table name* GROUP BY *column_name_1*;
## Challenge \#1 Write a query using an aggregate function that returns the number of article titles per ISSNs, sorted by title count in descending order. Which ISSN has the most titles? (Hint: choose which aggregate function to use. It is one of the common aggregate functions: `MAX`, `MIN`, `AVG`, `COUNT`, `SUM`.)
## Filtering after Aggregation SELECT *column_name_1*, AVG(*column_name_2*) FROM *table name* GROUP BY *column_name_1* HAVING AVG(*column_name_2*) >= 10;
## Challenge \#2 Write a query that returns, from the `articles` table, the average `Citation_Count` for each journal ISSN but only for the journals with 5 or more citations on average.
## Calculations After `SELECT`, in addition to listing column names, we can write calculations using arithmetic operators: * `+` * `-` * `*` * `/` * `%` ("modulo") * `SQRT` (square root)
## [Joins and aliases](https://librarycarpentry.org/lc-sql/06-joins-aliases/index.html)
## Basic Join SELECT * FROM *table_name_1* JOIN *table_name_2* ON *table_name_1*.*column_name* = *table_name_2*.*column_name*;

## Alternative Join SELECT * FROM *table_name_1* JOIN *table_name_2* USING (*column_name*);
## Additional Resource [SQL Join Types Explained Visually](https://dataschool.com/how-to-teach-people-sql/sql-join-types-explained-visually/)
## Selecting From Multiple Tables SELECT *table_name_1*.*column_name_1*, *table_name_2*.*column_name_2*, *table_name_2*.*column_name_3* FROM *table_name_1* JOIN *table_name_2* ON *table_name_1*.*column_name_1* = *table_name_2*.*column_name_1*;
## Challenge \#1 Write a query that `JOINS` the `articles` and `journals` tables and that returns the `Journal_Title`, total number of articles published and average number of citations for every journal ISSN.
## Challenge \#2 Write a query that returns the `Journal_Title`, `Publisher` name, and number of articles published, ordered by number of articles in descending order.
## [Database design](https://librarycarpentry.org/lc-sql/08-database-design/index.html)
## Spreadsheets Spreadsheets, sometimes referred to as tabular data or flat files, are an easy way to display data organized in columns and rows.
Spreadsheets can make data gathering easier but they can also lead to messy data.
In the figure below, can you identify where inconsistencies in the data have been introduced?
## Terminology
## Tables * Each table stores data about a particular type of "entity," like an article or a journal. * Each record is an example of that entity (e.g., one article, one journal). * Entities are described by attributes, stored in fields (e.g., article title, journal ISSN).
## Linking Tables * To link tables, each table must have a way to uniquely identify each record. * A field that uniquely identifies the records can be used as a table's "primary key." * It is common to have a primary key field called `ID`, filled with integers that count up from 1.
## Entity-Relationship Diagram (ERD)
## Foreign Key When a primary key from one table connects to a field in a second table, the field in the second table is called a "foreign key."
## Types of Relationships * **One to One** - each item in the first table has exactly one match in the second table. * **One to Many** - each item in the first table is related to many items in the second table, sometimes represented as 1 to * or 1 to ∞ * **Many to One** - many items in the first table is related to one item in the second table. * **Many to Many** - many items in the first table are related to many items in the second table.
## Normalization
## Challenge \#1 Are there other tables and relationships you can create to further normalize the data and avoid inconsistencies? For this exercise, you can either use pencil/pen and paper to draw new tables and relationships or modify the previous ERD using [dbdiagram.io](https://dbdiagram.io/d/5cc32b0cf7c5bb70c72fc530).
## Additional Resources * [Database Structure and Design Tutorial](https://www.lucidchart.com/pages/database-diagram/database-design) * [What is an Entity Relationship Diagram](https://www.lucidchart.com/pages/er-diagrams)
## [Creating tables and modifying data](https://librarycarpentry.org/lc-sql/09-create/index.html)
## Create a Table Using SQL CREATE TABLE *table_name*(*col_name_1* *data_type_1*, *col_name_2* *data_type_2*...);
## Delete a Table Using SQL DROP TABLE *table_name*;
## Creating the "journals" table
CREATE TABLE "journals" ( "id" INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT, "ISSN-L" TEXT, "ISSNs" TEXT, "PublisherId" INTEGER, "Journal_Title" TEXT, CONSTRAINT "PublisherId" FOREIGN KEY("PublisherId") REFERENCES "publishers"("id") );
## Add Records Using SQL INSERT INTO "*table_name*" VALUES (*value_1*,*value_2*,...);
## Modify Records Using SQL UPDATE *table_name* SET *column_name* = *value* WHERE id = *record_id*;
## Delete Records Using SQL DELETE FROM *table_name* WHERE *column_name* = *value*;
Caution: deleting records can break the links between tables. You should also delete any records that point to the deleted records.
## Challenge \#1 Write an SQL statement to add the journal “New Journal of Physics” (ISSNs & ISSNs: 1367-2630; publisher: “Institute of Physics (IOP)”) to the table `journals`. You need to add the publisher “IOP” to the table `publishers` as well.
## Challenge \#2 SQLite has several administrative commands that aren’t part of the SQL standard. One of them is `.dump`, which prints the SQL commands needed to re-create the database. Another is `.read`, which reads a file created by `.dump` and restores the database. A colleague of yours thinks that storing dump files (which are text) in version control is a good way to track and manage changes to the database. What are the pros and cons of this approach? (Hint: records aren’t stored in any particular order.)
## Feedback on Today https://forms.gle/e8EUdKaeErA7ySjy9 Can also email questions to angela.zoss@duke.edu