The purpose of this project was to explore the CIA Factbook, a database of information about all the countries in the world, using SQL queries in order to find out interesting information.
The project provided the opportunity to develop my knowledge of SQL and learn Jupyter Notebook. The version of SQL used is sqlite. It makes use of the following types of query:
The original Jupyter notebook can be downloaded from [here]http://pravjey.github.io/CIAFactbook/Factbook.ipynb]
%%capture
%load_ext sql
%sql sqlite:///factbook.db
%sql SELECT * FROM sqlite_master WHERE type='table';
%sql SELECT * FROM facts LIMIT 5;
%%sql
SELECT COUNT(*) AS "Number of records",
COUNT(DISTINCT(name)) AS "Number of distinct countries"
FROM facts;
Each country has its own, unique record
The first task was to find the range for the population and population growth
%%sql
SELECT MIN(population) AS "Smallest population",
MAX(population) AS "Largest population",
MIN(population_growth) AS "Smallest population growth",
MAX(population_growth) AS "Largest population growth"
FROM facts;
Out of the whole database, the lowest population of a country was zero and the highest was over 7 billion. This gave rise to two questions:
Upon further examination, it was found that the country with a population of zero is Antarctica. As Antarctica was a country where no-one lived, other than penguins, and the only human inhabitants were temporarily-based research scientists, it made sense to exclude Antartica from further analysis.
So the country with the lowest population is the Pitcairn Islands.
%%sql
SELECT name from facts
WHERE population = (SELECT MIN(population) FROM facts);
%%sql
SELECT name from facts
WHERE population = (SELECT MIN(population) FROM facts
WHERE name <> "Antarctica");
Similarly, the "country" with the largest population in the CIA Factbook is "the World". Again, it did not make sense to include this record in the analysis of countries. So the country with the largest population is China.
%%sql
SELECT name from facts
WHERE population = (SELECT MAX(population) FROM facts);
%%sql
SELECT name from facts
WHERE population = (SELECT MAX(population) FROM facts
WHERE name <> "World");
The countries with the lowest growth in population, i.e. zero growth, were Vatican City State, Cocos Islands, Greenland and the Pitcairn Islands.
The country with the highest growth in population was South Sudan.
%%sql
SELECT name from facts
WHERE population_growth = (SELECT MIN(population_growth) FROM facts);
%%sql
SELECT name from facts
WHERE population_growth = (SELECT MAX(population_growth) FROM facts);
So, excluding the records for Antartica and the World, the range for population and population growth are as follows:
%%sql
SELECT MIN(population) AS "Smallest population",
MAX(population) AS "Largest population",
MIN(population_growth) AS "Smallest population growth",
MAX(population_growth) AS "Largest population growth"
FROM facts
WHERE name <> "World" AND name <> "Antarctica";
%sql SELECT AVG(population) FROM facts
%sql SELECT AVG(area) FROM facts
%%sql
SELECT name, population, area
FROM facts
WHERE population > (SELECT AVG(population) FROM facts) AND
area > (SELECT AVG(area) FROM facts)
ORDER BY population DESC;
It is worth pointing out that the the European Union is included in the list of countries with an above-average area and population. Since the E.U. is not technically a country, but a group of countries, this could also be excluded from further analysis. So the top-ten countries with above-average population and area are:
%%sql
SELECT name, population, area
FROM facts
WHERE population > (SELECT AVG(population) FROM facts) AND
area > (SELECT AVG(area) FROM facts) AND
name <> "European Union"
ORDER BY population DESC;
%%sql
SELECT name, (CAST(area_water AS FLOAT) / CAST(area_land AS FLOAT))
AS "Water to land ratio"
FROM facts
ORDER BY (CAST(area_water AS FLOAT) / CAST(area_land AS FLOAT)) DESC
LIMIT 10;
It can be seen that only two countries, British Indian Ocean Territory (aka Chagos Islands) and the Virgin Islands, comprise more water than land. This can be seen from the query below, comparing the actual water area to the actual land area.
All the other countries in the Factbook consist of more land than water.
%%sql
SELECT name
FROM facts
WHERE area_water > area_land;
Eighty-nine countries have a water to land ratio of zero, i.e. none of the area of the country comprises of water features.
%%sql
SELECT COUNT(name)
FROM facts
WHERE (CAST(area_water AS FLOAT) / CAST(area_land AS FLOAT)) = 0.0;
A country's population increases in accordance with the birth and migration rate and decreases in accordance with the death rate. It is assumed that the figures for birth rate, death rate and migration rate in the database are percentages. So the query to find out the countries expected to see the largest population increase for the coming are is:
%%sql
SELECT name, ((birth_rate - death_rate + migration_rate) / 100) * population
AS "Population increase"
FROM facts
ORDER BY ((birth_rate - death_rate + migration_rate) / 100) * population DESC
LIMIT 10;
I was particularly interested to find out which countries are likely to see decreases in population.
In the first instance, these would be countries where the death rate is bigger than the birth rate. There are 24 such countries.
%%sql
SELECT COUNT(name)
FROM facts
WHERE death_rate > birth_rate;
%%sql
SELECT name
FROM facts
WHERE death_rate > birth_rate;
The problem with simply comparing death rate and birth rate alone is that birth rate is not the only factor that contributes to population increase. Migration rate also has a role to play. So I then amended the above query to search for the countries where the death rate was higher than the birth and migration rates combined.
There were 13 such countries.
Apart from Japan, they were all in Central and Eastern Europe.
%%sql
SELECT COUNT(name)
FROM facts
WHERE death_rate > (birth_rate + migration_rate)
%%sql
SELECT name
FROM facts
WHERE death_rate > (birth_rate + migration_rate);
Finally, it would be interesting to see in how many countries migration rate has a bigger impact on population increase than birth rate
%%sql
SELECT name
FROM facts
WHERE birth_rate < migration_rate;
There are only seven countries in the world where migration rate has a bigger impact on population increase than the birth rate. In all other countries, birth rate has a bigger impact than migration rate.
China, India, Russia and the United States may be some of the countries with the largest populations but they also take up a lot of geographic area. So, the population may be further apart. The interesting question is in which countries are there high populations crammed into small geographic areas.
%%sql
SELECT name, (CAST(population AS FLOAT) / CAST(area AS FLOAT)) AS "Population to Area ratio"
FROM facts
ORDER BY (CAST(population AS FLOAT) / CAST(area AS FLOAT)) DESC
LIMIT 10;
It is interesting to note that the countries with largest population densities are all tiny in a geographic sense.
The average population to area ration is:
%%sql
SELECT AVG(CAST(population AS FLOAT) / CAST(area AS FLOAT)) AS "Average Population to Area ratio"
FROM facts
It is interesting that India, despite having one of the largest populations and area in the world, actually has a below-average population to area ratio.
Most of the countries with below-average population to area ratios appear to be have small countries and islands or island archipelagos.
It is interesting to note that the United Kingdom, and the European Union as a whole - despite grumblings about influx of migrants - actually have below-average population to area ratios.
%%sql
SELECT name, (CAST(population AS FLOAT) / CAST(area AS FLOAT)) AS "Population to Area ratio"
FROM facts
WHERE (CAST(population AS FLOAT) / CAST(area AS FLOAT)) < (SELECT AVG(CAST(population AS FLOAT) / CAST(area AS FLOAT))
FROM facts)
ORDER BY (CAST(population AS FLOAT) / CAST(area AS FLOAT)) DESC;
There is one country in the world where the recorded total area is not the same as the sum of the land area and water area.
%%sql
SELECT name
FROM facts
WHERE area <> (area_land + area_water);
Names of countries vary in length, from 4 to 45, with an average length of 10.
%%sql
SELECT MAX(LENGTH(name)) AS "Longest name",
MIN(LENGTH(name)) AS "Shortest name",
ROUND(AVG(LENGTH(name)),0) AS "Average length of name"
FROM facts;
The countries with the longest names are:
%%sql
SELECT name
FROM facts
WHERE LENGTH(name) = (SELECT MAX(LENGTH(name))
FROM facts);
The countries with the shortest names are:
%%sql
SELECT name
FROM facts
WHERE LENGTH(name) = (SELECT MIN(LENGTH(name))
FROM facts);
There are 84 countries with above-average length of names and 159 countries with below average length of names.
%%sql
SELECT COUNT(name)
FROM facts
WHERE LENGTH(name) > (SELECT ROUND(AVG(LENGTH(name)),0)
FROM facts);
%%sql
SELECT COUNT(name)
FROM facts
WHERE LENGTH(name) < (SELECT ROUND(AVG(LENGTH(name)),0)
FROM facts);
[Back to Portfolio Page][http://pravjey.github.io/archive.html]