0x00. Desciption
execute sql queies in SQLite in IMDB database.
0x01. Evn setup
I’m using Ubuntu20.04 VM for this homework. The VM is installed on VMware Workstation Pro 15. There’re several steps to prepare:
-
install SQLite,
sudo apt-get install sqlite3 libsqlite3-dev
-
Download dataset,
wget https://15445.courses.cs.cmu.edu/fall2019/files/imdb-cmudb2019.db.gz
-
verify dataset,
md5 imdb-cmudb2019.db.gz
-
Unzip and execute it,
gunzip imdb-cmudb2019.db.gz
andsqlite3 imdb-cmudb2019.db
0x02. Q1_SAMPLE
Sample, skip.
0x03. Q2_UNCOMMON_TYPE
List the longest title of each type along with the runtime minutes.
select type, primary_title, max(runtime_minutes) FROM titles GROUP BY type ORDER BY type ASC, primary_title ASC;
output:
movie|Logistics|51420 short|Kuriocity|461 tvEpisode|Téléthon 2012|1800 tvMiniSeries|Kôya no yôjinbô|1755 tvMovie|ArtQuench Presents Spirit Art|2112 tvSeries|The Sharing Circle|8400 tvShort|Paul McCartney Backstage at Super Bowl XXXIX|60 tvSpecial|Katy Perry Live: Witness World Wide|5760 video|Midnight Movie Madness: 50 Movie Mega Pack|5135 videoGame|Flushy Fish VR: Just Squidding Around|1500
This is almost correct, however, in tvShort
type, a tie exists.
sqlite> select title_id,type, runtime_minutes FROM titles WHERE type = 'tvShort' ORDER BY runtime_minutes DESC LIMIT 5; tt2292857|tvShort|60 tt5613498|tvShort|60 tt10622020|tvShort|53 tt0353158|tvShort|49 tt5452108|tvShort|48
Now the problem is the tie, I need to figure out a way to handle it(show both tuples when tie). So, I take the output of this sql as a temp table.
select type, primary_title, max(runtime_minutes) FROM titles GROUP BY type ORDER BY type ASC, primary_title ASC;
And then, join it with titles
table, on the same runtime_minutes and type, which means it will give me:
if max without tie, the join will output the same tuple, if max WITH tie, the join will output the tuples that has the same attributes* like the max, in this case, has the same runtime and type, which indicates it's a tie.
As for the answer:
select titles.type, titles.primary_title, titles.runtime_minutes from titles join (select type, primary_title, max(runtime_minutes) as maxLength FROM titles GROUP BY type) as maxType on maxType.maxLength = titles.runtime_minutes and maxType.type = titles.type order by titles.type ASC, titles.primary_title ASC;
movie|Logistics|51420 short|Kuriocity|461 tvEpisode|Téléthon 2012|1800 tvMiniSeries|Kôya no yôjinbô|1755 tvMovie|ArtQuench Presents Spirit Art|2112 tvSeries|The Sharing Circle|8400 tvShort|Paul McCartney Backstage at Super Bowl XXXIX|60 tvShort|The People Next Door|60 tvSpecial|Katy Perry Live: Witness World Wide|5760 video|Midnight Movie Madness: 50 Movie Mega Pack|5135 videoGame|Flushy Fish VR: Just Squidding Around|1500
0x04. Q3_TV_VS_MOVIE
List all types of titles along with the number of associated titles.
SELECT type, count(distinct title_id) as number from titles GROUP BY type ORDER BY number;
tvShort|4075 videoGame|9044 tvSpecial|9107 tvMiniSeries|10291 tvMovie|45431 tvSeries|63631 video|90069 movie|197957 short|262038 tvEpisode|1603076
Answer from prof:
SELECT type, count(*) AS title_count FROM titles GROUP BY type ORDER BY title_count ASC;
So, at this time that I figured title_id is primary key and it’s unique, so distinct is redundant. It could be replaced by count(title_id)
or simplily count(*)
.
0x05. Q4_OLD_IS_NOT_GOLD
Which decades saw the most number of titles getting premiered? List the number of titles in every decade. Like 2010s|2789741
.
From what I could do, I got this:
SELECT CAST(premiered/10 AS INT)*10 as upTime, count(*) FROM titles WHERE premiered is not NULL GROUP BY upTime ORDER BY upTime DESC;v
2020|2492 2010|1050732 2000|494639 1990|211453 1980|119258 1970|99707 1960|75237 1950|39554 1940|10011 1930|11492 1920|13153 1910|26596 1900|9586 1890|2286 1880|22 1870|1
I have no idea how to append ‘s’ at the end of each decade. So, I look for the answer.
SELECT CAST(premiered/10*10 AS TEXT) || 's' AS decade, COUNT(*) AS num_movies FROM titles WHERE premiered is not null GROUP BY decade ORDER BY num_movies DESC ;
2010s|1050732 2000s|494639 1990s|211453 1980s|119258 1970s|99707 1960s|75237 1950s|39554 1910s|26596 1920s|13153 1930s|11492 1940s|10011 1900s|9586 2020s|2492 1890s|2286 1880s|22 1870s|1
I cast the decades into INT
instead of TEXT
which could not use ||
to append someting else. And sort this by number of movies(I sorted by the wrong collum before) Now I change my sql to this:
SELECT CAST(premiered/10*10 AS TEXT) || 's' as upTime, count(*) as numbers FROM titles WHERE premiered is not NULL GROUP BY upTime ORDER BY numbers DESC;
2010s|1050732 2000s|494639 1990s|211453 1980s|119258 1970s|99707 1960s|75237 1950s|39554 1910s|26596 1920s|13153 1930s|11492 1940s|10011 1900s|9586 2020s|2492 1890s|2286 1880s|22 1870s|1
And now, the output is exactly the same.
0x06. Q5_PERCENTAGE
List the decades and the percentage of titles which premiered in the corresponding decade. Display like : 2010s|45.7042
.
SELECT CAST(premiered/10*10 AS TEXT) || 's' as upTime, ROUND(CAST(count(*) AS FLOAD) / (SELECT COUNT(*) FROM titles) * 100, 4) as numbers FROM titles WHERE premiered is not NULL GROUP BY upTime ORDER BY numbers DESC;
2010s|45.7891 2000s|21.5555 1990s|9.2148 1980s|5.1971 1970s|4.3451 1960s|3.2787 1950s|1.7237 1910s|1.159 1920s|0.5732 1930s|0.5008 1940s|0.4363 1900s|0.4177 2020s|0.1086 1890s|0.0996 1880s|0.001 1870s|0.0
This is done based on the previous question, just cast the number to fload and divide. The answer cast it into real. They are both proximity data types, so in this question, there no difference.
If you need accurate data, numeric
and decimal
should be better.