Improve Efficiency of Inner Join

2016-01-07T18:38:44+08:00

Background

I search the data on Stack Exchange Data Explorer.

Problem

I want to select columns from two tables and join them to form one single table. I choose INNER JOIN to avoid seeing null entries in the result. The syntax is similar to the following.

Sample SQL Syntax

SELECT * FROM Table1
INNER JOIN Table2
ON Table1.ID = Table2.ID
WHERE Col1 LIKE ‘foo’

Unluckily, when the size of Table1 and Table2 is large, it takes a while to get the result.

Solution

This page has two more efficient ways. Since I search for recent data, I adopted the third method in my query.

My Query on Stack Exchange Data Explorer

DECLARE @TagLike NVARCHAR(25) = ##taglike:string##p>

<p>SELECT TOP 500 * FROM (SELECT Id AS [Post Link], AnswerCount AS [Ans],
CommentCount AS [Com], CreationDate, Score AS [Scr],
ViewCount AS [Views], OwnerUserId FROM Posts
WHERE AnswerCount = 0 AND Tags LIKE ‘%’ + @TagLike + ‘%’ AND
ClosedDate IS NULL) AS p
INNER JOIN (SELECT Id, LastAccessDate, Reputation AS [Rep] FROM Users
WHERE LastAccessDate &gt;= ‘2015-12-01’) AS u
ON p.OwnerUserId = u.Id
ORDER BY p.Com

Distribution of User Reputation on Math Stack Exchange

2016-01-07T17:34:38+08:00

Background

Recently, I post math on Mathematics Stack Exchange instead of here.

Problem

How can one get a table for the distribution of reputation on that site?

Solution

Write a SQL query on Stack Exchange Data Explorer.

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

< aption>User’s Total Reputation Distributionlink

 class='sql'>select u.range as [prev row &lt;= Reputation &lt;= this row], count(*) as [Users] class="k">from ( select case when Reputation between 1 and 4 then 4 when Reputation between 5 and 9 then 9 when Reputation between 10 and 14 then 14 when Reputation between 15 and 19 then 19 when Reputation between 20 and 49 then 49 when Reputation between 50 and 74 then 74 when Reputation between 75 and 99 then 99 when Reputation between 100 and 124 then 124 when Reputation between 125 and 249 then 249 when Reputation between 250 and 499 then 499 when Reputation between 500 and 999 then 999 when Reputation between 1000 and 1999 then 1999 when Reputation between 2000 and 2499 then 2499 when Reputation between 2500 and 2999 then 2999 when Reputation between 3000 and 4999 then 4999 when Reputation between 5000 and 9999 then 9999 when Reputation between 10000 and 14999 then 14999 when Reputation between 15000 and 19999 then 19999 when Reputation between 20000 and 24999 then 24999 else 400000 end as range from Users) u group by u.range /tr>

The indentation is automatically done by Vim. I know that the syntax is ugly. If I assign text string to the column u.range, then the table is sorted in alphabetical order of that column instead of numerical order. This doesn’t make sense. Therefore, I use a dirty way to get the statistics, and played with the built-in graphing function. However, the visual result isn’t so satisfactory.

Anyone who has completed high school will realise that a log graph is better. Asking for this feature on Meta Stack Exchange takes time. I believe that such feature request will be rejected by the moderator to reduce the workload of Stack Exchange company. Therefore, I plot the log graph using GNU Octave.

Download the CSV file to get the data.
Change it to an GNU Octave script file.
Open it using Vim.
Do the necessary text substitutions so that the data becomes a matrix.
Complete the script file by adding the plot commands.
- Line format arguments
- Add coordinates to points

The source code for the log plot (mathse-rep.m) download

A = [
4,86698;
9,23074;
14,17793;
19,9897;
49,22201;
74,5852;
99,3187;
124,39216;
249,9960;
499,4471;
999,2496;
1999,1416;
2499,319;
2999,223;
4999,482;
9999,356;
14999,155;
19999,81;
24999,45;
400000,152];
loglog(A(:,1), A(:,2), ".-k");
for i = 1:numel(A(:,1))
  text(A(i,1), A(i,2), ['(' num2str(A(i,1)) ',' num2str(A(i,2)) ')'], ...
  "fontsize", 12);
end
title("User's Total Reputation Distribution on Math Stack Exchange", ...
"fontsize", 14);

I choose loglog because semilogx causes the labels on tail to overlap. Here’s the results.

Lessons learnt

I can save plots in GNU Octave as a SVG file. I know this after searching “octave export svg”. From Printing and Saving Plots, I see print -d[device], in which one can substitute the output format. For example, I used print -dsvg to generate the SVG’s shown above.

Category: Sql | Blog 1

Improve Efficiency of Inner Join

Background

Problem

Solution

Distribution of User Reputation on Math Stack Exchange

Background

Problem

Solution

Lessons learnt