I search the data on Stack Exchange Data Explorer.
I want to select columns from two tables and join them to form one
single table. I choose INNER JOIN
to avoid seeing null entries in
the result. The syntax is similar to the following.
1
2
3
4
SELECT * FROM Table1
INNER JOIN Table2
ON Table1.ID = Table2.ID
WHERE Col1 LIKE ‘foo’
Unluckily, when the size of Table1
and Table2
is large, it takes a
while to get the result.
This page has two more efficient ways. Since I search for recent data, I adopted the third method in my query.
1
2
3
4
5
6
7
8
9
10
11
DECLARE @TagLike NVARCHAR(25) = ##taglike:string##</p>
<p>SELECT TOP 500 * FROM (SELECT Id AS [Post Link], AnswerCount AS [Ans],
CommentCount AS [Com], CreationDate, Score AS [Scr],
ViewCount AS [Views], OwnerUserId FROM Posts
WHERE AnswerCount = 0 AND Tags LIKE ‘%’ + @TagLike + ‘%’ AND
ClosedDate IS NULL) AS p
INNER JOIN (SELECT Id, LastAccessDate, Reputation AS [Rep] FROM Users
WHERE LastAccessDate >= ‘2015-12-01’) AS u
ON p.OwnerUserId = u.Id
ORDER BY p.Com
Recently, I post math on Mathematics Stack Exchange instead of here.
How can one get a table for the distribution of reputation on that site?
Write a SQL query on Stack Exchange Data Explorer.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
select u.range as [prev row <= Reputation <= this row], count(*) as [Users]
from (
select case
when Reputation between 1 and 4 then 4
when Reputation between 5 and 9 then 9
when Reputation between 10 and 14 then 14
when Reputation between 15 and 19 then 19
when Reputation between 20 and 49 then 49
when Reputation between 50 and 74 then 74
when Reputation between 75 and 99 then 99
when Reputation between 100 and 124 then 124
when Reputation between 125 and 249 then 249
when Reputation between 250 and 499 then 499
when Reputation between 500 and 999 then 999
when Reputation between 1000 and 1999 then 1999
when Reputation between 2000 and 2499 then 2499
when Reputation between 2500 and 2999 then 2999
when Reputation between 3000 and 4999 then 4999
when Reputation between 5000 and 9999 then 9999
when Reputation between 10000 and 14999 then 14999
when Reputation between 15000 and 19999 then 19999
when Reputation between 20000 and 24999 then 24999
else 400000 end as range
from Users) u
group by u.range
The indentation is automatically done by Vim. I know that the
syntax is ugly. If I assign text string to the column u.range
,
then the table is sorted in alphabetical order of that column
instead of numerical order. This doesn’t make sense. Therefore, I
use a dirty way to get the statistics, and played with the built-in
graphing function. However, the visual result isn’t so
satisfactory.
Anyone who has completed high school will realise that a log graph is better. Asking for this feature on Meta Stack Exchange takes time. I believe that such feature request will be rejected by the moderator to reduce the workload of Stack Exchange company. Therefore, I plot the log graph using GNU Octave.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
A = [
4,86698;
9,23074;
14,17793;
19,9897;
49,22201;
74,5852;
99,3187;
124,39216;
249,9960;
499,4471;
999,2496;
1999,1416;
2499,319;
2999,223;
4999,482;
9999,356;
14999,155;
19999,81;
24999,45;
400000,152];
loglog(A(:,1), A(:,2), ".-k");
for i = 1:numel(A(:,1))
text(A(i,1), A(i,2), ['(' num2str(A(i,1)) ',' num2str(A(i,2)) ')'], ...
"fontsize", 12);
end
title("User's Total Reputation Distribution on Math Stack Exchange", ...
"fontsize", 14);
I choose loglog
because semilogx
causes the labels on tail to
overlap. Here’s the results.
I can save plots in GNU Octave as a SVG file. I know this after
searching “octave export svg”. From
Printing and Saving Plots, I see print -d[device]
,
in which one can substitute the output format. For example, I used
print -dsvg
to generate the SVG’s shown above.