分析stackoverflow热门问答

stackoverflow无疑是每个程序员的圣地，里面的问和答不知道解救了多少Copy-Paste党。更难能可贵的是它开放了自己的主要数据库，允许任何人对它的数据进行下载分析。让你用自己的方式去理解数据，实在是太激动人心了。其实类似的还有Github. 本文抛砖引玉。记录了我用它来发现哪些问题和解答是非常有价值的，然后自己可以针对性的查看学习

数据源

stackoverflow可能从建站之初就一直开放了自己的数据库。2014年起它将数据库拷贝托管到了archive.org，数据库文件总量极限压缩之后总大小仍然有接近100G。这还真不是一笔小的宽带费用。因此数据是每几个月更新一次。更要命的是archive上只能下载到最新一版的数据。无法得到历史版本数据。这对与有对比需求的人是很伤的，比如要分析语言流行变化趋势，哪项技术流行，哪些衰败就没辙了！！！还好有位老哥(链接二)也对分析SO敢兴趣。于是又找个了分析组织托管历史数据。至此我们分析数据的数据源有了

数据结构

链接一包含各个字段的解释，链接二有更形象的图表

我个人目前只需要分析哪些问题和解答是高质量的。所以只有posts表对我有用。因此我下载了2020年3月和2017年12月的posts表

数据处理选型

stackoverflow自己的提供的线上数据查询工具. 因为是关系型数据库，我对它的分析性能表示怀疑。没有试用
google bigquery 试用过你会知道，这查询性能当时就震惊到我了，看起来是关系型数据库的结构，对SQL的兼容可谓是非常好。即使各种表关联也一点不慢。并且可以和bigquery平台上其他的公共数据集关联查询，然而它最大的缺点是:贵贵贵贵贵贵贵！！！！。它提供了试用，你可能进行几次查询就用光免费额度了
clickhouse 列数据库。极致性能，而且很好上手，即便是笔记本单机也能无压力秒级分析亿级数据(磁盘最好是SSD)，用在本例上是很轻松了

因此本文用了Clickhouse，安装和基础使用就不说了，我个人使用的是DataGrip作为客户端，使用csv格式导入数据库

数据转换

下载得到的数据是压缩的xml格式。很尴尬，基本没有人直接分析xml格式，所以需要先转换一下。而且即使我只下载了Posts表。其中某些字段对我也是没有用的。比如标题和内容表。精简之后能大幅减少文件大小，也有利于数据库存储和分析。下载得到Posts.xml大小15G，解压后73G。写了个Rust脚本从xml转换到csv，没有用多进程，性能尚可

use quick_xml::events::attributes::Attribute;

use csv;
use htmlescape::decode_html;
use quick_xml::events::Event;
use quick_xml::Reader;
use std::fs::File;

fn get_attribute<'a, 'b>(attributes: &'b [Attribute<'a>], key: &str, default: &'b str) -> &'b str {
    let v = attributes
        .into_iter()
        .find(|a| a.key == key.as_bytes())
        .map_or(default.as_bytes(), |a| &a.value);
    std::str::from_utf8(v).unwrap()
}

fn main() {
    let mut buf = Vec::new();
    let mut reader = Reader::from_file("Posts.xml").unwrap();

    let mut csvf = File::create("posts.csv").expect("unable to create file");
    let mut wtr = csv::Writer::from_writer(csvf);
    wtr.write_record(&[
        "id",
        "create_date",
        "post_type",
        "score",
        "view_count",
        "tags",
        "answer_count",
        "comment_count",
        "favorite_count",
    ]);
    let mut count = 0;
    loop {
        match reader.read_event(&mut buf) {
            Ok(Event::Start(ref e)) | Ok(Event::Empty(ref e)) => match e.name() {
                b"row" => {
                    count += 1;

                    let attributes = e.attributes().map(|a| a.unwrap()).collect::<Vec<_>>();
                    let id = get_attribute(&attributes, "Id", "0");
                    let create_date =
                        get_attribute(&attributes, "CreationDate", "1990-01-01T01:49:46.220");
                    let post_type = get_attribute(&attributes, "PostTypeId", "0");
                    let score = get_attribute(&attributes, "Score", "0");
                    let view_count = get_attribute(&attributes, "ViewCount", "0");
                    let tags = get_attribute(&attributes, "Tags", "");
                    let answer_count = get_attribute(&attributes, "AnswerCount", "0");
                    let comment_count = get_attribute(&attributes, "CommentCount", "0");
                    let favorite_count = get_attribute(&attributes, "FavoriteCount", "0");

                    let create_date = format!("{} {}", &create_date[0..10], &create_date[11..19]);
                    let create_date = create_date.as_str();

                    let decode_tag = decode_html(tags).unwrap();
                    let tag = decode_tag
                        .trim_matches('>')
                        .trim_matches('<')
                        .split("><")
                        .map(|s| format!("'{}'", s))
                        .collect::<Vec<String>>()
                        .join(",");

                    let tag = format!("[{}]", tag);
                    let tag = tag.as_str();

                    wtr.write_record(&[
                        id,
                        create_date,
                        post_type,
                        score,
                        view_count,
                        tag,
                        answer_count,
                        comment_count,
                        favorite_count,
                    ]);
                }
                _ => println!("{:?}", std::str::from_utf8(e.name()).unwrap()),
            },
            Ok(Event::Eof) => break,
            _ => (),
        }
    }

    wtr.flush();
}

以上是转换2020年3月的数据，17年12月的xml类似，以上代码减去几个字段

数据写入数据库

建表

create table posts
(
    id             UInt64,
    create_date    DateTime,
    post_type      Int8,
    score          Int32,
    view_count     UInt32,
    tags           Array(String),
    answer_count   UInt16,
    comment_count  UInt16,
    favorite_count UInt16
)
    engine = MergeTree PARTITION BY toYear(create_date) ORDER BY create_date;
    
    create table posts2017
(
    id             UInt64,
    score          Int32,
    view_count     UInt32,
    answer_count   UInt16,
    comment_count  UInt16,
    favorite_count UInt16
)
    engine = MergeTree PARTITION BY id % 10 ORDER BY id;

将csv写入clickhouse

1 2	cat posts.csv \| clickhouse-client --query="INSERT INTO posts2017 Format CSVWithNames"; cat posts.csv \| clickhouse-client --query="INSERT INTO posts Format CSVWithNames";

将两张表合并到一张表，加速查询

create table posts_all engine = MergeTree PARTITION BY toYear(create_date) ORDER BY create_date AS
select id,
       tags,
       create_date,
       post_type,
       score,
       view_count,
       answer_count,
       comment_count,
       favorite_count,
       posts2017.score          as _score,
       posts2017.view_count     as _view_count,
       posts2017.answer_count   as _answer_count,
       posts2017.comment_count  as _comment_count,
       posts2017.favorite_count as _favorite_count
from posts
         any
         left join posts2017 on posts2017.id = posts.id;

开始分析吧

一切准备就绪，接下来就是自由分析啦，这个就看个人发挥了。自己写了个仅供参考

查找C语言中有价值的问题

select id, groupArray(t) x
from (
      select arrayJoin(tags) tag, id, 'score' t
      from posts_all
      where tag = 'c'
        and post_type = 1
      order by (score - _score) desc
      limit 30
      union all
      select arrayJoin(tags) tag, id, 'view' t
      from posts_all
      where tag = 'c'
        and post_type = 1
      order by view_count - _view_count desc
      limit 30
      union all
      select arrayJoin(tags) tag, id, 'answer' t
      from posts_all
      where tag = 'c'
        and post_type = 1
      order by answer_count - _answer_count desc
      limit 30
      union all
      select arrayJoin(tags) tag, id, 'comment' t
      from posts_all
      where tag = 'c'
        and post_type = 1
      order by comment_count - _comment_count desc
      limit 30
      union all
      select arrayJoin(tags) tag, id, 'favorite' t
      from posts_all
      where tag = 'c'
        and post_type = 1
      order by favorite_count - _favorite_count desc
      limit 30)
group by id order by length(x) desc,x desc ;

结果如下:

id	x
572547	[‘favorite’,’view’,’score’]
1041866	[‘favorite’,’view’,’score’]
47981	[‘favorite’,’view’,’score’]
37538	[‘favorite’,’view’,’score’]
1675351	[‘favorite’,’view’,’score’]
24853	[‘favorite’,’view’,’score’]
58224638	[‘answer’,’score’,’comment’]
612097	[‘view’,’score’]
1921539	[‘view’,’score’]
201101	[‘view’,’score’]
16931770	[‘view’,’score’]
8547778	[‘favorite’,’score’]
840501	[‘favorite’,’score’]
1410563	[‘favorite’,’score’]
1642028	[‘favorite’,’score’]
7825055	[‘favorite’,’score’]
605845	[‘favorite’,’score’]
1143262	[‘favorite’,’score’]
65820	[‘favorite’,’score’]
9229601	[‘favorite’,’score’]
1711631	[‘favorite’,’score’]
21593	[‘favorite’,’score’]
36827659	[‘favorite’,’score’]
1433204	[‘favorite’,’score’]
57842756	[‘answer’,’score’]
52509602	[‘answer’,’favorite’]
4264127	[‘view’]
3501338	[‘view’]
330793	[‘view’]
2279379	[‘view’]
822323	[‘view’]
5029840	[‘view’]
3437404	[‘view’]
38561	[‘view’]
2386772	[‘view’]
1088622	[‘view’]
308695	[‘view’]
111928	[‘view’]
628761	[‘view’]
15004944	[‘view’]
19641597	[‘view’]
27272525	[‘view’]
17307275	[‘view’]
8257714	[‘view’]
7021725	[‘view’]
8440816	[‘view’]
57650895	[‘score’]
1538420	[‘score’]
204476	[‘score’]
2550774	[‘score’]
381542	[‘score’]
5134891	[‘favorite’]
98650	[‘favorite’]
4955198	[‘favorite’]
31162367	[‘favorite’]
5431941	[‘favorite’]
119123	[‘favorite’]
63166	[‘favorite’]
145270	[‘favorite’]
81656	[‘favorite’]
562303	[‘favorite’]
55307892	[‘comment’]
48494733	[‘comment’]
48844198	[‘comment’]
55069186	[‘comment’]
55101427	[‘comment’]
49649253	[‘comment’]
50226598	[‘comment’]
49040259	[‘comment’]
50895849	[‘comment’]
53657018	[‘comment’]
48125683	[‘comment’]
58714944	[‘comment’]
48515115	[‘comment’]
51181037	[‘comment’]
48912653	[‘comment’]
47691216	[‘comment’]
48134447	[‘comment’]
51867939	[‘comment’]
53563600	[‘comment’]
45524284	[‘comment’]
47777205	[‘comment’]
48505911	[‘comment’]
48176305	[‘comment’]
56406841	[‘comment’]
47616508	[‘comment’]
48420442	[‘comment’]
48914413	[‘comment’]
57575599	[‘comment’]
52451184	[‘comment’]
51638803	[‘answer’]
53444743	[‘answer’]
53832063	[‘answer’]
50530113	[‘answer’]
58278260	[‘answer’]
50951261	[‘answer’]
56208872	[‘answer’]
49456599	[‘answer’]
48608993	[‘answer’]
53298646	[‘answer’]
52442415	[‘answer’]
466204	[‘answer’]
53109888	[‘answer’]
50288322	[‘answer’]
59925618	[‘answer’]
52359906	[‘answer’]
48315523	[‘answer’]
48868367	[‘answer’]
122616	[‘answer’]
50822596	[‘answer’]
47906401	[‘answer’]
51548994	[‘answer’]
48226806	[‘answer’]
8487986	[‘answer’]
2602823	[‘answer’]
54025987	[‘answer’]
60113944	[‘answer’]

参考

All Stack Exchange data dumps

分析stackoverflow热门问答

数据源

数据结构

数据处理选型

数据转换

数据写入数据库

开始分析吧

参考

ficapy

博客起航

IDM-Internet Download Manager

everything--文件搜索神器

everything进阶教程

Snagit-截图兼图片转换软件

xnview-图片查看

DeskPins-窗口置顶

傲梅分区助手--傻瓜化分区软件

excel数字递增批量打印

pip使用国内镜像