【30天學習新語言-Ruby】Day 6：爬蟲自動搜尋敏感資訊

716 字

4 分鐘

【30天學習新語言-Ruby】Day 6：爬蟲自動搜尋敏感資訊

2025-08-15

Learning

/

Ruby

/

30天學Ruby挑戰

💡 為什麼要寫爬蟲？#

如果去看常見的一些資安事件，應該會發現很多資料外洩其實不是被駭，而是開發者不小心把敏感資訊放在公開的地方。

像是 API key 寫在 JavaScript 裡、測試帳號密碼放在註解、或是 Email 地址沒有做好保護等等。

今天要寫一個爬蟲，自動在網頁中搜尋這些可能的敏感資訊，再用正規表達式（Regex）來找出特定模式的資料。

🎯 大綱#

使用 open-uri 抓取網頁內容
撰寫 Regex 找出 Email、API Key、密碼等
遞迴爬取多層網頁
將結果輸出成報告

📚 知識點#

URI.open() — 開啟網頁並讀取內容
/pattern/ — Ruby 的正規表達式
string.scan(/regex/) — 找出所有符合的字串
Set.new — 使用集合避免重複
URI.join() — 處理相對路徑

💻 實作#

1
require 'open-uri'
2
require 'nokogiri'
3
require 'uri'
4
require 'set'
5

6
class SensitiveCrawler
7
  def initialize(base_url, max_depth = 2)
8
    @base_url = base_url
9
    @max_depth = max_depth
10
    @visited = Set.new
11
    @findings = {
12
      emails: Set.new,
13
      api_keys: Set.new,
14
      passwords: Set.new,
15
      tokens: Set.new,
16
      private_ips: Set.new,
17
      s3_buckets: Set.new
18
    }
19
  end
20

21
  # 定義各種敏感資訊的 Regex
22
  def patterns
23
    {
24
      emails: /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/,
25

26
      # 常見的 API Key 格式
27
      api_keys: /(?:api[_\-]?key|apikey|api_secret)[\s]*[:=][\s]*["']?([a-zA-Z0-9_-]{20,})["']?/i,
28

29
      # AWS Access Key
30
      aws_keys: /AKIA[0-9A-Z]{16}/,
31

32
      # 可能的密碼
33
      passwords: /(?:password|passwd|pwd)[\s]*[:=][\s]*["']([^"']{4,})["']/i,
34

35
      # JWT Token
36
      tokens: /eyJ[A-Za-z0-9_=-]+\.[A-Za-z0-9_=-]+\.?[A-Za-z0-9_.+\/=-]*/,
37

38
      # 私有 IP
39
      private_ips: /(?:10\.\d{1,3}\.\d{1,3}\.\d{1,3}|172\.(?:1[6-9]|2\d|3[01])\.\d{1,3}\.\d{1,3}|192\.168\.\d{1,3}\.\d{1,3})/,
40

41
      # S3 Bucket
42
      s3_buckets: /(?:s3\.amazonaws\.com\/[a-z0-9.-]+|[a-z0-9.-]+\.s3\.amazonaws\.com)/i
43
    }
44
  end
45

46
  def crawl(url = @base_url, depth = 0)
47
    return if depth > @max_depth
48
    return if @visited.include?(url)
49

50
    @visited.add(url)
51
    puts "Crawling: #{url} (depth: #{depth})"
52

53
    begin
54
      html = URI.open(url, 'User-Agent' => 'Mozilla/5.0')
55
      content = html.read
56

57
      # 搜尋敏感資訊
58
      search_sensitive_data(content, url)
59

60
      # 如果還沒到最大深度，繼續爬取連結
61
      if depth < @max_depth
62
        doc = Nokogiri::HTML(content)
63
        links = doc.css('a[href]').map { |link| link['href'] }
64

65
        links.each do |link|
66
          next if link.nil? || link.empty?
67

68
          # 處理相對路徑
69
          begin
70
            absolute_url = URI.join(url, link).to_s
71
            # 只爬取同網域的連結
72
            if absolute_url.start_with?(@base_url)
73
              crawl(absolute_url, depth + 1)
74
            end
75
          rescue => e
76
            # 忽略無效的 URL
77
          end
78
        end
79
      end
80

81
    rescue => e
82
      puts "Error crawling #{url}: #{e.message}"
83
    end
84
  end
85

86
  def search_sensitive_data(content, source_url)
87
    patterns.each do |type, pattern|
88
      matches = content.scan(pattern)
89
      next if matches.empty?
90

91
      matches.flatten.each do |match|
92
        next if match.nil? || match.empty?
93

94
        # 根據類型儲存
95
        case type
96
        when :emails
97
          @findings[:emails].add(match) if match.include?('@')
98
        when :api_keys, :aws_keys
99
          @findings[:api_keys].add(match)
100
        when :passwords
101
          # 過濾掉明顯的假密碼
102
          unless match =~ /^(password|example|test|demo|sample)$/i
103
            @findings[:passwords].add("#{match} (found in: #{source_url})")
104
          end
105
        when :tokens
106
          @findings[:tokens].add(match[0..50] + '...') # 只顯示部分
107
        when :private_ips
108
          @findings[:private_ips].add(match)
109
        when :s3_buckets
110
          @findings[:s3_buckets].add(match)
111
        end
112
      end
113
    end
114
  end
115

116
  def report
117
    puts "\n" + "=" * 60
118
    puts "SENSITIVE DATA REPORT"
119
    puts "=" * 60
120

121
    @findings.each do |type, items|
122
      next if items.empty?
123

124
      puts "\n#{type.to_s.upcase.gsub('_', ' ')} (Found: #{items.size})"
125
      items.first(10).each do |item|
126
        puts "    - #{item}"
127
      end
128
      puts "    ... and #{items.size - 10} more" if items.size > 10
129
    end
130

131
    if @findings.values.all?(&:empty?)
132
      puts "\nNo sensitive data found"
133
    end
134

135
    puts "\nTotal pages crawled: #{@visited.size}"
136
  end
137
end
138

139
# 使用範例
140
if __FILE__ == $0
141
  target = ARGV[0] || 'http://example.com'
142

143
  puts "Starting sensitive data crawler"
144
  puts "Target: #{target}"
145
  puts "Max depth: 2"
146

147
  crawler = SensitiveCrawler.new(target, 2)
148
  crawler.crawl
149
  crawler.report
150
end

🚀 執行方式#

建立檔案 sensitive_crawler.rb
執行指令（可以指定目標網站）：

1
ruby sensitive_crawler.rb http://target-website.com

預期輸出：

1
Starting sensitive data crawler
2
Target: http://example.com
3
Max depth: 2
4

5
Crawling: http://example.com (depth: 0)
6
Crawling: http://example.com/about (depth: 1)
7
Crawling: http://example.com/contact (depth: 1)
8

9
============================================================
10
SENSITIVE DATA REPORT
11
============================================================
12

13
EMAILS (Found: 5)
14
    - admin@example.com
15
    - support@example.com
16
    - test@example.com
17

18
API Keys (Found: 2)
19
    - Example123
20
    - Example123456
21

22
PRIVATE IPS (Found: 1)
23
    - 192.168.1.100
24

25
Total pages crawled: 15