access_log
Receives access logs from stdin in JSON format and stores them on a database. It intentionally doesn't collect IP addresses. It doesn't respect the Do Not Track (DNT) header though, because we're not collecting personally identifiable data. Referrer collection is optional but we strongly suggest using a referrer policy that doesn't collect full addresses.
See the Rails migration for the database schema, the Nginx configuration, and the site configuration.
It supports SQlite3 and PostgreSQL databases :)
Build
Install zlib, sqlite3 and ssl development files (it varies between distributions).
Install Crystal and the development tools (also varies).
Run:
make
Build for Alpine
make alpine-build
Database configuration
Create an access_logs database with the following schema:
| Field | Type | Reference | Index? |
|---|---|---|---|
| id | String | UUID | Unique |
| host | String | Host name | Yes |
| msec | Float | Unix timestamp of visit | ? |
| server_protocol | String | HTTP/Version | ? |
| request_method | String | GET/POST/etc. | ? |
| request_completion | String | "OK" | ? |
| uri | String | Request | True |
| query_string | String | Arguments | ? |
| status | Integer | HTTP status | ? |
| sent_http_content_type | String | MIME type of response | ? |
| sent_http_content_encoding | String | Compression | ? |
| sent_http_etag | String | ETag header | ? |
| sent_http_last_modified | String | Last modified date | ? |
| http_accept | String | MIME types requested | ? |
| http_accept_encoding | String | Compression accepted | ? |
| http_accept_language | String | Languages supported | ? |
| http_pragma | String | Pragma header | ? |
| http_cache_control | String | Cache requested | ? |
| http_if_none_match | String | ETag requested | ? |
| http_dnt | String | Do Not Track header | ? |
| http_user_agent | String | User Agent | Yes |
| http_origin | String | Request origin | Yes |
| http_referer | String | Referer (see Referrer Policy) | Yes |
| request_time | Float | Request duration | ? |
| bytes_sent | Integer | Bytes sent | ? |
| body_bytes_sent | Integer | Bytes sent not including headers | ? |
| request_length | Integer | Headers | ? |
| http_connection | String | Connection status | ? |
| pipe | String | Connection was multiplexed | ? |
| connection_requests | Integer | Requests done on the same connection | ? |
| geoip2_data_country_name | String | Country according to GeoIP | Yes |
| geoip2_data_city_name | String | City according to GeoIP | Yes |
| ssl_server_name | String | SNI | ? |
| ssl_protocol | String | SSL/TLS version used | ? |
| ssl_early_data | String | TLSv1.3 early data used | ? |
| ssl_session_reused | String | TLS session reused | ? |
| ssl_curves | String | Curves used | ? |
| ssl_ciphers | String | Ciphers available | ? |
| ssl_cipher | String | Cipher used | ? |
| sent_http_x_xss_protection | String | XSS Protection sent | ? |
| sent_http_x_frame_options | String | Frame protection sent | ? |
| sent_http_x_content_type_options | String | Content protection sent | ? |
| sent_http_strict_transport_security | String | HSTS sent | ? |
| nginx_version | String | Server version | ? |
| pid | Integer | Server PID | ? |
| crawler | Boolean | Web crawler detected | ? |
| remote_user | String | HTTP Basic auth user | ? |
Nginx configuration
Configure Nginx to format access log as JSON. You can configure
http_referer.policy as one of unsafe-url, no-referrer, origin,
origin-when-cross-origin, same-origin, strict-origin,
strict-origin-when-cross-origin, no-referrer-when-downgrade.
{
"http_referer": {
"referrer": "$http_referer",
"origin": "$http_origin",
"policy": "origin-when-cross-origin"
}
}
Note: The internal key is referrer but the parent is
http_referer (double and single "r" respectively, the second is a typo
on the HTTP specification).
Install daemonize and run access_logd to create access.log as
a FIFO node, so Nginx writes to it and access_log can read from it.
Check /var/log/nginx/error.log for debugging.
ACCESS_LOG_FLAGS is the env variable to pass flags to access_logd.
For a working example check our Nginx
container.
log_format main escape=json '{"host":"$host","msec":$msec,"server_protocol":"$server_protocol","request_method":"$request_method","request_completion":"$request_completion","uri":"$uri","query_string":"$query_string","status":$status,"sent_http_content_type":"$sent_http_content_type","sent_http_content_encoding":"$sent_http_content_encoding","sent_http_etag":"$sent_http_etag","sent_http_last_modified":"$sent_http_last_modified","http_accept":"$http_accept","http_accept_encoding":"$http_accept_encoding","http_accept_language":"$http_accept_language","http_pragma":"$http_pragma","http_cache_control":"$http_cache_control","http_if_none_match":"$http_if_none_match","http_dnt":"$http_dnt","http_user_agent":"$http_user_agent","http_origin":"$http_origin","http_referer":{"origin":"$http_origin","referrer":"$http_referer","policy":"origin-when-cross-origin"},"request_time":$request_time,"bytes_sent":$bytes_sent,"body_bytes_sent":$body_bytes_sent,"request_length":$request_length,"http_connection":"$http_connection","pipe":"$pipe","connection_requests":$connection_requests,"geoip2_data_country_name":"$geoip2_data_country_name","geoip2_data_city_name":"$geoip2_data_city_name","ssl_server_name":"$ssl_server_name","ssl_protocol":"$ssl_protocol","ssl_early_data":"$ssl_early_data","ssl_session_reused":"$ssl_session_reused","ssl_curves":"$ssl_curves","ssl_ciphers":"$ssl_ciphers","ssl_cipher":"$ssl_cipher","sent_http_x_xss_protection":"$sent_http_x_xss_protection","sent_http_x_frame_options":"$sent_http_x_frame_options","sent_http_x_content_type_options":"$sent_http_x_content_type_options","sent_http_strict_transport_security":"$sent_http_strict_transport_security","nginx_version":"$nginx_version","pid":"$pid","remote_user":""}';
access_log /var/log/nginx/access.log main;
Crawler user agents
Download the crawler user agents
database and feed it
as argument to access_log. It'll try to detect if a UA belongs to
a web crawler.
TODO
- Make some fields optional