Files
access_log/README.md
2023-01-22 21:38:24 -03:00

192 lines
12 KiB
Markdown

# access_log
Receives access logs on a UNIX socket in JSON format and stores them on
a database. It **intentionally** doesn't collect IP addresses. It
doesn't respect the Do Not Track (DNT) header though, because we're not
collecting personally identifiable data. Referrer collection is
optional but we **strongly** suggest using a referrer policy that
doesn't collect full addresses.
See the [Rails
migration](https://0xacab.org/sutty/sutty/blob/rails/db/migrate/20200118155319_create_access_log.rb)
for the database schema, the [Nginx
configuration](https://0xacab.org/sutty/containers/nginx/blob/master/nginx/nginx.conf),
and the [site
configuration](https://0xacab.org/sutty/ansible-sutty/blob/master/templates/sites.conf.j2).
It supports SQlite3 and PostgreSQL databases :)
## Sustainable Web Design
When enabled, you can track CO2 emissions using [Sustainable Web Design
"Calculating Digital Emissions"
method](https://sustainablewebdesign.org/calculating-digital-emissions/).
The algorithm and data are based on
[CO2.js](https://github.com/thegreenwebfoundation/co2.js).
It follows the calculations with the added --optional-- feature of using
the origin country of the visit for the "consumer device" segment.
To enable this, see Nginx configuration.
```bash
# For a datacenter using renewable energy on Costa Rica
access_log --swd --renewable --datacenter CR
```
### Average vs marginal intensity
[CO2.js explains this
better](https://developers.thegreenwebfoundation.org/co2js/data/). In
practice, using average intensity data will give lower results and
mostly use the global intensity, since the data by country is missing
most countries.
`access_log` uses marginal data by default.
## Create database
```bash
sqlite3 access_log.sqlite3 < contrib/create.sql
```
## Build
Install zlib, sqlite3 and ssl development files (it varies between
distributions).
Install Crystal and the development tools (also varies).
Run:
```bash
make
```
## Build for Alpine
```bash
make alpine-build
```
## Database configuration
Create an `access_logs` database with the following schema:
| Field | Type | Reference | Index? |
| ----- | ---- | --------- | ------ |
| id | String | UUID | Unique |
| host | String | Host name | Yes |
| msec | Float | Unix timestamp of visit | ? |
| server_protocol | String | HTTP/Version | ? |
| request_method | String | GET/POST/etc. | ? |
| request_completion | String | "OK" | ? |
| uri | String | Request | True |
| query_string | String | Arguments | ? |
| status | Integer | HTTP status | ? |
| sent_http_content_type | String | MIME type of response | ? |
| sent_http_content_encoding | String | Compression | ? |
| sent_http_etag | String | ETag header | ? |
| sent_http_last_modified | String | Last modified date | ? |
| http_accept | String | MIME types requested | ? |
| http_accept_encoding | String | Compression accepted | ? |
| http_accept_language | String | Languages supported | ? |
| http_pragma | String | Pragma header | ? |
| http_cache_control | String | Cache requested | ? |
| http_if_none_match | String | ETag requested | ? |
| http_dnt | String | Do Not Track header | ? |
| http_user_agent | String | User Agent | Yes |
| http_origin | String | Request origin | Yes |
| http_referer | String | Referer (see Referrer Policy) | Yes |
| request_time | Float | Request duration | ? |
| bytes_sent | Integer | Bytes sent | ? |
| body_bytes_sent | Integer | Bytes sent not including headers | ? |
| request_length | Integer | Headers | ? |
| http_connection | String | Connection status | ? |
| pipe | String | Connection was multiplexed | ? |
| connection_requests | Integer | Requests done on the same connection | ? |
| geoip2_data_country_name | String | Country according to GeoIP | Yes |
| geoip2_data_city_name | String | City according to GeoIP | Yes |
| ssl_server_name | String | SNI | ? |
| ssl_protocol | String | SSL/TLS version used | ? |
| ssl_early_data | String | TLSv1.3 early data used | ? |
| ssl_session_reused | String | TLS session reused | ? |
| ssl_curves | String | Curves used | ? |
| ssl_ciphers | String | Ciphers available | ? |
| ssl_cipher | String | Cipher used | ? |
| sent_http_x_xss_protection | String | XSS Protection sent | ? |
| sent_http_x_frame_options | String | Frame protection sent | ? |
| sent_http_x_content_type_options | String | Content protection sent | ? |
| sent_http_strict_transport_security | String | HSTS sent | ? |
| nginx_version | String | Server version | ? |
| pid | Integer | Server PID | ? |
| crawler | Boolean | Web crawler detected | ? |
| remote_user | String | HTTP Basic auth user | ? |
## Nginx configuration
Configure Nginx to format access log as JSON. You can configure
`http_referer.policy` as one of `unsafe-url`, `no-referrer`, `origin`,
`origin-when-cross-origin`, `same-origin`, `strict-origin`,
`strict-origin-when-cross-origin`, `no-referrer-when-downgrade`.
```json
{
"http_referer": {
"referrer": "$http_referer",
"origin": "$http_origin",
"policy": "origin-when-cross-origin"
}
}
```
**Note:** The internal key is `referrer` but the parent is
`http_referer` (double and single "r" respectively, the second is a typo
on the HTTP specification).
Install `daemonize` and run `access_logd`. By default it creates a UNIX
socket on `/tmp/access_log.socket` so Nginx writes can write to it using
its [syslog support](https://nginx.org/en/docs/syslog.html).
Check `/var/log/nginx/error.log` for debugging.
`ACCESS_LOG_FLAGS` is the env variable to pass flags to `access_logd`.
For a working example check our [Nginx
container](https://0xacab.org/sutty/containers/nginx/).
```
log_format main escape=json '{"host":"$host","msec":$msec,"server_protocol":"$server_protocol","request_method":"$request_method","request_completion":"$request_completion","uri":"$uri","query_string":"$query_string","status":$status,"sent_http_content_type":"$sent_http_content_type","sent_http_content_encoding":"$sent_http_content_encoding","sent_http_etag":"$sent_http_etag","sent_http_last_modified":"$sent_http_last_modified","http_accept":"$http_accept","http_accept_encoding":"$http_accept_encoding","http_accept_language":"$http_accept_language","http_pragma":"$http_pragma","http_cache_control":"$http_cache_control","http_if_none_match":"$http_if_none_match","http_dnt":"$http_dnt","http_user_agent":"$http_user_agent","http_origin":"$http_origin","http_referer":{"origin":"$http_origin","referrer":"$http_referer","policy":"origin-when-cross-origin"},"request_time":$request_time,"bytes_sent":$bytes_sent,"body_bytes_sent":$body_bytes_sent,"request_length":$request_length,"http_connection":"$http_connection","pipe":"$pipe","connection_requests":$connection_requests,"geoip2_data_country_name":"$geoip2_data_country_name","geoip2_data_city_name":"$geoip2_data_city_name","ssl_server_name":"$ssl_server_name","ssl_protocol":"$ssl_protocol","ssl_early_data":"$ssl_early_data","ssl_session_reused":"$ssl_session_reused","ssl_curves":"$ssl_curves","ssl_ciphers":"$ssl_ciphers","ssl_cipher":"$ssl_cipher","sent_http_x_xss_protection":"$sent_http_x_xss_protection","sent_http_x_frame_options":"$sent_http_x_frame_options","sent_http_x_content_type_options":"$sent_http_x_content_type_options","sent_http_strict_transport_security":"$sent_http_strict_transport_security","nginx_version":"$nginx_version","pid":"$pid","remote_user":""}';
access_log syslog=unix:/tmp/access_log.socket,nohostname main;
```
### Add origin country of visit to SWD
Add a `$geoip2_data_country_iso_code` variable on Nginx and the
corresponding variable to the JSON log format.
```nginx
geoip2 /usr/share/GeoIP/GeoLite2-Country.mmdb {
$geoip2_data_country_iso_code country iso_code;
}
log_format main escape=json '{"host":"$host","msec":$msec,"server_protocol":"$server_protocol","request_method":"$request_method","request_completion":"$request_completion","uri":"$uri","query_string":"$query_string","status":$status,"sent_http_content_type":"$sent_http_content_type","sent_http_content_encoding":"$sent_http_content_encoding","sent_http_etag":"$sent_http_etag","sent_http_last_modified":"$sent_http_last_modified","http_accept":"$http_accept","http_accept_encoding":"$http_accept_encoding","http_accept_language":"$http_accept_language","http_pragma":"$http_pragma","http_cache_control":"$http_cache_control","http_if_none_match":"$http_if_none_match","http_dnt":"$http_dnt","http_user_agent":"$http_user_agent","http_origin":"$http_origin","http_referer":{"origin":"$http_origin","referrer":"$http_referer","policy":"origin-when-cross-origin"},"request_time":$request_time,"bytes_sent":$bytes_sent,"body_bytes_sent":$body_bytes_sent,"request_length":$request_length,"http_connection":"$http_connection","pipe":"$pipe","connection_requests":$connection_requests,"geoip2_data_country_name":"$geoip2_data_country_name","geoip2_data_city_name":"$geoip2_data_city_name","ssl_server_name":"$ssl_server_name","ssl_protocol":"$ssl_protocol","ssl_early_data":"$ssl_early_data","ssl_session_reused":"$ssl_session_reused","ssl_curves":"$ssl_curves","ssl_ciphers":"$ssl_ciphers","ssl_cipher":"$ssl_cipher","sent_http_x_xss_protection":"$sent_http_x_xss_protection","sent_http_x_frame_options":"$sent_http_x_frame_options","sent_http_x_content_type_options":"$sent_http_x_content_type_options","sent_http_strict_transport_security":"$sent_http_strict_transport_security","nginx_version":"$nginx_version","pid":"$pid","remote_user":"","geoip2_data_country_iso_code":"$geoip2_data_country_iso_code"}';
```
Then run the program with the required flags enabled:
```bash
access_log --swd --device-country
```
## Crawler user agents
Download the [crawler user agents
database](https://github.com/monperrus/crawler-user-agents) and feed it
as argument to `access_log`. It'll try to detect if a UA belongs to
a web crawler.
## TODO
* [ ] Make some fields optional