[backend, crawler] feat: support token statistics

This commit is contained in:
2025-10-25 03:23:15 +02:00
parent 8e456cff75
commit 799cda6e06
32 changed files with 414 additions and 60 deletions
+4 -4
View File
@@ -2,7 +2,7 @@
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: DRC News Corpus
title: Basango
message: >-
If you use this software, please cite it using the
metadata from this file.
@@ -14,11 +14,11 @@ authors:
email: bernard@devscast.tech
affiliation: Devscast Community
orcid: 'https://orcid.org/0009-0003-9777-6349'
repository-code: 'https://github.com/bernard-ng/drc-news-corpus'
repository-code: 'https://github.com/bernard-ng/basango'
repository: >-
https://www.huggingface.c0/datasets/bernard-ng/drc-news-corpus
https://www.huggingface.c0/datasets/bernard-ng/basango
abstract: >-
The "DRC News Corpus" is a curated collection of news
The "Basango" is a curated collection of news
articles sourced from major media outlets covering a wide
spectrum of topics related to the Democratic Republic of
Congo (DRC). This dataset encompasses a diverse range of
+10 -10
View File
@@ -1,24 +1,24 @@
# Core and Backend
![Deployed](https://github.com/bernard-ng/drc-news-corpus/actions/workflows/deploy.yaml/badge.svg)
![Coding Standard](https://github.com/bernard-ng/drc-news-corpus/actions/workflows/quality.yaml/badge.svg)
![Tests](https://github.com/bernard-ng/drc-news-corpus/actions/workflows/tests.yaml/badge.svg)
![Security](https://github.com/bernard-ng/drc-news-corpus/actions/workflows/audit.yaml/badge.svg)
![Deployed](https://github.com/bernard-ng/basango/actions/workflows/deploy.yaml/badge.svg)
![Coding Standard](https://github.com/bernard-ng/basango/actions/workflows/quality.yaml/badge.svg)
![Tests](https://github.com/bernard-ng/basango/actions/workflows/tests.yaml/badge.svg)
![Security](https://github.com/bernard-ng/basango/actions/workflows/audit.yaml/badge.svg)
| Scope | Link |
|-------------------|------------------------------------------------------------|
| core and backend | https://github.com/bernard-ng/drc-news-corpus |
| core and backend | https://github.com/bernard-ng/basango |
| ML models | https://github.com/bernard-ng/drc-news-ml |
| Mobile App | https://github.com/bernard-ng/basango |
| Dataset (partial) | https://huggingface.co/datasets/bernard-ng/drc-news-corpus |
| Dataset (partial) | https://huggingface.co/datasets/bernard-ng/basango |
---
## DRC News Corpus : Towards a scalable and intelligent system for Congolese News curation
## Basango : Towards a scalable and intelligent system for Congolese News curation
### Introduction
The **"DRC News Corpus"** is a structured and scalable dataset of news articles sourced from major media outlets covering diverse aspects of the Democratic Republic of Congo (DRC). Designed for efficiency, this system enables the automated collection, processing, and organization of news stories spanning politics, economy, society, culture, environment, and international affairs.
The **"Basango"** is a structured and scalable dataset of news articles sourced from major media outlets covering diverse aspects of the Democratic Republic of Congo (DRC). Designed for efficiency, this system enables the automated collection, processing, and organization of news stories spanning politics, economy, society, culture, environment, and international affairs.
### Scalability and Use Cases:
@@ -45,7 +45,7 @@ If you want to rebuild the dataset follow the steps bellow :
#### Installation
```bash
git clone https://github.com/bernard-ng/drc-news-corpus.git && cd drc-news-corpus
git clone https://github.com/bernard-ng/basango.git && cd basango
make build
make start
```
@@ -104,5 +104,5 @@ a CSV file will be generated in the `data` directory.
### Acknowledgment:
The compilation and curation of the "DRC News Corpus" were conducted by Tshabu Ngandu Bernard with the primary objective of facilitating research and analysis related to the Democratic Republic of Congo.
The compilation and curation of the "Basango" were conducted by Tshabu Ngandu Bernard with the primary objective of facilitating research and analysis related to the Democratic Republic of Congo.
I do not own the content of the articles, and all rights belong to the respective publishers. The dataset is intended for non-commercial research purposes only.
+1 -1
View File
@@ -1,6 +1,6 @@
{
"version": "1",
"name": "drc-news-corpus",
"name": "basango",
"type": "collection",
"ignore": [
"node_modules",
@@ -18,7 +18,7 @@
<field name="hash" length="32" />
<field name="categories" type="text[]" nullable="true" />
<many-to-one field="source" target-entity="Basango\Aggregator\Domain\Model\Entity\Source">
<many-to-one field="source" target-entity="Basango\Aggregator\Domain\Model\Entity\Source" fetch="EAGER">
<join-column nullable="false" on-delete="CASCADE" />
</many-to-one>
@@ -30,6 +30,7 @@
</field>
<field name="metadata" type="open_graph" nullable="true" />
<embedded name="readingTime" class="Basango\Aggregator\Domain\Model\ValueObject\ReadingTime" use-column-prefix="false" />
<field name="tokenStatistics" type="token_statistics" nullable="true" />
<field name="image"
insertable="false"
@@ -8,5 +8,12 @@
repository-class="Gesdinet\JWTRefreshTokenBundle\Entity\RefreshTokenRepository"
table="refresh_tokens"
>
<id name="id" type="integer">
<generator strategy="SEQUENCE" />
<sequence-generator sequence-name="refresh_tokens_id_seq" allocation-size="100" initial-value="1" />
</id>
<field name="refreshToken" type="string" column="refresh_token" length="128" unique="true"/>
<field name="username" type="string" length="255" column="username"/>
<field name="valid" type="datetime"/>
</entity>
</doctrine-mapping>
@@ -0,0 +1,31 @@
<?php
declare(strict_types=1);
namespace DoctrineMigrations;
use Doctrine\DBAL\Schema\Schema;
use Doctrine\Migrations\AbstractMigration;
/**
* Class Version20251024234318.
*
* @author bernard-ng <bernard@devscast.tech>
*/
final class Version20251024234318 extends AbstractMigration
{
public function getDescription(): string
{
return 'add token statistics to article';
}
public function up(Schema $schema): void
{
$this->addSql('ALTER TABLE article ADD token_statistics JSONB DEFAULT NULL');
}
public function down(Schema $schema): void
{
$this->addSql('ALTER TABLE article DROP token_statistics');
}
}
@@ -70,6 +70,7 @@ doctrine:
article_id: Basango\Aggregator\Infrastructure\Persistence\Doctrine\DBAL\Types\ArticleIdType
source_id: Basango\Aggregator\Infrastructure\Persistence\Doctrine\DBAL\Types\SourceIdType
open_graph: Basango\Aggregator\Infrastructure\Persistence\Doctrine\DBAL\Types\OpenGraphType
token_statistics: Basango\Aggregator\Infrastructure\Persistence\Doctrine\DBAL\Types\TokenStatisticsType
# Identity and Access
user_id: Basango\IdentityAndAccess\Infrastructure\Persistence\Doctrine\DBAL\Types\UserIdType
@@ -125,6 +126,7 @@ doctrine:
orm:
auto_generate_proxy_classes: true
enable_lazy_ghost_objects: true
enable_native_lazy_objects: true
entity_managers:
default:
validate_xml_mapping: false
@@ -6,6 +6,7 @@ namespace Basango\Aggregator\Application\UseCase\Command;
use Basango\Aggregator\Domain\Model\ValueObject\Link;
use Basango\Aggregator\Domain\Model\ValueObject\OpenGraph;
use Basango\Aggregator\Domain\Model\ValueObject\TokenStatistics;
/**
* Class Save.
@@ -17,11 +18,12 @@ final readonly class CreateArticle
public function __construct(
public string $title,
public Link $link,
public string $categories,
public array $categories,
public string $body,
public string $source,
public int $timestamp,
public ?OpenGraph $metadata = null
public ?OpenGraph $metadata = null,
public ?TokenStatistics $tokenStatistics = null
) {
}
}
@@ -43,12 +43,13 @@ final readonly class CreateArticleHandler implements CommandHandler
link: $command->link,
body: $command->body,
hash: $hash,
categories: mb_strtolower($command->categories),
categories: $command->categories,
source: $source,
publishedAt: $publishedAt
);
$article
->defineOpenGraph($command->metadata)
->defineTokenStatistics($command->tokenStatistics)
->computeReadingTime();
$this->articleRepository->add($article);
@@ -10,6 +10,7 @@ use Basango\Aggregator\Domain\Model\ValueObject\OpenGraph;
use Basango\Aggregator\Domain\Model\ValueObject\ReadingTime;
use Basango\Aggregator\Domain\Model\ValueObject\Scoring\Credibility;
use Basango\Aggregator\Domain\Model\ValueObject\Scoring\Sentiment;
use Basango\Aggregator\Domain\Model\ValueObject\TokenStatistics;
/**
* Class Article.
@@ -25,13 +26,14 @@ class Article
public readonly Link $link,
public readonly string $body,
public readonly string $hash,
private(set) string $categories,
private(set) array $categories,
public readonly Source $source,
public readonly \DateTimeImmutable $publishedAt,
public readonly \DateTimeImmutable $crawledAt = new \DateTimeImmutable(),
private(set) Credibility $credibility = new Credibility(),
private(set) Sentiment $sentiment = Sentiment::NEUTRAL,
private(set) ?OpenGraph $metadata = null,
private(set) ?TokenStatistics $tokenStatistics = null,
private(set) ?ReadingTime $readingTime = null,
private(set) ?\DateTimeImmutable $updatedAt = null,
public readonly ?string $image = null,
@@ -56,7 +58,7 @@ class Article
return $this;
}
public function assignCategories(string $categories): self
public function assignCategories(array $categories): self
{
$this->categories = $categories;
$this->updatedAt = new \DateTimeImmutable();
@@ -83,4 +85,11 @@ class Article
return $this;
}
public function defineTokenStatistics(?TokenStatistics $statistics): self
{
$this->tokenStatistics = $statistics;
return $this;
}
}
@@ -0,0 +1,62 @@
<?php
declare(strict_types=1);
namespace Basango\Aggregator\Domain\Model\ValueObject;
/**
* Class TokenStatistics.
*
* @author bernard-ng <bernard@devscast.tech>
*/
final class TokenStatistics implements \JsonSerializable
{
public ?int $total {
get {
return ($this->title ?? 0)
+ ($this->body ?? 0)
+ ($this->excerpt ?? 0)
+ ($this->categories ?? 0);
}
}
public function __construct(
public readonly ?int $title = null,
public readonly ?int $body = null,
public readonly ?int $excerpt = null,
public readonly ?int $categories = null,
) {
}
public static function tryFrom(?string $value): ?self
{
if ($value === null) {
return null;
}
try {
$object = \json_decode($value, true, 512, JSON_THROW_ON_ERROR);
return new self(
$object['title'] ?? null,
$object['body'] ?? null,
$object['excerpt'] ?? null,
$object['categories'] ?? null,
);
} catch (\Throwable) {
return null;
}
}
#[\Override]
public function jsonSerialize(): array
{
return [
'title' => $this->title,
'body' => $this->body,
'excerpt' => $this->excerpt,
'categories' => $this->categories,
'total' => $this->total,
];
}
}
@@ -0,0 +1,67 @@
<?php
declare(strict_types=1);
namespace Basango\Aggregator\Infrastructure\Persistence\Doctrine\DBAL\Types;
use Basango\Aggregator\Domain\Model\ValueObject\TokenStatistics;
use Doctrine\DBAL\Platforms\AbstractPlatform;
use Doctrine\DBAL\Types\ConversionException;
use Doctrine\DBAL\Types\Type;
/**
* Class TokenStatisticsType.
*
* @author bernard-ng <bernard@devscast.tech>
*/
final class TokenStatisticsType extends Type
{
public function getSQLDeclaration(array $column, AbstractPlatform $platform): string
{
return $platform->getJsonTypeDeclarationSQL([
'nullable' => true,
'jsonb' => true,
]);
}
public function getName(): string
{
return 'token_statistics';
}
#[\Override]
public function convertToPHPValue(mixed $value, AbstractPlatform $platform): ?TokenStatistics
{
if ($value === null) {
return null;
}
if (! \is_string($value)) {
throw ConversionException::conversionFailedInvalidType($value, $this->getName(), ['null', 'string', TokenStatistics::class]);
}
try {
return TokenStatistics::tryFrom($value);
} catch (\Throwable $e) {
throw ConversionException::conversionFailed($value, $this->getName(), $e);
}
}
#[\Override]
public function convertToDatabaseValue($value, AbstractPlatform $platform): ?string
{
if ($value instanceof TokenStatistics) {
return json_encode($value) ?: null;
}
if ($value === null || $value === '') {
return null;
}
if (! \is_string($value)) {
throw ConversionException::conversionFailedInvalidType($value, $this->getName(), ['null', 'string', TokenStatistics::class]);
}
throw ConversionException::conversionFailed($value, $this->getName());
}
}
@@ -47,11 +47,12 @@ final class AddArticleController extends AbstractController
$this->handleCommand(new CreateArticle(
$model->title,
Link::from($model->link),
implode(', ', $model->categories),
$model->categories,
$model->body,
$model->source,
$model->timestamp,
$model->metadata,
$model->tokenStatistics
));
return new JsonResponse(status: Response::HTTP_CREATED);
@@ -5,6 +5,7 @@ declare(strict_types=1);
namespace Basango\Aggregator\Presentation\WriteModel;
use Basango\Aggregator\Domain\Model\ValueObject\OpenGraph;
use Basango\Aggregator\Domain\Model\ValueObject\TokenStatistics;
use Symfony\Component\Validator\Constraints as Assert;
/**
@@ -32,4 +33,6 @@ final class AddArticleModel
public array $categories = [];
public ?OpenGraph $metadata = null;
public ?TokenStatistics $tokenStatistics = null;
}
@@ -42,7 +42,7 @@ final readonly class GetArticleOverviewListDbalHandler implements GetArticleOver
$qb->from('article', 'a')
->innerJoin('a', 'source', 's', 'a.source_id = s.id')
//->orderBy('a.published_at', $query->filters->sortDirection->value)
->orderBy('a.published_at', $query->filters->sortDirection->value)
->setParameter('userId', $query->userId->toString())
;
@@ -62,15 +62,17 @@ trait ArticleQuery
private function applyArticleFilters(QueryBuilder $qb, ArticleFilters $filters): QueryBuilder
{
if ($filters->category !== null) {
// PostgreSQL array containment for single value
$qb->andWhere(':category = ANY(a.categories)')
->setParameter('category', $filters->category);
}
if ($filters->search !== null) {
// Case-insensitive search in PostgreSQL
$qb->andWhere('a.title ILIKE :search')
->setParameter('search', sprintf('%%%s%%', $filters->search));
$qb
->addSelect("ts_rank(a.tsv, to_tsquery('french', :search)) AS rank")
->andWhere("a.tsv @@ to_tsquery('french', :search)")
->setParameter('search', $filters->search)
->resetOrderBy()
->orderBy('rank', $filters->sortDirection->value);
}
if ($filters->dateRange instanceof DateRange) {
@@ -4,8 +4,8 @@ declare(strict_types=1);
namespace Basango\IdentityAndAccess\Domain\Model\Entity;
use Gesdinet\JWTRefreshTokenBundle\Entity\RefreshToken as BaseRefreshToken;
use Gesdinet\JWTRefreshTokenBundle\Model\AbstractRefreshToken;
class RefreshToken extends BaseRefreshToken
class RefreshToken extends AbstractRefreshToken
{
}
@@ -11,15 +11,15 @@ namespace Basango\SharedKernel\Domain;
*/
final class Application
{
public string $name = 'DRC News Corpus';
public string $name = 'Basango';
public string $website = 'https://research.devscast.org/drc-news-corpus';
public string $website = 'https://basango.ngandu.dev';
public string $emailAddress = 'contact@devscast.tech';
public string $infoAddress = 'contact@devscast.tech';
public string $emailName = 'DRC News Corpus';
public string $emailName = 'Basango';
public string $legalName = 'Devscast Software SàSu';
@@ -48,14 +48,13 @@ trait PaginationQuery
PaginatorKeyset $keyset,
SortDirection $direction = SortDirection::DESC
): QueryBuilder {
$orderDirection = strtoupper($direction->value);
$comparisonOperator = $direction === SortDirection::ASC ? '>' : '<';
if ($keyset->date !== null) {
$qb->addOrderBy($keyset->date, $orderDirection);
$qb->addOrderBy($keyset->date, $direction->value);
}
$qb->addOrderBy($keyset->id, $orderDirection);
$qb->addOrderBy($keyset->id, $direction->value);
$cursor = PaginationCursor::decode($page->cursor);
if (! $cursor instanceof PaginationCursor) {
@@ -22,9 +22,9 @@ final class DefaultController extends AbstractController
public function __invoke(): JsonResponse
{
return $this->json([
'repository' => 'https://github.com/bernard-ng/drc-news-corpus',
'title' => 'DRC News Corpus : Towards a scalable and intelligent system for Congolese News curation',
'description' => 'The DRC News Corpus is a structured and scalable dataset of news articles sourced from major media outlets covering diverse aspects of the Democratic Republic of Congo (DRC). Designed for efficiency, this system enables the automated collection, processing, and organization of news stories spanning politics, economy, society, culture, environment, and international affairs.',
'repository' => 'https://github.com/bernard-ng/basango',
'title' => 'Basango : Towards a scalable and intelligent system for Congolese News curation',
'description' => 'The Basango is a structured and scalable dataset of news articles sourced from major media outlets covering diverse aspects of the Democratic Republic of Congo (DRC). Designed for efficiency, this system enables the automated collection, processing, and organization of news stories spanning politics, economy, society, culture, environment, and international affairs.',
'status' => 200,
]);
}