Massimo CAFARO

Massimo CAFARO

Professore II Fascia (Associato)

Settore Scientifico Disciplinare ING-INF/05: SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI.

Dipartimento di Ingegneria dell'Innovazione

Centro Ecotekne Pal. O - S.P. 6, Lecce - Monteroni - LECCE (LE)

Ufficio, Piano terra

Telefono +39 0832 29 7371

Direttore del Master in Applied Data Science.

Responsabile scientifico del Laboratorio HPC

Afferente al Dipartimento di Ingegneria dell'Innovazione.

Area di competenza:

Calcolo parallelo e distribuito, grid/cloud computing, security, cryptography, data mining, machine learning, deep learning

Orario di ricevimento

Previo appuntamento via email.

Visualizza QR Code Scarica la Visit Card

Curriculum Vitae

Massimo Cafaro è professore associato presso il Dipartimento di Ingegneria dell'Innovazione dell'Università del Salento, nel settore scientifico disciplinare ING-INF/05. Il suo ambito di ricerca riguarda Parallel e Distributed Computing, Grid e Cloud Computing, Data Mining, Machine Learning, Deep Learning, Big Data e Security. È direttore del Master in Applied Data Science presso l'Università del Salento. È Senior Member IEEE ed IEEE Computer Society, Senior Member ACM, Associate Editor per il journal IEEE Access e moderatore per IEEE TechRxiv preprint repository ("Computing and Processing" category). Laureato in Scienze dell'Informazione presso l’Università di Salerno, ha conseguito il dottorato in Informatica presso l'Università di Bari. È inoltre autore di oltre 110 pubblicazioni internazionali in refereed journals, proceedings e book chapters e co-autore del brevetto “Metodo e formalismo per inviare istruzioni a DataBase distribuiti realizzato mediante programma per computer”. È Visiting Professor presso il British Institute of Technology and E-Commerce, London, UK. È invited lecturer presso numerose università ed enti di ricerca, ed è membro dei Technical Committees IEEE on Parallel Processing, Distributed Processing, Scalable Computing e Services Computing. Inoltre, è Vice Chair of Regional Centers e coordinatore della Technical Area on Data Intensive Computing nell’ambito di IEEE Technical Committee on Scalable Computing. Collabora attivamente con l'Istituto Nazionale di Geofisica e Vulcanologia (INGV) ed il Centro Euro-Mediterraneano per i Cambiamenti Climatici (CMCC) occupandosi di aspetti di ricerca teorici e pratici prestando particolare attenzione  al design, analisi ed implementazione di algoritmi sequenziali, paralleli e distribuiti. L'attività scientifica include infine l'organizzazione di workshop internazionali e l'insegnamento in università italiane e straniere, ed in scuole di dottorato internazionali.
 

Ha conseguito l’Abilitazione Scientifica Nazionale alle funzioni di professore universitario di prima fascia (Ordinario) nel Settore Concorsuale 09/H1 - SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI.

Ha conseguito l’Abilitazione Scientifica Nazionale alle funzioni di professore universitario di prima fascia (Ordinario) nel Settore Concorsuale 01/B1 - INFORMATICA.

Responsabile scientifico del laboratorio di ricerca HPC (High Performance Computing).

Responsabile scientifico del laboratorio di ricerca CAMPI.

Membro del Collegio dei Docenti del Corso di Dottorato di Ricerca in Ingegneria dei Sistemi Complessi presso l'Università del Salento.

 

 

 

 

 

Didattica

A.A. 2023/2024

ALGORITMI PARALLELI

Corso di laurea INGEGNERIA INFORMATICA

Tipo corso di studio Laurea Magistrale

Lingua ITALIANO

Crediti 9.0

Docente titolare Massimo CAFARO

Ripartizione oraria Ore totali di attività frontale: 81.0

  Ore erogate dal docente Massimo CAFARO: 63.0

Anno accademico di erogazione 2023/2024

Per immatricolati nel 2022/2023

Anno di corso 2

Struttura DIPARTIMENTO DI INGEGNERIA DELL'INNOVAZIONE

Percorso PERCORSO COMUNE

Sede Lecce

DATA MINING & MACHINE LEARNING

Degree course INGEGNERIA INFORMATICA

Course type Laurea Magistrale

Language INGLESE

Credits 9.0

Teaching hours Ore totali di attività frontale: 81.0

Year taught 2023/2024

For matriculated on 2022/2023

Course year 2

Structure DIPARTIMENTO DI INGEGNERIA DELL'INNOVAZIONE

Subject matter Intelligenza artificiale

Location Lecce

LABORATORIO INFORMATICO

Corso di laurea INFERMIERISTICA (ABILITANTE ALLA PROFESSIONE SANITARIA DI INFERMIERE)

Tipo corso di studio Laurea

Crediti 2.0

Docente titolare LUCA MAINETTI

Ripartizione oraria Ore totali di attività frontale: 40.0

  Ore erogate dal docente Massimo CAFARO: 20.0

Anno accademico di erogazione 2023/2024

Per immatricolati nel 2023/2024

Anno di corso 1

Struttura DIPARTIMENTO DI SCIENZE E TECNOLOGIE BIOLOGICHE ED AMBIENTALI

Percorso SEDE LECCE

A.A. 2022/2023

DATA MINING & MACHINE LEARNING

Degree course COMPUTER ENGINEERING

Course type Laurea Magistrale

Language INGLESE

Credits 9.0

Teaching hours Ore totali di attività frontale: 81.0

Year taught 2022/2023

For matriculated on 2021/2022

Course year 2

Structure DIPARTIMENTO DI INGEGNERIA DELL'INNOVAZIONE

Subject matter ARTIFICIAL INTELLIGENCE

Location Lecce

PARALLEL ALGORITHMS

Degree course COMPUTER ENGINEERING

Course type Laurea Magistrale

Language INGLESE

Credits 9.0

Owner professor Massimo CAFARO

Teaching hours Ore totali di attività frontale: 81.0

  Ore erogate dal docente Massimo CAFARO: 54.0

Year taught 2022/2023

For matriculated on 2021/2022

Course year 2

Structure DIPARTIMENTO DI INGEGNERIA DELL'INNOVAZIONE

Subject matter PERCORSO COMUNE

Location Lecce

A.A. 2021/2022

DATA MINING

Degree course MATEMATICA

Course type Laurea Magistrale

Language INGLESE

Credits 6.0

Teaching hours Ore totali di attività frontale: 42.0

Year taught 2021/2022

For matriculated on 2021/2022

Course year 1

Structure DIPARTIMENTO DI MATEMATICA E FISICA "ENNIO DE GIORGI"

Subject matter PERCORSO COMUNE

Location Lecce

PARALLEL ALGORITHMS

Degree course COMPUTER ENGINEERING

Course type Laurea Magistrale

Language INGLESE

Credits 9.0

Teaching hours Ore totali di attività frontale: 81.0

Year taught 2021/2022

For matriculated on 2020/2021

Course year 2

Structure DIPARTIMENTO DI INGEGNERIA DELL'INNOVAZIONE

Subject matter PERCORSO COMUNE

Location Lecce

A.A. 2020/2021

DATA MINING

Degree course MATEMATICA

Course type Laurea Magistrale

Language INGLESE

Credits 6.0

Teaching hours Ore totali di attività frontale: 42.0

Year taught 2020/2021

For matriculated on 2020/2021

Course year 1

Structure DIPARTIMENTO DI MATEMATICA E FISICA "ENNIO DE GIORGI"

Subject matter PERCORSO COMUNE

PARALLEL ALGORITHMS

Degree course COMPUTER ENGINEERING

Course type Laurea Magistrale

Language INGLESE

Credits 9.0

Teaching hours Ore totali di attività frontale: 81.0

Year taught 2020/2021

For matriculated on 2019/2020

Course year 2

Structure DIPARTIMENTO DI INGEGNERIA DELL'INNOVAZIONE

Subject matter PERCORSO COMUNE

Location Lecce

A.A. 2019/2020

DATA MINING

Degree course MATEMATICA

Course type Laurea Magistrale

Language INGLESE

Credits 6.0

Teaching hours Ore totali di attività frontale: 42.0

Year taught 2019/2020

For matriculated on 2019/2020

Course year 1

Structure DIPARTIMENTO DI MATEMATICA E FISICA "ENNIO DE GIORGI"

Subject matter PERCORSO COMUNE

Location Lecce

PARALLEL ALGORITHMS

Degree course COMPUTER ENGINEERING

Course type Laurea Magistrale

Language INGLESE

Credits 9.0

Teaching hours Ore totali di attività frontale: 81.0

Year taught 2019/2020

For matriculated on 2018/2019

Course year 2

Structure DIPARTIMENTO DI INGEGNERIA DELL'INNOVAZIONE

Subject matter PERCORSO COMUNE

Location Lecce

A.A. 2018/2019

DATA MINING

Degree course MATEMATICA

Course type Laurea Magistrale

Language INGLESE

Credits 6.0

Teaching hours Ore totali di attività frontale: 42.0

Year taught 2018/2019

For matriculated on 2018/2019

Course year 1

Structure DIPARTIMENTO DI MATEMATICA E FISICA "ENNIO DE GIORGI"

Subject matter PERCORSO COMUNE

Location Lecce

ELEMENTI DI INFORMATICA

Corso di laurea INGEGNERIA DELLE TECNOLOGIE INDUSTRIALI

Tipo corso di studio Laurea

Lingua ITALIANO

Crediti 6.0

Ripartizione oraria Ore totali di attività frontale: 54.0

Anno accademico di erogazione 2018/2019

Per immatricolati nel 2018/2019

Anno di corso 1

Struttura DIPARTIMENTO DI INGEGNERIA DELL'INNOVAZIONE

Percorso unico

Sede Lecce

PARALLEL ALGORITHMS

Degree course COMPUTER ENGINEERING

Course type Laurea Magistrale

Language INGLESE

Credits 9.0

Teaching hours Ore totali di attività frontale: 81.0

Year taught 2018/2019

For matriculated on 2017/2018

Course year 2

Structure DIPARTIMENTO DI INGEGNERIA DELL'INNOVAZIONE

Subject matter PERCORSO COMUNE

Location Lecce

Torna all'elenco
ALGORITMI PARALLELI

Corso di laurea INGEGNERIA INFORMATICA

Settore Scientifico Disciplinare ING-INF/05

Tipo corso di studio Laurea Magistrale

Crediti 9.0

Docente titolare Massimo CAFARO

Ripartizione oraria Ore totali di attività frontale: 81.0

  Ore erogate dal docente Massimo CAFARO: 63.0

Per immatricolati nel 2022/2023

Anno accademico di erogazione 2023/2024

Anno di corso 2

Semestre Secondo Semestre (dal 04/03/2024 al 14/06/2024)

Lingua ITALIANO

Percorso PERCORSO COMUNE (999)

Sede Lecce

Analisi Matematica I e II, teoria della probabilità. Capacità di programmazione inl linguaggio C/C++.

Il corso fornisce un'introduzione moderna alla progettazione, analisi ed implementazione di algoritmi sequenziali e paralleli. In particolare, il corso si basa su un approccio pragmatico alla programmazione parallela di algoritmi message-passing attraverso il linguaggio C e la libreria MPI.

Knowledge and understanding.

Gli studenti devono avere un solido background con un ampio spettro di conoscenze di base sugli algoritmi sequenziali e paralleli:

 

  • gli studenti devono possedere gli strumenti cognitivi di base per pensare in modo analitico, creativo, critico e curioso, e possedere le capacità di astrazione e di risoluzione dei problemi necessarie per affrontare sistemi complessi;
  • devono avere una solida conoscenza della progettazione e dell'implementazione di algoritmi efficienti sequenziali e paralleli;
  • devono possedere gli strumenti per analizzare le risorse utilizzate dagli algoritmi;
  • devono possedere un catalogo dei più noti ed efficienti algoritmi sequenziali e paralleli per problemi computazionali di base.
  •  

Applying knowledge and understanding.

Al termine del corso lo studente dovrebbe essere in grado di:

 

  • Descrivere e utilizzare le principali tecniche di progettazione di algoritmi sequenziali;
  • Progettare, dimostrare la correttezza e analizzare la complessità computazionale di algoritmi sequenziali;
  • Comprendere le differenze tra diversi algoritmi che risolvono lo stesso problema e riconoscere quale sia migliore in condizioni diverse;
  • Descrivere e utilizzare gli algoritmi sequenziali di base;
  • Descrivere e utilizzare le strutture dati di base; conoscere l'esistenza di strutture dati avanzate;
  • Comprendere la differenza tra algoritmi sequenziali e paralleli;
  • Progettare, implementare e analizzare algoritmi paralleli basati sul message-passing in C/C++ utilizzando la libreria MPI;
  • Descrivere e utilizzare algoritmi paralleli di base.

 

 

Making judgements. Gli studenti sono guidati ad apprendere criticamente tutto ciò che viene loro spiegato in classe, a confrontare diversi approcci alla soluzione di problemi algoritmici e a identificare e proporre, in modo autonomo, la soluzione più efficienti.

 

Communication. È fondamentale che lo studente sia in grado di comunicare con un pubblico vario e composito, non culturalmente omogeneo, in modo chiaro, logico ed efficace, utilizzando gli strumenti metodologici acquisiti e le proprie conoscenze scientifiche e, in particolare, il lessico specialistico. Il corso promuove lo sviluppo delle seguenti abilità dello studente: capacità di esporre in termini precisi e formali un modello astratto di problemi concreti, individuandone le caratteristiche salienti e scartando quelle non essenziali; capacità di descrivere e analizzare una soluzione efficace per un dato problema.

 

Learning skills. Gli studenti devono acquisire la capacità critica di rapportarsi, con originalità e autonomia, alle problematiche tipiche degli algoritmi sequenziali e paralleli e, in generale, alle questioni culturali legate ad altri ambiti simili. Dovranno essere in grado di sviluppare e applicare autonomamente le conoscenze e i metodi appresi in vista di un eventuale proseguimento degli studi a livello superiore (dottorato) o nella più ampia prospettiva di auto-miglioramento culturale e professionale dell'apprendimento permanente. Pertanto, gli studenti devono essere in grado di passare a forme espositive diverse dai testi di partenza per memorizzare, riassumere per sé e per gli altri e diffondere le conoscenze scientifiche.

Il corso si propone di mettere gli studenti in grado di astrarre modelli e problemi algoritmici formali da problemi computazionali concreti e di progettare soluzioni algoritmiche efficienti per questi ultimi. A tal fine si utilizzerà il seguente metodo di insegnamento. Ogni problema computazionale sarà introdotto motivandolo con esempi concreti. La presentazione di ogni argomento sarà divisa in quattro parti: 1. Descrizione del problema computazionale concreto. 2. Modellazione del problema reale mediante un problema astratto. 3. Risoluzione del problema astratto attraverso un algoritmo ottenuto con l'applicazione delle tecniche generali di progettazione di algoritmi introdotte nel corso. 4. Analisi delle risorse utilizzate dall'algoritmo. Il corso consiste in lezioni frontali ed esercitazioni in aula. Ci saranno lezioni teoriche finalizzate all'apprendimento delle tecniche di base per la progettazione e l'analisi degli algoritmi, e una parte delle lezioni dedicata alle esercitazioni in cui si illustrerà, con dovizia di esempi, come le conoscenze teoriche acquisite possano essere utilizzate per risolvere problemi algoritmici di interesse pratico e implementare algoritmi paralleli in linguaggio C/C++ attraverso la libreria MPI.

L'esame consiste in una prova scritta e nell'implementazione di un prototipo di software parallelo. La prova scritta (2 ore, 21 punti su 30) verte su argomenti teorici relativi alla progettazione e all'analisi di algoritmi sequenziali e paralleli, al fine di verificare la conoscenza e la comprensione della materia da parte dello studente. Per superare la prova scritta, gli studenti dovranno ottenere almeno 9 punti su 21.  Il progetto di software parallelo (9 punti su 30) ha lo scopo di verificare la pratica della programmazione parallela e la capacità dello studente di implementare algoritmi paralleli teorici o di parallelizzare un algoritmo sequenziale. Per superare la prova di programmazione parallela, gli studenti devono ottenere almeno 4 punti su 9. Sia la prova scritta che l'implementazione del prototipo di software parallelo sono obbligatori. Gli studenti devono superare sia la prova scritta sia la prova di programmazione parallela; l'esame è superato solo se la somma dei punti ottenuti nella prova scritta e in quella di programmazione parallela è maggiore o uguale a 18 punti.

Il problema da risolvere in parallelo (o un algoritmo sequenziale da implementare in parallelo) viene assegnato su richiesta dello studente. Il progetto di software parallelo assegnato può essere discusso solo se la prova scritta è stata superata con successo; inoltre, il progetto di software parallelo deve essere obbligatoriamente discusso in uno degli appelli della stessa sessione in cui lo studente supera la prova scritta. A tal fine, il progetto (codice e relativa documentazione) deve essere inviato al docente via email almeno tre giorni prima dell'esame. Non saranno accettati progetti inviati oltre il termine indicato. 

 

La relazione relativa al progetto assegnato deve essere strutturata come segue.

 

1. Introduzione. Lo studente deve fornire una descrizione accurata del progetto assegnato, comprensiva dell’analisi dell’algoritmo sequenziale che risolve il problema affrontato nel progetto. Qualora lo studente lo ritenga importante ai fini della comprensione, possono essere presenti pseudo-codice, esempi, grafici, figure, istanze applicative etc; 

 

2. Design parallelo. Si tratta di uno studio preliminare volto ad evidenziare le opportunità per il parallelismo insite nel problema e/o nell'algoritmo sequenziale di partenza. Partendo dallo pseudo-codice occorre quindi determinare le funzionalità, le parti di codice e/o le strutture dati più opportune che possono essere prese in considerazione per la parallelizzazione. In questa fase devono essere considerati e discussi design alternativi, relativi a diverse strategie di parallelizzazione, motivando opportunamente perché alcune operazioni si prestano ad una parallelizzazione efficace ed altre no. Inoltre, devono essere riportati i risultati attesi, inclusa l’analisi della complessità parallela nel caso peggiore. È richiesta massima chiarezza espositiva in merito alla strategia di parallelizzazione, pertanto le strutture dati utilizzate devono essere descritte integralmente, motivando le scelte adottate per minimizzare il peso  delle comunicazioni e della sincronizzazione. La valutazione analitica teorica della scalabilità della versione parallela è obbligatoria, e deve essere eseguita sia utilizzando la isoefficienza che la funzione di scalabilità. 

 

3. Implementazione. Il design parallelo adottato deve essere implementato utilizzando il linguaggio di programmazione C o, eventualmente, C++ (in modo da usare le strutture dati presenti nella STL del C++). Il codice deve essere necessariamente commentato in modo adeguato. Nel caso in cui lo studente verifichi l’esistenza di strategie di parallelizzazione multiple per lo stesso problema assegnato, è possibile fornire una implementazione per ciascuna strategia, discutendo vantaggi e svantaggi di ogni strategia. Si noti che il codice parallelo non deve essere riportato integralmente nel testo della relazione, in quanto il codice del progetto - sorgenti e Makefile per il build (in alternativa è possibile usare CMake) - deve in ogni caso essere consegnato a parte. Tuttavia, la relazione può includere alcuni snippet di codice relativi alla parti più critiche ed interessanti. 

 

4. Debug e test. Si raccomanda di effettuare un debug e test del codice parallelo che consenta di verificare, ragionevolmente, l’assenza di bugs e la correttezza dell’algoritmo parallelo in relazione all’output prodotto. Si invita lo studente a progettare e verificare gli unit tests ritenuti idonei a valutare la bontà del codice. Lo studente è esplicitamente avvisato che l’implementazione di una strategia di parallelizzazione non corretta e/o la presenza di bugs che mandano in crash l'applicazione parallela a runtime comportano il non superamento dell'esame, mentre la presenza di bugs che inficiano la correttezza dell'algoritmo comporta una penalizzazione relativa al punteggio che sarà attribuito. 

 

5. Analisi delle prestazioni e della scalabilità. Lo studente deve analizzare le prestazioni dell’implementazione sviluppata in termini di tempo di esecuzione, occupazione di memoria (se necessario), speedup ed efficienza. Deve essere valutata sia la strong scalability (Amdahl) che la weak scalability (Gustafson) ove possibile. 

 

6. Sintesi. La relazione deve essere quanto più possibile sintetica, ma senza che ciò vada a discapito della chiarezza espositiva. 

 

7. Valutazione del progetto. Il progetto sarà valutato con un punteggio relativo ad una scala che va da 1 a 9 punti. Il voto sarà definito solo al termine della discussione del progetto. La complessità del problema assegnato sarà presa in considerazione per la valutazione; gli studenti che si occupano di problemi più semplici devono quindi aspettarsi una minore indulgenza nella valutazione rispetto agli studenti che si occupano di problemi più difficili. 

 

La valutazione terrà inoltre conto di: 

 

- Chiarezza ed efficacia della presentazione; 

- Profondità, correttezza ed originalità dell’analisi teorica; 

- Capacità tecniche e documentazione dell’implementazione; 

- Numero e qualità delle strategie multiple parallele, ove presenti; 

- Pensiero critico nella valutazione e nell’analisi delle prestazioni; 

  • Significatività dei risultati: dato che la programmazione parallela si occupa principalmente di prestazioni, un buon progetto deve anche fornire un significativo speedup.

 

 

8. Plagio. Lo studente è avvisato esplicitamente che il plagio è gravissimo, ed è considerato molto seriamente. L’uso di fonti esterne di qualsiasi genere (internet, libri, dispense, lavori pregressi di altri studenti etc) è consentito solo per contributi marginali nel progetto (ad esempio, per la descrizione iniziale del progetto assegnato), ed ogni singola occorrenza deve essere esplicitamente citata e riportata nella relazione. La violazione di questo codice di comportamento non sarà in alcun modo tollerata, e comporta l’immediato annullamento dell’esame ed il deferimento dello studente alla commissione disciplinare di ateneo, con avvio del relativo procedimento disciplinare secondo quanto riportato del Titolo V (PROCEDIMENTI DISCIPLINARI E SANZIONI) del Regolamento di Ateneo per gli studenti (D.R. 672 del 5/12/2017, entrato in vigore in data 8/12/2017).

 

 

 

 

 

 

 

 

Ricevimento Studenti

Su appuntamento; contattare il docente via e-mail o al termine degli incontri di classe.

Algoritmi Sequenziali


Introduzione agli algoritmi sequenziali. Tecniche di progettazione di algoritmi. Design ed analisi. Modello RAM. Decrease and conquer. Russian Peasant multiplication. Il problema dell'ordinamento. Insertion Sort. Analisi di Insertion Sort. Correttezza. Assertions. Invariants. Correttezza di Insertion Sort. Ordini di grandezza e notazioni asintotiche. Uso delle notazioni asintotiche. Classi di funzioni. Analisi del running time di un algoritmo usando le notazioni asintotiche. Determinare la complessità computazionale di un algoritmo. Divide and conquer. Merge Sort. Equazioni di ricorrenza. Recursion tree. Risoluzione di equazioni di ricorrenza. Metodo di sostituzione. Albero di ricorsione. Master theorem e sue estensioni. Teoremi di Akra-Bazzi. Metodo di Iterazione. Cambio di variabile. Ricorrenze lineari di ordine costante con coefficienti costanti. Divide and conquer. Merge Sort. Binary search. Interpolation search. Powering a number. Calcolo dell'ennesimo numero di Fibonacci. Algoritmo di Strassen per la moltiplicazione matriciale. Algoritmo di Karatsuba per la moltiplicazione di due numeri. Minimo e massimo simultanei tramite divide and conquer oppure mediante algoritmo iterativo. Majority element: divide and conquer algorithm. Algoritmo di Boyer and Moore. The hiring problem. Analisi nel caso peggiore. Analisi probabilistica di un algoritmo deterministico (average-case analysis). Variabili aleatorie indicatrici. Media di una variabile aleatoria indicatrice. Algoritmi randomizzati. Caratteristiche e differenze di un algoritmo randomizzato rispetto ad un algoritmo deterministico. Algoritmi randomizzati Monte Carlo e Las Vegas. Esempio di analisi di un algoritmo randomizzato: expected e worst-case running time. Algoritmo randomizzato di Freivald (Monte Carlo Matrix Multiplication Checker) per matrici con entries a valere su un campo GF(2). Correttezza dell'algoritmo di Freivald. Aumento della probabilità che l'output sia corretto da una probabilità costante ad una esponenzialmente elevata. Generalizzazione per matrici con entries a valere su un campo GF(d). QuickSort. Partizionamento. Analisi nel caso peggiore. Analisi nel caso medio. Considerazioni introduttive sulla complessita' dell'algoritmo nel caso medio. Un esempio di analisi nel caso medio errata. Disuguaglianza di Jensen. Analisi dettagliata del caso medio tramite variabili aleatorie indicatrici. Paranoid Quicksort. Analisi nel caso medio di Paranoid Quicksort. Transform and Conquer: instance simplification, representation change, problem reduction. Heaps. Max-Heap e Min-Heap. Max-Heapify. Build-Max-Heap. HeapSort. Priority Queues. Lower bound per algoritmi di ordinamento per confronti. Ordinamento in tempo lineare. Counting sort. Radix sort. Bucket sort. Statistiche d'ordine. Selezione in tempo atteso lineare. Selezione in tempo lineare nel caso peggiore. Introduzione alla programmazione dinamica. Matrix Chain Multiply. Optimal substructure. Overlapping subproblems. Top-down divide and conquer approach with memoization. Bottom-up iterative approach. Longest Common Subsequence. Elementi fondamentali della programmazione dinamica. Knapsack problem. 0-1 and fractional knapsack. Algoritmo basato su programmazione dinamica per 0-1 knapsack. Algoritmi pseudo-polinomiali. Strategia greedy. Greddy choice property. Activity selection problem. Algoritmo basato su programmazione dinamica per activity selection. Algoritmo greedy per activity selection. Confronto tra approccio greedy e programmazione dinamica. Algoritmo greedy applicato a Knapsack 0-1 e fractional knapsack. Minimum Spanning Tree. Algoritmo di Prim. Paths in graphs. Shortest paths. Optimal substructure. Triangle inequality. Negative weight cycles. Single-source shortest paths. Dijkstra’s algorithm. Correctness and analysis. Unweighted graphs and breadth-first search. Correctness and analysis. Belmann-Ford algorithm. Correctness and analysis. Shortest paths in Directed Acyclic Graphs: topological sort and depth-first search. All-pairs shortest paths. Dynamic programming algorithm. Floyd-Warshall algorithm. Correctness and analysis. Transitive closure of a directed graph. Flow networks. Maximum flow problem. Flow cancellation. Net flow. Equivalence between positive and net flow. Properties of flow. Cuts. Flow across a cut. Capacity of a cut. Upper bound on the maximum flow value. Residual network. Residual capacities. Augmenting paths. Max-flow min-cut theorem. Ford-Fulkerson algorithm. Analysis. Pseudo-polinomiality of Ford-Fulkerson. Edmonds-Karp algorithm. Monotonicity lemma. Counting flow augmentations. Analysis of Edmonds-Karp. Maximum matching problem. Maximum and maximal matching. 2-approximation greedy algorithm for maximal matching. Approximation algorithms. Maximum matching on bipartite graphs. Introduzione alla complessita' computazionale concreta. Presentazione informale delle classi di complessita' P, NP e NPC. Problemi astratti e concreti. Problemi decisionali. Codifica di un problema. Linguaggi formali. Caratterizzazione formale delle classi P, NP e NPC. Riduzioni in tempo polinomiale. Il problema P = NP e sue implicazioni. Problemi NP-Completi. Dimostrazioni di NP-Completeness. Circuit-SAT. SAT. 3-CNF-SAT. Clique. Vertex cover. Tipi di computazione: deterministica, randomizzata, nondeterministica. Computing models: RAM e Turing Machine. Nondeterminismo. Church-Turing thesis. Linguaggi decidibili, semidecidibili e indecidibili. Extended Church-Turing thesis. Sequential computing thesis. Upper bounds e lower bounds. Classi di complessita’ di base. Relazioni tra RAM e Turing Machine. Classi di complessita’ L, NL, P, NP, POLYLOGSPACE, PSPACE, EXP. Riduzioni e completezza. P-Completeness. Tesi del calcolo parallelo. Problemi altamente parallelizzabili. Classe Polylogspace. Classe NC. Problemi difficilmente parallelizzabili: i problemi P-Completi.


Algoritmi Paralleli

 

Introduction to parallel computing. Concurrency, parallelism and Bernstein's conditions. Modern scientific method. Evolution of supercomputing. Modern parallel supercomputers. Data Parallelism, functional parallelism and pipeline. Data and functional parallelism example: a generic clustering algorithm. Programming parallel computers. Extending sequential compilers. Extending sequential programming languages. Adding a parallel language layer on top of a sequential programming language. Design and implement a new parallel language and its compiler. MPI and OpenMP. Interconnection networks. Shared and switched medium. Static (direct) and dynamic (indirect) topologies. Switch’s functionalities. Latency and its four components: software overhead, channel delay, switching delay, contention time. Evaluation of network topologies: bisection bandwidth, bisection width, diameter, number of edges per switch node and edge length. 2D mesh, linear network, binary tree, fat tree, hypertree, butterfly, hypercube, shuffle-exchange. Computer vettoriali. Processor array. Multiprocessori. Multiprocessori centralizzati (UMA). Cache coherence. Write-invalidate protocol. Sincronizzazione. Multiprocessori distribuiti (NUMA). Directory based protocol. Multicomputer. Asymmetrical e symmetrical multicomputers. Commodity clusters. Network of Workstations. Tassonomia di Flynn. Task-Channel model. Metodologia PCAM. Partitioning. Domain e functional decomposition. Comunicazione. Comunicazione locale e globale, strutturata e non strutturata. Agglomerazione. Granularity. Surface to volume effect. Communication/computation ratio. Mapping. Static and dynamic mapping decision tree. Case studies. Boundary value problem. Determinare il massimo. Reduction. The n-body problem. Gather. All-gather. Adding data input. Scatter. message-Passing model. Libreria MPI. Circuit-SAT: parallelizzazione tramite MPI. User-Defined Reductions. Derived datatypes. Benchmarking. How to install the MPI library on linux, macOS and Windows. Crivello di Eratostene. Parallelizzazione. Allocazione a blocchi. Broadcast in MPI. Sieve performance enhancements. Eliminare i numeri pari, replicare il calcolo per eliminare le comunicazioni, riorganizzare i cicli per migliorare il cache hit rate. Analysis and benchmarking. All-pairs shortest path problem. Dynamic 2-D arrays in C. Parallel algorithm design: Partitioning. Communication. Agglomeration e mapping. Rowwise e Columnwise block striped decompositions. Point-to-point communication in MPI. Block row matrix I/O. Analysis and benchmarking. Overlapping communication and computation. Analisi delle prestazioni. Communication overhead. Total overhead. Work. Cost. Degree of concurrency. Maximum and average degree of concurrency. Critical Path Length, also known as span or depth. Parallelism. Work Law. Span Law. Brent's theorem. Execution time components. General speedup formula. Optimal number of processors. Super-linear speedup. Another definition of overhead, based on the speedup. Efficiency. Amdahl’s effect. Amdahl’s Law. Gustafson’ Law. Karp-Flatt metric. Isoefficiency Metric. Scalability. Strongly scalable and weakly scalable algorithms. Quasi-scalable algorithms and the scaling zone. Scalability function. Cost-Optimality (Work-Efficiency) and Isoefficiency. Scaling down dei processori. Impatto della mancanza di cost-optimality. Minimum Parallel execution Time. Minimum Cost-Optimal Parallel Execution Time. Moltiplicazione matrice-vettore. Sequential algorithm and its complexity. Design, analysis, and implementation of parallel programs based on Rowwise block striped, and Columnwise block striped decompositions. Replication of vectors. All-gather. MPI_Allgatherv. All-to-all. MPI_Scatterv. MPI_Gatherv. MPI_Alltoallv. Moltiplicazione matrice-vettore: algoritmo basato su decomposizione checkerboard block. Design ed analisi. Creazione di communicators con topologia cartesiana in MPI. Creazione di communicators mediante MPI_Comm_split. Document classification. Parallel algorithm design. Manager-worker paradigm (master-slave). Creating communicators. Non-blocking communications. Additional MPI functions. Moltiplicazione matriciale. Sequential algorithms: Iterative row-oriented and Recursive block-oriented. Parallel algorithms: Rowwise block striped decomposition and Cannon’s algorithm. DNS Algorithm. SUMMA algorithm. Fox's matrix multiplication algorithm. Dense linear systems. Special matrices. Back substitution algorithm. Parallel row-oriented algorithm. Parallel column-oriented algorithm. Gaussian elimination algorithm. Avoiding numerical issues: partial pivoting. Parallel row-oriented algorithm MPI reduction operators MPI_MAXLOC and MPI_MINLOC. Parallel column-oriented algorithm. Pipelined row-oriented algorithm. Sparse linear systems. Jacobi and Gauss-Seidel methods. Conjugate Gradient Method. Parallel algorithm based on rowwise block striped decomposition. Ordinary and partial differential equations. Examples of Phenomena Modeled by PDEs. Solving PDEs. Linear Second-order PDEs. Difference Quotients. Finite difference methods. Vibrating string problem. Discretization. Parallelization. Ghost cells. Analysis. Replication of computations done by neighbouring processes to reduce communications. Steady state heat distribution problem. Parallel algorithm based on rowwise block striped and checkerboard block decompositions. Analysis. Sorting problem. Sequential Quicksort algorithm. Parallel Quicksort. Complexity and isoefficiency analysis. HyperQuicksort. Complexity and isoefficiency analysis. Parallel sorting by regular sampling. Complexity and isoefficiency analysis. Analysis of parallel Floyd's algorithm based on checkerboard block decomposition.


Parallel Programming

 

Message-Passing programming using the MPI library.

Introduction to Algorithms. Fourth edition. Cormen, Leiserson, Rivest, Stein. The MIT Press

 

Parallel Programming in C with MPI and OpenMP (International Edition). Michael J. Quinn. McGraw-Hill

ALGORITMI PARALLELI (ING-INF/05)
DATA MINING & MACHINE LEARNING

Degree course INGEGNERIA INFORMATICA

Subject area ING-INF/05

Course type Laurea Magistrale

Credits 9.0

Teaching hours Ore totali di attività frontale: 81.0

For matriculated on 2022/2023

Year taught 2023/2024

Course year 2

Semestre Primo Semestre (dal 18/09/2023 al 22/12/2023)

Language INGLESE

Subject matter Intelligenza artificiale (A202)

Location Lecce

Analisi Matematica I e II. Teoria della probabilità. Algebra lineare. Capacità di programmazione in linguaggio C/C++.

Il corso fornisce un'introduzione moderna al data mining ed al machine learning, che comprendono tecniche, algoritmi e metodologie per scoprire strutture, schemi e relazioni in insiemi di dati (tipicamente di grandi dimensioni) e fare previsioni. Le applicazioni del data mining e del machine learning sono già presenti intorno a noi e, quando sono fatte bene, a volte passano addirittura inosservate. Per esempio, come funziona la ricerca web di Google? Come fa Netflix a consigliare i film ai suoi utenti? I principi del data mining e del machine learning forniscono risposte a queste e altre domande. Il data mining ed il machine learning intersecano i campi dell'informatica, dell'apprendimento statistico e delle basi di dati. Il corso mira a fornire agli studenti le conoscenze necessarie per esplorare, analizzare e sfruttare i dati disponibili al fine di trasformarli in informazioni preziose e utilizzabili per un'azienda, ad esempio per facilitare il processo decisionale.

Knowledge and understanding. Il corso descrive metodi e modelli per l'analisi di grandi quantità di dati. Gli studenti devono avere un solido background con un ampio spettro di conoscenze di base relative al data mining:

  • gli studenti devono possedere gli strumenti cognitivi di base per pensare in modo analitico, creativo, critico e curioso, e possedere le capacità di astrazione e di risoluzione dei problemi necessarie per affrontare sistemi complessi;
  • devono avere una solida conoscenza dei modelli e delle metodologie di data mining;
  • devono essere in grado di lavorare su grandi raccolte di dati, anche eterogenei e prodotti ad alta velocità, al fine di integrarli - in particolare sapendo gestirne l'origine e la qualità - e di effettuare analisi tematiche approfondite, attingendo a queste conoscenze per migliorare il processo decisionale.

Applying knowledge and understanding. Al termine del corso lo studente dovrà essere in grado di:

  • descrivere e utilizzare le principali tecniche di data mining;
  • comprendere le differenze tra diversi algoritmi che risolvono lo stesso problema e riconoscere quale sia il migliore in diverse condizioni;
  • affrontare nuovi problemi di data mining selezionando i metodi appropriati e giustificando le proprie scelte;
  • affrontare nuovi problemi di data mining progettando algoritmi adeguati e valutando i risultati;
  • spiegare i risultati sperimentali a persone estranee al machine learning o all'informatica.

Making judgements. Lo studente deve avere la capacità di elaborare dati complessi e/o frammentari e deve giungere a idee e giudizi originali e autonomi, nonché a scelte coerenti nel contesto del proprio lavoro, particolarmente delicate nella professione del data scientist. Il corso promuove lo sviluppo dell'autonomia di giudizio nella scelta appropriata di tecniche/modelli per l'elaborazione dei dati e la capacità critica di interpretare la bontà dei risultati dei modelli/metodi applicati ai dataset in esame.

Communication. È fondamentale che gli studenti siano in grado di comunicare con un pubblico vario e composito, non culturalmente omogeneo, in modo chiaro, logico ed efficace, utilizzando gli strumenti metodologici acquisiti e le loro conoscenze scientifiche e, in particolare, il lessico specialistico. Gli studenti devono essere in grado di organizzare materiale divulgativo e di studio efficacemente, attraverso i più comuni strumenti di presentazione anche informatici, per comunicare i risultati dei processi di analisi dei dati, ad esempio utilizzando strumenti di visualizzazione e reportistica rivolti a diversi tipi di pubblico.

Learning skills. Gli studenti devono acquisire la capacità critica di rapportarsi, con originalità e autonomia, alle problematiche tipiche del data mining e, in generale, alle questioni culturali legate ad altri ambiti simili. Dovranno essere in grado di sviluppare e applicare autonomamente le conoscenze e i metodi appresi in vista di un eventuale proseguimento degli studi a livello superiore (dottorato) o nella più ampia prospettiva di auto-miglioramento culturale e professionale dell'apprendimento permanente. Pertanto, gli studenti devono essere in grado di passare a forme espositive diverse dai testi di partenza per memorizzare, riassumere per sé e per gli altri e diffondere le conoscenze scientifiche.

Il corso si propone di fornire agli studenti strumenti avanzati per l'analisi dei dati, attraverso i quali estrapolare informazioni rilevanti da grandi insiemi di dati e guidare i relativi processi decisionali. Il corso si articola in lezioni frontali, con l'ausilio di slide messe a disposizione degli studenti tramite la piattaforma e-learning di UniSalento, ed esercitazioni in aula. Le lezioni frontali mirano a migliorare la conoscenza e la comprensione degli studenti attraverso la presentazione di teorie, modelli e metodi; gli studenti sono invitati a partecipare alla lezione con autonomia di giudizio, ponendo domande e presentando esempi. Le esercitazioni sono finalizzate alla comprensione degli algoritmi e dei modelli presentati.

Progetto software ed esame orale. Durante l'esame viene chiesto allo studente di illustrare argomenti teorici per verificare la sua conoscenza e comprensione degli argomenti selezionati. Lo studente deve dimostrare un'adeguata conoscenza e comprensione degli argomenti presentati o indicati, applicando in modo pertinente le teorie e i modelli concettuali oggetto del programma di studio. L'esame orale è valutato su una scala da 18 a 30 punti. Il progetto software viene assegnato su richiesta dello studente e deve essere obbligatoriamente discusso nella stessa prova in cui si svolge l'esame orale. A tal fine, il progetto (codice e relativa documentazione) deve essere inviato al docente via email almeno tre giorni prima dell'esame. Non saranno accettati progetti inviati oltre il termine indicato. Il progetto software è valutato su una scala da 18 a 30 punti. Il voto finale è dato dalla media dei voti ottenuti, e l'esame è superato se il voto conseguito è almeno pari a 18.

 

La relazione relativa al progetto assegnato deve essere strutturata come segue.

 

1. Introduzione. Lo studente deve fornire una descrizione accurata del progetto assegnato, comprensiva dell’analisi dell’algoritmo sequenziale che risolve il problema affrontato nel progetto. Qualora lo studente lo ritenga importante ai fini della comprensione, possono essere presenti pseudo-codice, esempi, grafici, figure, istanze applicative etc; 

 

2. Implementazione. Il software deve essere implementato utilizzando il linguaggio di programmazione C o, eventualmente, C++ (in modo da usare le strutture dati presenti nella STL del C++). Il codice deve essere necessariamente commentato in modo adeguato. Nel caso in cui lo studente verifichi l’esistenza di strategie multiple per lo stesso problema assegnato, è possibile fornire una implementazione per ciascuna strategia, discutendo vantaggi e svantaggi di ogni strategia. Si noti che il codice non deve essere riportato integralmente nel testo della relazione, in quanto il codice del progetto - sorgenti e Makefile per il build (in alternativa è possibile usare CMake) - deve in ogni caso essere consegnato a parte. Tuttavia, la relazione può includere alcuni snippet di codice relativi alla parti più critiche ed interessanti. 

 

3. Debug e test. Si raccomanda di effettuare un debug e test del codice che consenta di verificare, ragionevolmente, l’assenza di bugs e la correttezza dell’algoritmo in relazione all’output prodotto. Si invita lo studente a progettare e verificare gli unit tests ritenuti idonei a valutare la bontà del codice. Lo studente è esplicitamente avvisato che l’implementazione di una strategia risolutiva non corretta e/o la presenza di bugs che mandano in crash l'applicazione a runtime comportano il non superamento dell'esame, mentre la presenza di bugs che inficiano la correttezza dell'algoritmo comporta una penalizzazione relativa al punteggio che sarà attribuito. 

 

4. Analisi delle prestazioni e della scalabilità. Lo studente deve analizzare le prestazioni dell’implementazione sviluppata in termini di tempo di esecuzione ed occupazione di memoria se necessario. Deve essere valutata la scalabilità dell’algoritmo al variare della problem size. 

 

5. Sintesi. La relazione deve essere quanto più possibile sintetica, ma senza che ciò vada a discapito della chiarezza espositiva. 

 

6. Valutazione del progetto. Il progetto sarà valutato con un punteggio relativo ad una scala che va da 1 a 30 punti. Il voto sarà definito solo al termine della discussione del progetto. La complessità del problema assegnato sarà presa in considerazione per la valutazione; gli studenti che si occupano di problemi più semplici devono quindi aspettarsi una minore indulgenza nella valutazione rispetto agli studenti che si occupano di problemi più difficili. 

 

La valutazione terrà inoltre conto di: 

 

- Chiarezza ed efficacia della presentazione; 

- Capacità tecniche e documentazione dell’implementazione; 

- Pensiero critico nella valutazione e nell’analisi delle prestazioni; 

  • Significatività dei risultati: dato che le prestazioni sono lo scopo principale, un buon progetto deve anche fornire prestazioni significative.

 

 

7. Plagio. Lo studente è avvisato esplicitamente che il plagio è gravissimo, ed è considerato molto seriamente. L’uso di fonti esterne di qualsiasi genere (internet, libri, dispense, lavori pregressi di altri studenti etc) è consentito solo per contributi marginali nel progetto (ad esempio, per la descrizione iniziale del progetto assegnato), ed ogni singola occorrenza deve essere esplicitamente citata e riportata nella relazione. La violazione di questo codice di comportamento non sarà in alcun modo tollerata, e comporta l’immediato annullamento dell’esame ed il deferimento dello studente alla commissione disciplinare di ateneo, con avvio del relativo procedimento disciplinare secondo quanto riportato del Titolo V (PROCEDIMENTI DISCIPLINARI E SANZIONI) del Regolamento di Ateneo per gli studenti (D.R. 672 del 5/12/2017, entrato in vigore in data 8/12/2017).

 

Ricevimento studenti

Su appuntamento; contattare il docente via e-mail o al termine degli incontri di classe.

Introduction to the course.

 

Introduction to the Map-Reduce parallel programming model. Open-source Hadoop implementation. Pros and cons of Hadoop and Map-Reduce. Distributed File System. Chunk servers, Master node. Map Function. Sort and Shuffle. Reduce Function. Map Tasks. Reduce Tasks. Word counting. Handling failures. Number of Map and Reduce jobs. Granularity of tasks and pipelining. Strugglers tasks: mitigating the problem by spawning backup tasks. Combiners. Partition (hash) function. Additional examples of Map-Reduce: natural join, two-pass matrix multiply, single pass matrix multiply. Cost measures for Map-Reduce algorithms.

 

Frequent Pattern Mining. Discovery of association rules. Market-basket model. Examples of possible applications. Frequent itemsets. Support of an itemset. Association rules. Confidence and Interest. Association rules with highly positive or negative interest. Mining association rules. Maximal and closed frequent itemsets. Lattice of the itemsets. Naive approach to counting frequent pairs. A-priori. Monotonicity. PCY. PCY refinements: multistage and multihash. Frequent itemsets in 2 passes: random sampling and choice of the threshold, SON, monotonicity, parallel SON parallelo with Map-Reduce in 2 passes, Toivonen and the negative border. ECLAT. DECLAT. FPGrowth.  Sequence mining. Frequent sequences. Mining frequent sequences. GSP. SPADE. PREFIXSPAN. 

 

Streams. Uniform, 2-universal and pairwise independent hash function. Streaming: turnstile, strict turnstile and cash register models. Frequency estimation. Sketches. Count-Sketch. Count-Min. Comparing Count-Sketch and Count-Min. Frequent items. Phi-frequent items. The majority problem. Boyer-Moore. Misra-Gries. Frequent. Space Saving. Space Saving properties. Comparing Frequent and Space Saving. Sampling from a Data Stream: Sampling a fixed proportion. Sampling a fixed-size sample: varying the sample size. Reservoir Sampling. Queries over a sliding windows: counting the number of bits equal to 1 with DGIM. Filtering data streams: Bloom FIlters. Counting distinct elements: Flajolet-Martin. Estimating moments with AMS.

 

Scene completion problem. Near neighbors in high dimensionality spaces. Document similarity. Pairs of candidate documents. Near neighbor search. Jaccard similarity and distance. Shingling: how to convert documents, emails etc in sets. k-shingles. Compression through hashing of k-shingles. Min-Hashing: converting big sets in short signatures preserving the similarity. Similarity and Jaccard distance for boolean vectors. Boolean matrices. Min-hash signatures. Implementation. Locality-Sensitive Hashing: determining pairs of candidate documents. Matrix partitioning in b bands of r rows: analysis of accuracy with regard to  false positives and false negatives.

 

Link analysis. PageRank. Dead ends. Spider traps. Flow formulation. Matrix formulation. Random walk interpretation. Stationary distribution of a Discrete-Time Markov Chain. Perron-Frobenius Theorem. Google matrix and teleportation. Sparse matrix encoding. Block update algorithm. Topic-specific PageRank. Matrix formulation. Topic vector. Web Spam. Term spam. Spam farms. PageRank value obtained through a Spam Farm. TrustRank. Trust propagation. Spam Mass estimation.

 

Recommender systems. Recommendations. The long tail phenomenon. Content-based systems. Utility function and matrix. Ratings. Extrapolation of ratings (utilities). Item profiles. User profiles. Collaborative filtering. k-NN. Similarity metrics. User-user and item-item collaborative filtering. Evaluation of systems. Error metrics. RMSE, precision, rank correlation. Complexity of collaborative filtering. The Netflix challenge. Bellkor recommender system. Modeling local and global effects. Learning the optimal weights: optimization problem and gradient descent. Latent factor models. SVD decomposition. Learning the P and Q matrices. Preventing overfitting: regularization. Stochastic Gradient Descent. Biases and interactions. Temporal biases and factors.

 

Artificial Intelligence. Different definitions over time: thinking like a human, acting like humans and the Turing test, thinking rationally, acting rationally. Rational agents, decisions and behaviour. Game playing: chess, go, etc. The human intelligence. Human versus artificial intelligence. Benefits of AI. Ai, machine learning and deep learning. AI applications and use-cases. Human learning process. Features, instances, labels, classes, classification function. Training set, validation and test set. Supervised learning: regression, binary and multiclass classification. Unsupervised learning. Semi-supervised and weakly-supervised learning. Reinforcement learning: agent, environment and its state that changes depending on the agent's actions, reward and penalty. Loss functions. Zero-one and squared loss functions. Bias and variance. Deep Learning. Machine Learning vs Deep Learning. Limitations of Deep Learning. Artificial versus biological neural networks. AI Myths.

 

The clustering problem. Curse of dimensionality. Clustering in euclidean and non euclidean spaces. Distances. Hierarchical clustering: agglomerative and divisive algorithms. Clustering by point assignment. Centroid and clustroid. K-means and K-means++. Choosing k: elbow criterion. BFR. Discard, Compression and Retained sets. Summarizing points. Mahalanobis distance. CURE. Representative points and their choice. Input space and feature space. Kernel methods. Kernel matrix. Linear kernel. Kernel trick. Kernel operations in feature space. Represenative clustering: K-means and Kernel K-means. Expectation-Maximization clustering. Hierarchical clustering. Density-based clustering. DBSCAN. Clustering validation: external, internal and relative measures.

 

Machine learning. Supervised and unsupervised approaches. Numerical and categorical attributes. Categorical attributes: nominal and ordinal attributes. Probabilistic classifiers. Parametric approach: Bayes and naive Bayes classifiers. Data centering. Non parametric approach (density based): K-nearest neighbors classifier. Decision Trees. Hyperplans. Split points. Data partion and purity. Split Point Evaluation Measures: entropy, split entropy, information gain, Gini index, CART. Evaluation of numerical and categorical split points. Support Vector machines. Hyperplanes. Support Vectors and Margins. Linear and Separable Case. Soft Margin SVM: Linear and Nonseparable Case. Kernel SVM: Nonlinear Case. SVM Training Algorithms. Multiclass SVM. Assessing the performances of a classifier. Evaluation metrics. ROC curve and AUC. K-fold cross-validation. Bootstrapping. Confidence intervals. Paired t-Test. Bias and variance decomposition. Ensemble classifiers. Bagging. Random Forest. Boosting. Stacking.

Mining of Massive Datasets, 3rd edition

J. Leskovec, A. Rajaraman and J. Ullman

Freely available online: http://www.mmds.org

 

Data Mining and Analysis, 2nd edition

M. J. Zaki and W. Meira

Freely available online: http://dataminingbook.info

 

 

DATA MINING & MACHINE LEARNING (ING-INF/05)
LABORATORIO INFORMATICO

Corso di laurea INFERMIERISTICA (ABILITANTE ALLA PROFESSIONE SANITARIA DI INFERMIERE)

Settore Scientifico Disciplinare ING-INF/05

Tipo corso di studio Laurea

Crediti 2.0

Docente titolare LUCA MAINETTI

Ripartizione oraria Ore totali di attività frontale: 40.0

  Ore erogate dal docente Massimo CAFARO: 20.0

Per immatricolati nel 2023/2024

Anno accademico di erogazione 2023/2024

Anno di corso 1

Semestre Secondo Semestre (dal 04/03/2024 al 07/06/2024)

Lingua

Percorso SEDE LECCE (A231)

Non vi è alcuna propedeuticità per il corso.

In sintesi il laboratorio affronta i seguenti temi: introduzione alle architetture dei dati, introduzione all'analisi dei dati tramite linguaggio SQL, struttura dei sistemi di calcolo, concetto di algoritmo ed elementi della programmazione strutturata, introduzione al linguaggio di programmazione Python.

Il corpus principale del laboratorio intende dunque fornire le basi pratiche per la modellazione e l'analisi dei dati e, parallelamente, per l'elaborazione dei dati sviluppando semplici programmi Python. Ogni concetto esposto è sperimentato in modo pratico insieme agli studenti utilizzando il personal computer e strumenti di sviluppo moderni e ampiamente diffusi nel mondo industriale.

Conoscenze e comprensione. Al termine del corso gli studenti: (a) conosceranno i principi della programmazione strutturata, in relazione alle caratteristiche del software; (b) conosceranno gli aspetti generali del linguaggio Python cioè programmazione con i tipi di dati fondamentali, le strutture di controllo, le funzioni, le liste, i file, gli insiemi e i dizionari; (c) comprenderanno le tecniche di codifica in Python di algoritmi; (d) comprenderanno come utilizzare i principali ambienti di sviluppo Python anche in relazione delle singole necessità rappresentate nei requisiti del software e nella strutturazione dei dati.

Inoltre, al termine del corso gli studenti conosceranno il linguaggio SQL, necesssario per interagire con una base di dati. In particolare: (a) gli statements (DDL, Data Definition Language) necessari per specificare lo schema concettuale ed interno di un DBMS (Data Base Management System) ed il mapping tra i due; (b) gli statements (DML, Data Manipulation Language) necessari per manipolare una base di dati. Le operazioni tipicamente supportate includono il recupero, l’inserimento, la cancellazione e la modifica dei dati.

Capacità di applicare conoscenze e comprensione. Gli studenti saranno in grado di applicare le conoscenze acquisite in diversi ambiti applicativi e, in generale, per la codifica al computer in linguaggio Python di logica di business. Inoltre, gli studenti saranno in grado di creare ed interrogare una base di dati.

Autonomia di giudizio. Il corso favorisce l'autonomia di giudizio degli studenti attraverso l'analisi critica di problemi di modellazione del software da requisiti funzionali e non funzionali, per i quali trovare le soluzioni adeguate a risolverli in linguaggio Python. Diverse soluzioni proposte interattivamente dagli studenti saranno poste a confronto e valutate criticamente dagli studenti stessi. Inoltre, gli studenti devono possedere la capacità di problem solving e devono pervenire a idee e giudizi originali e autonomi, a scelte coerenti nell’ambito del loro lavoro, particolarmente delicate nell’ambito della implementazione di una base di dati. Il corso promuove lo sviluppo dell’autonomia di giudizio nella scelta appropriata della soluzione migliore relativa a semplici problemi legati all’interazione con una base di dati e la capacità critica di interpretare la bontà dei risultati ottenuti.

 

Abilità comunicative. Gli studenti apprenderanno come comunicare adeguatamente e con il corretto livello di formalismo le scelte di design adottate e le strategie di implementazione scelte. Il metodo di insegnamento interattivo e teorico/pratico favorirà momenti di confronto in cui mettere in pratica tali abilità comunicative. È fondamentale che gli studenti siano in grado di comunicare con un pubblico vario e composito, non omogeneo culturalmente, in modo chiaro, logico ed efficace, utilizzando gli strumenti metodologici acquisiti e le loro conoscenze scientifiche e, in particolar modo, il lessico di specialità. Il corso favorisce lo sviluppo delle abilità inerenti le capacità di esporre in termini precisi e formali snippets di codice sorgente in linguaggio Python e queries SQL.

Capacità di apprendimento. La materia in costante evoluzione (sia le tecniche di sviluppo del software, sia i linguaggi che le implementano) richiederà agli studenti la capacità di aggiornarsi e di ricercare materiale on-line, valutandone anche la qualità. Il metodo didattico favorirà l’approfondimento autonomo da parte degli studenti, incuriosendoli su tecniche di sviluppo evolute (vedi i design pattern). Gli studenti fevono inoltre essere in grado di riusare le conoscenze legate alle basi di dati indipendentemente dallo specifico DBMS utilizzato.

Lezioni frontali in laboratorio con ampio spazio a esercitazioni pratiche svolte con l’uso del personal computer, creazione individuale di semplici basi di dati, elaborazione di script SQL e di programmi Python.

L'esame prevede una prova orale per la verifica dell'apprendimento dei concetti teorici (verifica delle conoscenze) e della capacità di applicazione dei medesimi, in particolare per la codifica autonoma di basi di dati, di script SQL e di semplici programmi Python (verifica delle competenze). Durante l’esame lo studente dovrà usare il personal computer, configurato con gli ambienti di sviluppo illustrati e utilizzati in laboratorio.

Si veda www.disteba.unisalento.it.

www.unisalento.it/people/luca.mainetti

www.unisalento.it/people/massimo.cafaro

 

Ricevimento Studenti

Su appuntamento; contattare il docente via e-mail o al termine degli incontri di classe.

Presentazione dettagliata del modulo Python (1 ora). Struttura dei sistemi di calcolo (2 ore). Il concetto di algoritmo (1 ora). Introduzione al linguaggio Python (2 ore). Programmare con numeri e stringhe (2 ore). Decisioni (2 ore). Cicli (2 ore). Funzioni (2 ore). Liste (2 ore). Eccezioni e file (2 ore). Insiemi e dizionari (2 ore).

Il linguaggio SQL. Definizione dei dati e tipi di dato. I concetti di schema e catalogo. Statement CREATE SCHEMA. Statement CREATE TABLE. Tipi di dato di un attributo. Statement CREATE DOMAIN. Specifica di vincoli in SQL. Vincoli sugli attributi e valori di default per gli attributi. PRIMARY Key, FOREIGN KEY, integrità referenziale. Clausola UNIQUE, clausola CHECK per vincoli su tuple. Query SELECT. Aliases. Eliminare tuple duplicate con DISTINCT. Tabelle come insiemi: operazioni UNION, EXCEPT, INTERSECT. Tabelle come multi-insiemi: UNION ALL, EXCEPT ALL, INTERSECT ALL. Pattern matching per sottostringhe ed operatori aritmetici. Ordinare i risultati di una query: ORDER BY. (3 ore)

 

Statement INSERT. Statement DELETE. Statement UPDATE. Valori NULL. Logica basata sui tre valori TRUE, FALSE ed UNKNOWN. Queries annidate. Queries annidate correlate. Funzioni EXISTS e UNIQUE. Insiemi espliciti e ridenominazione di attributi in SQL. (3 ore).

 

JOIN, NATURAL JOIN, OUTER JOIN (left, right, full). Funzioni di aggregazione: COUNT, SUM, MAX, MIN, AVG. Clausola GROUP BY. Clausola HAVING. Clausola WITH. Costrutto CASE. Queries ricorsive. Concetto di vista in SQL. Statement CREATE VIEW. Statement DROP VIEW. Autorizzazione: statements GRANT e REVOKE. Statement DROP. Statement ALTER. (3 ore)

 

Il DBMS SQLite. (3 ore)

 

Accesso ed interazione con una base di dati SQLite in Python. (8 ore)

- Cay Horstmann, Rance D. Necaise, “Concetti di Informatica e Fondamenti di Python”, Seconda Edizione, Apogeo Education, Maggioli Editore, 2019.

- Ramez Elmasri, Shamkant B. Navathe, “Fundamentals of Database Systems”, 7th Edition, Pearson, 2015.

- Materiale didattico fornito dal docente tramite il sito elearning.unisalento.it.

LABORATORIO INFORMATICO (ING-INF/05)
DATA MINING & MACHINE LEARNING

Degree course COMPUTER ENGINEERING

Subject area ING-INF/05

Course type Laurea Magistrale

Credits 9.0

Teaching hours Ore totali di attività frontale: 81.0

For matriculated on 2021/2022

Year taught 2022/2023

Course year 2

Semestre Primo Semestre (dal 19/09/2022 al 16/12/2022)

Language INGLESE

Subject matter ARTIFICIAL INTELLIGENCE (A182)

Location Lecce

Calculus. Probability theory. Linear Algebra. Programming skills.

The course provides a modern introduction to data mining, which spans techniques, algorithms and methodologies for discovering structure, patterns and relationships in data sets (typically, large ones) and making predictions. Applications of data mining are already happening all around us, and, when they are done well, sometimes they even go unnoticed. For instance, how does the Google web search work? How does Netflix recommend movies to its users? The principles of data mining provide answers to these and others questions. Data mining overlaps the fields of computer science, statistical machine learning and data bases. The course aims at providing the students with the knowldedge required to explore, analyze and leverage available data in order to turn the data into valuable and actionable information for a company, for instance, in order to facilitate a decision-making process.

Knowledge and understanding. The course describes methods and models for the analysis of large amounts of data. Students must have a solid background with a broad spectrum of basic knowledge related to data mining:

  • the students must have the basic cognitive tools to think analytically, creatively, critically and in an inquiring way, and have the abstraction and problem-solving skills needed to cope with complex systems;
  • they must have solid knowledge of data mining models and methodologies;
  • they must be able to work on large data collections, including heterogeneous and produced at high speed data, in order to integrate them - in particular by knowing how to manage their origin and quality - and to carry out in-depth thematic analyses, drawing on this knowledge to improve the decision-making process.

Applying knowledge and understanding. After the course the student should be able to:

  • describe and use the main data mining techniques;
  • understand the differences among several algorithms solving the same problem and recognize which one is better under different conditions;
  • tackle new data mining problems by selecting the appropriate methods and justifying his/her choices;
  • tackle new data mining problems by designing suitable algorithms and evaluating the results;
  • explain experimental results to people outside of statistical machine learning or computer science.

Making judgements. Students must have the ability to process complex and/or fragmentary data and must arrive at original and autonomous ideas and judgments, and consistent choices in the context of their work, which are particularly delicate in the profession of data scientist. The course promotes the development of independent judgment in the appropriate choice of technique/model for data processing and the critical ability to interpret the goodness of the results of the models/methods applied to the datasets under examination.

Communication. It is essential that students are able to communicate with a varied and composite audience, not culturally homogeneous, in a clear, logical and effective way, using the methodological tools acquired and their scientific knowledge and, in particular, the specialty vocabulary. Students should be able to organize effective dissemination and study material through the most common presentation tools, including computer-based ones, to communicate the results of data analysis processes, for example by using visualization and reporting tools aimed at different types of audiences.

Learning skills. Students must acquire the critical ability to relate, with originality and autonomy, to the typical problems of data mining and, in general, cultural issues related to other similar areas. They should be able to develop and apply independently the knowledge and methods learnt with a view to possible continuation of studies at higher (doctoral) level or in the broader perspective of cultural and professional self-improvement of lifelong learning. Therefore, students should be able to switch to exhibition forms other than the source texts in order to memorize, summarize for themselves and for others, and disseminate scientific knowledge.

The course aims to provide students with advanced tools for data analysis, through which to extrapolate relevant information from large datasets and guide the related decision-making processes. The course consists of frontal lessons using slides made available to students via the UniSalento e-learning platform, and classroom exercises. The frontal lessons are aimed at improving students' knowledge and understanding through the presentation of theories, models and methods; students are invited to participate in the lesson with autonomy of judgement, by asking questions and presenting examples. The exercises are aimed at understanding the algorithms and models presented.

Software project and oral exam. During the exam the student is asked to illustrate theoretical topics in order to verify his/her knowledge and understanding of the selected topics. The student must demonstrate adequate knowledge and understanding of the issues presented or indicated, applying in a relevant manner the theories and conceptual models covered by the study programme. The oral exam is evaluated on a scale from 18 to 30 points. The software project is assigned upon request to the student, and must be mandatorily discussed into the same trial in which the oral test is performed. The software project is evaluated on a scale from 18 to 30 points. The final grade is given by the average of the grades obtained, and the exam is passed if the grade achieved is at least 18.

 

The report on the assigned project must be structured as follows.

 

1. Introduction. The student must provide an accurate description of the assigned project, including an analysis of the sequential algorithm that solves the problem addressed in the project. If the student deems it important for understanding, then pseudo-code, examples, graphs, figures, application instances etc. may be included; 

 

2. Implementation. The software must be implemented using the C or, possibly, C++ programming language (so as to use the data structures present in the C++ STL). The code must necessarily be appropriately commented. In the event that the student verifies the existence of multiple solution strategies for the same assigned problem, an implementation can be provided for each strategy, discussing the advantages and disadvantages of each strategy. Note that the code does not have to be given in full in the text of the report, as the project code - source and Makefile for the build (alternatively CMake can be used) - must in any case be delivered separately. However, the report may include some code snippets relating to the most critical and interesting parts. 

 

3. Debugging and testing. It is recommended to debug and test the code in order to reasonably verify the absence of bugs and the correctness of the algorithm in relation to the output produced. The student is urged to design and test unit tests deemed suitable for assessing the goodness of the code. The student is explicitly warned that the implementation of an incorrect solving strategy and/or the presence of bugs that crash the application at runtime will result in failing the exam, while the presence of bugs that affect the correctness of the algorithm will result in a penalty relative to the mark that will be awarded.

 

4. Performance and scalability analysis. The student must analyse the performance of the implementation developed in terms of execution time and memory occupation if necessary. The scalability of the algorithm as the problem size changes must be assessed. 

 

5. Synthesis. The report must be as concise as possible, but without detriment to clarity of presentation. 

 

6. Project evaluation. The project will be assessed on a scale of 1 to 30 points. The grade will only be determined at the end of the project discussion. The complexity of the problem assigned will be taken into account in the assessment; students dealing with simpler problems should therefore expect less leniency in the assessment than students dealing with more difficult problems. 

 

Assessment will also take into account: 

 

- Clarity and effectiveness of presentation; 

- Technical skills and documentation of implementation; 

- Critical thinking in evaluation and performance analysis; 

- Meaningfulness of results: since performance is the main purpose, a good project must also deliver meaningful performance.

 

7. Plagiarism. The student is explicitly warned that plagiarism is very serious, and is taken very seriously. The use of external sources of any kind (internet, books, handouts, previous work by other students etc.) is only allowed for marginal contributions in the project (e.g. for the initial description of the assigned project), and each individual occurrence must be explicitly cited and reported in the report. Violation of this code of conduct will not be tolerated in any way, and will result in the immediate cancellation of the examination and the referral of the student to the University Disciplinary Committee, with the commencement of the relevant disciplinary proceedings in accordance with the provisions of Title V (DISCIPLINARY PROCEDIMENTS AND SANCTIONS) of the University Regulations for Students (D.R. 672 of 5/12/2017, which came into force on 8/12/2017).

Office Hours

By appointment; contact the instructor by email or at the end of class meetings.

Introduction to the course.

 

Introduction to the Map-Reduce parallel programming model. Open-source Hadoop implementation. Pros and cons of Hadoop and Map-Reduce. Distributed File System. Chunk servers, Master node. Map Function. Sort and Shuffle. Reduce Function. Map Tasks. Reduce Tasks. Word counting. Handling failures. Number of Map and Reduce jobs. Granularity of tasks and pipelining. Strugglers tasks: mitigating the problem by spawning backup tasks. Combiners. Partition (hash) function. Additional examples of Map-Reduce: natural join, two-pass matrix multiply, single pass matrix multiply. Cost measures for Map-Reduce algorithms.

 

Frequent Pattern Mining. Discovery of association rules. Market-basket model. Examples of possible applications. Frequent itemsets. Support of an itemset. Association rules. Confidence and Interest. Association rules with highly positive or negative interest. Mining association rules. Maximal and closed frequent itemsets. Lattice of the itemsets. Naive approach to counting frequent pairs. A-priori. Monotonicity. PCY. PCY refinements: multistage and multihash. Frequent itemsets in 2 passes: random sampling and choice of the threshold, SON, monotonicity, parallel SON parallelo with Map-Reduce in 2 passes, Toivonen and the negative border. ECLAT. DECLAT. FPGrowth.  Sequence mining. Frequent sequences. Mining frequent sequences. GSP. SPADE. PREFIXSPAN. 

 

Streams. Uniform, 2-universal and pairwise independent hash function. Streaming: turnstile, strict turnstile and cash register models. Frequency estimation. Sketches. Count-Sketch. Count-Min. Comparing Count-Sketch and Count-Min. Frequent items. Phi-frequent items. The majority problem. Boyer-Moore. Misra-Gries. Frequent. Space Saving. Space Saving properties. Comparing Frequent and Space Saving. Sampling from a Data Stream: Sampling a fixed proportion. Sampling from a Data Stream: Sampling a fixed-size sample (Reserved Sampling). Queries over a sliding windows: counting the number of bits equal to 1 with DGIM. Filtering data streams: Bloom FIlters. Counting distinct elements: Flajolet-Martin. Estimating moments with AMS.

 

Scene completion problem. Near neighbors in high dimensionality spaces. Document similarity. Pairs of candidate documents. Near neighbor search. Jaccard similarity and distance. Shingling: how to convert documents, emails etc in sets. k-shingles. Compression through hashing of k-shingles. Min-Hashing: converting big sets in short signatures preserving the similarity. Similarity and Jaccard distance for boolean vectors. Boolean matrices. Min-hash signatures. Implementation. Locality-Sensitive Hashing: determining pairs of candidate documents. Matrix partitioning in b bands of r rows: analysis of accuracy with regard to  false positives and false negatives.

 

Link analysis. PageRank. Dead ends. Spider traps. Flow formulation. Matrix formulation. Random walk interpretation. Stationary distribution of a Discrete-Time Markov Chain. Perron-Frobenius Theorem. Google matrix and teleportation. Sparse matrix encoding. Block update algorithm. Topic-specific PageRank. Matrix formulation. Topic vector. Web Spam. Term spam. Spam farms. PageRank value obtained through a Spam Farm. TrustRank. Trust propagation. Spam Mass estimation.

 

Recommender systems. Recommendations. The long tail phenomenon. Content-based systems. Utility function and matrix. Ratings. Extrapolation of ratings (utilities). Item profiles. User profiles. Collaborative filtering. k-NN. Similarity metrics. User-user and item-item collaborative filtering. Evaluation of systems. Error metrics. RMSE, precision, rank correlation. Complexity of collaborative filtering. The Netflix challenge. Bellkor recommender system. Modeling local and global effects. Learning the optimal weights: optimization problem and gradient descent. Latent factor models. SVD decomposition. Learning the P and Q matrices. Preventing overfitting: regularization. Stochastic Gradient Descent. Biases and interactions. Temporal biases and factors.

 

Artificial Intelligence. Different definitions over time: thinking like a human, acting like humans and the Turing test, thinking rationally, acting rationally. Rational agents, decisions and behaviour. Game playing: chess, go, etc. The human intelligence. Human versus artificial intelligence. Benefits of AI. Ai, machine learning and deep learning. AI applications and use-cases. Human learning process. Features, instances, labels, classes, classification function. Training set, validation and test set. Supervised learning: regression, binary and multiclass classification. Unsupervised learning. Semi-supervised and weakly-supervised learning. Reinforcement learning: agent, environment and its state that changes depending on the agent's actions, reward and penalty. Loss functions. Zero-one and squared loss functions. Bias and variance. Deep Learning. Machine Learning vs Deep Learning. Limitations of Deep Learning. Artificial versus biological neural networks. AI Myths.

 

The clustering problem. Curse of dimensionality. Clustering in euclidean and non euclidean spaces. Distances. Hierarchical clustering: agglomerative and divisive algorithms. Clustering by point assignment. Centroid and clustroid. K-means and K-means++. Choosing k: elbow criterion. BFR. Discard, Compression and Retained sets. Summarizing points. Mahalanobis distance. CURE. Representative points and their choice. Input space and feature space. Kernel methods. Kernel matrix. Linear kernel. Kernel trick. Kernel operations in feature space. Represenative clustering: K-means and Kernel K-means. Expectation-Maximization clustering. Hierarchical clustering. Density-based clustering. DBSCAN. Clustering validation: external, internal and relative measures.

 

Machine learning. Supervised and unsupervised approaches. Numerical and categorical attributes. Categorical attributes: nominal and ordinal attributes. Probabilistic classifiers. Parametric approach: Bayes and naive Bayes classifiers. Data centering. Non parametric approach (density based): K-nearest neighbors classifier. Decision Trees. Hyperplans. Split points. Data partion and purity. Split Point Evaluation Measures: entropy, split entropy, information gain, Gini index, CART. Evaluation of numerical and categorical split points. Support Vector machines. Hyperplanes. Support Vectors and Margins. Linear and Separable Case. Soft Margin SVM: Linear and Nonseparable Case. Kernel SVM: Nonlinear Case. SVM Training Algorithms. Multiclass SVM. Assessing the performances of a classifier. Evaluation metrics. ROC curve and AUC. K-fold cross-validation. Bootstrapping. Confidence intervals. Paired t-Test. Bias and variance decomposition. Ensemble classifiers. Bagging. Random Forest. Boosting. Stacking.

 

Introduction to neural networks. History. The Biological Inspiration. Learning in Biological vs Artificial Networks. An Alternative View: The Computational Graph Extension of Traditional Machine Learning. Machine Learning versus Deep Learning. Single Layer Networks: the Perceptron. Bias neurons. Training a Perceptron. Perceptron vs Linear SVMs. Activation and Loss Functions. Multilayer Neural Networks. Reasons  for nonlinearity of hidden layers. Role of Hidden Layers. The Feature Engineering View of Hidden Layers. Multilayer Networks as Computational Graphs. Connecting Machine Learning with Shallow Neural Networks: Perceptron versus Linear SVM, RBF Network versus kernel SVM. Neural models for linear regression, classification and the Fisher discriminant. Widrow-Hoff learning. Neural model for linear regression. Comparison of Widrow-Hoff with Perceptron and SVM. Connections with Fisher Discriminant. Neural Models for Logistic Regression. The Softmax Activation Function and Multinomial Logistic Regression. The Autoencoder for Unsupervised Representation Learning. Singular Value Decomposition with Autoencoders. Row-Index to Row-Value Autoencoders: Incomplete Matrix Factorization for Recommender Systems. Model Generalization and the Bias-Variance Trade-Off in the context of neural networks. 

Penalty-Based Regularization. Economy in Parameters. Soft vs hard economy. L1 vs L2 regularization. L2 regularization and noise injection. Weight sharing. Dropout. Feature co-adaptation. Why feature co-adaptation is bad. Dropout training. Dropout testing: repeated sampling and geometric averaging vs weight scaling inference rule. Why dropout helps. Connection of dropout with noise injection and regularization. How to apply dropout in practice.

Mining of Massive Datasets 

J. Leskovec, A. Rajaraman and J. Ullman

Freely availableonline: http://www.mmds.org

 

Data Mining and Analysis

M. J. Zaki and W. Meira

Freely available online: http://dataminingbook.info

 

Neural Networks and Deep Learning

Charu C. Aggarwal

Springer

 

DATA MINING & MACHINE LEARNING (ING-INF/05)
PARALLEL ALGORITHMS

Degree course COMPUTER ENGINEERING

Subject area ING-INF/05

Course type Laurea Magistrale

Credits 9.0

Owner professor Massimo CAFARO

Teaching hours Ore totali di attività frontale: 81.0

  Ore erogate dal docente Massimo CAFARO: 54.0

For matriculated on 2021/2022

Year taught 2022/2023

Course year 2

Semestre Secondo Semestre (dal 01/03/2023 al 09/06/2023)

Language INGLESE

Subject matter PERCORSO COMUNE (999)

Location Lecce

Calculus I and II, Probability Theory. Programming skills and working knowledge of the C programming language.

The course provides a modern introduction to design, analysis and implementation of sequential and parallel algorithms. In particular, the course is based on a pragmatic approach to parallel programming of message-passing algorithms through the C language and the MPI library.

Knowledge and understanding. Students must have a solid background with a broad spectrum of basic knowledge of sequential and parallel algorithms:

 

· the students must have the basic cognitive tools to think analytically, creatively, critically and in an inquiring way, and have the abstraction and problem-solving skills needed to cope with complex systems;

· they must have a solid knowledge of the design and implementation of sequential and parallel efficient algorithms;

· they must have the tools for analysing the resources used by algorithms;

· they must have a catalogue of the most well-known and efficient sequential and parallel algorithms for basic computational problems.

 

Applying knowledge and understanding. After the course the student should be able to:

 

· Describe and use the main design techniques for sequential algorithms;

· Design, prove the correctness and analyze the computational complexity of sequential algorithms;

· Understand the differences among several algorithms solving the same problem and recognize which one is better under different conditions;

· Describe and use basic sequential algorithms;

· Describe and use basic data structures; know about the existence of advanced data structures;

· Understand the difference between sequential and parallel algorithms;

· Design, implement and analyze message-passing based parallel algorithms in C using the MPI library;

· Describe and use basic parallel algorithms.

 

 

Making judgements. Students are guided to learn critically everything that is explained to them in class, to compare different approaches to solving algorithmic problems, and to identify and propose, in an autonomous way, the most efficient solution they find.

 

Communication. It is essential that students are able to communicate with a varied and composite audience, not culturally homogeneous, in a clear, logical and effective way, using the methodological tools acquired and their scientific knowledge and, in particular, the specialty vocabulary. The course promotes the development of the following skills of the student: ability to expose in precise and formal terms an abstract model of concrete problems, identifying the salient features of them and discarding the nonessential ones; ability to describe and analyze an efficient solution to the problem in question.

 

Learning skills. Students must acquire the critical ability to relate, with originality and autonomy, to the typical problems of data mining and, in general, cultural issues related to other similar areas. They should be able to develop and apply independently the knowledge and methods learnt with a view to possible continuation of studies at higher (doctoral) level or in the broader perspective of cultural and professional self-improvement of lifelong learning. Therefore, students should be able to switch to exhibition forms other than the source texts in order to memorize, summarize for themselves and for others, and disseminate scientific knowledge.

The course aims to enable students to abstract formal algorithmic models and problems from concrete computational problems, and to design efficient algorithmic solutions for them. This will be done using the following teaching method. Every computational problem will be introduced, motivating it with concrete examples. The presentation of each topic will be divided into four parts: 1. Description of the actual computational problem. 2. Modelling the real problem by means of an abstract problem. 3. Resolution of the abstract problem through an algorithm obtained through the application of the general techniques of design of algorithms introduced in the course. 4. Analysis of the resources used by the algorithm. The course consists of frontal lessons, and classroom exercises. There will be theoretical lessons aimed at learning the basic techniques for the project and analysis of algorithms, and a part of lessons devoted to exercises in which we will illustrate, with plenty of examples, how the theoretical knowledge acquired can be used in order to solve algorithmic problems of practical interest and implement parallel algorithms in C language through the MPI library.

The exam consists of a written test and a prototype implementation of a parallel software. The written test  (2 hours, 21 points out of 30) covers theoretical topics related to the design and analysis of sequential and parallel algorithms in order to verify the student’s knowledge and understanding of the materials. In order to pass the written test, students must obtain at least 9 points out of 21.  The parallel software project (9 points out of 30) is meant to verify the practice of parallel programming and the ability of the student to implement theoretical parallel algorithms or to parallelize a sequential algorithm. In order to pass the parallel programming challenge, students must obtain at least 4 points out of 9. Both the written test and the prototype implementation of parallel software are mandatory. Students must pass both the written test and the parallel programming challenge; however, the overall exam is passed only if the sum of points obtained in the written test and the parallel programming challenge is greater than or equal to 18 points.

 

The problem to be solved in parallel (or a sequential algorithm to be implemented in parallel) is assigned upon request to the student. The assigned parallel software project can be discussed only if a written test has been successfully passed; moreover, the parallel software project must be mandatorily discussed into the same trial in which the written test has been passed.

 

The report on the assigned parallel software project must be structured as follows.

 

1. Introduction. The student must provide an accurate description of the assigned project, including an analysis of the sequential algorithm that solves the problem addressed in the project. If the student deems it important for understanding, then pseudo-code, examples, graphs, figures, application instances etc. may be included; 

 

2. Parallel design. This is a preliminary study aimed at highlighting the opportunities for parallelism inherent in the problem and/or sequential algorithm. Starting from the pseudo-code, it is then necessary to determine the most appropriate functionalities, code parts and/or data structures that can be considered for parallelisation. At this stage, alternative designs for different parallelisation strategies must be considered and discussed, duly justifying why some operations lend themselves to effective parallelisation and others do not. Furthermore, the expected results must be reported, including a worst-case parallel complexity analysis. Clarity of exposition regarding the parallelisation strategy is required, therefore the data structures used must be described in full, justifying the choices made to minimise the burden of communication and synchronisation. The theoretical analytical evaluation of the scalability of the parallel version is mandatory, and must be performed using both the isoefficiency and the scalability function. 

 

3. Implementation. The parallel design adopted must be implemented using the programming language C or, possibly, C++ (so as to use the data structures present in the STL of C++). The code must necessarily be appropriately commented. In the event that the student verifies the existence of multiple parallelisation strategies for the same assigned problem, an implementation can be provided for each strategy, discussing the advantages and disadvantages of each strategy. Note that the parallel code does not have to be given in full in the text of the report, as the project code - source and Makefile for the build (alternatively CMake can be used) - must in any case be delivered separately. However, the report may include some code snippets relating to the most critical and interesting parts.

 

4. Debugging and testing. Debugging and testing of the parallel code is recommended in order to reasonably verify the absence of bugs and the correctness of the parallel algorithm in relation to the output produced. The student is encouraged to design and verify unit tests deemed suitable for assessing the goodness of the code. The student is explicitly warned that the implementation of an incorrect parallelisation strategy and/or the presence of bugs that crash the parallel application at runtime will result in failing the exam, while the presence of bugs that affect the correctness of the algorithm will result in a penalty relative to the mark that will be awarded. 

 

5. Performance and scalability analysis. The student must analyse the performance of the developed implementation in terms of execution time, memory occupation (if necessary), speedup and efficiency. Both strong scalability (Amdahl) and weak scalability (Gustafson) must be assessed where possible. 

 

6. Synthesis. The report must be as concise as possible, but without detriment to clarity of presentation. 

 

7. Project evaluation. The project will be assessed on a scale of 1 to 9 points. The grade will only be determined at the end of the project discussion. The complexity of the problem assigned will be taken into account in the evaluation; students dealing with simpler problems should therefore expect less leniency in the evaluation than students dealing with more difficult problems. 

 

Assessment will also take into account: 

 

- Clarity and effectiveness of presentation; 

- Depth, correctness and originality of theoretical analysis; 

- Technical ability and documentation of implementation; 

- Number and quality of multiple parallel strategies, if any; 

- Critical thinking in performance evaluation and analysis; 

Significance of results: since parallel programming is primarily concerned with performance, a good project must also provide a significant speedup.

 

 

8. Plagiarism. The student is explicitly warned that plagiarism is very serious, and is taken very seriously. The use of external sources of any kind (internet, books, handouts, previous work by other students etc.) is only allowed for marginal contributions in the project (e.g. for the initial description of the assigned project), and each individual occurrence must be explicitly cited and reported in the report. Violation of this code of conduct will not be tolerated in any way, and will result in the immediate cancellation of the examination and the referral of the student to the University Disciplinary Committee, with the commencement of the relevant disciplinary proceedings in accordance with the provisions of Title V (DISCIPLINARY PROCEDIMENTS AND SANCTIONS) of the University Regulations for Students (D.R. 672 of 5/12/2017, which came into force on 8/12/2017).

 

 

 

 

 

Office Hours

By appointment; contact the instructor by email or at the end of class meetings.

Parallel Algorithms

Introduction. Parallel Architectures. Parallel algorithm design. Message-Passing Programming. Sieve of Erathostenes. Floyd all-pairs shortest path algorithm. Performance analysis. Matrix-vector multiplication. Document classification. Matrix multiplication. Linear Systems. Finite Difference Methods. Sorting. 

 

Parallel Programming

 

Message-Passing programming using MPI.

 

Sequential Algorithms

 

Introduction. Order of growth. Analysis of algorithms. Decrease and conquer. Divide and conquer. Recurrences. Randomized algorithms. Transform and conquer.  Dynamic programming. Greedy algorithms. Single Source Shortest Paths. Dijkstra algorithm. Breadth-First Search. Bellman-Ford Algorithm. Complexity and computability. NP-Completeness. The transition from sequential to parallel computing. Parallel complexity.

Introduction to Algorithms. Fourth edition. Cormen, Leiserson, Rivest, Stein. The MIT Press

Parallel Programming in C with MPI and OpenMP (International Edition). Michael J. Quinn. McGraw-Hill

PARALLEL ALGORITHMS (ING-INF/05)
DATA MINING

Degree course MATEMATICA

Subject area ING-INF/05

Course type Laurea Magistrale

Credits 6.0

Teaching hours Ore totali di attività frontale: 42.0

For matriculated on 2021/2022

Year taught 2021/2022

Course year 1

Semestre Secondo Semestre (dal 21/02/2022 al 27/05/2022)

Language INGLESE

Subject matter PERCORSO COMUNE (999)

Location Lecce

Calculus. Probability theory. Linear Algebra. Programming skills.

The course provides a modern introduction to data mining, which spans techniques, algorithms and methodologies for discovering structure, patterns and relationships in data sets (typically, large ones) and making predictions. Applications of data mining are already happening all around us, and, when they are done well, sometimes they even go unnoticed. For instance, how does the Google web search work? How does Shazam recognize a song? How does Netflix recommend movies to its users? The principles of data mining provide answers to these and others questions. Data mining overlaps the fields of computer science, statistical machine learning and data bases. The course aims at providing the students with the knowldedge required to explore, analyze and leverage available data in order to turn the data into valuable and actionable information for a company, for instance, in order to facilitate a decision-making process.

Knowledge and understanding. The course describes methods and models for the analysis of large amounts of data. Students must have a solid background with a broad spectrum of basic knowledge related to data mining:

  • the students must have the basic cognitive tools to think analytically, creatively, critically and in an inquiring way, and have the abstraction and problem-solving skills needed to cope with complex systems;
  • they must have solid knowledge of data mining models and methodologies;
  • they must be able to work on large data collections, including heterogeneous and produced at high speed data, in order to integrate them - in particular by knowing how to manage their origin and quality - and to carry out in-depth thematic analyses, drawing on this knowledge to improve the decision-making process.

Applying knowledge and understanding. After the course the student should be able to:

  • describe and use the main data mining techniques;
  • understand the differences among several algorithms solving the same problem and recognize which one is better under different conditions;
  • tackle new data mining problems by selecting the appropriate methods and justifying his/her choices;
  • tackle new data mining problems by designing suitable algorithms and evaluating the results;
  • explain experimental results to people outside of statistical machine learning or computer science.

Making judgements. Students must have the ability to process complex and/or fragmentary data and must arrive at original and autonomous ideas and judgments, and consistent choices in the context of their work, which are particularly delicate in the profession of data scientist. The course promotes the development of independent judgment in the appropriate choice of technique/model for data processing and the critical ability to interpret the goodness of the results of the models/methods applied to the datasets under examination.

Communication. It is essential that students are able to communicate with a varied and composite audience, not culturally homogeneous, in a clear, logical and effective way, using the methodological tools acquired and their scientific knowledge and, in particular, the specialty vocabulary. Students should be able to organize effective dissemination and study material through the most common presentation tools, including computer-based ones, to communicate the results of data analysis processes, for example by using visualization and reporting tools aimed at different types of audiences.

Learning skills. Students must acquire the critical ability to relate, with originality and autonomy, to the typical problems of data mining and, in general, cultural issues related to other similar areas. They should be able to develop and apply independently the knowledge and methods learnt with a view to possible continuation of studies at higher (doctoral) level or in the broader perspective of cultural and professional self-improvement of lifelong learning. Therefore, students should be able to switch to exhibition forms other than the source texts in order to memorize, summarize for themselves and for others, and disseminate scientific knowledge.

The course aims to provide students with advanced tools for data analysis, through which to extrapolate relevant information from large datasets and guide the related decision-making processes. The course consists of frontal lessons using slides made available to students via the Moodle platform, and classroom exercises. The frontal lessons are aimed at improving students' knowledge and understanding through the presentation of theories, models and methods; students are invited to participate in the lesson with autonomy of judgement, by asking questions and presenting examples. The exercises are aimed at understanding the algorithms and models presented.

Oral exam. During the exam the student is asked to illustrate theoretical topics in order to verify his/her knowledge and understanding of the selected topics. The student must demonstrate adequate knowledge and understanding of the issues presented or indicated, applying in a relevant manner the theories and conceptual models covered by the study programme.

Office Hours

By appointment; contact the instructor by email or at the end of class meetings.

Introduzione al corso. Streams. Funzioni hash uniformi, 2-universal e pairwise independent. Streaming: modello turnstile, strict turnstile e cash register. Frequency estimation. Sketches. Count-Sketch. Count-Min. Confronto comparativo tra Count-Sketch e Count-Min. Frequent items. Phi-frequent items. The majority problem. Algoritmo di Boyer-Moore. Algoritmo di Misra-Gries. Algoritmo Frequent.Algoritmo Space Saving. Proprieta' di Space Saving. Confronto comparativo con Frequent. Introduzione al paradigma di programmazione parallela Map-Reduce. Implementazione open-source Hadoop. Pro e contro di Hadoop e Map-Reduce. Distributed File System. Chunk servers, Master node. Map Function. Sort and Shuffle. Reduce Function. Map Tasks. Reduce Tasks. Word counting. Gestione dei guasti. Numero di Map e Reduce jobs. Granularita' dei tasks e pipelining. Mitigare il problema degli strugglers task: spawning di backup tasks. Combiners. Partition (hash) function. Altri esempi di algoritmi Map-Reduce: natural join, two-pass matrix multiply, single pass matrix multiply. Misure di costo per un algoritmo Map-Reduce. Discovery di association rules. Modello market-basket. Esempi di possibili applicazioni. Frequent itemsets. Supporto di un itemset. Association rules. Confidence e Interest.Association rules con elevato interesse positivo o negativo. Mining di association rules. Maximal e closed frequent itemsets. Lattice degli itemsets. Naive approach to counting frequent pairs. Algoritmo A-priori. Monotonicity. Algoritmo PCY. Raffinamenti di PCY: multistage e multihash. Frequent itemsets in 2 passate: random sampling. Frequent itemsets in 2 passate: Random sampling e scelta della soglia opportuna, algoritmo SON, monotonicità, SON parallelo mediante Map-Reduce in 2 passate, algoritmo di Toivonen, bordo negativo. Scene completion problem. Near neighbors in spazi di dimensionalità elevata. Document similarity. Coppie di documenti candidati. Near neighbor search. Jaccard similarity e distance. Shingling: convertire documenti email etc in insiemi. k-shingles. Compressione mediante hashing di k-shingles. Min-Hashing: conversione di insiemi di cardinalità elevata in brevi signatures preservando la similarità. Similarità e distanza di Jaccard per vettori booleani. Boolean matrices. Min-hash signatures. Implementazione. Locality-Sensitive Hashing: determinare coppie di documenti candidate. Matrix partitioning in b bande di r righe: analisi del grado di accuratezza associato rispetto ai falsi positivi ed ai falsi negativi. Link analysis. PageRank. Dead ends. Spider traps. Flow formulation. Matrix formulation. Random walk interpretation. Stationary distribution of a Discrete-Time Markov Chain. Perron-Frobenius Theorem. Google matrix and teleportation. Sparse matrix encoding. Block update algorithm. Topic-specific PageRank. Matrix formulation. Topic vector. Web Spam. Term spam. Spam farms. Analisi del valore di PageRank ottenuto tramite Spam Farm. TrustRank. Trust propagation. Spam Mass estimation. Introduzione al problema del clustering. Curse of dimensionality. Clustering in spazi euclidei e non euclidei. Distanze. Hierarchical clustering: agglomerative and divisive algorithms. Clustering by point assignment. Centroid and clustroid. K-means e K-means++. Scelta di k: elbow criterion. Algoritmo BFR. Discard, Compression e Retained sets. Summarizing points. Distanza di Mahalanobis. Algoritmo CURE. Punti rappresentativi e loro scelta. Input space e feature space. Kernel methods. Kernel matrix. Linear kernel. Kernel trick. Kernel operations in feature space. Represenattive clustering: K-means e Kernel K-means. Expectation-Maximization clustering. Hierarchical clustering. Density-based clustering. Algoritmo DBSCAN. Recommender systems. Recommendations. The long tail phenomenon. Content-based systems. Utility function and matrix. Ratings. Extrapolation of ratings (utilities). Item profiles. User profiles. Collaborative filtering. k-NN. Similarity metrics. User-user and item-item collaborative filtering. Evaluation of systems. Error metrics. RMSE, precision, rank correlation. Complexity of collaborative filtering. The Netflix challenge. Bellkor recommender system. Modeling local and global effects. Learning the optimal weights: optimization problem and gradient descent. Latent factor models. SVD decomposition. Learning the P and Q matrices. Preventing overfitting: regularization. Stochastic Gradient Descent. Biases and interactions. Temporal biases and factors. Machine learning: supervised and unsupervised approaches. Attributi numerici e categorici. Attributi categorici nominali ed ordinali. Probabilistic classifiers. Parametric approach: Bayes and naive Bayes classifiers. Data centering. Non parametric approach (density based): K-nearest neighbors classifier. Decision Trees. Hyperplans. Split points. Data partion and purity. Split Point Evaluation Measures: entropy, split entropy, information gain, Gini index, CART. Valutazione di split points numerici e categorici. Support Vector machines. Hyperplanes. Support Vectors and Margins. Linear and Separable Case. Soft Margin SVM: Linear and Nonseparable Case. Kernel SVM: Nonlinear Case. SVM Training Algorithms. Multiclass SVM. Analisi delle prestazioni di un classifier. Metriche di valutazione. ROC curve e AUC. K-fold cross-validation. Bootstrapping. Intervalli di confidenza. Paired t-Test. Bias and variance decomposition. Ensemble classifiers. Bagging. Random Forest. Boosting. Stacking. Introduction to neural networks. History. The Biological Inspiration. Learning in Biological vs Artificial Networks. An Alternative View: The Computational Graph Extension of Traditional Machine Learning. Machine Learning versus Deep Learning. Single Layer Networks: the Perceptron. Bias neurons. Training a Perceptron. Perceptron vs Linear SVMs. Activation and Loss Functions. Multilayer Neural Networks. Reasons  for nonlinearity of hidden layers. Role of Hidden Layers. The Feature Engineering View of Hidden Layers. Multilayer Networks as Computational Graphs. Connecting Machine Learning with Shallow Neural Networks: Perceptron versus Linear SVM, RBF Network versus kernel SVM.

Mining of Massive Datasets 

J. Leskovec, A. Rajaraman and J. Ullman

Freely availableonline: http://www.mmds.org

 

Data Mining and Analysis

M. J. Zaki and W. Meira

Freely available online: http://dataminingbook.info

 

Neural Networks and Deep Learning

Charu C. Aggarwal

Springer

 

DATA MINING (ING-INF/05)
PARALLEL ALGORITHMS

Degree course COMPUTER ENGINEERING

Subject area ING-INF/05

Course type Laurea Magistrale

Credits 9.0

Teaching hours Ore totali di attività frontale: 81.0

For matriculated on 2020/2021

Year taught 2021/2022

Course year 2

Semestre Primo Semestre (dal 20/09/2021 al 17/12/2021)

Language INGLESE

Subject matter PERCORSO COMUNE (999)

Location Lecce

Calculus I and II, Probability Theory. Programming skills and working knowledge of the C programming language.

The course provides a modern introduction to design, analysis and implementation of sequential and parallel algorithms. In particular, the course is based on a pragmatic approach to parallel programming of message-passing algorithms through the C language and the MPI library.

Knowledge and understanding. Students must have a solid background with a broad spectrum of basic knowledge of sequential and parallel algorithms:

 

· the students must have the basic cognitive tools to think analytically, creatively, critically and in an inquiring way, and have the abstraction and problem-solving skills needed to cope with complex systems;

· they must have a solid knowledge of the design and implementation of sequential and parallel efficient algorithms;

· they must have the tools for analysing the resources used by algorithms;

· they must have a catalogue of the most well-known and efficient sequential and parallel algorithms for basic computational problems.

 

Applying knowledge and understanding. After the course the student should be able to:

 

· Describe and use the main design techniques for sequential algorithms;

· Design, prove the correctness and analyze the computational complexity of sequential algorithms;

· Understand the differences among several algorithms solving the same problem and recognize which one is better under different conditions;

· Describe and use basic sequential algorithms;

· Describe and use basic data structures; know about the existence of advanced data structures;

· Understand the difference between sequential and parallel algorithms;

· Design, implement and analyze message-passing based parallel algorithms in C using the MPI library;

· Describe and use basic parallel algorithms.

 

 

Making judgements. Students are guided to learn critically everything that is explained to them in class, to compare different approaches to solving algorithmic problems, and to identify and propose, in an autonomous way, the most efficient solution they find.

 

Communication. It is essential that students are able to communicate with a varied and composite audience, not culturally homogeneous, in a clear, logical and effective way, using the methodological tools acquired and their scientific knowledge and, in particular, the specialty vocabulary. The course promotes the development of the following skills of the student: ability to expose in precise and formal terms an abstract model of concrete problems, identifying the salient features of them and discarding the nonessential ones; ability to describe and analyze an efficient solution to the problem in question.

 

Learning skills. Students must acquire the critical ability to relate, with originality and autonomy, to the typical problems of data mining and, in general, cultural issues related to other similar areas. They should be able to develop and apply independently the knowledge and methods learnt with a view to possible continuation of studies at higher (doctoral) level or in the broader perspective of cultural and professional self-improvement of lifelong learning. Therefore, students should be able to switch to exhibition forms other than the source texts in order to memorize, summarize for themselves and for others, and disseminate scientific knowledge.

The course aims to enable students to abstract formal algorithmic models and problems from concrete computational problems, and to design efficient algorithmic solutions for them. This will be done using the following teaching method. Every computational problem will be introduced, motivating it with concrete examples. The presentation of each topic will be divided into four parts: 1. Description of the actual computational problem. 2. Modelling the real problem by means of an abstract problem. 3. Resolution of the abstract problem through an algorithm obtained through the application of the general techniques of design of algorithms introduced in the course. 4. Analysis of the resources used by the algorithm. The course consists of frontal lessons, and classroom exercises. There will be theoretical lessons aimed at learning the basic techniques for the project and analysis of algorithms, and a part of lessons devoted to exercises in which we will illustrate, with plenty of examples, how the theoretical knowledge acquired can be used in order to solve algorithmic problems of practical interest and implement parallel algorithms in C language through the MPI library.

The exam consists of a written test and a prototype implementation of a parallel software. The written test  (3 hours, 21 points out of 30) covers theoretical topics related to the design and analysis of sequential and parallel algorithms in order to verify the student’s knowledge and understanding of the materials. In order to pass the written test, students must obtain at least 9 points out of 21.  The parallel software project (9 points out of 30) is meant to verify the practice of parallel programming and the ability of the student to implement theoretical parallel algorithms or to parallelize a sequential algorithm. In order to pass the parallel programming challenge, students must obtain at least 4 points out of 9. Both the written test and the prototype implementation of parallel software are mandatory. Students must pass both the written test and the parallel programming challenge; however, the overall exam is passed only if the sum of points obtained in the written test and the parallel programming challenge is greater to or equal to 18 points.

 

The problem to be solved in parallel (or a sequential algorithm to be implemented in parallel) is assigned upon request to the student. The assigned parallel software project can be discussed only if a written test has been successfully passed; moreover, the parallel software project must be mandatorily discussed into one of the trials available in the same exam session in which the written test has been passed.

 

 

WARNING: THE FOLLOWING ARE DETAILED INSTRUCTIONS RELATED TO ONLINE WRITTEN TESTS USING MICROSOFT TEAMS 


Written test of 
Parallel Algorithms
 


Microsoft Teams is used to verify the student's identity and to supervise the test; the teacher will select each student asking him to show his identification document. The exam takes place within a team that students access through registration by the teacher or through a link that will be communicated via email. Students are asked to enter the team at least 15 minutes prior to the scheduled start time in order to perform recognition. Each student will participate in the meeting through a device equipped with a microphone and a webcam that must remain on for the duration of the test, framing the student and the paper on which he/she writes. The teacher will check the regularity of the students' work, immediately cancelling the test in case of irregularities of any kind.

For the distribution of the exam outline to the students, the teacher shares a pdf file in the TEAMS chat. The track is divided into 3 exercises for Parallel Algorithms, and the teacher releases to the students each of the exercises that compose it at fixed time intervals. The test is then carried out as follows:

 

1. The teacher releases the first exercise at the beginning of the roll call;
a. students print out the text of the exercise and begin drafting the paper on a sheet of paper clearly visible from the video capture device;
b. the teacher provides his/her estimate x of sufficient time to answer;
c. delivery of the paper related to the first exercise after x amount of time has elapsed;
d. Teacher waits 2 minutes for delivery;


2. Teacher delivers the second exercise;
a. students print out the text of the exercise and begin writing the paper on a sheet of paper clearly visible from the video capture device;
b. teacher gives his estimate y of sufficient time to answer;
c. delivery of the paper related to the second exercise after the time y has elapsed;
d. The teacher waits 2 minutes for the delivery;


3. Teacher releases the third exercise;
a. students print out the text of the exercise and begin writing the paper on a piece of paper that is clearly visible from the video capture device;
b. teacher gives his estimate z of sufficient time to answer;
c. delivery of the paper related to the second exercise after the time z has elapsed;
d. The teacher waits 2 minutes for the delivery;


4. The teacher informs the participating students that the roll call is closed.
  
For the delivery of the papers related to the individual exercises, the student is requested to take a photo of the relative sheets, including an image of the university booklet or, alternatively, an identification document. The photos should be placed in a directory called "Exercise_a_b_c_d" in which:

- a indicates the progressive number of the exercise (1, 2 or 3);
- b is PA  (for Parallel Algorithms)
- c indicates the student's first name;
- d indicates the student's last name.

The directory must then be compressed, e.g. in zip format, and the resulting file must be delivered via team chat.

It is not possible to turn in an exercise before the allotted time and move on to the next exercise. The student who intends to abandon the test can give notice to the teacher at any time, and leave the team immediately afterwards. It is not possible to leave for any reason during the test. The student is required to ensure that he/she has everything necessary for the test (microphone, webcam, printer, camera, pen, adequate number of sheets of paper).

 

 

 

Office Hours

By appointment; contact the instructor by email or at the end of class meetings.

Parallel Algorithms

Introduction. Parallel Architectures. Parallel algorithm design. Message-Passing Programming. Sieve of Erathostenes. Floyd all-pairs shortest path algorithm. Performance analysis. Matrix-vector multiplication. Document classification. Matrix multiplication. Linear Systems. Finite Difference Methods. Sorting. 

 

Parallel Programming

 

Message-Passing programming using MPI.

 

Sequential Algorithms

 

Introduction. Order of growth. Analysis of algorithms. Decrease and conquer. Divide and conquer. Recurrences. Randomized algorithms. Transform and conquer.  Dynamic programming. Greedy algorithms. Single Source Shortest Paths. Dijkstra algorithm. Breadth-First Search. Bellman-Ford Algorithm. Complexity and computability. NP-Completeness. The transition from sequential to parallel computing. Parallel complexity.

Introduction to Algorithms. Third edition. Cormen, Leiserson, Rivest, Stein. The MIT Press

Parallel Programming in C with MPI and OpenMP (International Edition). Michael J. Quinn. McGraw-Hill

PARALLEL ALGORITHMS (ING-INF/05)
DATA MINING

Degree course MATEMATICA

Subject area ING-INF/05

Course type Laurea Magistrale

Credits 6.0

Teaching hours Ore totali di attività frontale: 42.0

For matriculated on 2020/2021

Year taught 2020/2021

Course year 1

Semestre Secondo Semestre (dal 22/02/2021 al 28/05/2021)

Language INGLESE

Subject matter PERCORSO COMUNE (999)

Calculus. Probability theory. Linear Algebra. Programming skills.

The course provides a modern introduction to data mining, which spans techniques, algorithms and methodologies for discovering structure, patterns and relationships in data sets (typically, large ones) and making predictions. Applications of data mining are already happening all around us, and, when they are done well, sometimes they even go unnoticed. For instance, how does the Google web search work? How does Shazam recognize a song? How does Netflix recommend movies to its users? The principles of data mining provide answers to these and others questions. Data mining overlaps the fields of computer science, statistical machine learning and data bases. The course aims at providing the students with the knowldedge required to explore, analyze and leverage available data in order to turn the data into valuable and actionable information for a company, for instance, in order to facilitate a decision-making process.

Knowledge and understanding. The course describes methods and models for the analysis of large amounts of data. Students must have a solid background with a broad spectrum of basic knowledge related to data mining:

  • the students must have the basic cognitive tools to think analytically, creatively, critically and in an inquiring way, and have the abstraction and problem-solving skills needed to cope with complex systems;
  • they must have solid knowledge of data mining models and methodologies;
  • they must be able to work on large data collections, including heterogeneous and produced at high speed data, in order to integrate them - in particular by knowing how to manage their origin and quality - and to carry out in-depth thematic analyses, drawing on this knowledge to improve the decision-making process.

Applying knowledge and understanding. After the course the student should be able to:

  • describe and use the main data mining techniques;
  • understand the differences among several algorithms solving the same problem and recognize which one is better under different conditions;
  • tackle new data mining problems by selecting the appropriate methods and justifying his/her choices;
  • tackle new data mining problems by designing suitable algorithms and evaluating the results;
  • explain experimental results to people outside of statistical machine learning or computer science.

Making judgements. Students must have the ability to process complex and/or fragmentary data and must arrive at original and autonomous ideas and judgments, and consistent choices in the context of their work, which are particularly delicate in the profession of data scientist. The course promotes the development of independent judgment in the appropriate choice of technique/model for data processing and the critical ability to interpret the goodness of the results of the models/methods applied to the datasets under examination.

Communication. It is essential that students are able to communicate with a varied and composite audience, not culturally homogeneous, in a clear, logical and effective way, using the methodological tools acquired and their scientific knowledge and, in particular, the specialty vocabulary. Students should be able to organize effective dissemination and study material through the most common presentation tools, including computer-based ones, to communicate the results of data analysis processes, for example by using visualization and reporting tools aimed at different types of audiences.

Learning skills. Students must acquire the critical ability to relate, with originality and autonomy, to the typical problems of data mining and, in general, cultural issues related to other similar areas. They should be able to develop and apply independently the knowledge and methods learnt with a view to possible continuation of studies at higher (doctoral) level or in the broader perspective of cultural and professional self-improvement of lifelong learning. Therefore, students should be able to switch to exhibition forms other than the source texts in order to memorize, summarize for themselves and for others, and disseminate scientific knowledge.

The course aims to provide students with advanced tools for data analysis, through which to extrapolate relevant information from large datasets and guide the related decision-making processes. The course consists of frontal lessons using slides made available to students via the Moodle platform, and classroom exercises. The frontal lessons are aimed at improving students' knowledge and understanding through the presentation of theories, models and methods; students are invited to participate in the lesson with autonomy of judgement, by asking questions and presenting examples. The exercises are aimed at understanding the algorithms and models presented.

Oral exam. During the exam the student is asked to illustrate theoretical topics in order to verify his/her knowledge and understanding of the selected topics. The student must demonstrate adequate knowledge and understanding of the issues presented or indicated, applying in a relevant manner the theories and conceptual models covered by the study programme.

Office Hours

By appointment; contact the instructor by email or at the end of class meetings.

Introduzione al corso. Streams. Funzioni hash uniformi, 2-universal e pairwise independent. Streaming: modello turnstile, strict turnstile e cash register. Frequency estimation. Sketches. Count-Sketch. Count-Min. Confronto comparativo tra Count-Sketch e Count-Min. Frequent items. Phi-frequent items. The majority problem. Algoritmo di Boyer-Moore. Algoritmo di Misra-Gries. Algoritmo Frequent.Algoritmo Space Saving. Proprieta' di Space Saving. Confronto comparativo con Frequent. Introduzione al paradigma di programmazione parallela Map-Reduce. Implementazione open-source Hadoop. Pro e contro di Hadoop e Map-Reduce. Distributed File System. Chunk servers, Master node. Map Function. Sort and Shuffle. Reduce Function. Map Tasks. Reduce Tasks. Word counting. Gestione dei guasti. Numero di Map e Reduce jobs. Granularita' dei tasks e pipelining. Mitigare il problema degli strugglers task: spawning di backup tasks. Combiners. Partition (hash) function. Altri esempi di algoritmi Map-Reduce: natural join, two-pass matrix multiply, single pass matrix multiply. Misure di costo per un algoritmo Map-Reduce. Discovery di association rules. Modello market-basket. Esempi di possibili applicazioni. Frequent itemsets. Supporto di un itemset. Association rules. Confidence e Interest.Association rules con elevato interesse positivo o negativo. Mining di association rules. Maximal e closed frequent itemsets. Lattice degli itemsets. Naive approach to counting frequent pairs. Algoritmo A-priori. Monotonicity. Algoritmo PCY. Raffinamenti di PCY: multistage e multihash. Frequent itemsets in 2 passate: random sampling. Frequent itemsets in 2 passate: Random sampling e scelta della soglia opportuna, algoritmo SON, monotonicità, SON parallelo mediante Map-Reduce in 2 passate, algoritmo di Toivonen, bordo negativo. Scene completion problem. Near neighbors in spazi di dimensionalità elevata. Document similarity. Coppie di documenti candidati. Near neighbor search. Jaccard similarity e distance. Shingling: convertire documenti email etc in insiemi. k-shingles. Compressione mediante hashing di k-shingles. Min-Hashing: conversione di insiemi di cardinalità elevata in brevi signatures preservando la similarità. Similarità e distanza di Jaccard per vettori booleani. Boolean matrices. Min-hash signatures. Implementazione. Locality-Sensitive Hashing: determinare coppie di documenti candidate. Matrix partitioning in b bande di r righe: analisi del grado di accuratezza associato rispetto ai falsi positivi ed ai falsi negativi. Link analysis. PageRank. Dead ends. Spider traps. Flow formulation. Matrix formulation. Random walk interpretation. Stationary distribution of a Discrete-Time Markov Chain. Perron-Frobenius Theorem. Google matrix and teleportation. Sparse matrix encoding. Block update algorithm. Topic-specific PageRank. Matrix formulation. Topic vector. Web Spam. Term spam. Spam farms. Analisi del valore di PageRank ottenuto tramite Spam Farm. TrustRank. Trust propagation. Spam Mass estimation. Introduzione al problema del clustering. Curse of dimensionality. Clustering in spazi euclidei e non euclidei. Distanze. Hierarchical clustering: agglomerative and divisive algorithms. Clustering by point assignment. Centroid and clustroid. K-means e K-means++. Scelta di k: elbow criterion. Algoritmo BFR. Discard, Compression e Retained sets. Summarizing points. Distanza di Mahalanobis. Algoritmo CURE. Punti rappresentativi e loro scelta. Input space e feature space. Kernel methods. Kernel matrix. Linear kernel. Kernel trick. Kernel operations in feature space. Represenattive clustering: K-means e Kernel K-means. Expectation-Maximization clustering. Hierarchical clustering. Density-based clustering. Algoritmo DBSCAN. Recommender systems. Recommendations. The long tail phenomenon. Content-based systems. Utility function and matrix. Ratings. Extrapolation of ratings (utilities). Item profiles. User profiles. Collaborative filtering. k-NN. Similarity metrics. User-user and item-item collaborative filtering. Evaluation of systems. Error metrics. RMSE, precision, rank correlation. Complexity of collaborative filtering. The Netflix challenge. Bellkor recommender system. Modeling local and global effects. Learning the optimal weights: optimization problem and gradient descent. Latent factor models. SVD decomposition. Learning the P and Q matrices. Preventing overfitting: regularization. Stochastic Gradient Descent. Biases and interactions. Temporal biases and factors. Machine learning: supervised and unsupervised approaches. Attributi numerici e categorici. Attributi categorici nominali ed ordinali. Probabilistic classifiers. Parametric approach: Bayes and naive Bayes classifiers. Data centering. Non parametric approach (density based): K-nearest neighbors classifier. Decision Trees. Hyperplans. Split points. Data partion and purity. Split Point Evaluation Measures: entropy, split entropy, information gain, Gini index, CART. Valutazione di split points numerici e categorici. Support Vector machines. Hyperplanes. Support Vectors and Margins. Linear and Separable Case. Soft Margin SVM: Linear and Nonseparable Case. Kernel SVM: Nonlinear Case. SVM Training Algorithms. Multiclass SVM. Analisi delle prestazioni di un classifier. Metriche di valutazione. ROC curve e AUC. K-fold cross-validation. Bootstrapping. Intervalli di confidenza. Paired t-Test. Bias and variance decomposition. Ensemble classifiers. Bagging. Random Forest. Boosting. Stacking. Introduction to neural networks. History. The Biological Inspiration. Learning in Biological vs Artificial Networks. An Alternative View: The Computational Graph Extension of Traditional Machine Learning. Machine Learning versus Deep Learning. Single Layer Networks: the Perceptron. Bias neurons. Training a Perceptron. Perceptron vs Linear SVMs. Activation and Loss Functions. Multilayer Neural Networks. Reasons  for nonlinearity of hidden layers. Role of Hidden Layers. The Feature Engineering View of Hidden Layers. Multilayer Networks as Computational Graphs. Connecting Machine Learning with Shallow Neural Networks: Perceptron versus Linear SVM, RBF Network versus kernel SVM.

Mining of Massive Datasets 

J. Leskovec, A. Rajaraman and J. Ullman

Freely availableonline: http://www.mmds.org

 

Data Mining and Analysis

M. J. Zaki and W. Meira

Freely available online: http://dataminingbook.info

 

Neural Networks and Deep Learning

Charu C. Aggarwal

Springer

 

DATA MINING (ING-INF/05)
PARALLEL ALGORITHMS

Degree course COMPUTER ENGINEERING

Subject area ING-INF/05

Course type Laurea Magistrale

Credits 9.0

Teaching hours Ore totali di attività frontale: 81.0

For matriculated on 2019/2020

Year taught 2020/2021

Course year 2

Semestre Primo Semestre (dal 22/09/2020 al 18/12/2020)

Language INGLESE

Subject matter PERCORSO COMUNE (999)

Location Lecce

Calculus I and II, Probability Theory. Programming skills and working knowledge of the C programming language.

The course provides a modern introduction to design, analysis and implementation of sequential and parallel algorithms. In particular, the course is based on a pragmatic approach to parallel programming of message-passing algorithms through the C language and the MPI library.

Knowledge and understanding. Students must have a solid background with a broad spectrum of basic knowledge of sequential and parallel algorithms:

 

· the students must have the basic cognitive tools to think analytically, creatively, critically and in an inquiring way, and have the abstraction and problem-solving skills needed to cope with complex systems;

· they must have a solid knowledge of the design and implementation of sequential and parallel efficient algorithms;

· they must have the tools for analysing the resources used by algorithms;

· they must have a catalogue of the most well-known and efficient sequential and parallel algorithms for basic computational problems.

 

Applying knowledge and understanding. After the course the student should be able to:

 

· Describe and use the main design techniques for sequential algorithms;

· Design, prove the correctness and analyze the computational complexity of sequential algorithms;

· Understand the differences among several algorithms solving the same problem and recognize which one is better under different conditions;

· Describe and use basic sequential algorithms;

· Describe and use basic data structures; know about the existence of advanced data structures;

· Understand the difference between sequential and parallel algorithms;

· Design, implement and analyze message-passing based parallel algorithms in C using the MPI library;

· Describe and use basic parallel algorithms.

 

 

Making judgements. Students are guided to learn critically everything that is explained to them in class, to compare different approaches to solving algorithmic problems, and to identify and propose, in an autonomous way, the most efficient solution they find.

 

Communication. It is essential that students are able to communicate with a varied and composite audience, not culturally homogeneous, in a clear, logical and effective way, using the methodological tools acquired and their scientific knowledge and, in particular, the specialty vocabulary. The course promotes the development of the following skills of the student: ability to expose in precise and formal terms an abstract model of concrete problems, identifying the salient features of them and discarding the nonessential ones; ability to describe and analyze an efficient solution to the problem in question.

 

Learning skills. Students must acquire the critical ability to relate, with originality and autonomy, to the typical problems of data mining and, in general, cultural issues related to other similar areas. They should be able to develop and apply independently the knowledge and methods learnt with a view to possible continuation of studies at higher (doctoral) level or in the broader perspective of cultural and professional self-improvement of lifelong learning. Therefore, students should be able to switch to exhibition forms other than the source texts in order to memorize, summarize for themselves and for others, and disseminate scientific knowledge.

The course aims to enable students to abstract formal algorithmic models and problems from concrete computational problems, and to design efficient algorithmic solutions for them. This will be done using the following teaching method. Every computational problem will be introduced, motivating it with concrete examples. The presentation of each topic will be divided into four parts: 1. Description of the actual computational problem. 2. Modelling the real problem by means of an abstract problem. 3. Resolution of the abstract problem through an algorithm obtained through the application of the general techniques of design of algorithms introduced in the course. 4. Analysis of the resources used by the algorithm. The course consists of frontal lessons, and classroom exercises. There will be theoretical lessons aimed at learning the basic techniques for the project and analysis of algorithms, and a part of lessons devoted to exercises in which we will illustrate, with plenty of examples, how the theoretical knowledge acquired can be used in order to solve algorithmic problems of practical interest and implement parallel algorithms in C language through the MPI library.

The exam consists of a written test and a prototype implementation of a parallel software. The written test  (3 hours, 21 points out of 30) covers theoretical topics related to the design and analysis of sequential and parallel algorithms in order to verify the student’s knowledge and understanding of the materials. In order to pass the written test, students must obtain at least 14 points out of 21.  The parallel software prototype (9 points out of 30) is meant to verify the practice of parallel programming and the ability of the student to implement theoretical parallel algorithms or to parallelize a sequential algorithm. In order to pass the parallel programming challenge, students must obtain at least 4 points out of 9. Both the written test and the prototype implementation of parallel software are mandatory. Students must pass both the written test and the parallel programming challenge.

 

WARNING: THE FOLLOWING ARE DETAILED INSTRUCTIONS RELATED TO ONLINE WRITTEN TESTS USING MICROSOFT TEAMS 


Written test of 
Parallel Algorithms
 


Microsoft Teams is used to verify the student's identity and to supervise the test; the teacher will select each student asking him to show his identification document. The exam takes place within a team that students access through registration by the teacher or through a link that will be communicated via email. Students are asked to enter the team at least 15 minutes prior to the scheduled start time in order to perform recognition. Each student will participate in the meeting through a device equipped with a microphone and a webcam that must remain on for the duration of the test, framing the student and the paper on which he/she writes. The teacher will check the regularity of the students' work, immediately cancelling the test in case of irregularities of any kind.

For the distribution of the exam outline to the students, the teacher shares a pdf file in the TEAMS chat. The track is divided into 3 exercises for Parallel Algorithms, and the teacher releases to the students each of the exercises that compose it at fixed time intervals. The test is then carried out as follows:

 

1. The teacher releases the first exercise at the beginning of the roll call;
a. students print out the text of the exercise and begin drafting the paper on a sheet of paper clearly visible from the video capture device;
b. the teacher provides his/her estimate x of sufficient time to answer;
c. delivery of the paper related to the first exercise after x amount of time has elapsed;
d. Teacher waits 2 minutes for delivery;


2. Teacher delivers the second exercise;
a. students print out the text of the exercise and begin writing the paper on a sheet of paper clearly visible from the video capture device;
b. teacher gives his estimate y of sufficient time to answer;
c. delivery of the paper related to the second exercise after the time y has elapsed;
d. The teacher waits 2 minutes for the delivery;


3. Teacher releases the third exercise;
a. students print out the text of the exercise and begin writing the paper on a piece of paper that is clearly visible from the video capture device;
b. teacher gives his estimate z of sufficient time to answer;
c. delivery of the paper related to the second exercise after the time z has elapsed;
d. The teacher waits 2 minutes for the delivery;


4. The teacher informs the participating students that the roll call is closed.
  
For the delivery of the papers related to the individual exercises, the student is requested to take a photo of the relative sheets, including an image of the university booklet or, alternatively, an identification document. The photos should be placed in a directory called "Exercise_a_b_c_d" in which:

- a indicates the progressive number of the exercise (1, 2 or 3);
- b is PA  (for Parallel Algorithms)
- c indicates the student's first name;
- d indicates the student's last name.

The directory must then be compressed, e.g. in zip format, and the resulting file must be delivered via team chat.

It is not possible to turn in an exercise before the allotted time and move on to the next exercise. The student who intends to abandon the test can give notice to the teacher at any time, and leave the team immediately afterwards. It is not possible to leave for any reason during the test. The student is required to ensure that he/she has everything necessary for the test (microphone, webcam, printer, camera, pen, adequate number of sheets of paper).

 

 

 

Office Hours

By appointment; contact the instructor by email or at the end of class meetings.

Parallel Algorithms

Introduction. Parallel Architectures. Parallel algorithm design. Message-Passing Programming. Sieve of Erathostenes. Floyd all-pairs shortest path algorithm. Performance analysis. Matrix-vector multiplication. Document classification. Matrix multiplication. Linear Systems. Finite Difference Methods. Sorting. 

 

Parallel Programming

 

Message-Passing programming using MPI.

 

Sequential Algorithms

 

Introduction. Order of growth. Analysis of algorithms. Decrease and conquer. Divide and conquer. Recurrences. Randomized algorithms. Transform and conquer.  Dynamic programming. Greedy algorithms. Single Source Shortest Paths. Dijkstra algorithm. Breadth-First Search. Bellman-Ford Algorithm. Complexity and computability. NP-Completeness. The transition from sequential to parallel computing. Parallel complexity.

Introduction to Algorithms. Third edition. Cormen, Leiserson, Rivest, Stein. The MIT Press

Parallel Programming in C with MPI and OpenMP (International Edition). Michael J. Quinn. McGraw-Hill

PARALLEL ALGORITHMS (ING-INF/05)
DATA MINING

Degree course MATEMATICA

Subject area ING-INF/05

Course type Laurea Magistrale

Credits 6.0

Teaching hours Ore totali di attività frontale: 42.0

For matriculated on 2019/2020

Year taught 2019/2020

Course year 1

Semestre Secondo Semestre (dal 24/02/2020 al 29/05/2020)

Language INGLESE

Subject matter PERCORSO COMUNE (999)

Location Lecce

Calculus. Probability and Statistics. Linear Algebra. Programming skills.

The course provides a modern introduction to data mining, which spans techniques, algorithms and methodologies for discovering structure, patterns and relationships in data sets (typically, large ones) and making predictions. Applications of data mining are already happening all around us, and, when they are done well, sometimes they even go unnoticed. For instance, how does the Google web search work? How does Shazam recognize a song? How does Netflix recommend movies to its users? The principles of data mining provide answers to these and others questions. Data mining overlaps the fields of computer science, statistical machine learning and data bases. The course aims at providing the students with the knowldedge required to explore, analyze and leverage available data in order to turn the data into valuable and actionable information for a company, for instance, in order to facilitate a decision-making process.

Knowledge and understanding. The course describes methods and models for the analysis of large amounts of data. Students must have a solid background with a broad spectrum of basic knowledge related to data mining:

  • the students must have the basic cognitive tools to think analytically, creatively, critically and in an inquiring way, and have the abstraction and problem-solving skills needed to cope with complex systems;
  • they must have solid knowledge of data mining models and methodologies;
  • they must be able to work on large data collections, including heterogeneous and produced at high speed data, in order to integrate them - in particular by knowing how to manage their origin and quality - and to carry out in-depth thematic analyses, drawing on this knowledge to improve the decision-making process.

Applying knowledge and understanding. After the course the student should be able to:

  • describe and use the main data mining techniques;
  • understand the differences among several algorithms solving the same problem and recognize which one is better under different conditions;
  • tackle new data mining problems by selecting the appropriate methods and justifying his/her choices;
  • tackle new data mining problems by designing suitable algorithms and evaluating the results;
  • explain experimental results to people outside of statistical machine learning or computer science.

Making judgements. Students must have the ability to process complex and/or fragmentary data and must arrive at original and autonomous ideas and judgments, and consistent choices in the context of their work, which are particularly delicate in the profession of data scientist. The course promotes the development of independent judgment in the appropriate choice of technique/model for data processing and the critical ability to interpret the goodness of the results of the models/methods applied to the datasets under examination.

Communication. It is essential that students are able to communicate with a varied and composite audience, not culturally homogeneous, in a clear, logical and effective way, using the methodological tools acquired and their scientific knowledge and, in particular, the specialty vocabulary. Students should be able to organize effective dissemination and study material through the most common presentation tools, including computer-based ones, to communicate the results of data analysis processes, for example by using visualization and reporting tools aimed at different types of audiences.

Learning skills. Students must acquire the critical ability to relate, with originality and autonomy, to the typical problems of data mining and, in general, cultural issues related to other similar areas. They should be able to develop and apply independently the knowledge and methods learnt with a view to possible continuation of studies at higher (doctoral) level or in the broader perspective of cultural and professional self-improvement of lifelong learning. Therefore, students should be able to switch to exhibition forms other than the source texts in order to memorize, summarize for themselves and for others, and disseminate scientific knowledge.

The course aims to provide students with advanced tools for data analysis, through which to extrapolate relevant information from large datasets and guide the related decision-making processes. The course consists of frontal lessons using slides made available to students via the Moodle platform, and classroom exercises. The frontal lessons are aimed at improving students' knowledge and understanding through the presentation of theories, models and methods; students are invited to participate in the lesson with autonomy of judgement, by asking questions and presenting examples. The exercises are aimed at understanding the algorithms and models presented.

Oral exam. During the exam the student is asked to illustrate theoretical topics in order to verify his/her knowledge and understanding of the selected topics. The student must demonstrate adequate knowledge and understanding of the issues presented or indicated, applying in a relevant manner the theories and conceptual models covered by the study programme.

Office Hours

By appointment; contact the instructor by email or at the end of class meetings.

Introduction. Map-Reduce. Mining data streams. Frequent Items.  Frequent Itemsets and association rules. Mining similar items and Locality-Sensitive Hashing. Graph analysis. Link analysis and PageRank. Clustering. Recommendation systems. Mining Social-Network Graphs.  Dimensionality reduction.  Classification.

Mining of Massive Datasets 

J. Leskovec, A. Rajaraman and J. Ullman

Freely availableonline: http://www.mmds.org

 

Data Mining and Analysis

M. J. Zaki and W. Meira

Freely available online: http://dataminingbook.info

DATA MINING (ING-INF/05)
PARALLEL ALGORITHMS

Degree course COMPUTER ENGINEERING

Subject area ING-INF/05

Course type Laurea Magistrale

Credits 9.0

Teaching hours Ore totali di attività frontale: 81.0

For matriculated on 2018/2019

Year taught 2019/2020

Course year 2

Semestre Primo Semestre (dal 23/09/2019 al 20/12/2019)

Language INGLESE

Subject matter PERCORSO COMUNE (999)

Location Lecce

Calculus I and II, Probability Theory. Programming skills and working knowledge of the C programming language.

The course provides a modern introduction to design, analysis and implementation of sequential and parallel algorithms. In particular, the course is based on a pragmatic approach to parallel programming of message-passing algorithms through the C language and the MPI library.

Knowledge and understanding. Students must have a solid background with a broad spectrum of basic knowledge of sequential and parallel algorithms:

 

· the students must have the basic cognitive tools to think analytically, creatively, critically and in an inquiring way, and have the abstraction and problem-solving skills needed to cope with complex systems;

· they must have a solid knowledge of the design and implementation of sequential and parallel efficient algorithms;

· they must have the tools for analysing the resources used by algorithms;

· they must have a catalogue of the most well-known and efficient sequential and parallel algorithms for basic computational problems.

 

Applying knowledge and understanding. After the course the student should be able to:

 

· Describe and use the main design techniques for sequential algorithms;

· Design, prove the correctness and analyze the computational complexity of sequential algorithms;

· Understand the differences among several algorithms solving the same problem and recognize which one is better under different conditions;

· Describe and use basic sequential algorithms;

· Describe and use basic data structures; know about the existence of advanced data structures;

· Understand the difference between sequential and parallel algorithms;

· Design, implement and analyze message-passing based parallel algorithms in C using the MPI library;

· Describe and use basic parallel algorithms.

 

 

Making judgements. Students are guided to learn critically everything that is explained to them in class, to compare different approaches to solving algorithmic problems, and to identify and propose, in an autonomous way, the most efficient solution they find.

 

Communication. It is essential that students are able to communicate with a varied and composite audience, not culturally homogeneous, in a clear, logical and effective way, using the methodological tools acquired and their scientific knowledge and, in particular, the specialty vocabulary. The course promotes the development of the following skills of the student: ability to expose in precise and formal terms an abstract model of concrete problems, identifying the salient features of them and discarding the nonessential ones; ability to describe and analyze an efficient solution to the problem in question.

 

Learning skills. Students must acquire the critical ability to relate, with originality and autonomy, to the typical problems of data mining and, in general, cultural issues related to other similar areas. They should be able to develop and apply independently the knowledge and methods learnt with a view to possible continuation of studies at higher (doctoral) level or in the broader perspective of cultural and professional self-improvement of lifelong learning. Therefore, students should be able to switch to exhibition forms other than the source texts in order to memorize, summarize for themselves and for others, and disseminate scientific knowledge.

The course aims to enable students to abstract formal algorithmic models and problems from concrete computational problems, and to design efficient algorithmic solutions for them. This will be done using the following teaching method. Every computational problem will be introduced, motivating it with concrete examples. The presentation of each topic will be divided into four parts: 1. Description of the actual computational problem. 2. Modelling the real problem by means of an abstract problem. 3. Resolution of the abstract problem through an algorithm obtained through the application of the general techniques of design of algorithms introduced in the course. 4. Analysis of the resources used by the algorithm. The course consists of frontal lessons using slides made available to students via the Moodle platform, and classroom exercises. There will be theoretical lessons aimed at learning the basic techniques for the project and analysis of algorithms, and a part of lessons of an exercise type in which you will illustrate, with plenty of examples, how the theoretical knowledge acquired can be used in order to solve algorithmic problems of practical interest and implement parallel algorithms in C language through the MPI library.

Oral exam. Optionally, a student may be assigned a small project. During the exam the student is asked to illustrate theoretical topics in order to verify his/her knowledge and understanding of the selected topics. The student may also be asked to design a very simple algorithm in order to assess his/her ability to identify and use the relevant design techniques; alternatively, the student may be asked to analyze the complexity of a small code fragment.

Office Hours

By appointment; contact the instructor by email or at the end of class meetings.

Sequential Algorithms

 

Introduction. Order of growth. Analysis of algorithms. Decrease and conquer.  Divide and conquer. Recurrences.  Randomized algorithms.  Transform and conquer.  Dynamic programming.  Greedy algorithms. Complexity and computability. NP-Completeness. 

 

Parallel Algorithms

Introduction. The transition from sequential to parallel computing. Parallel complexity.  Parallel architectures.  Parallel algorithm design.  Message-Passing programming.  Sieve of Erathostenes.  Floyd all-pairs shortest path algorithm. Performance analysis.  Matrix-vector multiplication. Document classification.  Matrix multiplication.

Introduction to Algorithms. Third edition. Cormen, Leiserson, Rivest, Stein. The MIT Press

Parallel Programming in C with MPI and OpenMP International Edition (2004) Michael J. Quinn McGraw-Hill

PARALLEL ALGORITHMS (ING-INF/05)
DATA MINING

Degree course MATEMATICA

Subject area ING-INF/05

Course type Laurea Magistrale

Credits 6.0

Teaching hours Ore totali di attività frontale: 42.0

For matriculated on 2018/2019

Year taught 2018/2019

Course year 1

Semestre Secondo Semestre (dal 25/02/2019 al 31/05/2019)

Language INGLESE

Subject matter PERCORSO COMUNE (999)

Location Lecce

Calculus. Probability and Statistics. Linear Algebra. Programming skills.

The course provides a modern introduction to data mining, which spans techniques, algorithms and methodologies for discovering structure, patterns and relationships in data sets (typically, large ones) and making predictions. Applications of data mining are already happening all around us, and, when they are done well, sometimes they even go unnoticed. For instance, how does the Google web search work? How does Shazam recognize a song? How does Netflix recommend movies to its users? The principles of data mining provide answers to these and others questions. Data mining overlaps the fields of computer science, statistical machine learning and data bases. The course aims at providing the students with the knowldedge required to explore, analyze and leverage available data in order to turn the data into valuable and actionable information for a company, for instance, in order to facilitate a decision-making process.

Knowledge and understanding. The course describes methods and models for the analysis of large amounts of data. Students must have a solid background with a broad spectrum of basic knowledge related to data mining:

  • the students must have the basic cognitive tools to think analytically, creatively, critically and in an inquiring way, and have the abstraction and problem-solving skills needed to cope with complex systems;
  • they must have solid knowledge of data mining models and methodologies;
  • they must be able to work on large data collections, including heterogeneous and produced at high speed data, in order to integrate them - in particular by knowing how to manage their origin and quality - and to carry out in-depth thematic analyses, drawing on this knowledge to improve the decision-making process.

Applying knowledge and understanding. After the course the student should be able to:

  • describe and use the main data mining techniques;
  • understand the differences among several algorithms solving the same problem and recognize which one is better under different conditions;
  • tackle new data mining problems by selecting the appropriate methods and justifying his/her choices;
  • tackle new data mining problems by designing suitable algorithms and evaluating the results;
  • explain experimental results to people outside of statistical machine learning or computer science.

Making judgements. Students must have the ability to process complex and/or fragmentary data and must arrive at original and autonomous ideas and judgments, and consistent choices in the context of their work, which are particularly delicate in the profession of data scientist. The course promotes the development of independent judgment in the appropriate choice of technique/model for data processing and the critical ability to interpret the goodness of the results of the models/methods applied to the datasets under examination.

Communication. It is essential that students are able to communicate with a varied and composite audience, not culturally homogeneous, in a clear, logical and effective way, using the methodological tools acquired and their scientific knowledge and, in particular, the specialty vocabulary. Students should be able to organize effective dissemination and study material through the most common presentation tools, including computer-based ones, to communicate the results of data analysis processes, for example by using visualization and reporting tools aimed at different types of audiences.

Learning skills. Students must acquire the critical ability to relate, with originality and autonomy, to the typical problems of data mining and, in general, cultural issues related to other similar areas. They should be able to develop and apply independently the knowledge and methods learnt with a view to possible continuation of studies at higher (doctoral) level or in the broader perspective of cultural and professional self-improvement of lifelong learning. Therefore, students should be able to switch to exhibition forms other than the source texts in order to memorize, summarize for themselves and for others, and disseminate scientific knowledge.

The course aims to provide students with advanced tools for data analysis, through which to extrapolate relevant information from large datasets and guide the related decision-making processes. The course consists of frontal lessons using slides made available to students via the Moodle platform, and classroom exercises. The frontal lessons are aimed at improving students' knowledge and understanding through the presentation of theories, models and methods; students are invited to participate in the lesson with autonomy of judgement, by asking questions and presenting examples. The exercises are aimed at understanding the algorithms and models presented.

Oral exam. During the exam the student is asked to illustrate theoretical topics in order to verify his/her knowledge and understanding of the selected topics. The student must demonstrate adequate knowledge and understanding of the issues presented or indicated, applying in a relevant manner the theories and conceptual models covered by the study programme.

Office Hours

By appointment; contact the instructor by email or at the end of class meetings.

Introduction. Map-Reduce. Mining data streams. Frequent Items.  Frequent Itemsets and association rules. Mining similar items and Locality-Sensitive Hashing. Graph analysis. Link analysis and PageRank. Clustering. Recommendation systems. Mining Social-Network Graphs.  Dimensionality reduction.  Classification.

Mining of Massive Datasets 

J. Leskovec, A. Rajaraman and J. Ullman

Freely availableonline: http://www.mmds.org

 

Data Mining and Analysis

M. J. Zaki and W. Meira

Freely available online: http://dataminingbook.info

DATA MINING (ING-INF/05)
ELEMENTI DI INFORMATICA

Corso di laurea INGEGNERIA DELLE TECNOLOGIE INDUSTRIALI

Settore Scientifico Disciplinare ING-INF/05

Tipo corso di studio Laurea

Crediti 6.0

Ripartizione oraria Ore totali di attività frontale: 54.0

Per immatricolati nel 2018/2019

Anno accademico di erogazione 2018/2019

Anno di corso 1

Semestre Secondo Semestre (dal 04/03/2019 al 04/06/2019)

Lingua ITALIANO

Percorso unico (A96)

Sede Lecce

Nessuno

Il corso fornisce da una parte una moderna introduzione alla programmazione orientata agli oggetti, in particolare al linguaggio di programmazione Java, dall’altra introduce i concetti fondamentali delle basi di dati.

Conoscenze e comprensione. I risultati attesi di apprendimento prevedono che al termine del corso gli studenti:

· conoscano e siano in grado di applicare la sintassi e la semantica caratterizzanti il linguaggio di programmazione Java;

· conoscano e siano in grado di applicare gli elementi distintivi della programmazione orientata agli oggetti;

· conoscano i principi fondamentali delle basi di dati;

· conoscano il modello Entità-Relazioni ed il design di basi di dati;

· conoscano il linguaggio SQL;

· abbiano acquisito la capacità di problem solving;

· siano in grado di sviluppare, compilare, eseguire ed effettuare il debug di applicazioni Java, incluse applicazioni per l’accesso ad una base di dati.

 

Capacità di applicare conoscenze e comprensione. Dopo aver seguito il corso, lo studente dovrebbe essere in grado di:

· descrivere una possibile soluzione algoritmica per un problema reale;

· effettuare il design di una base di dati;

· interrogare una base di dati;

· implementare in linguaggio Java un’applicazione.

Autonomia di giudizio. Gli studenti devono possedere la capacità di problem solving e devono pervenire a idee e giudizi originali e autonomi, a scelte coerenti nell’ambito del loro lavoro, particolarmente delicate nell’ambito della implementazione di una applicazione in un linguaggio orientato agli oggetti.Il corso promuovelo sviluppo dell’autonomia di giudizio nella scelta appropriata della soluzione migliore relativa a semplici problemi e la capacità critica di interpretare la bontà dei risultati ottenuti.

Abilità comunicative. È fondamentale che gli studenti siano in grado di comunicare con un pubblico vario e composito, non omogeneo culturalmente, in modo chiaro, logico ed efficace, utilizzando gli strumenti metodologici acquisiti e le loro conoscenze scientifiche e, in particolar modo, il lessico di specialità.
Il corso favorisce lo sviluppo delle abilità inerenti le capacità di esporre in termini precisi e formali snippets di codice sorgente in linguaggio Java, modelli Entità-Relazioni, database designs, queries SQL e la descrizione di possibili soluzioni algoritmiche a problemi reali.

Capacità di apprendimento. Gli studenti devono acquisire la capacità critica di rapportarsi, con originalità e autonomia, alle problematiche tipiche della programmazione orientata agli oggetti. Devono essere in grado di rielaborare e di applicare autonomamente le conoscenze nella più ampia prospettiva di auto-aggiornamento culturale e professionale dell'apprendimento permanente. In particolare, devono poter riusare le conoscenze acquisite nell’ambito dell’apprendimento di altri linguaggi di programmazione orientati agli oggetti e, più in generale, nell’ambito dell’Ingegneria del Software. Devono inoltre essere in grado di riusare le conoscenze legate alle basi di dati indipendentemente dallo specifico DBMS utilizzato.

Il corso si articola in lezioni frontali che si avvalgono dell’uso di slides rese disponibili agli studenti mediante la piattaforma Moodle, ed esercitazioni in aula. Le lezioni frontali sono finalizzate al miglioramento delle conoscenze e capacità di comprensione degli studenti mediante l’esposizione del linguaggio di programmazione Java, dei principi di progettazione orientata agli oggetti e dei principi relativi alle basi di dati; gli studenti sono invitati a partecipare alla lezione con autonomia di giudizio, formulando domande, presentando esempi e discutendo possibili soluzioni alternative. Le esercitazioni sono finalizzate sia alla comprensione degli algoritmi e dei codici Java presentati, ed allo sviluppo della capacità di problem solving (dato un problema, lo studente deve analizzarlo ed individuare una soluzione algoritmica appropriata, implementandola correttamente in Java), sia alla comprensione delle queries SQL, dei modelli Entità-Relazioni e dei database designs presentati.

L’esame consiste in una prova scritta,nella quale lo studente dovràdimostrare di aver acquisito da una parte la capacità di modellare ed interrogare una base di dati, dall’altra la capacità di problem solving medianteimplementazionein linguaggio Java di un algoritmorisolutivo di un problema, utilizzandocorrettamente le principali strutture dati e glialgoritmidi base visti a lezione.

 

 

Svolgimento prova scritta mediante Microsoft Teams

degli appelli di

Elementi di Informatica

 

 

Per la verifica dell’identità dello studente e la sorveglianza della prova si usa Microsoft Teams; il docente selezionerà ogni singolo studente chiedendogli di mostrare il documento di identificazione. L’esame si svolge all’interno di un team a cui gli studenti accedono mediante iscrizione da parte del docente o tramite un link che sarà comunicato via mail. Gli studenti sono invitati ad accedere al team almeno 15 minuti prima dell’orario di inizio previsto, al fine di effettuare il riconoscimento. Ogni studente parteciperà al meeting mediante un dispositivo dotato di microfono e webcam che dovrà rimanere accesa per tutta la durata della prova, inquadrando lo studente ed il foglio su cui scrive. Il docente controllerà la regolarità del lavoro degli studenti, annullando immediatamente la prova in caso di irregolarità di ogni genere.

 

Per la distribuzione agli studenti della traccia d’esame, il docente condivide nella chat di TEAMS un file pdf. La traccia è suddivisa in 2 esercizi, ed il docente rilascia agli studenti ognuno degli esercizi che la compongono ad intervalli di tempo determinati. La prova si svolge quindi come segue:

 

  1. Il docente rilascia il primo esercizio all’inizio dell’appello;
    1. gli studenti stampano il testo dell’esercizio ed iniziano la stesura dell’elaborato su un foglio ben visibile dal dispositivo di acquisizione video;
    2. il docente fornisce la sua stima x del tempo sufficiente per rispondere;
    3. consegna dello svolgimento relativo al primo esercizio trascorso il tempo x;
    4. Il docente attende 2 minuti per la consegna;
  2. Il docente rilascia il secondo esercizio;
    1. gli studenti stampano il testo dell’esercizio ed iniziano la stesura dell’elaborato su un foglio ben visibile dal dispositivo di acquisizione video;
    2. il docente fornisce la sua stima y del tempo sufficiente per rispondere;
    3. consegna dello svolgimento relativo al secondo esercizio trascorso il tempo y;
    4. Il docente attende 2 minuti per la consegna;
  3. Il docente comunica agli studenti partecipanti la chiusura dell’appello.

 

Per la consegna degli elaborati relativi ai singoli esercizi, si richiede che lo studente fotografi i fogli relativi, includendo anche un’immagine del libretto universitario o, in alternativa, un documento di riconoscimento. Le foto dovranno essere inserite in una directory chiamata “Esercizio_a_b_c_d” in cui:

 

  • a indica il numero progressivo dell’esercizio (1, 2 o 3);
  • b è EI (per Elementi di Informatica)
  • c indica il nome dello studente;
  • d indica il cognome dello studente.

 

La directory deve quindi essere compressa, ad esempio in formato zip, ed il file risultante deve essere consegnato tramite la chat del team.

 

Non è possibile consegnare lo svolgimento di un esercizio prima del tempo stabilito e passare allo svolgimento dell’esercizio successivo. Lo studente che intende rinunciare alla prova può darne comunicazione al docente in qualunque momento, ed uscire immediatamente dopo dal team. Non è possibile allontanarsi per nessun motivo durante lo svolgimento della prova. Lo studente è tenuto ad assicurarsi di disporre di tutto quanto il necessario allo svolgimento (microfono, webcam, stampante, fotocamera, penna, numero di fogli adeguato).

 

Orario di ricevimento

 

Previo appuntamento da concordare via email o al termine delle lezioni.

Introduzione al corso. Computers, Internet e Java. ApplicazioniJava. Input/Output ed operatori.Classi, oggetti, metodi e stringhe.Strutture di controllo. Operatori di assegnamento, incremento e decremento. Operatori logici. Dettagli sui metodi. Array ed ArrayList. Dettagli su classi e oggetti. Ereditarietà. Polimorfismo. Interfacce. Dettagli sulla gestione delle eccezioni. Stringhe, caratteri ed espressioni regolari. Files, Input/Output Streams, NIO e serializzazione XML. Generic collections. Espressioni lambda e streams. Ricorsione. Algoritmi di ricerca ed ordinamento, notazione asintotica. Concorrenza. Basi di dati ed utenti. Concetti ed architettura dei DBMS (Database Management Systems). Modellazione dei dati mediante il modello Entità-Relazioni (ER). Il modello relazionale, schemi e vincoli. Il linguaggio SQL. Accesso ai database con JDBC.

Java How to Program, Early Objects, 11th Edition

Deitel & Deitel

Pearson

 

Fundamentals Of Database Systems, Seventh Edition

Elmasri, Navathe

Pearson

ELEMENTI DI INFORMATICA (ING-INF/05)
PARALLEL ALGORITHMS

Degree course COMPUTER ENGINEERING

Subject area ING-INF/05

Course type Laurea Magistrale

Credits 9.0

Teaching hours Ore totali di attività frontale: 81.0

For matriculated on 2017/2018

Year taught 2018/2019

Course year 2

Semestre Primo Semestre (dal 24/09/2018 al 21/12/2018)

Language INGLESE

Subject matter PERCORSO COMUNE (999)

Location Lecce

Calculus I and II, Probability Theory. Programming skills and working knowledge of the C programming language.

The course provides a modern introduction to design, analysis and implementation of sequential and parallel algorithms. In particular, the course is based on a pragmatic approach to parallel programming of message-passing algorithms through the C language and the MPI library.

Knowledge and understanding. Students must have a solid background with a broad spectrum of basic knowledge of sequential and parallel algorithms:

 

· the students must have the basic cognitive tools to think analytically, creatively, critically and in an inquiring way, and have the abstraction and problem-solving skills needed to cope with complex systems;

· they must have a solid knowledge of the design and implementation of sequential and parallel efficient algorithms;

· they must have the tools for analysing the resources used by algorithms;

· they must have a catalogue of the most well-known and efficient sequential and parallel algorithms for basic computational problems.

 

Applying knowledge and understanding. After the course the student should be able to:

 

· Describe and use the main design techniques for sequential algorithms;

· Design, prove the correctness and analyze the computational complexity of sequential algorithms;

· Understand the differences among several algorithms solving the same problem and recognize which one is better under different conditions;

· Describe and use basic sequential algorithms;

· Describe and use basic data structures; know about the existence of advanced data structures;

· Understand the difference between sequential and parallel algorithms;

· Design, implement and analyze message-passing based parallel algorithms in C using the MPI library;

· Describe and use basic parallel algorithms.

 

 

Making judgements. Students are guided to learn critically everything that is explained to them in class, to compare different approaches to solving algorithmic problems, and to identify and propose, in an autonomous way, the most efficient solution they find.

 

Communication. It is essential that students are able to communicate with a varied and composite audience, not culturally homogeneous, in a clear, logical and effective way, using the methodological tools acquired and their scientific knowledge and, in particular, the specialty vocabulary. The course promotes the development of the following skills of the student: ability to expose in precise and formal terms an abstract model of concrete problems, identifying the salient features of them and discarding the nonessential ones; ability to describe and analyze an efficient solution to the problem in question.

 

Learning skills. Students must acquire the critical ability to relate, with originality and autonomy, to the typical problems of data mining and, in general, cultural issues related to other similar areas. They should be able to develop and apply independently the knowledge and methods learnt with a view to possible continuation of studies at higher (doctoral) level or in the broader perspective of cultural and professional self-improvement of lifelong learning. Therefore, students should be able to switch to exhibition forms other than the source texts in order to memorize, summarize for themselves and for others, and disseminate scientific knowledge.

The course aims to enable students to abstract formal algorithmic models and problems from concrete computational problems, and to design efficient algorithmic solutions for them. This will be done using the following teaching method. Every computational problem will be introduced, motivating it with concrete examples. The presentation of each topic will be divided into four parts: 1. Description of the actual computational problem. 2. Modelling the real problem by means of an abstract problem. 3. Resolution of the abstract problem through an algorithm obtained through the application of the general techniques of design of algorithms introduced in the course. 4. Analysis of the resources used by the algorithm. The course consists of frontal lessons using slides made available to students via the Moodle platform, and classroom exercises. There will be theoretical lessons aimed at learning the basic techniques for the project and analysis of algorithms, and a part of lessons of an exercise type in which you will illustrate, with plenty of examples, how the theoretical knowledge acquired can be used in order to solve algorithmic problems of practical interest and implement parallel algorithms in C language through the MPI library.

Oral exam. Optionally, a student may be assigned a small project. During the exam the student is asked to illustrate theoretical topics in order to verify his/her knowledge and understanding of the selected topics. The student may also be asked to design a very simple algorithm in order to assess his/her ability to identify and use the relevant design techniques; alternatively, the student may be asked to analyze the complexity of a small code fragment.

Office Hours

By appointment; contact the instructor by email or at the end of class meetings.

Sequential Algorithms

 

Introduction. Order of growth. Analysis of algorithms. Decrease and conquer.  Divide and conquer. Recurrences.  Randomized algorithms.  Transform and conquer.  Dynamic programming.  Greedy algorithms. Complexity and computability. NP-Completeness. 

 

Parallel Algorithms

Introduction. The transition from sequential to parallel computing. Parallel complexity.  Parallel architectures.  Parallel algorithm design.  Message-Passing programming.  Sieve of Erathostenes.  Floyd all-pairs shortest path algorithm. Performance analysis.  Matrix-vector multiplication. Document classification.  Matrix multiplication.

Introduction to Algorithms. Third edition. Cormen, Leiserson, Rivest, Stein. The MIT Press

Parallel Programming in C with MPI and OpenMP International Edition (2004) Michael J. Quinn McGraw-Hill

PARALLEL ALGORITHMS (ING-INF/05)
DATA MINING

Degree course MATEMATICA

Subject area ING-INF/05

Course type Laurea Magistrale

Credits 6.0

Teaching hours Ore totali di attività frontale: 0.0

For matriculated on 2017/2018

Year taught 2017/2018

Course year 1

Semestre Secondo Semestre (dal 26/02/2018 al 25/05/2018)

Language INGLESE

Subject matter PERCORSO COMUNE (999)

Location Lecce

Analisi Matematica. Probabilità e statistica. Algebra lineare. Programmazione ed analisi di algoritmi.

Introduzione al corso. Map-Reduce. (2 ore) Mining data streams. Frequent Items. (6 ore)  Frequent Itemsets ed association rules.(4 ore)  Mining similar items e Locality-Sensitive Hashing. (2 ore)  Analisi di grafi. Link analysis e PageRank. (2 ore)  Clustering. (4 ore)  Recommendation systems. (4 ore)  Mining Social-Network Graphs. (4 ore)  Dimensionality reduction. (2 ore)  Classification. (6 ore) Esercitazioni. (6 ore).

Il corso fornisce una moderna introduzione al data mining, un insieme di tecniche, algoritmi e metodologie per scoprire la struttura, patterns e relazioni in insiemi di dati (tipicamente, quelli più grandi) e fare previsioni. Le applicazioni del data mining stanno già accadendo intorno a noi, e se ben fatte, possono a volte anche passare inosservate. Come funziona la ricerca sul web di Google? Come fa Shazam a riconoscere una canzone? Come fa Netflix a raccomandare film a ciascuno dei suoi utenti? I principi del data mining forniscono le risposte di base a queste e ad altre domande simili. Il data mining abbraccia i campi dell’informatica, dello statistical machine learning e dei database. Obiettivo del corso è mettere in grado gli studenti di esplorare, analizzare e sfruttare i dati disponibili al fine di trasformarli in informazioni quantitative e qualitative di valore ed interesse per una azienda, ad esempio ai fini di un processo di decision-making.

Risultati di apprendimento

Dopo aver seguito il corso, lo studente dovrebbe essere in grado di:

 

•        descrivere ed utilizzare le principali tecniche di data mining;

•        comprendere le differenze tra algoritmi diversi che risolvono uno stesso problema e riconoscere quale algoritmo è il migliore rispetto a condizioni diverse;

•        affrontare nuovi problemi di data mining scegliendo i metodi più appropriati e giustificando le proprie scelte;

•        affrontare nuovi problemi di data mining progettando appositi algoritmi e valutando i risultati;

•        spiegare i risultati ottenuti sperimentalmente anche a persone con un background teorico diverso da statistica e/o informatica.

L’esame è orale. Durante l’esame, allo studente viene chiesto di illustrare argomenti teorici per verificare la sua conoscenza e comprensione degli argomenti scelti. 

Mining of Massive Datasets 

J. Leskovec, A. Rajaraman and J. Ullman

Disponibile gratuitamente online: http://www.mmds.org

 

Data Mining and Analysis

M. J. Zaki and W. Meira

Disponibile gratuitamente online: http://dataminingbook.info

DATA MINING (ING-INF/05)
PARALLEL ALGORITHMS

Degree course COMPUTER ENGINEERING

Subject area ING-INF/05

Course type Laurea Magistrale

Credits 9.0

Teaching hours Ore totali di attività frontale: 81.0

For matriculated on 2016/2017

Year taught 2017/2018

Course year 2

Semestre Primo Semestre (dal 25/09/2017 al 22/12/2017)

Language INGLESE

Subject matter PERCORSO COMUNE (999)

Location Lecce

Analisi I e II, Teoria della Probabilita’. Conoscenze pregresse del linguaggio di programmazione C.

Algoritmi sequenziali

 

Introduzione. Ordini di grandezza. Analisi di algoritmi. Decrease and conquer. Divide and conquer. Ricorrenze. (7 ore) Algoritmi randomizzati. (5 ore)  Transform and conquer. (2 ore)  Lower bound for sorting. Linear time sorting algorithms: Counting sort, Radix sort and Bucket sort. (3 ore) Order statistics. Selection in expected and worst case linear time. (2 ore) Programmazione dinamica. (4 ore) Algoritmi greedy. (3 ore) Complessita’ e calcolabilita’. NP-Completezza. (7 ore)

 

Algoritmi paralleli

 

1) Introduzione. Metodo scientifico moderno. Concorrenza e parallelismo. Equazioni di Bernstein. Evoluzione del supercalcolo. Calcolatori paralleli. Grafo delle dipendenze dei dati. Data parallelism. Functional Parallelism. Pipelining. Approcci per la programmazione di calcolatori paralleli. (2 ore)

 

2) Architetture parallele. Reti di interconnessione. Shared e switched media. Topologie di rete. 2D mesh. Binary tree. Hypertree. Fat tree. Butterfly. Ipercubo. Torus. Shuffle-exchange. Processor arrays. Multiprocessori centralizzati e distribuiti. Multicomputers asimmetrici e simmetrici. Clusters e reti di workstations. Tassonomia di Flinn. (5 ore)

 

3) Parallel algorithm design. Modello task-channel. Metodologia PCAM di Foster. Decomposizioni, tasks e grafo delle dipendenze dei tasks. Grain size di una decomposizione. Grado di concorrenza. Critical path length. Grafo delle interazioni dei tasks. Processi e mapping. Recursive decomposition, Data decomposition Exploratory decomposition, Speculative decomposition. Decomposizione di dati di input, intermedi e di output. Decomposizione ibride e gerarchiche. The Owner Computes Rule. Tasks static e dinamici. Comunicazione locale, globale, strutturata e non strutturata. Task interactions: read-only, read-write, one-way, two-way. Comunicazione static e dinamica, sincrona ed asincrona. Algoritmi red-black.  Divide and conquer. Agglomerazione. Effetto surface to volume. Communication/computation ratio. Replicare la computazione per eliminare comunicazioni. Mapping. NP-completeness del mapping. Mapping static e dinamico.     Mappings based on data partitioning. Mappings based on task graph partitioning. Hybrid mappings. Probabilistic mapping, cyclic and block-cyclic. Graph partitioning. Recursive bisection. Quad and Oct-trees. Space-filling curves. Scattered decomposition. Dynamic load-balancing. Centralized dynamic mapping, master-slave (manager-worker), chunck scheduling. Distributed dynamic mapping. Dynamic data driven mapping. Dynamic geometric decomposition: Adaptive Mesh Refinement (AMR). Esempi di progettazione di algoritmi paralleli: boundary value problem, reductions, n-body problem. Data input. (10 ore)

 

 

4) Message-Passing programming. Message-passing model. MPI. Circuit-SAT problem. Communicators. Point-to-point communications: vari tipi di send e receive. Collective communications: broadcast, reductions, scatter, gather, all-gather,  all-to-all. Benchmark. (2 ore)

 

5) Crivello di Eratostene: progettazione, analisi, implementazione e benchmark. (2 ore)

 

6) Algoritmo di Floyd all-pairs shortest path: progettazione, analisi, implementazione e benchmark. (2 ore)

 

7) Performance analysis. Tempo di esecuzione sequenziale e parallelo. Communication overhead. Overhead totale. Speedup ed efficienza. Speedup relativo, reale, assoluto, relae asintotico e relativo asintotico. Cost-normalized speedup. Numero ottimale di processori. Speedup superlineare. Legge di Amdahl-Ware. Legge di  Gustafson-Barsis. Metrica di Karp-Flatt. Legge di Lee (generalizzazione della legge di Amdahl-Ware). Metrica di isoefficienza (Grama, Gupta and Kumar). Funzione di scalabilita’ di Sun e Ni. Cost-optimality (work-efficiency). Cost-optimality ed isoefficienza. Grado di concorrenza ed isoefficienza. Impatto della non cost-optimality di un algoritmo parallelo. Minimum Parallel execution Time. Minimum Cost-Optimal Parallel Execution Time. (7 ore)

 

8) Moltiplicazione matrice-vettore. Algoritmo sequenziale. Rowwise block-striped decomposition. Columnwise block-striped decomposition. Checkerboard block decomposition. implementazione e benchmark. (3 ore)

 

 

9) Classificazione di documenti. Design. Manager-worker paradigm. Comunicazioni nonblocking. (2 ore)

 

10) Moltiplicazione di matrici. Algoritmo sequenziale iterativo row-oriented e ricorsivo block-oriented.  Rowwise block-striped  parallel algorithm. Algoritmo di Cannon. Algoritmo di Fox. Algoritmo DNS (dei tre indiani). (3 ore)

 

Esercitazioni. (21 ore)

Il corso fornisce una moderna introduzione alla progettazione, analisi ed implementazione di algoritmi sequenziali e paralleli. In particolare, il corso e’ basato su un approccio pratico alla programmazione parallela di algoritmi message-passing in linguaggio C con la libreria MPI.

 

Dopo aver seguito il corso, lo studente dovrebbe essere in grado di:

 

1)      descrivere ed utilizzare le principali tecniche di progettazione di algoritmi sequenziali;

2)      progettare, provare la correttezza, implementare ed analizzare la complessita’ computazionale di algoritmi sequenziali;

3)      comprendere le differenze tra algoritmi diversi che risolvono uno stesso problema e riconoscere quale algoritmo e’ il migliore rispetto a condizioni diverse;

4)      descrivere ed utilizzare algoritmi sequenziali di base;

5)      descrivere ed utilizzare strutture dati di base; conoscere l’esistenza di strutture dati avanzate;

6)      comprendere le differenze tra algoritmi sequenziali e paralleli;

7)      progettare, provare la correttezza, implementare ed analizzare la complessita’ computazionale di algoritmi paralleli basati su message-passing in C  utilizzando la libreria MPI;

8)      descrivere ed utilizzare algoritmi paralleli di base.

L’esame e’ orale. Opzionalmente, allo studente puo’ essere assegnato un piccolo progetto. Duramte l’esame, allo studente viene chiesto di illustrare argomenti teorici per verificare la sua conoscenza e comprensione degli argomenti scelti. Allo studente puo’ essere chiesto di progettare un semplice algoritmo per verificare la sua capacita’ di identificare ed utilizzare le tecniche di progettazione piu’ appropriate. Alternativamente, allo studente puo’ essere chiesto di analizzare la complessita’ computazionale di un breve frammento di codice.

PARALLEL ALGORITHMS (ING-INF/05)
MOBILE APPLICATIONS DEVELOPMENT

Degree course EUROPEAN HERITAGE,DIGITAL MEDIA AND THE INFORMATION SOCIETY

Subject area ING-INF/05

Course type Laurea Magistrale

Credits 6.0

Teaching hours Ore totali di attività frontale: 42.0

For matriculated on 2016/2017

Year taught 2016/2017

Course year 1

Semestre Primo Semestre (dal 26/09/2016 al 20/01/2017)

Language INGLESE

Subject matter INTERNAZIONALE (A56)

MOBILE APPLICATIONS DEVELOPMENT (ING-INF/05)
PARALLEL ALGORITHMS

Degree course COMPUTER ENGINEERING

Subject area ING-INF/05

Course type Laurea Magistrale

Credits 9.0

Teaching hours Ore totali di attività frontale: 81.0

For matriculated on 2015/2016

Year taught 2016/2017

Course year 2

Semestre Primo Semestre (dal 26/09/2016 al 22/12/2016)

Language INGLESE

Subject matter PERCORSO COMUNE (999)

Location Lecce

PARALLEL ALGORITHMS (ING-INF/05)
MOBILE APPLICATIONS DEVELOPMENT

Degree course EUROPEAN HERITAGE,DIGITAL MEDIA AND THE INFORMATION SOCIETY

Subject area ING-INF/05

Course type Laurea Magistrale

Credits 6.0

Teaching hours Ore totali di attività frontale: 42.0

For matriculated on 2015/2016

Year taught 2015/2016

Course year 1

Semestre Secondo Semestre (dal 22/02/2016 al 21/05/2016)

Language INGLESE

Subject matter INTERNAZIONALE (A56)

MOBILE APPLICATIONS DEVELOPMENT (ING-INF/05)
PARALLEL ALGORITHMS

Corso di laurea COMPUTER ENGINEERING

Settore Scientifico Disciplinare ING-INF/05

Tipo corso di studio Laurea Magistrale

Crediti 9.0

Ripartizione oraria Ore totali di attività frontale: 0.0

Per immatricolati nel 2014/2015

Anno accademico di erogazione 2015/2016

Anno di corso 2

Semestre Primo Semestre (dal 21/09/2015 al 18/12/2015)

Lingua

Percorso PERCORSO COMUNE (999)

Sede Lecce - Università degli Studi

PARALLEL ALGORITHMS (ING-INF/05)
PARALLEL ALGORITHMS

Corso di laurea COMPUTER ENGINEERING

Settore Scientifico Disciplinare ING-INF/05

Tipo corso di studio Laurea Magistrale

Crediti 9.0

Ripartizione oraria Ore totali di attività frontale: 0.0

Per immatricolati nel 2013/2014

Anno accademico di erogazione 2014/2015

Anno di corso 2

Semestre Primo Semestre (dal 29/09/2014 al 13/01/2015)

Lingua

Percorso PERCORSO COMUNE (999)

Sede Lecce - Università degli Studi

PARALLEL ALGORITHMS (ING-INF/05)

Tesi

Gli studenti interessati possono chiedere una tesi relativa al data mining, machine learning e deep learning sequenziale, parallelo e distribuito. In alternativa, gli studenti possono anche chiedere una tesi di laurea in materia di sicurezza e crittografia. Infine, la tesi può anche essere strettamente correlata al calcolo parallelo e distribuito, inclusi cloud, P2P e internet of things.

Pubblicazioni

 

Profilo ORCID

Profilo DBLP

Profilo Google Scholar 

Profilo ResearchGate

 

Last update: June 16,  2022

 

Edited books


[5] Massimo Cafaro, Jinoh Kim, Alex Sim. Proceedings of the 4th International Workshop on Systems and Network Telemetry and Analytics, SNTA@HPDC 2021, Stockholm, Sweden, June 21, 2021. ACM 2021, ISBN 978-1-4503-8386-8

[4] Massimo Cafaro, Jinoh Kim, Alex Sim. Proceedings of the 3rd International Workshop on Systems and Network Telemetry and Analytics, SNTA@HPDC 2020, Stockholm, Sweden, June 23, 2020. ACM 2020, ISBN 978-1-4503-7980-9

[3] 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2018, Washington, DC, USA, May 1-4, 2018. IEEE Computer Society 2018, ISBN 978-1-5386-5815-4. Edited by Esam El-Araby, Dhabaleswar K. Panda, Sandra Gesing, Amy W. Apon, Volodymyr V. Kindratenko, Massimo Cafaro, Alfredo Cuzzocrea. DOI: 10.1109/CCGRID.2018.00-76

[2] Grids, Clouds and Virtualization, Springer-Verlag, London, 2011, Series: Computer Communications and Networks. Edited by Massimo Cafaro and Giovanni Aloisio. Hardcover ISBN 978-0-85729-048-9, Softcover ISBN 978-1-4471-2592-1, eBook ISBN 978-0-85729-049-6. DOI: 10.1007/978-0-85729-049-6

[1] Workshop Proceedings of the Conference Grid and Pervasive Computing 2009, May 4-8 2009, Geneva, Switzerland, IEEE, edited by H. Muller, J. Chen, M. Cafaro, J. H. Park and N. Abdennadher, ISBN 978-0-7695-3677-4. DOI: 10.1109/GPC.2009.5


Journals


[48] Cesaroni, C., Spogli, L., Franceschi, G. D., Damaceno, J. G., Grzesiak, M., Vani, B., Monico, J. F. G., Romano, V., Alfonsi L. and Cafaro, M. (2021). A measure of ionospheric irregularities: zonal velocity and its implications for L-band scintillation at low-latitudes. Earth and Planetary Physics, 5(5), pp. 1–12. Amer Geophysical Union. Print and Electronic ISSN 2096-3955 DOI 10.26464/epp2021042

[47] I. Epicoco, C. Melle, M. Cafaro, M. Pulimeno. AFQN: Approximate Qn Estimation in Data Streams. To appear in Applied Intelligence. Springer. Print ISSN 0924-669X, Electronic ISSN 1573-7497.

[46] Luca Spogli, Hossein Ghobadi, Antonio Cicone, Lucilla Alfonsi, Claudio Cesaroni, Nicola Linty, Vincenzo Romano, Massimo Cafaro. Adaptive Phase Detrending for GNSS Scintillation Detection: A Case Study Over Antarctica. IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1-5, 2022, Art no. 8009905, Print ISSN 1545-598X Electronic ISSN 1558-0571, DOI: 10.1109/LGRS.2021.3067727

[45] M. Pulimeno, I. Epicoco, M. Cafaro. Distributed mining of time-faded heavy hitters. Information Sciences, Elsevier, Volume 545, 2021, Pages 633-662, ISSN 0020-0255, DOI 10.1016/j.ins.2020.09.048

[44] M. Cafaro, C. Melle, M. Pulimeno, I. Epicoco. Fast Online Computation of the Qn Estimator with Applications to the Detection of Outliers in Data Streams. Expert Systems With Applications, Elsevier, Volume 164, February 2021, Article 113831, ISSN: 0957-4174, DOI: 10.1016/j.eswa.2020.113831

[43] F. Ventruto, M. Pulimeno, M. Cafaro, I. Epicoco. On frequency estimation and detection of heavy hitters in data streams. Future Internet, MDPI, Volume 12, Issue 9, Article 158, September 2020, ISSN 1999-5903, DOI: 10.3390/fi12090158

[42] I. Epicoco, C. Melle, M. Cafaro, M. Pulimeno and G. Morleo. UDDSketch: Accurate Tracking of Quantiles in Data Streams. IEEE Access, vol. 8, pp. 147604-147617, 2020, ISSN: 2169-3536, DOI: 10.1109/ACCESS.2020.3015599

[41] Ghobadi H., Spogli L., Alfonsi L., Cesaroni C., Cicone A., Linty N., Romano V., Cafaro M. Disentangling ionospheric refraction and diffraction effects in GNSS raw phase through fast iterative filtering technique. GPS Solutions 24, 85 (2020), Springer, ISSN 1080-5370, DOI: 10.1007/s10291-020-01001-1

[40] Juliana G. Damaceno, Karl Bolmgren, Jon Bruno, Giorgiana De Franceschi, Cathryn Mitchell, Massimo Cafaro, "GPS loss of lock statistics over Brazil during the 24th solar cycle", Advances in Space Research, Elsevier, Volume 66, Issue 2, 2020, Pages 219-225, ISSN 0273-1177, DOI 10.1016/j.asr.2020.03.041

[39] M. Cafaro, I. Epicoco, M. Pulimeno, “CMSS: Sketching Based Reliable Tracking of Large Network Flows”. Future Generation Computer Systems, Elsevier, Volume 101, 2019, Pages 770-784, ISSN 0167-739X, DOI 10.1016/j.future.2019.07.031

[38] M. Cafaro, I. Epicoco, M. Pulimeno, Special Issue on Parallel and Distributed Data Mining, Information Sciences, Elsevier, Volume 496, 2019, Pages 284-286,, ISSN 0020-0255, DOI 10.1016/j.ins.2019.05.062

[37] H. Ghobadi, P. Testa, L. Spogli, M. Cafaro, L. Alfonsi, V. Romano, R. Bru, "User-Oriented ICT Cloud Architecture for High-Accuracy GNSS-Based Services”, Sensors 19, no. 11: 2635. 2019, ISSN 1424-8220, DOI:10.3390/s19112635. https://www.mdpi.com/1424-8220/19/11/2635

[36] M. Cafaro, I. Epicoco, M. Pulimeno, “Mining frequent items in unstructured P2P networks”, Future Generation Computer Systems, Elsevier, Volume 95, 2019, Pages 1-16, DOI 10.1016/j.future.2018.12.030, ISSN 0167-739X 

[35] M.Cafaro, P. Pelle’, “Space-efficient Verifiable Secret Sharing Using Polynomial Interpolation”, IEEE Transactions on Cloud Computing, IEEE, vol. 6, no. 2, pp. 453-463, April-June 2018, DOI 10.1109/TCC.2015.2396072, ISSN: 2168-7161, http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7018923&isnumber=8372844

[34] M. Cafaro, M. Pulimeno, I. Epicoco, "Parallel mining of time-faded heavy hitters", Expert Systems With Applications, Elsevier, Volume 96, 2018, Pages 115-128, ISSN: 0957-4174, DOI: 10.1016/j.eswa.2017.11.021

[33] M. Cafaro, I. Epicoco, M. Pulimeno, G. Aloisio, "On Frequency Estimation and Detection of Frequent Items in Time Faded Streams”, IEEE Access, IEEE, Volume 5, Issue 1, (2017), pp. 24078-24093, ISSN: 2169-3536, DOI: 10.1109/ACCESS.2017.2757238

[32] I. Epicoco, M. Cafaro, M. Pulimeno, Fast and Accurate Mining of Correlated Heavy Hitters”, Data Mining & Knowledge Discovery, Springer, Volume 32, Issue 1, 2018, pp. 162-186, ISSN: 1384-5810, DOI:10.1007/s10618-017-0526-x

[31] Arcangelo Messina, Massimo Cafaro. "Parallel damage detection through finite frequency changes on multicore processors", Structural Engineering and Mechanics, Techno-Press, Volume 63, No. 4 (2017), pp. 457-469. ISSN: 1225-4568, DOI: https://doi.org/10.12989/sem.2017.63.4.457

[30] M. Cafaro, M. Pulimeno, I. Epicoco, G. Aloisio, “Parallel Space Saving on Multi and Many-Core Processors”, Concurrency and Computation: Practice and Experience, John Wiley and Sons Ltd, Volume 30, Issue 7, DOI: 10.1002/cpe.4160, Print ISSN 1532-0626, Online ISSN 1532-0634

[29] M. Cafaro, M. Pulimeno, I. Epicoco, G. Aloisio, “Mining frequent items in the time fading model”, Information Sciences, Elsevier, Volume 370-371, 2016, pp. 221-238, ISSN: 0020-0255, DOI: 10.1016/j.ins.2016.07.077 available on ScienceDirect http://www.sciencedirect.com/science/article/pii/S0020025516305618 and arXiv https://arxiv.org/abs/1601.03892

[28] M. Cafaro, M. Pulimeno, P. Tempesta, “A Parallel Space Saving Algorithm For Frequent Items and the Hurwitz zeta distribution”, Information Sciences, Elsevier, Volume 329, 2016, pp. 1 - 19, ISSN: 0020-0255, DOI: 10.1016/j.ins.2015.09.003 available on ScienceDirect http://www.sciencedirect.com/science/article/pii/S002002551500657X and arXiv http://arxiv.org/abs/1401.0702

[27] M. Cafaro, R. Civino, B. Masucci, "On the Equivalence of Two Security Notions for Hierarchical Key Assignment Schemes in the Unconditional Setting", IEEE Transactions on Dependable and Secure Computing, Volume 12, Issue 4 (2015), pp. 485 - 490, IEEE, DOI 10.1109/TDSC.2014.2355841, http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6894147, ISSN: 1545-5971

[26] M. Cafaro, M. Mirto, G. Aloisio, "Preference-Based Matchmaking of Grid Resources with CP-Nets", Journal of Grid Computing, Volume 11, Issue 2 (2013), pp. 211-237, Springer Netherlands, DOI: 10.1007/s10723-012-9235-2 Print ISSN 1570-7873, Online ISSN 1572-9184

[25] M. Cafaro, P. Tempesta: Finding frequent items in parallel. Concurrency and Computation: Practice and Experience Volume 23 Issue 15 pp. 1774-1788 (2011), John Wiley and Sons Ltd., Chichester, UK DOI: 10.1002/cpe.1761 Print ISSN 1532-0626, Online ISSN 1532-0634

[24] M. Cafaro, H. Müller, N. Abdennadher: Special Section: Grid and Pervasive Computing 2009. Future Generation Computer Systems Volume 27, Issue 5, May 2011, pp. 587-589, Elsevier, DOI:10.1016/j.future.2010.11.019, ISSN 0167-739X

[23] J. Chen, M. Cafaro: Special Issue: 3rd International Workshop on Workflow Management and Applications in Grid Environments (WaGe2008). Concurrency and Computation: Practice and Experience Volume 21 Issue 16 pp. 1961-1964 (2009) John Wiley and Sons Ltd., DOI: 10.1002/cpe.1459, Print ISSN 1532-0626, Online ISSN 1532-0634

[22] M. Cafaro, V. De Bene, G. Aloisio, “Deterministic Parallel Selection Algorithms on Coarse Grained Multicomputers”, Concurrency and Computation: Practice and Experience, Volume 21 Issue 18 (2009), pp. 2336–2354, John Wiley and Sons Ltd., DOI: 10.1002/cpe.1453, Print ISSN 1532-0626, Online ISSN 1532-0634

[21] M. Cafaro, I. Epicoco, S. Fiore, D. Lezzi, S. Mocavero, G. Aloisio, “Near Real-Time Parallel Processing and Advanced Data Management of SAR Images in Grid Environments”, Journal of Real-Time Image Processing, Special issue on architectures and techniques for real-time processing of remotely sensed images, Volume 4, Number 3, 2009, pp. 219-227, Springer-Verlag, DOI 10.1007/s11554-009-0119-z, Print ISSN 1861-8200, Online ISSN 1861-8219

[20] M. Cafaro, I. Epicoco, M. Mirto, D. Lezzi, G. Aloisio, "The Grid Resource Broker Workflow Engine", Concurrency and Computation: Practice and Experience, Wiley, Special Issue: 2nd International Workshop on Workflow Management and Applications in Grid Environments (WaGe2007), Volume 20, Issue 15, pp. 1725 – 1739, 2008, John Wiley & Sons, Ltd., DOI: 10.1002/cpe.1318, Print ISSN 1532-0626, Online ISSN 1532-0634

[19] M. Mirto, S. Fiore, I. Epicoco, M. Cafaro, S. Mocavero, E. Blasi, G. Aloisio, "A Bioinformatics Grid Alignment Toolkit", Future Generation Computer Systems, Elsevier, Volume 24, Number 7, pp. 752-762, 2008, DOI:10.1016/j.future.2008.02.001, ISSN 0167-739X

[18] M. Mirto, M. Cafaro, S. Fiore, Daniele Tartarini, G. Aloisio, "A Grid-enabled Protein Secondary Structure Predictor", IEEE Transactions on NanoBioscience, IEEE, Volume 6, Issue 2, pp. 124-130, June 2007, DOI 10.1109/TNB.2007.897475, ISSN 1536-1241

[17] G. Aloisio, M. Cafaro, G. Carteni, I. Epicoco, S. Fiore, D. Lezzi, M. Mirto , S. Mocavero, "The Grid Resource Broker Portal", Concurrency and Computation: Practice and Experience, Special Issue on Grid Computing Environments, Volume 19, Issue 12 (2007), pp. 1663-1670, John Wiley & Sons, Ltd., DOI: 10.1002/cpe.1131, Print ISSN 1532-0626, Online ISSN 1532-0634

[16] Y. Yang, O. F. Rana, D.W. Walker, R. Williams, C. Georgousopoulos, M. Cafaro, G. Aloisio, "An agent infrastructure for on-demand processing of remote-sensing archives", Int. J. on Digital Libraries Volume 5, Issue 2, pp 120-132 (2005), Springer-Verlag, DOI 10.1007/s00799-003-0054-8, Print ISSN 1432-5012, Online ISSN 1432-1300

[15] G. Aloisio, M.C. Barba, M. Cafaro, E. Blasi, S. Fiore, M. Mirto, "A Web Service based Grid Portal for Edgebreaker Compression", Methods of Information in Medicine, Special issue on HealthGrid Volume 44 Issue 2, pp. 233-238, Schattauer Publishers (2005), ISSN 0026-1270

[14] G. Aloisio, M. Cafaro, G. Carteni, I. Epicoco, G. Quarta (2005). A grid portal for Earth Observation community. IL NUOVO CIMENTO DELLA SOCIETÀ ITALIANA DI FISICA C, GEOPHYSICS AND SPACE PHYSICS, Società Italiana di Fisica, Bologna, vol. 28 issue 2, pp. 193-203 (2005), DOI: 10.1393/ncc/i2005-10182-5, Print ISSN 2037-4909, Online ISSN 1826-9885

[13] G.P. Marra, I. Schipa, G. Aloisio, M. Cafaro, D. Conte, C. Elefante, C. Mangia, M. Miglietta, U. Rizza, A. Tanzarella (2005). G-AQFS (Grid Air Quality Forecast System): An experimental system based on GRID computing technologies to forecast atmospheric dispersion of pollutants. IL NUOVO CIMENTO DELLA SOCIETÀ ITALIANA DI FISICA. C, GEOPHYSICS AND SPACE PHYSICS, Società Italiana di Fisica, Bologna, vol. 28 issue 2, pp. 183-192 (2005), Print ISSN 2037-4909, Online ISSN 1826-9885

[12] G. Aloisio, M. Cafaro, S. Fiore, I. Epicoco, M. Mirto, S. Mocavero, “Performance Analysis of Information Services in a Grid Environment”, Journal of Systemics, Cybernetics and Informatics, Volume 2, Number 5, 2004, pp. 24-30, Online ISSN 1690-4524

[11] G. Aloisio, M.C. Barba, E. Blasi, M. Cafaro, S. Fiore, M. Mirto, “BIG: a Grid Portal for Biomedical Data and Images”, Journal of Systemics, Cybernetics and Informatics, Volume 2, Number 3, 2004, pp. 10-18, Online ISSN 1690-4524

[10] G. Aloisio, M. Cafaro, R. Cesari, C. Mangia, G.P. Marra, M. Miglietta, M. Mirto, U. Rizza, I. Schipa, A. Tanzarella, "G-AQFS: Grid computing exploitation for the management of air quality in presence of complex meteorological circulations", Journal of Digital Information Management, Volume 2, Number 2, pp. 67-73, June 2004, Digital Information Research Foundation (DIRF) Press, ISSN 0972-7272

[9] G. Aloisio, M. Cafaro, M. Mirto, S. Fiore, "Early Experiences with the GRelC Library", Journal of Digital Information Management, Volume 2, Number 2, pp. 54-60, June 2004, Digital Information Research Foundation (DIRF) Press, ISSN 0972-7272

[8] G. Aloisio, M. Cafaro, "A Dynamic Earth Observation System", Parallel Computing, Volume 29, Issue 10 (2003), pp. 1357-1362, Special Issue on High performance computing with geographical data, Elsevier, DOI:10.1016/j.parco.2003.04.002, ISSN 0167-8191

[7] G. Aloisio, M. Cafaro, D. Lezzi, "The Desktop Grid Environment Enabler", Computing and Informatics, Volume 21, Number 4 (2002), pp. 333-345, Special Issue on Grid Computing, Slovak Academy of Sciences, Institute of Informatics, ISSN 1335-9150

[6] G. Aloisio, M. Cafaro, I. Epicoco, "Early experiences with the GridFTP protocol using the GRB-GSIFTP library", Future Generation Computer Systems, Volume 18, Issue 8 (2002), pp. 1053-1059, Special Issue on Grid Computing: Towards a New Computing Infrastructure, North-Holland, DOI:10.1016/S0167-739X(02)00084-5, ISSN 0167-739X

[5] G. Aloisio, M. Cafaro, "Web-based access to the grid using the Grid Resource Broker Portal", Concurrency and Computation: Practice and Experience, Volume 14 Issue 13-15 (2002), pp. 1145-1160, Special Issue on Grid Computing Environments, John Wiley & Sons, Ltd., DOI: 10.1002/cpe.677, Print ISSN 1532-0626, Online ISSN 1532-0634

[4] G. Aloisio, M. Cafaro, E. Blasi, I. Epicoco, "The Grid Resource Broker, a Ubiquitous Grid Computing Framework", Journal of Scientific Programming, Volume 10, Number 2 (2002), pp. 113-119, Special Issue on Grid Computing, IOS Press, Amsterdam, Print ISSN 1058-9244, Online ISSN 1875-919X

[3] G. Aloisio, M. Cafaro, C. Kesselman, R. Williams, "Web Access to SuperComputing", IEEE Computing in Science and Engineering, IEEE, Volume 3, Issue 6 (2001), pp. 66-72, DOI 10.1109/5992.963430, ISSN 1521-9615

[2] G. Aloisio, M.Cafaro, P. Beraldi, F. Guerriero, R. Musmanno,"An algorithm for solving the distributed termination detection problem":  Parallel Algorithms and Applications, Volume 14, Issue 2, pp. 149-164, (1999) Taylor and Francis, DOI 10.1080/10637199808947383, ISSN 1063-7192

[1] B. Paternoster, M. Cafaro, "Computation of interval of stability of Runge-Kutta-Nystrom methods", Journal of Symbolic Computation, Elsevier, Volume 25, Issue 3 (1998) pp. 383-394, DOI:10.1006/jsco.1997.0183, ISSN 0747-7171


Book Chapters

 

[12] M. Cafaro, M. Pulimeno (2019), "Frequent Itemset Mining”. In: Business and Consumer Analytics: New Directions, Springer, DOI 10.1007/978-3-030-06222-4_6, pp. 269—304, book ISBN 978-3-030-06221-7, ebook ISBN 978-3-030-06222-4

[11] M. Cafaro, I. Epicoco and M. Pulimeno (2019) Data Mining: Mining Frequent Patterns, Associations Rules, and Correlations. In: Guenther, R. and Steel, D. (eds.), Encyclopedia of Bioinformatics and Computational Biology, vol. 1, pp. 358–366. Elsevier.

[10] M. Cafaro, I. Epicoco and M. Pulimeno (2019) Techniques for Designing Bioinformatics Algorithms. In: Guenther, R. and Steel, D. (eds.), Encyclopedia of Bioinformatics and Computational Biology, vol. 1, pp. 5–14. Elsevier.

[9] M. Cafaro, G. Aloisio, “Grids, Clouds and Virtualization”, in Grids, Clouds and Virtualization, Springer-Verlag, London, 2011, pp. 1-21, Series: Computer Communications and Networks. Edited by Massimo Cafaro and Giovanni Aloisio. Hardcover ISBN 978-0-85729-048-9, Softcover ISBN 978-1-4471-2592-1, eBook ISBN 978-0-85729-049-6

[8] S. Fiore, A. Negro, S. Vadacca, M. Cafaro, G. Aloisio, R. Barbera and E. Giorgio, “An Architectural Overview of the GRelC Data Access Service”, in Handbook of Research on Grid Technologies and Utility Computing: Concepts for Managing Large-Scale Applications, Emmanuel Udoh (Editor), Frank Wang (Co-Editor), pp. 98-108, IGI Global, 2009, DOI 10.4018/978-1-60566-184-1.ch010, ISBN13: 9781605661841 ISBN10: 1605661848 EISBN13: 9781605661858

[7] M. Mirto, I. Epicoco, M. Cafaro, S. Fiore, M. Passante, A. negro and G. Aloisio, “ProGenGrid: A Grid Problem Solving Environment for Bioinformatics”, in Handbook of Research on Computational Grid Technologies for Life Sciences, Biomedicine, and Healthcare, Mario Cannataro (Editor), pp. 269-291, IGI Global, 2009, DOI: 10.4018/978-1-60566-374-6.ch014, ISBN13: 9781605663746 ISBN10: 1605663743 EISBN13: 9781605663753

[6] M. Cafaro et al., “The LIBI Grid Platform for Bioinformatics”, in Handbook of Research on Computational Grid Technologies for Life Sciences, Biomedicine, and Healthcare, Mario Cannataro (Editor), pp. 577-613, IGI Global, 2009, DOI: 10.4018/978-1-60566-374-6.ch029, ISBN13: 9781605663746 ISBN10: 1605663743 EISBN13: 9781605663753

[5] S. Fiore, M. Cafaro, M. Mirto, S. Vadacca, A. Negro and G. Aloisio, "The GRelC Project: state of the art and future directions", in Advances in Parallel Computing”, “High Performance Computing and Grids in Action”, L. Grandinetti (Ed), pp. 331-344, IOS Press, 2008, ISBN 978-1-58603-839-7 (print) 978-1-60750-313-2 (online), Series “Advances in Parallel Computing”, Volume 16

[4] M. Cafaro, I. Epicoco, G. Quarta, Sandro Fiore, G. Aloisio, “Design and Implementation of a Grid Computing Environment for Remote Sensing”, in “High-Performance Computing in Remote Sensing”, A. Plaza and C. Chang (Eds), pp. 281-308, Chapman & Hall/CRC, 2008, ISBN13 9781584886624 ISBN10 1-58488-662-5

[3] G. Aloisio, M. Cafaro, I. Epicoco, "A Grid Software Process", in "Grid Computing: Software Environments and Tools", Jose C. Cunha and Omer F. Rana (Eds), pp. 75-98, Springer-Verlag London, 2006, DOI: 10.1007/1-84628-339-6_4, Print ISBN 978-1-85233-998-2 Online ISBN 978-1-84628-339-0

[2] G. Aloisio, M. Cafaro, I. Epicoco, J. Nabrzyski, "The EU GridLab Project: A Grid Application Toolkit and Testbed", in "Engineering the Grid: Status and Perspective", pp. 123-138, L. T. Yang, Jack Dongarra, Adolfy Hoisie, Beniamino Di Martino and Hans Zima (Eds), American Scientific Publisher 2006, ISBN: 1-58883-038-1

[1] G. Aloisio, M. Cafaro, S. Fiore, M. Mirto, "The Grid Relational Catalog Project", in "Grid Computing: The New Frontiers of High Performance Computing”, Series “Advances in Parallel Computing”, Volume 14, pp.129-155, Elsevier, 2005, L. Grandinetti (Ed), ISBN: 978-0-444-51999-3

 


Proceedings

 

[73] A. Mariani, I. Epicoco, M. Cafaro, M. Pulimeno. Grid-based Contraction Clustering in a Peer-to-Peer Network. To appear in Proceedings of LOD 2022 - The 8th International Online & Onsite Conference on Machine Learning, Optimization, and Data Science – September 18 – 22, 2022 – Certosa di Pontignano, Siena – Tuscany, Italy, https://lod2022.icas.cc

[72] A. Della Monaca, M. Cafaro, M. Pulimeno, I. Epicoco. An Adaptive Clustering Approach for Distributed Outlier Detection in Data Streams. To appear in Proceedings of DCAI 2022 - The 19th International Conference on Distributed Computing and Artificial Intelligence, L'Aquila (Italy), 13th-15th July, 2022, www.dcai-conference.net

[71] J. G. Damaceno, C. Cesaroni, L. Spogli, G. D. Franceschi and M. Cafaro, "Multi-instrumental analyses of the September 2017 space weather storm over Brazil”, 2019 URSI Asia-Pacific Radio Science Conference (AP-RASC), New Delhi, India, 2019, pp. 1-4. DOI: 10.23919/URSIAP-RASC.2019.8738327

[70] Damaceno, J.G., Cesaroni, C., Spogli, L., Grzesiak, M., De Franceschi, G., Cafaro, M., “Regional short-term forecasting model to predict ionospheric scintillation and TEC at low latitudes”, 2019 URSI Asia-Pacific Radio Science Conference (AP-RASC), New Delhi, India, 2019, DOI: 10.23919/URSIAP-RASC.2019.8738621 

[69] Zhou, W., Zomaya, A., Fox, G., Martino, B.D., Yang, L., Dong, M., Ranjan, R., Cafaro, M., Wang, W., “Message from the ISPA 2018 Chairs”
(2019) Proceedings - 16th IEEE International Symposium on Parallel and Distributed Processing with Applications, 17th IEEE International Conference on Ubiquitous Computing and Communications, 8th IEEE International Conference on Big Data and Cloud Computing, 11th IEEE International Conference on Social Computing and Networking and 8th IEEE International Conference on Sustainable Computing and Communications, ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018, art. no. 8672268, pp. XXII-XXIII. DOI: 10.1109/BDCloud.2018.00005

[68] M. Pulimeno, I. Epicoco, M. Cafaro, C. Melle and G. Aloisio, "Parallel Mining of Correlated Heavy Hitters on Distributed and Shared-Memory Architectures”, 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 2018, pp. 5111-5118. doi: 10.1109/BigData.2018.8622201

[67] S. Samarelli, L. Agrimano, I. Epicoco, M. Cafaro, R. Nutricato, D. O. Nitti and F. Bovenga, “Rheticus®: a Cloud-Based Geo-Information Service for Ground Instabilities Detection and Monitoring”, in Proceedings of IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, 2018, pp. 2238-2240, DOI: 10.1109/IGARSS.2018.8518226 Electronic ISBN: 978-1-5386-7150-4, USB ISBN: 978-1-5386-7149-8 

[66] M. Pulimeno, I. Epicoco, M. Cafaro, C. Melle and G. Aloisio, "Parallel Mining of Correlated Heavy Hitters", in Gervasi O. et al. (eds), Proceedings of the 18th International Conference on Computational Science and Its Applications (ICCSA 2018), July 2 - 5, 2018, Melbourne, Australia. Lecture Notes in Computer Science, vol 10964, pp. 627–641, 2018. Springer, DOI https://doi.org/10.1007/978-3-319-95174-4_48 Print ISBN 978-3-319-95173-7 Online ISBN 978-3-319-95174-4

[65] M. Cafaro, I. Epicoco, G. Aloisio and M. Pulimeno, "CUDA Based Parallel Implementations of Space-Saving on a GPU," 2017 International Conference on High Performance Computing & Simulation (HPCS), Genoa, Italy, 2017, pp. 707-714. doi: 10.1109/HPCS.2017.108, ISBN 978-1-5386-3250-5

[64] M. Cafaro, M. Pulimeno, "Merging Frequent Summaries", ICTCS 2016, Proceedings of the 17th Italian Conference on Theoretical Computer Science, Lecce, Italy, September 7-9 2016, CEUR Proceedings, Volume 1720, pp. 280-285, http://ceur-ws.org/Vol-1720/

[63] Chen, J., Nepal, S., Cafaro, M., Pei, J., Sun, X.-H., Lin, X., Yang, L.T., ”Message from BDSE2013 Chairs," 2013 IEEE 16th International Conference on Computational Science and Engineering, Sydney, NSW, 2013, pp. xxx-xxx.
doi: 10.1109/CSE.2013.206

[62] M. Mirto, M. Cafaro, G. Aloisio, “Peer-to-Peer Data Discovery in Health Centers”, Computer-Based Medical Systems (CBMS), 2013 IEEE 26th International Symposium on, IEEE, pp. 343-348, 20-22 June 2013, DOI 10.1109/CBMS.2013.6627813, ISBN 978-1-4799-1053-3

[61] R. Van Engelen, M. Govindaraju, M. Cafaro, “Editorial message: First workshop on advances in high-performance e-Science middleware and applications: eScience 2008”, Proceedings of the 4th IEEE International Conference on eScience, eScience 2008, Indianapolis, IN, USA, DOI: 10.1109/eScience.2008.4

[60] S. Fiore, A. Negro, S. Vadacca, M. Cafaro, G. Aloisio, R. Barbera, E. Giorgio, “Advances in the GRelC Data Access Service”, IEEE Proceedings of the International Symposium on Parallel and Distributed Processing and Applications (ISPA 2008) - December 10-12, 2008 - Sydney, Australia, IEEE, pp. 849-854, DOI:10.1109/ISPA.2008.87 ISBN 978-0-7695-3471-8

[59] M. Mirto, M. Cafaro, I. Epicoco, G. Aloisio, Advances in the ProGenGrid Workflow Management System, IEEE Proceedings of the 1st International Workshop on High Performance Data Grid (HPDataGrid'08), held in conjunction with the 9th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT'08), December 1-4 2008 - Dunedin, New Zealand, IEEE, pp. 538-543, DOI:10.1109/PDCAT.2008.60, ISBN 978-0-7695-3443-5
[58] M. Mirto, S. Fiore, M. Cafaro, M. Passante and G. Aloisio, A Grid-based Bioinformatics Wrapper for Biological Databases, Proceedings of the 21st IEEE International Symposium on Computer-Based Medical Systems (CBMS 2008), June 17-19 2008, Jyvaskyla, Finland, IEEE, pp. 191-196, DOI:10.1109/CBMS.2008.93, ISBN 978-0-7695-3165-6

[57] S. Fiore, M. Mirto, M. Cafaro, S. Vadacca, A. Negro and G. Aloisio, A GRelC based Data Grid Management Environment, Proceedings of the 21st IEEE International Symposium on Computer-Based Medical Systems (CBMS 2008), June 17-19 2008, Jyvaskyla, Finland, IEEE, pp. 355-360, DOI:10.1109/CBMS.2008.125, ISBN 978-0-7695-3165-6

[56] Sandro Fiore, Massimo Cafaro, Salvatore Vadacca, Alessandro Negro, Emanuele Verdesca, Maria Mirto, Giovanni Aloisio, “Asynchronous query mechanisms within the GRelC Data Access Service”, Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN 2008), February 12-14, 2008, Innsbruck, Austria, ACTA Press Anaheim, CA, USA, pp. 49-54, ISBN: 978-0-88986-714-7

[55] Massimo Cafaro, Daniele Lezzi, Sandro Fiore, Giovanni Aloisio, Robert Van Engelen, "The GSI plug-in for gSOAP: building cross-grid interoperable secure grid services", Proceedings of Parallel Processing and Applied Mathematics 2007, 7th International Conference, PPAM 2007, Gdansk, Poland, September 9-12, 2007, Volume 4967, pp. 894-901, Lecture Notes in Computer Science, Springer-Verlag Berlin, DOI:10.1007/978-3-540-68111-3_95, Print ISBN 978-3-540-68105-2, Online ISBN 978-3-540-68111-3

[54] Sandro Fiore, Alessandro Negro, Salvatore Vadacca, Massimo Cafaro, Maria Mirto, Giovanni Aloisio, "Advanced Grid DataBase Management with the GRelC Data Access Service", Proceedings of the 5th International Symposium on Parallel and Distributed Processing and Applications, ISPA 2007 Niagara Falls, Canada, August 29-31, 2007, Lecture Notes in Computer Science 4742, Springer-Verlag Berlin Heidelberg, pp. 683-694, DOI:10.1007/978-3-540-74742-0_61, Print ISBN 978-3-540-74741-3, Online ISBN 978-3-540-74742-0

[53] M. Cafaro, I. Epicoco, M. Mirto, D. Lezzi, G. Aloisio, "The Grid Resource Broker Workflow Engine", IEEE Proceedings of The 6th International Conference on Grid and Cooperative Computing (GCC 2006), Urumchi, Xinjiang, China, August 16-18, 2007, IEEE, pp. 725-732, DOI:10.1109/GCC.2007.120, ISBN 0-7695-2871-6

[52] S. Fiore, M. Cafaro, A. Negro, S. Vadacca, G. Aloisio, R. Barbera, E. Giorgio, "GRelC DAS: a Grid-DB Access Service for gLite Based Production Grids”, IEEE Proceedings of the Fourth International Workshop on Emerging Technologies for Next-generation GRID (ETNGRID 2007), June 18-20, 2007 - Paris (France), IEEE, in 16th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises, 2007. WETICE 2007, pp. 261-266, DOI:10.1109/WETICE.2007.4407167, ISBN 978-0-7695-2879-3

[51] S. Fiore, M. Mirto, M. Cafaro, G. Aloisio, "GRelC Data Storage: a Lightweight Disk Storage Management Solution for Bioinformatics “in silico” experiments", Proceedings of the 20th IEEE International Symposium on COMPUTER-BASED MEDICAL SYSTEMS (IEEE CBMS 2007), June 20-22, 2007, Maribor (Slovenia), IEEE, pp. 495-502, DOI:10.1109/CBMS.2007.53, ISBN 0-7695-2905-4

[50] Giovanni Aloisio, Massimo Cafaro, Italo Epicoco, Sandro Fiore, Maria Mirto, "A Services Oriented System for Bioinformatics Applications on the Grid", Studies in Health Technology and Informatics, From Genes to Personalized HealthCare: Grid Solutions for the Life Sciences, Volume 126, Proceedings of HealthGrid 2007, N. Jacq et al. (Eds.), 2007, Volume 126, IOS Press, pp. 174-183, 2007, ISBN 978-1-58603-738-3

[49] Giovanni Aloisio, Massimo Cafaro, Sandro Fiore, Maria Mirto, "A Grid System for the Ingestion of Biological Data into a Relational DBMS", Proceedings of IEEE International Symposium on Bioinformatics and Life Science Computing (BLSC-07), 21-23 May 2007, Niagara Falls, Canada, in 21st International Conference on Advanced Information Networking and Applications Workshops, 2007, AINAW ’07, IEEE, pp. 713-718, DOI:10.1109/AINAW.2007.26, ISBN:978-0-7695-2847-2

[48] Giovanni Aloisio, Massimo Cafaro, Sandro Fiore, Maria Mirto, Salvatore Vadacca, “GRelC Data Gather Service: a Step Towards P2P Production Grids”, Proceedings of 22nd ACM Symposium on Applied Computing (SAC 2007), Seoul, Korea, March 11 - 15, 2007, ACM New York, NY, USA, pp. 561-565, DOI:10.1145/1244002.1244131, ISBN 1-59593-480-4

[47] G. Aloisio, M. Cafaro, I. Epicoco, S. Fiore, M. Mirto, “BioGAT: a Grid Toolkit for Bioinformatics Sequence Alignment”, Proceeding of the Grid-Enabling Legacy Applications and Supporting End Users Workshop (GELA) within the framework of the 15th IEEE International Symposium on High Performance Distributed Computing, HPDC’15, Paris, France, June 19-23, 2006, pp. 77-85, 2006

[46] G. Aloisio, M. Cafaro, S. Fiore, M. Tana, “GridSAT architecture: a step further towards security and efficiency”, Proceedings of Parallel and Distributed Computing and Networks (PDCN) – IASTED, pp. 1-6, February 14 – 16, 2006 Innsbruck, Austria, ACTA Press Anaheim, CA, USA

[45] G. Aloisio, M. Cafaro, S. Fiore, M. Mirto, "A Split&Merge Data Management Architecture for a Grid Environment", Proceedings of the 19th IEEE International Symposium on Computer-Based Medical Systems (IEEE CBMS 2006), IEEE, pp. 739-744, June 22-23 2006, Salt Lake City Utah (USA), DOI:10.1109/CBMS.2006.28, ISBN 0-7695-2517-1

[44] G. Aloisio, M. Cafaro, S. Molendini, M. Tana, F. Tommasi, A. Tricco, “GridSAT: Grid enabled Satellite Architecture for Reliable Transmissions”, Proceedings of the 2nd IEEE International Symposium on Wireless Communication Systems, IEEE, September 5-7, 2005 Siena, Italy, pp. 634-638, DOI:10.1109/ISWCS.2005.1547782, ISBN 0-7803-9206-X

[43] G. Aloisio, M. Cafaro, G. Carteni, I. Epicoco, G. Quarta, S. Raolil, "GridFlow for Earth Observation Data Processing", Proceedings of The 2005 International Conference on Grid Computing and Applications (GCA'05: June 20-23 2005, Las Vegas, USA), pp. 168-174, CSREA Press, ISBN 1-932415-57-2

[42] G. Aloisio, M. Cafaro, S. Fiore, M. Mirto, “ProGenGrid: a Grid-enabled platform for Bioinformatics”, Studies in Health Technology and Informatics, From grid to Healthgrid - Proceedings of HealthGrid 2005, Tony Solomonides et al. (Eds), IOS Press, Volume 112, IOS Press, pp. 113-126, 2005, ISBN 978-1-58603-510-5

[41] G. Aloisio, M. Cafaro, I. Epicoco, Sandro Fiore, Maria Mirto, “A Semantic Grid-based Data Access and Integration Service for Bioinformatics”, Electronic Proceedings of the 5th IEEE International Symposium on Cluster Computing and the Grid (CCGrid) 2005, IEEE, May 9-12, 2005, Cardiff, Wales, UK, pp. 196 - 203 Vol. 1 , DOI:10.1109/CCGRID.2005.1558554, ISBN:0-7803-9074-1

[40] G. Aloisio, M. Cafaro, S. Fiore, M. Mirto, “ProGenGrid: a Workflow Service Infrastructure for Composing and Executing Bioinformatics Grid Services”, Proceedings of the 18th IEEE International Symposium on Computer-Based Medical Systems (IEEE CBMS 2005), IEEE, pp. 555-560, Dublin, June 23-24 2005, Ireland, DOI:10.1109/CBMS.2005.90, ISBN 0-7695-2355-2

[39] G. Aloisio, M. Cafaro, I. Epicoco, G. Quarta, “Teaching High Performance Computing Parallelizing a real Computational Science Application”, Procedings of Computational Science – ICCS 2005: 5th International Conference, Atlanta, GA, USA, May 22-25, 2005. Proceedings, Part II, Springer-Verlag Berlin Heidelberg, Lecture Notes in Computer Science Volume 3515, 2005, pp 10-17, DOI:10.1007/11428848_2, Print ISBN 978-3-540-26043-1, Online ISBN 978-3-540-32114-9

[38] G. Aloisio, M. Cafaro, I. Epicoco, S. Fiore, D. Lezzi, M. Mirto, S. Mocavero, “Resource and Service Discovery in the iGrid Information Service”, Proceedings of International Conference on Computational Science and its Applications (ICCSA 2005), Singapore, May 9-12, 2005, Proceedings, Part III, Springer-Verlag Berlin Heidelberg, Lecture Notes in Computer Science Volume 3482, 2005, pp 1-9, DOI:10.1007/11424857_1, Print ISBN 978-3-540-25862-9, Online ISBN 978-3-540-32045-6

[37] G. Aloisio, M. Cafaro, S. Fiore, M. Mirto. “The Grid-DBMS: Towards Dynamic Data Management in Grid Environments”. Proceedings of IEEE International Conference on Information Technology: Coding and Computing, ITCC 2005, 4-6 April 2005, Las Vegas, Nevada, USA, IEEE, Vol. II, pp. 199-204, 2005, DOI:10.1109/ITCC.2005.272, ISBN:0-7695-2315-3

[36] G. Aloisio, M. Cafaro, D. Conte, S. Fiore, I. Epicoco, G.P. Marra, G. Quarta, "A grid enabled Web Map Server", Proceedings of IEEE International Conference on Information Technology: Coding and Computing, ITCC 2005, 4-6 April 2005, Las Vegas, Nevada, USA, IEEE, Volume I, pp. 298-303, 2005, DOI:10.1109/ITCC.2005.12, ISBN 0-7695-2315-3

[35] G. Aloisio, M. Cafaro, I. Epicoco, D. Lezzi, R. Van Engelen, “The GSI plug-in for gSOAP: Enhanced Security, Performance, and Reliability”, Proceedings of IEEE International Conference on Information Technology: Coding and Computing, ITCC 2005, 4-6 April 2005, Las Vegas, Nevada, USA, IEEE, Volume I, pp. 304-309, DOI:10.1109/ITCC.2005.273, ISBN:0-7695-2315-3

[34] G. Aloisio, M. Cafaro, I. Epicoco, S. Fiore, D. Lezzi, M. Mirto and S. Mocavero, “iGrid, a Novel Grid Information Service”, Proceedings of Advances in Grid Computing - EGC 2005 (European Grid Conference, Amsterdam, The Netherlands, February 14-16, 2005, Revised Selected Papers), Lecture Notes in Computer Science, Springer-Verlag Berlin Heidelberg, Volume 3470, pp. 506-515, 2005, DOI:10.1007/11508380_52, Print ISBN 978-3-540-26918-2, Online ISBN 978-3-540-32036-4

[33] G. Aloisio, M. Cafaro, S. Fiore, G. Quarta, “A Grid-Based Architecture for Earth Observation Data Access”, Proceeding of the 2005 ACM Symposium on Applied Computing (SAC 2005), ACM New York, NY, USA, March 13-17, 2005, Santa Fe, New Mexico, USA, Volume I, pp. 701-705, DOI:10.1145/1066677.1066837, ISBN 1-58133-964-0

[32] G. Aloisio, Z. Balaton, P. Boon, M. Cafaro, I. Epicoco, G. Gombás, P. Kacsuk, T. Kielmann, D. Lezzi, “Integrating Resource and Service Discovery in the CoreGrid Information Cache Mediator Component”, In Proceedings of CoreGrid Integration Workshop, 2005

[31] G. Aloisio, M. Cafaro, S. Fiore, M. Mirto, “Advanced Delivery Mechanisms in the GRelC Project”, Proceeding of 2nd International Workshop on Middleware for Grid Computing (MGC 2004), ACM New York, NY, USA, October 18 2004, Toronto, Ontario Canada, pp. 69-74, DOI:10.1145/1028493.1028505, ISBN 1-58113-950-0

[30] G. Aloisio, M. Cafaro, S. Fiore, M. Mirto, “Bioinformatics Data Access Service in the ProGenGrid System”, Proceeding of First International Workshop on Grid Computing and its Application to Data Analysis (GADA 2004), October 25-29, Larnaca, Cyprus, Greece, in On the Move to Meaningful Internet Systems 2004: OTM 2004 Workshops, Lecture Notes in Computer Science Volume 3292, 2004, pp. 211-221, Springer-Verlag Berlin Heidelberg, DOI:10.1007/978-3-540-30470-8_38, Print ISBN 978-3-540-23664-1, Online ISBN 978-3-540-30470-8

[29] G. Aloisio, M. Cafaro, S. Fiore, M. Mirto, "ProGenGrid: A Grid Framework for Bioinformatics", Proceeding of Biological and Artificial Intelligence Environments, 15th Italian Workshop on Neural Nets, WIRN VIETRI 2004 (Part 1, Pre-Wirn workshop on Computational Intelligence Methods for Bioinformatics and Bistatistics - CIBB 2004), Springer Netherlands, September 14-17 2004, Perugia, Italy, pp. 1-9, DOI:10.1007/1-4020-3432-6_1, Print ISBN 978-1-4020-3431-2, Online ISBN 978-1-4020-3432-9

[28] G. Aloisio, M. Cafaro, I. Epicoco, G. Quarta, “Information Management for Grid-Based Remote Sensing Problem Solving Environment,” Proceedings of the 2004 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'04), Las Vegas, Nevada, USA, June 21-24, 2004, Hamid R. Arabnia and Jun Ni (Ed.), pp. 887-894, Vol. II, CSREA Press, 2004, ISBN: 1-932415-24-6

[27] G. Aloisio, M. Cafaro, S. Fiore, M. Mirto, “A Gather Service in a Health Grid Environment”, Electronic proceedings of Medicon and Health Telematics 2004, IFMBE Proceedings (Springer), Volume 6, July 31 - August 05 2004, Island of Ischia, Naples, Italy

[26] G. Aloisio, M. Cafaro, R. Cesari, C. Mangia, G.P. Marra, M. Miglietta, M. Mirto, U. Rizza, I. Schipa, A. Tanzarella, “G-AQFS: Grid computing exploitation for the management of air quality in presence of complex meteorological circulations”, Proceedings of IEEE International Conference on Information Technology: Coding and Computing, April 5 to 7, 2004, Las Vegas, Nevada, IEEE, Volume II, pp. 83-87, DOI:10.1109/ITCC.2004.1286594, ISBN 0-7695-2108-8

[25] G. Aloisio, M. Cafaro, I. Epicoco, G. Quarta, “A Problem Solving Environment for Remote Sensing Data Processing”, in Proceedings of IEEE International Conference on Information Technology: Coding and Computing, April 5 to 7, 2004, Las Vegas, Nevada, IEEE, Volume II, pp. 56-61, DOI:10.1109/ITCC.2004.1286590, ISBN 0-7695-2108-8

[24] G. Aloisio, M. Cafaro, S. Fiore, M. Mirto, “The GRelC Library: A Basic Pillar in the Grid Relational Catalog Architecture”, in Proceedings of IEEE International Conference on Information Technology: Coding and Computing, April 5 to 7, 2004, Las Vegas, Nevada, IEEE Press, Volume I, pp. 372-376, DOI:10.1109/ITCC.2004.1286482, ISBN 0-7695-2108-8

[23] G. Aloisio, M. Cafaro, S. Fiore, M. Mirto, “The GRELC Project: Towards GRID-DBMS”, Proceedings of the International Conference on Parallel and Distributed Computing and Networks (PDCN) - IASTED, pp. 1-6, Innsbruck (Austria) February 17-19, 2004, ACTA Press Anaheim, CA, USA

[22] G. Aloisio, M.C. Barba, M. Cafaro, E. Blasi, S. Fiore, M. Mirto, “Web Services for Edgebreaker Compression in a Grid Portal”, Electronic Proceedings of HealthGrid, 29-30 January, 2004, Clermont-Ferrand, Francia

[21] G. Aloisio, E. Blasi, M. Cafaro, M, S. Fiore, M. Mirto, “A virtual clinical folder on the grid”, Edited by: Callaos, N; Horimoto, K; Chen, J; et al., the 8th World Multi-Conference on Systemics, Cybernetics and Informatics, Orlando, FL, JUL 18-21, 2004, Vol. VII pp. 113-118

[20] G. Aloisio, M. Cafaro, S. Fiore, S. Mocavero, “An adaptive scheduling infrastructure for a grid environment”, Edited by: Callaos, N; Lesso, W; Sanchez, B, the 8th World Multi-Conference on Systemics, Cybernetics and Informatics, Orlando, FL, JUL 18-21, 2004, Vol. II pp. 11-15

[19] G. Aloisio, E. Blasi, M. Cafaro, I. Epicoco, G. Quarta, M. Tana, A. Zuccala, “A distributed architecture for remote sensing data management”, Edited by: Callaos, N; Lesso, W; Sanchez, B, the 8th World Multi-Conference on Systemics, Cybernetics and Informatics, Orlando, FL, JUL 18-21, 2004, Vol. I pp. 236-240 

[18] G. Aloisio, M. Cafaro, I. Epicoco, D. Lezzi, M. Mirto, S. Mocavero, “The Design and Implementation of the GridLab Information Service”, Proceedings of The Second International Workshop on Grid and Cooperative Computing (GCC 2003), 7-10 December 2003, Shanghai (China), Springer-Verlag Berlin Heidelberg, Lecture Notes in Computer Science Volume 3032, 2004, pp. 131-138, DOI:10.1007/978-3-540-24679-4_26, Print ISBN 978-3-540-21988-0, Online ISBN 978-3-540-24679-4

[17] G. Aloisio, E. Blasi, M. Cafaro, I. Epicoco, S. Fiore, S. Mocavero, “A Grid Environment for Diesel Engine Chamber Optimization”, Proceedings of ParCo2003, 2-5 September, Dresden (Germany), 2003, Elsevier, ISBN 0-444-51689-1, pp. 599-607

[16] G. Aloisio, M. Cafaro, D. Lezzi, R. Van Engelen, “Secure Web Services with Globus GSI and gSOAP”, Proceedings of Euro-Par 2003 Parallel Processing, 26th - 29th August 2003, Klagenfurt, Austria, Springer-Verlag Berlin Heidelberg, Lecture Notes in Computer Science Volume 2790, 2003, pp .421-426, DOI:10.1007/978-3-540-45209-6_62, Print ISBN 978-3-540-40788-1, Online ISBN 978-3-540-45209-6

[15] G. Aloisio, M. Cafaro, E. Blasi, M. Mirto, S. Fiore, D. Lezzi, “Web Services for a Biomedical Imaging Portal”, Proceedings of Information Technology: Coding and Computing (ITCC 2003), IEEE, Volume IV, pp. 432-436, Las Vegas, Nevada, 28-30 April 2003, best student paper award, DOI:10.1109/ITCC.2003.1197568, ISBN 0-7695-1916-4

[14] G. Aloisio, E. Blasi, M. Cafaro, I. Epicoco, S. Fiore, M. Mirto, “Dynamic Grid Catalog Information Service”, Proceedings of the First European Across Grids Conference, February 13-14, 2003 Santiago de Compostela, Spain, Springer-Verlag Berlin Heidelberg, Lecture Notes in Computer Science Volume 2970, 2004, pp 198-205, DOI:10.1007/978-3-540-24689-3_25, Print ISBN 978-3-540-21048-1, Online ISBN 978-3-540-24689-3

[13] G. Aloisio, E. Blasi, M. Cafaro, S. Fiore, M. Mirto, “A Grid Portal for Biomedical Imaging”, Edited by: Callaos, N; Lesso, W; Schewe, KD; et al., the 7th World Multiconference on Systemics, Cybernetics and Informatics, ORLANDO, FL, JUL 27-30, 2003, Vol. V pp. 57-62

[12] G. Aloisio, M. Cafaro, S. Fiore, S; I. Epicoco, M. Mirto, S. Mocavero, “A performance comparison between GRIS and LDGC Information Services”, Edited by: Callaos, N; Lesso, W; Schewe, KD; et al., the 7th World Multiconference on Systemics, Cybernetics and Informatics, ORLANDO, FL, JUL 27-30, 2003, Vol. XII pp. 416-420 

[11] G. Aloisio, E. Blasi, M. Cafaro, M. Mirto, S. Fiore, “A genetic algorithm for medical image registration”, Edited by: Callaos, N; Whymark, G; Lesso, W, the 6th World Multi-Conference on Systemics, Cybernetics and Informatics (SCI 2002)/8th International Conference on Information Systems Analysis and Synthesis (ISAS 2002)ORLANDO, FL, JUL 14-18, 2002, pp. 192-197

[10] G. Aloisio, L. De Paolis, L. Provenzano, M. Cafaro, L. Colizzi, “Coronary stent implant simulation using haptic interface: Problems and solutions “, Edited by: Callaos, N; Whymark, G; Lesso, W, the 6th World Multi-Conference on Systemics, Cybernetics and Informatics (SCI 2002)/8th International Conference on Information Systems Analysis and Synthesis (ISAS 2002) ORLANDO, FL, JUL 14-18, 2002, pp. 570-573

[9] G. Aloisio, M. Cafaro, I. Epicoco, “SARA, a Web Based Remote Sensing Digital Library”, Electronic Proceedings of IEEE International Parallel and Distributed Processing Symposium 2002, IEEE, Fort Lauderdale, Florida, April 15-19, 2002, DOI:10.1109/IPDPS.2002.1016634

[8] G. Aloisio, M. Cafaro, E. Blasi, L. Depaolis, I. Epicoco, “The GRB Library: Grid Computing with Globus in C”, Proceedings of High-Performance Computing and Networking, the 9th International Conference HPCN Europe 2001, Amsterdam, Netherlands, June 25–27, 2001, Springer-Verlag Berlin Heidelberg, Lecture Notes in Computer Science Volume 2110, 2001, pp. 133-139, DOI:10.1007/3-540-48228-8_14, Print ISBN 978-3-540-42293-8, Online ISBN 978-3-540-48228-4

[7] Allen G., Dramlitsch T., Goodale T., Lanfermann G., Radke T., Seidel E., Kielmann T., Verstoep K., Balaton Z., Kacsuk P., Szalai F., Gehring J., Keller A., Streit A., Matyska L., Ruda M., Krenek A., Knipp H., Merzky A., Reinefeld A., Schintke F., Ludwiczak B., Nabrzyski J., Pukacki J., Kersken H.-P., Aloisio G., Cafaro M., Ziegler W., Russell M., “Early experiences with the EGrid testbed”, Proceedings of the first IEEE/ACM symposium on Cluster Computing and the Grid, CCGRID 2001, 15-18 May 2001, Brisbane, IEEE, pp.130-137, 2001, DOI:10.1109/CCGRID.2001.923185, ISBN 0-7695-1010-8

[6] G. Aloisio, M. Cafaro, “An introduction to the Globus toolkit”, 2000 CERN School of Computing Conference, Edited by: Vandoni, CE, MARATHON, GREECE, SEP 17- 30, 2000, Book Series: C E R N REPORTS , pp. 117-131

[5] G. Aloisio, M. Cafaro, A. Mongelli, R. Williams, “The use of Computational Grids for a Dynamic Earth Observation System”, the 4th World Multi-Conference on Systemics, Cybernetics and Informatics, Orlando, FL, JUL 23-26, 2000, Vol. VII pp. 535-537

[4] G. Aloisio, M. Cafaro, P. Falabella, C. Kesselman, R. Williams, “Grid Computing on the Web using the Globus Toolkit”, Proceedings of High Performance Computing and Networking, the 8th International Conference HPCN Europe 2000 Amsterdam, The Netherlands, May 8–10, 2000, Springer-Verlag Berlin Heidelberg, Lecture Notes in Computer Science Volume 1823, 2000, pp. 32-40, DOI:10.1007/3-540-45492-6_4, Print ISBN 978-3-540-67553-2, Online ISBN 978-3-540-45492-2

[3] M. Cafaro, B. Paternoster, “Analysis of Stability of Rational Approximations through computer algebra”, Proceedings of The Second Workshop on Computer Algebra in Scientific Computing (CASC 99), Munich, May 31-June 4, 1999, pp. 25-36, Springer-Verlag Berlin Heidelberg, DOI:10.1007/978-3-642-60218-4_2, Print ISBN 978-3-540-66047-7, Online ISBN 978-3-642-60218-4

[2] G. Aloisio, M. Cafaro, R. Williams, “The Digital Puglia Project: an Active Digital Library of remote sensing data”, Proceedings of High-Performance Computing and Networking, the 7th International Conference, HPCN Europe 1999, Amsterdam, The Netherlands, April 12–14, 1999, Springer Berlin Heidelberg, Lecture Notes in Computer Science Volume 1593, 1999, pp. 563-572, DOI:10.1007/BFb0100617, Print ISBN 978-3-540-65821-4, Online ISBN 978-3-540-48933-7

[1] G. Aloisio, M. Cafaro, R. Williams, P. Messina, “A distributed web-based metacomputing environment”, Proceedings of High-Performance Computing and Networking, the 5th International Conference, HPCN Europe 1997, Vienna, Austria, April 28–30, 1997 Springer Berlin Heidelberg, Lecture Notes in Computer Science Volume 1225, 1997, pp. 480-486, DOI:10.1007/BFb0031620, Print ISBN 978-3-540-62898-9, Online ISBN 978-3-540-69041-2

Temi di ricerca

Algoritmi sequenziali, paralleli e distribuiti. Cloud/grid/P2P computing. Data mining. Machine learning. Deep learning. Security e crittografia.