PL Data Analysis

2024-04-03

TABLE DES MATIERES

Presentation du sujet et contexte de l’etude.
Demarche utilisee. 2)1) 2)2)
Limite des modeles et du pouvoir prédictif.
Difficultes.
Ce qui pourrait être améliorer à l’avenir.

PRESENTATION DU SUJET ET CONTEXTE DE L’ETUDE

Lorsque nous parlons de premier League, nous faisons référence au championnat national de Football Anglais

DEMARCHES UTILISEES ET CODE.

Démarches utilisées

Notre démarche est consitutée de deux étapes :

Scrapper différentes informations basiques les points, victoires, les défaites à travers le temps.
Scrapper des informations en lien avec le championat actuel : Le budget
Tenter d’apliquer des modèles statistiques et faire de la statistiques descriptive.

CODE 1/

## Warning: package 'reticulate' was built under R version 4.1.2

## Warning in system2(command = python, args = shQuote(script), stdout =
## TRUE, : running command ''/Users/polo11/Documents/GitHub/Premierleague/
## Automatisation.ipynb' '/Library/Frameworks/R.framework/Versions/4.1/Resources/
## library/reticulate/config/config.py' 2>/dev/null' had status 126

## Error in python_config_impl(python): Error 126 occurred running /Users/polo11/Documents/GitHub/Premierleague/Automatisation.ipynb:

Code 2/

#Example of URL that could be pasted.
urlpage_4 = 'https://www.skysports.com/premier-league-table/2023'

# The objective of the function get_page is to extract and return HTML elements corresponding to a
# specified tag and a specific class from a given web page.

def get_page(urlpage_4,element,html_class):
    req_5 = urllib3.PoolManager()
    res_5 = req_5.request('GET', urlpage_4)
    row_html_5 = BeautifulSoup(res_5.data, 'html.parser')
    PL19 = row_html_5.find_all(element , 
    class_= html_class)
    return(PL19)

PL19 = str(get_page(urlpage_4, 'tr', 'row-body'))

list_team_20 = re.findall('<span class="team-name">(.*?)</span>', str(PL19))

def extract_team_20_stats(PL19, team):
    team= team.title()
    teams = re.findall('<span class="team-name">(.*?)</span>', 
    str(PL19))
    position= (list_team_20.index(team)+1)
    start = PL19.find(team)
    end = PL19.index("</tr>", start)
    team_data_20 = PL19[start:end]
    match_played= 38
    data = [int(s) for s in re.findall(r'<td.*?>(\d+)</td>', team_data_20)]
    points= data[0]
    wins= data [1]
    drawns= data [2]
    loses =data [3]
    goals_for = data [4]
    goals_against = data [5]
    team_stats20 = {'match_played': match_played,
    'position': position,'points': points,
                    'wins': wins,'loses': loses ,
                    'drawns':  drawns,'goals_for': goals_for,
        'goals_against':goals_against
    }
    return team_stats20

Code 3/

team_stats_20 = {}

for team in list_team_20:
    team_stats = extract_team_20_stats(PL19, team)
    team_stats_df = pd.DataFrame(team_stats, index=[0])
    team_stats_df['team'] = team
    team_stats_df['year'] = 2020
    team_stats_20[team] = team_stats_df

## Error in team_stats_20.replace({: could not find function "team_stats_20.replace"