Skip to the content.

Since I need to choose the school most fitted for my child, I really interested in having more knowledge about primary education system in The Netherlands and to know more about school differences. I found that Dutch Ministry of education, culture and science opened some data about primary schools. It’s very scattered information placed in different files in not very useful for analysis form, so before go to analysis I want to join info I need together, clean data, deal with missing values and maybe tramsform data to the view which would be easier to read and analyse.

All the data for this part of dataset construction I got here: https://duo.nl/open_onderwijsdata/databestanden/po/leerlingen-po/bo-sbo/bo-sbo-eindscores.jsp

1. Exams scores

First I will join a few .cvs files about exam score from differetn years together:

import pandas as pd 
data_score_2014 = pd.read_csv("input_data/Test_score_2014_2015.csv", error_bad_lines=False, sep=';', encoding = "ISO-8859-1") 
data_score_2015 = pd.read_csv("input_data/Test_score_2015_2016.csv", error_bad_lines=False, sep=';', encoding = "ISO-8859-1") 
data_score_2016 = pd.read_csv("input_data/Test_score_2016_2017.csv", error_bad_lines=False, sep=';', encoding = "ISO-8859-1") 
data_score_2017 = pd.read_csv("input_data/Test_score_2017_2018.csv", error_bad_lines=False, sep=';', encoding = "ISO-8859-1") 
data_score_2018 = pd.read_csv("input_data/Test_score_2018_2019.csv", error_bad_lines=False, sep=';', encoding = "ISO-8859-1") 

print(data_score_2014.columns)
data_score_2018.head()
Index(['PEILDATUM_LEERLINGEN', 'PRIKDATUM_SCORES', 'BRIN_NUMMER',
       'VESTIGINGSNUMMER', 'INSTELLINGSNAAM_VESTIGING', 'POSTCODE_VESTIGING',
       'PLAATSNAAM', 'GEMEENTENUMMER', 'GEMEENTENAAM', 'PROVINCIE', 'SOORT_PO',
       'DENOMINATIE_VESTIGING', 'BEVOEGD_GEZAG_NUMMER', 'Leerjaar_8',
       'ONTHEFFING_REDEN_ND', 'CET_AANTAL', 'cet_gem', 'IEP_AANTAL', 'iep_gem',
       'ROUTE8_AANTAL', 'route8_gem', 'DREMPEL_AANTAL', 'drempel_gem'],
      dtype='object')
PEILDATUM_LEERLINGEN PRIKDATUM_SCORES BRIN_NUMMER VESTIGINGSNUMMER INSTELLINGSNAAM_VESTIGING POSTCODE_VESTIGING PLAATSNAAM GEMEENTENUMMER GEMEENTENAAM PROVINCIE ... CET_AANTAL CET_GEM IEP_AANTAL IEP_GEM ROUTE8_AANTAL ROUTE8_GEM DIA_AANTAL DIA_GEM AMN_AANTAL AMN_GEM
0 20190417 20190725 00AP 0 Wereldwijs 2716PH ZOETERMEER 637 Zoetermeer Zuid-Holland ... 11 538,636363636364 0 0 0 0
1 20190417 20190725 00AR 0 BS "De Maasparel" 6107AW STEVENSWEERT 1641 Maasgouw Limburg ... 17 537,588235294118 0 0 0 0
2 20190417 20190725 00AV 0 De Morgenster 3201CN SPIJKENISSE 1930 Nissewaard Zuid-Holland ... 12 532,5 0 0 0 0
3 20190417 20190725 00AZ 0 De Stapsteen 2971AR BLESKENSGRAAF CA 1927 Molenwaard Zuid-Holland ... 0 14 84,3571428571429 0 0 0
4 20190417 20190725 00BA 0 OBS De Klimboom 6666EB HETEREN 1734 Overbetuwe Gelderland ... 0 16 83,625 0 0 0

5 rows × 25 columns

I noticed that the data_score_2014 dataframe have some columns named using small letters, but in other dataframes only capital letters are using, let’s change it:

data_score_2014.rename(columns = {'cet_gem': 'CET_GEM', 'iep_gem': 'IEP_GEM', 'route8_gem':'ROUTE8_GEM', 'Leerjaar_8':'LEERJAAR_8'}, inplace = True)

Now let’s merge everything in one dataframe:

data_frames = [data_score_2014, data_score_2015, data_score_2016, data_score_2017, data_score_2018]
data_score = pd.concat(data_frames).fillna(' ')
data_score.reset_index(drop=True, inplace=True)
print(data_score_2014.shape)
print(data_score_2015.shape)
print(data_score_2016.shape)
print(data_score_2017.shape)
print(data_score_2018.shape)
print(data_score.shape)
(6920, 23)
(6851, 21)
(6751, 27)
(6658, 25)
(6579, 25)
(33759, 29)
data_score.head()
PEILDATUM_LEERLINGEN PRIKDATUM_SCORES BRIN_NUMMER VESTIGINGSNUMMER INSTELLINGSNAAM_VESTIGING POSTCODE_VESTIGING PLAATSNAAM GEMEENTENUMMER GEMEENTENAAM PROVINCIE ... ROUTE8_AANTAL ROUTE8_GEM DREMPEL_AANTAL drempel_gem DIA_AANTAL DIA_GEM CESAN_AANTAL CESAN_GEM AMN_AANTAL AMN_GEM
0 20141001 20150918 00AP 0 De Schanskorf 2715BT ZOETERMEER 637 Zoetermeer Zuid-Holland ... 0,00 0,00 0,00 0,00
1 20141001 20150918 00AR 0 BS "De Maasparel" 6109AM OHE EN LAAK 1641 Maasgouw Limburg ... 0,00 0,00 0,00 0,00
2 20141001 20150918 00AV 0 De Morgenster 3201CN SPIJKENISSE 612 Spijkenisse Zuid-Holland ... 0,00 0,00 0,00 0,00
3 20141001 20150918 00AZ 0 De Kiezel en de Kei 2971AR BLESKENSGRAAF CA 1927 Molenwaard Zuid-Holland ... 0,00 0,00 0,00 0,00
4 20141001 20150918 00BA 0 OBS De Klimboom 6666EB HETEREN 1734 Overbetuwe Gelderland ... 0,00 0,00 0,00 0,00

5 rows × 29 columns

Now I want to see which type of columns we have in dataset:

print(data_score.dtypes)
PEILDATUM_LEERLINGEN          int64
PRIKDATUM_SCORES              int64
BRIN_NUMMER                  object
VESTIGINGSNUMMER              int64
INSTELLINGSNAAM_VESTIGING    object
POSTCODE_VESTIGING           object
PLAATSNAAM                   object
GEMEENTENUMMER                int64
GEMEENTENAAM                 object
PROVINCIE                    object
SOORT_PO                     object
DENOMINATIE_VESTIGING        object
BEVOEGD_GEZAG_NUMMER          int64
LEERJAAR_8                    int64
ONTHEFFING_REDEN_ND          object
CET_AANTAL                   object
CET_GEM                      object
IEP_AANTAL                   object
IEP_GEM                      object
ROUTE8_AANTAL                object
ROUTE8_GEM                   object
DREMPEL_AANTAL               object
drempel_gem                  object
DIA_AANTAL                   object
DIA_GEM                      object
CESAN_AANTAL                 object
CESAN_GEM                    object
AMN_AANTAL                   object
AMN_GEM                      object
dtype: object

There are few columns has type ‘object’, but they should has ‘int’ or ‘float’ type, so I think there are some values which cannot be converted to numeric ones. Let’s check it. I think that such values can be considered as missing values, so let’s change it to 0:

def fill_missing_zero_value(dataframe, column):
    dataframe.loc[dataframe[column] == ' ', column] = 0

def convert_to_numeric(dataframe, column):
    dataframe[column] = dataframe[column].astype(str).str.replace(',','.')
    dataframe[column] = pd.to_numeric(dataframe[column])

def get_numeric_column(dataframe, column):
    fill_missing_zero_value(dataframe, column)
    convert_to_numeric(dataframe, column)

get_numeric_column(data_score, 'ONTHEFFING_REDEN_ND')
data_score['ONTHEFFING_REDEN_ND'].astype(int)

get_numeric_column(data_score, 'CET_AANTAL')
data_score['CET_AANTAL'] = data_score['CET_AANTAL'].astype(int)
get_numeric_column(data_score, 'CET_GEM')

get_numeric_column(data_score, 'IEP_AANTAL')
data_score['IEP_AANTAL'] = data_score['IEP_AANTAL'].astype(int)
get_numeric_column(data_score, 'IEP_GEM')

get_numeric_column(data_score, 'ROUTE8_AANTAL')
data_score['ROUTE8_AANTAL'] = data_score['ROUTE8_AANTAL'].astype(int)
get_numeric_column(data_score, 'ROUTE8_GEM')

get_numeric_column(data_score, 'DREMPEL_AANTAL')
data_score['DREMPEL_AANTAL'] = data_score['DREMPEL_AANTAL'].astype(int)
get_numeric_column(data_score, 'drempel_gem')

get_numeric_column(data_score, 'DIA_AANTAL')
data_score['DIA_AANTAL'] = data_score['DIA_AANTAL'].astype(int)
get_numeric_column(data_score, 'DIA_GEM')

get_numeric_column(data_score, 'CESAN_AANTAL')
data_score['CESAN_AANTAL'] = data_score['CESAN_AANTAL'].astype(int)
get_numeric_column(data_score, 'CESAN_GEM')

get_numeric_column(data_score, 'AMN_AANTAL')
data_score['AMN_AANTAL'] = data_score['AMN_AANTAL'].astype(int)
get_numeric_column(data_score, 'AMN_GEM')

print(data_score.dtypes)
PEILDATUM_LEERLINGEN           int64
PRIKDATUM_SCORES               int64
BRIN_NUMMER                   object
VESTIGINGSNUMMER               int64
INSTELLINGSNAAM_VESTIGING     object
POSTCODE_VESTIGING            object
PLAATSNAAM                    object
GEMEENTENUMMER                 int64
GEMEENTENAAM                  object
PROVINCIE                     object
SOORT_PO                      object
DENOMINATIE_VESTIGING         object
BEVOEGD_GEZAG_NUMMER           int64
LEERJAAR_8                     int64
ONTHEFFING_REDEN_ND          float64
CET_AANTAL                     int64
CET_GEM                      float64
IEP_AANTAL                     int64
IEP_GEM                      float64
ROUTE8_AANTAL                  int64
ROUTE8_GEM                   float64
DREMPEL_AANTAL                 int64
drempel_gem                  float64
DIA_AANTAL                     int64
DIA_GEM                      float64
CESAN_AANTAL                   int64
CESAN_GEM                    float64
AMN_AANTAL                     int64
AMN_GEM                      float64
dtype: object

Let’s check do we have null or NaN values:

print(data_score.isnull().sum())
print(data_score.isna().sum())
PEILDATUM_LEERLINGEN         0
PRIKDATUM_SCORES             0
BRIN_NUMMER                  0
VESTIGINGSNUMMER             0
INSTELLINGSNAAM_VESTIGING    0
POSTCODE_VESTIGING           0
PLAATSNAAM                   0
GEMEENTENUMMER               0
GEMEENTENAAM                 0
PROVINCIE                    0
SOORT_PO                     0
DENOMINATIE_VESTIGING        0
BEVOEGD_GEZAG_NUMMER         0
LEERJAAR_8                   0
ONTHEFFING_REDEN_ND          0
CET_AANTAL                   0
CET_GEM                      0
IEP_AANTAL                   0
IEP_GEM                      0
ROUTE8_AANTAL                0
ROUTE8_GEM                   0
DREMPEL_AANTAL               0
drempel_gem                  0
DIA_AANTAL                   0
DIA_GEM                      0
CESAN_AANTAL                 0
CESAN_GEM                    0
AMN_AANTAL                   0
AMN_GEM                      0
dtype: int64
PEILDATUM_LEERLINGEN         0
PRIKDATUM_SCORES             0
BRIN_NUMMER                  0
VESTIGINGSNUMMER             0
INSTELLINGSNAAM_VESTIGING    0
POSTCODE_VESTIGING           0
PLAATSNAAM                   0
GEMEENTENUMMER               0
GEMEENTENAAM                 0
PROVINCIE                    0
SOORT_PO                     0
DENOMINATIE_VESTIGING        0
BEVOEGD_GEZAG_NUMMER         0
LEERJAAR_8                   0
ONTHEFFING_REDEN_ND          0
CET_AANTAL                   0
CET_GEM                      0
IEP_AANTAL                   0
IEP_GEM                      0
ROUTE8_AANTAL                0
ROUTE8_GEM                   0
DREMPEL_AANTAL               0
drempel_gem                  0
DIA_AANTAL                   0
DIA_GEM                      0
CESAN_AANTAL                 0
CESAN_GEM                    0
AMN_AANTAL                   0
AMN_GEM                      0
dtype: int64

It looks like we don’t have NaN or null values.

Now I would like to change columns with exam’s date info to columns with just a year of the exam:

data_score['PRIKDATUM_SCORES'] = data_score['PRIKDATUM_SCORES'].map({
    20150918 : 2015,
    20160722 : 2016,
    20170722 : 2017,
    20180725 : 2018,
    20190725 : 2019,
})
data_score.rename(columns = {'PRIKDATUM_SCORES': 'DATUM'}, inplace = True)
print(data_score['DATUM'].value_counts())
2015    6920
2016    6851
2017    6751
2018    6658
2019    6579
Name: DATUM, dtype: int64

I know that most common final test in the primary schools in The Netherlands is CET and suppose it shoul be most common schools choise in terms of types of final exam. I want to explore what information we have about exams students are taking, but first let’s explore do we have the rows without any info about exams:

schools_without_exams = data_score[(data_score['CET_AANTAL'] == 0) & 
                               (data_score['DREMPEL_AANTAL'] == 0) & (data_score['AMN_AANTAL'] == 0) 
                                & (data_score['CESAN_AANTAL'] == 0) & (data_score['DIA_AANTAL'] == 0) 
                                & (data_score['ROUTE8_AANTAL'] == 0) & (data_score['IEP_AANTAL'] == 0)]
schools_at_least_1_exam = data_score[(data_score['CET_AANTAL'] > 0) | 
                               (data_score['DREMPEL_AANTAL'] > 0) | (data_score['AMN_AANTAL'] > 0) 
                                | (data_score['CESAN_AANTAL'] > 0) | (data_score['DIA_AANTAL'] > 0) 
                                | (data_score['ROUTE8_AANTAL'] > 0) | (data_score['IEP_AANTAL'] > 0)]
print(data_score.shape)
print(schools_at_least_1_exam.shape)
print(schools_without_exams.shape)
(33759, 29)
(32253, 29)
(1506, 29)

And remove this rows (without any exams):

data_score.drop(schools_without_exams.index, 0,  inplace = True)
print(data_score.shape)
(32253, 29)

Next let’s take a look do we have wrong exams data (for example, we have amount of pupils who took the exam, but score is 0 or vice versa). And if we have it, we need to remove it.

def remove_wrong_data(subset):
    if subset.shape[0] > 0:
        data_score.drop(subset.index, 0,  inplace = True)

wrong_cet = data_score[((data_score['CET_AANTAL'] == 0) & (data_score['CET_GEM'] > 0)) |
                      ((data_score['CET_AANTAL'] > 0) & (data_score['CET_GEM'] == 0))]
print(wrong_cet.shape)
remove_wrong_data(wrong_cet)

wrong_drempel = data_score[((data_score['DREMPEL_AANTAL'] == 0) & (data_score['drempel_gem'] > 0)) |
                      ((data_score['DREMPEL_AANTAL'] > 0) & (data_score['drempel_gem'] == 0))]
print(wrong_drempel.shape)
remove_wrong_data(wrong_drempel)

wrong_AMN = data_score[((data_score['AMN_AANTAL'] == 0) & (data_score['AMN_GEM'] > 0)) |
                      ((data_score['AMN_AANTAL'] > 0) & (data_score['AMN_GEM'] == 0))]
print(wrong_AMN.shape)
remove_wrong_data(wrong_AMN)

wrong_CESAN = data_score[((data_score['CESAN_AANTAL'] == 0) & (data_score['CESAN_GEM'] > 0)) |
                      ((data_score['CESAN_AANTAL'] > 0) & (data_score['CESAN_GEM'] == 0))]
print(wrong_CESAN.shape)
remove_wrong_data(wrong_CESAN)

wrong_DIA = data_score[((data_score['DIA_AANTAL'] == 0) & (data_score['DIA_GEM'] > 0)) |
                      ((data_score['DIA_AANTAL'] > 0) & (data_score['DIA_GEM'] == 0))]
print(wrong_DIA.shape)
remove_wrong_data(wrong_DIA)

wrong_ROUTE8 = data_score[((data_score['ROUTE8_AANTAL'] == 0) & (data_score['ROUTE8_GEM'] > 0)) |
                      ((data_score['ROUTE8_AANTAL'] > 0) & (data_score['ROUTE8_GEM'] == 0))]
print(wrong_ROUTE8.shape)
remove_wrong_data(wrong_ROUTE8)

wrong_IEP = data_score[((data_score['IEP_AANTAL'] == 0) & (data_score['IEP_GEM'] > 0)) |
                      ((data_score['IEP_AANTAL'] > 0) & (data_score['IEP_GEM'] == 0))]
print(wrong_IEP.shape)
remove_wrong_data(wrong_IEP)


print(data_score.shape)
(537, 29)
(68, 29)
(10, 29)
(0, 29)
(12, 29)
(115, 29)
(208, 29)
(31303, 29)

Now I want to see how many schools choose different type of exams and is there a lot of schools which choose more then two kind of test:

schools_more_then_2_exam = data_score.copy()

def count_exams(row):
    count = 1 if row.CET_AANTAL > 0 else 0
    count = count + 1 if row.DREMPEL_AANTAL > 0 else count
    count = count + 1 if row.AMN_AANTAL > 0 else count
    count = count + 1 if row.CESAN_AANTAL > 0 else count
    count = count + 1 if row.DIA_AANTAL > 0 else count
    count = count + 1 if row.ROUTE8_AANTAL > 0 else count
    count = count + 1 if row.IEP_AANTAL > 0 else count
    return count

    
schools_more_then_2_exam['EXAMEN_AANTAL'] = schools_more_then_2_exam.apply(lambda row: count_exams(row), axis = 1) 

schools_more_then_2_exam = schools_more_then_2_exam[schools_more_then_2_exam['EXAMEN_AANTAL'] > 2]
print(schools_more_then_2_exam.shape)
schools_more_then_2_exam.head()
(3, 30)
PEILDATUM_LEERLINGEN DATUM BRIN_NUMMER VESTIGINGSNUMMER INSTELLINGSNAAM_VESTIGING POSTCODE_VESTIGING PLAATSNAAM GEMEENTENUMMER GEMEENTENAAM PROVINCIE ... ROUTE8_GEM DREMPEL_AANTAL drempel_gem DIA_AANTAL DIA_GEM CESAN_AANTAL CESAN_GEM AMN_AANTAL AMN_GEM EXAMEN_AANTAL
18551 20170415 2017 15YU 0 OBS Holy 3137WD VLAARDINGEN 622 Vlaardingen Zuid-Holland ... 184.600000 0 0.0 0 0.0 0 0.0 0 0.0 3
25227 20180417 2018 15YU 0 OBS Holy 3137WD VLAARDINGEN 622 Vlaardingen Zuid-Holland ... 208.166667 0 0.0 0 0.0 0 0.0 0 0.0 3
31810 20190417 2019 15YU 0 OBS Holy 3137WD VLAARDINGEN 622 Vlaardingen Zuid-Holland ... 191.200000 0 0.0 0 0.0 0 0.0 0 0.0 3

3 rows × 30 columns

There are few rows with more then 2 type of test chosen. Let’s just remove it, then it will be possible to make a column of chosen exams in each school and remove all unnecessary columns:

#remove rows with more then 2 choosen type of tests
print(data_score.shape)
data_score.drop(schools_more_then_2_exam.index, 0,  inplace = True)
print(data_score.shape)
(31303, 29)
(31300, 29)
# make columns for the test with biggest amount of pupils who took it
def most_common_exam(row):
    exam_type, amount, score = '', 0, 0
    if row.CET_AANTAL > 0:
        exam_type,amount,score ='CET', row.CET_AANTAL, row.CET_GEM
    if row.DREMPEL_AANTAL > amount:
        exam_type,amount,score = 'DREMPEL', row.DREMPEL_AANTAL, row.drempel_gem
    if row.AMN_AANTAL > amount:
        exam_type,amount,score = 'AMN', row.AMN_AANTAL, row.AMN_GEM
    if row.CESAN_AANTAL > amount:
        exam_type,amount,score = 'CESAN', row.CESAN_AANTAL, row.CESAN_GEM
    if row.DIA_AANTAL > amount:
        exam_type,amount,score = 'DIA', row.DIA_AANTAL, row.DIA_GEM
    if row.ROUTE8_AANTAL > amount:
        exam_type,amount,score = 'ROUTE8', row.ROUTE8_AANTAL, row.ROUTE8_GEM
    if row.IEP_AANTAL > amount:
        exam_type,amount,score = 'IEP', row.IEP_AANTAL, row.IEP_GEM
    return exam_type,amount,score

data_score['EXAMEN_1'], data_score['EXAMEN_1_AANTAL'], data_score['EXAMEN_1_GEM'] = zip(*data_score.apply(lambda row: most_common_exam(row), axis = 1))
data_score.head()
PEILDATUM_LEERLINGEN DATUM BRIN_NUMMER VESTIGINGSNUMMER INSTELLINGSNAAM_VESTIGING POSTCODE_VESTIGING PLAATSNAAM GEMEENTENUMMER GEMEENTENAAM PROVINCIE ... drempel_gem DIA_AANTAL DIA_GEM CESAN_AANTAL CESAN_GEM AMN_AANTAL AMN_GEM EXAMEN_1 EXAMEN_1_AANTAL EXAMEN_1_GEM
0 20141001 2015 00AP 0 De Schanskorf 2715BT ZOETERMEER 637 Zoetermeer Zuid-Holland ... 0.0 0 0.0 0 0.0 0 0.0 CET 11 534.3
1 20141001 2015 00AR 0 BS "De Maasparel" 6109AM OHE EN LAAK 1641 Maasgouw Limburg ... 0.0 0 0.0 0 0.0 0 0.0 CET 18 539.8
3 20141001 2015 00AZ 0 De Kiezel en de Kei 2971AR BLESKENSGRAAF CA 1927 Molenwaard Zuid-Holland ... 0.0 0 0.0 0 0.0 0 0.0 CET 22 532.9
4 20141001 2015 00BA 0 OBS De Klimboom 6666EB HETEREN 1734 Overbetuwe Gelderland ... 0.0 0 0.0 0 0.0 0 0.0 CET 11 536.2
5 20141001 2015 00BB 0 Obs Letterwies 9944AR NIEUWOLDA 1895 Oldambt Groningen ... 0.0 0 0.0 0 0.0 0 0.0 CET 16 531.3

5 rows × 32 columns

# make columns for the second chosen test
def get_second_exam(row):
    if (row.CET_AANTAL > 0) & (row.EXAMEN_1 != 'CET'):
        return 'CET', row.CET_AANTAL, row.CET_GEM
    if (row.DREMPEL_AANTAL > 0) & (row.EXAMEN_1 != 'DREMPEL'):
        return 'DREMPEL', row.DREMPEL_AANTAL, row.drempel_gem
    if (row.AMN_AANTAL > 0) & (row.EXAMEN_1 != 'AMN'):
        return 'AMN', row.AMN_AANTAL, row.AMN_GEM
    if (row.CESAN_AANTAL > 0) & (row.EXAMEN_1 != 'CESAN'):
        return 'CESAN', row.CESAN_AANTAL, row.CESAN_GEM
    if (row.DIA_AANTAL > 0) & (row.EXAMEN_1 != 'DIA'):
        return 'DIA', row.DIA_AANTAL, row.DIA_GEM
    if (row.ROUTE8_AANTAL > 0) & (row.EXAMEN_1 != 'ROUTE8'):
        return 'ROUTE8', row.ROUTE8_AANTAL, row.ROUTE8_GEM
    if (row.IEP_AANTAL > 0) & (row.EXAMEN_1 != 'IEP'):
        return 'IEP', row.IEP_AANTAL, row.IEP_GEM
    return '', 0, 0

data_score['EXAMEN_2'], data_score['EXAMEN_2_AANTAL'], data_score['EXAMEN_2_GEM'] = zip(*data_score.apply(lambda row: get_second_exam(row), axis = 1))
data_score['EXAMEN_2'].value_counts()
           31151
CET           55
IEP           55
ROUTE8        18
DREMPEL       11
DIA            6
AMN            4
Name: EXAMEN_2, dtype: int64

We can see that most of the schools in dataset (31151) has one chosen type of exam, so we can remove new columns EXAMEN_2_… because it has relatively small information.

Next we also need to make a new field which values would be unique for each school (and which we would use later to merge different dataframes). I think if we pool VESTIGINGSNUMMER, BRIN_NUMMER and DATUM columns, we would have unique value for each row:

new_column = data_score["VESTIGINGSNUMMER"].astype(str) + data_score["BRIN_NUMMER"] + "_" + data_score["DATUM"].astype(str) 
data_score.insert (0, "SCHOOL_ID", new_column)
print(data_score.shape)
print(data_score.columns)
(31300, 36)
Index(['SCHOOL_ID', 'PEILDATUM_LEERLINGEN', 'DATUM', 'BRIN_NUMMER',
       'VESTIGINGSNUMMER', 'INSTELLINGSNAAM_VESTIGING', 'POSTCODE_VESTIGING',
       'PLAATSNAAM', 'GEMEENTENUMMER', 'GEMEENTENAAM', 'PROVINCIE', 'SOORT_PO',
       'DENOMINATIE_VESTIGING', 'BEVOEGD_GEZAG_NUMMER', 'LEERJAAR_8',
       'ONTHEFFING_REDEN_ND', 'CET_AANTAL', 'CET_GEM', 'IEP_AANTAL', 'IEP_GEM',
       'ROUTE8_AANTAL', 'ROUTE8_GEM', 'DREMPEL_AANTAL', 'drempel_gem',
       'DIA_AANTAL', 'DIA_GEM', 'CESAN_AANTAL', 'CESAN_GEM', 'AMN_AANTAL',
       'AMN_GEM', 'EXAMEN_1', 'EXAMEN_1_AANTAL', 'EXAMEN_1_GEM', 'EXAMEN_2',
       'EXAMEN_2_AANTAL', 'EXAMEN_2_GEM'],
      dtype='object')

And last (but not least) what we should do - remove all unnecessary columns:

data_score.drop('PEILDATUM_LEERLINGEN', 1, inplace = True)
data_score.drop('BRIN_NUMMER', 1, inplace = True)
data_score.drop('VESTIGINGSNUMMER', 1, inplace = True)
data_score.drop('LEERJAAR_8', 1, inplace = True)
data_score.drop('ONTHEFFING_REDEN_ND', 1, inplace = True)
data_score.drop('CET_AANTAL', 1, inplace = True)
data_score.drop('CET_GEM', 1, inplace = True)
data_score.drop('IEP_AANTAL', 1, inplace = True)
data_score.drop('IEP_GEM', 1, inplace = True)
data_score.drop('ROUTE8_AANTAL', 1, inplace = True)
data_score.drop('ROUTE8_GEM', 1, inplace = True)
data_score.drop('DREMPEL_AANTAL', 1, inplace = True)
data_score.drop('drempel_gem', 1, inplace = True)
data_score.drop('DIA_AANTAL', 1, inplace = True)
data_score.drop('DIA_GEM', 1, inplace = True)
data_score.drop('CESAN_AANTAL', 1, inplace = True)
data_score.drop('CESAN_GEM', 1, inplace = True)
data_score.drop('AMN_AANTAL', 1, inplace = True)
data_score.drop('AMN_GEM', 1, inplace = True)
data_score.drop('EXAMEN_2', 1, inplace = True)
data_score.drop('EXAMEN_2_AANTAL', 1, inplace = True)
data_score.drop('EXAMEN_2_GEM', 1, inplace = True)
data_score.rename(columns = {'EXAMEN_1': 'EXAMEN', 'EXAMEN_1_AANTAL': 'EXAMEN_AANTAL', 'EXAMEN_1_GEM':'EXAMEN_GEM', 'Leerjaar_8':'LEERJAAR_8'}, inplace = True)
print(data_score.shape)
data_score.head()
(31300, 14)
SCHOOL_ID DATUM INSTELLINGSNAAM_VESTIGING POSTCODE_VESTIGING PLAATSNAAM GEMEENTENUMMER GEMEENTENAAM PROVINCIE SOORT_PO DENOMINATIE_VESTIGING BEVOEGD_GEZAG_NUMMER EXAMEN EXAMEN_AANTAL EXAMEN_GEM
0 000AP_2015 2015 De Schanskorf 2715BT ZOETERMEER 637 Zoetermeer Zuid-Holland Bo Gereformeerd vrijgemaakt 41646 CET 11 534.3
1 000AR_2015 2015 BS "De Maasparel" 6109AM OHE EN LAAK 1641 Maasgouw Limburg Bo Rooms-Katholiek 41797 CET 18 539.8
3 000AZ_2015 2015 De Kiezel en de Kei 2971AR BLESKENSGRAAF CA 1927 Molenwaard Zuid-Holland Bo Openbaar 41638 CET 22 532.9
4 000BA_2015 2015 OBS De Klimboom 6666EB HETEREN 1734 Overbetuwe Gelderland Bo Openbaar 41530 CET 11 536.2
5 000BB_2015 2015 Obs Letterwies 9944AR NIEUWOLDA 1895 Oldambt Groningen Bo Openbaar 41613 CET 16 531.3

First part of our dataset constraction has done, so le’t write results down to temporary .csv file:

data_score.to_csv(r'score_first_result.csv', index = False)