This article is the second part of dataset construction. I’m going to add here some new data to the dataset. It will be info about amount of students on different levels of knowledge in language and arithmetics and school advices for secondary education.
All the data for this part of dataset construction I got here:
- Student’s reference level of language and arithmetic: https://duo.nl/open_onderwijsdata/databestanden/po/leerlingen-po/bo-sbo/refniveau.jsp
- School advices: https://duo.nl/open_onderwijsdata/databestanden/po/leerlingen-po/bo-sbo/schooladviezen.jsp
2. Students level in language and arithmetic
First we need to get data from temporary .cvs file which I created on previous step:
import pandas as pd
data_score = pd.read_csv("score_first_result.csv", error_bad_lines=False, sep=',', encoding = "ISO-8859-1")
print(data_score.shape)
print(data_score.columns)
data_score.head()
(31300, 14)
Index(['SCHOOL_ID', 'DATUM', 'INSTELLINGSNAAM_VESTIGING', 'POSTCODE_VESTIGING',
'PLAATSNAAM', 'GEMEENTENUMMER', 'GEMEENTENAAM', 'PROVINCIE', 'SOORT_PO',
'DENOMINATIE_VESTIGING', 'BEVOEGD_GEZAG_NUMMER', 'EXAMEN',
'EXAMEN_AANTAL', 'EXAMEN_GEM'],
dtype='object')
SCHOOL_ID | DATUM | INSTELLINGSNAAM_VESTIGING | POSTCODE_VESTIGING | PLAATSNAAM | GEMEENTENUMMER | GEMEENTENAAM | PROVINCIE | SOORT_PO | DENOMINATIE_VESTIGING | BEVOEGD_GEZAG_NUMMER | EXAMEN | EXAMEN_AANTAL | EXAMEN_GEM | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 000AP_2015 | 2015 | De Schanskorf | 2715BT | ZOETERMEER | 637 | Zoetermeer | Zuid-Holland | Bo | Gereformeerd vrijgemaakt | 41646 | CET | 11 | 534.3 |
1 | 000AR_2015 | 2015 | BS "De Maasparel" | 6109AM | OHE EN LAAK | 1641 | Maasgouw | Limburg | Bo | Rooms-Katholiek | 41797 | CET | 18 | 539.8 |
2 | 000AZ_2015 | 2015 | De Kiezel en de Kei | 2971AR | BLESKENSGRAAF CA | 1927 | Molenwaard | Zuid-Holland | Bo | Openbaar | 41638 | CET | 22 | 532.9 |
3 | 000BA_2015 | 2015 | OBS De Klimboom | 6666EB | HETEREN | 1734 | Overbetuwe | Gelderland | Bo | Openbaar | 41530 | CET | 11 | 536.2 |
4 | 000BB_2015 | 2015 | Obs Letterwies | 9944AR | NIEUWOLDA | 1895 | Oldambt | Groningen | Bo | Openbaar | 41613 | CET | 16 | 531.3 |
Now I will join together a few .cvs files which we will add to dataset after some data transformation and cleaning:
data_subject_score_2015 = pd.read_csv("input_data/Arithmetic_language_score_2015_2016.csv", error_bad_lines=False, sep=';', encoding = "ISO-8859-1")
data_subject_score_2016 = pd.read_csv("input_data/Arithmetic_language_score_2016_2017.csv", error_bad_lines=False, sep=';', encoding = "ISO-8859-1")
data_subject_score_2017 = pd.read_csv("input_data/Arithmetic_language_score_2017_2018.csv", error_bad_lines=False, sep=';', encoding = "ISO-8859-1")
data_subject_score_2018 = pd.read_csv("input_data/Arithmetic_language_score_2018_2019.csv", error_bad_lines=False, sep=';', encoding = "ISO-8859-1")
print(data_subject_score_2018.shape)
data_subject_score_2018.head()
(6304, 22)
PEILDATUM | BRIN_NUMMER | VESTIGINGSNUMMER | INSTELLINGSNAAM_VESTIGING | POSTCODE_VESTIGING | PLAATSNAAM | GEMEENTENUMMER | GEMEENTENAAM | PROVINCIE | SOORT_PO | ... | REKENEN_LAGER1F | REKENEN_1F | REKENEN_1S | REKENEN_2F | LV_LAGER1F | LV_1F | LV_2F | TV_LAGER1F | TV_1F | TV_2F | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20190417 | 00AP | 0 | Wereldwijs | 2716PH | ZOETERMEER | 637 | Zoetermeer | Zuid-Holland | Bo | ... | 1 | 4 | 6 | 0 | 0 | 2 | 9 | 0 | 2 | 9 |
1 | 20190417 | 00AR | 0 | BS "De Maasparel" | 6107AW | STEVENSWEERT | 1641 | Maasgouw | Limburg | Bo | ... | 2 | 4 | 11 | 0 | 0 | 1 | 16 | 0 | 5 | 12 |
2 | 20190417 | 00AV | 0 | De Morgenster | 3201CN | SPIJKENISSE | 1930 | Nissewaard | Zuid-Holland | Bo | ... | 1 | 7 | 4 | 0 | 0 | 2 | 10 | 2 | 4 | 6 |
3 | 20190417 | 00AZ | 0 | De Stapsteen | 2971AR | BLESKENSGRAAF CA | 1927 | Molenwaard | Zuid-Holland | Bo | ... | 2 | 3 | 9 | 0 | 1 | 2 | 11 | 0 | 3 | 11 |
4 | 20190417 | 00BA | 0 | OBS De Klimboom | 6666EB | HETEREN | 1734 | Overbetuwe | Gelderland | Bo | ... | 2 | 5 | 9 | 0 | 0 | 2 | 14 | 2 | 4 | 10 |
5 rows × 22 columns
Need to merge everything in one dataframe:
data_frames = [data_subject_score_2015, data_subject_score_2016, data_subject_score_2017, data_subject_score_2018]
data_subject_score = pd.concat(data_frames).fillna(' ')
data_subject_score.reset_index(drop=True, inplace=True)
print(data_subject_score_2015.shape)
print(data_subject_score_2016.shape)
print(data_subject_score_2017.shape)
print(data_subject_score_2018.shape)
print(data_subject_score.shape)
data_subject_score.head()
(6558, 22)
(6463, 22)
(6381, 22)
(6304, 22)
(25706, 22)
PEILDATUM | BRIN_NUMMER | VESTIGINGSNUMMER | INSTELLINGSNAAM_VESTIGING | POSTCODE_VESTIGING | PLAATSNAAM | GEMEENTENUMMER | GEMEENTENAAM | PROVINCIE | SOORT_PO | ... | REKENEN_LAGER1F | REKENEN_1F | REKENEN_1S | REKENEN_2F | LV_LAGER1F | LV_1F | LV_2F | TV_LAGER1F | TV_1F | TV_2F | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20160419 | 00AP | 0 | De Schanskorf | 2715BT | ZOETERMEER | 637 | Zoetermeer | Zuid-Holland | Bo | ... | 2 | 4 | 6 | 0 | 0 | 2 | 10 | 0 | 4 | 8 |
1 | 20160419 | 00AR | 0 | BS "De Maasparel" | 6107AW | STEVENSWEERT | 1641 | Maasgouw | Limburg | Bo | ... | 0 | 7 | 10 | 0 | 0 | 1 | 16 | 1 | 3 | 13 |
2 | 20160419 | 00AV | 0 | De Morgenster | 3201CN | SPIJKENISSE | 1930 | Nissewaard | Zuid-Holland | Bo | ... | 3 | 3 | 1 | 0 | 0 | 3 | 4 | 2 | 3 | 2 |
3 | 20160419 | 00AZ | 0 | De Kiezel en de Kei | 2971AR | BLESKENSGRAAF CA | 1927 | Molenwaard | Zuid-Holland | Bo | ... | 2 | 8 | 1 | 8 | 0 | 5 | 14 | 0 | 4 | 15 |
4 | 20160419 | 00BA | 0 | OBS De Klimboom | 6666EB | HETEREN | 1734 | Overbetuwe | Gelderland | Bo | ... | 1 | 10 | 2 | 11 | 0 | 8 | 16 | 2 | 4 | 18 |
5 rows × 22 columns
And take a look at columns types to understand do we (probably) have some empty/missing/unappropriate values in numeric columns:
data_subject_score.dtypes
PEILDATUM int64
BRIN_NUMMER object
VESTIGINGSNUMMER int64
INSTELLINGSNAAM_VESTIGING object
POSTCODE_VESTIGING object
PLAATSNAAM object
GEMEENTENUMMER int64
GEMEENTENAAM object
PROVINCIE object
SOORT_PO object
DENOMINATIE_VESTIGING object
BEVOEGD_GEZAG_NUMMER int64
REKENEN_LAGER1F int64
REKENEN_1F int64
REKENEN_1S int64
REKENEN_2F int64
LV_LAGER1F int64
LV_1F int64
LV_2F int64
TV_LAGER1F int64
TV_1F int64
TV_2F int64
dtype: object
You can see that dataset has all the “levels” columns as int type, I think it means we don’t have unexpected values in this columns (like string or empty column), it couldn’t be parsed to int otherwise. But still let’s check null and NaN values in dataset:
print(data_subject_score.isnull().sum())
print(data_subject_score.isna().sum())
PEILDATUM 0
BRIN_NUMMER 0
VESTIGINGSNUMMER 0
INSTELLINGSNAAM_VESTIGING 0
POSTCODE_VESTIGING 0
PLAATSNAAM 0
GEMEENTENUMMER 0
GEMEENTENAAM 0
PROVINCIE 0
SOORT_PO 0
DENOMINATIE_VESTIGING 0
BEVOEGD_GEZAG_NUMMER 0
REKENEN_LAGER1F 0
REKENEN_1F 0
REKENEN_1S 0
REKENEN_2F 0
LV_LAGER1F 0
LV_1F 0
LV_2F 0
TV_LAGER1F 0
TV_1F 0
TV_2F 0
dtype: int64
PEILDATUM 0
BRIN_NUMMER 0
VESTIGINGSNUMMER 0
INSTELLINGSNAAM_VESTIGING 0
POSTCODE_VESTIGING 0
PLAATSNAAM 0
GEMEENTENUMMER 0
GEMEENTENAAM 0
PROVINCIE 0
SOORT_PO 0
DENOMINATIE_VESTIGING 0
BEVOEGD_GEZAG_NUMMER 0
REKENEN_LAGER1F 0
REKENEN_1F 0
REKENEN_1S 0
REKENEN_2F 0
LV_LAGER1F 0
LV_1F 0
LV_2F 0
TV_LAGER1F 0
TV_1F 0
TV_2F 0
dtype: int64
Let’s make unique columns SCHOOL_ID:
data_subject_score['PEILDATUM'].value_counts()
20160419 6558
20170415 6463
20180417 6381
20190417 6304
Name: PEILDATUM, dtype: int64
data_subject_score['PEILDATUM'] = data_subject_score['PEILDATUM'].map({
20160419 : 2016,
20170415 : 2017,
20180417 : 2018,
20190417 : 2019,
})
data_subject_score.rename(columns = {'PEILDATUM': 'DATUM'}, inplace = True)
data_subject_score['DATUM'].value_counts()
2016 6558
2017 6463
2018 6381
2019 6304
Name: DATUM, dtype: int64
new_column = data_subject_score["VESTIGINGSNUMMER"].astype(str) + data_subject_score["BRIN_NUMMER"] + "_" + data_subject_score["DATUM"].astype(str)
data_subject_score.insert (0, "SCHOOL_ID", new_column)
data_subject_score.drop('VESTIGINGSNUMMER', 1, inplace = True)
data_subject_score.drop('BRIN_NUMMER', 1, inplace = True)
print(data_subject_score.shape)
print(data_subject_score.columns)
data_subject_score.head()
(25706, 21)
Index(['SCHOOL_ID', 'DATUM', 'INSTELLINGSNAAM_VESTIGING', 'POSTCODE_VESTIGING',
'PLAATSNAAM', 'GEMEENTENUMMER', 'GEMEENTENAAM', 'PROVINCIE', 'SOORT_PO',
'DENOMINATIE_VESTIGING', 'BEVOEGD_GEZAG_NUMMER', 'REKENEN_LAGER1F',
'REKENEN_1F', 'REKENEN_1S', 'REKENEN_2F', 'LV_LAGER1F', 'LV_1F',
'LV_2F', 'TV_LAGER1F', 'TV_1F', 'TV_2F'],
dtype='object')
SCHOOL_ID | DATUM | INSTELLINGSNAAM_VESTIGING | POSTCODE_VESTIGING | PLAATSNAAM | GEMEENTENUMMER | GEMEENTENAAM | PROVINCIE | SOORT_PO | DENOMINATIE_VESTIGING | ... | REKENEN_LAGER1F | REKENEN_1F | REKENEN_1S | REKENEN_2F | LV_LAGER1F | LV_1F | LV_2F | TV_LAGER1F | TV_1F | TV_2F | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 000AP_2016 | 2016 | De Schanskorf | 2715BT | ZOETERMEER | 637 | Zoetermeer | Zuid-Holland | Bo | Gereformeerd vrijgemaakt | ... | 2 | 4 | 6 | 0 | 0 | 2 | 10 | 0 | 4 | 8 |
1 | 000AR_2016 | 2016 | BS "De Maasparel" | 6107AW | STEVENSWEERT | 1641 | Maasgouw | Limburg | Bo | Rooms-Katholiek | ... | 0 | 7 | 10 | 0 | 0 | 1 | 16 | 1 | 3 | 13 |
2 | 000AV_2016 | 2016 | De Morgenster | 3201CN | SPIJKENISSE | 1930 | Nissewaard | Zuid-Holland | Bo | Gereformeerd vrijgemaakt | ... | 3 | 3 | 1 | 0 | 0 | 3 | 4 | 2 | 3 | 2 |
3 | 000AZ_2016 | 2016 | De Kiezel en de Kei | 2971AR | BLESKENSGRAAF CA | 1927 | Molenwaard | Zuid-Holland | Bo | Openbaar | ... | 2 | 8 | 1 | 8 | 0 | 5 | 14 | 0 | 4 | 15 |
4 | 000BA_2016 | 2016 | OBS De Klimboom | 6666EB | HETEREN | 1734 | Overbetuwe | Gelderland | Bo | Openbaar | ... | 1 | 10 | 2 | 11 | 0 | 8 | 16 | 2 | 4 | 18 |
5 rows × 21 columns
And remove all unnesessary columns which we had already have in data_score:
columns = []
for column in data_subject_score.columns:
if((column in data_score.columns) & (column != 'SCHOOL_ID')):
columns.append(column)
print(columns)
['DATUM', 'INSTELLINGSNAAM_VESTIGING', 'POSTCODE_VESTIGING', 'PLAATSNAAM', 'GEMEENTENUMMER', 'GEMEENTENAAM', 'PROVINCIE', 'SOORT_PO', 'DENOMINATIE_VESTIGING', 'BEVOEGD_GEZAG_NUMMER']
print(data_subject_score.shape)
data_subject_score.drop(columns, 1, inplace = True)
print(data_subject_score.shape)
print(data_subject_score.columns)
(25706, 21)
(25706, 11)
Index(['SCHOOL_ID', 'REKENEN_LAGER1F', 'REKENEN_1F', 'REKENEN_1S',
'REKENEN_2F', 'LV_LAGER1F', 'LV_1F', 'LV_2F', 'TV_LAGER1F', 'TV_1F',
'TV_2F'],
dtype='object')
And finally merge data_score and data_subject_score:
data = pd.merge(data_score, data_subject_score, on='SCHOOL_ID', how='left').fillna(0)
print(data.shape)
print(data.columns)
data.head()
(31300, 24)
Index(['SCHOOL_ID', 'DATUM', 'INSTELLINGSNAAM_VESTIGING', 'POSTCODE_VESTIGING',
'PLAATSNAAM', 'GEMEENTENUMMER', 'GEMEENTENAAM', 'PROVINCIE', 'SOORT_PO',
'DENOMINATIE_VESTIGING', 'BEVOEGD_GEZAG_NUMMER', 'EXAMEN',
'EXAMEN_AANTAL', 'EXAMEN_GEM', 'REKENEN_LAGER1F', 'REKENEN_1F',
'REKENEN_1S', 'REKENEN_2F', 'LV_LAGER1F', 'LV_1F', 'LV_2F',
'TV_LAGER1F', 'TV_1F', 'TV_2F'],
dtype='object')
SCHOOL_ID | DATUM | INSTELLINGSNAAM_VESTIGING | POSTCODE_VESTIGING | PLAATSNAAM | GEMEENTENUMMER | GEMEENTENAAM | PROVINCIE | SOORT_PO | DENOMINATIE_VESTIGING | ... | REKENEN_LAGER1F | REKENEN_1F | REKENEN_1S | REKENEN_2F | LV_LAGER1F | LV_1F | LV_2F | TV_LAGER1F | TV_1F | TV_2F | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 000AP_2015 | 2015 | De Schanskorf | 2715BT | ZOETERMEER | 637 | Zoetermeer | Zuid-Holland | Bo | Gereformeerd vrijgemaakt | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 000AR_2015 | 2015 | BS "De Maasparel" | 6109AM | OHE EN LAAK | 1641 | Maasgouw | Limburg | Bo | Rooms-Katholiek | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 000AZ_2015 | 2015 | De Kiezel en de Kei | 2971AR | BLESKENSGRAAF CA | 1927 | Molenwaard | Zuid-Holland | Bo | Openbaar | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 000BA_2015 | 2015 | OBS De Klimboom | 6666EB | HETEREN | 1734 | Overbetuwe | Gelderland | Bo | Openbaar | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 000BB_2015 | 2015 | Obs Letterwies | 9944AR | NIEUWOLDA | 1895 | Oldambt | Groningen | Bo | Openbaar | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 24 columns
School advices about secondary education
First let’s read all the school advices datasets, merge their together and check missing values:
data_school_advice_2014 = pd.read_csv("input_data/School_advice_2014_2015.csv", error_bad_lines=False, sep=';', encoding = "ISO-8859-1")
data_school_advice_2015 = pd.read_csv("input_data/School_advice_2015_2016.csv", error_bad_lines=False, sep=';', encoding = "ISO-8859-1")
data_school_advice_2016 = pd.read_csv("input_data/School_advice_2016_2017.csv", error_bad_lines=False, sep=';', encoding = "ISO-8859-1")
data_school_advice_2017 = pd.read_csv("input_data/School_advice_2017_2018.csv", error_bad_lines=False, sep=';', encoding = "ISO-8859-1")
data_school_advice_2018 = pd.read_csv("input_data/School_advice_2018_2019.csv", error_bad_lines=False, sep=';', encoding = "ISO-8859-1")
print(data_school_advice_2018.shape)
data_school_advice_2018.head()
(6585, 26)
PEILDATUM_LEERLINGEN | PRIKDATUM_ADVIEZEN | BRIN_NUMMER | VESTIGINGSNUMMER | INSTELLINGSNAAM_VESTIGING | POSTCODE_VESTIGING | PLAATSNAAM | GEMEENTENUMMER | GEMEENTENAAM | PROVINCIE | ... | VMBO_B_K | VMBO_K | VMBO_K_GT | VMBO_GT | VMBO_GT_HAVO | HAVO | HAVO_VWO | VWO | ADVIES_NIET_MOGELIJK | TOTAAL_ADVIES | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20190417 | 20190725 | 00AP | 0 | Wereldwijs | 2716PH | ZOETERMEER | 637 | Zoetermeer | Zuid-Holland | ... | 1 | 1 | 0 | 2 | 3 | 0 | 2 | 1 | 1 | 11 |
1 | 20190417 | 20190725 | 00AR | 0 | BS "De Maasparel" | 6107AW | STEVENSWEERT | 1641 | Maasgouw | Limburg | ... | 0 | 3 | 0 | 1 | 0 | 4 | 0 | 8 | 0 | 17 |
2 | 20190417 | 20190725 | 00AV | 0 | De Morgenster | 3201CN | SPIJKENISSE | 1930 | Nissewaard | Zuid-Holland | ... | 2 | 0 | 2 | 3 | 1 | 1 | 2 | 1 | 0 | 12 |
3 | 20190417 | 20190725 | 00AZ | 0 | De Stapsteen | 2971AR | BLESKENSGRAAF CA | 1927 | Molenwaard | Zuid-Holland | ... | 1 | 1 | 0 | 3 | 3 | 0 | 3 | 2 | 0 | 14 |
4 | 20190417 | 20190725 | 00BA | 0 | OBS De Klimboom | 6666EB | HETEREN | 1734 | Overbetuwe | Gelderland | ... | 0 | 1 | 1 | 4 | 2 | 1 | 0 | 7 | 0 | 16 |
5 rows × 26 columns
data_frames = [data_school_advice_2014, data_school_advice_2015, data_school_advice_2016, data_school_advice_2017,data_school_advice_2018]
data_school_advice = pd.concat(data_frames).fillna(' ')
data_school_advice.reset_index(drop=True, inplace=True)
print(data_school_advice_2014.shape)
print(data_school_advice_2015.shape)
print(data_school_advice_2016.shape)
print(data_school_advice_2017.shape)
print(data_school_advice_2018.shape)
print(data_school_advice.shape)
data_school_advice.head()
(6920, 26)
(6853, 26)
(6754, 26)
(6662, 26)
(6585, 26)
(33774, 26)
PEILDATUM_LEERLINGEN | PRIKDATUM_ADVIEZEN | BRIN_NUMMER | VESTIGINGSNUMMER | INSTELLINGSNAAM_VESTIGING | POSTCODE_VESTIGING | PLAATSNAAM | GEMEENTENUMMER | GEMEENTENAAM | PROVINCIE | ... | VMBO_B_K | VMBO_K | VMBO_K_GT | VMBO_GT | VMBO_GT_HAVO | HAVO | HAVO_VWO | VWO | ADVIES_NIET_MOGELIJK | TOTAAL_ADVIES | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20141001 | 20150722 | 00AP | 0 | De Schanskorf | 2715BT | ZOETERMEER | 637 | Zoetermeer | Zuid-Holland | ... | 0 | 1 | 1 | 2 | 0 | 5 | 1 | 1 | 0 | 11 |
1 | 20141001 | 20150722 | 00AR | 0 | BS "De Maasparel" | 6109AM | OHE EN LAAK | 1641 | Maasgouw | Limburg | ... | 0 | 0 | 0 | 4 | 0 | 7 | 0 | 6 | 0 | 18 |
2 | 20141001 | 20150722 | 00AV | 0 | De Morgenster | 3201CN | SPIJKENISSE | 612 | Spijkenisse | Zuid-Holland | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
3 | 20141001 | 20150722 | 00AZ | 0 | De Kiezel en de Kei | 2971AR | BLESKENSGRAAF CA | 1927 | Molenwaard | Zuid-Holland | ... | 3 | 0 | 0 | 4 | 5 | 3 | 2 | 4 | 0 | 22 |
4 | 20141001 | 20150722 | 00BA | 0 | OBS De Klimboom | 6666EB | HETEREN | 1734 | Overbetuwe | Gelderland | ... | 0 | 1 | 0 | 4 | 0 | 1 | 4 | 0 | 0 | 12 |
5 rows × 26 columns
print(data_school_advice.dtypes)
print(data_school_advice.isnull().sum())
print(data_school_advice.isna().sum())
PEILDATUM_LEERLINGEN int64
PRIKDATUM_ADVIEZEN int64
BRIN_NUMMER object
VESTIGINGSNUMMER int64
INSTELLINGSNAAM_VESTIGING object
POSTCODE_VESTIGING object
PLAATSNAAM object
GEMEENTENUMMER int64
GEMEENTENAAM object
PROVINCIE object
SOORT_PO object
DENOMINATIE_VESTIGING object
BEVOEGD_GEZAG_NUMMER int64
VSO int64
PRO int64
VMBO_B int64
VMBO_B_K int64
VMBO_K int64
VMBO_K_GT int64
VMBO_GT int64
VMBO_GT_HAVO int64
HAVO int64
HAVO_VWO int64
VWO int64
ADVIES_NIET_MOGELIJK int64
TOTAAL_ADVIES int64
dtype: object
PEILDATUM_LEERLINGEN 0
PRIKDATUM_ADVIEZEN 0
BRIN_NUMMER 0
VESTIGINGSNUMMER 0
INSTELLINGSNAAM_VESTIGING 0
POSTCODE_VESTIGING 0
PLAATSNAAM 0
GEMEENTENUMMER 0
GEMEENTENAAM 0
PROVINCIE 0
SOORT_PO 0
DENOMINATIE_VESTIGING 0
BEVOEGD_GEZAG_NUMMER 0
VSO 0
PRO 0
VMBO_B 0
VMBO_B_K 0
VMBO_K 0
VMBO_K_GT 0
VMBO_GT 0
VMBO_GT_HAVO 0
HAVO 0
HAVO_VWO 0
VWO 0
ADVIES_NIET_MOGELIJK 0
TOTAAL_ADVIES 0
dtype: int64
PEILDATUM_LEERLINGEN 0
PRIKDATUM_ADVIEZEN 0
BRIN_NUMMER 0
VESTIGINGSNUMMER 0
INSTELLINGSNAAM_VESTIGING 0
POSTCODE_VESTIGING 0
PLAATSNAAM 0
GEMEENTENUMMER 0
GEMEENTENAAM 0
PROVINCIE 0
SOORT_PO 0
DENOMINATIE_VESTIGING 0
BEVOEGD_GEZAG_NUMMER 0
VSO 0
PRO 0
VMBO_B 0
VMBO_B_K 0
VMBO_K 0
VMBO_K_GT 0
VMBO_GT 0
VMBO_GT_HAVO 0
HAVO 0
HAVO_VWO 0
VWO 0
ADVIES_NIET_MOGELIJK 0
TOTAAL_ADVIES 0
dtype: int64
Everything looks fine. Next we need to make unique columns SCHOOL_ID:
data_school_advice['PRIKDATUM_ADVIEZEN'].value_counts()
20150722 6920
20160722 6853
20170722 6754
20180725 6662
20190725 6585
Name: PRIKDATUM_ADVIEZEN, dtype: int64
data_school_advice['PRIKDATUM_ADVIEZEN'] = data_school_advice['PRIKDATUM_ADVIEZEN'].map({
20150722 : 2015,
20160722 : 2016,
20170722 : 2017,
20180725 : 2018,
20190725 : 2019,
})
data_school_advice.rename(columns = {'PRIKDATUM_ADVIEZEN': 'DATUM'}, inplace = True)
data_school_advice['DATUM'].value_counts()
2015 6920
2016 6853
2017 6754
2018 6662
2019 6585
Name: DATUM, dtype: int64
new_column = data_school_advice["VESTIGINGSNUMMER"].astype(str) + data_school_advice["BRIN_NUMMER"] + "_" + data_school_advice["DATUM"].astype(str)
data_school_advice.insert (0, "SCHOOL_ID", new_column)
data_school_advice.drop('VESTIGINGSNUMMER', 1, inplace = True)
data_school_advice.drop('BRIN_NUMMER', 1, inplace = True)
data_school_advice.drop('PEILDATUM_LEERLINGEN', 1, inplace = True)
print(data_school_advice.shape)
print(data_school_advice.columns)
data_school_advice.head()
(33774, 24)
Index(['SCHOOL_ID', 'DATUM', 'INSTELLINGSNAAM_VESTIGING', 'POSTCODE_VESTIGING',
'PLAATSNAAM', 'GEMEENTENUMMER', 'GEMEENTENAAM', 'PROVINCIE', 'SOORT_PO',
'DENOMINATIE_VESTIGING', 'BEVOEGD_GEZAG_NUMMER', 'VSO', 'PRO', 'VMBO_B',
'VMBO_B_K', 'VMBO_K', 'VMBO_K_GT', 'VMBO_GT', 'VMBO_GT_HAVO', 'HAVO',
'HAVO_VWO', 'VWO', 'ADVIES_NIET_MOGELIJK', 'TOTAAL_ADVIES'],
dtype='object')
SCHOOL_ID | DATUM | INSTELLINGSNAAM_VESTIGING | POSTCODE_VESTIGING | PLAATSNAAM | GEMEENTENUMMER | GEMEENTENAAM | PROVINCIE | SOORT_PO | DENOMINATIE_VESTIGING | ... | VMBO_B_K | VMBO_K | VMBO_K_GT | VMBO_GT | VMBO_GT_HAVO | HAVO | HAVO_VWO | VWO | ADVIES_NIET_MOGELIJK | TOTAAL_ADVIES | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 000AP_2015 | 2015 | De Schanskorf | 2715BT | ZOETERMEER | 637 | Zoetermeer | Zuid-Holland | Bo | Gereformeerd vrijgemaakt | ... | 0 | 1 | 1 | 2 | 0 | 5 | 1 | 1 | 0 | 11 |
1 | 000AR_2015 | 2015 | BS "De Maasparel" | 6109AM | OHE EN LAAK | 1641 | Maasgouw | Limburg | Bo | Rooms-Katholiek | ... | 0 | 0 | 0 | 4 | 0 | 7 | 0 | 6 | 0 | 18 |
2 | 000AV_2015 | 2015 | De Morgenster | 3201CN | SPIJKENISSE | 612 | Spijkenisse | Zuid-Holland | Bo | Gereformeerd vrijgemaakt | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
3 | 000AZ_2015 | 2015 | De Kiezel en de Kei | 2971AR | BLESKENSGRAAF CA | 1927 | Molenwaard | Zuid-Holland | Bo | Openbaar | ... | 3 | 0 | 0 | 4 | 5 | 3 | 2 | 4 | 0 | 22 |
4 | 000BA_2015 | 2015 | OBS De Klimboom | 6666EB | HETEREN | 1734 | Overbetuwe | Gelderland | Bo | Openbaar | ... | 0 | 1 | 0 | 4 | 0 | 1 | 4 | 0 | 0 | 12 |
5 rows × 24 columns
Deleting common (with data_score) columns:
columns = []
for column in data_school_advice.columns:
if((column in data_score.columns) & (column != 'SCHOOL_ID')):
columns.append(column)
print(columns)
['DATUM', 'INSTELLINGSNAAM_VESTIGING', 'POSTCODE_VESTIGING', 'PLAATSNAAM', 'GEMEENTENUMMER', 'GEMEENTENAAM', 'PROVINCIE', 'SOORT_PO', 'DENOMINATIE_VESTIGING', 'BEVOEGD_GEZAG_NUMMER']
print(data_school_advice.shape)
data_school_advice.drop(columns, 1, inplace = True)
print(data_school_advice.shape)
print(data_school_advice.columns)
(33774, 24)
(33774, 14)
Index(['SCHOOL_ID', 'VSO', 'PRO', 'VMBO_B', 'VMBO_B_K', 'VMBO_K', 'VMBO_K_GT',
'VMBO_GT', 'VMBO_GT_HAVO', 'HAVO', 'HAVO_VWO', 'VWO',
'ADVIES_NIET_MOGELIJK', 'TOTAAL_ADVIES'],
dtype='object')
We have a lot of different type of VMBO advices, they are not that important to us, we would like just to know ‘big’ level of education student was adviced for. Let’s pool all the columns of different VMBO-types into one:
VMBO_column = data_school_advice["VMBO_B"] + data_school_advice["VMBO_B_K"] + data_school_advice["VMBO_K"]
+ data_school_advice["VMBO_K_GT"] + data_school_advice["VMBO_GT"]
data_school_advice.insert (3, "VMBO", VMBO_column)
data_school_advice.drop('VMBO_B', 1, inplace = True)
data_school_advice.drop('VMBO_B_K', 1, inplace = True)
data_school_advice.drop('VMBO_K', 1, inplace = True)
data_school_advice.drop('VMBO_K_GT', 1, inplace = True)
data_school_advice.drop('VMBO_GT', 1, inplace = True)
data_school_advice.rename(columns = {'VMBO_GT_HAVO': 'VMBO_HAVO'}, inplace = True)
print(data_school_advice.shape)
print(data_school_advice.columns)
data_school_advice.head()
(33774, 10)
Index(['SCHOOL_ID', 'VSO', 'PRO', 'VMBO', 'VMBO_HAVO', 'HAVO', 'HAVO_VWO',
'VWO', 'ADVIES_NIET_MOGELIJK', 'TOTAAL_ADVIES'],
dtype='object')
SCHOOL_ID | VSO | PRO | VMBO | VMBO_HAVO | HAVO | HAVO_VWO | VWO | ADVIES_NIET_MOGELIJK | TOTAAL_ADVIES | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 000AP_2015 | 0 | 0 | 1 | 0 | 5 | 1 | 1 | 0 | 11 |
1 | 000AR_2015 | 0 | 0 | 1 | 0 | 7 | 0 | 6 | 0 | 18 |
2 | 000AV_2015 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
3 | 000AZ_2015 | 0 | 1 | 3 | 5 | 3 | 2 | 4 | 0 | 22 |
4 | 000BA_2015 | 1 | 1 | 1 | 0 | 1 | 4 | 0 | 0 | 12 |
Merge data and data_school_advice:
data = pd.merge(data, data_school_advice, on='SCHOOL_ID', how='left').fillna(0)
print(data.shape)
print(data.columns)
data.head()
(31300, 33)
Index(['SCHOOL_ID', 'DATUM', 'INSTELLINGSNAAM_VESTIGING', 'POSTCODE_VESTIGING',
'PLAATSNAAM', 'GEMEENTENUMMER', 'GEMEENTENAAM', 'PROVINCIE', 'SOORT_PO',
'DENOMINATIE_VESTIGING', 'BEVOEGD_GEZAG_NUMMER', 'EXAMEN',
'EXAMEN_AANTAL', 'EXAMEN_GEM', 'REKENEN_LAGER1F', 'REKENEN_1F',
'REKENEN_1S', 'REKENEN_2F', 'LV_LAGER1F', 'LV_1F', 'LV_2F',
'TV_LAGER1F', 'TV_1F', 'TV_2F', 'VSO', 'PRO', 'VMBO', 'VMBO_HAVO',
'HAVO', 'HAVO_VWO', 'VWO', 'ADVIES_NIET_MOGELIJK', 'TOTAAL_ADVIES'],
dtype='object')
SCHOOL_ID | DATUM | INSTELLINGSNAAM_VESTIGING | POSTCODE_VESTIGING | PLAATSNAAM | GEMEENTENUMMER | GEMEENTENAAM | PROVINCIE | SOORT_PO | DENOMINATIE_VESTIGING | ... | TV_2F | VSO | PRO | VMBO | VMBO_HAVO | HAVO | HAVO_VWO | VWO | ADVIES_NIET_MOGELIJK | TOTAAL_ADVIES | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 000AP_2015 | 2015 | De Schanskorf | 2715BT | ZOETERMEER | 637 | Zoetermeer | Zuid-Holland | Bo | Gereformeerd vrijgemaakt | ... | 0.0 | 0 | 0 | 1 | 0 | 5 | 1 | 1 | 0 | 11 |
1 | 000AR_2015 | 2015 | BS "De Maasparel" | 6109AM | OHE EN LAAK | 1641 | Maasgouw | Limburg | Bo | Rooms-Katholiek | ... | 0.0 | 0 | 0 | 1 | 0 | 7 | 0 | 6 | 0 | 18 |
2 | 000AZ_2015 | 2015 | De Kiezel en de Kei | 2971AR | BLESKENSGRAAF CA | 1927 | Molenwaard | Zuid-Holland | Bo | Openbaar | ... | 0.0 | 0 | 1 | 3 | 5 | 3 | 2 | 4 | 0 | 22 |
3 | 000BA_2015 | 2015 | OBS De Klimboom | 6666EB | HETEREN | 1734 | Overbetuwe | Gelderland | Bo | Openbaar | ... | 0.0 | 1 | 1 | 1 | 0 | 1 | 4 | 0 | 0 | 12 |
4 | 000BB_2015 | 2015 | Obs Letterwies | 9944AR | NIEUWOLDA | 1895 | Oldambt | Groningen | Bo | Openbaar | ... | 0.0 | 0 | 0 | 7 | 0 | 5 | 0 | 2 | 0 | 16 |
5 rows × 33 columns
Do all the columns has the type they should?
data.dtypes
SCHOOL_ID object
DATUM int64
INSTELLINGSNAAM_VESTIGING object
POSTCODE_VESTIGING object
PLAATSNAAM object
GEMEENTENUMMER int64
GEMEENTENAAM object
PROVINCIE object
SOORT_PO object
DENOMINATIE_VESTIGING object
BEVOEGD_GEZAG_NUMMER int64
EXAMEN object
EXAMEN_AANTAL int64
EXAMEN_GEM float64
REKENEN_LAGER1F float64
REKENEN_1F float64
REKENEN_1S float64
REKENEN_2F float64
LV_LAGER1F float64
LV_1F float64
LV_2F float64
TV_LAGER1F float64
TV_1F float64
TV_2F float64
VSO int64
PRO int64
VMBO int64
VMBO_HAVO int64
HAVO int64
HAVO_VWO int64
VWO int64
ADVIES_NIET_MOGELIJK int64
TOTAAL_ADVIES int64
dtype: object
I see that after merging columns REKENEN_LAGER1F, REKENEN_1F, REKENEN_1S, REKENEN_2F, LV_LAGER1F, LV_1F, LV_2F, TV_LAGER1F, TV_1F, TV_2F has type float64, but they should be int, so let’s convert them:
data["REKENEN_LAGER1F"] = data["REKENEN_LAGER1F"].astype(int)
data["REKENEN_1F"] = data["REKENEN_1F"].astype(int)
data["REKENEN_1S"] = data["REKENEN_1S"].astype(int)
data["REKENEN_2F"] = data["REKENEN_2F"].astype(int)
data["LV_LAGER1F"] = data["LV_LAGER1F"].astype(int)
data["LV_1F"] = data["LV_1F"].astype(int)
data["LV_2F"] = data["LV_2F"].astype(int)
data["TV_LAGER1F"] = data["TV_LAGER1F"].astype(int)
data["TV_1F"] = data["TV_1F"].astype(int)
data["TV_2F"] = data["TV_2F"].astype(int)
Finally, let’s write our result dataset down to .csv file:
data.to_csv(r'score_second_result.csv', index = False)