Please use this identifier to cite or link to this item: http://buratest.brunel.ac.uk/handle/2438/7926
Full metadata record
DC FieldValueLanguage
dc.contributor.authorShepperd, M-
dc.contributor.authorSong, Q-
dc.contributor.authorSun, Z-
dc.contributor.authorMair, C-
dc.date.accessioned2014-01-21T11:44:15Z-
dc.date.available2014-01-21T11:44:15Z-
dc.date.issued2013-
dc.identifier.citationIEEE Transactions on Software Engineering, 39(9), 1208 - 1215, 2013en_US
dc.identifier.issn0098-5589-
dc.identifier.urihttp://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6464273en
dc.identifier.urihttp://bura.brunel.ac.uk/handle/2438/7926-
dc.description.abstractBackground-Self-evidently empirical analyses rely upon the quality of their data. Likewise, replications rely upon accurate reporting and using the same rather than similar versions of datasets. In recent years, there has been much interest in using machine learners to classify software modules into defect-prone and not defect-prone categories. The publicly available NASA datasets have been extensively used as part of this research. Objective-This short note investigates the extent to which published analyses based on the NASA defect datasets are meaningful and comparable. Method-We analyze the five studies published in the IEEE Transactions on Software Engineering since 2007 that have utilized these datasets and compare the two versions of the datasets currently in use. Results-We find important differences between the two versions of the datasets, implausible values in one dataset and generally insufficient detail documented on dataset preprocessing. Conclusions-It is recommended that researchers 1) indicate the provenance of the datasets they use, 2) report any preprocessing in sufficient detail to enable meaningful replication, and 3) invest effort in understanding the data prior to applying machine learners.en_US
dc.language.isoenen_US
dc.publisherInstitute of Electrical and Electronics Engineersen_US
dc.subjectEmpirical software engineeringen_US
dc.subjectData qualityen_US
dc.subjectDefect predictionen_US
dc.subjectMachine learningen_US
dc.titleData quality: Some comments on the NASA software defect datasetsen_US
dc.typeArticleen_US
dc.identifier.doihttp://dx.doi.org/10.1109/TSE.2013.11-
pubs.organisational-data/Brunel-
pubs.organisational-data/Brunel/Brunel Active Staff-
pubs.organisational-data/Brunel/Brunel Active Staff/School of Info. Systems, Comp & Maths-
pubs.organisational-data/Brunel/Brunel Active Staff/School of Info. Systems, Comp & Maths/IS and Computing-
pubs.organisational-data/Brunel/University Research Centres and Groups-
pubs.organisational-data/Brunel/University Research Centres and Groups/School of Information Systems, Computing and Mathematics - URCs and Groups-
pubs.organisational-data/Brunel/University Research Centres and Groups/School of Information Systems, Computing and Mathematics - URCs and Groups/Centre for Information and Knowledge Management-
Appears in Collections:Computer Science
Dept of Computer Science Research Papers

Files in This Item:
File Description SizeFormat 
TSE_NASADataQualNote_V26.pdf165.81 kBAdobe PDFView/Open


Items in BURA are protected by copyright, with all rights reserved, unless otherwise indicated.