{"id":119,"date":"2017-04-21T15:27:02","date_gmt":"2017-04-21T14:27:02","guid":{"rendered":"http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/?p=119"},"modified":"2017-04-28T10:09:41","modified_gmt":"2017-04-28T09:09:41","slug":"identifying-gaps-in-your-data","status":"publish","type":"post","link":"http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/2017\/04\/21\/identifying-gaps-in-your-data\/","title":{"rendered":"Identifying gaps in your data"},"content":{"rendered":"<p>One of the first things you want to do when you explore a new dataset is to identify possible gaps. Sample size and the number of variables are relevant but&#8230;how many observations do you have <em>for each variable<\/em>? This distinction is even more relevant for archaeologists because (if we are being honest&#8230;) most of our data has huge gaps.<\/p>\n<p>Just to make the post clear:<br \/>\n&#8211; The <em>Sample<\/em> is the set of entities you collected.<br \/>\n&#8211; <em>Variables<\/em> are measures and properties of this sample<br \/>\n&#8211; <em>Observations<\/em> are the values of the variables for each item in your sample<\/p>\n<p>The identification of variables with a decent number of observations is crucial for several processes. Let&#8217;s say that you have a bunch of archaeological sites and you want to create a map where the size of each dot (i.e. <em>site<\/em>) is proportional to the area of the site. This would be a bad idea if 90% of your sample does not have an assigned area because these points will be ignored.<\/p>\n<p>This is even more relevant if you want to do some modelling (e.g. a linear regression). Lots of statistical models ignore variables that have observations with unassigned values so you have to be very careful about it. Let&#8217;s see how can we explore this issue.<\/p>\n<h1>Example: Arrowheads in UK<\/h1>\n<h2>Loading the dataset<\/h2>\n<p>We downloaded a dataset of arrowheads collected by the <a href=\"https:\/\/finds.org.uk\/\">Portable Antiquities Scheme<\/a>:<\/p>\n[generic linenumbers=&#8221;False&#8221;]\narrowheads &lt;- read.csv(&quot;https:\/\/git.io\/v9JJd&quot;, na.string=&quot;&quot;)<br \/>\n[\/generic]\n<p>As you can see I specified that empty strings of text should be read as <em>NA<\/em> (i.e. Not Available). If you don&#8217;t do that then it will be read as an empty string, which is different than not having a value.<\/p>\n<p>If we take a look at the newly created <em>arrowheads<\/em> data frame we will see a bunch of interesting metrics:<\/p>\n[generic  linenumbers=&#8221;False&#8221;]\nstr(arrowheads)<br \/>\n[\/generic]\nYou should get something like:<\/p>\n[generic  linenumbers=&#8221;False&#8221;]\n&#8216;data.frame&#8217;:   1079 obs. of  13 variables:<br \/>\n $ id               : int  522443 179174 233283 199204 106547 45059 508485 646649 401936 133638 &#8230;<br \/>\n $ classification   : Factor w\/ 81 levels &#8220;Arrowhead&#8221;,&#8221;Barb and Tanged&#8221;,..: NA NA 10 NA NA NA NA NA NA NA &#8230;<br \/>\n $ subClassification: Factor w\/ 19 levels &#8220;barbed and tanged&#8221;,..: NA NA NA NA NA NA NA NA NA NA &#8230;<br \/>\n $ length           : num  28 55.2 45.3 100.3 39 &#8230;<br \/>\n $ width            : num  18 11 25 7.6 11 &#8230;<br \/>\n $ thickness        : num  3 2 5.11 6.7 NA 3.54 3.47 6.5 1 NA &#8230;<br \/>\n $ diameter         : num  2.2 4.5 4.58 5.1 6 6.82 6.96 7.5 8 8 &#8230;<br \/>\n $ weight           : num  NA 6.1 2.78 16.69 4.76 &#8230;<br \/>\n $ quantity         : int  1 1 1 1 1 1 1 1 1 1 &#8230;<br \/>\n $ broadperiod      : Factor w\/ 12 levels &#8220;BRONZE AGE&#8221;,&#8221;EARLY MEDIEVAL&#8221;,..: 1 4 1 4 1 4 4 4 4 4 &#8230;<br \/>\n $ fromdate         : int  -2150 NA -2150 1250 -2150 1066 1066 1200 1066 1200 &#8230;<br \/>\n $ todate           : int  -1500 NA -1500 1499 -800 1350 1400 1300 1500 1499 &#8230;<br \/>\n $ district         : Factor w\/ 188 levels &#8220;Arun&#8221;,&#8221;Ashford&#8221;,..: 45 181 74 NA 26 93 177 137 186 53 &#8230;<br \/>\n[\/generic]\n<p>See all these <em>NA<\/em> values? These are the gaps in our data. We can suspect that <em>diameter<\/em> or &#8220;subClassification* will not be popular, but having over 1000 arrowheads it is difficult to know what variables should we be used in the analysis.<\/p>\n<h2>Visualizing gaps<\/h2>\n<p>How can you identify these gaps? My preferred method is to visualize them using the <a href=\"https:\/\/cran.r-project.org\/web\/packages\/Amelia\/Amelia.pdf\">Amelia package<\/a> (yes, awesome <a href=\"https:\/\/en.wikipedia.org\/wiki\/Amelia_Earhart\">name<\/a> for an R package on missing data&#8230;). Its use is straightforward:<\/p>\n[generic  linenumbers=&#8221;False&#8221;]\ninstall.package(&#8220;Amelia&#8221;)<br \/>\nlibrary(&#8220;Amelia&#8221;)<br \/>\nmissmap(arrowheads)<br \/>\n[\/generic]\n<figure id=\"attachment_122\" aria-describedby=\"caption-attachment-122\" style=\"width: 625px\" class=\"wp-caption alignright\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-content\/uploads\/sites\/7\/2017\/04\/missmap1-878x1024.png\" alt=\"\" width=\"625\" height=\"729\" class=\"size-large wp-image-122\" srcset=\"http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-content\/uploads\/sites\/7\/2017\/04\/missmap1-878x1024.png 878w, http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-content\/uploads\/sites\/7\/2017\/04\/missmap1-257x300.png 257w, http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-content\/uploads\/sites\/7\/2017\/04\/missmap1-768x896.png 768w, http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-content\/uploads\/sites\/7\/2017\/04\/missmap1-600x700.png 600w, http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-content\/uploads\/sites\/7\/2017\/04\/missmap1-171x200.png 171w, http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-content\/uploads\/sites\/7\/2017\/04\/missmap1.png 900w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><figcaption id=\"caption-attachment-122\" class=\"wp-caption-text\">Missingness map<\/figcaption><\/figure>\n<p>The structure is R-classic: rows are sample units while columns are variables. Red cells are the ones that have some values while the other ones are empty.<\/p>\n<h2>Interpretation<\/h2>\n<p>The map of missing values allows us to make informed decisions on how to proceed with the analysis. In this case:<br \/>\n&#8211; We should not use <em>diameter<\/em> for analysis because it is not present in most of the sample.<br \/>\n&#8211; We have a almost complete information on broad spatial and temporal coordinates (<em>broadperiod<\/em> and <em>district<\/em>)<br \/>\n&#8211; <em>classification<\/em> and <em>subClassification<\/em> are quite useless here<br \/>\n&#8211; Measures that can be used are: <em>weight<\/em>, <em>thickness<\/em>, <em>width<\/em> and <em>length<\/em><\/p>\n<h2>Impact<\/h2>\n<p>You can easily visualize the impact by creating a visualization with <em>diameter<\/em> and another one without this value:<br \/>\n[generic  linenumbers=&#8221;False&#8221;]\nggplot(arrowheads, aes(x=width, y=diameter, col=broadperiod)) + geom_point() + theme_bw() + facet_wrap(~broadperiod)<br \/>\n[\/generic]\n<figure id=\"attachment_123\" aria-describedby=\"caption-attachment-123\" style=\"width: 625px\" class=\"wp-caption alignright\"><a href=\"http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-content\/uploads\/sites\/7\/2017\/04\/diameter.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-content\/uploads\/sites\/7\/2017\/04\/diameter-1024x1024.png\" alt=\"\" width=\"625\" height=\"625\" class=\"size-large wp-image-123\" srcset=\"http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-content\/uploads\/sites\/7\/2017\/04\/diameter-1024x1024.png 1024w, http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-content\/uploads\/sites\/7\/2017\/04\/diameter-150x150.png 150w, http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-content\/uploads\/sites\/7\/2017\/04\/diameter-300x300.png 300w, http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-content\/uploads\/sites\/7\/2017\/04\/diameter-768x768.png 768w, http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-content\/uploads\/sites\/7\/2017\/04\/diameter-600x600.png 600w, http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-content\/uploads\/sites\/7\/2017\/04\/diameter-200x200.png 200w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><figcaption id=\"caption-attachment-123\" class=\"wp-caption-text\">Scatterplot width vs diameter<\/figcaption><\/figure>\n<p>Not looking good&#8230;R even tells you that you lost 1051 points in your dataset&#8230;also, most of the periods. Compare it with:<\/p>\n[generic  linenumbers=&#8221;False&#8221;]\nggplot(arrowheads, aes(x=width, y=length, col=broadperiod)) + geom_point() + theme_bw() + facet_wrap(~broadperiod)<br \/>\n[\/generic]\n<figure id=\"attachment_124\" aria-describedby=\"caption-attachment-124\" style=\"width: 625px\" class=\"wp-caption alignright\"><a href=\"http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-content\/uploads\/sites\/7\/2017\/04\/length.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-content\/uploads\/sites\/7\/2017\/04\/length-1024x1024.png\" alt=\"\" width=\"625\" height=\"625\" class=\"size-large wp-image-124\" srcset=\"http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-content\/uploads\/sites\/7\/2017\/04\/length-1024x1024.png 1024w, http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-content\/uploads\/sites\/7\/2017\/04\/length-150x150.png 150w, http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-content\/uploads\/sites\/7\/2017\/04\/length-300x300.png 300w, http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-content\/uploads\/sites\/7\/2017\/04\/length-768x768.png 768w, http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-content\/uploads\/sites\/7\/2017\/04\/length-600x600.png 600w, http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-content\/uploads\/sites\/7\/2017\/04\/length-200x200.png 200w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><figcaption id=\"caption-attachment-124\" class=\"wp-caption-text\">Scatterplot width vs length<\/figcaption><\/figure>\n<p>Only 84 rows contained missing values, much better!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>One of the first things you want to do when you explore a new dataset is to identify possible gaps. Sample size and the number of variables are relevant but&#8230;how many observations do you have for each variable? This distinction is even more relevant for archaeologists because (if we are being honest&#8230;) most of our [&hellip;]<\/p>\n","protected":false},"author":13,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[16,15,18,19,17],"class_list":["post-119","post","type-post","status-publish","format-standard","hentry","category-r","tag-arrowheads","tag-dataviz","tag-eda","tag-hint","tag-pas"],"_links":{"self":[{"href":"http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-json\/wp\/v2\/posts\/119","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-json\/wp\/v2\/users\/13"}],"replies":[{"embeddable":true,"href":"http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-json\/wp\/v2\/comments?post=119"}],"version-history":[{"count":5,"href":"http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-json\/wp\/v2\/posts\/119\/revisions"}],"predecessor-version":[{"id":166,"href":"http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-json\/wp\/v2\/posts\/119\/revisions\/166"}],"wp:attachment":[{"href":"http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-json\/wp\/v2\/media?parent=119"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-json\/wp\/v2\/categories?post=119"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/research.shca.ed.ac.uk\/past-by-numbers\/wp-json\/wp\/v2\/tags?post=119"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}