Ylonka Machado's URP Hour Log

Week 1: Read up on Data Analytics & Visualization, SPARQL, Big Data, etc. and research datasets

Datasets:
http://www3.epa.gov/ttn/analysis/ozone.htm
http://www.health.pa.gov/My%20Health/Environmental%20Health/Environmental%20Public%20Health%20Tracking/Pages/MetaData-Ozone.aspx#.VgfsmY9Viko
http://216.128.241.210/dataset?q=ozone+exceedance&sort=none&tags=ozone&ext_location=&ext_bbox=&ext_prev_extent=-142.03125%2C8.754794702435605%2C-59.0625%2C61.77312286453148

Technologies for Data Visualization:
http://bl.ocks.org/mbostock/2206590
http://bl.ocks.org/mbostock/4060606
http://bl.ocks.org/NPashaP/a74faf20b492ad377312
https://live2.zoomdata.com/zoomdata/visualization#53f22849e4b08f9d5f15360a-522655b0e4b00f4f3af30f12
https://vida.io/gists/FLFFovRPbu2t5QwQC
https://vida.io/gists/ot4Ynw4gZdmKkofo8

Below are slides from Professor Plotka's Web Science course.





Week 2: Data Studies

Data under study: http://spreadsheets.latimes.com/epa-tightens-regulations-ozone-pollution/
Related: http://www.latimes.com/nation/la-na-epa-ozone-rule-20141126-story.html#page=1, https://epafacts.com/high-cost-of-red-tape/smog/

1. What is the problem that we are trying to solve? How will it impact the real world / bottom line? Does it have to be this way?
  • The Environmental Protection Agency (EPA) is attempting to limit the amount of ozone in counties
  • Limiting ozone would improve air quality and lessen the amount of damage done to citizens' lungs
  • Current standard is 75 ppb but proposed new limits would be between 60 and 70 ppb.
  • This way is the only way to ensure counties stay in compliance
  • New rule would be major victory for public health group
  • Further stoke partisan clashes between president and Republicans poised to take control of congress

2. How can I measure this? What data do I need? What is the appropriate algorithm and why? How will I evaluate this?
  • The ozone in a county is measured in parts per billion
  • Data needed were ozone parts per billion
  • The formula for ppb is = (grams of solute/grams of solution) * 10^9

3. What are the attributes/ variables/ features/ columns?
  • Columns (from left to right): County, State, Current reading of ozone ppb, Meets current standard, 70ppb, 65 ppb

4. What are the domains of each feature?

5. What does each row/ observation represent?
  • Each row is a different county

6. How was/ is the data collected?
  • Data was from EPA's most recent air quality data at the time (Nov. 2014)

7. What data are we ignoring, and does it have more features?
  • Population in each county (ppb might depend on whether there is dense population or not)

8. Is the data in a usable form? How does it need to be cleaned?
  • Data is available in CSV, JSON, XLS
  • It is organized from lowest ppb to highest ppb
  • Meets current standard, 70 ppb, and 65 ppb columns all have html tags to identify whether or not they meet the standards. Should be cleaned by replacing <i class="LATClose01"></i> with "no" and <i class="LATCheck" style="color:#009900;"></i> with "yes"

9. What metadata is available?
  • No metadata available for this specific dataset

10. Where did this data come from?
  • EPA Air Quality Data

11. What format is it in?
  • CSV, JSON, XLS

12. Who designed it and why?
  • EPA to provide researchers, public health professionals, and the public with air quality data

13. Who provided the data?
  • EPA

Graphical representation of data:
14. Is it titled and are the axes labeled with units?
  • It is titled "155 counties do not meet the current EPA standard"
  • Its one axis "Ozone concentration compared to selected standards" is labeled in ppb and distinguishes hues of blue for those counties within standards and yellow/brown for counties that do not meet the current EPA standard

15. Does convey a main idea/argument clearly?
  • Yes, although the colors used for counties that do not meet standard should be more attention-getting and aggressive to better represent the urgency of the matter

16. Is this the best format for the data presentation?
  • Yes, counties are easy to find and identify whether they meet standards or not

17. Is it pleasing to look at?
  • Yes

18. What data is not represented on it?
  • Population per county

19. Background: Who are the stakeholders? Who cares about this question/data in any significant way? For each: What are their main tasks and goals?
  • Researchers
    • Main tasks: Research ozone ppb in counties and its harmful effects
    • Goals: Spread awareness of ozone ppb in counties and its harmful effects
  • Public health professionals
    • Main tasks: Analyze and develop programs that protect the health of individuals in a community
    • Goals: Push for stricter standards to be enforced for the population's wellbeing
  • Government officials
    • Main tasks: To ensure the healthy, safety, and wellbeing of the population
    • Goals: Take more action against counties that are not meeting standards
  • Public
    • Main tasks:
    • Goals: Bring attention to what they believe is not being handled correctly/is unfair

20. Catalysts: What would make this stakeholder get involved, or rally around a project? What do they care about?
  • All stakeholders care about health; if population numbers were to be included in the data, it could display how many people are exposed to dangerously high ozone ppb


22. Benefits: How can data analytics improve the performance of different stakeholders?
  • Data analytics, in this specific case, can spread awareness of ozone's lung-damaging effects
  • Draw correlations between lung-related health issues in counties to their ozone ppb

Week 3: Data Visualization
http://cdb.io/1PwvKXN