## Blog 3

As the summer session progressed, I am going to complete the first stage of the research — data collection. This stage is especially important not only because it is the start of the project but also because the accuracy of the data will directly affect the accuracy of the result, no matter how hard we try to investigate and improve the methods. As for starters, we plan to create a template for the research first. Based on the other papers I have read, one of the most important variable is land cover, which includes bushes, rain precipitation, snow, etc. It is directly related to the distribution of population and the habitability of the region. The data of land cover is downloaded from xxx. Before conducting research myself, “data-collection” process is kind of fixed and simple in my mind, but there are many factors to be considered when I am actually searching for data of the best quality. Since we plan to make a template first, so we are going to use only a few variables for now. After downloading land cover data, based on the papers I have read, two of the other important variables are nighttime light and settlement. Since in this research, one of the main idea is that we are going to investigation how to generate a synthetic population based on various data layers available while using household as a unit instead of using individual as a unit like most of the other people do. Household is a crucial point while thinking about population distribution and even disease spread. In this case, settlement is an important variable. After collecting all these variables, we have to make sure that the data collecting time, region, and organization is corresponding to each other. Then another important step is to organise the data before aggregate all the variables together. We have to get the general population information of Liberia. And based on that data, we have to aggregate individual or household data into clan data based on clan ID. In other words, we are going to get variable data for each clan. From there, we can aggregate all  the variable data based on clan ID using “merge” function. After working with the variable, this week just passed by.

## Blog 2

After reading some papers during the first week, I got the broad idea of the research and have some thought about it. I met with the professor for a few times. We have to know each other’s ideas and then make some more detailed plan for our future work. We all agreed that Random Forest package in R studio is very powerful and might be a good start for the research, and also narrowed down all the methods to Random Forest and Hierarchical Bayesian Model for start.

## Blog 1 — Summer research outline making

Since this my first time to do a scientific research, I am very excited and looking forward to the whole summer sessions. The main idea of this research — “Generate a spatially continuous synthetic population description for improved prediction of human movement and spread of disease” — was inspired by my COLL 150 with professor Frazier of Data Science department. So I started to talk with professor Frazier about some brief ideas about the research process. We talked for hours, which was really helpful and enlightening. We went through some topics in class during the spring semester and some inspiring thoughts of my classmates. The whole class collected many cutting-edge methodologies with promising future and some of them might be used by us.We settled down some general outlines of the research we are going to do and decided to read more about different methods and make a summary list of all the methods related.

## Generate a spatially continuous synthetic population description for improved prediction of human movement and spread of disease

R language came into my life on the first day of college last semester. In my COLL 100 Data Science Lab, I used R Studio to sort through data, make predictions about stocks of Fortune 500 companies(stocks_map), create wordclouds from twitter data(wordcloud_modern_family), project crop data onto the map of Nigeria(dominant_crop_map_lga), etc. I was amazed at how powerful R is while dealing with big data and getting conclusions out of it. As a teaching assistant in the lab this semester, I am taking Human Development seminar and diving more deeply into data science. I learned about freedom and complexity, scale of the world, and cutting-edge methodologies involved in data science. The most engaging part is putting what is discussed in the research paper into reality with R. We made three-dimensional plots of WorldPop data of Liberia and other countries. While writing annotated bibliography, literature review, and central research question through the semester, I get more and more interested in the research and feel motivated to actually deal with the gap found from the previous papers in my own research. At the meanwhile, I am also taking classes from Computer Science and Mathematics department. With all of these inspirations from classes and encouragement from Professor Tyler Frazier, I came up with the idea for the research assignment in class — “Generate a spatially continuous synthetic population description for improved prediction of human movement and spread of disease” and plan to actually do this during the summer.